The Computer Visionaries: Final Update

Rishab Kaup, Karthik Praturu, Noah Sutter and Austen Schunk
Fall 2018 CS 4476 Computer Vision: Final Project
Georgia Tech

Abstract

Style transfer is a method of transferring the style of an image, like a painting, to another image. This is useful for creating new types of computer generated artwork, and it can even be used for videos. Our approach focuses on exploring and optimizing the actual style transfer for videos, as well as the inference time of models used for style transfer. We have found key areas for improvement in the most popular style transfer algorithm currently in use (shown in [1]) through our experiments and propose a modified structure for possible better performance, and faster inference.

Introduction

Style transfer algorithms produce very artistic and often seemingly imaginative outputs, making it a very popular way to filter images and generate abstract art. Tools like the Deep Dream Generator use style transfer algorithms to generate visual content for consumer use or research purposes. Real-time style transfer can be used for augmented reality applications and videos by applying the transformation frame-by-frame, but this requires faster methods for determining how to transfer style.

The modern style transfer algorithm, introduced by Johnson et al., works as shown in the following figure:

Johnson et al. Style Tranfer Pipeline [1]

The algorithm is split up into two parts, an image transformation network (ITN) and a loss network. The loss network is a pre-trained VGG16 network meant for image classification, and it defines weighted functions of feature reconstruction loss and style reconstruction loss. Most modern style transfer implementations make use of the same loss network. The loss functions generated by the loss network are used when training the image transformation network (ITN), and training the ITN introduced in [1] involves fixing a style image and feeding through a large database of content images. The ITN introduced by Johnson et al. incorporates a convolutional neural network with multiple convolutional and pooling layers throughout; however, other modern implementations use different approaches. For example, the method introduced in [7] has an ITN that matches the mean and variance of intermediate features, and the ITN in [8] uses a network that attempts to minimize the difference of centered covariance between the loss network output and the combined style and content images.

We approached the problem of style transfer by first taking into account what currently exists, and then modifying and combining the traits present in each method to try and achieve an algorithm better suited for modern, high-quality and fast videos. Specifically, we attempted to make the videos created smoother and more pleasing to the eye. Additionally, we wanted to implement model compression with two goals in mind. The first goal is to simply reduce the size of the model, because the original architecture is a massive 150Mb. The second is to use the compressed model to perform faster stylizations.

Approach

Part 1: Smoother Videos

In order to create smoother videos, we decided to approach the problem using post processing rather than training. This is because the other major problem that we are attempting to solve is the long inference times we have encountered throughout the process. In this vein, there are several method through which we attempted to improve the flickering effect of stylizing videos:

Gaussian Filtering
Interpolation of video (predicting frames between frames)
Frame subtraction
Optical Flow

These attempted changes were what we rationalized as having the best chance of success of making the videos smoother, and reduce the flickering effect.

Part 2: Model Compression

The general idea behind using model compression in this context is to reduce the number of parameters in order to conserve space and reduce inference speed. There are two main approaches to perform this task, which are using a small-dense network or a large-sparse network. Using the results from [9], which say that a large-sparse network with the same number of parameters will produce higher accuracy in classification, we decided to go with the approach of introducing sparsity into the network proposed in [1]. In order to acheieve sparsity, we used tensorflow's built-in pruning library, which uses the idea of threshold pruning presented in [9]. Unfortunately, tensorflow does not include a sparse convolution operator, so we used the library in [10] that references the techniques presented in [11].

Experiments and Results

Part 1: Smoother Videos

Gaussian Filtering

Original

Left: Sigma = 1.0 | Center: Sigma = 2.0 | Right: Sigma = 3.0

A first, rudimentary attempt to get a smoother video was to apply gaussian filtering to the images after passing through the stylization. This would ideally create a smoother intepretation of the background vs the fox, and would result in a more pleasing image. However, the reality was that this attempt did nothing to make the frames less jittery, and only served to blur the coloring in the videos. This makes sense, as gaussian filtering does nothing to take into account the way that the neural net styles images. This was when we first thought that the key to getting stable videos may lie in the actual training of the neural net, but regardless we pushed onwards to test the other theories we had.

Interpolation of video (predicting frames between frames)

Original

Left: Original Number of Frames Result | Center: Double the Frames by Interpolation Result | Right: Quadruple the Frames by Interpolation Result

This result was obtained by increasing the number of frames in the original video through frame interpolation and running the video with extra frames throught the transformation network. The goal was that the extra frames might reduce or smooth the flickering between frames by reducing the difference between frames and making the overall video seem less jumpy. This was successful to a degree. The flickering is smoother as there are smaller changes between frames, however, there is still just as much flickering as there was in the original image.

Frame subtraction

Left: Original | Middle: Using previous stylized frame without overlay | Right: Using previous stylized frame with overlay

For this approach, we attempted to use the previous frame of the video to help make intelligent decisions about the next frame in the video. There were two main attempts to do this, as shown by the results in the left and right videos:

Using unedited stylized frame as previous frame

Using edited stylized frame as previous frame

Optical Flow

Left: Original | Middle: Using only optical flow | Right: Optical flow with blending and median filtering

The experiments above helped us realize what needed to be done to get good, reduced-flickering output videos. Getting the difference between frames and adding style to the changed frames was not enough since the context of the changed pixels is not conserved: Changing the style of randomly consistent pixels led to undesired popping. Since movement of the camera, movement of the fox, and changes in brightness between frames negatively affects the flickering of the output video, we utilized optical flow, which estimates pixel motion, to achieve better results. We did the following as a control: Like with frame subtraction, we took the current frame and the previous frame and compute the optical flow, then use the magnitude of the flow to threshold the output image, and only apply style changes to locations that are above the threshold. Raw optical flow like this does not work, problems are compounded like with frame subtraction, causing a "smearing" effect to appear in the output image. However, blending this output with the predicted next frame output from the style transfer network, and passing it through a median filter, greatly reduces the smearing effect and achieves slightly smoother and less-flickery results.

The "smearing" effect from naively using optical flow is an improvement over using basic frame subtraction. To an extent, the context is being preserved, but by too much, and blending the optical flow thresholded style frame with the regularly predicted style frame improves upon this. We can make this better by weighting pixels that move around more. In essence, instead of globally blending both frames together, on a per-pixel basis, set the value of that pixel to the weighted combination of the regularly predicted style frame pixel and the optical flow thresholded style frame pixel at that location. A pixel location with a large optical flow has a higher weight and blends more of the regularly predicted style frame pixel and less of the optical flow thresholded pixel. Each weight is normalized by the maximum weight, but we also added a penalty (refered to as p in the figure below) that gets mutliplied by the normalized weight before blending. Next, all the normalized-weight-penalties (nwp) are thresholded to be between 0 and 1, and blending is achieved by blending nwp of the regularly predicted style frame pixel with (1-nwp) of the optical flow thresholded pixel.

Left: Small Penalty (p=200) | Center: Medium Penalty (p=500) | Right: Large Penalty (p=1000)

Part 2: Model Compression

First we trained a single network, using the approach from above, with 50% sparsity and saved the weights. The next task was to find a library that would support sparse convolution, which was extremely challenging. We originally wanted the computations to be performed on CPU, but we were not able to find any that would work on our available machines. As a result, we ended up using [10], which is only built for GPU's with CUDA support. Next, we tested the previously saved weights in a network with the architecture from [1] but now replaced all convolution ops with a the new sparse convolution op. In doing this we found that the there were two errors. The first is issues with memory allocation for larger input images. The second is that the sparse convolution op is actually slower for smaller filters sizes i.e. the 3x3 ones. Given both of these issues, we decided to create a hyrbid architecture that used tensorflows built in convolution for all layers except the first and last ones that are 9x9. Next, we trained models with sparsity contraints of 70%, 90%, and 95%. We then gathered information on the speed of inference by inputting square images ranging form 240x240 to 1440x1440. We calculated these times by using 20 frames to warmp up the GPU, then perform inference. The following are the results from our test.

As you can see our new hybrid network consistently outperforms tensorflows regular convolution operator. Additionally, note that the runtime for the dense convolution follows a polynomial path, while the runtime for our hybrid network is nearly linear. As a result, this new hybrid network could be especially useful for style-transfer in Ultra-HD images.
Now that we have shown the runtime results, we will now demonstrate the quality of style-transfer that is obtained from a sparse network. The following are a few samples of style transfer using our sparse structure and this style image.

Dense

50%

70%

90%

95%

Before discussing the results, it is important to note that the above images original dimensions are: Golden Gate Bridge-496 x 331, Stata Building-1024 x 679, and Lion-3840 × 2160, but the training and style image are 256 x 256. In general, the images are stylized quite well even at 95% sparsity. There are really only two main differences. The first is that the stylization seems to lose the vibrance/variance in color. For example, the lion stylized by the dense network has very vibrant color range for the lion and a background with lots of different colors, while the lion stylized by the 95% sparse network has a lion with less color range and a background that is split by two types of colors. The next difference is the correlation of sparsity and a lack of "swirliness." By "swirliness", we are referring to how the original style image could be characterized by swirl-like patterns across the image. A good example of the network, losing this attribute as sparsity increases can be seen in the golden gate bridge sequence, which starts with high "swirliness" and eventually ends with only slightly curved lines. On a final note, it is interesting to point out that since all networks were trained on low resolution images, it learns the features in the context of a low image. As a result, when applied to high resolution images, such as the lion, the style is transferred for each small patch of the image.

Conclusions and Future Work

Part 1: Smoother Videos

We had pretty good results regarding generating smoother videos. Optical flow with blending and filtering resulted in much better preservation of the background than just the simple techniques before. However, we did stumble upon a possible plan that could work very well if we focused more on the TRAINING of the neural network. During training of the neural net, there are three losses that are minimized: Content loss, style loss, and total variance loss. These combined are what make the balance between content and style in the produced images. However, with videos, another loss component can be added to try and minimize the effect of stylization between frames: noise loss. For example, different frames in a video contain only small difference, which can essentially be treated as noise. If, during training, we could calculate both the styled image with and without noise, and minimize the loss between those two images, the neural net would be much better suited to create stylized videos where there is less flickering between frames, as the background/non-moving portions of the frames would be more consistent. Unfortunately, due to time constraints and high training times, we could not test this implementation. However, we did work on reducing inference times for style models, which would have helped us achieve this goal.

Part 2: Model Compression

The model compression was able to acheive the two goals we wanted, which were reducing space complexity of the model and time complexity of inference. Unfortunately, the time complexity aspect only holds true for a device with a GPU and CUDA support, which is usually not the case on devices that benefit most from model compression such as phones and wearables. As a result, our future work would be to implement a sparse convolution op that is supported by CPU or phone GPU. Additionally, we would like to combine the results from our two parts into one system that could efficiently compute stable style transfer.

Relevant papers and articles:

Johnson et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Ruder et al. Artistic Style Transfer for Video
Cheng et al. A Survey of Model Compression and Acceleration for Deep Neural Networks
Huang et al. Real-Time Neural Style Transfer for Videos
Gao et al. ReCoNet: Real-time Coherent Video StyleTransfer Network
Weinzaepfel et al. DeepFlow: Large displacement optical flow with deep matching
Huang et al. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization
Li et al. Learning Linear Transformations for Fast Arbitrary Style Transfer
Zhu et al. Exploring the Efficacy of Pruning for Model Compression
Hanxiang Hao Sparse Convolution Op
Han et al. Learning both Weights and Connections for Efficient Neural Networks