September 21, 2020
Despite their great upscaling performance, deep learning backed Super-Resolution methods cannot be easily applied to real-world applications due to their heavy computational requirements. At Collabora we have addressed this issue by introducing an accurate and light-weight deep network for video super-resolution, running on a completely open source software stack using Panfrost, the free and open-source graphics driver for Mali GPUs. Here's an overview of Super Resolution, its purpose for image and video upscaling, and how our model came about.
Internet streaming has experienced tremendous growth in the past few years, and continues to advance at a rapid pace. Streaming now accounts for over 60% of internet traffic and is expected to quadruple over the next five years.
Video delivery quality depends critically on available network bandwidth. Due to bandwidth limitations, most video sources are compressed, resulting in image artifacts, noise, and blur. Quality is also degraded by routine image upscaling, which is required to match the very high pixel density of newer mobile devices.
The upscaling community has provided us with many fundamental advances in video and image upscaling, from classic methods such as Nearest-Neighbor, Linear and Lanczos resampling. However, no fundamentally new methods have been introduced in over 20 years. Also, traditional algorithm-based upscaling methods lack fine detail and cannot remove defects and compression artifacts.
All of this is changing thanks to the Deep Learning revolution. We now have a whole new class of techniques for state-of-the-art upscaling, called Deep Learning Super Resolution (DLSR).
|Deep Learning Super Resolution (DLSR).|
An image's resolution may be reduced due to lower spatial resolution (for example to reduce bandwidth) or due to image quality degradation such as blurring.
Super-resolution (SR) is a technique for constructing a high-resolution (HR) image from a collection of observed low-resolution (LR) images. SR increases high frequency components and removes compression artifacts.
The HR and LR images are related via the equation:
LR = degradation(HR).
By applying the degradation function, we obtain the LR image from the HR image. If we know the degradation function in advance, we can apply its inverse to the LR image to recover the HR image. Unfortunately we usually do not know the degradation function beforehand. The problem is thus ill-posed, and the quality of the SR result is limited.
DLSR solves this problem by learning image prior information from HR and/or LR example images, thereby improving the quality of the LR to HR transformation.
The key to DLSR succsss is the recent rapid development of deep convolutional neural networks (CNNs). Recent years have witnessed dramatic improvements in the design and training of CNN models used by Super-Resolution.
Upscaling can be achieved using different techniques, such as the aformentioned Nearest-Neighbor, Linear and Lanczos resampling methods. The group of images below demonstrates these different options.
First, the lower resolution input image to be be upscaled:
Then, the various methods can be applied. Click on the image below to get a closer look at each result, as well as the original image before it was downscaled.
The objective is to improve the quality of the LR image to approach the quality of the target, known as the ground truth. In this case, round truth is the original image which was downscaled to create the low-resolution image.
The standard approach to Super-Resolution using Deep Learning or Convolution Neural networks (CNNs) is to use a fully supervised approach where a low-resolution image is processed by a network comprising convolutional and up-sampling layers to produce a high-resolution image. This generated HR image is then matched against the original HR image using an appropriate loss function. This approach is commonly known as "paired setting" as it uses pairs of LR and corresponding HR images for training.
More recently, and following the introduction of generative adversarial networks (GANs), GANs are one of the most utilized machine-learning architectures for Super-Resolution.
In generative adversarial networks, two networks train and compete against each other, resulting in mutual learning. The first network, called the generator, generates high-resolution inputs and tries to fool the second network, the discriminator, into accepting these as true high-quality inputs. The discriminator output predicts if an input is a real high-quality image (similar to the training set) or if it's a fake or bad upscaled image.
The technical details considerably more complex but follow these general principles.
The following shows different examples of X4 upsampling using our trained Deep Learning Super Resolution model. You can click on each image to view its original size. We also list the output for Nearest Neighbour, Bi-linear and Lanczos' interpolation for comparison.
The model adds details to the sky and the signs. The hotel sign is is not 100% accurate, but compared with the other upscaling methods a huge improvement. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
Due to the complex lighting the output is not as sharp compared with the previous examples. Still the model was able to bring back details to the shirt and face. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
Since the model was trained on animation videos as well, the works on various contents. However, in our experiments a model trained on a specific content type showed even better results. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
Another animation example, compared with the other upscaling methods, our Super-Resolution model was able to add details to the background and objects in the foreground. Input, Nearest Neighbour, Bi-linear, Lanczos, Original.
For more examples: https://medel.pages.collabora.com/super-resolution-examples/.
Super Resolution is one of the areas where we can fortunately rely on an almost infinite supply of data (high-quality images and videos) which we can use to create a training set. By down-sampling the high-quality images we can create low resolution and high-resolution image pairs needed to train our model.
The low-resolution image is initially a copy of the ground truth image at half the dimensions. The low-resolution image is initially upscaled using a bi-linear transformation so that its dimensions match the target image, so that it is ready to serve as input for our model.
To make the model robust against different forms of image degradation and to better generalize, the dataset can be further augmented by:
One big question we need to answer is how to quantitatively evaluate the performance of our model.
Simply comparing video resolution doesn't reveal much about quality. In fact, it may be completely misleading. A 1080p movie of 500MB may look worse than a 720p movie at 500MB, because the former's bitrate may be too low, introducing various kinds of compression artifacts.
The same goes for comparing bitrates at similar frame sizes, as different encoders can deliver better quality at lower bitrates, or vice-versa. For example, a 720p 500MB video produced with XviD will look worse than a 500MB video produced with x264, because the latter is much more efficient.
To solve the problem, over the past decade several methods have been introduced, commonly classified as either full-reference, reduced-reference, or no-reference based on the amount of information they assess from a reference image of ostensibly pristine quality.
Video quality has traditionally been measured using either PSNR (peak-to-signal-ratio) or SSIM (Structural Similarity Index Method). However, PSNR doesn’t take human perception into account, simply measuring the mean squared error between the original clean signal and the noise of the compressed signal. SSIM does consider human perception, but was originally developed to analyze static images and doesn’t allow for human perception over time, although more recent versions of SSIM have started to address this issue.
With the rapid development of machine learning, important data-driven models have begun to emerge. One such is Netflix’s Video Multi-method Assessment Fusion (VMAF). VMAF combines multiple quality features to train a Support Vector Regressor to predict subjective judgments of video quality.
At Collabora, we use a combination of SSIM and VMAF to train and test our Deep Learning Super-Resolution models. SSIM is fast to calculate and serves as a basic indicator for how the model is performing. VMAF, on the other hand, delivers more accurate results, which are usually missed by traditional methods.
Despite their great upscaling performance, deep learning backed Super-Resolution methods cannot be easily applied to real-world applications due to their heavy computational requirements. At Collabora we have addressed this issue by introducing an accurate and light-weight deep network for video super-resolution.
To achieve a good tradeoff between computational complexity and reproduction quality, we implemented a cascading mechanism on top of a standard network architecture, producing a light-weight solution. We also used a multi-tile approach in which we divide a large input into smaller tiles to better utilize memory bandwidth and overcome size constraints posed by certain frameworks and devices. Multi-tile significantly improves inference speed. This approach can be extended from single image SR to video SR where video frames are treated as a group of multiple tiles.
We designed our solution on top of the open-source Panfrost video driver, allowing us to offload compute to the GPU.
Coming up in Part 2 of this series, we'll take a deep dive into how our model works, and how you can use free, open source software to achieve a higher level of compression than existing video compression methods. Stay tuned!
Update (Sept. 24):
By popular demand, the code to train your own model and to reproduce the results from the blog-post can be found here: https://gitlab.collabora.com/medel/super-resolution.
Due to licensing issues (a large number of images used have a research license attached to them), we can't release the pre-trained model for the second stage of the Super-Resolution method at this point. However, we are currently re-training the model to solve the issue, and will be making the updated model checkpoint available soon!
Did you know you could run a permissively-licensed MTP implementation with minimal dependencies on an embedded device? Here's a step-by-step…
Earlier this year, the Rust compiler gained support for LLVM source-base code coverage. In this post we'll explain how to setup a CI job…
Over the past few months, I've been working on a side project to improve Meson sub-project support. The best stress test is to build projects…
The most complete automated testing and continuous integration tool for the Linux kernel continues to evolve at a rapid pace. Here's a look…
In the embedded world, many modern SoCs such as the ST Microelectronics STM32MP1 now include coprocessor cores which can be used for a wide…
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…