Open source machine learning for video compression

Open source machine learning for video compression

Marcus Edel
September 14, 2022

Share this post:

Reading time:

Over the past few years, different video codecs have been successfully developed, including H.265 and VP9, to meet the needs of various applicationsranging from video conferencing platforms like Zoom to streaming services like YouTube and software like OBS to broadcast to different sites.

The quality of the reconstructed video using these codecs is excellent at medium-to-low bitrates, but it degrades when operating at very low bitrates. While these codecs leverage expert knowledge of human perception and carefully engineered signal processing pipelines, there has been a massive interest in replacing these handcrafted methods with machine learning approaches that learn to encode video data.

Using open source software, Collabora has developed an efficient compression pipeline that enables a face video broadcasting system that achieves the same visual quality as the H.264 standard while only using one-tenth of the bandwidth. In a nutshell, the face video compression algorithms rely on a source frame of the face, a pipeline to extract the important features from a face image, and a generator to reconstruct the face using the extracted and compressed features on the receiving side.

Key takeaways

Machine learning model to predict facial landmarks, capturing both facial expressions and overall head poses from a video.
We generate speaker-aware talking-head animations based on a single source image and a driving video.
The compact landmark representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth.

Talking heads problem

Animating expressive talking heads is essential for filmmaking, virtual avatars, video streaming, computer games, and mixed realities. Despite recent advances, generating realistic facial animation with little or no manual labor remains an open challenge in computer graphics. Several key factors contribute to this challenge. Traditionally the generation process needs a lot of compute, making it nontrivial to run it in real-time in a video conference setting. Facial dynamics are difficult to reconstruct using based on a few images.

We present a method that generates expressive talking-head videos from a single facial image and a driving video. The key component of our method is the prediction of the facial landmarks reflecting the facial dynamics. Based on this intermediate representation, our method works with many portrait images in a single unified framework and generalizes well for faces that were not observed during training.

Talking heads generation

A neural network extracts and encodes the locations of key facial features of the user for each frame, which is much more efficient than compressing pixel and color data. The encoded data is then passed on to a generative adversarial network along with a reference video frame captured at the beginning of the session. The GAN is trained to reconstruct the new image by projecting the facial features onto the reference frame.

Implementation details

We base our generator network on the image-to-image translation architecture proposed by Johnson et al., but replace downsampling and upsampling layers with residual blocks similarly. For the discriminator, we use a similar network, which consists of residual downsampling blocks without normalization layers. We also use self-attention blocks, which are inserted at 32×32 spatial resolution in all downsampling parts of the networks and at 64×64 resolution in the upsampling part of the generator.

We also integrated our Super-Resolution model on top of the reconstructed output to enhance the overall image quality without increasing the necessary bandwidth.

Video compression in action

The video shows the video compression model in action; the first video is the H.264 compression, and the second is the reconstructed video based on a single source image and predicted landmarks for the driving video. The last video applies Super-Resolution on top of it to improve the overall video quality.

The compression pipeline can be used as a standalone tool, but it can also be embedded directly into existing video conferencing tools. Thanks to that, the model can tap into all the metadata you have about your video stream and dynamically adjust the number of landmarks to improve facial reconstruction.

Limitations

Currently, the key limitation of our method is that using landmarks from a different person leads to a noticeable mismatch. In addition, our reconstruction network takes a lot of compute, hindering wider adoption for resource-constrained devices.

Outlook

Our work could not have been possible without the help of countless open source resources. We hope our contributions will help others in the video compression and web conferencing community build the next generation of innovative technology. We released the code to reproduce the results.

If you have questions or ideas on how to compress your data, join us on our Gitter #lounge channel or leave a comment in the comment section.

Open Source meets Super Resolution, part 1

Open source machine learning at IBC 2022

Bag of Freebies for XR Hand Tracking: Machine Learning & OpenXR

Open Source meets Super Resolution, part 1

Open source machine learning at IBC 2022

Bag of Freebies for XR Hand Tracking: Machine Learning & OpenXR

Comments (5)

Guillaume:
Oct 04, 2022 at 08:03 AM

Hi Marcus! :)

Great work! I was wondering how could such codec be integrated with a videoconferencing web client? Would it be possible to ship it as a browser plugin or something?

Reply to this comment

Reply to this comment
1. Marcus Edel:
  Dec 23, 2022 at 06:25 PM
  
  Depending on the videoconferencing platform, a plugin could work. The main issue is intercepting the video stream before it gets sent out. One solution we looked into is to provide a special chrome or firefox version with the necessary fixes to make it work.
  
  Reply to this comment
  
  Reply to this comment
LinuxLover:
Dec 21, 2022 at 10:07 PM

Could this approach be adapted to game/desktop streaming?

Reply to this comment

Reply to this comment
1. Marcus Edel:
  Dec 23, 2022 at 06:34 PM
  
  This particular method focuses on web video conferencing (faces); in this setting, we can extract keypoints from the face and later use them to reconstruct the face. That said, this technique can be transferred to other areas, like arbitrary objects in a game. However, we are looking into combining foveated rendering and super-resolution, specifically targeting games, to reduce the bandwidth required.
  
  Reply to this comment
  
  Reply to this comment
Salvador:
Mar 14, 2023 at 05:39 PM

Amazing work sir. Regarding what you mention about h265 and obs. Would like to showcase obs with h265 or h264 encoding on rk3399, Would be that possible? If so, please co sider to give me some hints since getting vpu to work on mainline on rk3399 was always a bit difficult.
I got obs working nicely on blobs with h265 enc on rk3588, but would love to se it working decently at least up to 1080p30 on rk3399 with h264 or h265 enc.

Reply to this comment

Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it

25/04/2024

While I managed to land support for two extensions, implementing control flow re-convergence in NVK did not go as planned. This is the story…

Automatic regression handling and reporting for the Linux Kernel

14/03/2024

In continuation with our series about Kernel Integration we'll go into more detail about how regression detection, processing, and tracking…

Almost a fully open-source boot chain for Rockchip's RK3588!

21/02/2024

Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…

What's the latest with WirePlumber?

19/02/2024

Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…

DRM-CI: A GitLab-CI pipeline for Linux kernel testing

08/02/2024

Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…

Persian Rug, Part 4 - The limitations of proxies

23/01/2024

This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기

Bandeira de Português Acesse Collabora.com em Português

Learn more

+44 1223 362967

+1 514 667 2499

contact@collabora.com

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Open source machine learning for video compression

Key takeaways

Talking heads problem

Talking heads generation

Implementation details

Video compression in action

Limitations

Outlook

Related Posts

Open Source meets Super Resolution, part 1

Open source machine learning at IBC 2022

Bag of Freebies for XR Hand Tracking: Machine Learning & OpenXR

Related Posts

Open Source meets Super Resolution, part 1

Open source machine learning at IBC 2022

Bag of Freebies for XR Hand Tracking: Machine Learning & OpenXR

Comments (5)

Add a Comment

Search the newsroom

Latest Blog Posts

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it

Automatic regression handling and reporting for the Linux Kernel

Almost a fully open-source boot chain for Rockchip's RK3588!

What's the latest with WirePlumber?

DRM-CI: A GitLab-CI pipeline for Linux kernel testing

Persian Rug, Part 4 - The limitations of proxies

About Collabora

Learn more