We're hiring!

Bag of Freebies for XR Hand Tracking: Machine Learning & OpenXR

Marcus Edel avatar

Marcus Edel
June 17, 2021

Share this post:

Reading time:

In our previous post, we presented a project backed by INVEST-AI which introduces a multi-stage neural network-based solution that accurately locates and tracks the hands despite complex background noise and occlusion between hands. Now let's dive into the machine learning details of our innovative, open source hand-tracking pipeline.

Hand pose estimation using a video stream lays the foundation for efficient human-computer interaction on a head-mounted Augmented Reality (AR) device. See for example the Valve Index, Microsoft Hololens and Magic Leap One. There has been significant progress recently in this field due to advances in deep learning algorithms and the proliferation of inexpensive consumer-grade cameras.

Despite these advances, it remains a challenge to obtain precise and robust hand pose estimation due to complex pose variations, significant variability in global orientation, self-similarity between fingers, and severe self-occlusion. The time required to estimate the hand pose is another big challenge for XR applications, since real-time responses are needed for reliable applications.

Taking into account the above motivation and challenges, we have implemented a lightweight and top-down pose estimation technique that is suitable for the performance-constrained XR sector. As a result, our methods can be integrated into frameworks such as Monado XR, a free, open-source XR platform that offers fundamental building blocks for different XR devices and platforms.

Our pose estimation pipeline is initiated by detecting the hand using the built-in low light sensitivity cameras found in XR devices such as the Valve Index. It then crops the image to a Region Of Interest (ROI) and localizes 3D landmarks based on heatmap regression. The ROI is then fed into an encoder-decoder network. The encoder is designed to have a small number of parameters, while learning high-level information through 1x1 convolutions, skip connections, and depthwise-separable convolutions. The decoder improves feature map resolution via a deconvolution process, which preserves the high resolution of the hand and avoids false recognition of the background.

Our contributions to the field can be summarized as follows:

    1. Development of an efficient and powerful hand-pose estimation pipeline, hand detector and hand landmark, estimation model.
    2. Verification of the advantages of data-augmentation and semi-supervised learning during model training.
    3. Modification of state-of-the-art methods to make them more efficient and suitable for timing-constrained XR use cases.

Bag of freebies

Deep learning-based methods are only effective if a large amount of training data is available. The data is usually collected and labeled manually, which is tedious, time-consuming and error-prone. This labeling issue is even worse for 3D computer vision problems, which require the labeling of 3D data, an especially difficult task for humans.

Therefore, many recent works focus on using computer graphics methods to automatically synthesize image data and corresponding annotation data. However, resulting detection performance is usually sub-optimal because synthetic images do differ from real images.

In addition, when working on specific complex tasks such as estimating the hand pose from an image, it is difficult to acquire the large amounts of data required to train the models. Though transfer learning techniques could be used to significant effect, it is challenging to find a suitable pre-trained model.

Bag of freebies - Data Augmentation

A popular collection of object detection methods which meet the definition of "bag of freebies" is known as data augmentation. The purpose of data augmentation is to increase the variability of the input images so that the designed object detection model is more robust when applied to images obtained from different environments. For example, photometric distortions and geometric distortions are two commonly used data augmentation methods, and they definitely benefit the object detection task. To increase the size of our data set and also to create challenging real-world examples, we implemented the following data augmentation strategies:

      • Crop: Crop a part of the image of random size and location.
      • Blur: There are many ways to blur an image. The best known are the average, median, Gaussian and bilateral filters.
      • Perspective transformation. These are comprised of rotation, translation, shearing, and scaling. These transformations can be performed in three dimensions.
      • Cutout: Cutout involves removing regions of the input image at random. In addition we perform cutout per channel, and also select different replacement values of the deleted regions.
      • Brightness and Contrast: There are several ways to adjust the brightness and contrast; we simply add a random bias.
      • Noise: Noise injection is a fairly common image enhancement technique. In reality, we only add a matrix of the same size as our input. This matrix is composed of elements following a random distribution.
      • Lens distortion: To mimic the distortion of different camera lenses we simulated different lense distortions.

Bag of freebies - Semi-supervised learning

In addition to the data-augmentation methods proposed above, another set of "bag of freebies" methods are those dedicated to solving the unlabeled data problem.

We studied how to make use of unlabeled images found for example by using Google image search. Web images are diverse and can be easily acquired in large quantities, supplying a wide variety of object poses, appearances and interactions with the context, which may be absent from curated datasets. However, web images usually lie outside the statistical distribution of the curated data. The domain gap between the two data sets calls for careful method design to make to effective make use of this second data set.

To facilitate the use of massive amounts of unlabeled data, we build upon a technique known as Noisy Student Training, a semi-supervised learning approach. This technique extends the idea of self-training and distillation by the use of an equal-or-larger student model, as well as the addition of noise to the student model during learning.

This technique has three main steps:

      1. Train a teacher model on labeled images.
      2. Use the teacher model to generate pseudo-labels on unlabeled images
      3. Train a student model on the combination of labeled images and pseudo-labeled images.

The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and train a new student.

Noisy Student Training seeks to improve on self-training and distillation in two ways. Firstly, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Secondly, it adds noise to the student model so that the noised student is forced to learn harder from the pseudo labels.

Using Noisy Student training, we improved our model pipeline accuracy by 7%. In addition, we not only improved the standard accuracy, but we also improved robustness on real-world scenarios by large margins.

Bag of specials

Our objective is to find the optimal balance among the input network resolution, the convolutional layer number, the parameter number, and the number of layer outputs for each model in our pipeline.

Bag of specials - Hand detection

The task of hand detection is to find the bounding box of each hand in every input image. To that end, we implemented a simple and efficient CNN architecture inspired by the YOLOv4 architecture, which simultaneously localize and classifies.

We reduced our compute requirements by running the detection model only on one of the images returned from the camera. Once the hands have been detected and tracked in a single image, bounding boxes in the remaining image in the next frame can be obtained using the tracked pose, allowing for subsequent stereo tracking. The resulting system can acquire the hands almost instantaneously while only incurring the cost of a single evaluation.

Bag of specials - Keypoint estimation

Our keypoint estimation network, predicts the 26 keypoints defined in the OpenXR specification, using a crop of the image based on the predicted bounding box from the hand detection step mentioned above. This is in contrast to previous works where the keypoint is typically based on each image independently. This is problematic in case the hand is occluded by objects or is only partially visible in the frame. To overcome this issue, our model takes information extracted from previous frames into account. Thus, our model explicitly incorporates the extrapolated keypoints as an additional network input.

The network outputs a 2D heatmap for each of the 21 predicted keypoints, constructed by evaluating a Gaussian following, which is processed by a simple regression model to calculate the final landmarks.


Our pipeline can accurately detect the landmarks of the hands in different scenarios while also performing favorably in cases where occlusions occur. In our experiments, we found that our method is able to recognize hand gestures in numerous settings. However, we did encounter issues when occlusion occurred over longer periods of time.

The relevant end-to-end usage scenario, source code, and pre-trained models can be found here.

Future Work

Our system remains limited in the following ways : it can handle interactions with virtual objects in the air, but it is not designed to reason about hand-to-hand or hand-to-real-world-object interactions. As both of these interactions are critical for an immersive VR experience, we plan in the future to explore how to bridge the gap between hand pose estimation and how hands interact with real-world objects.

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

Thoughts on PipeWire 1.0 and beyond


We can now confidently say that PipeWire is here to stay. But of course it is not the end of the journey. There are many new areas to explore…

Persian Rug, Part 3 - The warp and the weft


Our look at the Rust crate for interconnected objects continues, as we examine how persian-rug really does tie the room together by providing…

Advocating a better Kernel Integration for all


The testing ecosystem in the Linux kernel has been steadily growing, but are efforts sufficiently coordinated? How can we help developers…

WirePlumber: Exploring Lua scripts with Event Dispatcher


With the upcoming 0.5 release, WirePlumber's Lua scripts will be transformed with the new Event Dispatcher. More modular and extensible…

A roadmap for VirtIO Video on ChromeOS: part 2


This second installment explores the Rust libraries Collabora developed to decode video and how these libraries are used within ARCVM to…

Persian Rug, Part 2 - Other ways to make object soups in Rust


Why is creating object graphs hard in Rust? In part 1, we looked at a basic pattern, where two types of objects refer to one another. In…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2023. All rights reserved. Privacy Notice. Sitemap.