June 17, 2021
In our previous post, we presented a project backed by INVEST-AI which introduces a multi-stage neural network-based solution that accurately locates and tracks the hands despite complex background noise and occlusion between hands. Now let's dive into the machine learning details of our innovative, open source hand-tracking pipeline.
Hand pose estimation using a video stream lays the foundation for efficient human-computer interaction on a head-mounted Augmented Reality (AR) device. See for example the Valve Index, Microsoft Hololens and Magic Leap One. There has been significant progress recently in this field due to advances in deep learning algorithms and the proliferation of inexpensive consumer-grade cameras.
Despite these advances, it remains a challenge to obtain precise and robust hand pose estimation due to complex pose variations, significant variability in global orientation, self-similarity between fingers, and severe self-occlusion. The time required to estimate the hand pose is another big challenge for XR applications, since real-time responses are needed for reliable applications.
Taking into account the above motivation and challenges, we have implemented a lightweight and top-down pose estimation technique that is suitable for the performance-constrained XR sector. As a result, our methods can be integrated into frameworks such as Monado XR, a free, open-source XR platform that offers fundamental building blocks for different XR devices and platforms.
Our pose estimation pipeline is initiated by detecting the hand using the built-in low light sensitivity cameras found in XR devices such as the Valve Index. It then crops the image to a Region Of Interest (ROI) and localizes 3D landmarks based on heatmap regression. The ROI is then fed into an encoder-decoder network. The encoder is designed to have a small number of parameters, while learning high-level information through 1x1 convolutions, skip connections, and depthwise-separable convolutions. The decoder improves feature map resolution via a deconvolution process, which preserves the high resolution of the hand and avoids false recognition of the background.
Our contributions to the field can be summarized as follows:
Deep learning-based methods are only effective if a large amount of training data is available. The data is usually collected and labeled manually, which is tedious, time-consuming and error-prone. This labeling issue is even worse for 3D computer vision problems, which require the labeling of 3D data, an especially difficult task for humans.
Therefore, many recent works focus on using computer graphics methods to automatically synthesize image data and corresponding annotation data. However, resulting detection performance is usually sub-optimal because synthetic images do differ from real images.
In addition, when working on specific complex tasks such as estimating the hand pose from an image, it is difficult to acquire the large amounts of data required to train the models. Though transfer learning techniques could be used to significant effect, it is challenging to find a suitable pre-trained model.
A popular collection of object detection methods which meet the definition of "bag of freebies" is known as data augmentation. The purpose of data augmentation is to increase the variability of the input images so that the designed object detection model is more robust when applied to images obtained from different environments. For example, photometric distortions and geometric distortions are two commonly used data augmentation methods, and they definitely benefit the object detection task. To increase the size of our data set and also to create challenging real-world examples, we implemented the following data augmentation strategies:
In addition to the data-augmentation methods proposed above, another set of "bag of freebies" methods are those dedicated to solving the unlabeled data problem.
We studied how to make use of unlabeled images found for example by using Google image search. Web images are diverse and can be easily acquired in large quantities, supplying a wide variety of object poses, appearances and interactions with the context, which may be absent from curated datasets. However, web images usually lie outside the statistical distribution of the curated data. The domain gap between the two data sets calls for careful method design to make to effective make use of this second data set.
To facilitate the use of massive amounts of unlabeled data, we build upon a technique known as Noisy Student Training, a semi-supervised learning approach. This technique extends the idea of self-training and distillation by the use of an equal-or-larger student model, as well as the addition of noise to the student model during learning.
This technique has three main steps:
The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and train a new student.
Noisy Student Training seeks to improve on self-training and distillation in two ways. Firstly, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Secondly, it adds noise to the student model so that the noised student is forced to learn harder from the pseudo labels.
Using Noisy Student training, we improved our model pipeline accuracy by 7%. In addition, we not only improved the standard accuracy, but we also improved robustness on real-world scenarios by large margins.
Our objective is to find the optimal balance among the input network resolution, the convolutional layer number, the parameter number, and the number of layer outputs for each model in our pipeline.
The task of hand detection is to find the bounding box of each hand in every input image. To that end, we implemented a simple and efficient CNN architecture inspired by the YOLOv4 architecture, which simultaneously localize and classifies.
We reduced our compute requirements by running the detection model only on one of the images returned from the camera. Once the hands have been detected and tracked in a single image, bounding boxes in the remaining image in the next frame can be obtained using the tracked pose, allowing for subsequent stereo tracking. The resulting system can acquire the hands almost instantaneously while only incurring the cost of a single evaluation.
Our keypoint estimation network, predicts the 26 keypoints defined in the OpenXR specification, using a crop of the image based on the predicted bounding box from the hand detection step mentioned above. This is in contrast to previous works where the keypoint is typically based on each image independently. This is problematic in case the hand is occluded by objects or is only partially visible in the frame. To overcome this issue, our model takes information extracted from previous frames into account. Thus, our model explicitly incorporates the extrapolated keypoints as an additional network input.
The network outputs a 2D heatmap for each of the 21 predicted keypoints, constructed by evaluating a Gaussian following, which is processed by a simple regression model to calculate the final landmarks.
Our pipeline can accurately detect the landmarks of the hands in different scenarios while also performing favorably in cases where occlusions occur. In our experiments, we found that our method is able to recognize hand gestures in numerous settings. However, we did encounter issues when occlusion occurred over longer periods of time.
The relevant end-to-end usage scenario, source code, and pre-trained models can be found here.
Our system remains limited in the following ways : it can handle interactions with virtual objects in the air, but it is not designed to reason about hand-to-hand or hand-to-real-world-object interactions. As both of these interactions are critical for an immersive VR experience, we plan in the future to explore how to bridge the gap between hand pose estimation and how hands interact with real-world objects.
Monado now has initial support for 6DoF ("inside-out") tracking for devices with cameras and an IMU! Three free and open source SLAM/VIO…
When developing an application or a library, it is very common to want to run it without installing it, or to install it into a custom prefix…
An incredible amount has changed in Mesa and in the Vulkan ecosystems since we wrote the first Vulkan driver in Mesa for Intel hardware…
Every file system used in production has tools to try to recover from system crashes. To provide a better infrastructure for those tools,…
The PipeWire project made major strides over the past few years, bringing shiny new features, and paving the way for new possibilities in…
Over the past 18 months, we have been on a roller-coaster ride developing futex2, a new set of system calls. As part of this effort, the…