Monado's hand tracking: hand-waving our way towards a first attempt

Monado's hand tracking: hand-waving our way towards a first attempt

Moses Turner
May 31, 2022

Share this post:

Reading time:

As part of my internship at Collabora, I picked up Monado's hand tracking project. Today I will outline the section I did during the summer of 2021, which was a fairly bare-bones first attempt. Keep in mind that we've moved on from the architecture I describe here and have made considerable progress since then. More than anything, this is to illustrate how awesome it is to experience an internship at Collabora.

So. I started my internship right in the middle of this project - we already had done the work on model architecture and had developed unique techniques for training these models. Now it was time to take those trained models and try to deploy them inside Monado. Optical hand tracking for XR has a bit of a reputation as a Very Hard Tracking Task, and indeed it is - getting training data, training neural nets, and deploying them in real-time, low-latency environments such as XR is every bit as hard as they say it is. And also, when I started, I had very little experience with computer vision. But, somebody needed to do this; I decided I'd be crazy and just go for it.

A few months later, we ended up with this hand-tracking pipeline running inside Monado, our OpenXR runtime:

On Valve Index:

With my custom North Star setup:

That's hand tracking all right! It's not great, but to the best of my knowledge, this was still by far the best optical hand tracking suitable for XR and available on Linux. As far as I'm aware, I was the first person on earth to get optical hand tracking working on the Valve Index's onboard cameras. And it was entirely outside of Valve or SteamVR's purview - we just went for it independently. This fits pretty well with the ethos of libsurvive, an open-source clone of SteamVR Lighthouse tracking, which we also had integrated into Monado.

So how does this work?

Machine Learning (ML) models

We used Mediapipe's model architectures, but we used our own training data and training pipelines to train them.

Inference with ONNX Runtime instead of Tensorflow Lite

We decided not to go with Mediapipe, as its C++ library is very heavy, and it builds with Bazel which is notorious for being finicky and annoying if you want compatibility with CMake/Meson. Monado is lightweight and easy to build, so this wasn't a good fit. Instead, we use ONNX Runtime to run the ML models, which was a great choice - during our tests it was much faster than Tensorflow Lite on CPU, and it's a simple CMake build. Also, it runs natively with ONNX, the popular open file format and de facto standard for machine learning models. Using ONNX has made trying out other inference platforms much easier - these days, everything seems to have a way to go back and forth between ONNX and its native format; +1 for interoperability!

Absolute depth based on keypoint triangulation

Our ML models estimate hand landmarks in "2.5d coordinates, " where the keypoint location is predicted in pixel coordinates and the depth relative to the wrist. The models don't directly say anything about the hand's absolute depth relative to the camera. That's a problem because we care very much about absolute depth relative to the camera! If your virtual hands don't show up in the same place as your real hands and move through space like real hands, it'll feel weird and be hard to use. So, we run all the models in both views, find the hand keypoints, and extend them as rays coming out of the cameras. We estimate each hand joint at the closest intersection between each set of rays.

This works fairly well but can be quite jittery!

As you can see, quite jittery. Just using triangulation isn't good enough - given the short timeframe, we opted for the simplest way to correct the jitter.

Euro filtering

I see your jitter and raise you One Euro!

Given the limited time, we went with a One Euro Filter - a common, quick-to-implement way to smooth out noisy samples. As a type of Infinite Impulse Response filter, the One Euro Filter stores an internal state (in this case a 3D position vector) - on the first sample it receives, it sets this internal state to exactly the first sample. Then, upon receiving a new sample, it interpolates somewhere between its internal state and the new sample. The main innovation of the One Euro Filter is to figure out a good amount to interpolate in order to keep it smooth while not increasing latency more than it has to. If you interpolate too close to the new sample, the latency is low, but the jitter doesn't decrease that much. If you don't interpolate that much, it's smooth, but now there's a lot of latency. So, we tune euro filters to not interpolate very much when the finger measurement is just jittering around but interpolate a lot when we think the hand is actually moving. There are many ways to do filtering, and One Euro filters are very simple to implement. But in most situations they add too much latency for use in XR, and we've since moved on in our new tracking.

And a lot more ✨

There's much more I'd like to talk about, so here's a speed round:

The machine learning models used in this pipeline didn't infer handedness, so I came up with a really neat heuristic to figure it out. For each of the four fingers, it takes the cross product of the direction each joint points in with the direction the next joint points in. Since hand joints (usually) curl only in one direction, and if most of the cross products point towards the thumb, we guess it's a right hand; otherwise, we guess it's a left hand. This is basically like applying the curl version of the Right-Hand Rule in reverse - instead of using the right-ness of the hand to figure out which way the cross product should point, we use the way the cross product points to figure out the right-ness of the hand. This is a very silly way of doing things, but for what it is, it works surprisingly well.
In this pipeline, all of the hands are re-detected in every frame. I wrote a reasonably complex sorting method to figure out the correspondences between currently-observed hands and previously-observed hands so we can do euro filtering and moving averages of the handednesses.
The "keypoint estimator" - the neural net that estimates the hand keypoints - only estimates 21 keypoints on your hand, and ignores the four metacarpal joints that are near the base of your palm. But, OpenXR's XR_EXT_hand_tracking requires that we estimate them, so we just linearly interpolate between the proximal joints and the wrist joint.

Limitations and how to fix them

When I came out of this project, I found a lot of problems!

Running the detection model every frame

The official Mediapipe implementation only runs the hand detection model once every now and then, then for all the subsequent frames it predicts where the new hand should be in pixel coordinates, just predicting based on the keypoints predicted in the past two frames. Since there's a little bit of extra room in the hand region of interest, the predicted region of interest does not have to be perfect, it just has to include the entire hand at a reasonable scale. So, when your hands move slowly, this is a lot better because A) it reduces the compute per frame by roughly 60% and B) the regions of interest that the detection model predicts are quite jittery, and using this prediction method is much smoother. However, Mediapipe's method fails when you move your hands too fast, and ours doesn't. This is a trade-off, and there's no good place to be on it. If this were a dichotomy, the normal Mediapipe way would probably be better. But it's not; there is a third option. ;)

Mixing up left and right hands

This one is very clear-cut - that "right-hand rule" trick doesn't work in some cases, and if you only show it exactly flat hands it'll get confused.

The solution is easy - train a neural network that classifies hands!

Region of interest orientation

Mediapipe's keypoint estimation model expects the input image to be rotated such that the fingers are at the top. So, what happens if you make your hand flat like this?

Well, nothing good - you can see that the tracking totally breaks down. The problem is that if the fingers are in the middle, the rotation varies wildly, and sometimes it's undefined. It'll do something, but it won't be what you want. Most other Mediapipe implementations handle it a little better, but in all cases, it's an obvious pain point where it fails a lot more than you'd want. Another easy fix is that we've trained some new models that don't expect the hand to come in some specific orientation, which works exceptionally better.

Failures on tricky fist poses

Here, the detection model and keypoint model are both failing. My best guess is that the training data didn't have very many examples of fists. Simple fix - just include more fists in the training data!

No kinematic constraint

Since we simply triangulate the keypoints, the depth can be wildly wrong if some of the models fail, and you get meter-long hands. The correct move is to use some method to constrain the bone lengths to stay constant over time - your fingers don't grow over timescales we're concerned with - and also force the virtual fingers to only bend in ways human fingers can bend. It is complicated to do this correctly, but today we're almost there.

RGB cameras have low light efficiency

Many cameras used for computer vision, notably the ones on most Windows Mixed Reality headsets and the Oculus Quest headsets, don't see in color. Instead, each cell on the sensor just sees one light intensity value, and it can't differentiate between colors. The only difference between these cameras and the normal RGB cameras you know and love is that the grayscale cameras do not have color filters over the sensor cells. So, they're colorblind, but each sensor sees all the light that hits it, not just all the light that hits it that's a certain color. They thus receive more light, and we say that grayscale cameras are generally more "light-efficient" than RGB cameras.

Seeing as much light as possible is very important here. If it's too dark, the cameras have to expose for a long time, the hands get all blurry, and at some point it gets hard to differentiate individual fingers. If we were to train neural networks that operate on grayscale images, we could take advantage of grayscale cameras' higher efficiency, and be able to track hands in more adverse conditions.

Better than expected, next steps!

Given the really short timespan of this part of the project and how little I knew when I started, it's pretty amazing to me that this pipeline worked at all, and I'm thrilled with the progress so far.

Since the end of this summer project, we took all the things we learned and applied them to training a much better hand tracking pipeline. We're not diving into that here, but a little sneak peek can't hurt:

This is a demo I recorded recently - everything is still under heavy R&D, and the final product will be quite a lot nicer! I'm really excited to announce it officially when the time comes, but we have a lot more work to do before it's truly ready for prime time.

Care to try our hand-tracking?

Sure, the tracking I showed you is not amazing yet, but it's the best we've got on Linux, and it's worth using! If you have a Valve Index and are comfortable with building from source, all you have to do is build Monado with libsurvive and follow the instructions here and it should work. And, when our new tracking is ready, it'll be a drop-in replacement!

Want to work on something like this at Collabora?

This project has been really cool for me. For the past four years, I've been deeply interested in open-source software, human-machine interfaces, computer vision, and the idea of a free future where knowledge is shared without limitations. Getting an internship (that turned into a full-time engineer position!) has been a dream come true for me. It means I can work full-time on what I'm truly passionate about with minimal stress and a good work-life balance.

I learned plenty of new things after I started. At Collabora, I trained my first neural networks. Since then, I've been soaking up and then directly applying everything I can learn about gradient descent, backpropagation, and all the different types of neural networks people are creating. Similarly, I knew a little bit about Kalman Filters and One Euro Filters before joining Collabora, but I'd never applied them in the real world; I was just theory-crafting. All the different flavors of filtering: Infinite Impulse Response Filters, Finite Impulse Response Filters, Kalman Filters, and energy-minimizing optimizers like Levenberg-Marquardt or just good old Gauss-Newton are all super fascinating to me. I could (and do) talk about them for hours, and I've been able to learn about and directly apply them through my work at Collabora.

If the above sounds like random technobabble to you, good. That was the intention. At Collabora, I could dive deeply into this machine vision problem with little distraction. If this specific avenue isn't interesting to you, that's not surprising! What I do is very niche. But the real takeaway is that we work on many diverse things, and are given a lot of freedom to really understand what it is we're doing. Have a look at some other elements that are happening on Collbora's XR team:

We work on SLAM/VIO, which is an entirely different field! The only thing optical hand tracking genuinely has in common with SLAM is that it uses the same cameras and has to run incredibly fast.
We develop and maintain xrdesktop, a library designed for spatial interaction with traditional desktop apps. This one is completely different - the primary focus is not tracking, but computer graphics and user experience. Last year we had a successful summer of code working on xrdesktop.
We're active participants in the Khronos OpenXR working group - bringing you open, cross-platform, interoperable XR.
And of course, we have Monado, our OpenXR runtime that we're building alongside the open-source community!

Right now is an extremely fascinating time for XR, and there are a vast amount of intriguing projects to work on. Almost every day I'm sincerely excited to go to work - I undoubtedly love what I do, and it's really amazing to be able to do research out in the open. I get the privilege of knowing that it'll be open access to everyone, forever, and wherever I go I'll be free to use, apply, and discuss the work I'm doing now. If you're interested in software freedom, bringing the lower levels of XR to life, or in general, making computers do things really fast, I urge you to apply for an internship, or apply for an engineering position here at Collabora! I promise that you won't regret it, you'll meet an incredible group of people, and you'll learn a ton.

Special thanks to:

Project North Star and Luxonis, for providing some of the cameras we use while developing our tracking.

StereoKit, for being an incredibly easy to use XR library which has made debugging our hand tracking orders of magnitude easier.

Visual-inertial tracking for Monado

Meet wxrd, a standalone Wayland compositor for xrdesktop

Monado 21.0.0, an officially conformant OpenXR implementation!

Visual-inertial tracking for Monado

Meet wxrd, a standalone Wayland compositor for xrdesktop

Monado 21.0.0, an officially conformant OpenXR implementation!

Comments (4)

Andrei:
Jun 07, 2022 at 07:13 AM

Hey Moses, congrats for landing a full time job!

Which languages you used in this project? ML usually involves python

I like what Collabora does/works at, they keep appearing at Phoronix

Reply to this comment

Reply to this comment
1. Moses Turner:
  Jun 07, 2022 at 02:48 PM
  
  Python for training, C/C++ for inference!
  
  Reply to this comment
  
  Reply to this comment
Backer Kuo:
Aug 12, 2022 at 10:34 AM

Hi Moses,

How to make the project run on an android phone?

Reply to this comment

Reply to this comment
Alister Mcneely:
Mar 01, 2023 at 12:23 AM

Hi I was wondering if the steam VR driver had an estimated release date because i use windows and I would love to try out your hand tracking

Reply to this comment

Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

12/05/2025

Powerful video analytics pipelines are easy to make when you're well-equipped. Combining GStreamer and Machine Learning frameworks are the…

Matt Godbolt sold me on Rust (by showing me C++)

06/05/2025

Gustavo Noronha helps break down C++ and shows how that knowledge can open up new possibilities with Rust.

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기

Bandeira de Português Acesse Collabora.com em Português

Learn more

+44 1223 362967

+1 514 667 2499

contact@collabora.com

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Monado's hand tracking: hand-waving our way towards a first attempt

So how does this work?

Machine Learning (ML) models

Inference with ONNX Runtime instead of Tensorflow Lite

Absolute depth based on keypoint triangulation

Euro filtering

And a lot more ✨

Limitations and how to fix them

Running the detection model every frame

Mixing up left and right hands

Region of interest orientation

Failures on tricky fist poses

No kinematic constraint

RGB cameras have low light efficiency

Better than expected, next steps!

Care to try our hand-tracking?

Want to work on something like this at Collabora?

Related Posts

Visual-inertial tracking for Monado

Meet wxrd, a standalone Wayland compositor for xrdesktop

Monado 21.0.0, an officially conformant OpenXR implementation!

Related Posts

Visual-inertial tracking for Monado

Meet wxrd, a standalone Wayland compositor for xrdesktop

Monado 21.0.0, an officially conformant OpenXR implementation!

Comments (4)

Add a Comment

Search the newsroom

Latest Blog Posts

Coccinelle for Rust progress report

Linux Media Summit 2025 recap

Constructor acquires, destructor releases

What if C++ had decades to learn?

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

Matt Godbolt sold me on Rust (by showing me C++)

About Collabora

Learn more