May 31, 2022
As part of my internship at Collabora, I picked up Monado's hand tracking project. Today I will outline the section I did during the summer of 2021, which was a fairly bare-bones first attempt. Keep in mind that we've moved on from the architecture I describe here and have made considerable progress since then. More than anything, this is to illustrate how awesome it is to experience an internship at Collabora.
So. I started my internship right in the middle of this project - we already had done the work on model architecture and had developed unique techniques for training these models. Now it was time to take those trained models and try to deploy them inside Monado. Optical hand tracking for XR has a bit of a reputation as a Very Hard Tracking Task, and indeed it is - getting training data, training neural nets, and deploying them in real-time, low-latency environments such as XR is every bit as hard as they say it is. And also, when I started, I had very little experience with computer vision. But, somebody needed to do this; I decided I'd be crazy and just go for it.
A few months later, we ended up with this hand-tracking pipeline running inside Monado, our OpenXR runtime:
On Valve Index:
With my custom North Star setup:
That's hand tracking all right! It's not great, but to the best of my knowledge, this was still by far the best optical hand tracking suitable for XR and available on Linux. As far as I'm aware, I was the first person on earth to get optical hand tracking working on the Valve Index's onboard cameras. And it was entirely outside of Valve or SteamVR's purview - we just went for it independently. This fits pretty well with the ethos of libsurvive, an open-source clone of SteamVR Lighthouse tracking, which we also had integrated into Monado.
We used Mediapipe's model architectures, but we used our own training data and training pipelines to train them.
We decided not to go with Mediapipe, as its C++ library is very heavy, and it builds with Bazel which is notorious for being finicky and annoying if you want compatibility with CMake/Meson. Monado is lightweight and easy to build, so this wasn't a good fit. Instead, we use ONNX Runtime to run the ML models, which was a great choice - during our tests it was much faster than Tensorflow Lite on CPU, and it's a simple CMake build. Also, it runs natively with ONNX, the popular open file format and de facto standard for machine learning models. Using ONNX has made trying out other inference platforms much easier - these days, everything seems to have a way to go back and forth between ONNX and its native format; +1 for interoperability!
Our ML models estimate hand landmarks in "2.5d coordinates, " where the keypoint location is predicted in pixel coordinates and the depth relative to the wrist. The models don't directly say anything about the hand's absolute depth relative to the camera. That's a problem because we care very much about absolute depth relative to the camera! If your virtual hands don't show up in the same place as your real hands and move through space like real hands, it'll feel weird and be hard to use. So, we run all the models in both views, find the hand keypoints, and extend them as rays coming out of the cameras. We estimate each hand joint at the closest intersection between each set of rays.
This works fairly well but can be quite jittery!
As you can see, quite jittery. Just using triangulation isn't good enough - given the short timeframe, we opted for the simplest way to correct the jitter.
I see your jitter and raise you One Euro!
Given the limited time, we went with a One Euro Filter - a common, quick-to-implement way to smooth out noisy samples. As a type of Infinite Impulse Response filter, the One Euro Filter stores an internal state (in this case a 3D position vector) - on the first sample it receives, it sets this internal state to exactly the first sample. Then, upon receiving a new sample, it interpolates somewhere between its internal state and the new sample. The main innovation of the One Euro Filter is to figure out a good amount to interpolate in order to keep it smooth while not increasing latency more than it has to. If you interpolate too close to the new sample, the latency is low, but the jitter doesn't decrease that much. If you don't interpolate that much, it's smooth, but now there's a lot of latency. So, we tune euro filters to not interpolate very much when the finger measurement is just jittering around but interpolate a lot when we think the hand is actually moving. There are many ways to do filtering, and One Euro filters are very simple to implement. But in most situations they add too much latency for use in XR, and we've since moved on in our new tracking.
There's much more I'd like to talk about, so here's a speed round:
When I came out of this project, I found a lot of problems!
The official Mediapipe implementation only runs the hand detection model once every now and then, then for all the subsequent frames it predicts where the new hand should be in pixel coordinates, just predicting based on the keypoints predicted in the past two frames. Since there's a little bit of extra room in the hand region of interest, the predicted region of interest does not have to be perfect, it just has to include the entire hand at a reasonable scale. So, when your hands move slowly, this is a lot better because A) it reduces the compute per frame by roughly 60% and B) the regions of interest that the detection model predicts are quite jittery, and using this prediction method is much smoother. However, Mediapipe's method fails when you move your hands too fast, and ours doesn't. This is a trade-off, and there's no good place to be on it. If this were a dichotomy, the normal Mediapipe way would probably be better. But it's not; there is a third option. ;)
This one is very clear-cut - that "right-hand rule" trick doesn't work in some cases, and if you only show it exactly flat hands it'll get confused.
The solution is easy - train a neural network that classifies hands!
Mediapipe's keypoint estimation model expects the input image to be rotated such that the fingers are at the top. So, what happens if you make your hand flat like this?
Well, nothing good - you can see that the tracking totally breaks down. The problem is that if the fingers are in the middle, the rotation varies wildly, and sometimes it's undefined. It'll do something, but it won't be what you want. Most other Mediapipe implementations handle it a little better, but in all cases, it's an obvious pain point where it fails a lot more than you'd want. Another easy fix is that we've trained some new models that don't expect the hand to come in some specific orientation, which works exceptionally better.
Here, the detection model and keypoint model are both failing. My best guess is that the training data didn't have very many examples of fists. Simple fix - just include more fists in the training data!
Since we simply triangulate the keypoints, the depth can be wildly wrong if some of the models fail, and you get meter-long hands. The correct move is to use some method to constrain the bone lengths to stay constant over time - your fingers don't grow over timescales we're concerned with - and also force the virtual fingers to only bend in ways human fingers can bend. It is complicated to do this correctly, but today we're almost there.
Many cameras used for computer vision, notably the ones on most Windows Mixed Reality headsets and the Oculus Quest headsets, don't see in color. Instead, each cell on the sensor just sees one light intensity value, and it can't differentiate between colors. The only difference between these cameras and the normal RGB cameras you know and love is that the grayscale cameras do not have color filters over the sensor cells. So, they're colorblind, but each sensor sees all the light that hits it, not just all the light that hits it that's a certain color. They thus receive more light, and we say that grayscale cameras are generally more "light-efficient" than RGB cameras.
Seeing as much light as possible is very important here. If it's too dark, the cameras have to expose for a long time, the hands get all blurry, and at some point it gets hard to differentiate individual fingers. If we were to train neural networks that operate on grayscale images, we could take advantage of grayscale cameras' higher efficiency, and be able to track hands in more adverse conditions.
Given the really short timespan of this part of the project and how little I knew when I started, it's pretty amazing to me that this pipeline worked at all, and I'm thrilled with the progress so far.
Since the end of this summer project, we took all the things we learned and applied them to training a much better hand tracking pipeline. We're not diving into that here, but a little sneak peek can't hurt:
This is a demo I recorded recently - everything is still under heavy R&D, and the final product will be quite a lot nicer! I'm really excited to announce it officially when the time comes, but we have a lot more work to do before it's truly ready for prime time.
Sure, the tracking I showed you is not amazing yet, but it's the best we've got on Linux, and it's worth using! If you have a Valve Index and are comfortable with building from source, all you have to do is build Monado with libsurvive and follow the instructions here and it should work. And, when our new tracking is ready, it'll be a drop-in replacement!
This project has been really cool for me. For the past four years, I've been deeply interested in open-source software, human-machine interfaces, computer vision, and the idea of a free future where knowledge is shared without limitations. Getting an internship (that turned into a full-time engineer position!) has been a dream come true for me. It means I can work full-time on what I'm truly passionate about with minimal stress and a good work-life balance.
I learned plenty of new things after I started. At Collabora, I trained my first neural networks. Since then, I've been soaking up and then directly applying everything I can learn about gradient descent, backpropagation, and all the different types of neural networks people are creating. Similarly, I knew a little bit about Kalman Filters and One Euro Filters before joining Collabora, but I'd never applied them in the real world; I was just theory-crafting. All the different flavors of filtering: Infinite Impulse Response Filters, Finite Impulse Response Filters, Kalman Filters, and energy-minimizing optimizers like Levenberg-Marquardt or just good old Gauss-Newton are all super fascinating to me. I could (and do) talk about them for hours, and I've been able to learn about and directly apply them through my work at Collabora.
If the above sounds like random technobabble to you, good. That was the intention. At Collabora, I could dive deeply into this machine vision problem with little distraction. If this specific avenue isn't interesting to you, that's not surprising! What I do is very niche. But the real takeaway is that we work on many diverse things, and are given a lot of freedom to really understand what it is we're doing. Have a look at some other elements that are happening on Collbora's XR team:
Right now is an extremely fascinating time for XR, and there are a vast amount of intriguing projects to work on. Almost every day I'm sincerely excited to go to work - I undoubtedly love what I do, and it's really amazing to be able to do research out in the open. I get the privilege of knowing that it'll be open access to everyone, forever, and wherever I go I'll be free to use, apply, and discuss the work I'm doing now. If you're interested in software freedom, bringing the lower levels of XR to life, or in general, making computers do things really fast, I urge you to apply for an internship, or apply for an engineering position here at Collabora! I promise that you won't regret it, you'll meet an incredible group of people, and you'll learn a ton.
Special thanks to:
StereoKit, for being an incredibly easy to use XR library which has made debugging our hand tracking orders of magnitude easier.
This second installment explores the Rust libraries Collabora developed to decode video and how these libraries are used within ARCVM to…
Why is creating object graphs hard in Rust? In part 1, we looked at a basic pattern, where two types of objects refer to one another. In…
Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions…
In Linux, the Industrial Input/Output subsystem manages devices like Analog to Digital Converters, Light sensors, accelerometers, etc. On…
Collabora's main testing laboratory has grown to automate testing on over 150 devices of about 30 different types. The lab receives job…
Rust is a modern language known for its memory safety, efficiency, and wide range of high-level features. But many beginners also run into…