Mateo de Mayo
April 05, 2022
Monado now has initial support for 6DoF ("inside-out") tracking for devices with cameras and an IMU. Three free and open source SLAM/VIO solutions were integrated and adapted to work on XR: Kimera-VIO, ORB-SLAM3, and Basalt. Thanks to this, the RealSense and WinMR (Linux only) drivers in Monado were extended to support this type of tracking.
During my six-month internship at Collabora, I had the opportunity to gain firsthand experience on this project. Let me walk you through the concepts, the type of work, and exposure I attained while working remotely as an intern in the XR team for Monado. If any of this has piqued your interest, be sure to check out our careers page.
It is often easy to underestimate the number of systems and moving parts that need to be working together for something like virtual reality (VR) and augmented reality (AR) to work reasonably well. Combining VR and AR into a single denotation XR is common because there is a big overlap between the problems that each field needs to solve.
One of XR's critical and surprisingly difficult problems is tracking. Tracking refers to the ability of the XR hardware and software to identify the device's position and rotation in the real world, its pose, so that it can mirror it in the simulated world. There are many ways to achieve this task, and they mostly involve the smart use of physical sensors, be they mechanical, inertial, magnetic, optical, or even acoustic .
All of these tracking methods share a common problem; sensors are imperfect and full of noise. Take, for example, one of the most common sensor packages in use for the task, an inertial measurement unit or IMU; it provides access to a (rate) gyroscope to measure angular velocity and an accelerometer for linear acceleration (and sometimes also a magnetometer). In theory, if their measurements were perfect, IMUs should provide enough data to figure out the pose in space of a device containing it. But even the most expensive IMUs accumulate so much noise that integrating their measurements during short time spans returns poses that are off by hundreds of meters. To counteract the imperfect nature of physical sensors like these, the most successful approaches employ a combination of multiple sensors, together with smart fusion algorithms that can integrate the wildly different measurement types into a good enough pose estimation.
In recent years, visual-inertial (VI) tracking has been gaining a lot of traction in the XR ecosystem. By employing camera sensors, usually at least two, together with IMUs inside an XR device (e.g., headset, mobile phone), VI tracking estimates the agent pose by making the device look and get some understanding of its environment. One of the main benefits of this type of tracking is that having all these sensors packed inside one device is very convenient for the user as it requires no special hardware setup in the surrounding areas. This kind of tracking for real-time applications like XR was considered unfeasible just about a decade ago, but that is no longer the case thanks to advancements in both hardware and techniques.
There are XR devices that already employ visual-inertial tracking successfully in off-the-shelf products like the Meta Quest, Windows Mixed Reality headsets, or even the ARCore and ARKit SDKs present in mobile devices. However, all of these solutions are proprietary. There is no way for people to improve on their imperfections, re-use them in new projects and products, or even learn from their implementations without getting special licenses from their vendors.
Example of devices that contain stereo cameras, an IMU and can be used for visual-inertial tracking. Left: Samsung Odyssey+; Right: IntelRealSense D455.
The main studied problem we're exploring here is called (visual-inertial) simultaneous localization and mapping, or (VI-)SLAM for short. As the name implies, SLAM tries to simultaneously create a map and localize the device in it, starting from no prior knowledge about neither the map nor the agent's initial pose. There are multiple ways of implementing SLAM and VI-SLAM, but we will focus on approaches that are suited for tracking XR devices. These usually involve the use of fast IMU samples (e.g., 200 Hz) together with slow (e.g. 20 Hz) camera images. The IMU samples are used for measuring the “internal movement” the agent experiences, also known as proprioceptive measurements. On the other hand, the camera samples give information on how the environment is changing when the device moves to help correct the noisy IMU data; they are exteroceptive measurements. The map for VI-SLAM is usually formed from landmarks in the environment that were triangulated and detected during multiple measurements as corners in the camera images. In the best cases, the map in SLAM and the updates it goes through can last the entire run of the simulation; this helps the device understand when it is seeing a place it has already seen before. However, there are simpler solutions in which the map is only kept for a short time; these approaches are called visual-inertial odometry or VIO. They are faster than full SLAM solutions at the cost of accuracy.
Fortunately, academic research on visual-inertial navigation has been thriving during these last decades. It is not only a central issue in areas like robotics, but it is also a fascinating one by nature as it combines topics in computer vision, sensor fusion, optimization, and probabilistic estimation, among others. This intense research has resulted in many, and I mean many, free and open-source implementations with varying degrees of performance, robustness, accuracy, applications, ease of use, and countless other properties. Pages like openslam.org and surveys like  and , while sometimes outdated, can list dozens of readily available systems to consider. Furthermore, every year new systems emerge while others stop being maintained. Keeping up with the area can be a challenge, but that is a byproduct of its active development.
After years of fragmentation in the XR ecosystem, the Khronos Group, in conjunction with major industry parties, developed the OpenXR API for standardizing the XR software stack while also providing support for vendor-specific extensions. Collabora has been involved in developing the standard since its inception and supports the development of Monado.
Monado is an open-source implementation for the OpenXR standard. It also provides plenty of tools and functionality for XR and several drivers for common hardware devices. While the project already had implemented various methods for device tracking, visual-inertial tracking was a missing feature.
For my internship, I was commended with the task of integrating SLAM/VIO solutions into Monado and adapting them for XR. The topics I was exposed to during this period were way out of my realm of knowledge. I'm very glad to have had been introduced to them through this project with great mentors from the company to guide me along the ride.
Next, we'll look at what work has been done to make Monado support SLAM/VIO systems for device tracking.
As discussed above, just deciding which SLAM systems to use can be complicated. I had to filter out dozens of packages, and even then, it was sometimes hard to know if the chosen ones were the best fit for the task. Each implementation has its pros and cons, but their papers usually only highlight the pros. Furthermore, knowing which aspects you should give more weight to can be challenging before knowing more about the field. Questions that arise from trade-offs the systems present can be difficult to answer if you don't know a priori how important each aspect might be in the final result. Some examples of these decisions involved choosing between faster but less accurate systems or vice versa, systems with different theoretical foundations, different underlying architectures, technologies, etc. From this, a large portion of my work consisted in getting a better idea of what these properties looked like for each evaluated implementation by reading their source codes, papers, and references.
The process of getting introduced to a new field can be daunting, but it is also very exciting to be able to see your progress along the way. Papers you once felt were indecipherable, start making sense bit by bit. The whole picture of the problem you are tackling continues to get clearer in the process. And, you start to imagine solutions of your own to different problems in your path. There have been some dead ends in the process, but even those made things a bit more evident and helped the whole project take shape. A key takeaway from my experience is that when you are learning, reading the source code in the early stages is a must. Reading a paper describing a system with jargon is mostly useful for people that already have experience with the terms. However, it can get confusing for a newcomer until you see how everything gets tied together in the final result. A nice side effect of reading different packages implemented by experts is that you start getting familiar with the approaches they choose, what they have in common, and where they decide to diverge—all of this without getting lost in superficial details like notation or terminology differences.
During the course of the internship, three systems were studied and integrated with Monado in one way or another. I started with Kimera-VIO , a very promising VIO system with a permissive license that supports operating with one (monocular) or two (stereo) cameras together with an IMU. Reading its paper and source code helped me a lot to get introduced to many concepts in the area and to understand how everything is tied together in a real piece of software. Unfortunately, I could not get good results out of it, which was a bit discouraging considering it was my first and supposedly best option at the time.
ORB-SLAM3  was the next shot at getting a system to work. ORB-SLAM3 is the third iteration from a line of SLAM implementations that have consistently appeared within the top scores of the different surveys in the field. Not only that, but the project can operate in almost any sensor configuration, from purely monocular (no IMU) SLAM to full stereo-IMU SLAM. Fortunately, integrating this brought usable results, the tracking worked relatively well, and I now had a new point of reference as to what could be achieved. Besides that, concepts started to get clearer; one thing is to read that a system has some feature named A, but it is more coherent when you experiment with what feature A produces and, more importantly, how it feels for XR. However, in this case, a package with a permissive license would have been preferred, while ORB-SLAM3 is GPL-3.0 licensed (reciprocal rather than permissive).
Finally, Basalt  was the last system to approach; a permissively licensed implementation that supports mostly stereo configurations but is flexible enough to extend in the future to other setups. Basalt is surprisingly fast and works great for real-time applications like XR. ORB-SLAM3 was good too, but its strong points seemed to be on the processing of pre-recorded datasets. Not only is Basalt significantly fast but its source code adheres to good software engineering practices in general; things like good amounts of documentation, usage of CI, and an easy build process were a breath of fresh air.
There were other systems considered that almost made it to the list, but didn't make the cut for some reason or another. Implementations like ProSLAM, OpenVINS, RTAB-Map, and HybVIO among others were considered. In a couple of cases, the systems were released during the internship when the backlog was full already.
For integrating these systems with Monado, it was necessary to create a flexible enough SLAM tracking interface that allowed to easily swap the implementations in use while keeping any external code out of Monado. This resulted in a standard
slam_tracker.hpp header file that each system needs to implement in a separate fork. Writing such a file requires a bit of knowledge on how the system pipeline starts and ends, and sometimes adding some simple queuing mechanisms. This is just an adapter that allows each package to expose the inputs and outputs of their pipelines to Monado for tracking. In Monado, a central SLAM tracker class is in charge of talking with this adapter and implementing any extra functionality that is generic enough to all systems (e.g., pose prediction).
Here is a simplified diagram detailing the different stages and modules data needs to go through, from raw IMU and camera samples to predicted usable poses for an OpenXR application.
Once a central SLAM tracker entity was in place, it was just a matter of getting drivers for actual hardware devices with cameras and IMUs to stream their data to this class.
The first driver that was adapted to use this SLAM tracker class was the RealSense driver. These cameras from Intel are handy for a multitude of computer vision applications. For example, the T265 model, which has been discontinued, possesses an on-device SLAM solution. The tracking occurs entirely inside the camera and is reported back to the host. The original RealSense driver in Monado supported only that kind of tracking, but it has now been adapted to also support the new SLAM tracker. Now any RealSense device with camera streams and an IMU can be used for SLAM tracking even if it does not support on-device SLAM. The hardware this has been tested with was a RealSense D455.
Monado also supports Windows Mixed Reality (WMR) headsets, thanks to continued work from its community. One particularly amazing experience I had during this internship was the opportunity to work with some contributors of the WMR driver for extending it to support SLAM tracking. Its current state is not as good as the RealSense driver, but it's just a matter of time until it gets there. These are among the first headsets running visual-inertial tracking on a fully open-source stack.
Other minor functionalities were also upstreamed, like a dataset recorder that can save IMU and camera data in a standard format (EuRoC) as well as a way to play back those datasets to the SLAM tracker. Most of my contributions resulted in merge requests to Monado. There was also some involvement with other projects I used along the way, including the SLAM systems themselves.
To wrap things up, below is a video of an OpenXR application (bottom right), running on Monado, using a RealSense D455 as the data source (top left) and Basalt as the SLAM backend (top right).
These last months have been incredible; the people I've met, the topics I was introduced to, the type of work I was involved in, and all from the comfort of my home. Doing an internship at Collabora has been an incredible experience; it provided me with the time and resources to make meaningful contributions to a remarkable open-source project, which can be hard to do or even get started with otherwise. If you feel like any of this sounded like something you would be interested in, I encourage you to apply for the next round of internships!
Monado now has initial support for 6DoF ("inside-out") tracking for devices with cameras and an IMU! Three free and open source SLAM/VIO…
When developing an application or a library, it is very common to want to run it without installing it, or to install it into a custom prefix…
An incredible amount has changed in Mesa and in the Vulkan ecosystems since we wrote the first Vulkan driver in Mesa for Intel hardware…
Every file system used in production has tools to try to recover from system crashes. To provide a better infrastructure for those tools,…
The PipeWire project made major strides over the past few years, bringing shiny new features, and paving the way for new possibilities in…
Over the past 18 months, we have been on a roller-coaster ride developing futex2, a new set of system calls. As part of this effort, the…