We're hiring!

Visual-inertial tracking for Monado

Mateo de Mayo avatar

Mateo de Mayo
April 05, 2022

Share this post:

Reading time:

Monado now has initial support for 6DoF ("inside-out") tracking for devices with cameras and an IMU. Three free and open source SLAM/VIO solutions were integrated and adapted to work on XR: Kimera-VIO, ORB-SLAM3, and Basalt. Thanks to this, the RealSense and WinMR (Linux only) drivers in Monado were extended to support this type of tracking.

During my six-month internship at Collabora, I had the opportunity to gain firsthand experience on this project. Let me walk you through the concepts, the type of work, and exposure I attained while working remotely as an intern in the XR team for Monado. If any of this has piqued your interest, be sure to check out our careers page.

Some background

Tracking in XR

It is often easy to underestimate the number of systems and moving parts that need to be working together for something like virtual reality (VR) and augmented reality (AR) to work reasonably well. Combining VR and AR into a single denotation XR is common because there is a big overlap between the problems that each field needs to solve.

One of XR's critical and surprisingly difficult problems is tracking. Tracking refers to the ability of the XR hardware and software to identify the device's position and rotation in the real world, its pose, so that it can mirror it in the simulated world. There are many ways to achieve this task, and they mostly involve the smart use of physical sensors, be they mechanical, inertial, magnetic, optical, or even acoustic [1].

All of these tracking methods share a common problem; sensors are imperfect and full of noise. Take, for example, one of the most common sensor packages in use for the task, an inertial measurement unit or IMU; it provides access to a (rate) gyroscope to measure angular velocity and an accelerometer for linear acceleration (and sometimes also a magnetometer). In theory, if their measurements were perfect, IMUs should provide enough data to figure out the pose in space of a device containing it. But even the most expensive IMUs accumulate so much noise that integrating their measurements during short time spans returns poses that are off by hundreds of meters. To counteract the imperfect nature of physical sensors like these, the most successful approaches employ a combination of multiple sensors, together with smart fusion algorithms that can integrate the wildly different measurement types into a good enough pose estimation.

In recent years, visual-inertial (VI) tracking has been gaining a lot of traction in the XR ecosystem. By employing camera sensors, usually at least two, together with IMUs inside an XR device (e.g., headset, mobile phone), VI tracking estimates the agent pose by making the device look and get some understanding of its environment. One of the main benefits of this type of tracking is that having all these sensors packed inside one device is very convenient for the user as it requires no special hardware setup in the surrounding areas. This kind of tracking for real-time applications like XR was considered unfeasible just about a decade ago, but that is no longer the case thanks to advancements in both hardware and techniques.

There are XR devices that already employ visual-inertial tracking successfully in off-the-shelf products like the Meta Quest, Windows Mixed Reality headsets, or even the ARCore and ARKit SDKs present in mobile devices. However, all of these solutions are proprietary. There is no way for people to improve on their imperfections, re-use them in new projects and products, or even learn from their implementations without getting special licenses from their vendors.

Example of devices that contain stereo cameras, an IMU and can be used for visual-inertial tracking. Left: Samsung Odyssey+; Right: IntelRealSense D455.

Devices for visual-inertial tracking - Odyssey+ and D455

SLAM research field

The main studied problem we're exploring here is called (visual-inertial) simultaneous localization and mapping, or (VI-)SLAM for short. As the name implies, SLAM tries to simultaneously create a map and localize the device in it, starting from no prior knowledge about neither the map nor the agent's initial pose. There are multiple ways of implementing SLAM and VI-SLAM, but we will focus on approaches that are suited for tracking XR devices. These usually involve the use of fast IMU samples (e.g., 200 Hz) together with slow (e.g. 20 Hz) camera images. The IMU samples are used for measuring the “internal movement” the agent experiences, also known as proprioceptive measurements. On the other hand, the camera samples give information on how the environment is changing when the device moves to help correct the noisy IMU data; they are exteroceptive measurements. The map for VI-SLAM is usually formed from landmarks in the environment that were triangulated and detected during multiple measurements as corners in the camera images. In the best cases, the map in SLAM and the updates it goes through can last the entire run of the simulation; this helps the device understand when it is seeing a place it has already seen before. However, there are simpler solutions in which the map is only kept for a short time; these approaches are called visual-inertial odometry or VIO. They are faster than full SLAM solutions at the cost of accuracy.

Fortunately, academic research on visual-inertial navigation has been thriving during these last decades. It is not only a central issue in areas like robotics, but it is also a fascinating one by nature as it combines topics in computer vision, sensor fusion, optimization, and probabilistic estimation, among others. This intense research has resulted in many, and I mean many, free and open-source implementations with varying degrees of performance, robustness, accuracy, applications, ease of use, and countless other properties. Pages like openslam.org and surveys like [2] and [3], while sometimes outdated, can list dozens of readily available systems to consider. Furthermore, every year new systems emerge while others stop being maintained. Keeping up with the area can be a challenge, but that is a byproduct of its active development.

Monado and OpenXR

After years of fragmentation in the XR ecosystem, the Khronos Group, in conjunction with major industry parties, developed the OpenXR API for standardizing the XR software stack while also providing support for vendor-specific extensions. Collabora has been involved in developing the standard since its inception and supports the development of Monado.

Monado is an open-source implementation for the OpenXR standard. It also provides plenty of tools and functionality for XR and several drivers for common hardware devices. While the project already had implemented various methods for device tracking, visual-inertial tracking was a missing feature.

Internship work

For my internship, I was commended with the task of integrating SLAM/VIO solutions into Monado and adapting them for XR. The topics I was exposed to during this period were way out of my realm of knowledge. I'm very glad to have had been introduced to them through this project with great mentors from the company to guide me along the ride.

Next, we'll look at what work has been done to make Monado support SLAM/VIO systems for device tracking.


As discussed above, just deciding which SLAM systems to use can be complicated. I had to filter out dozens of packages, and even then, it was sometimes hard to know if the chosen ones were the best fit for the task. Each implementation has its pros and cons, but their papers usually only highlight the pros. Furthermore, knowing which aspects you should give more weight to can be challenging before knowing more about the field. Questions that arise from trade-offs the systems present can be difficult to answer if you don't know a priori how important each aspect might be in the final result. Some examples of these decisions involved choosing between faster but less accurate systems or vice versa, systems with different theoretical foundations, different underlying architectures, technologies, etc. From this, a large portion of my work consisted in getting a better idea of what these properties looked like for each evaluated implementation by reading their source codes, papers, and references.

The process of getting introduced to a new field can be daunting, but it is also very exciting to be able to see your progress along the way. Papers you once felt were indecipherable, start making sense bit by bit. The whole picture of the problem you are tackling continues to get clearer in the process. And, you start to imagine solutions of your own to different problems in your path. There have been some dead ends in the process, but even those made things a bit more evident and helped the whole project take shape. A key takeaway from my experience is that when you are learning, reading the source code in the early stages is a must. Reading a paper describing a system with jargon is mostly useful for people that already have experience with the terms. However, it can get confusing for a newcomer until you see how everything gets tied together in the final result. A nice side effect of reading different packages implemented by experts is that you start getting familiar with the approaches they choose, what they have in common, and where they decide to diverge—all of this without getting lost in superficial details like notation or terminology differences.


During the course of the internship, three systems were studied and integrated with Monado in one way or another. I started with Kimera-VIO [4], a very promising VIO system with a permissive license that supports operating with one (monocular) or two (stereo) cameras together with an IMU. Reading its paper and source code helped me a lot to get introduced to many concepts in the area and to understand how everything is tied together in a real piece of software. Unfortunately, I could not get good results out of it, which was a bit discouraging considering it was my first and supposedly best option at the time.

ORB-SLAM3 [5] was the next shot at getting a system to work. ORB-SLAM3 is the third iteration from a line of SLAM implementations that have consistently appeared within the top scores of the different surveys in the field. Not only that, but the project can operate in almost any sensor configuration, from purely monocular (no IMU) SLAM to full stereo-IMU SLAM. Fortunately, integrating this brought usable results, the tracking worked relatively well, and I now had a new point of reference as to what could be achieved. Besides that, concepts started to get clearer; one thing is to read that a system has some feature named A, but it is more coherent when you experiment with what feature A produces and, more importantly, how it feels for XR. However, in this case, a package with a permissive license would have been preferred, while ORB-SLAM3 is GPL-3.0 licensed (reciprocal rather than permissive).

Finally, Basalt [6] was the last system to approach; a permissively licensed implementation that supports mostly stereo configurations but is flexible enough to extend in the future to other setups. Basalt is surprisingly fast and works great for real-time applications like XR. ORB-SLAM3 was good too, but its strong points seemed to be on the processing of pre-recorded datasets. Not only is Basalt significantly fast but its source code adheres to good software engineering practices in general; things like good amounts of documentation, usage of CI, and an easy build process were a breath of fresh air.

There were other systems considered that almost made it to the list, but didn't make the cut for some reason or another. Implementations like ProSLAM, OpenVINS, RTAB-Map, and HybVIO among others were considered. In a couple of cases, the systems were released during the internship when the backlog was full already.

Integration and devices

For integrating these systems with Monado, it was necessary to create a flexible enough SLAM tracking interface that allowed to easily swap the implementations in use while keeping any external code out of Monado. This resulted in a standard slam_tracker.hpp header file that each system needs to implement in a separate fork. Writing such a file requires a bit of knowledge on how the system pipeline starts and ends, and sometimes adding some simple queuing mechanisms. This is just an adapter that allows each package to expose the inputs and outputs of their pipelines to Monado for tracking. In Monado, a central SLAM tracker class is in charge of talking with this adapter and implementing any extra functionality that is generic enough to all systems (e.g., pose prediction).

Here is a simplified diagram detailing the different stages and modules data needs to go through, from raw IMU and camera samples to predicted usable poses for an OpenXR application.

OpenXR SLAM system data flow in Monado

Once a central SLAM tracker entity was in place, it was just a matter of getting drivers for actual hardware devices with cameras and IMUs to stream their data to this class.

The first driver that was adapted to use this SLAM tracker class was the RealSense driver. These cameras from Intel are handy for a multitude of computer vision applications. For example, the T265 model, which has been discontinued, possesses an on-device SLAM solution. The tracking occurs entirely inside the camera and is reported back to the host. The original RealSense driver in Monado supported only that kind of tracking, but it has now been adapted to also support the new SLAM tracker. Now any RealSense device with camera streams and an IMU can be used for SLAM tracking even if it does not support on-device SLAM. The hardware this has been tested with was a RealSense D455.

Monado also supports Windows Mixed Reality (WMR) headsets, thanks to continued work from its community. One particularly amazing experience I had during this internship was the opportunity to work with some contributors of the WMR driver for extending it to support SLAM tracking. Its current state is not as good as the RealSense driver, but it's just a matter of time until it gets there. These are among the first headsets running visual-inertial tracking on a fully open-source stack.

Other minor functionalities were also upstreamed, like a dataset recorder that can save IMU and camera data in a standard format (EuRoC) as well as a way to play back those datasets to the SLAM tracker. Most of my contributions resulted in merge requests to Monado. There was also some involvement with other projects I used along the way, including the SLAM systems themselves.


To wrap things up, below is a video of an OpenXR application (bottom right), running on Monado, using a RealSense D455 as the data source (top left) and Basalt as the SLAM backend (top right).

These last months have been incredible; the people I've met, the topics I was introduced to, the type of work I was involved in, and all from the comfort of my home. Doing an internship at Collabora has been an incredible experience; it provided me with the time and resources to make meaningful contributions to a remarkable open-source project, which can be hard to do or even get started with otherwise. If you feel like any of this sounded like something you would be interested in, I encourage you to apply for the next round of internships!


  1. Welch & Foxlin 2002. Motion tracking: no silver bullet, but a respectable arsenal. IEEE Computer Graphics and Applications 22, 6 (November 2002).
  2. Servières et al. 2021. Visual and Visual-Inertial SLAM: State of the Art, Classification, and Experimental Benchmarking. Journal of Sensors 2021, (February 2021).
  3. Taketomi 2017. Visual-SLAM Algorithms: a Survey from 2010 to 2016. (2017), 11
  4. Rosinol et al. 2020. Kimera: an Open-Source Library for Real-Time Metric-Semantic Localization and Mapping. arXiv:1910.02490 [cs] (March 2020).
  5. Campos et al. 2021. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. IEEE Trans. Robot. 37, 6 (December 2021), 1874–1890.
  6. Usenko et al. 2021. Visual-Inertial Mapping with Non-Linear Factor Recovery. IEEE Robot. Autom. Lett. 5, 2 (April 2020), 422–429.

Comments (10)

  1. shawn.xiao:
    Apr 25, 2022 at 10:21 AM

    Hi Mayo:

    I’m a VR software engineer from Shenzhen, China.
    I’m doing some work on OpenXR,runtime and SLAM currently.
    The runtime we use is monado.
    There is a problem that has troubled me for a while,
    How the TUM data(time,tx,ty,tz,qx,qy,qz,qw) obtained by the SLAM algorithm passed to OpenXR SDK through monado?

    Hope to get your help

    Reply to this comment

    Reply to this comment

    1. Mateo de Mayo:
      Apr 25, 2022 at 04:11 PM

      Hello shawn.xiao!

      In general, the VIO/SLAM tracking occurs entirely inside Monado with the device driver being the one in charge of setting up the tracker.
      These systems usually return a pose estimate for each pair of stereo frames captured with the cameras you may be using.
      That frequency is usually too low for tracking in VR; about 30fps while screen refresh rates are usually 3-5 times higher. Because of this, we instead use the system pose estimates as a baseline to which we also apply simple prediction algorithms that make use of the latest IMU samples Monado might have received on top of the VIO/SLAM estimates. On top of that, we also have some filtering algorithms to smooth out some systems that produce jittery trajectories.

      All of this to say that, by default, you won't get the same poses that the VIO/SLAM systems output in your OpenXR app unless you explicitly disable all of these modifications in the configuration UI. Having said that, it's worth mentioning that you can instruct the tracker to output raw estimates in a CSV file in EuRoC format (time,tx,ty,tz,qw,qx,qy,qz).

      I hope this makes things a bit more clear, and in any case, I would invite you to join Monado's discord server and tag me (@mateosss) if you need more help: https://discord.gg/8RkJgRJ
      Or to create an issue on https://gitlab.freedesktop.org/monado/monado

      Reply to this comment

      Reply to this comment

      1. shawn.xiao:
        Apr 26, 2022 at 07:19 AM

        Hi mateo:

        Thank you very much for your reply, it helped me a lot,
        but don't know why, I can't access https://discord.gg/8RkJgRJ, it shows that can't link.
        https://gitlab.freedesktop.org/monado/monado acess is ok.

        If I have other problems that I can't solve during development,
        I will seek for your help on freedesktop or your blogs,

        Thanks again!

        Reply to this comment

        Reply to this comment

        1. Moses Turner:
          Apr 26, 2022 at 05:46 PM

          try this! https://discord.gg/B545yA4hAE
          I'll go look at the link on our freedesktop page, it might be broken.

          Reply to this comment

          Reply to this comment

        2. Mateo de Mayo:
          Apr 26, 2022 at 06:12 PM

          No problem!

          The discord link does work for me, maybe your network is blocking it?
          In any case feel free to ask through the issue tracker on gitlab.freedesktop.org.

          Good luck

          Reply to this comment

          Reply to this comment

  2. Winters:
    Sep 01, 2022 at 06:14 AM

    I applied a basalt slam in monado. But it seems that relocation function was not implemented in basalt.
    Do you have any idea to realize the relocation function or other slam instead of basalt ?


    Reply to this comment

    Reply to this comment

  3. Mateo de Mayo:
    Sep 01, 2022 at 06:15 PM


    Indeed, Basalt is not a full SLAM system and it's only doing VIO for now, we have plans to improve on this.
    ORB-SLAM3 does have relocalization/full SLAM capabilities right from the start and you might want to try it out with Monado.

    If you do, you'll need to checkout this fork of ORB-SLAM3 for instructions:

    As this system has not been our focus, some of the documentation could be slightly outdated,
    so feel free to reach out for help in Monado's discord server.

    Reply to this comment

    Reply to this comment

    1. Winters:
      Sep 08, 2022 at 01:16 AM

      Thank you for your kind answer.
      Do you mean that your plan is to add the relocalization function of ORB-SLAM3 on the Basalt ? Or select ORB-SLAM3 full system ?


      Reply to this comment

      Reply to this comment

      1. Mateo de Mayo:
        Sep 08, 2022 at 09:34 PM

        As of now, it is possible to use ORB-SLAM3, the "full system", with a fork of Monado by following the repository I linked in my answer.
        However, after evaluating different metrics (and other characteristics) of ORB-SLAM3 w.r.t. XR, and not looking only at the absolute trajectory error (ATE) which is the main metric these systems are usually compared with, I think the best way forward would be to invest in developing new modules on top of Basalt which already provides a very good foundation. Some topics of interest for XR are the ones mentioned in issues #62, #69, and #88 in Basalt upstream: https://gitlab.com/VladyslavUsenko/basalt/-/issues

        Reply to this comment

        Reply to this comment

  4. Winters:
    Sep 08, 2022 at 01:22 AM

    Thank you for your kind answer.

    Do you mean that your plan is to add the relocalization part of ORB-SLAM3 to the Basalt ?
    Or select the Full ORB-SLAM3 ?


    Reply to this comment

    Reply to this comment

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

The latest on cmtp-responder, a permissively-licensed MTP responder implementation


Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…

A roadmap for VirtIO Video on ChromeOS: part 3


The final installment of a series explaining how Collabora is helping shape the video virtualization story for Chromebooks with a focus…

Hacking on the PipeWire GStreamer elements


Last week I attended the GStreamer spring hackfest in Thessaloniki to work on the PipeWire GStreamer elements and connect with the community.

Transforming speech technology with WhisperLive


The world of AI has made leaps and bounds from what It once was, but there are still some adjustments required for the optimal outcome.…

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it


While I managed to land support for two extensions, implementing control flow re-convergence in NVK did not go as planned. This is the story…

Automatic regression handling and reporting for the Linux Kernel


In continuation with our series about Kernel Integration we'll go into more detail about how regression detection, processing, and tracking…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.