October 04, 2022
For the past several months, I've been working on writing a brand new open-source Vulkan driver for NVIDIA hardware in Mesa called NVK. This new driver has primarily been written by myself (Jason Ekstrand), along with Karol Herbst and Dave Airlie at Red Hat. In the last month or two, we've started picking up a few commits here and there from community folks and I'm hopeful that community involvement will only increase going forward.
Support for NVIDIA hardware in open-source drivers has always been somewhat lacking. The nouveau drivers exist but they're often missing features, buggy, or just don't support certain cards. This is due to a combination of factors. Unlike the Intel and AMD drivers, nouveau driver stack has been developed with little to no official documentation or help from NVIDIA. They occasionally provide little bits of support here. Historically, it's been mostly focused on enabling nouveau just enough that you can install your Linux distro, get to a web browser, and download their proprietary driver stack.
Most of the hardware knowledge we (the open-source graphics community) have is learned by reverse-engineering, digging through CUDA documentation (it's amazingly low-level sometimes), and picking through the few bits NVIDIA drops us here and there. This slows down development in the best of times and makes solving certain problems nearly impossible.
Probably the biggest area of technical struggle has been properly driving the hardware from kernel space. NVIDIA hardware depends on signed firmware for everything from display to job execution to power management. The firmware blobs that NVIDIA provided in the past were trimmed down versions they created just for open-source drivers. These firmware blobs were likely missing features (we have no way to know) and never got as much internal testing as the firmware used by the NVIDIA proprietary drivers, so sometimes certain cards just don't work, and we have no idea why. We also don't fully understand how to properly do certain things like power management, so most NVIDIA cards run at minimum clock rates all the time, leading to performance that's worse than some integrated cards.
Another factor has been the lack of contributors. This isn't really a failing of the community so much as an unfortunate effect of the economics of open-source software. Unlike the Intel and AMD drivers in Mesa, nouveau has never had strong corporate backing. Apart from a couple people at Red Hat who are part-time on nouveau, most of the developers who've been involved in the past several years have been volunteers doing it in their free time. Reverse-engineering is hard work and people often find something easier or more fulfilling to do and wander off. Anyone who sticks around and demonstrates significant skill and capability within either Mesa or the kernel quickly gets hired by some company to work on something that isn't nouveau and their nouveau contributions quickly drop off.
The unfortunate reality is that, while the original nouveau drivers were written by some amazing engineers and were state-of-the-art a decade ago, they have fallen behind in the last several years. The few developer hours it gets are mostly spent on basic hardware enablement and trying to get new OpenGL versions, leaving the systemic and architectural issues unaddressed.
There are a few things which have changed recently to make the technical landscape a bit more friendly. The first change actually happened a few years ago when NVIDIA launched their Turing class of hardware. Turing GPUs use a new unified firmware blob. The GSP firmware contains everything needed for bringing up the GPU, and once the GSP firmware is loaded, it loads all the other firmwares needed by various pieces of the GPU. If we can get GSP working, this will allow us to use the same firmware as the proprietary NVIDIA drivers and should solve many of the mystery bugs. Dave Airlie gave a very good talk at LPC about this.
Second: a few months ago, NVIDIA released an open-source version of their kernel driver. While this isn't quite documentation, it does give us something to reference to see how NVIDIA drives their hardware. The code drop from NVIDIA isn't a good fit for upstream Linux, but it does give us the opportunity to rework the upstream driver situation and do it right.
Third: NVIDIA has started providing official headers for the 3D and compute hardware. While this isn't as good as real documentation either, it at least gives us names for all the registers we've been programming. Mesa developers working on nouveau have figured out a lot about how the hardware operates the hard way by reverse engineering, but the nouveau driver is still full of mystery bits shoved into mystery registers. We now finally have names for all those bits! We also have something we can search through if we're trying to figure out why some particular feature isn't working. "Oh, look! There are more depth test registers I'm not setting!"
All these things things together make now an ideal time to reboot the nouveau driver stack.
As said above, NVK is a new open-source Vulkan driver for NVIDIA hardware in Mesa. It's been written almost entirely from scratch using the new official headers from NVIDIA. We occasionally reference the existing nouveau OpenGL driver but, because the official headers often use different names for things than the reverse-engineered driver, we often can't copy+paste directly. Vulkan is also different enough from OpenGL that we often have to re-think things anyway.
One of my personal goals for NVK is for it to become the new reference Vulkan driver within Mesa. All of the Vulkan drivers in Mesa can trace their lineage back to the Intel Vulkan driver (ANV) and were started by copying+pasting from it. We won't be there for a while, but my hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I'm building NVK with all the best practices we've developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized.
I'm also trying to build NVK to be as modern as possible, using it as motivation for developing the common Vulkan runtime code in Mesa. Copy+paste from other Vulkan drivers is kept to a minimum. Instead, whenever I'm tempted to copy+paste from ANV or some other driver, that's an indication to me that we need more common framework code. For example, NVK has never had a line of render pass code; it implements dynamic rendering and uses the common render pass implementation for legacy render passes. When implementing
robustBufferAccess a couple weeks ago, I went straight for VK_EXT_pipeline_robustness and some helpers to the common code to make it a bit more ergonomic.
Long-term, the hope is for NVK to be for NVIDIA hardware what RADV is to AMD hardware. However, that's a pretty high bar. RADV is a quite mature driver with a lot of features and fantastic run-time performance. There's a lot of work between where we are now and RADV-level driver quality, but it gives us a goal.
We've gotten surprisingly far with NVK, considering that it's only been in development for a few months. As of the writing of this blog post, we're passing about 98% of the Vulkan CTS with a very basic feature set. More specifically, my last CTS run had the following results:
Pass: 193734, Fail: 1064, Crash: 1286, Warn: 4, Skip: 1364208, Flake: 265
These data provide not only a sense of driver quality but also indicate how far along we are in development. As you can see, we're only running a little over 10% of the tests at this time. An average full-featured Vulkan 1.3 driver such as ANV or RADV runs about 50% of the CTS where the other 50% is skipped due to various image formats and minor features not being supported. This means we're probably about 20-25% of the way there in terms of features.
Architecturally, it's also in pretty good shape at this point. I've been focused on the large core components which may require deep changes or large refactors in order to get right. Before we get many more contributors, I want to make sure that the bones of the driver are good so that they have a solid foundation to build on. We also need the code-base to be relatively stable if more than just a couple of people will be working on it at the same time. Otherwise, there are likely to be conflicts while working on seemingly unrelated features. As of now, most of the big pieces are in good shape. The only thing that's likely to need significant architectural rework is pipeline compilation, but I'm putting that off until we get the compiler situation in better shape over-all.
Currently, we're targeting Turing+. I have an RTX 2060 I've ben using. I think Karol and Dave have been working on Ampere cards. With the Lovelace cards coming out, I'll likely upgrade to one of those for my development before too long.
Karol has patches for Kepler, Maxwell, and Pascal but it's still incomplete. It's also unclear what the kernel situation will look like going forward. Thanks to GSP, it may be that we choose to make any new kernel work Turing+ which might limit userspace a bit. The current nouveau kernel interface isn't very good for Vulkan, so we may have a hard dependency on the new kernel going forward, limiting hardware support. It's all still very TBD.
Sure! Trying out NVK is no different than any other Mesa driver. Just pull the branch nvk/main branch from the nouveau/mesa project, build it, and give it a try. However, as much as we welcome people playing around with the driver and contributing, please don't file bug reports asking for additional hardware support or about specific apps not working. We're well aware that there are lots of missing features and bugs. The driver should still be considered alpha-quality for a while. Once things are more stabilized, helping to find app bugs would be great, but for now we're still focused on fixing CTS tests and closing the feature gap, those kinds of bug reports aren't helpful.
If you do wish to contribute, I strongly recommend getting a Turing or newer GPU. Fortunately, the GPU shortage seems to be over and, since Turing is 4 years old now, they're pretty easy to get your hands on these days.
That's a good question! Normally, I would have submitted the merge request long ago. There are far more alpha-quality drivers already in Mesa. The problem is that we really need a new kernel uAPI to support Vulkan properly and I don't want to be stuck supporting the current nouveau uAPI for the next five years. In theory, we could upstream it and just leave out the bits which touch the kernel but then we'd need to be always developing on top of a branch and rebase to drop the kernel patches every time we make an MR. So far, it's been easier to just work in a branch. If we can get the kernel situation sorted out quickly enough, I'm hopeful that we can land the new kernel uAPI upstream in tandem with NVK going into upstream Mesa.
First off, no one is going to be deleting them so they'll continue working as well as they ever have. However, there are some significant issues with the current gallium drivers and, as is the story with the rest of the nouveau stack, no one has put the time into fixing them. Many of those issues aren't obvious when using nouveau to drive a desktop and a few simple applications. Once we get re-clocking sorted on Turing+ with GSP firmware and people attempt serious gaming, those bottlenecks will quickly take center stage. We will need a solution to this long-term.
One option would be to improve the nouveau gallium drivers based on what we've learned by writing NVK or even rewrite them entirely (that's not as much work as it sounds). While working on NVK, I've been intentionally splitting various pieces out into libraries that can be shared with a gallium driver like we did with the Intel drivers several years ago. This should make it easy to share things like the new and improved image layout code between drivers. As I dig more into the compiler, it will get the same treatment.
Another option being discussed is to use Zink for OpenGL going forward. It's already capable of running most Wayland compositors, XWayland, X.org with the modesetting back-end, and most of the apps anyone cares about. It will take some work yet to get full Zink support on NVK (there are still features missing) but it's likely easier than building a whole OpenGL driver. Whether or not this is the best option long-term is still undecided.
There are a lot of options and nothing is decided. As we get further with NVK and re-building various pieces of the nouveau stack, it will be more clear which options are best.
As mentioned a couple of times already, we need a new kernel uAPI and the nouveau kernel needs quite a bit of work. I won't go into detail as to what all needs to change in this blog post, but it's pretty close to throwing out the old API and starting over. Likely, we need other internal structural changes to the nouveau kernel driver as well. Currently, Dave is looking into this but can't afford to be one of the primary nouveau kernel maintainers long-term.
The other thing that needs a lot of work before we go much further is the compiler. The nouveau shader compiler, like the rest of nouveau, was probably state-of-the-art a decade ago but has fallen into disrepair. People continue enabling new hardware but larger structural changes or really hard bugfixes just don't happen. Karol added NIR support a couple years ago but it only recently got enabled by default and the whole compiler is still structured in a very gallium-centric way. Fixing that is going to require deep surgery at best. My next project, starting either late 2022 or early 2023, will be to either write a new compiler or figure out how to fix what we have.
(Yes, I know someone is going to comment asking why write a new compiler and not fix the one we have. Honestly, just fixing what's there is still on the table. However, modernizing the compiler and making it NIR-centric would require deep surgery and very broad refactoring. There's quite a bit being done by the nv50 back-end itself that really should be done by NIR and all that needs to be carefully ripped out. There are also serious problems with the register allocator where it just fails sometimes with no fallback. By the time all is said and done, there likely wouldn't be much of the original left. It may be easier to rewrite with the old one as a reference than to try and slowly refactor it without breaking anything. Generally, I'm skeptical that trying to salvage the current infrastructure is worth the effort long-term, but I need to dig into it more before I'll be sure of the best path forward.)
In parallel with those two efforts, I'm hoping that others will work on enabling features and work towards conformance. We likely won't get actual conformance until the compiler and kernel pieces are in place, but there are plenty of bugs to fix and little features to add before we'll be anywhere close to feature parity with the proprietary NVIDIA drivers.
After waiting in the Linux-next integration tree for about 18 months, the basic Rust infrastructure finally landed in the mainline Linux…
Coming up next week at the Automotive Linux Summit in Yokohama and virtually, Marius Vlad and Daniel Stone will present the latest on the…
Clear your schedules, this weekend's Capitole du Libre is calling your name for all things open source! Gathering in the "Pink City" of…