June 09, 2022
One of the important lessons the graphics industry has learned over the last decade or so is the need for explicit synchronization between different pieces of asynchronous accelerator work. The Vulkan graphics and compute API learned this lesson and uses explicit synchronization for everything. On Linux, unfortunately, we still need implicit synchronization to talk to various window systems which has caused pain for Linux Vulkan drivers for years.
With older graphics APIs like OpenGL, the client makes a series of API calls, each of which either mutates some bit of state or performs a draw operation. There are a number of techniques that have been used over the years to parallelize the rendering work, but the implementation has to ensure that everything appears to happen in order from the client's perspective. While this served us well for years, it's become harder and harder to keep the GPU fully occupied. Games and other 3D applications have gotten more complex and need multiple CPU cores in order to have enough processing power to reliably render their entire scene in less than 16 milliseconds and achieve a smooth 60 frames per second. GPUs have also gotten larger with more parallelism, and there's only so much a driver can do behind the client's back to parallelize things.
To improve both GPU and CPU utilization, modern APIs like Vulkan take a different approach. Most Vulkan objects such as images are immutable: while the underlying image contents may change, the fundamental properties of the image such as its dimensions, color format, and number of miplevels do not. This is different from OpenGL where the application can change any property of anything at any time. To draw, the client records sequences of rendering commands in command buffers which are submitted to the GPU as a separate step. The command buffers themselves are still stateful, and the recorded commands have the same in-order guarantees as OpenGL. However, the state and ordering guarantees only apply within the command buffer, making it safe to record multiple command buffers simultaneously from different threads. The client only needs to synchronize between threads at the last moment when they submit those command buffers to the GPU. Vulkan also allows the driver to expose multiple hardware work queues of different types which all run in parallel. Getting the most out of a large desktop GPU often requires having 3D rendering, compute, and image/buffer copy (DMA) work happening all at the same time and in parallel with the CPU prep work for the next batch of GPU work.
Enabling all this additional CPU and GPU parallelism comes at a cost: synchronization. One piece of GPU work may depend on other pieces of GPU work, possibly on a different queue. For instance, you may upload a texture on a copy queue and then use that texture on a 3D queue. Because command buffers can be built in parallel and the driver has no idea what the client is actually trying to do, the client has to explicitly provide that dependency information to the driver. In Vulkan, this is done through
VkSemaphore objects. If command buffers are the nodes in the dependency graph of work to be done, semaphores are the edges. When a command buffer is submitted to a queue, the client provides two sets of semaphores: a set to wait on before executing the command buffer and a set to signal when the command buffer completes. In our texture upload example, the client would tell the driver to signal a semaphore when the texture upload operation completes and then have it wait on that same semaphore before doing the 3D rendering which uses the texture. This allows the client to take advantage of as much parallelism as it can manage while still having things happen in the correct order as needed.
Everything we just discussed is in the context of a single client trying to get as much out of the GPU as it can. But what if we have multiple clients involved? While this isn't something most game engine developers want to think about, it's critical when you look at a desktop system as a whole. In the common case, you don't just have a single game rendering and displaying on the screen; you have multiple clients all rendering their own window and a compositor putting everything together into the final image you see on-screen. If you're watching a video, you may also have a video decoder which is feeding into your web browser, adding another layer of complexity.
All of these pieces working together to create the final image you see on your screen poses many of the same problems as multi-queue rendering. Instead of having multiple queues being targeted by a single client, each client has its own queues and we need to synchronize between them. In particular, we need to make sure that the composition happens after each of the clients has completed its rendering or else we risk getting stale or incomplete data on the screen.
The way this typically works with OpenGL on Linux is that the client will draw to the back buffer (framebuffer 0) and then call
glxSwapBuffers() if using X11 and GLX). Inside the
eglSwapBuffers() call, the driver ensures that all the rendering work has been submitted to the kernel driver and then hands the back buffer to the compositor (either the X server or a Wayland compositor) to be composited in the next frame. The compositor then submits its rendering commands to composite the latest frames from all the apps. Who ensures that the compositor's rendering work happens only after all the client's have completed rendering to their respective back buffers? The kernel does, implicitly. For each shared buffer, it tracks all the GPU work which has been submitted globally, across the entire system, which may touch that buffer and ensures it happens in order. While this auto-magic tracking sounds nice, it has the same over-synchronization downsides as the rest of OpenGL that we discussed above.
The way this is supposed to work with Vulkan is via explicit synchronization. The client first acquires an image to render to via
vkAcquireNextImageKHR() which takes an optional semaphore and fence to be signaled once the acquired image is actually ready for rendering. The client is expected to block its rendering on that semaphore or fence. Then, once the client has submitted its rendering, it calls
vkQueuePresentKHR() and passes it a set of semaphores to wait on before reading the image. Exactly how those fences and semaphores get shared between the compositor and client and exactly what they do is left as an implementation detail. The mental model, however, is that the semaphore and fence in
vkAcquireNextImage() are the signaled semaphore and fence from the compositor's last GPU job which read that image and the semaphores passed to
vkQueuePresentKHR() are the ones the compositor waits on before compositing.
The description above is how Vulkan is "supposed to work" because, as nice as that mental model is, it's all a lie. The fundamental problem is that, even if the app is using Vulkan, the compositors are typically written in OpenGL and the window-system protocols (X11 and Wayland) are written assuming implicit synchronization. In Wayland, once the
wl_surface.commit request is sent across the wire, the compositor is free to assume the surface is ready and begin rendering, trusting in implicit synchronization to make it all work. There has been some work to allow passing sync files along with
wl_buffer.release events but it's still incomplete and not broadly supported. The X11 PRESENT extension has a mechanism for creating a synchronization primitive which is shared between the X server and client. However, that primitive is always
xshmfence which only synchronizes between the two userspace processes on the CPU; implicit synchronization is required to ensure the GPU work happens in order. In the end, then, in spite of all the nice explicit semaphores we have in the Vulkan window-system APIs, we have to somehow turn that into implicit synchronization because we live in an implicit synchronized world.
As a quick side note, none of the above is a problem on Android. The Android APIs are designed to use explicit synchronization from the ground up. All the
SurfaceFlinger APIs pass sync files between the client and compositor to do the synchronization. This maps fairly well to the Vulkan APIs. It's only generic Linux where we have a real problem here.
If implicit synchronization is auto-magic and handled by the kernel, doesn't Vulkan get it for free? Why is this a problem? Good questions! Yes, Vulkan drivers are running on top of the same kernel drivers as OpenGL but they typically shut off implicit synchronization and use the explicit primitives. There are a few different reasons for this, all of which come down to trying to avoid over-synchronization:
With multiple queues the client can submit to, if implicit synchronization were enabled, the client might end up synchronizing with itself more than needed. We don't know what the client is trying to do and it's better to only do the synchronization it explicitly asks for so we can get maximum parallelism and keep that beast full.
Vulkan doesn't know when a piece of memory is being written as opposed to read, so we would always have to assume the worst case. The kernel implicit synchronization stuff is smart enough to allow multiple simultaneous reads but only one client job writing at a time. If everything looks like a write, everything which touches a given memory object would get serialized.
Vulkan lets the client sub-allocate images out of larger memory objects. Because the kernel's implicit synchronization is at the memory object granularity, every job which touches the same memory object would get synchronized, even two jobs are accessing completely independent images within it.
If you're using bindless (
UPDATE_AFTER_BIND_BIT) or buffer device address, the Vulkan driver doesn't even know which memory objects are being used by any given command buffer. It has to assume any memory object which exists may be used by any command buffer. If we left implicit synchronization enabled, this would mean everything would synchronize on everything else.
Each of those can be pretty bad by itself but when you put them together the result is that, in practice, using implicit synchronization in Vulkan would completely serialize all work and kill your multi-queue parallelism. So we shut it off if the kernel driver allows it.
If we're turning off implicit synchronization, how do we synchronize with the window system? That's the real question, isn't it? There are a number of different strategies for this which have been employed by various drivers over the years and they all come down to some form of selective enabling of implicit synchronization. Also, they're all terrible and lead to over-synchronization somewhere.
The RADV driver currently tracks when each window-system buffer is acquired by the client and only enables implicit synchronization for window-system buffers and only when owned by the client. Thanks to details of the amdgpu kernel driver, enabling implicit synchronization doesn't actually cause the client to synchronize with itself when doing work on multiple queues. However, because of our inability to accurately track when a buffer is in use, this strategy leads to over-synchronization if the client acquires multiple images from the swapchain and is working on them simultaneously. There's no way for us to separate which work is targeting which image and only make the
vkQueuePresentKHR() wait on the work for the one image.
In ANV (the Intel driver), we get hints from the window-system code and flag the window-system buffer as written by the dummy submit done as part of
vkQueueSubmit() and consider it to be read by everything that waits on the semaphore from
vkAcquireNextImageKHR(). This strategy works well for GPU <-> GPU synchronization but we run into problems when implementing
vkWaitForFences() for the fence from
vkAcquireNextImageKHR(). That has to be done via
DRM_IOCTL_I915_GEM_WAIT which can't tell the difference between the compositor's work and work which has since been submitted by the client. If you call
vkWaitForFences() on such a fence after submitting any client work, it basically ends up being a
vkDeviceWaitIdle() which isn't at all what you want.
If you didn't follow all that, don't worry too much. It's all very complicated and detailed and annoying. The important thing to understand is that there is no one strategy for dealing with this; every driver has its own. Also, all the strategies we've employed to date can cause massive over-synchronization somewhere. We need a better plan.
So, how do we do this better? We've tried and failed so many times. Is there a better way? Yes, I believe there is.
Before getting into the details of marrying implicit and explicit synchronization, we need to understand how implicit synchronization works in the kernel. Each graphics memory allocation in the kernel is represented by a dma-buf object. (This corresponds to a
VkDeviceMemory object in Vulkan or a single buffer or image in OpenGL.) Each dma-buf object in the kernel has a dma reservation object attached to it which is a container of dma fences. A dma fence is a lightweight object which represents an event that is guaranteed to happen at some point, possibly in the future. Whenever some GPU job is enqueued in kernel space, a dma fence is created which signals when that GPU job is complete and that fence is added to the reservation object on any buffers used by that job. Each dma fence in the reservation object has a usage flag saying what kind of fence it is. When a job is created, it captures some subset of the fences associated with the buffers used by the job and the kernel waits on those before executing the job. Depending on the job and its relation to the buffer in question, it may wait on some or all of the fences. For instance, if doing a read with implicit synchronization, the job must wait on any fences from previously enqueued jobs which write the buffer.
So how do we tie implicit and explicit sync together? Let userspace extract and set fences itself, of course! The new API, which should be in Linux 5.20, adds two new ioctls on dma-buf file descriptors which allow userspace to extract and insert fences directly. In userspace, these dma fences are represented by sync files. A sync file wraps a dma fence which turns it into a file descriptor that can be passed around by userspace and waited on via
poll(). The first ioctl extracts all of the fences from a dma-buf's reservation object and returns them to userspace as a single sync file. It takes a flags parameter which lets you specify whether you expect to read the data in the dma-buf, write it, or both. If you specify read-only, the returned sync file will only contain write fences but if you specify write or read-write, the returned sync file will wait on all implicit sync fences currently in the reservation object. The second ioctl allows userspace to add a sync file to the reservation object. It also takes read/write flags to allow you to control whether the newly added fence is considered a write fence or only a read fence.
These new ioctls have unfortunately been quite a long time in coming. I typed the initial patches around a year ago and they got quickly nacked by Christian König at AMD who saw two big problems. First was that the sync file export patch was going to cause serious over-synchronization if it was ever used on the amdgpu kernel driver because of some of the clever tricks they play with dma fences to avoid over-synchronization internally. Second, thanks to the design of reservation objects at the time, the sync file import patch made both him and Daniel Vetter nervous because of the way it let userspace add arbitrary dma fences that might interact with low-level kernel operations such as memory eviction and swapping. Neither Daniel nor Christian was opposed to the API in principle, but it had to wait until we had solutions to those problems. Over the course of past year, Christian has been working steadily on refactoring the amdgpu driver and reworking the design of reservation objects away from the old read/write lock design towards a new "bag of fences" design which allows a lot more flexibility. Now that his work has landed, it's safe to go ahead with the new fence import/export API and it should be landing in time for Linux 5.19.
With this new API, we can finally move to a new implicit synchronization strategy in the Mesa Vulkan window-system code which should work correctly for everyone with no additional over-synchronization. In
vkAcquireNextImageKHR(), we can export the fences from the dma-buf that backs the window-system image as a sync file and then import that sync file into the semaphore and fence provided by the client. Because the export takes a snapshot of the dma fences, any calls to
vkWaitForFences() on the acquire fence won't have the GPU-stalling effect the ANV solution has today. In
vkQueuePresentKHR(), instead of playing all the object ownership and memory object signaling tricks we play today, we can take the wait semaphores passed in from the client or produced by the present blit, turn them into a sync file, and then import that sync file into the dma-buf that backs the window system image before handing it off to the compositor. As far as the compositor is concerned, we look just like an OpenGL driver using implicit synchronization and, from the perspective of the Vulkan driver, it all looks like explicit synchronization. Everyone wins!
Of course, all those old strategies will have to hang around in the tree for several years while we wait for the new ioctls to be reliably available everywhere. In another 3-5 years or so, we can delete support for all the legacy implicit synchronization mechanisms and we'll finally be living in explicit synchronization nirvana.
Before we wrap up, it's worth addressing one more question. A lot of people have asked me over the last couple years why we don't just plumb explicit synchronization support through Wayland and call it a day. That's how things work on Android, and it worked out okay.
The fundamental problem is that Linux is heterogeneous by nature. People mix and match different components and versions of those components all the time. Even in the best case, there are version differences. Ubuntu and Fedora come out at roughly the same time every 6 months but they still don't ship the same versions of every package. There are also LTS versions which update some packages but not others, spins which make different choices from the main distro, etc. The end result is that we can't just rewire everything and drop in a new solution atomically. Whatever we do has to be something that can be rolled out one component at a time.
This solution allows us to roll out better explicit synchronization support to users seamlessly. Vulkan drivers seamlessly work with compositors which only understand implicit synchronizaiton and, if Wayland compositors pick up sufficient explicit synchronization support, we can transition to that once the compositors are ready. We could have driven this from the Wayland side first and rolled out explicit synchronization support to a bunch of Wayland compositors and said you need a new Wayland compositor if you want to get the fastest possible Vulkan experience. However, that would have been a lot more work. It would have involved a bunch of protocol, adding sync file support to KMS, and touching every Wayland compositor we collectively care about. It would also have been much harder to get 100% transitioned to explicit synchronization because you can only use explicit synchronization without stalling if every component in the entire display path supports it. Likely, had we taken that path, some configurations would be stuck with the old hacky solutions forever and we would never be able to delete that code from Mesa.
There are two other advantages of the kernel ioctl over relying on Wayland protocol. First is that we can check for support on driver initialization. Because of the way Vulkan is structured, we know nothing about the window system when the driver first starts up. We do, however, know about the kernel. If we ever want to have driver features or other behavior depend on "real" explicit synchronization, we can check for these new ioctls early in the driver initialization process and adjust accordingly instead of having to wait until the client connects us to the window system, possibly after they've already done some rendering work.
Second, these new ioctls allow people to write Wayland compositors in Vulkan! We've had the dma-buf import/export APIs in Vulkan for a while but synchronization was left for later. Now that we have these ioctls, a Wayland compositor written in Vulkan can do the same thing as I described above with
vkQueuePresentKHR() only in reverse. When they get the composite request from the client, they can export the fences from the client's buffer to a sync file and use that as a wait semaphore for their Vulkan composite job. Once they submit the composite job, the completion semaphore for the composite job can then be exported as a sync file and re-imported into each of the clients' buffers. For a Vulkan client, this will be equivalent to if they had just passed
VkSemaphore objects back and forth. For an OpenGL client, this will appear the same as if the compositor were running OpenGL with implicit synchronization.
There you have it! After fighting with the divide between implicit and explicit synchronization with Vulkan on Linux for over seven years, we may finally have some closure. The work's not all done, however. A few of us in the Linux graphics space have a lot of ideas on where we'd like to see synchronization go in the future. We're not to synchronization nirvana quite yet, but this is an important step along the way.
Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions…
In Linux, the Industrial Input/Output subsystem manages devices like Analog to Digital Converters, Light sensors, accelerometers, etc. On…
Collabora's main testing laboratory has grown to automate testing on over 150 devices of about 30 different types. The lab receives job…
Rust is a modern language known for its memory safety, efficiency, and wide range of high-level features. But many beginners also run into…
At Collabora, we're committed to bringing people together. That's why we're pushing state-of-the-art machine-learning techniques like Large…
I have been working on getting U-boot upstream up to speed for the Radxa Rock-5B Rockchip RK3588 board. One of the cool features that I…