Faith Ekstrand
October 27, 2025
Reading time:
Before we could switch Nouveau to use Zink+NVK as the default OpenGL implementation, we had to fix a bunch of bugs. Most of those bugs had to do with synchronization. Let's talk about it!
Earlier this year, a couple of my colleagues at Collabora were trying to prepare an NVK demo for Embedded World and ran across an issue with WebGL in Chromium. If they enabled Vulkan support in Chromium and tried to run any WebGL demo, it would flicker like mad for a few seconds and then hang. After a few more seconds, it would go back to flickering and the process would repeat.
The hang, as it turns out, was Chromium's render process getting stuck and the watchdog killing and re-starting it. Where did it get stuck? Inside of Mesa's core vkQueueSubmit() implementation, waiting for a binary semaphore to materialize. Why? Because Zink broke one of the fundamental rules of binary Vulkan semaphores: there can only be one wait operation on any given signal operation. You can't signal once and then wait multiple times. More specifically, when doing GL<->Vulkan interop, Zink would sometimes fail to signal semaphores.
Wait, why are you doing GL<->Vulkan interop for WebGL?
Good question! When you enable Vulkan support in Chromium, that doesn't necessarily mean the entire web browser is running on top of Vulkan. Some parts of Chromium switch to using ANGLE to layer OpenGL on top of Vulkan, others may use Vulkan directly, and others may continue using the system OpenGL implementation (Zink, in our case). It then uses the GL<->Vulkan interop extensions to pass images and synchronization primitives between components so everything works together.
Once I figured out that it was an issue with synchronization in Zink's GL<->Vulkan interop implementation, I gave up on trying to debug Chromium itself. It's a massive multi-threaded, multi-process application and synchronization issues are hard enough to debug without all that excess complexity. So I spent about half my trans-Atlantic flight back from Vulkanised writing a very mean Piglit test. We already had a pretty good Vulkan test for this in crucible (a tiny Mesa Vulkan test suite). It has a pair of compute shaders which do slightly different hash-like operations on a buffer. It then ping-pongs between two different VkDevices, executing the two compute shaders. If there is ever a synchronization issue and the ping-pong doesn't properly interleave between the two devices, the final hash result will be wrong. My new Piglit test was a clone of the crucible test that runs one of the compute shaders in Vulkan and the other in OpenGL.
Once I had a targeted test case, finding the Zink bug wasn't too difficult. It turns out that Zink was assuming that all synchronization primitives were one-shot and would throw away the VkSemaphore as soon as it had been signaled. This was a reasonable assumption in the world of sync files and most of the other primitives Mesa uses. However, when a VkSemaphore is imported via GL_EXT_semaphore, it is supposed to be persistent. You import the semaphore once and can then signal and wait on it as many times as you wish. Once we fixed the Zink bug, both WebGL and WebGPU now work great.
This one didn't actually turn out to be a synchronization issue but it sure looked like one. Firefox was flickering and rendering the wrong contents. Sometimes it would even slowly corrupt over time.
Upon further investigation, it wasn't so much a flicker as parts of the UI just not getting updated when they were supposed to. Firefox, for example, was misrendering for a bit and then fixing itself. This is an error pattern I'd seen before when working on compositors and I started suspecting the EGL_EXT_buffer_age extension. For compositors and UI-heavy applications like Firefox, you can get significant performance improvements by only re-painting the parts of the screen that have actually changed. In order to do that, however, you need to know how old your back buffer is so that you know how much has changed since you rendered to that buffer. EGL_EXT_buffer_age allows the application to query the "age" of the current back buffer. What if we were reporting the wrong age?
Digging through Zink window-system code, everything seemed to be in order. Then someone on our Discord suggested, "What if you try it without damage?" I disabled the damage extension and voila! Firefox rendered correctly.
Okay, so what is damage and what did disabling it do?
Another important part of being able to do partial re-paints is the EGL_KHR_partial_update extension. This extension allows the client to specify a set of rectangles, the union of which is the area that will be updated. This is especially important for tiling GPUs as it allows them to avoid the memory-intensive tile load/store operations for parts of the back buffer that the client isn't going to render to.
Can you spot the bug?
c++ res->damage.extent.width = u.y1 - u.y0; res->damage.extent.height = u.x1 - u.x0; res->damage.offset.x = u.x0; res->damage.offset.y = u.y0;
Yup! We were flipping the width and height, effectively rotating the damage rectangle by 90 degrees. To make matters worse, this was in the loop that added up all the damage rectangles so we were spinning the total damage rectangle as we were adding them all up.
In the process of debugging this, I also looked at the damage code in Iris (the Intel OpenGL driver) because Firefox rendered correctly there. When I did, I discovered that Iris's damage was broken too, but in a way that caused the damage rectangles to be too big, rather than too small or in the wrong spot. I fixed both drivers.
Another issue we saw with Firefox on Zink (there were a few of them) was that it would sometimes crash with a Wayland protocol error: "Release or Acquire point set but no buffer attached". This gets thrown by mutter (the GNOME compositor) whenever wl_buffer.commit gets invoked with explicit sync time points but no buffer attached. Running the client with WAYLAND_DEBUG=client, I eventually saw this right before it crashed:
{mesa vk display queue} -> wp_linux_drm_syncobj_surface_v1#79.set_acquire_point(wp_linux_drm_syncobj_timeline_v1#
{Default Queue} -> wl_surface#68.frame(new id wl_callback#70)
{Default Queue} -> wl_surface#68.commit()
{mesa vk display queue} -> wp_linux_drm_syncobj_surface_v1#79.set_release_point(wp_linux_drm_syncobj_timeline_v1#
{mesa vk display queue} -> wl_surface#68.attach(wl_buffer#92, 0, 0)
{mesa vk display queue} -> wl_surface#68.damage_buffer(0, 0, 3408, 2066)
{mesa vk display queue} -> wl_surface#68.commit()
Mesa sets the acquire point and then something else comes in and does a quick wl_surface.frame and wl_surface.commit before Mesa could finish attaching the buffer and committing its changes.
This is actually a known issue with Wayland in general and something we've had to be really careful about in Mesa in the past. In order for things to properly interleave between a driver talking to Wayland and the rest of the app talking to Wayland, both Vulkan WSI and EGL make a guarantee: All Wayland protocol messages happen inside the eglSwapBuffers() or vkQueuePresent() call. Implementations are not allowed to queue the present and submit it later. Otherwise, this exact sort of issue might happen if the client needs to also talk to the wl_surface.
Okay, so we have a threading bug somewhere. At first I thought it must be a bug in Firefox. I was pretty sure Mesa's EGL implementation was correct here and I couldn't see any threads in the Vulkan Wayland WSI code. Digging around in the Firefox bug tracker, I found a similar issue on the NVIDIA proprietary driver but they said it was fixed. So I tried Firefox nightly. Still broken. Maybe it regressed again? Maybe they didn't totally fix it? Seemed unlikely given that NVIDIA users seemed happy.
Then I looked at the Zink window-system code again and noticed that it was doing what looked like thread synchronization. Sure enough! Zink was queuing presents to a submit thread to work around vkQueuePresent() performance issues in drivers. This is probably fine if Zink is running on Windows or with X11 but it's taboo on Wayland. The solution was to refactor Zink a bit to avoid asynchronous present on Wayland.
wl_surface objectsWith the threading issue fixed, Firefox took a lot longer to crash. Instead of crashing after a few seconds, it now took about 10 or 20 minutes of web browsing before it would crash with a different Wayland protocol error: "DRM Syncobj surface object already created for surface 69"
This error gets thrown whenever you have more than one wp_linux_drm_syncobj_surface_v1 for a given wl_surface. wp_linux_drm_syncobj_surface_v1 is effectively an extension of the wl_surface itself that adds the set_acquire_point and set_release_point requests used for explicit synchronization. Since it's an extension of the wl_surface, there's no reason why would it would ever need more than one of them for a given wl_surface so the protocol disallows it.
In order to ensure we follow this rule, the Vulkan WSI code in Mesa hangs on to the wp_linux_drm_syncobj_surface_v1 as part of the VkSurface. Vulkan then requires that only one VkSurface can exist for any given wl_surface at any given time. Zink has a cache of wl_surfaces and their corresponding VkSurface to ensure it never double-creates a VkSurface.
But somehow Zink was double-creating one anyway. Why? The answer came down to proxy-wrapper objects...
Proxy-wrapper objects are a concept in Wayland that allow a Wayland client to have multiple queues and control which queues events get returned on. New objects created through the proxy wrapper will be assigned to the proxy wrapper's queue, rather than the parent object's queue. Mesa uses these, for instance, to get wl_surface.frame callback events on its own queue rather than the client's queue. Whenever you create a surface in Mesa (whether an EGLSurface or a VkSurface), the first thing we do is create a proxy-wrapper object for the wl_surface and attach it to our own queue. This way we know that we can implement eglSwapBuffers() or vkQueuePresent() safely, regardless of what the client may be doing on the Wayland protocol from some other thread.
But Zink doesn't use Mesa's normal Wayland EGL implementation. Instead, whenever it can, Zink uses something called Kopper. Kopper, unlike most EGL back-ends in Mesa is a sort of back-door that implements EGL on top of Vulkan WSI, sort of like how Zink implements OpenGL on top of Vulkan. This has the advantage that most apps running on top of Zink look like regular, everyday Vulkan apps from the perspective of the Vulkan driver. They're not doing piles of explicit buffer import/export. They just create a VkSurface, make a VkSwapchain, and present like any other Vulkan application. The downside to this approach is that Kopper attempts to circumvent most of Mesa's EGL implementation in places and this leads to interesting corner cases.
In this particular case, we were getting burned by Wayland proxy-wrapper objects. I already mentioned that Zink/Kopper have a cache to avoid creating duplicate VkSurfaces for a given wl_surface and I described how Vulkan proxy-wrapper objects worked. The problem was that Kopper was caching the VkSurface based on the Mesa-created proxy-wrapper wl_surface object, not the original. This meant that any time Mesa's EGL implementation decided to re-initialize and create a new wl_surface proxy-wrapper, Zink would create a duplicate VkSurface and we risked that protocol error. The fix was to just pass the original wl_surface into Kopper.
The final Zink issue I want to raise is one which we only observed when Zink was running inside the X server. This is the case where Zink is the OpenGL driver used by X11's Glamor renderer. We were seeing significant flickering of X11 apps with this configuration. Chromium was one of the worst offenders, with UI elements flickering in and out of existence constantly.
This is a really difficult case for Zink. When running inside the X server or as an X11 compositor, we often have no DRM format modifiers and, more importantly, we have no explicit synchronization. The GL driver X11 uses is expected to just implicitly synchronize with all the apps on your desktop. (This is also the case with some Wayland compositors but some are starting to support zwp_linux_explicit_synchronization_v1 protocol these days.) Because Vulkan uses entirely explicit synchronization, it's up to Zink to translate somehow implicitly synchronize with clients using Vulkan's explicit synchronization primitives.
Using just what's available in the Vulkan API, this would be impossible. However, a few years ago, we added sync file import/export ioctls to dma-buf which allow you to query a dma-buf for the current set of write or read fences in the form of a sync file as well as set the write fence from a sync file. This was originally designed to help Vulkan drivers better synchronize with X11 and older Wayland compositors, which is kind of the opposite problem. However, when combined with Vulkan's support for the import and export of sync files, it gives us all the pieces Zink needs in order to implement implicit synchronization. The trick, then, is to insert synchronization in all the right places while also avoiding over-synchronizing everything.
And Zink attempts to do this. Its barrier code is able to detect dma-bufs and do the queue family transition from VK_QUEUE_FAMILY_EXTERNAL to the 3D queue and insert the necessary synchronization. It would also transition the dma-buf image back to VK_QUEUE_FAMILY_EXTERNAL and synchronize at the end of the batch so any other clients trying to use the dma-buf wouldn't accidentally race with it. All of this worked great... Once. If the dma-buf image was ever used a second time, however, Zink would fail to detect the needed queue family ownership transfers and start treating the dma-buf image like any other totally unsynchronized image.
We never noticed this because Wayland compositors always re-import the image into the texture via glEGLImageTargetTexture2D() every frame. This meant that Zink was always working with a fresh image and, as long as the queue family ownership transfer and related synchronization happened for the first composite, it was close enough. In the world of X11, however, images are typically imported once and then referenced over and over again. This meant that after the first frame, X11 clients were effectively running entirely unsynchronized against the Glamor and the X11 compositor, if any.
Once I managed to track down the bug, fixing it was actually fairly simple. There were a couple of places where we missed flagging an image as possibly needing a barrier. The fix was less than a dozen lines of code and now X11 clients are running flicker-free (as much as any X11 client is flicker-free).
If there's one thing to take away from this story it's that synchronization is hard and that Firefox and Chromium together make a pretty good EGL test suite, finding all sorts of corner cases. With both of them working now, though, I feel pretty confident that Zink's EGL is probably in decent shape.
27/10/2025
By resolving critical synchronization bugs in Zink’s Vulkan–OpenGL interop, Faith Ekstrand paved the way for Zink+NVK to become the default…
25/09/2025
Abandoned vendor-provided BSP roadblocks can be overcome when mainline Open Source projects like the Linux kernel are integrated directly.…
06/08/2025
This second post in the Tyr series dives deeper into GPU driver internals by using the Vulkan-based VkCube application to explain how User…
22/07/2025
Getting into kernel development can be daunting. There are layers upon layers of knowledge to master, but no clear roadmap, especially when…
15/07/2025
This past May, we met with the community at the GStreamer Spring Hackfest in Nice, France, and were able to make great strides, including…
03/07/2025
As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…
Comments (0)
Add a Comment