Deep dive into OpenGL over DirectX layering

Deep dive into OpenGL over DirectX layering

Louis-Francis Ratté-Boulianne
July 09, 2020

Share this post:

Reading time:

Earlier this year, we announced a new project in partnership with Microsoft: the implementation of OpenCL and OpenGL to DirectX 12 translation layers (Git repository). Time for a summer update! In this blog post, I will explain a little more about the OpenGL part of this work and more specifically the steps that have been taken to improve the performance of the OpenGL-On-D3D12 driver.

General Architecture

In the initial steps of this project, we quickly realized that the best way forward was to build on top of Mesa. Zink, a project started by Erik Faye-Lund, has already proven that we could achieve a similar goal: translating OpenGL to a lower-level graphics API (s/Vulkan/DirectX12/). People familiar with that project will therefore experience a strong déjà-vu feeling when looking at the architecture of our current effort:

The Mesa state tracker is responsible for translating OpenGL state (blend modes, texture state, etc) and drawing commands (like glDrawArrays and glDrawPixels) into objects and operations that map well to modern GPU hardware features (Gallium API). The "D3D12 Driver" is thus an implementation of that interface.

On the shader side, the state tracker is able to convert OpenGL fixed-functions, traditionally implemented directly by the hardware, into shaders. Mesa will also translate GLSL shaders into an intermediate representation named NIR. We use that representation to produce the DXIL bytecode consumed by DirectX. I'm not gonna focus on the NIR-to-DXIL compiler here as it definitely deserves its own blog post.

Finally, a different component of Mesa, the WGL State tracker, is handling WGL calls (API between OpenGL and the windowing system interface of Windows). Internally, an existing implementation of the windowing system was using GDI (Graphics Device Interface) to actually display the rendered frames on the screen. We added a new implementation using DXGI Swapchains. More on that later.

DirectX 12 - 101

In order to better understand the next sections, let's dive a little more into the details of DirectX 12.

DirectX 12 requires that we record commands (e.g. clearing the render target, draw calls, etc.) into a ID3D12GraphicsCommandList and then call Execute() to actually process the commands. But before we can record drawing commands, we first need to set some state on the command list. Including (not an exhaustive list):

Viewport, scissor clipping, blend factor, topology (whether we want to draw points, lines or triangles), vertex and index buffers.
Render Targets and Depth/Stencil resources: where the draw call is going to render the resulting pixels.
Pipeline State Object (state bits that will probably stay the same for multiple draws): compiled shaders (DXIL bytecode), blend state, depth/stencil/alpha state, rasterizer state...
Root Signature: defines what types of resources are bound to the graphics pipeline. For example, if a pipeline requires access to two textures, the root signature is going to declare a parameter that is a 2-descriptor range into the SRV (shader resource view) heap.
Descriptor Heaps: where we set the relevent descriptors for resources, samplers and constant buffers.
Resource State: description of how a GPU intends to access a resource. Transition barriers are required to garantee the proper state for a command. For example, before sampling from a texture, we need to make sure that the source is in the D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE state. The exact details of the transition vary from hardware to hardware, but it would minimally makes sure that all writes to the texture are completed, that the resource has the proper layout (e.g. (de)compression) and that the cache is coherent.

Vulkan follows a similar model, with VkPipeline objects encapsulating state such as image formats, render-target attachments, blend modes, and shaders all bounded into a single object. Like DirectX command lists, Vulkan pipelines are immutable once created and recorded. This is one of the biggest sources of impedance mismatch when translating from GL, where applications set global state parameters and the final state is only known when a draw call is submitted.

Performance Work

Initial State

Our initial implementation of the driver was as straightforward as possible, as we wanted to validate our approach and not focus too much on performance early on. Mission accomplished, it was really slow! For each draw call, we were setting the entire pipeline state, filling the descriptor heaps, recording the draw command, immediately executing the command list, and waiting for it to finish.

When drawing a scene with 6 textured triangles (red, green, blue, red, green, blue), the sequence of events would look like this:

Command Batching

This is of course extremely inefficient, and one easy approach to reduce latency is to batch multiple commands (clear, draw, etc.) together. Concretely, we create multiple batch objects that each contains a command allocator and a set of descriptor heaps (sampler and CBV/SRV heaps). Commands are recorded in the batch until the descriptor heaps are full, at which point we can simply execute (send the commands to the GPU), create a fence (for future CPU/GPU synchronization) and start a new batch. Given that queuing of the command and its actual execution are now decoupled, this optimization also requires that we keep track of needed resources. For example, we need to make sure to not delete textures that the draw call will sample from, once executed.

Batch
Command Allocator
Sample Descriptor Heap
CBV/SRV Descriptor Heap
Tracked Objects
Fence

When all of the batch objects are used (wrap-around), we wait for the oldest submitted batch to complete. It is then safe to unreference all of the tracked resources and re-use the allocator and heaps. Assuming a maximum of 2 active batches and when allocating heaps just big enough for two draw calls (let's start small), drawing our 6 triangles looks like this:

It is important to note that some more flushing (waiting for some or all of the commands to be finished) is needed in some scenarios. The main one being when mapping a resource currently used by a command (texturing, blit source/destination) for access by the CPU.

Dirty State and PSO Caching

In real-life situations, it is really rare that ALL of the state bits changes in-between draw calls. It is probably safe to assume, for example, that the viewport would keep constant during a frame and that the blending state won't budge much either. So, in a similar fashion to how (real) hardware drivers are implemented, we use dirty-state flags to keep track of which state has changed. We still need to re-assert the entire state when starting a new batch (resetting a command list also resets the entire pipeline state). However, it saves us some CPU cycles when doing multiple draw commands in one batch (very likely).

In addition to that, and given that PSO (Pipeline State Object) creation is relatively costly, we cache those. If any of the PSO-related dirty flags are set, we can then search the cache and in case of a miss, create a new PSO.

The total rendering time hasn't changed much in our example scenario, but the CPU usage is lowered. Another effect of not re-asserting the descriptor heaps on each draw call is that we can sometimes fit more commands in one batch without allocating bigger heaps.

DXGI Swapchain Winsys

Initially, only CPU-driven GDI winsys integration was implemented. There are two downsides to that approach.

Each time we are done recording the commands for a frame, we need to wait for the rendering to finish and are then completely stalling the CPU.
The resulting framebuffer content is copied to the GDI display target for composition by the window manager.

Let's zoom out and see what is happening when drawing 4 frames. For the purpose of this next diagram, we'll assume that we can draw an entire frame using only one batch:

By implementing integration with DXGI swapchains (only supported for double-buffered pixel formats; we still rely on the old GDI code path otherwise), we can solve these two issues. The swapchain provides us with a back buffer into which the GPU can render the current frame. It also keeps track of a front buffer that is used for the display. When the application want to present the next frame (wglSwapBuffers), the swapchain simply flip these two buffers.

Please note that this diagram is only valid for full-screen applications when throttling is disabled (wglSwapInterval(0)). When syncing with V-Sync, the GPU might also introduce some stalling to make sure it doesn't render over the currently displayed buffer. When rendering in windowed mode, the window manager will use the front buffer to compose the final display scene. It can also, in some situations, directly use the buffer without any bliting if the hardware supports hardware overlays.

One final caveat: the application will suffer a performance hit when drawing on the front buffer (glDrawBuffer(GL_FRONT)). The buffer-flip presentation model rules out that possibility; the front buffer needs to stay intact for scanout. If that happens, we have no choice but to create a fake front buffer and to performs some copies before drawing and swapping buffers.

Resource State Manager

Resource state transition barriers require that we specify both the initial state of the resource and the desired new state. The naive approach is to add a state barrier before and after each resource usage (COMMON -> RENDER_TARGET, draw, RENDER_TARGET -> COMMON). But that solution has a real performance cost: each transition may involve layout transitions, resolves, or even copies to shadow resources, so whilst this is an acceptable crutch for short-term development, the cost of using the lowest common denominator is too much for real-world usage. However, getting the details right for a better solution is tricky:

Subresources in a resource can have different states;
Some states are compatible together (read states);
Mutiple transition barriers can be aggregated to reduce the number of commands;
Some transitions are implicit (with no cost associated) and don't require a barrier.

Luckily for us, the Microsoft team already worked on a solution for a very similar problem. They have previously developed a project named D3D11on12 (yep, you guessed it, translating DirectX11 API to DirectX12) that is itself relying on D3D12TranslationLayer. They were able to adapt some code from the latter into our Mesa tree, fixing all of the problems previously mentioned.

Buffer allocation

It is not always optimal to create a new commited resource for each buffers. To speed up resource allocation, we don't immediatly destroy unreferenced buffers but instead try to re-use them for new allocations. Allocating whole resources for small buffers is also inefficient because of the alignment requirements. Therefore, we create a new buffer slab on demand to sub-allocate smaller buffers. In the future, it might be possible to implement a similar solution for textures, but this approach is stricly used for buffers as of now.

Wrap-Up

TDLR; All of these incremental changes sum up to an amazing result: less CPU time wasted waiting on completions (pipelining), less CPU time wasted on overhead (batching), less CPU time wasted on redundant operations (caching), more efficient memory usage (suballocation), zero-copy presentation pipeline (DXGI), more efficient GPU hardware usage (explicit image states which aren't COMMON).

For the ones that just came for the screenshot:

And some numbers I compiled for Doom 3 timedemo benchmark. In this very specific scenario, on my system (mobile Intel CPU/GPU), the cumulative gain of our changes is around 40x!

Step	FPS
Initial State	1.0
Command Batching	4.2
Dirty State & PSO Cache	12.9
DXGI Swapchain *	14.8
Resource State Manager	24.6
Buffer Caching/Suballoc	42.5

* The improvement when switching to DXGI is low because of resource creation that requests a kernel lock.

Disclaimer: You might not be able to replicate these numbers if you use the main repository as I disabled the debug layer and changed the descriptor heaps size for my benchmark.

Future

Here are some of the ideas we could consider going forward:

Further CPU overhead analysis
Optimize number of batches and size of descriptor heaps
Placed textures (sub-allocation from a large texture)
Same-subresource copies will always be more inefficient unless we can go around that
DirectX 12 restriction
Use a different BO when the whole content is discarded
Create bucket root signature (round up needed size for descriptor ranges)

Acknowledgments

Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Erik Faye-Lund, Gert Wollny) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).

Introducing OpenCL and OpenGL on DirectX

Zink: Fall Update

Introducing Zink, an OpenGL implementation on top of Vulkan

Introducing OpenCL and OpenGL on DirectX

Zink: Fall Update

Introducing Zink, an OpenGL implementation on top of Vulkan

Search the newsroom

Latest Blog Posts

Re-thinking framebuffers in PanVK

23/03/2026

PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…

Running Mainline Linux, U-Boot, and Mesa on Rockchip: A year in review

02/03/2026

Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…

Now streaming: Collabora XDC 2025 presentations

02/12/2025

As an active member of the freedesktop community, Collabora was busy at XDC 2025. Our graphics team delivered five talks, helped out in…

Implementing Bluetooth LE Audio & Auracast on Linux systems

24/11/2025

LE Audio introduces a modern, low-power, low-latency Bluetooth® audio architecture that overcomes the limitations of classic Bluetooth®…

Strengthening KernelCI: New architecture, storage, and integrations

17/11/2025

Collabora’s long-term leadership in KernelCI has delivered a completely revamped architecture, new tooling, stronger infrastructure, and…

Font recognition reimagined with FasterViT-2

11/11/2025

Collabora extended the AdobeVFR dataset and trained a FasterViT-2 font recognition model on millions of samples. The result is a state-of-the-art…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기