July 09, 2020
Earlier this year, we announced a new project in partnership with Microsoft: the implementation of OpenCL and OpenGL to DirectX 12 translation layers (Git repository). Time for a summer update! In this blog post, I will explain a little more about the OpenGL part of this work and more specifically the steps that have been taken to improve the performance of the OpenGL-On-D3D12 driver.
In the initial steps of this project, we quickly realized that the best way forward was to build on top of Mesa. Zink, a project started by Erik Faye-Lund, has already proven that we could achieve a similar goal: translating OpenGL to a lower-level graphics API (s/Vulkan/DirectX12/). People familiar with that project will therefore experience a strong déjà-vu feeling when looking at the architecture of our current effort:
The Mesa state tracker is responsible for translating OpenGL state (blend modes, texture state, etc) and drawing commands (like glDrawArrays and glDrawPixels) into objects and operations that map well to modern GPU hardware features (Gallium API). The "D3D12 Driver" is thus an implementation of that interface.
On the shader side, the state tracker is able to convert OpenGL fixed-functions, traditionally implemented directly by the hardware, into shaders. Mesa will also translate GLSL shaders into an intermediate representation named NIR. We use that representation to produce the DXIL bytecode consumed by DirectX. I'm not gonna focus on the NIR-to-DXIL compiler here as it definitely deserves its own blog post.
Finally, a different component of Mesa, the WGL State tracker, is handling WGL calls (API between OpenGL and the windowing system interface of Windows). Internally, an existing implementation of the windowing system was using GDI (Graphics Device Interface) to actually display the rendered frames on the screen. We added a new implementation using DXGI Swapchains. More on that later.
In order to better understand the next sections, let's dive a little more into the details of DirectX 12.
DirectX 12 requires that we record commands (e.g. clearing the render target, draw calls, etc.) into a
ID3D12GraphicsCommandList and then call
Execute() to actually process the commands. But before we can record drawing commands, we first need to set some state on the command list. Including (not an exhaustive list):
Vulkan follows a similar model, with
VkPipeline objects encapsulating state such as image formats, render-target attachments, blend modes, and shaders all bounded into a single object. Like DirectX command lists, Vulkan pipelines are immutable once created and recorded. This is one of the biggest sources of impedance mismatch when translating from GL, where applications set global state parameters and the final state is only known when a draw call is submitted.
Our initial implementation of the driver was as straightforward as possible, as we wanted to validate our approach and not focus too much on performance early on. Mission accomplished, it was really slow! For each draw call, we were setting the entire pipeline state, filling the descriptor heaps, recording the draw command, immediately executing the command list, and waiting for it to finish.
When drawing a scene with 6 textured triangles (red, green, blue, red, green, blue), the sequence of events would look like this:
This is of course extremely inefficient, and one easy approach to reduce latency is to batch multiple commands (clear, draw, etc.) together. Concretely, we create multiple batch objects that each contains a command allocator and a set of descriptor heaps (sampler and CBV/SRV heaps). Commands are recorded in the batch until the descriptor heaps are full, at which point we can simply execute (send the commands to the GPU), create a fence (for future CPU/GPU synchronization) and start a new batch. Given that queuing of the command and its actual execution are now decoupled, this optimization also requires that we keep track of needed resources. For example, we need to make sure to not delete textures that the draw call will sample from, once executed.
|Sample Descriptor Heap|
|CBV/SRV Descriptor Heap|
When all of the batch objects are used (wrap-around), we wait for the oldest submitted batch to complete. It is then safe to unreference all of the tracked resources and re-use the allocator and heaps. Assuming a maximum of 2 active batches and when allocating heaps just big enough for two draw calls (let's start small), drawing our 6 triangles looks like this:
It is important to note that some more flushing (waiting for some or all of the commands to be finished) is needed in some scenarios. The main one being when mapping a resource currently used by a command (texturing, blit source/destination) for access by the CPU.
In real-life situations, it is really rare that ALL of the state bits changes in-between draw calls. It is probably safe to assume, for example, that the viewport would keep constant during a frame and that the blending state won't budge much either. So, in a similar fashion to how (real) hardware drivers are implemented, we use dirty-state flags to keep track of which state has changed. We still need to re-assert the entire state when starting a new batch (resetting a command list also resets the entire pipeline state). However, it saves us some CPU cycles when doing multiple draw commands in one batch (very likely).
In addition to that, and given that PSO (Pipeline State Object) creation is relatively costly, we cache those. If any of the PSO-related dirty flags are set, we can then search the cache and in case of a miss, create a new PSO.
The total rendering time hasn't changed much in our example scenario, but the CPU usage is lowered. Another effect of not re-asserting the descriptor heaps on each draw call is that we can sometimes fit more commands in one batch without allocating bigger heaps.
Initially, only CPU-driven GDI winsys integration was implemented. There are two downsides to that approach.
Let's zoom out and see what is happening when drawing 4 frames. For the purpose of this next diagram, we'll assume that we can draw an entire frame using only one batch:
By implementing integration with DXGI swapchains (only supported for double-buffered pixel formats; we still rely on the old GDI code path otherwise), we can solve these two issues. The swapchain provides us with a back buffer into which the GPU can render the current frame. It also keeps track of a front buffer that is used for the display. When the application want to present the next frame (wglSwapBuffers), the swapchain simply flip these two buffers.
Please note that this diagram is only valid for full-screen applications when throttling is disabled (
wglSwapInterval(0)). When syncing with V-Sync, the GPU might also introduce some stalling to make sure it doesn't render over the currently displayed buffer. When rendering in windowed mode, the window manager will use the front buffer to compose the final display scene. It can also, in some situations, directly use the buffer without any bliting if the hardware supports hardware overlays.
One final caveat: the application will suffer a performance hit when drawing on the front buffer (
glDrawBuffer(GL_FRONT)). The buffer-flip presentation model rules out that possibility; the front buffer needs to stay intact for scanout. If that happens, we have no choice but to create a fake front buffer and to performs some copies before drawing and swapping buffers.
Resource state transition barriers require that we specify both the initial state of the resource and the desired new state. The naive approach is to add a state barrier before and after each resource usage (
COMMON). But that solution has a real performance cost: each transition may involve layout transitions, resolves, or even copies to shadow resources, so whilst this is an acceptable crutch for short-term development, the cost of using the lowest common denominator is too much for real-world usage. However, getting the details right for a better solution is tricky:
Luckily for us, the Microsoft team already worked on a solution for a very similar problem. They have previously developed a project named D3D11on12 (yep, you guessed it, translating DirectX11 API to DirectX12) that is itself relying on D3D12TranslationLayer. They were able to adapt some code from the latter into our Mesa tree, fixing all of the problems previously mentioned.
It is not always optimal to create a new commited resource for each buffers. To speed up resource allocation, we don't immediatly destroy unreferenced buffers but instead try to re-use them for new allocations. Allocating whole resources for small buffers is also inefficient because of the alignment requirements. Therefore, we create a new buffer slab on demand to sub-allocate smaller buffers. In the future, it might be possible to implement a similar solution for textures, but this approach is stricly used for buffers as of now.
TDLR; All of these incremental changes sum up to an amazing result: less CPU time wasted waiting on completions (pipelining), less CPU time wasted on overhead (batching), less CPU time wasted on redundant operations (caching), more efficient memory usage (suballocation), zero-copy presentation pipeline (DXGI), more efficient GPU hardware usage (explicit image states which aren't
For the ones that just came for the screenshot:
And some numbers I compiled for Doom 3
timedemo benchmark. In this very specific scenario, on my system (mobile Intel CPU/GPU), the cumulative gain of our changes is around 40x!
|Dirty State & PSO Cache||12.9|
|DXGI Swapchain *||14.8|
|Resource State Manager||24.6|
* The improvement when switching to DXGI is low because of resource creation that requests a kernel lock.
Disclaimer: You might not be able to replicate these numbers if you use the main repository as I disabled the debug layer and changed the descriptor heaps size for my benchmark.
Here are some of the ideas we could consider going forward:
Our team consists of five additional Collabora engineers (Boris Brezillon, Daniel Stone, Elie Tournier, Erik Faye-Lund, Gert Wollny) and two Microsoft DirectX engineers (Bill Kristiansen, Jesse Natalie).
Did you know you could run a permissively-licensed MTP implementation with minimal dependencies on an embedded device? Here's a step-by-step…
Earlier this year, the Rust compiler gained support for LLVM source-base code coverage. In this post we'll explain how to setup a CI job…
Over the past few months, I've been working on a side project to improve Meson sub-project support. The best stress test is to build projects…
The most complete automated testing and continuous integration tool for the Linux kernel continues to evolve at a rapid pace. Here's a look…
In the embedded world, many modern SoCs such as the ST Microelectronics STM32MP1 now include coprocessor cores which can be used for a wide…
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…