July 07, 2022
Even though the hardware served by the r600 driver is ageing, it is still in wide use, and high-end cards from that generation will still deliver good performance for mid-range gaming. When the drivers were originally implemented, TGSI was the dominating intermediate representation (IR) used by the shader compilers in Mesa. Several years back, NIR (new intermediate representation) was introduced, which has since been adopted by most drivers in Mesa. Among other things, NIR allows adding hardware specific opcodes that make it easy to transform the shader code to something that can easily be translated into hardware specific assembly. (To learn more about the features of NIR, take a look at Faith Ekstrand's excellent blog post.)
With that in mind, and the general sentiment that I should learn something about NIR, I got the idea to implement a NIR back-end for the r600 hardware while I was at XDC 2018. At that time, the driver created non-optimized assembly from the TGSI, which was then optimized by SB, an optimizer that was added in 2013 to the r600 driver. This optimizer has quite a few quirks; it does not work for compute or tessellation shaders, or shaders that use images or atomic operations. On top of that, it has some bugs that are difficult to fix because the code base is not well documented and difficult to understand.
When I started this project, I did not have any idea about how NIR was actually implemented or to be used. My only experience with compilers was the implementation of an improved register allocation pass for TGSI, so obviously I would make a lot of errors.
As someone who likes test-driven development, my approach for bringing up the back-end, that is, to get working basic vertex and fragment shaders, was to implement a function that makes it possible to create a NIR shader from its printout. Then I would write a test with the expected assembly, and implement the code to actually create that assembly. (The code to create a NIR shader from a printout can be found in a development branch however, the way NIR is printed has changes since, so that some re-designing would be needed to make it useful again.)
Thanks to the working TGSI back-end, test expectations were easy to obtain. So I happily coded away to first get the shaders to draw a simple triangle, add texturing, matrix operations, and so on.
Once the basic shaders were running, and glxgears did what it supposed to do, it was easy to move forward: Run a set of piglit tests and for those that crash or fail, see what the TGSI created assembly does, and fill in the gaps.
With that it was simple to get vertex and fragment shaders working. The most challenging part was not to get the assembly right, but to get the shader info in sync with what was created by the TGSI back-end, because that's what the state code expects.
Up to this point the difference between the assembler output of the TGSI code path and the output of the NIR code path was not substantial. Granted, NIR was way better optimized from the start, so the assembly created from the IR was usually better, but up to this point SB, would level the play field (apart from the bugs).
By the end of 2019 - when support for fragment and vertex shaders had been implemented - the back-end was upstreamed, and development continued there.
To support r600 properly, a few instructions, like nir_op_cube_r600 had already been added, and while these made some things easier, they did not really contribute to more optimized code. Only with the use of local data storage (LDS) in tessellation shaders did NIR really begin to shine: With the TGSI code path, for every instruction that accesses LDS the memory address is evaluated from scratch, resulting in quite some code duplication. Because LDS handling was only added after SB had landed, support for optimizing the generated code was initially not available, and because LDS reads actually require two dependent instructions (a fetch to a queue and a read from that queue), implementing this support is not trivial. Dave Airlie merged some code to SB to do this optimization, but it is still disabled.
With NIR things became simple: just add a back-end-specific intrinsics for accessing LDS and lower the shared memory access with all the address calculations to use these intrinsics that can directly be translated to r600 assembly. Then, let the NIR passes take care of removing the code duplication and optimizing the address calculation. With that, TessMark performance (with factor 32) improved from 32 FPS to 52 FPS, and a few rendering bugs were fixed too.
From this point on implementing further functionality was, again straightforward. At the beginning of 2021 the NIR back-end had been brought to parity with the TGSI back-end for Evergreen class hardware, and soft-fp64 had been tied in, so that support for OpenGL 4.5 could be advertised. By mid 2021 Cayman class hardware was also supported, although without the hardware fp64 support.
However, since I had jumped into the project without much knowledge about how to write a compiler, my initial design of the intermediate representation used in the back-end did not really plan for optimization or scheduling. Hence, the limitations that were true for the TGSI back-end in that regard were still true.
In addition, NIR itself is a constantly moving target. For instance, initially it was not possible to consume the lowered IO in the r600 back-end, because some information about semantics was lost. Later, when this data was added to the IO intrinsics, I changed the code to lower IO, because it makes things a lot easier, but this left a fair amount of dead code lying around. In addition, the better I understood NIR the more code became obsolete, but was still somewhat used and difficult to rip out. Hence, I decided that the back-end should be rewritten, taking into account the lessons learned, and this time some optimization and better scheduling would be built in.
Because the functionality was already there, rewriting the back-end was quite easy, mostly copying and pasting the existing code and adjusting the interfaces. The new back-end implements some copy-propagation, still a bit conservative, though, and a pre-scheduler. The final code-arrangement is still made by the old assembler code. Still, it barely changes the pre-scheduled code - it mostly takes care of emitting additional instructions for indirect addressing, and it validates the created assembly.
Thanks to the work done by Emma Anholt, the glsl-to-tgsi code path has been replaced by glsl-to-nir and nir-to-tgsi. With that, the TGSI the driver sees is already a lot better optimized than before, but a few problems still remain: The per-LDS address calculation is still done, instruction groups are not filled if a TGSI instruction does not use all four slots, and if the shader does not allow for SB to be used, then this is the code that is executed by the hardware.
With that in mind, adding a native NIR back-end still has its virtues.
As of now, the NIR back-end supports Evergreen and Northern Island-based hardware. It is, again, on par with the TGSI back-end; a few piglit regressions remain though. For some test results I ran piglit on Cayman PRO (Radeon HD 6950) and Cedar (Radeon HD 5000). Because the GPU soft-reset sometimes crashes the graphics hardware in a way that makes a reboot necessary, I excluded a number of tests from the piglit runs.
On Cayman piglit was run like:
./piglit run gpu -x conditional \ -x glx \ -x tex3d-maxsize \ -x atomicity \ -x ssbo-atomiccompswap-int\ -x image_load_store \ -x gs-max-output \ -x spec@arb_compute_shader@execution@min-dvec4-double-large-group-size \ -j1 --dmesg -v --timeout 100
The NIR code provides quite a number of fixes and it was possible to enable a few more features so that the driver now advertises OpenGL 4.5.
On Cedar, piglit was run similarly to Cayman. Since the TGSI back-end doesn't support fp64 here, piglit was once run on NIR skipping the fp64 tests to directly compare to TGSI, and once including the fp64 tests:
SKIP_FP64=-x dmat -x fp64 -x double -x dvec ./piglit run gpu -x conditional \ -x glx \ -x tex3d-maxsize \ -x atomicity \ $SKIP_FP64 \ -x ssbo-atomiccompswap-int -j1 --dmesg -v --timeout 100
Here we see a similar picture as with Cayman, the number of fixes out-weigh the number of the few regressions, and many tests were enabled because OpenGL 4.5 can be exposed with the NIR back-end.
|NIR (fp64 included)
Performance-wise the NIR back-end is mostly a win. A number of benchmarks were run by using the Phoronix test suite, comparing TGSI and NIR both with SB disabled and enabled.
|TGSI + SB
|NIR + SB
As can be seen in the table above, all but two test cases give a performance improvement, i.e. NIR standalone performs better than TGSI standalone, and NIR+SB performs better than TGSI+SB. In addition, even though SB is usually capable of improving the code produced by the NIR back-end, the performance win is generally smaller then when optimizing the code that was created by the TGSI back-end. There are two exception though: For OpenArena no performance improvements can be seen, and Xonotic sees a significant performance regression with the NIR back-end as compared to TGSI. The poor performance achieved here can mostly be attributed to lost opportunities for copy propagation and vectorizing gradient evaluations. SB can level the playing field, but since it also doesn't vectorize the gradient evaluation, a slight performance regression remains.
The detailed results can be found on openbenchmarking.org
A number of improvements can still be applied to the NIR back-end:
Finally, for NIR to become the default back-end, all piglit regressions and the big performance regression with Xonotic must be fixed.
The new NIR code is available with the merge request. If you want to help and test this code, the back-end is enabled with
R600_DEBUG=nir. SB is enabled by default, but you can use
R600_DEBUG=nir,nosb to run NIR with disabled SB. Play your favorite games with the back-end enabled and report bugs at https://gitlab.freedesktop.org/mesa/mesa/-/issues. If you are a developer and have the hardware, just pick a task from the list above and start fixing.
Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…
Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…
Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…
This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…
One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…
Google Open Source have chosen their second group of winners for the 2023 Google Open Source Peer Bonus Program, and Arnaud Ferraris, Senior…