June 05, 2020
In our last blog update for Panfrost, the free and open-source graphics driver for modern Mali GPUs, we announced initial support for the Bifrost architecture. We have since extended this support to all major features of OpenGL ES 2.0 and even some features of desktop OpenGL 2.1. With only free software, a Mali G31 chip can now run Wayland compositors with zero-copy graphics, including GNOME 3. We can run every scene in glmark2-es2, and 3D games like Neverball can be played. In addition, we can support hardware-accelerated video players mpv and Kodi. Screenshots above are from a Mali G31 board running Panfrost.
All of the above is included in upstream Mesa with no out-of-tree patches required, with the upcoming Bifrost support enabled via the
PAN_MESA_DEBUG=bifrost environmental variable.
Bringing up these new applications required implementing many new floating-point arithmetic opcodes, including comparisons, selections, and additional type conversions. Further, I’ve added initial support for integer arithmetic and bitwise operations, used to implement integer types directly as well as booleans. While there are a number of arithmetic logic unit (ALU) opcodes required, this is not an obstacle on architectures with regular instruction encodings.
Unfortunately, Bifrost is not a regular architecture and has dozens of distinct instruction encodings in order to conserve space. Adding opcodes to the compiler is still routine, but requires adding quite a bit more code. Plus, the duplication can be error-prone, so as soon as I add a new opcode, I add comprehensive tests against the real hardware iterating through different combinations of operand size and modifiers to exercise all the packing special cases.
The upshot is that the testing coverage eliminates entire classes of compiler bugs which tend to plague new drivers, allowing our open source Bifrost driver to flourish despite such a quirky architecture.
Beyond new ALU opcodes, I extended the texture support to enable simple texture operations from vertex shaders, a pattern occurring in
glmark2’s terrain scene. Mali GPUs use slightly different encodings for fragment and vertex texture operations, since fragment shaders can automatically compute the level-of-detail parameter based on neighboring fragments, whereas there is no notion of neighboring fragments in vertex shaders.
Finally, I added initial control flow support (branching) support for if/else statements and loops. As Bifrost is a Single Instruction, Multiple Thread (SIMT) architecture in which multiple threads run the same shader in lockstep, branching is a complicated affair if threads diverge. Most of the complexity is handled in hardware, but just enough seeps through that the branching implementation ends up a hair more complicated than that of Midgard. Still, it’s enough for glmark2’s
loop scene, and there’s always room for improvement.
Of couse, Bifrost progress is no obstacle to improving our Midgard support. Inspired by the lessons learned designing the Bifrost Intermediate Representation as previously blogged, I revisited our Midgard Intermediate Representation as well. The focus was two fold:
Simplify to enable faster, more effective optimizations in fewer lines of code.
Generalize the IR to support non-32-bit operation.
To do so, I implemented generic helpers for inferring instruction modifiers like saturation. Consider a shader that squares a variable and saturates it to the range [0, 1].
X = clamp(X * X, 0.0, 1.0);
In NIR, Mesa’s common intermediate representation used across drivers, this line might look like the following, using NIR’s
fsat opcode to clamp to [0, 1]:
ssa_10 = fmul ssa_9, ssa_9 ssa_11 = fsat ssa_10
Our hardware has native support for saturating the results of floating-point instructions. There are a few approaches to take advantage of this. One is to use NIR’s builtin saturation handling, as Midgard’s compiler used to. A NIR pass can fuse the
fsat instruction into the multiply, producing the NIR:
ssa_10 = fmul.sat ssa_9, ssa_9
Then our backend compiler can use the
.sat flag directly. While this is an easy approach, it is inflexible, since the hardware might be able to use modifiers that NIR does not express. For instance, Mali GPUs have a
.clamp_positive operation which does max(x, 0.0) on the result for free. If we wrote
X = max(X * X, 0.0), NIR could give us code using a dedicated
ssa_10 = fmul ssa_9, ssa_9 ssa_11 = fclamp_positive ssa_10
However, it could not fuse the modifier in without substantial changes affecting common code. The second approach would be to compile this to two instructions in the IR, and use a second propagation pass on our backend IR to fuse it together.
10 = fmul 9, 9 10 = fmul.pos 9, 9 11 = fclamp_positive ssa_10
However, there’s a third option unifying both cases and simplifying the compiler: inferring the modifiers generically while translating NIR into our backend IR. This enables us to use architecture-specific modifiers, like
.pos, while still having the original NIR available for efficient handling. This approach enabled us to replace hundreds of lines of optimizations for floating-point modifiers and bitwise inverses, while optimizing new patterns that the original design could not, promising savings in code complexity and performance improvements. Since it’s generic, it allows us to optimize not just Midgard programs, but soon Bifrost modifiers as well.
With a simpler compiler, I was able to add 16-bit support to the Midgard compiler to reduce register pressure and improve thread count (occupancy) due to the architecture’s register sharing mechanism. As previously blogged, our Bifrost compiler is built to support this from day 1, and through the lessons learned there, I was able to backport the improvements to Midgard.
To prepare, I added types into the IR to avoid compiler passes requiring type inference, a complex and error prone pursuit. Once type sizes were preserved cleanly, I added additional support to the Midgard compiler’s packing routines to handle some outstanding details of 16-bit instructions. Midgard is significantly simpler to pack than Bifrost; whereas 16-bit and 32-bit instructions on Bifrost involve separate instructions with dramatically differing opcodes and formats, Midgard has a one-size-fits-most approach which – despite its inherent limitations – is refreshing. Miscellaneous fixes were needed across the compiler; nevertheless, the simplified IR lived up to its design and is now able to support 16-bit operations.
The bulk of the code required for FP16 has now landed in upstream Mesa but is disabled by default pending further testing. Nevertheless, for the adventurous among you, you can set
PAN_MESA_DEBUG=fp16 on a recent build of master. Beware: here be dragons.
Stepping away from the compiler, an interesting improvement is the new handling of draws with colour masked out. A typical draw in OpenGL that does not use blending or colour masks might look like:
glColorMask(true, true, true, true); glDepthMask(true); glDrawArrays(GL_TRIANGLES, 0, 15);
Since blending is disabled and all colour channels (RGBA) are written simultaneously, this draw does not need to read from the colour buffer (tilebuffer). But what if the draw does not write to any colour channels?
glColorMask(false, false, false, false); glDepthMask(true); glDrawArrays(GL_TRIANGLES, 0, 15);
Naively, the GPU would need to read the previous colour and write it back immediately - but that’s wasteful. Instead, we can detect the case where no colour is written, and elide all access to the colour buffer, skipping both the read and the write.
Could we skip the draw entirely? If there are no side effects, we can, but applications typically mask out colour while also unmasking the depth buffer, which is independent of the colour computation. Midgard has a solution.
Even if depth/stencil updates are required, as long as the shader only computes colour with no side effects, there’s no reason to run the shader. While Bifrost does not appear to, Midgard allows the driver to specify a draw with no shader, saving not only colour buffer read/write but also shader execution.
In addition to our work on Midgard performance, community Panfrost hacker Icecream95 has been improving the Midgard stack nonstop.
Since our last blog post, they contributed a major bug fix for handling
discard instructions. For background, OpenGL conceptually first runs the fragment shader for each pixel on the screen and then performs depth testing. In practice, modern hardware attempts to perform depth tests before running the shader, known as “early-z” testing, in order to avoid needlessly executing the shader for occluded pixels.
However, games use
discard, an OpenGL directive allowing shaders to eliminate fragments, which can interfere with optimizations like early-z. The driver is responsible for detecting these situations, disabling these optimizations, and enabling standards-compliant fallback paths including “late-z” testing. After Icecream95 investigated issues with Panfrost’s handling of depth testing in the presence of
discard instructions, they were able to fix rendering bugs in many games including SuperTuxKart, OpenMW, and RVGL.
Some Panfrost (Mali T760) screenshots of games improved by Icecream95’s patches:
Hats off to a great community contributor!
One final area that we’ve been working on is exposing Mali’s performance counters to userspace in Panfrost, allowing us to identify bottlenecks in the driver and other developers to identify bottlenecks in their application running on Panfrost. For about a year, we have had experimental support for passing the raw counters from kernelspace. Collaborans Antonio Caggiano and Rohan Garg, in conjunction with Icecream95 and other contributors, have been working on integrating these counters with Perfetto to enable high-level analysis with an elegant, free software user interface.
In the past 3 months since we began work on Bifrost, fellow Collaboran Tomeu Vizoso and I have progressed from stubbing out the new compiler and command stream in March to running real programs by May. Driven by a reverse-engineering effort in tandem with the free software community, we are confident that against proprietary blobs and downstream hacks, open-source software will prevail.
Looking to the future, we plan to improve Bifrost’s coverage of OpenGL ES 2.0 to support more 3D games, now that the basic accelerated desktop is working. We also plan to improve Bifrost compiler performance, in order to approach the proprietary stack’s performance as we did for Midgard. Most of all, we’d like to build a community around the driver, with software freedom and an open first approach as core values.
It worked for Freedreno, Etnaviv, and Lima. It worked for Panfrost on Midgard. And I’m confident it will work again on Bifrost.
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…
Hwangsaeul, or H8L, a remote surveillance streaming solution, utilizes the capability of libsrt to collect statistics from open SRT sockets…
Complex, real-world correctness tests and performance analysis are now possible thanks to gltrim, a new tool recently added to apitrace,…
Earlier this week, WebRTC became an official W3C and IETF standard. GStreamer has a powerful and rapidly maturing WebRTC implementation.…
Last year, from June to September, I worked on the kernel development tool Coccinelle under Collabora. I implemented a performance boosting…
The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL…