June 11, 2021
Panfrost, the open source driver for Arm Mali, now supports OpenGL ES 3.1 on both Midgard (Mali T760 and newer) and Bifrost (Mali G31, G52, G72) GPUs. OpenGL ES 3.1 adds a number of features on top of OpenGL ES 3.0, notably including compute shaders. While Panfrost has had limited support for compute shaders on Midgard for use in TensorFlow Lite, the latest work extends the support to more GPUs and adds complementary features required by the OpenGL ES 3.1 specification, like indirect draws and no-attachment framebuffers.
The new feature support represents the cumulative effort of multiple Collaborans -- Boris Brezillon, Italo Nicola, and myself -- in tandem with the wider Mesa community. The OpenGL driver has seen over 1000 commits since the beginning of 2021, including several hundred targeting OpenGL ES 3.1 features. Our focus is Mali G52, where we are passing essentially all drawElements Quality Program and Khronos conformance tests and are aiming to become formally conformant. Nevertheless, thanks to a unified driver, many new features on Bifrost trickle down to Midgard allowing the older architecture still in wide use to improve long after the vendor has dropped support. On Mali T860, we are passing about 99.5% of tests required for conformant OpenGL ES 3.1. That number can only grow thanks to Mesa's continuous integration running these tests for every merge request and preventing Panfrost regressions. With a Vulkan driver in the works, Panfrost's API support is looking good.
Since the last Panfrost update, we've added an instruction scheduler to the Bifrost compiler. To understand the motivation, recall the hardware design. The Bifrost instruction set pairs instructions into "tuples", one using the multipliers and the other using the adders. Up to 8 tuples are grouped into a "clause", a sequence of instructions with fixed latency that can execute back-to-back with no pipeline bubbles in the middle. The benefit to the hardware designers was that Bifrost's pipeline could be statically filled by the compiler, rather than adding logic in the hardware to dynamically dispatch instructions to different parts of the units like a superscalar chip. Unfortunately, that means the compiler becomes significantly more complicated, as it has to group instructions itself satisfying dozens of architectural invariants. If any condition fails to be met, the GPU will fault with an Invalid Instruction Encoding exception and abort execution. In Panfrost, we've approached this by formally modeling the constraints to produce a predicate (function returning a boolean) for whether a given instruction may be scheduled in a given position in the program. Then it's simple enough to schedule greedily by choosing instructions with this predicate according to some selection heuristic. Wait, selection heuristic?
An algorithm is "greedy" if it makes a locally optimal choice at every step. On the surface, it seems like that strategy would produce a globally optimal result. In special cases, this is true and the best algorithms to solve a problem are greedy. Unfortunately, greedy algorithms produce suboptimal results on many other problems, sometimes spectacularly so. Instruction scheduling is one of those cases: when the predicate shows two different instructions can be scheduled next, which one should be picked? It's not enough to always pick the first or pick one at random; while both strategies are locally optimal, both have poor global performance. We need a heuristic to choose the best instruction from the set of candidate instructions at each point, taking into account how it will affect our ability to schedule in the future. Coming up with good heuristics is tricky, and we have a great deal of room to grow in Panfrost, but the basic model is serving us well so far.
Another large change to the driver since our last blog was the addition of dirty tracking, a common graphics driver optimization with a twist for Mali. Typical GPUs are stateful, and the driver emits commands to set pieces of graphics state -- for instance, setting uniforms -- before each draw. Mali GPUs present an unusual stateless interface. Instead of commands, the driver prepares descriptors containing large amounts of graphics state bundled together, and each draw has pointers to the different descriptors. In a sense, typical GPUs are programmed with an OpenGL-like state machine, whereas Mali is programmed with Vulkan-like pipelines.
There is conventional graphics wisdom that "state changes are expensive", so graphics programmers try to minimize the number of API calls they make. Drivers can help minimize state changes as well, by tracking which state is "dirty" (modified) and which state is "clean" (unmodified). Then the driver only has to emit commands for the dirty state, reducing its own CPU overhead as well as reducing work for the GPU to process.
Surprisingly, the same idea generalizes to Mali. It is inefficient for the driver to upload every Mali descriptor for every application draw call. Ideally, we could reuse the same descriptor for subsequent draw calls if we know the state hasn't changed. Dirty tracking lets us know exactly when the state has changed, allowing us to only upload new descriptors when required, and reusing the descriptor otherwise. On the surface, the purpose is simply reducing CPU overhead, since the GPU's programming model is stateless and therefore must redo work anyway. However, the GPU has several layers of caches, so reusing these descriptors can enable the GPU to use cached descriptors as opposed to invalidating its descriptor caches after every draw call. Implementing dirty tracking in Panfrost improved draws per second in one synthetic benchmark by about 400%. Not bad for a week's work.
In the coming months, we're aiming to polish the OpenGL ES 3.1 support in time for the Mesa 21.2 release next month. Next stop after that: Bifrost performance improvements and introducing support for the modern Valhall (Mali G77 and newer) architecture family.
Earlier this year, from January to April 2021, I worked on adding support for stateless decoders for GStreamer as part of a multimedia internship…
In our previous post, we presented a project backed by INVEST-AI which introduces a multi-stage neural network-based solution. Now let's…
Initiated as a joint effort by the Google Chrome OS team and Collabora, the recent KernelCI hackfest brought the addition of new tests including…
There's a lot that has happened in the world of Zink since my last update, so let's see if I can bring you up to date on the most important…
Panfrost, the open source driver for Arm Mali, now supports OpenGL ES 3.1 on both Midgard (Mali T760 and newer) and Bifrost (Mali G31, G52,…
Collabora has been investing into Perfetto to enable driver authors & users to get deep insights into driver internals and GPU performance.…