April 23, 2020
In Panfrost’s infancy, community members Connor Abbott and Lyude Paul largely reverse-engineered Bifrost and built a proof-of-concept shader dis/assembler. Meanwhile, I focused on the Midgard architecture (Mali T600+), building an OpenGL driver alongside developers like Collaboran Tomeu Vizoso.
As Midgard support has grown – including initial GLES3 support – we have now turned our attention to building a Bifrost driver. We at Collabora got to work in late February, with Tomeu porting the Panfrost command stream, while I built up a new Bifrost compiler.
This week, we've reached our first major milestone: the first 3D renders on Bifrost, including basic texture support!
The interface to a modern GPU has two components, the fixed-function command stream and the programmable instruction set architecture. The command stream controls the hardware, dispatching shaders and containing the state required by OpenGL or Vulkan. By contrast, the instruction set encodes the shaders themselves, as with any programmable architecture. Thus the GPU driver contains two major components, generating the command stream and compiling programs respectively.
From Midgard to Bifrost, there have been few changes to the command stream. After all, both architectures feature approximately the same OpenGL and Vulkan capabilities, and the fixed-function hardware has not required much driver-visible optimization. The largest changes involve the interfaces between the shaders and the command stream, including the titular shader descriptors. Indeed, squinting command stream traces from Midgard and Bifrost look similar – but the long tail of minor updates implies a nontrivial Panfrost port.
But the Bifrost instruction set, on the other hand? A rather different story.
Let’s dive in.
Bifrost’s instruction set was redesigned completely from Midgard’s, requiring us to build a free software compiler targeting Bifrost from scratch. Midgard’s architecture is characterized as:
Vector. 128-bit arithmetic logic unit (ALU) allows 32-bit 4-channel SIMD.
Very Long Instruction Word. 5 blocks – scalar/vector add/multiply and special function unit – operate with partial parallelism across 2 pipeline stages.
Vector and VLIW architectures promise high-performance in theory, and in theory, theory and practice are the same. In practice, these architectures are extremely difficult to compile efficiently for. Automatic vectorization is difficult; if a shader uses a 3-channel 32-bit vector (
vec3), most likely the extra channel will go unused, wasting resources. Likewise, scheduling for VLIW architectures is difficult and slots often go unused, again wasting resources and preventing shaders from reaching peak architectural performance.
Bifrost by contrast is:
Scalarish. 32-bit ALU. 32-bit operations are purely scalar, 16-bit/2-channel and 8-bit/4-channel SIMD. 64-bit operation requires a special half-performance mode.
Slightly Long Instruction Word. Bifrost has 2 blocks – fused multiply-add and add – pipelined without parallelism. Simplified special function unit.
This redesign promises better performance - and a redesign of Panfrost’s compiler, too.
At the heart of a modern compiler is one or more Intermediate Representations (IRs). In Mesa, OpenGL shaders are parsed into GLSL IR – a tree IR expressing language-visible types – which is converted to NIR – a flat static single assignment IR enabling powerful optimizations. Backends convert NIR to a custom backend IR, whose design seals the fate of a compiler. A poor IR design can impede the growth and harm the performance of the entire compiler, while a well-designed IR enables complex features to be implemented naturally with simple, fast code. There is no one-size-fits-all IR; the design necessarily is built to make certain compiler passes (like algebraic optimization) simple at the expense of others (like register allocation), justifying the use of multiple IRs within a compiler. Further, IR design permeates every aspect of the compiler, so IR changes to a mature compiler are difficult. Both Intel and AMD GPU compilers have required ground-up rewrites to fix IR issues, so I was eager to get the Bifrost IR (BIR) right to begin.
An IR is simply a set of data structures representing a program. Typically backends use a “flat” IR: a list of blocks, where each block contains a list of instructions. But how do we store an instruction?
We could reuse the hardware’s instruction format as-is, with abstract variables instead of registers. It’s tempting, and for simple architectures, it can work. The initial NIR-to-BIR pass is a bit harder than with abstract instructions, but final code generation is trivial, since the packing is already done.
Unfortunately, real world instructions sets are irregular and as quirky as their compilers’ developers. Further, they are tightly packed to be small instead of simple. For final assembly, we will always need to pack the machine formats, but with this design, we also need to unpack them. Worse, the machine irregularities will permeate every aspect of the compiler since they are now embedded into the IR. On Bifrost, for example, most operations have multiple unrelated encodings; this design would force much of the compiler to be duplicated.
So what if the IR is entirely machine-independent, compiling in the abstract and converting to the machine-specific form at the very end. Such IRs are helpful; in Mesa, the machine-independent NIR facilitates sharing of powerful optimizations across drivers. Unfortunately, some design traits really do affect our compiler. Is there a compromise?
Notice the first IR simplifies packing at the expense of the rest of the compiler, whereas the second simplifies NIR-to-BIR at the expense of machine-specific passes. All designs trade complexity in one part of the compiler for another. Hence a good IR simplifies the hardest compiler passes at the expense of the easiest. For us, scheduling and register allocation are NP-complete problems requiring complex heuristics, whereas NIR-to-BIR and packing are linear translations with straightforward rules. Ideally, our IR simplifies scheduling and register allocation, pushing the complexity into NIR-to-BIR and packing. For Bifrost, this yields one golden rule:
A single BIR instruction corresponds to a single Bifrost instruction.
While single instructions may move during scheduling and be rewired from register allocation, the operation is unaffected. On the other hand, within an instruction:
BIR instructions are purely abstract without packing.
By delaying packing, we simplify manipulation. So NIR-to-BIR is complicated by the one-to-many mapping of NIR to Bifrost opcodes with special functions; meanwhile, final code generation is complicated by the pattern matching required to infer packing from abstract instructions. But by pushing this complexity to the edges, the optimizations in between are greatly simplified.
But speaking of IR mistakes, there is one other common issue…
One oversight I made in designing the Midgard IR – an oversight shared by the IR of many other compilers in Mesa – is often assuming instructions to operate on 32-bit data only. In OpenGL with older Mesa versions, this assumption was true as the default float and integer types are 32-bit. However, the assumption is problematic for OpenCL where 8-, 16-, and 64-bit types are first class. Even for OpenGL, it is suboptimal. While the specification mandates minimum precision requirement for operation, fulfillable with 32-bit arithmetic, on shaders using
mediump precision qualifiers we may use 16-bit arithmetic instead. About a month ago, Mesa landed support for optimizing
mediump fragment shaders to use 16-bit arithmetic, so for Bifrost, we want to make sure we can take advantage of these optimizations.
The benefit of reduced precision is two-fold. First, shader computations need to be stored in registers, but space in the register file is scarce, so we need to conserve it. If a shader runs out of registers, it spills to main memory, which is slow, so by using 16-bit types instead of 32-bit, we can reduce spilling. Second, although Bifrost is scalar for 32-bit, it is 2-channel vector for 16-bit. As mentioned, automatic vectorization is difficult, but if shaders are vectorized to begin, the compiler can take advantage of this.
As an example, to add 32-bit vector
R4-R7, we need code like:
ADD.f32 R0, R0, R4 ADD.f32 R1, R1, R5 ADD.f32 R2, R2, R6 ADD.f32 R3, R3, R7
But in 16-bit mode with vectors
ADD.v2f16 R0, R0, R2 ADD.v2f16 R1, R1, R3
Notice both register usage and instruction count are halved. How do we support this in Mesa? Mesa pipes 16-bitness through NIR into our backend, so we must ensure types are respected across our backend compiler. To do so, we include types explicitly in our backend intermediate representation, which the NIR-to-BIR pass simply passes through from NIR. Certain backend optimizations have to be careful to respect these types, whereas others work with little change provided the IR is well-designed. Scheduling is mostly unaffected. But where are there major changes?
Enter register allocation.
A fundamental problem every backend compiler faces is register allocation, mapping abstract IR variables to concrete machine registers:
0 = load uniform #0 1 = load attribute #0 2 = add 0, 1 3 = mul 2, 2
R0 = load uniform #0 R1 = load attribute #0 R0 = add R0, R1 R0 = mul R0, R0
Traditionally, register allocation is modeled by a graph, where each IR variable is represented by a node. Any variables that are simultaneously live, in the sense that both of their values will be read later in the program, are said to interfere since they require separate registers to avoid clobbering each other’s data. Interference is represented by edges. Then the problem reduces to graph colouring, finding colours (registers) for each node (variable) that don’t coincide with the colours (registers) of any nodes connected by edges (interfering variables). Initially, Panfrost used a graph colouring approach.
However, the algorithm above is difficult to adapt to irregular vector architectures. One alternative approach is to model the register file explicitly, modeling modeling interference as constraints and using a constraint solver to allocate registers. For the above program, letting ki denote the register allocated to variable i, there is a single constraint on the registers k0 ≠ k1, since (0, 1) is the only pair of interfering nodes, yielding a optimal valid solution k0 = 0, k1 = 1, k2 = 0, k3 = 0, corresponding to the allocation above.
As-is, this approach is mathematically equivalent to graph colouring. However, unlike graph colouring, it extends easily to model vectorization, enabling per-component liveness tracking, so some components of a vector are allocated to a register while others are reused for another value. It also extends easily to vectors of varying type sizes, crucial for 16-bit support, whereas the corresponding generalization for graph colouring is much more complicated.
This work was originally conducted in October for the Midgard compiler, but the approach works just as well for Bifrost. Although Bifrost is conceptually “almost” scalar, there are enough corner cases where we dip into vector structures that a vector-aware register allocator is useful. In particular, 16-bit instructions involve subdivides 32-bit registers, and vectorized input/output requires contiguous registers; both are easily modeled with linear-style constraints.
The final stage of a backend compiler is packing (final code generation), taking the scheduled, register allocated IR and assembling a final binary. Compared to Midgard, packing for Bifrost is incredibly complicated. Why?
Vectorized programs for vector architectures can be smaller than equivalent programs for scalar architectures. The above instruction sequences to add a 4-channel vector corresponds to just a single instruction on Midgard:
VADD.FADD R0.xyzw, R0.xyzw, R1.xyzw
We would like to minimize program size, since accesses to main memory are slow and increasing the size of the instruction cache is expensive. By minimizing size, smaller caches can be used with better efficiency. Unfortunately, naively scalarizing the architecture by a factor of 4 would appear to inflate program size by 4x, requiring a 4x larger cache for the same performance.
We can do better than simply duplicating instructions. First, by simplifying the vector semantics (since we know most code will be scalar or small contiguous vectors), we eliminate vector write masks and swizzles. But this may not be good enough.
Bifrost goes beyond to save instruction bits in any way possible, since small savings in instruction encoding accumulate quickly in complex shaders. For example, Bifrost separates medium-latency register file accesses from low latency “port” accesses, loading registers into ports ahead of the instruction. There are 64 registers (32-bits each), requiring 6-bits to index a register number. The structure to specify which registers should be loaded looks like:
unsigned port_0 : 5; unsigned port_1 : 6;
We have 6 bits to encode port 1, but only 5 bits for port 0. Does that mean port 0 can only load registers R0-R31, instead of the full range?
Actually, no - if port 0 is smaller than port 1, the port numbers are as you would expect. But Bifrost has a trick: if port 0 is larger than port 1, then the hardware subtracts 63 from both ports to get the actual port number. In effect, the ordering of the ports is used as an implicit extra bit, storing 12-bits of port data in 11-bits on the wire. (What if the port numbers are equal? Then just reuse the same port!)
Similar tricks permeate the architecture, a win for code size but a loss for compiler packing complexity. The good news is that our compiler’s packing code is self-contained and unit tested independent of the rest of the compiler.
Putting it all together, we have the beginnings of a Bifrost compiler, sufficient for the screenshots above. Next will be adding support for more complex instructions and scheduling to support more complex shaders.
Architecturally, Bifrost turns Midgard on its head, ideally bringing performance improvements but rather complicating the driver. While Bifrost support in Panfrost is still early, it’s progressing rapidly. The compiler code required for the screenshots above is all upstreamed, and the command stream code is working its way up. So stay tuned, and happy hacking.
The Hantro Video4Linux2 (V4L2) kernel module has gained support for another SoC! The Microchip SAMA5D4 features a single decode unit supporting…
DKMS is a framework that is mostly used to build and install external kernel modules. It can also be used to install a specific patch to…
Building GTK 4 as a Meson subproject for your own application is not only useful for Windows builds, but also for many Linux distributions…
Recently, we have been using Perfetto to successfully profile Apitrace traces in crosvm through VirGL renderer. We have now added perfetto…
As part of a project backed by INVEST-AI, a program managed by IVADO Labs, we have developed a multi-stage neural network-based solution…
Did you know you could run a permissively-licensed MTP implementation with minimal dependencies on an embedded device? Here's a step-by-step…