June 26, 2019
In my last Panfrost blog post, I announced my internship goal: improve Panfrost to run GNOME3. GNOME is a popular Linux desktop making heavy use of OpenGL; to use GNOME with only free and open source software on a machine with Mali graphics, Panfrost is necessary.
Two months ahead of schedule, here I am, drafting this blog post from GNOME on my laptop running Panfrost!
Bring-up of GNOME required improving the driver's robustness and performance, focused on Mali's tiled architecture. Typically found in mobile devices, tiling GPU architectures divide the screen into many small tiles, like a kitchen floor, rendering each tile separately. This allows for unique optimizations but also poses unique challenges.
One natural question is: how big should tiles be? If the tiles are too big, there's no point to tiling, but if the tiles are too small, the GPU will repeat unnecessary work. Mali offers a hybrid answer: allow lots of different sizes! Mali's technique of "hierarchical tiling" allows the GPU to use tiles as small as 16x16 pixels all the way up to 2048x2048 pixels. This "sliding scale" allows different types of content to be optimized in different ways. The tiling needs of a 3D game like SuperTuxKart are different from those of a user interface like GNOME Shell, so this technique gets us the best of both worlds!
Although primarily handled in hardware, hierarchical tiling is configured by the driver; I researched this configuration mechanism in order to understand it and improve our configuration with respect to performance and memory usage.
Tiled architectures additionally present an optimization opportunity: if the driver can figure out a priori which 16x16 tiles will definitely not change, those tiles can be culled from rendering entirely, saving both read and write bandwidth. As a conceptual example, if the GPU composites your entire desktop while you're writing an email, there's no need to re-render your web browser in the other window, since that hasn't changed. I implemented an initial version of this optimization in Panfrost, accumulating the scissor state across draws within a frame, rendering only to the largest bounding box of the scissors. This optimization is particularly helpful for desktop composition, ideally improving performance on workloads like GNOME, Sway, and Weston.
…Of course, theory aside, mostly what GNOME needed was a good, old-fashioned bugfixing spree, because the answer is always under your nose. Turns out what really broke the desktop was a trivial bug in the viewport specification code. Alas.
Looking forward to sophisticated workloads as this open driver matures, I researched job "scoreboarding". For some background, the Mali hardware divides a frame into many small "jobs". For instance, a "vertex job" executes a vertex shader; a "tiler job" executes tiling (sorting geometry job into tiles at varying hierarchy levels). Many of these jobs have to execute in a specific order; for instance, geometry has to be output by a vertex job before a tiler job can read that geometry. Previously, these relationships were hard-coded into the driver, which was okay for simple workloads but does not scale well.
I have since replaced this code with an elegant dependency management system, based on the hardware's scoreboarding. Instead of hard-coding relationships, the driver can now specify high level dependencies, and a generic algorithm (based on topological sorting) works out the order of submission and scoreboard flags necessary to actualize the given requirements. The new scoreboarding implementation has enabled new features, like rasterizer discard, to be implemented with ease.
With these improvements and more, several new features have landed in the driver, fixing hundreds of failing dEQP tests since my last blog post, bringing us nearer to conformance on OpenGL ES 2.0 and beyond.
Did you know you could run a permissively-licensed MTP implementation with minimal dependencies on an embedded device? Here's a step-by-step…
Earlier this year, the Rust compiler gained support for LLVM source-base code coverage. In this post we'll explain how to setup a CI job…
Over the past few months, I've been working on a side project to improve Meson sub-project support. The best stress test is to build projects…
The most complete automated testing and continuous integration tool for the Linux kernel continues to evolve at a rapid pace. Here's a look…
In the embedded world, many modern SoCs such as the ST Microelectronics STM32MP1 now include coprocessor cores which can be used for a wide…
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…