Boris Brezillon
February 23, 2023
Reading time:
As fate would have it, a new DRM driver for recent Mali GPUs was submitted earlier this month. This is a bit of an oddity in the DRM subsystem world, where support for new hardware is usually added to GPU drivers supporting previous hardware generations. So let's have a look at why this was done differently this time, and the challenges to come to get this new driver merged.
Version 10 of the Mali architecture (second iteration of Valhall GPUs) introduced a major change in how jobs and their parameters are passed to the GPU. Arm replaced the Job Manager (JM) block by a Command Stream Frontend (CSF). As the new name implies, CSF hardware is introducing a command stream based solution to update the pipeline state and submit GPU jobs, thus avoiding re-allocation of relatively big pipeline state structures when only a single parameter of the pipeline changes between two job submissions. Nothing really new here, other GPU vendors have been using this command stream based submission pattern for years now, and Arm is just catching up here.
We won't give a detailed overview of how CSF works, but it is worth noting that the CSF frontend has a dedicated instruction set, and a bunch of registers to pass data around or keep internal states. There are instructions to submit jobs (compute, tiling and fragment jobs), and others to do more trivial stuff, like read/write memory, wait for job completion, wait for fences, jump/branch... That alone means we have to provide CSF-specific handling in Mesa to deal with the command stream emission and submission. If CSF was just about moving away from a descriptor based job submission approach, we could get away with a minimal amount of kernel changes and squash CSF support to the existing kernel driver.
But here comes the second major change brought by CSF hardware: firmware-assisted scheduling. The GPU not only embeds its unified shader cores (used to execute shader code) and the Command Execution Unit (the block processing the CSF instructions), it also has a Cortex-M7 microcontroller in front, that is here to do some high level queue scheduling. Before we get into that, let's take a step back, and explain how job scheduling is done in the Panfrost driver.
Panfrost uses the drm_sched framework to deal with job scheduling. This framework is based on the concept of hardware queues (represented by drm_gpu_scheduler), which are processing jobs in order and have a predefined amount of job slots available. These hardware queues are fed by a software scheduler taking jobs from higher level scheduling entities represented by drm_sched_entity. To keep things simple, let's assume these scheduling entities are backing VkQueue or GL context objects, which end up being passed render/compute jobs to [execute]1.
Unlike the hardware queue model, where operations are submitted to hardware queues at the job granularity, modern GPUs have been moving to firmware-assisted scheduling. In this new model, an intermediate micro-controller is taking high-level queue objects containing a stream of instructions to execute (the job submissions being encoded in the command stream) and scheduling these high-level queues. The following diagram describes the Mali CSF scheduling model. But other GPU vendors have pretty similar scheduling schemes, with different naming for their scheduling entities, and probably different ways of passing those scheduling entities around.
We initially tried to re-use drm_sched, but quickly realized it would be challenging to reconcile the hardware queue and firmare assisted scheduling schemes. Eventually we gave up on this idea and went for our own scheduler implementation, duplicating the drm_sched job dependency tracking logic in our scheduler code. This change alone made us reconsider the viability of having the CSF and JM backend implemented in the same driver. But there is still quite a bit of common code to be shared, even if the scheduling logic is diverging: MMU handling, driver initialization boiler-plate, device frequency scaling, power-management, and probably other stuff I forgot about.
On a side note, Intel has been working on making drm_sched ready for the firmware-assisted scheduling case, so we will likely go back to a drm_sched-based implementation, thus reducing the potential friction there has been between the CSF and JM scheduling logic. But there are still two crucial reasons we would rather have a separate driver:
The first aspect is pretty obvious, but let's go over the second one and try to detail what a Vulkan-friendly uAPI looks like, and how it differs from the Panfrost uAPI.
For those who are unfamiliar with graphics APIs, it is worth reminding that Vulkan is all about giving control back to the user by making a lot of the graphics pipeline management explicit, whereas OpenGL was trying to hide things from its users to make their life easier. We won't go over the pros and cons of each API here, but this design decision has an impact on the uAPI needed to have a performant Vulkan driver. We will detail some of them here.
Whilst executing a Vulkan command buffer, fences and semaphore objects passed to vkQueueSubmit can be waited on before the queued work begins, or signaled after it finishes. That means the waits on buffer object idleness that was required when dealing with GL-like submissions can go away. The only places where implicit fencing is still needed are the Window System Integration layers. Luckily, this has been recently addressed with the addition of two dma-buf ioctls, allowing one to import a sync-file into a dma-buf, or export all fences attached to a dma-buf to a sync-file. With these new ioctls, we can reconcile the implicit and explicit fencing worlds, and allow kernel drivers to be explicit-synchronization centric (no code to deal with the implicit synchronization case).
With explicit synchronization, we get rid of the step that was iterating over all buffer objects passed to a submit ioctl to extract implicit fences to wait on, and add the job done implicit fence back to these buffer objects so other users can wait on buffer idleness. Although we still have this list of buffer objects to pass in order to make sure the GPU mappings on these buffers are preserved while the GPU is potentially accessing those buffers.
Again, Vulkan is pretty explicit about object lifecycles and when things can and can't be freed. One such case is about VkMemory objects and the bind/unbind operations that are used to attach memory to a VkImage or VkBuffer object. That means the user is responsible for keeping the memory objects live in the GPU virtual address space while jobs are still in flight. This in turn means we don't need to pass all buffer objects the GPU jobs are accessing when we submit a batch.
This new paradigm requires quite a few changes compared to what the Panfrost uAPI provides. In Panfrost, GPU virtual address mappings were implicitly created at buffer object creation time. We now want to add explicit VM_{MAP,UNMAP} ioctls to allow creating these mappings explicitly. And while we're at it, and other drivers already allow that, we can just provide extra ioctls to create/destroy virtual address spaces (VM instances), so a single DRM file descriptor can deal with multiple independent contexts. And if we go further and envision support for sparse memory binding, we also need a way to queue binding/unbinding operations to a VkQueue. This generally implies adding some sort of VM_BIND ioctl that provides asynchronous/queue-based VM_{MAP,UNMAP} operations.
This is quite a major shift in how we deal with the GPU virtual address space; retrofitting that in the Panfrost driver would be both painful and error-prone. So, we just lost one denominator between Panfrost and the new driver: the MMU/GPU-va-management logic.
To sum-up, we have a completely new uAPI (almost nothing shared with the old one), a new scheduling logic, and a new MMU/GPU-VA-management logic. This leaves us some driver initialization boilerplate, the device frequency scaling implementation, and the power management code, which is likely to differ too, because some of the power-management is now done by the firmware. So, the only sane decision here was to fork Panfrost and make PanCSF an independent driver. We might end sharing some code at some point if it makes sense, but it sounds a bit premature to try to do that now.
The first thing to note is that this RFC, while being at least partly functional (only tested on basic GLES2 workload so far), is far from being ready. There are things we need to address: like trying to use drm_sched instead of implementing our own timesharing-based scheduler, having a proper buffer object eviction mechanism to gracefully handle situations where the system is memory pressured (and implementing the VM fencing mechanism that goes with it, so we don't end up with GPU faults when such evictions happen), and of course, making sure we are robust to all kind of failures. It also lacks support for power-management, device frequency scaling, and probably other useful features like performance counters, but those should be relatively straightforward to implement compared to the scheduling and memory management logic.
At any rate, that is still an important step in our attempt at having a fully upstream open-source graphics stack for Mali CSF GPUs. And with this RFC being posted early, we hope to get the discussion started and sort out some important implementation details before we get too far and risk a major rewrite of the code when others start reviewing what we have done.
Note that the Mesa changes needed to support CSF hardware and interface with this PanCSF driver should be posted soon, so stay tuned!
Special thanks to Faith Ekstrand, Alyssa Rosenzweig, Daniel Stone, and Daniel Vetter for supporting/advising me when I was working on the various iterations of this driver.
1. In practice, the graphics API queue object might require more than one scheduling entity ↩
05/03/2026
As champions of open source development in the embedded community, Collabora will be at Booth 4-404 with an impressive lineup of live demonstrations…
25/02/2026
Support for Rockchip’s VDPU381 and VDPU383 decoders is now upstream in Linux, bringing mainline H.264/HEVC decode support, robust IOMMU-reset…
19/02/2026
Weston 15.0 has arrived, bringing a brand new Lua-based shell for fully customizable window management, an experimental Vulkan renderer,…
Add a Comment