We're hiring!

PanCSF: A new DRM driver for Mali CSF-based GPUs

Boris Brezillon avatar

Boris Brezillon
February 23, 2023

Share this post:

Reading time:

As fate would have it, a new DRM driver for recent Mali GPUs was submitted earlier this month. This is a bit of an oddity in the DRM subsystem world, where support for new hardware is usually added to GPU drivers supporting previous hardware generations. So let's have a look at why this was done differently this time, and the challenges to come to get this new driver merged.

Job Manager vs Command Stream Frontend

Version 10 of the Mali architecture (second iteration of Valhall GPUs) introduced a major change in how jobs and their parameters are passed to the GPU. Arm replaced the Job Manager (JM) block by a Command Stream Frontend (CSF). As the new name implies, CSF hardware is introducing a command stream based solution to update the pipeline state and submit GPU jobs, thus avoiding re-allocation of relatively big pipeline state structures when only a single parameter of the pipeline changes between two job submissions. Nothing really new here, other GPU vendors have been using this command stream based submission pattern for years now, and Arm is just catching up here.

We won't give a detailed overview of how CSF works, but it is worth noting that the CSF frontend has a dedicated instruction set, and a bunch of registers to pass data around or keep internal states. There are instructions to submit jobs (compute, tiling and fragment jobs), and others to do more trivial stuff, like read/write memory, wait for job completion, wait for fences, jump/branch... That alone means we have to provide CSF-specific handling in Mesa to deal with the command stream emission and submission. If CSF was just about moving away from a descriptor based job submission approach, we could get away with a minimal amount of kernel changes and squash CSF support to the existing kernel driver.

But here comes the second major change brought by CSF hardware: firmware-assisted scheduling. The GPU not only embeds its unified shader cores (used to execute shader code) and the Command Execution Unit (the block processing the CSF instructions), it also has a Cortex-M7 microcontroller in front, that is here to do some high level queue scheduling. Before we get into that, let's take a step back, and explain how job scheduling is done in the Panfrost driver.

1. Hardware queue scheduling model

Panfrost uses the drm_sched framework to deal with job scheduling. This framework is based on the concept of hardware queues (represented by drm_gpu_scheduler), which are processing jobs in order and have a predefined amount of job slots available. These hardware queues are fed by a software scheduler taking jobs from higher level scheduling entities represented by drm_sched_entity. To keep things simple, let's assume these scheduling entities are backing VkQueue or GL context objects, which end up being passed render/compute jobs to [execute]1.

2. Firmware assisted scheduling model

Unlike the hardware queue model, where operations are submitted to hardware queues at the job granularity, modern GPUs have been moving to firmware-assisted scheduling. In this new model, an intermediate micro-controller is taking high-level queue objects containing a stream of instructions to execute (the job submissions being encoded in the command stream) and scheduling these high-level queues. The following diagram describes the Mali CSF scheduling model. But other GPU vendors have pretty similar scheduling schemes, with different naming for their scheduling entities, and probably different ways of passing those scheduling entities around.

3. PanCSF Scheduler implementation

We initially tried to re-use drm_sched, but quickly realized it would be challenging to reconcile the hardware queue and firmare assisted scheduling schemes. Eventually we gave up on this idea and went for our own scheduler implementation, duplicating the drm_sched job dependency tracking logic in our scheduler code. This change alone made us reconsider the viability of having the CSF and JM backend implemented in the same driver. But there is still quite a bit of common code to be shared, even if the scheduling logic is diverging: MMU handling, driver initialization boiler-plate, device frequency scaling, power-management, and probably other stuff I forgot about.

On a side note, Intel has been working on making drm_sched ready for the firmware-assisted scheduling case, so we will likely go back to a drm_sched-based implementation, thus reducing the potential friction there has been between the CSF and JM scheduling logic. But there are still two crucial reasons we would rather have a separate driver:

  1. Any changes to the Panfrost driver is a potential source of regression
  2. We want the new driver to be Vulkan-friendly, and the Panfrost uAPI was clearly not designed with Vulkan in mind.

The first aspect is pretty obvious, but let's go over the second one and try to detail what a Vulkan-friendly uAPI looks like, and how it differs from the Panfrost uAPI.

The 'perfect' uAPI for a Vulkan driver

For those who are unfamiliar with graphics APIs, it is worth reminding that Vulkan is all about giving control back to the user by making a lot of the graphics pipeline management explicit, whereas OpenGL was trying to hide things from its users to make their life easier. We won't go over the pros and cons of each API here, but this design decision has an impact on the uAPI needed to have a performant Vulkan driver. We will detail some of them here.

1. Getting rid of (almost) all implicit fences

Whilst executing a Vulkan command buffer, fences and semaphore objects passed to vkQueueSubmit can be waited on before the queued work begins, or signaled after it finishes. That means the waits on buffer object idleness that was required when dealing with GL-like submissions can go away. The only places where implicit fencing is still needed are the Window System Integration layers. Luckily, this has been recently addressed with the addition of two dma-buf ioctls, allowing one to import a sync-file into a dma-buf, or export all fences attached to a dma-buf to a sync-file. With these new ioctls, we can reconcile the implicit and explicit fencing worlds, and allow kernel drivers to be explicit-synchronization centric (no code to deal with the implicit synchronization case).

2. Letting the userspace driver control its GPU virtual address space

With explicit synchronization, we get rid of the step that was iterating over all buffer objects passed to a submit ioctl to extract implicit fences to wait on, and add the job done implicit fence back to these buffer objects so other users can wait on buffer idleness. Although we still have this list of buffer objects to pass in order to make sure the GPU mappings on these buffers are preserved while the GPU is potentially accessing those buffers.

Again, Vulkan is pretty explicit about object lifecycles and when things can and can't be freed. One such case is about VkMemory objects and the bind/unbind operations that are used to attach memory to a VkImage or VkBuffer object. That means the user is responsible for keeping the memory objects live in the GPU virtual address space while jobs are still in flight. This in turn means we don't need to pass all buffer objects the GPU jobs are accessing when we submit a batch.

This new paradigm requires quite a few changes compared to what the Panfrost uAPI provides. In Panfrost, GPU virtual address mappings were implicitly created at buffer object creation time. We now want to add explicit VM_{MAP,UNMAP} ioctls to allow creating these mappings explicitly. And while we're at it, and other drivers already allow that, we can just provide extra ioctls to create/destroy virtual address spaces (VM instances), so a single DRM file descriptor can deal with multiple independent contexts. And if we go further and envision support for sparse memory binding, we also need a way to queue binding/unbinding operations to a VkQueue. This generally implies adding some sort of VM_BIND ioctl that provides asynchronous/queue-based VM_{MAP,UNMAP} operations.

This is quite a major shift in how we deal with the GPU virtual address space; retrofitting that in the Panfrost driver would be both painful and error-prone. So, we just lost one denominator between Panfrost and the new driver: the MMU/GPU-va-management logic.

The wise choice: a new driver

To sum-up, we have a completely new uAPI (almost nothing shared with the old one), a new scheduling logic, and a new MMU/GPU-VA-management logic. This leaves us some driver initialization boilerplate, the device frequency scaling implementation, and the power management code, which is likely to differ too, because some of the power-management is now done by the firmware. So, the only sane decision here was to fork Panfrost and make PanCSF an independent driver. We might end sharing some code at some point if it makes sense, but it sounds a bit premature to try to do that now.

What's next?

The first thing to note is that this RFC, while being at least partly functional (only tested on basic GLES2 workload so far), is far from being ready. There are things we need to address: like trying to use drm_sched instead of implementing our own timesharing-based scheduler, having a proper buffer object eviction mechanism to gracefully handle situations where the system is memory pressured (and implementing the VM fencing mechanism that goes with it, so we don't end up with GPU faults when such evictions happen), and of course, making sure we are robust to all kind of failures. It also lacks support for power-management, device frequency scaling, and probably other useful features like performance counters, but those should be relatively straightforward to implement compared to the scheduling and memory management logic.

At any rate, that is still an important step in our attempt at having a fully upstream open-source graphics stack for Mali CSF GPUs. And with this RFC being posted early, we hope to get the discussion started and sort out some important implementation details before we get too far and risk a major rewrite of the code when others start reviewing what we have done.

Note that the Mesa changes needed to support CSF hardware and interface with this PanCSF driver should be posted soon, so stay tuned!

Special thanks to Faith Ekstrand, Alyssa Rosenzweig, Daniel Stone, and Daniel Vetter for supporting/advising me when I was working on the various iterations of this driver.

1. In practice, the graphics API queue object might require more than one scheduling entity


Comments (7)

  1. Nikos:
    Feb 23, 2023 at 10:11 PM

    Thank you for this work, it sounds very promising.

    Reply to this comment

    Reply to this comment

  2. Googulator:
    Mar 02, 2023 at 11:37 AM

    Any timeline for the Mesa counterpart?

    Reply to this comment

    Reply to this comment

    1. bbrezillon:
      Mar 07, 2023 at 10:49 AM

      We pushed it here [1] a few hours ago, and here [2] is a branch containing the latest kernel driver version. Please keep in mind that this is still work-in-progress, so don't expect a stable or performant driver.


      Reply to this comment

      Reply to this comment

  3. Fredrum:
    Mar 03, 2023 at 09:00 PM

    Would this driver help improve general GLES performance on Mali-G610 using panfrost drivers?
    Currently I understand that Panfrost GLES2 is only running at !25-40% of capacity and I also read somone mention that som sort of scheduling was part of that problem.
    Would this improve that sitiation or has nothing to do with it?

    Reply to this comment

    Reply to this comment

    1. bbrezillon:
      Mar 07, 2023 at 12:00 PM

      We haven't benchmarked this driver yet, so I'm not sure where you get these numbers from. We do intend to work on the performance aspect further down the road, but that's not our main priority right now.

      Reply to this comment

      Reply to this comment

      1. Fredrum:
        Mar 07, 2023 at 05:06 PM

        The number estimate were not about your driver just Panfrost in general on Mali-G610 (rk3588s).
        And were based on gl benchmark scores Vendor Blob driver vs Panfrost.

        Reply to this comment

        Reply to this comment

        1. bbrezillon:
          Mar 07, 2023 at 05:17 PM

          I'm pretty sure there's a confusion between the official mesa project [1] and panfork [2] (which is a fork of mesa with Mali-G610 support on top). We don't support Mali-G610 in mesa yet.


          Reply to this comment

          Reply to this comment

Add a Comment

Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest News & Events

Monado accepted for XROS 2023!


We're proud to announce that Monado, the free and open source XR platform, has been accepted as a mentoring organization for XROS, the XR…

Showcasing the STM32MP1 at Embedded World


As a recent new member of STMicroelectronics' Partner Program, we're excited to be showcasing the STM32MP1 at Embedded World this week,…

Connecting at Embedded World 2023


Nestled in the historic city of Nuremberg, the annual Embedded World conference will be taking place from March 14 to 16. Collabora will…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2023. All rights reserved. Privacy Notice. Sitemap.