Daniel Almeida
May 14, 2026
Reading time:
In the previous part of this series we looked at the interface offered by the kernel-mode driver (KMD) to the user-mode driver (UMD). The UMD, which is panvk in our case, implements the Vulkan-facing part of the stack and prepares work for the GPU. The KMD, which is Tyr/Panthor here, handles the privileged side: memory management, submission, synchronization, and the hardware-facing execution contexts. In particular, we saw that Tyr exposes the same API as Panthor, which lets panvk create GPU virtual address spaces, allocate buffer objects, create scheduling groups, and submit work to the device.
This is a good place to go one level deeper and discuss what those groups and submissions actually map to on Mali CSF hardware. The main component we need to discuss is the Microcontroller Unit, or MCU, an Arm Cortex-M7 core that runs the firmware supplied by Arm. The MCU sits between the host CPU and the GPU execution machinery, which is why CSF-based Mali GPUs look quite different from the older Mali Job Manager architecture.
When we say that CSF has hardware scheduling support, we should be a bit careful about what this means. The KMD still has a fair amount of work to do around submission and execution. It tracks dependencies, programs the VM and address-space state a group needs, and signals completions back to userspace. The scheduling step proper is the decision about which runnable groups get access to the finite hardware slots exposed by the firmware. The firmware, on the other hand, consumes command-stream ring buffers, tracks the state of the active command streams, and helps arbitrate access to the GPU-side execution resources.
At this point we must distinguish between two separate events: creating a queue and submitting work on it.
Queue creation sets up the objects that later submissions will use: Vulkan device/queue creation leads panvk to asking the KMD to create a Panthor/Tyr group and its queues. The KMD then returns a group handle and keeps the queue state attached to that group.
A later submission uses those existing objects: Vulkan queue submission leads panvk to submitting work to an existing group and queue. The KMD can then bind a runnable group to a CSG slot, bind one of its queues to a CS slot, append work to the queue ring buffer, and ring a doorbell. The CSF firmware consumes that ring buffer and drives the GPU endpoints.
This article will focus on that path. It will not try to fully explain GPU memory management yet, but we will need a small amount of it to understand why groups carry a vm_id, and why getting the MCU to boot is not by itself enough to execute a Vulkan workload.
We will start from a small Vulkan application like vkcube, move from the VkQueue abstraction down into the groups and queues created by panvk, and then map those objects to the CSG and CS interfaces exposed by the firmware.
Let's begin at the Vulkan end of the stack. There, the main workflow revolves around describing work as a series of commands that get recorded into command buffers and later submitted to a queue. The application expects these commands to be executed by the GPU at a suitable time, and may request to be notified when the work is complete.
A VkQueue is the object an application uses to hand record command buffers to the implementation. The queue is not created at submission time. It is requested when the VkDevice is created, from a queue family that advertises the kind of work the application wants to run, such as graphics or compute.
An application like vkcube might retrieve a graphics queue like so:
VkDeviceQueueCreateInfo queues[2];
queues[0].queueFamilyIndex = demo->graphics_queue_family_index;
queues[0].queueCount = 1;
queues[0].pQueuePriorities = queue_priorities;
VkDeviceCreateInfo device = {
.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,
.queueCreateInfoCount = 1,
.pQueueCreateInfos = queues,
...
};
err = vkCreateDevice(demo->gpu, &device, NULL, &demo->device);
...
vkGetDeviceQueue(demo->device, demo->graphics_queue_family_index, 0,
&demo->graphics_queue);
At this point, the application can begin recording command buffers. These command buffers are only a description of the work, e.g., begin a render pass, bind the pipeline and buffers that describe the cube, issue the draw commands, and eventually present the result. Submitting that recorded work to a queue is what turns it into something the Vulkan implementation must actually execute.
Once the application has a queue, submission is a separate operation. The application records commands into VkCommandBuffers, then calls something like vkQueueSubmit or vkQueueSubmit2 to append that work to an existing queue. The submission also carries synchronization information: what must be waited on before the work can run, and what should be signaled once it completes.
Again, vkcube shows this in a nicely compact way:
VkSubmitInfo submit_info;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = ¤t_submission.image_acquired_semaphore;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = ¤t_submission.cmd;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores =
¤t_swapchain_resource.draw_complete_semaphore;
err = vkQueueSubmit(demo->graphics_queue, 1, &submit_info,
current_submission.fence);
An implementation must then back the VkQueue with some hardware context that is capable of carrying out the work that was submitted to it. In our case, panvk implements the Vulkan queue by asking the KMD to create a scheduling group, which is the object that Tyr and Panthor use to back an execution context.
Let's proceed by inspecting the source code in panvk:
struct drm_panthor_queue_create qc[] = {
[PANVK_SUBQUEUE_VERTEX_TILER] = {
.priority = 1,
.ringbuf_size = 64 * 1024,
},
[PANVK_SUBQUEUE_FRAGMENT] = {
.priority = 1,
.ringbuf_size = 64 * 1024,
},
[PANVK_SUBQUEUE_COMPUTE] = {
.priority = 1,
.ringbuf_size = 64 * 1024,
},
};
struct drm_panthor_group_create gc = {
.compute_core_mask = phys_dev->compute_core_mask,
.fragment_core_mask = phys_dev->fragment_core_mask,
.tiler_core_mask = 1,
.max_compute_cores = util_bitcount64(phys_dev->compute_core_mask),
.max_fragment_cores = util_bitcount64(phys_dev->fragment_core_mask),
.max_tiler_cores = 1,
.priority = group_priority,
.queues = DRM_PANTHOR_OBJ_ARRAY(ARRAY_SIZE(qc), qc),
.vm_id = pan_kmod_vm_handle(dev->kmod.vm),
};
int ret = pan_kmod_ioctl(dev->drm_fd, DRM_IOCTL_PANTHOR_GROUP_CREATE, &gc);
There are two important things to notice here. First, the UMD does not create just one queue. It creates a group containing several queues, each with its own priority and ring-buffer size. Second, the group is tied to a vm_id, meaning that all submissions through the group's queues execute in that GPU virtual address space.
The corresponding UAPI structs make the same model explicit:
struct drm_panthor_queue_create {
__u8 priority;
__u8 pad[3];
__u32 ringbuf_size;
};
struct drm_panthor_group_create {
struct drm_panthor_obj_array queues;
__u8 max_compute_cores;
__u8 max_fragment_cores;
__u8 max_tiler_cores;
__u8 priority;
__u64 compute_core_mask;
__u64 fragment_core_mask;
__u64 tiler_core_mask;
__u32 vm_id;
__u32 group_handle;
};
Once this ioctl returns successfully, the UMD has a group_handle that it can use for future submissions. Internally, the KMD has created a Group object and a number of Queue objects. These are driver-side execution contexts; they are not the firmware-visible hardware slots themselves.
To understand how these objects relate to the hardware, we now need to look at the firmware-visible interfaces they are eventually bound to.
In CSF terminology, CSG means Command Stream Group. A CSG is the firmware-visible container for a set of command streams that share scheduling state and a GPU address space. This makes it a natural fit for the UMD-facing notion of a queue context, such as a VkQueue, even if the mapping is not meant to be a perfect one-to-one statement about every API.
Inside a CSG we find one or more CS (Command Stream) interfaces. A Command Stream is the lower-level endpoint that actually consumes CSF instructions from a ring buffer. The UMD builds GPU command streams in memory, the KMD wraps those submissions into the format expected by the firmware queue, and the MCU then processes the ring buffer as the GPU resources become available.
This gives us two layers of scheduling to keep separate:
Panthor/Tyr Group -> firmware CSG slot Panthor/Tyr Queue -> firmware CS slot inside that CSG
A CSG slot can only host one group at a time. If a process has a group that is not currently bound to a CSG slot, that group is idle from the hardware point of view, even if it has pending work. This is why Panthor still contains a scheduler even when the hardware has a firmware scheduler of its own. The firmware can only act on the groups and command streams currently visible through its slots, so the KMD must rotate access to those scarce active slots. In practice, that often means evicting work that is blocked or idle before work that is actively using the GPU.
That tells us why the scheduler exists. The next step is to see how actual work gets submitted to one of these queues.
Creating a group only creates the execution context. The actual work reaches the GPU through DRM_IOCTL_PANTHOR_GROUP_SUBMIT, which contains one or more queue submissions:
struct drm_panthor_queue_submit {
__u32 queue_index;
__u32 stream_size;
__u64 stream_addr;
__u32 latest_flush;
__u32 pad;
struct drm_panthor_obj_array syncs;
};
struct drm_panthor_group_submit {
__u32 group_handle;
__u32 pad;
struct drm_panthor_obj_array queue_submits;
};
The queue_index tells the KMD which queue inside the group should receive the work. The stream_addr and stream_size describe the userspace command stream that should be called from the kernel-managed command stream ring buffer. Synchronization is carried through the syncs array, which tells the KMD what must be waited on before the submission can run and what must be signaled once it completes.
An end-to-end submission therefore looks like this:
GROUP_SUBMIT for a group and one of its queues.This is where the split between KMD and firmware becomes most visible. The KMD still owns dependencies, isolation, and Linux synchronization objects. The KMD decides which groups become resident, while the firmware decides which resident groups get scheduled based on priority. The firmware is responsible for the low-level command stream protocol once the work has been made visible to an active CSG/CS pair.
Before any of the above can work, Tyr needs the MCU firmware to be running. The firmware is not just an optional helper. It is the other side of the CSF interface.
Booting it is somewhat straightforward. It consists of parsing the firmware binary, finding the relevant code and data sections, allocating backing memory for them, and mapping those sections at the addresses where the MCU expects to find them. Once the device is powered and the firmware entry point is known, Tyr can start the MCU and wait for the firmware to expose its shared interfaces.
Only after that point can the KMD discover the firmware-visible hierarchy. The root of that hierarchy is the Global interface. From there, the driver can discover how many CSG interfaces exist, what each CSG exposes, and how many CS interfaces live under each CSG.
In other words, firmware boot is not separate from scheduling. It is the prerequisite that makes scheduling through CSF possible at all.
Communication between the CPU and the MCU happens through shared memory regions defined by the firmware interface, together with the doorbells and interrupts each side uses to ask the other to react. The hierarchy starts with the Global interface and then leads to the CSG and CS interfaces described above.
Each interface is split into Control, Input, and Output regions. The naming is easy to misread, so it is worth being precise here.
The Control region is mostly firmware-provided discovery state. This is where the KMD learns things like firmware version, feature bits, input and output addresses, group counts, stream counts, strides, suspend buffer sizes, and similar properties of the interface. The KMD reads this state so it can program the rest of the interface correctly.
The Input region is the host-writable side of the protocol. This is where Tyr writes requests and configuration values that the firmware should act on. The Output region is the firmware-writable side, used for acknowledgments, status, progress information, and events that the CPU must eventually process.
The normal workflow is therefore:
KMD reads Control to discover the interface KMD writes Input fields and queue ring buffers KMD rings a doorbell MCU consumes the request MCU writes Output fields MCU raises an interrupt when the CPU must react
This gives the CPU and the MCU a lockless communication channel where each side has a well-defined ownership rule. The exact fields differ between Global, CSG, and CS interfaces, but the pattern remains the same: discovery in Control, requests in Input, status and acknowledgments in Output.
The vm_id in drm_panthor_group_create deserves a brief detour because it is one of those fields that can look incidental until the first fault happens.
Command streams contain GPU virtual addresses. A shader may load from a buffer, a command stream may reference another indirect command stream, and the firmware itself needs to execute in a well-defined GPU-side address space. When userspace submits a stream through a group, the addresses in that stream must be resolved in the VM associated with that group.
This is also an isolation boundary. One client should not be able to submit a command stream that reads another client's memory simply by guessing an address. The GPU MMU, programmed by the KMD, enforces that separation by translating GPU virtual addresses through the page tables belonging to the right VM.
The full GPU memory-management story deserves its own article, because it involves GEM buffer objects, VM binding, page tables, fault handling, and a fair amount of Linux DRM infrastructure.
At this point we have most of the execution path on the table: firmware boot, the shared interfaces, submission, and the VM the command streams run in. That is enough context to step back and look at what this means for Tyr's upstreaming path.
This brings us back to Tyr's upstreaming path. The CSF MCU underpins the job submission front end on newer Mali GPUs, but booting the MCU is only the first step in the chain.
To execute real userspace work, the driver needs the following pieces to fit together:
Missing any one of these leaves us with an incomplete path. For example, if the firmware boots but groups do not have valid VMs, submitted command streams will fault as soon as the GPU tries to dereference their addresses. If groups and queues exist but event handling is missing, the work may reach the hardware but userspace will not get reliable completion information back.
This is why the CSF pieces tend to appear in a fairly strict order in the driver. The firmware must boot before we can discover the interfaces, and the VM layer must exist before those interfaces can safely execute userspace command streams. The group, queue, submission, and event paths then turn those pieces into a usable userspace execution path.
The next part of this series will explain how Tyr manages GPU memory, and why that layer is required before the group and queue machinery described above can execute useful work.
In the meantime, come see Tyr in action at RustWeek in Utrecht! We'll be running a SuperTuxKart tournament powered by the driver. See you there!
14/05/2026
See how Tyr moves beyond MCU firmware boot to build the group, queue, VM, submission, and completion paths needed to run real Vulkan workloads…
07/05/2026
A complete breakdown of Mesa’s NIR compiler detailing how it optimizes shader memory access with SSA promotion, deref analysis, copy propagation,…
05/05/2026
Collabora brought Bluetooth Auracast broadcasting to MediaTek Genio 700 for Embedded World 2026. Here's the complete, fully Open Source…
22/04/2026
Using our XR expertise, Collabora created a standalone XR experience for our 1% for the Planet partner, SOMAR, to showcase the direct impact…
17/04/2026
BitNet-style ternary brings LLM inference to ExecuTorch via its Vulkan backend, enabling much smaller, bandwidth-efficient models with portable…
23/03/2026
PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…