Re-thinking framebuffers in PanVK

Re-thinking framebuffers in PanVK

Faith Ekstrand
March 23, 2026

Share this post:

Reading time:

One of the improvements I've been working on lately for the Panfrost driver stack in Mesa (for Arm's Mali GPUs) is a brand new framebuffer abstraction for PanVK. The previous abstraction was lifted out of the OpenGL driver and re-used by PanVK (the Vulkan driver). This was an effective way of getting the Vulkan driver off the ground but it's started to be limiting. In order to be able to implement things like efficient MSAA resolves, we needed a more flexible abstraction with fewer OpenGL assumptions baked in.

A brief introduction to tiled rendering

Before we discuss the details of the new framebuffer abstraction in PanVK, it's worth spending a few paragraphs on the tiled rendering strategy employed by most mobile GPUs and how it differs from the immediate-mode rendering on desktop GPUs.

Most desktop GPUs, such as those from Intel, Nvidia, and AMD, use what we call immediate-mode rendering. In an immediate-mode renderer, triangles are rasterized more or less one at a time and written directly to the framebuffer. For color blending and depth/stencil testing, the framebuffer values are read, the blend or depth test is done as necessary, and the new value is written back out. Obviously, all this reading and writing of color and depth/stencil values is very memory intensive. In order to make this fast enough, desktop GPUs typically depend on fast memory (GDDR or HBM), large render caches, and various forms of lossless image compression.

In the mobile/embedded space, memory is typically fairly slow and we don't have the die area or the power budget for massive caches. This means we need a new strategy. The strategy employed by most mobile GPUs is called tiled rendering. With a tiled renderer, the framebuffer is divided into small (e.g., 32x32) tiles and the entire render pass (everything between vkCmdBeginRendering() and vkCmdEndRendering()) is rasterized per-tile. This allows us to replace the large render cache with a much smaller tile memory that only needs to be big enough to hold the render targets for a single tile. On Mali GPUs (and likely others), the size of the tile memory is fixed and we adjust the tile size based on the number and pixel formats of the render targets and the number of samples to ensure that all of the color and depth/stencil data for a single tile will fit in tile memory.

In order to avoid doing too much duplicated work, there is an additional stage that sits between geometry and rasterization called binning. The binner takes the triangles that come out of the geometry pipeline and bins them based on what tiles they touch. This way the tiler only actually processes the triangles that intersect that tile during the rasterization step. Depending on the implementation, the pre-binning geometry pipeline may only generate positions, leaving the rest of the geometry work to happen during rasterization.

The tile-based rendering design has a very different set of performance trade-offs to conventional immediate-mode renderers. Because this tile memory is so small, it's able to live on-chip, very close to the shader cores and is extremely fast, often almost as fast as register access. Even though it requires duplicating some of the geometry work, the improvement in fragment access speed more than makes up for that when memory bandwidth is at a premium. The downside is that performance on tile-based renderers is highly dependent on being able to keep as much in tile memory as possible for as long as possible. On an immediate-mode renderer, switching render targets hurts your cache utilization but there's generally enough memory bandwidth that you can get away with it. On a tile-based renderer, however, it's critical that you have a few well-defined render passes with nothing in the middle that might cause the driver to have to split the render pass. The tile-based design is all about avoiding memory traffic and anything that might cause us to have to spill everything out to main memory and read it back in destroys performance.

An astute reader might feel inclined to point out here that most modern desktop GPUs also do some amount of tiled rendering as well. This is true, but the strategy is very different. Tiled rendering on a desktop GPU is typically done implicitly and on-demand. The hardware batches up triangles at the end of the geometry pipeline and rasterizes the whole batch per-tile. This improves cache locality and allows a much smaller, much faster L2 or L3 cache to handle most of the memory traffic from depth testing and color blending. Because this is done implicitly by the hardware and depends on extra levels of caching for the speedup, it can be made almost entirely automatic and transparent to the software. The downside is that internal batching is quite limited and many batches of triangles get rasterized in each render pass. This means that, at the end of the day, the implicit tiling design still depends on fast memory and large caches; it's just better able to take advantage of L2 or L3 caching than a naive forward-rendering implementation.

The tile-based rendering design leads to a different set of performance trade-offs in other areas as well. One example is color blending. Immediate-mode GPUs typically have a large piece of hardware dedicated to handling every possible combination of render pixel formats, color blend modes, and logic ops supported by the OpenGL, Vulkan, and Direct3D APIs. This hardware consumes a lot of die area because there are so many possible combinations it has to handle. However, because access to tile memory is so fast, tile-based renderers can get away with doing some of this work in shaders. Instead of having to carry a large fixed-function blending unit, Mali only supports hardware blending for a small subset of the possible color formats and blending modes. For everything else, the fragment shader jumps to a blend shader which reads from the framebuffer, does the blending math in the shader, and writes it back out to the framebuffer. This isn't quite as fast as fixed-function blending but it's fast enough that it's not worth spending precious silicon on fixed-function hardware for anything except a handful of common cases.

Tile-based renderers are typically also fairly good at multisampled rendering. Because tile memory is so fast, the additional bandwidth required during the render pass for maintaining 4 or 8 copies of every pixel isn't significant. The only real cost is the memory bandwidth at the start and end of the render pass to pre-populate the tile memory at the start and write it back out to memory at the end. If only the final, resolved version of the image is required at the end (VK_ATTACHMENT_STORE_OP_NONE in Vulkan), we can go even further and do the multisample resolve entirely in tile memory and only ever write out a single sample per pixel at the end, further reducing the memory bandwidth required. If the image is cleared at the start of the render pass (VK_ATTACHMENT_LOAD_OP_CLEAR in Vulkan), we can further reduce the memory bandwidth used by clearing the tile memory directly and skipping the attachment load at the start of the render pass. When these are combined, we can do a full multisampled render at the bandwidth cost of a single-sampled attachment write at the end.

Trouble with multisampled resolves in PanVK

In order to get efficient multisample resolves, however, a lot has to line up. The client has to specify VK_ATTACHMENT_STORE_OP_NONE so that we know we don't have to maintain the multisampled data after the render pass completes. We then need to be able to insert a post-frame resolve shader and tell the hardware to write to the single-sampled resolve target instead of the main render target. We also have to deal with the fact that the tiled area may not exactly align with the render area specified by the Vulkan client because we can't write outside that area in the resolve image.

None of this was really possible with the old framebuffer abstraction. It conflates too many things. Probably the worst of these is the assumption that there is exactly one image view bound to a given attachment and that one image view is used for the load at the top, the store at the bottom, and the mid-render-pass spill for incremental rendering. In OpenGL and OpenGL ES, this makes a fair amount of sense. There is no concept of load/store ops or attachments there. There is simply the set of bound attachments. The multisample resolve optimization is also much more difficult to implement in OpenGL because we don't have any information from the client telling us that it won't use the full multisampled result so we don't know when it's safe to discard and only keep the single-sampled, resolved version.

The old framebuffer abstraction also conflated a few other things such as the hardware's "clean tile write enable" bit with whether or not we were doing a clear at the top of the render pass. It also had no real concept of the tiled area versus the client's render area. It knew enough to set the "clean pixel write enable" bit in certain cases where a color attachment might force an alignment requirement on the output rectangle but it had no way to communicate that information back to the driver so that it could adjust accordingly. Because of this, PanVK had to carry code which tests the client's render area against a fairly generous tile size and, if the render area was unaligned, did a full load and then ran a shader to clear the render area. This leads to additional loads and can result in performance issues whenever partial renders are used.

A new framebuffer abstraction

The new framebuffer abstraction was intentionally designed to separate all these different concerns into 5 separable pieces.

A framebuffer layout: The new pan_fb_layout struct contains the number of color attachments, the pixel format of each attachment, the number of samples, the tile size, and the render area information. This provides a complete description of the tile memory layout and the parameters that will be used to program the tiler itself.
A framebuffer load: The new pan_fb_load struct describes the load operation that happens at the start of the render pass. The only real constraints on the load is that, for image loads, the formats of the image views must match the formats in the framebuffer layout. If loading from a multisampled image, the number of samples also has to match. Some of the information from the load, such as clear colors, is used to program the hardware directly and others, such as image loads, are used to generate preload shaders which get executed at the start of the render pass before anything else is drawn.
A framebuffer store: The new pan_fb_store struct describes the store operation that happens at the end of the render pass. As with loads, the only real requirement is that, if an attachment is written back out, then the destination image view must have the same format as the corresponding attachment in the framebuffer layout. Limited multisample resolving is also supported directly by stores, though it's easier and safer to use resolves for that.
A framebuffer resolve: The new pan_fb_resolves struct describes an optional resolve operation that can happen at the end of the render pass. Because the resolve is always compiled down to a shader, it's fairly generic and can do a full parallel copy from render targets or images to render targets. This gives us lot of options when it comes to how we handle store ops in Vulkan. We can easily do the above multisample resolve optimization by doing an in-place resolve of the attachment and then doing the store directly to the single-sampled resolve attachment or we can also resolve to a second attachment and store out to both the multisampled attachment and the resolve attachment at the same time.
Framebuffer descriptor info: The pan_fb_desc_info struct contains all the other bits of data we need when filling out the hardware framebuffer descriptor that don't affect the framebuffer layout and aren't tied to any of the above operations. This includes things like the tiling heap, which the hardware uses to allocate temporary memory for tiling, and a few pieces of state, which Mali annoyingly puts in the framebuffer descriptor instead of making them dynamic.

One of the important changes is that the loads, stores, and resolves are entirely decoupled and the image views used by each need not be the same. For multisampled rendering, for instance, this allows us to load the multisampled render target at the top of the render pass, do the resolve in a resolve shader at the end, and then store out to the resolve target at the end, which may be an entirely different VkImageView.

Another critical change is that, unlike the old framebuffer abstraction, which only had a single concept of load vs. clear, loads and resolves are now both split into two halves: in-bounds and border. Because we now have a well-defined concept of the tiling area as distinct from the client-specified render area, we're able to do different things inside the client render area vs. in the border pixels. This lets us handle unaligned render areas much more gracefully. In the case of render pass clears, we're able to do an image load in the border pixels and only clear inside the render area. In the case of multisample resolves, we're able to resolve the attachment inside the render area, load the resolve target into the border pixels, and then write the whole thing back out, making it appear as if pixels outside the render area are unchanged in the resolve target, even if we did have to load and store them to deal with tiler alignments.

The one thing that's a little less obvious with the new abstraction is the way we handle incremental rendering. Incremental rendering is what happens when we run out of geometry or binning memory or overflow some fixed hardware limit. In this case, we have to split the render pass into two or more render passes behind the client's back. This obviously isn't great for performance, but it only happens when the client is drawing huge amounts of geometry or uses a very large number of draws, so while splitting the render pass does come at a cost, that cost probably won't dominate the performance at that point.

In the old framebuffer abstraction, the load image views were always the same as the store image views so incremental rendering could be handled by just disabling the clear and enabling loads for every render pass except the first one. The new framebuffer abstraction, however, doesn't have a single set of image views that represent both load and store. Instead, this is handled by having two sets of load and store ops: The set derived from Vulkan load and store ops and a second set which we call the spill. The spill load/store always target the image views bound through the Vulkan API because it's always safe to update the bound render targets. In the common case, when the geometry for the entire render pass fits in the geometry buffer and no incremental rendering is needed, we use only the API load and store. But when incremental rendering is required, the first pass uses the API load and stores to the spill, the last pass loads from the spill and stores to the API targets, and any passes in the middle both load from and store to the spill.

Benchmark results

I benchmarked this MR using the multisampling demo from Sascha Willems on my MediaTek Chromebook and, as expected, this eliminates most of the memory bandwidth and yields substantial speedups:

With 2x MSAA: 590 -> 2605 (4.4x speedup)
With 4x MSAA: 347 -> 2570 (7.4x speedup)
With 8x MSAA: 188 FPS -> 2494 FPS (13.2x speedup)
With 16x MSAA: 96.7 FPS -> 2483 FPS (25.7x speedup)

While 25x may look fairly fantastic, the number actually makes sense. Mobile GPUs are often bandwidth limited and the amount of bandwidth saved by optimizing multisample resolves is significant. Without this, we store the full 16x multisampled result and then resolve in a separate pass, which means reading all 16 samples back in again, averaging, and writing out one sample. This adds up to 33 samples being loaded or stored per pixel in the final result. With resolve shaders, the entire resolve happens entirely in fast tile memory and we only store the one final sample at the end. This brings the number of samples loaded or stored from 33 to 1. The reason why we only see a 25x speedup and not a 33x speedup is because the benchmark is doing more than just reading and writing render targets.

The math for the other sample counts is the same. For 8x multisampling, the theoretical maximum speedup is 17x, for 4x, it's 9x, and for 2x we could see as much as a 5x speedup. At least with this simple demo, the real-world numbers are fairly close to the theoretical maximums.

These theoretical maximums assume that render target loads and stores are the only memory accesses in the entire render. For this simple demo, this is almost true since it renders a relatively small amount of geometry and only has a couple simple textures. In a more realistic scenario, render target access is likely to only be a few percent of the total memory bandwidth required to render a frame. A real-world speed-up of over 2x would be surprising, regardless of the sample count. However, as long as the app is using VK_ATTACHMENT_STORE_OP_NONE, we should still be able to get a noticeable speed-up for multisampled use cases.

One other thing that's worth pointing out is how little the FPS numbers vary between the different sample counts once we fix our resolve performance. The claim made earlier that tile-based renderers are good at multisampling really is true if we're able to avoid writing out the full multisampled result. In this benchmark, the jump from 2x to 16x multisampling only cost 5%.

Future work and OpenGL

At the moment, the new framebuffer abstraction has only been hooked up for Vulkan. It can also be enabled for OpenGL, eventually, but it will take a lot of refactoring to get there. Unfortunately, the benefit there is unlikely to be quite as large since we always have to store the full multisampled result anyway because OpenGL may want it later. However, with a little work we can still eliminate the texture operation required by the resolve and probably get about half the benefit we got for Vulkan.

PanVK now uses AFBC by default

PanVK now supports Vulkan 1.4

PanVK now uses AFBC by default

PanVK now supports Vulkan 1.4

Comments (0)