Optimizing memory access in NIR

Optimizing memory access in NIR

Faith Ekstrand
May 07, 2026

Share this post:

Reading time:

Programmable shaders have been the heart of 3D graphics for the last 25 years. What started as a neat trick for more interesting texturing and lighting has become the single most important part of modern 3D graphics. They're no longer just a few lines of code that multiply matrices or apply a specular color map. Modern games often contain shaders with hundreds or thousands of lines doing anything from physics simulation to ray tracing to live AI filtering.

As shaders have become more important and more complex, NIR, the compiler core that sits at the heart of Mesa, has grown with them. These days NIR is quite a competent compiler infrastructure. We're able to handle the worst that modern games throw at us as well as complex compute and AI workloads.

However, there are a few bits of NIR that can be confusing at times and one of those is the way that we optimize memory access. This blog post aims to be a bit of a primer on how NIR thinks about memory access and optimization. As such, it's going to get quite technical but it should be interesting for Mesa developers and anyone who's curious about how a compiler works on the inside.

Variables, derefs, and SSA values

In NIR there are two different types of values: variables and SSA values (also called SSA defs in NIR). A variable has a data type, such as struct { uvec3 foo[4]; } and a mode, such as nir_var_mem_ubo, which specifies where it lives in memory. This is basically NIR's equivalent of a storage class in SPIR-V. Depending on the mode, some of those variables live in externally visible memory, such as UBOs or SSBOs, and others are entirely local to the shader, such as shared memory and function-local variables.

SSA values, on the other hand, are always vectors or scalars and are always function-local. Instead of a full data type, they simply have a number of bits per component and a number of components. SSA values are also guaranteed to be assigned at exactly one place in the shader, making it easy for the optimizer to go from an SSA value to the instruction which generated it. Instead of assigning the same variable different values on different sides of an if statement, like you would do in a conventional programming language, values which may be different based on control flow are resolved using special phi instructions. This is what is called single static assignment (SSA) form in the compiler literature.

Because SSA value are always vectors or scalars, accessing them is easy. All we need to completely specify access to an SSA value is a simple swizzle, which re-orders vector components. For variables, however, we need to specify the entire dereference (or deref) chain from the variable down to a single element. For instance, if we have a variable like struct { uvec3 foo[4]; } a;, then we need to be able to represent a.foo[2].y in the IR (internal representation). This is done in NIR by way of deref instructions. These are sort of NIR's equivalent to pointers in C or C++. Each chain starts with one of two deref instruction types: a variable or a cast. Variable derefs return a pointer to the variable while cast derefs take a pointer or raw memory address and return a pointer with whatever type was specified by the cast. Then struct member and array deref instructions can be used to descend deeper into the data type until we reach a vector or scalar. This is similar to OpAccessChain in SPIR-V except that each step in the access chain is its own instruction.

Once you have a deref chain, NIR has explicit load and store intrinsics. Loads read a value from memory and return it as an SSA value while stores take an SSA value and write it to memory. We also have a deref copy intrinsic which copies everything pointed to by one deref to another. Loads and stores can only access a single vector or scalar at a time since they're bound by the restrictions of SSA values. Copies, on the other hand, can copy whole arrays or structures at a time. We also have a memcpy intrinsic, which is different from deref copies as it works on a range of bytes starting from a deref, instead of operating on logical data types.

Because they're so much simpler, SSA defs are much easier for most of the optimizer to reason about. They're designed to make it easy to see patterns in the shader which we can optimize. The primary goal of any sort of variable or memory optimization is to try and promote things from variables or memory to SSA values so that we can further optimize those values. Most NIR optimizations don't even consider values that live in memory or variables. SSA defs also typically end up in registers in the final compiled shader, so this reduces the amount of thread-local memory we need, often eliminating it entirely.

Finding SSA values

The core of NIR's deref optimization strategy is nir_lower_vars_to_ssa(). This optimization pass is almost as old as NIR itself. It attempts to find any values which can be promoted directly to SSA.

The strategy is fairly straightforward. First, it walks over the IR and builds a data structure to track variable access. For each function-local variable and for each vector or scalar in that variable, we categorize it as never used, always used directly, or possibly used indirectly. For an array, direct access is one where the array element is a known constant so we know exactly what element will get read or written. An indirect access is one where we don't know the array index at compile time and we have to assume that it might access any element in the array. We can only optimize a value to SSA if we can prove that it will never be accessed indirectly.

Once we've built up that data structure, we do a second pass over the IR which converts each direct-only value to SSA. Each store is replaced with a move instruction which copies from the source of the store to a new SSA value. Each load is replaced with a move from the SSA value from the most recent store. If the most recent store was not in the current basic block in the control-flow graph, there may not be one unique most recent store (such as when the same value is written on both sides of an if/else), so we place phi instructions to resolve the values across control-flow edges.

The advantage of this strategy is that it's fast. The entire optimization is two linear-time passes over the IR. This pass is also able to handle arbitrary control-flow. As long as the value is only ever accessed directly, it can be read and written from anywhere in the function and we can still replace it with an SSA value. The downside is that it can only eliminate a variable access if every access to that vector or scalar value is fully known. There are other strategies which allow for more complex analysis, but they're typically more expensive or struggle with complex control-flow. We'll talk more about those later.

Because optimizing values to SSA is core to NIR's optimization strategy and because a single shader can contain thousands of variable accesses, speed is important here. We need to be able to run this pass often because turning one value into SSA may allow us to optimize other values. For instance, if we have i = 3 followed by a.foo[i] in the shader, nir_lower_vars_to_ssa(), combined with nir_opt_constant_folding(), will eliminate the temporary variable i and turn the array deref a.foo[3]. The next time nir_lower_vars_to_ssa() is run, it can then convert all accesses to a.foo[3] into SSA values.

Splitting and lowering copies

In order for nir_lower_vars_to_ssa() to work its magic, we have to first split copies. As stated earlier, a deref copy can copy an entire structure or array at a time. nir_lower_vars_to_ssa(), however, works on individual vectors or scalars. It also doesn't really like copies. When nir_lower_vars_to_ssa() encounters a deref copy that touches a value it can optimize, it implicitly converts that copy to a load and a store.

One solution to this would be to eliminate copies entirely by replacing them with a series of loads and stores. This is what nir_lower_var_copies() does. However, copies are still useful for variable-based copy propagation (more on that later) so we want to keep them around for as long as we can.

We settle on a compromise. Instead of fully lowering copies, we have a pass called nir_split_var_copies() which splits each copy of a complex data type into a series of copies that each copy a single logical vector or scalar. When an array is encountered, we have a special array deref type called a "wildcard" which says that the given deref accesses the entire array. So if we have variable struct { uvec3 foo[4]; } a, b; and a copy that does the copy b = a, nir_split_var_copies() will turn it into b.foo[*] = a.foo[*] where * represents the wildcard. In NIR, we call a.foo[*] a full-qualified deref since it specifies the whole chain from a, which has a structure type, to a single vector. This lets us make nir_lower_vars_to_ssa() simpler since it only has to reason about fully-qualified deref chains.

Those familiar with compilers might recognize this as being very similar to the conversion from array-of-struct to struct-of-array form. The difference is that we're able to do it without breaking up the original variables. For function-local variables and when there are no memcpy intrinsics, these two transforms are logically equivalent. We could just as easily replace any variables with structure types to a bunch of variables with (possibly nested) array of vector types. However, for anything which lives in client-visible memory such as UBOs or SSBOs, we can't go around re-ordering struct members and arrays because doing so would change the memory layout. NIR's transform to fully-qualified derefs doesn't do this. It accomplishes the goal of making things simpler for the optimizer while leaving the original types and memory layout alone.

nir_split_var_copies() is one of the first steps in any NIR-based shader optimizer. It typically only has to be called once or twice because very few passes add new copies which aren't fully qualified. But it does have to be called before nir_lower_vars_to_ssa() can make much progress.

Copy propagation

nir_split_var_copies() and nir_lower_vars_to_ssa() are enough for 99% of function-local variable access. Between function inlining and loop unrolling, most variable accesses end up being direct-only eventually, after enough iterations of the optimization loop. However, there are still a few cases where this approach isn't enough.

One is the case where the shader uses an array as scratch space. For instance, let's say we have a loop which writes a.foo[i] and then reads a.foo[i] a few lines later. nir_lower_vars_to_ssa() can't do anything in this case because it only works on direct-only values and we don't know the value of the array index i. For this, we need a pass which is able to see two uses of a.foo[i] as the same as long as the array index i doesn't change between them.

The other major limitation to nir_lower_vars_to_ssa() is that it only works on function-local variables. Its goal as an optimization is to replace a variable element with SSA values. It doesn't know what to do with a value that might live in memory. For UBOs, this doesn't much matter since they're never written so every access is a load. But for SSBOs, for instance, we need an optimization which understands interactions between the shader and the CPU or between different shader threads to know when we can reliably replace a load from a.foo[3] with the value previously stored to a.foo[3]. It also needs to understand when we need to leave the store in place because the CPU or some other thread might need that value.

The solution to both of these issues is an optimization pass called nir_opt_copy_prop_vars(), which takes an entirely different approach. Instead of looking at the usage of a variable across the entire shader, it operates locally, looking for places where we can replace a load with the source of a store. It walks the IR, building up a data structure that tracks all stores seen up until that point. When we see a load, we look it up in that data structure to see if it has a known value. Because the store tracking structure is basically a map from derefs to values, it's able to track stores with indirect array access. When we encounter a load from a.foo[i], we're able to search back and find a store to a.foo[i] and propagate its SSA value.

The data structure is also able to track arbitrary deref-to-deref copies and propagate those. If we see a copy b.foo[*] = a.foo[*] and then later see a load from b.foo[i], we can turn that into a load from a.foo[i]. There are a couple of cases where this comes up particularly often. One is DXVK and VKD3D's input handling. Due to the way inputs work in HLSL, DXVK and VKD3D sometimes have to create a big array containing all the shader's inputs and copy that array to a temporary which then gets read by the shader. If inputs only ever get read directly, nir_lower_vars_to_ssa() can get rid of the temporary but if there are indirect array accesses, we have to fall back to nir_opt_copy_prop_vars() and hope it's able to propagate the copy into the loads. We also run into this sometimes because GLSL doesn't have a way to pass references into functions. All function arguments are passed by-value. So if a naive shader author passes an entire UBO into a function, it looks like copying the whole UBO to a temporary and then using the temporary. If they then indirectly access an array in that UBO, we need to be able to propagate it back to the original UBO to avoid copying all that data.

The tricky part of this pass is in pruning the data structure as we go to make sure we never propagate anything we can't. Every time a store or a copy is added to the tracker, we discard any previous stores that may point to the same element. So, for instance, if you have a write to a.foo[i], it will invalidate a.foo[3] because, even if we don't know the value of i, i might have been 3 so we can no longer trust the a.foo[3] store. We don't now know the value of a.foo[3] but we can't trust the value we had before. Likewise, a store to a.foo[3] will cause us to discard a store to a.foo[i].

We also have to watch out for barrier instructions. Barriers are a way that the shader tells us that data may have been written from another thread. If we see an SSBO barrier, for instance, then we have to discard anything we thought we knew about values in SSBOs.

Finally, we also have watch out for aliasing. Aliasing is any time where one pointer (or deref in NIR terminology) might point to the same values as another. The example of a.foo[i] possibly being a.foo[3] is one example of aliasing, albeit a simple one. With pointer casts, you can get far more different looking deref chains that still happen to point to the same memory. Thanks to extensions like VK_KHR_variable_pointers and VK_KHR_buffer_device_address, we also have ways that the client can pull an arbitrary pointer out of a data structure and use it. Since we don't don't know anything about such pointers, we have to be very careful when making assumptions about what they do or do not alias. In general, the approach is to always assume two derefs may alias and try to prove otherwise. If the proof fails, we have to assume the worst.

The upside to this approach is that nir_opt_copy_prop_vars() is able to handle a lot of cases that nir_lower_vars_to_ssa() can't. The downside is that it's slow. The store/copy tracker is a complex data structure and maintaining it isn't trivial. If it were only a matter of looking up a store from the deref, we could use a hashing algorithm. But the alias analysis we have to do means every store we add has to look at every single other store in the tracker to see if they might alias. We have a few tricks to help reduce the cost as much as we can but it's still fundamentally a linear walk of all previous stores. To compensate for this, NIR-based optimizers should make sure nir_lower_vars_to_ssa() is run first to help eliminate as many local variables as possible before nir_opt_copy_prop_vars() is run. But at the end of the day, the two passes are complementary and either can potentially help the other make more progress.

Store elimination

The little siblings to nir_opt_copy_prop_vars() are nir_opt_combine_stores() and nir_opt_dead_write_vars(), which perform what is commonly known as store elimination.

Copy propagation is only half of a solution to the problem of too much variable access. Once we've gone through and replaced loads with either a load from the source of an earlier copy or with the SSA value written by a previous store, we still have all those stores and copies lying around. Store elimination is the process of detecting when a store instruction is unnecessary or redundant and getting rid of it. Say, for example, that we have some variable and we did a store followed by a load and then a second store. After nir_opt_copy_prop_vars() eliminates the load by replacing it with the first store, we're left with two stores back-to-back. The first of those two stores is redundant and can be removed.

To do this, nir_opt_combine_stores() has a store tracker, somewhat similar to the load tracker used by nir_opt_copy_prop_vars() but which tracks stores. As we walk the IR, we build up this structure. Whenever we encounter a new store, we check to see if there is anything in the tracker that we know will be overwritten by the new store. If so, we remove the earlier store instruction since we now know that it's redundant.

Maintaining this store tracker is somewhat simpler but fundamentally the same as the load tracker in nir_opt_copy_prop_vars(). Any time we see a load, we have to go through and discard any stores from the tracker which may affect it because they can't be eliminated. Barriers also force us to reset the tracker because we don't know if some other thread will read the stored values before the next store from this thread. And we have to do exactly the same alias analysis to determine when two things may overlap in the presence of deref casts and proper pointers.

nir_opt_combine_stores() does have one trick up its sleeve that classic store elimination does not, however. As the name implies, it doesn't just eliminate stores, it combines them. Because NIR is fundamentally a vector IR and works on vectors, store elimination is not as straightforward as it looks. Every store instruction contains a write mask, specifying which vector components get overwritten. For scalars, each store overwrites the entire scalar so this is a non-issue. But for vectors, we need to know that a store to a.xy isn't fully redundant with a store to a.xz because the y and z components differ. The easy thing to do here would be to treat different write masks like a deref aliasing solution and just say we can't eliminate the first store. nir_opt_combine_stores() is smarter than that, though, and instead replaces both stores with a store to a.xyz

nir_opt_dead_write_vars() is similar to nir_opt_combine_stores() but only performs store elimination. However, there are a few cases where it is able to eliminate stores but nir_opt_combine_stores() cannot. At some point in the future, we may combine the two passes.

Managing copies

One of the biggest challenges in optimizing variable and memory access in NIR is managing copies. This is also where many drivers' compiler stacks mess up and don't get as much out of NIR as they could.

Most variable and memory optimization passes need copies to be split in order to do anything so nir_split_var_copies() should be run early. However, you don't want to lower copies too early because nir_opt_copy_prop_vars() wants big copies whenever possible so it can propagate array accesses. If you lower copies too aggressively or too early, you end up hampering nir_opt_copy_prop_vars(). But before handing the NIR off to the hardware back-end compiler, you typically want to run a bunch of lowering passes which simplify the IR to make it easier to consume in the back-end and most of these lowering passes want copies eliminated entirely. The premier example of this is nir_lower_explicit_io() which converts external memory access such as UBO and SSBO access from deref instructions with explicit types to pointer arithmetic and intrinsics which load and store from raw pointers or descriptors and byte offsets. Since this pass is generating concrete load and store instructions, it doesn't want to deal with large copies which may copy an entire array at a time.

The typical strategy here is to break the optimization process in two. First, we run an optimization loop which allows copies and tries to find more copies whenever possible. This gives nir_opt_copy_prop_vars() the most opportunity to eliminate variable and memory access for us. Once we're ready to lower UBO and SSBO access, we get rid of copies and run nir_lower_explicit_io() to turn deref-based UBO and SSBO access into the byte-based intrinsics the back-end compiler actually wants to consume. From that point on, we disallow adding new copies and optimize without them.

During that first phase, we have several passes which attempt to find new copies or optimize variables so nir_opt_copy_prop_vars() can make more progress. These include:

nir_opt_find_array_copies() searches for unrolled array copies. This can look like a series of copies or a series of loads and stores that copy an entire array, one element at a time. It replaces those series of instructions with a single deref copy with wildcard derefs which copies the entire array. Converting split copies to whole-array copies enables copy propagation to potentially propagate more indirect array loads.
nir_opt_deref() works on the derefs themselves and tries to eliminate unnecessary casts. (With VK_KHR_shader_untyped_pointers, we can get a lot of unnecessary casts, even in 3D graphics shaders.) For OpenCL generic pointers, it also tries to specialize generic pointers to specific storage classes.
nir_opt_memcpy() attempts to replace memcpy with deref copies. However, it can only do this if there are no holes in one of the source or destination types.
nir_lower_vec3_to_vec4() replaces vec3 types with vec4. With OpenCL, vec3 types consume the same memory as a vec4 so this doesn't actually bloat memory. Instead, it fills in holes in structs and arrays so nir_opt_memcpy() can detect more copies.

While nir_split_var_copies() is able to reduce derefs to a form we can optimize, it leaves the variable itself intact. However, if we're unable to optimize a function-local variable to SSA values and it ends up in stack memory, we may still be able to reduce the memory footprint somewhat. For that, we have a few more passes:

nir_opt_split_struct_vars() splits struct variables into individual variables, one per struct member. This sort of finishes the job that nir_split_var_copies() started. This way unused struct members can be eliminated entirely.
nir_opt_split_array_vars() looks for variables with arrays that are only ever accessed directly and splits the array into individual elements. As with the struct version, the goal is to potentially reduce stack memory. However, if run too early, it can split a variable before nir_opt_find_array_copies() has the chance to find a copy. You only want to run it after most variable and memory optimizations.
nir_shrink_vec_array_vars() looks at arrays of vectors and attempts to eliminate unused components. A lot of shaders use vec4 types even when they're not needed. You have to be careful, though, because if nir_shrink_vec_array_vars() and nir_lower_vec3_to_vec4() are run in a loop, they will end up fighting with each other.

Conclusion

When all these passes are put together in the right order, we get a powerful optimizer that's able to chew through even the most complex shaders and eliminate a lot of variable and memory access.

When bringing up ray-tracing on Intel, we spent a lot of time looking at the BVH-building kernels which were written in OpenCL C and compiled to SPIR-V with LLVM. Even for fairly straightforward OpenCL C code, LLVM would generate large numbers of casts and memcpy instructions as part if its own lowering and optimization process. When all these optimizations are put together in the right order, NIR can chew through it all and eliminate virtually all stack memory. It's also proven very effective in RustiCL when running complex OpenCL benchmarks.

Re-thinking framebuffers in PanVK

From browsers to better drivers: Fixing Zink synchronization the hard way

Mesa 25.2 brings new hardware support for Nouveau users

Re-thinking framebuffers in PanVK

From browsers to better drivers: Fixing Zink synchronization the hard way

Mesa 25.2 brings new hardware support for Nouveau users

Search the newsroom

Latest Blog Posts

Optimizing memory access in NIR

07/05/2026

A complete breakdown of Mesa’s NIR compiler detailing how it optimizes shader memory access with SSA promotion, deref analysis, copy propagation,…

BlueZ-powered Auracast broadcasting on Genio 700

05/05/2026

Collabora brought Bluetooth Auracast broadcasting to MediaTek Genio 700 for Embedded World 2026. Here's the complete, fully Open Source…

Making the invisible audible: Building an OpenXR experience for ocean protection

22/04/2026

Using our XR expertise, Collabora created a standalone XR experience for our 1% for the Planet partner, SOMAR, to showcase the direct impact…

Bringing BitNet to ExecuTorch via Vulkan

17/04/2026

BitNet-style ternary brings LLM inference to ExecuTorch via its Vulkan backend, enabling much smaller, bandwidth-efficient models with portable…

Re-thinking framebuffers in PanVK

23/03/2026

PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…

Running Mainline Linux, U-Boot, and Mesa on Rockchip: A year in review

02/03/2026

Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기