Optimizing 3D performance with virglrenderer

Optimizing 3D performance with virglrenderer

Gert Wollny
May 17, 2021

Share this post:

Reading time:

Collabora has been investing into Perfetto to enable driver authors and users to get deep insights into driver internals and GPU performance which were not previously visible. This post shows how we applied this work and other peformance analysis tools to study a number of workloads on the virtualized VirGL implementation, and used this insight to improve performance by up to 6.2%.

Back in August 2019, I wrote about running games in a virtual machine by using virglrenderer. Now, let's look at how the code can be tweaked to squeeze out the last bit of performance.

Perfetto and cleaning up the decoding loop

In a first step to analyze performance related hot-spots in virglrenderer, a tool to trace and analyze the guest-host performance, perfetto, was used to obtain a run-time profile of virglrenderer in conjunction with the OpenGL calls on the host and the guest side. Perfetto was already discussed in another blog post.

Here, the virtualization was provided by CrosVM. The whole analysis was done within a docker environment that can be created by using scripts provided with the virglrenderer source code and that is based on running apitrace on trimmed traces.

Figure 1: Screenshot of the perfetto trace visualization focusing on the command decoding loop. Note that many short commands are executed here with the white gaps comprising the loop overhead.

This analysis revealed that the command stream decoding loses a lot of time between the actual function calls. From the performance point of view two things came into play:

Firstly, the command buffer query always dereferenced the context and the decode buffer pointers:

static inline uint32_t 
get_buf_entry(struct vrend_decode_ctx *ctx, uint32_t offset)
{
   return ctx->ds->buf[ctx->ds->buf_offset + offset];
}

and secondly, in the decoding loop the next buffer offset was evaluated two times and errors were checked two times per command (loop skeleton):

   while (gdctx->ds->buf_offset < gdctx->ds->buf_total) {
      ...
      /* check if the guest is doing something bad */
      if (gdctx->ds->buf_offset + len + 1 > gdctx->ds->buf_total) {
         break;
      }
	  
      ret = ... /* decode and run command */
	  
      if (ret == EINVAL)
         goto out;

      if (ret == ENOMEM)
      	 goto out;
	 
      gdctx->ds->buf_offset += (len) + 1;
   }

To improve the code the pointer dereferencing was moved out of the get_buf_entry function to the beginning of command decoding, the decode loop was refactored to check the error only once per command in the no-error case, and the position of the next buffer command is now also only evaluated once. A few other refactorings to the code were applied as well. Specifically, the interface of the decode functions was unified, and the switch statement was replaced by a callback table. With that the buffer query is now

static inline uint32_t get_buf_entry(const uint32_t *buf, uint32_t offset)
{
   return buf[offset];
}

and the loop (skeleton) reads:

   ... /* sanitize loop parameters */
   while (buf_offset < buf_total) {

      ...
      const uint32_t *buf = &typed_buf[buf_offset];
      buf_offset += len + 1;

      /* check if the guest is doing something bad */
      if (buf_offset > buf_total) {
         break;
      }

      ret = ... /* decode and run command */
      if (ret)
         return ret;
   }

perf and re-arranging the shader selection

To also test a different environment the following analysis was done using Qemu, and to zoom in on the instruction level perf was used for instrumentation. With that to obtain a performance profile the Unigine Heaven benchmark was run in the guest and perf on the host.

The selection of the shader program was identified as the main hot-spot. In particular, in each draw call all already available shader programs are checked to see whether the current combination of shader stages and dual source state is already available as a linked program:

{
   struct vrend_linked_shader_program *ent;
   LIST_FOR_EACH_ENTRY(ent, &ctx->sub->programs, head) {
      if (ent->dual_src_linked != dual_src)
         continue;
      if (ent->ss[PIPE_SHADER_COMPUTE])
         continue;
      if (ent->ss[PIPE_SHADER_VERTEX]->id != vs_id)
        continue;
      if (ent->ss[PIPE_SHADER_FRAGMENT]->id != fs_id)
        continue;
	  ...
      return ent;
   }
   return NULL;
}

Various opportunities for optimization are immediately visible:

Firstly, a GFX shader program is never linked with a compute shader, hence, managing the GFX programs and the compute programs in two different lists eliminates the needless check for compute shaders when searching for GFX shaders.
Secondly, shader IDs are 32 bit values, and since a shader program must consist of at least a vertex shader (VS) and a fragment shader (FS) to result in actual drawing, one can combine these two 32 bit values into one 64 bit value outside the LIST_FOR_EACH_ENTRY loop, and thereby save one comparison and one deref inside the loop.
Thirdly, by re-inserting each used program at the front of its list, shaders that are used more often in an OpenGL scene are kept in the front of the list, further reducing the search time.
Then, by distributing the IDs over 2ⁿ buckets of shader program lists one can further reduce the number of times the loop must be run to find a shader program or signal that a new program must be linked. With this last change the n lowest bits in the combined ID (VS + FS) are freed, which gives the opportunity to merge the dual_src boolean, removing one more deref and compare from the loop.
Finally, given that OpenGL applications might actually use only few shaders, it is likely that a program is already at the front of its corresponding list, so that in this case one can skip removing the program from and re-adding it to its list.

With these changes lookup_shader_program now reads

#define VREND_PROGRAM_NQUEUE_MASK (VREND_PROGRAM_NQUEUES - 1)
...

{
   uint64_t vs_fs_key = (((uint64_t)fs_id) << 32) | (vs_id & ~VREND_PROGRAM_NQUEUE_MASK) |
                        (dual_src ? 1 : 0);

   struct vrend_linked_shader_program *ent;

   struct list_head *programs = &ctx->sub->gl_programs[vs_id & VREND_PROGRAM_NQUEUE_MASK];
   LIST_FOR_EACH_ENTRY(ent, programs, head) {
      if (likely(ent->vs_fs_key != vs_fs_key))
         continue;
	  ...
      /* put the entry in front */
      if (programs->next != &ent->head) {
         list_del(&ent->head);
         list_add(&ent->head, programs);
      }
      return ent;
   }
   return NULL;
}

An analysis with with a program that uses reasonable complex 3D scenes, i.e. Unigine Heaven, showed that with just one program list, the body of the loop to find a program was run on average about 120 times. By re-inserting used programs at the front, this number was brought down to about 60. Since struct list_head is a struct of just two pointers the memory overhead per additional array element in gl_programs is rather low considering the possible reduction of run-time that can be achieved shortening the length of the program lists that need to be searched linearly. In light of the numbers obtained from running the Unigine Heaven benchmark a value VREND_PROGRAM_NQUEUES = 64 was chosen. (Using a hash table to manage the shader programs was also considered, but its overhead resulted in a considerable performance regression.)

Refactoring pointer dereferencing

In virglrenderer most functions used to take a vrend_context as parameter, only to later dereference the current vrend_sub_context and never use the parent context, i.e. the code is littered with statements that contain ctx->sub->. In order to improve code clarity, and to avoid this dereferenceing that the compiler might not always be able to optimize away, the code was refactored to pass the vrend_sub_context directly when possible, and also to use a helper pointer for similar pointer-dereferences that where used multiple times in functions or loops.

Further micro-optimizations

In a final round of optimizations, the hash function for virgl resources was changed to not only use xor but also a bit rotation so as to distribute the input bits and thereby avoid hash collisions better, and a series of if-conditions was combined into one condition so that its evaluation exits as soon as the boolean result is known. Finally, the VBO setup was checked in each draw call in order to work around a bug in older Intel graphics driver versions. Since this bug can no longer be reproduced, the check was dropped.

The benchmarks and the host environment

For an analysis of the performance improvements obtained by applying all these optimizations a series of benchmarks was run. The benchmarks were executed on an computer running Gentoo Linux, comprising a AMD FX-6300 processor, and a Radeon RX 580 grapics card. Virtualization was provided by Qemu git-7c79721606b compiled to support the SDL interface. Mesa host and guest version was 21.1.0-devel git103beecd36, and the guest OS was Ubuntu/Linux 20.10. The VM ran with a graphical resolution of 1440x900.

In order to get reproducible performance numbers the Phoronix test suite was used to run these benchmarks, and a suite consisting of a number of benchmarks and games was created comprising four Unigine benchmarks, GLmark2, Open Arena, Xotonic, and GPUtest/FurMark.

The Unigine benchmarks cover different levels of OpenGL with high quality texturing and graphical effects, Open Arena and Xotonic are two Open Source computer games with rather low requirements that can run at very high frame rates, and GLmark2 focuses on general purpose 3D graphics. GPUtest/Furmark, on the other hand, is a GPU stress test that handles all relevant computations in shaders.

Results

The results of running the benchmarks before and after applying the optimizations can be found at openbenchmark.org and are summarized in the following table:

Benchmark	Baseline FPS/Score	Optimized FPS/Score	Change (%)
Open Arena	89.7 ± 0.2	89.3 ± 0.8	-0.4
GLmark2	1273	1312	3.1
GPUtest/FurMark	6293 ± 15.4	6492 ± 76.0	3.2
Unigine Heaven	60.7 ± 0.6	64.5 ± 0.2	6.2
Unigine Sanctuary	141.9 ± 1.5	145.8 ± 2.0	2.7
Unigine Tropics	118.4 ± 0.2	121.9 ± 0.2	2.9
Unigine Valley	41.6 ± 0.0	42.5 ± 0.1	2.1
Xotonic	76.6 ± 0.2	77.9 ± 0.3	1.7

Seven out of the eight selected benchmarks showed an increase in the framerate/benchmark score, and only one regression can be seen. A look at the confidence intervals of the results shows that this one regression is actually not significant, but six of the seven reported improvements are.

Conclusion

A number of micro-optimizations were applied to virglrenderer that each taken on their own would probably not give a notable performance improvement, but all taken together show a increase in performance for most of the selected benchmarks. With these changes, perf no longer shows any performance hot-spots in the code that can easily be optimized.

Future work to improve the performance of virglrenderer will focus on further reducing small overheads, e.g. by re-arranging and compressing data structures to optimize cache usage. On a higher level, optimizing the guest-host synchronization, reducing one-time overheads like recompiling shaders, and optimizing the command stream are currently being investigated.

Profiling virtualized GPU acceleration with Perfetto

Virglrenderer and the state of virtualized virtual worlds

Trimming apitrace workload captures for better Mesa testing

Profiling virtualized GPU acceleration with Perfetto

Virglrenderer and the state of virtualized virtual worlds

Trimming apitrace workload captures for better Mesa testing

Comments (0)

Add a Comment

Search the newsroom

Latest Blog Posts

PipeWire workshop 2025: Updates on video transport, Rust efforts, TSN networking, and Bluetooth support

03/07/2025

As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

12/05/2025

Powerful video analytics pipelines are easy to make when you're well-equipped. Combining GStreamer and Machine Learning frameworks are the…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기