We're hiring!
*

Optimizing 3D performance with virglrenderer

Gert Wollny avatar

Gert Wollny
May 17, 2021

Share this post:

Reading time:

Collabora has been investing into Perfetto to enable driver authors and users to get deep insights into driver internals and GPU performance which were not previously visible. This post shows how we applied this work and other peformance analysis tools to study a number of workloads on the virtualized VirGL implementation, and used this insight to improve performance by up to 6.2%.

Back in August 2019, I wrote about running games in a virtual machine by using virglrenderer. Now, let's look at how the code can be tweaked to squeeze out the last bit of performance.

Perfetto and cleaning up the decoding loop

In a first step to analyze performance related hot-spots in virglrenderer, a tool to trace and analyze the guest-host performance, perfetto, was used to obtain a run-time profile of virglrenderer in conjunction with the OpenGL calls on the host and the guest side. Perfetto was already discussed in another blog post.

Here, the virtualization was provided by CrosVM. The whole analysis was done within a docker environment that can be created by using scripts provided with the virglrenderer source code and that is based on running apitrace on trimmed traces.

Figure 1: Screenshot of the perfetto trace visualization focusing on the command decoding loop. Note that many short commands are executed here with the white gaps comprising the loop overhead.


This analysis revealed that the command stream decoding loses a lot of time between the actual function calls. From the performance point of view two things came into play:

Firstly, the command buffer query always dereferenced the context and the decode buffer pointers:

static inline uint32_t 
get_buf_entry(struct vrend_decode_ctx *ctx, uint32_t offset)
{
   return ctx->ds->buf[ctx->ds->buf_offset + offset];
}

and secondly, in the decoding loop the next buffer offset was evaluated two times and errors were checked two times per command (loop skeleton):

   while (gdctx->ds->buf_offset < gdctx->ds->buf_total) {
      ...
      /* check if the guest is doing something bad */
      if (gdctx->ds->buf_offset + len + 1 > gdctx->ds->buf_total) {
         break;
      }
	  
      ret = ... /* decode and run command */
	  
      if (ret == EINVAL)
         goto out;

      if (ret == ENOMEM)
      	 goto out;
	 
      gdctx->ds->buf_offset += (len) + 1;
   }

To improve the code the pointer dereferencing was moved out of the get_buf_entry function to the beginning of command decoding, the decode loop was refactored to check the error only once per command in the no-error case, and the position of the next buffer command is now also only evaluated once. A few other refactorings to the code were applied as well. Specifically, the interface of the decode functions was unified, and the switch statement was replaced by a callback table. With that the buffer query is now

static inline uint32_t get_buf_entry(const uint32_t *buf, uint32_t offset)
{
   return buf[offset];
}

and the loop (skeleton) reads:

   ... /* sanitize loop parameters */
   while (buf_offset < buf_total) {

      ...
      const uint32_t *buf = &typed_buf[buf_offset];
      buf_offset += len + 1;

      /* check if the guest is doing something bad */
      if (buf_offset > buf_total) {
         break;
      }

      ret = ... /* decode and run command */
      if (ret)
         return ret;
   }

perf and re-arranging the shader selection

To also test a different environment the following analysis was done using Qemu, and to zoom in on the instruction level perf was used for instrumentation. With that to obtain a performance profile the Unigine Heaven benchmark was run in the guest and perf on the host.

The selection of the shader program was identified as the main hot-spot. In particular, in each draw call all already available shader programs are checked to see whether the current combination of shader stages and dual source state is already available as a linked program:

{
   struct vrend_linked_shader_program *ent;
   LIST_FOR_EACH_ENTRY(ent, &ctx->sub->programs, head) {
      if (ent->dual_src_linked != dual_src)
         continue;
      if (ent->ss[PIPE_SHADER_COMPUTE])
         continue;
      if (ent->ss[PIPE_SHADER_VERTEX]->id != vs_id)
        continue;
      if (ent->ss[PIPE_SHADER_FRAGMENT]->id != fs_id)
        continue;
	  ...
      return ent;
   }
   return NULL;
}

Various opportunities for optimization are immediately visible:

  • Firstly, a GFX shader program is never linked with a compute shader, hence, managing the GFX programs and the compute programs in two different lists eliminates the needless check for compute shaders when searching for GFX shaders.
  • Secondly, shader IDs are 32 bit values, and since a shader program must consist of at least a vertex shader (VS) and a fragment shader (FS) to result in actual drawing, one can combine these two 32 bit values into one 64 bit value outside the LIST_FOR_EACH_ENTRY loop, and thereby save one comparison and one deref inside the loop.
  • Thirdly, by re-inserting each used program at the front of its list, shaders that are used more often in an OpenGL scene are kept in the front of the list, further reducing the search time.
  • Then, by distributing the IDs over 2n buckets of shader program lists one can further reduce the number of times the loop must be run to find a shader program or signal that a new program must be linked. With this last change the n lowest bits in the combined ID (VS + FS) are freed, which gives the opportunity to merge the dual_src boolean, removing one more deref and compare from the loop.
  • Finally, given that OpenGL applications might actually use only few shaders, it is likely that a program is already at the front of its corresponding list, so that in this case one can skip removing the program from and re-adding it to its list.

With these changes lookup_shader_program now reads

#define VREND_PROGRAM_NQUEUE_MASK (VREND_PROGRAM_NQUEUES - 1)
...

{
   uint64_t vs_fs_key = (((uint64_t)fs_id) << 32) | (vs_id & ~VREND_PROGRAM_NQUEUE_MASK) |
                        (dual_src ? 1 : 0);

   struct vrend_linked_shader_program *ent;

   struct list_head *programs = &ctx->sub->gl_programs[vs_id & VREND_PROGRAM_NQUEUE_MASK];
   LIST_FOR_EACH_ENTRY(ent, programs, head) {
      if (likely(ent->vs_fs_key != vs_fs_key))
         continue;
	  ...
      /* put the entry in front */
      if (programs->next != &ent->head) {
         list_del(&ent->head);
         list_add(&ent->head, programs);
      }
      return ent;
   }
   return NULL;
}

An analysis with with a program that uses reasonable complex 3D scenes, i.e. Unigine Heaven, showed that with just one program list, the body of the loop to find a program was run on average about 120 times. By re-inserting used programs at the front, this number was brought down to about 60. Since struct list_head is a struct of just two pointers the memory overhead per additional array element in gl_programs is rather low considering the possible reduction of run-time that can be achieved shortening the length of the program lists that need to be searched linearly. In light of the numbers obtained from running the Unigine Heaven benchmark a value VREND_PROGRAM_NQUEUES = 64 was chosen. (Using a hash table to manage the shader programs was also considered, but its overhead resulted in a considerable performance regression.)

Refactoring pointer dereferencing

In virglrenderer most functions used to take a vrend_context as parameter, only to later dereference the current vrend_sub_context and never use the parent context, i.e. the code is littered with statements that contain ctx->sub->. In order to improve code clarity, and to avoid this dereferenceing that the compiler might not always be able to optimize away, the code was refactored to pass the vrend_sub_context directly when possible, and also to use a helper pointer for similar pointer-dereferences that where used multiple times in functions or loops.

Further micro-optimizations

In a final round of optimizations, the hash function for virgl resources was changed to not only use xor but also a bit rotation so as to distribute the input bits and thereby avoid hash collisions better, and a series of if-conditions was combined into one condition so that its evaluation exits as soon as the boolean result is known. Finally, the VBO setup was checked in each draw call in order to work around a bug in older Intel graphics driver versions. Since this bug can no longer be reproduced, the check was dropped.

The benchmarks and the host environment

For an analysis of the performance improvements obtained by applying all these optimizations a series of benchmarks was run. The benchmarks were executed on an computer running Gentoo Linux, comprising a AMD FX-6300 processor, and a Radeon RX 580 grapics card. Virtualization was provided by Qemu git-7c79721606b compiled to support the SDL interface. Mesa host and guest version was 21.1.0-devel git103beecd36, and the guest OS was Ubuntu/Linux 20.10. The VM ran with a graphical resolution of 1440x900.

In order to get reproducible performance numbers the Phoronix test suite was used to run these benchmarks, and a suite consisting of a number of benchmarks and games was created comprising four Unigine benchmarks, GLmark2, Open Arena, Xotonic, and GPUtest/FurMark.

The Unigine benchmarks cover different levels of OpenGL with high quality texturing and graphical effects, Open Arena and Xotonic are two Open Source computer games with rather low requirements that can run at very high frame rates, and GLmark2 focuses on general purpose 3D graphics. GPUtest/Furmark, on the other hand, is a GPU stress test that handles all relevant computations in shaders.

Results

The results of running the benchmarks before and after applying the optimizations can be found at openbenchmark.org and are summarized in the following table:

Benchmark Baseline FPS/Score Optimized FPS/Score Change (%)
Open Arena 89.7 ± 0.2 89.3 ± 0.8 -0.4
GLmark2 1273 1312 3.1
GPUtest/FurMark 6293 ± 15.4 6492 ± 76.0 3.2
Unigine Heaven 60.7 ± 0.6 64.5 ± 0.2 6.2
Unigine Sanctuary 141.9 ± 1.5 145.8 ± 2.0 2.7
Unigine Tropics 118.4 ± 0.2 121.9 ± 0.2 2.9
Unigine Valley 41.6 ± 0.0 42.5 ± 0.1 2.1
Xotonic 76.6 ± 0.2 77.9 ± 0.3 1.7


Seven out of the eight selected benchmarks showed an increase in the framerate/benchmark score, and only one regression can be seen. A look at the confidence intervals of the results shows that this one regression is actually not significant, but six of the seven reported improvements are.

Conclusion

A number of micro-optimizations were applied to virglrenderer that each taken on their own would probably not give a notable performance improvement, but all taken together show a increase in performance for most of the selected benchmarks. With these changes, perf no longer shows any performance hot-spots in the code that can easily be optimized.

Future work to improve the performance of virglrenderer will focus on further reducing small overheads, e.g. by re-arranging and compressing data structures to optimize cache usage. On a higher level, optimizing the guest-host synchronization, reducing one-time overheads like recompiling shaders, and optimizing the command stream are currently being investigated.

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Building a Board Farm for Embedded World

27/06/2024

With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5

26/06/2024

WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

The latest on cmtp-responder, a permissively-licensed MTP responder implementation

12/06/2024

Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…

A roadmap for VirtIO Video on ChromeOS: part 3

06/06/2024

The final installment of a series explaining how Collabora is helping shape the video virtualization story for Chromebooks with a focus…

Hacking on the PipeWire GStreamer elements

05/06/2024

Last week I attended the GStreamer spring hackfest in Thessaloniki to work on the PipeWire GStreamer elements and connect with the community.

Transforming speech technology with WhisperLive

28/05/2024

The world of AI has made leaps and bounds from what It once was, but there are still some adjustments required for the optimal outcome.…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.