Profiling virtualized GPU acceleration with Perfetto

Profiling virtualized GPU acceleration with Perfetto

Antonio Caggiano
April 22, 2021

Share this post:

Reading time:

Recently, we have been using Perfetto to successfully profile Apitrace traces in crosvm through VirGL renderer. We have now added perfetto instrumentation to VirGL renderer, Mesa, and Apitrace to see what happens precisely in a frame.

For a brief introduction to the tools just mentioned:

Perfetto is an open-source project for performance instrumentation and tracing. With its trace viewer, it is very useful for long system-wide tracing sessions.
Apitrace is a tool for tracing OpenGL, Direct3D, and other graphics APIs. Here we use it to trace and replay OpenGL applications.
crosvm is a virtual machine monitor, based on Linux KVM.
VirGL is a research project which lets us create a virtual 3D GPU for use inside a virtual machine.

I already wrote on how gfx-pps and perfetto work in a previous post, therefore I will focus now on the VirGL renderer perfetto instrumentation.

Host

On the host side we can capture GPU hardware counters with gfx-pps and VirGL renderer commands with perfetto instrumentation.

By using the perfetto tracing SDK library in VirGL renderer, we can capture trace events, for example we can generate slices to represent when a VirGL command starts being decoded and finishes its handling.

To be correct, VirGL renderer actually uses vperfetto, which simplifies processing perfetto traces in virtual machines. We are looking into percetto, a C wrapper on top of the C++ perfetto tracing SDK, to replace vperfetto in the future.

VirGL renderer

The VirGL renderer initialization takes care of initializing tracing resources when tracing is enabled. Tracing can be enabled by generating a meson project with the -Dtracing=perfetto option.

With the support of useful macros defined in virgl_util.h we can generate trace events representing the time to execute certain functions or specific scopes as we do when calling decode callbacks.

Figure 1: A view into the decode/submit stage of VirGL renderer. We can see how VirGL commands are translated to GL calls.

Description of events

As pointed out above, the main events that are tracked are the VirGL decode callbacks. We need to consider that a VirGL driver converts OpenGL calls on the guest to VirtIO GPU commands, then pushes them into a command buffer which is submitted to the VirtIO GPU when flushing is needed.

On the host side, Virgl renderer takes a submitted command buffer and decodes the commands by effectively performing a conversion from VirtIO GPU commands to OpenGL calls. Decode callbacks are invoked when commands are read from the buffer, and we generate a TrackEvent for every command decoded and handled by the renderer.

Looking at a higher level, we can see many slice events that map directly to VirGL renderer functions, thereby giving us a bird's-eye view of its activity: polling for events, processing command batches batch, trasferring data, decoding and submitting commands.

Guest

On the guest side we have Apitrace running a trace. Since we would like to inspect what happens at the frame level, it would make sense to generate a slice which tells us when a frame starts and when it finishes.

We generate this slice using a different approach which involves writing a custom ftrace event to the trace_marker file. While this removes the dependency from the Perfetto tracing SDK, it requires some changes to perfetto to enable parsing of these events by its ftrace data source.

Apitrace OpenGL commands are handled by the VirGL driver in Mesa, still on the guest. This has been instrumented as well with vperfetto to generate TrackEvents for the relevant ioctls sent to the VirtIO GPU.

Figure 2: Apitrace and VirGL driver tracks. Looking at these two, we can immediately see what happens in a frame, and how GL calls are translated to VirtGPU ioctls.

Where to optimize

The examples above show only a very little section of the whole trace, indeed looking at the frame level, we could immediately be overwhelmed by the amount of information at our disposal. To the rescue comes a nice feature of the Perfetto TraceViewer enabling us to select an area of a track to show the Slices tab for more details. Names, wall duration, average, number of occurences, precious data which could help us to get an idea of where most of the time of a frame is spent.

Figure 3: Apitrace and VirGL renderer tracks. The Slices tab shows some information about the highlighted area.

Looking at Figure 3, it seems there are too many glBindVertexBuffers that together take up to 7 ms which is a considerable amount of time for a frame. It is definitely a good candidate where to put some optimization efforts.

By looking at the VirGL renderer code, we can see that glBindVertexBuffers are emitted only when ctx->sub->vbo_dirty is true, so if we manage to keep that clean we might avoid some of these calls altogether.

Next step is looking for those places in the codebase where we set vbo_dirty as true, and that happens in three functions: vrend_set_single_vbo(), vrend_set_num_vbo(), and vrend_bind_vertex_elements_state(). For simplicity let us focus on the first two functions: they are both only invoked by vrend_decode_set_vertex_buffers() and, as we can see from the Slices tab, this function is called 462 times. How can we reduce this number?

A possible solution would be grouping draw calls by their vertex buffer state so that we would emit a single SET_VERTEX_BUFFERS command for multiple draw calls, as long as this would not affect the resulting frame.

Conclusion

Being able to look at the big picture has proven really valuable to find points of the VirGL stack needing optimizations. The tools provided by Perfetto are very effective for a visual-oriented analysis of the entire system, and its Python API might help in cases where such analysis can be automated programmatically.

Finally, to know more about what the future holds, head over the on-going discussion on the mesa-dev mailing list about integrating Perfetto and gfx-pps into Mesa, and take a look at the correponding Merge Request on Freedesktop's GitLab.

Panfrost performance counters with Perfetto

Trimming apitrace workload captures for better Mesa testing

Virglrenderer and the state of virtualized virtual worlds

Panfrost performance counters with Perfetto

Trimming apitrace workload captures for better Mesa testing

Virglrenderer and the state of virtualized virtual worlds

Comments (1)

Echo:
Aug 02, 2024 at 12:50 AM

Hello, Can I get the details for how to capture the trace of virglrenderer?

First, I compile the virglrenderer with -Dtracing=perfetto.
Then, I create a qemu VM machine, (open virgl 3d and even venus).
However, when i run the glmark2 app(test virgl), but I don not know how to capture the trace（ or profiler）of virglrenderer.

how to design a workable cfg for virglrenderer?
whta is the detailed command for capture the decode virtio-gpu command trace?

please help me
Thank you!

Reply to this comment

Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

PipeWire workshop 2025: Updates on video transport, Rust efforts, TSN networking, and Bluetooth support

03/07/2025

As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

12/05/2025

Powerful video analytics pipelines are easy to make when you're well-equipped. Combining GStreamer and Machine Learning frameworks are the…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기