Panfrost performance counters with Perfetto

Panfrost performance counters with Perfetto

Antonio Caggiano
August 21, 2020

Share this post:

Reading time:

Linux system information is available in several scattered forms. We can query kernel events, CPU counters, and memory counters through ftrace, procfs and sysfs, but historically we've lacked a holistic view of the system - including graphics performance counters - to target optimization. But we have now integrated Mali GPU hardware counters supported by Panfrost with Perfetto's tracing SDK, unlocking all-in-one graphics-aware profiling on Panfrost systems!

What is Perfetto?

Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps. It enables you to capture the state of various components of your system into a trace file, which can be loaded into a web-based trace viewer, also available online.

At the moment of writing, Perfetto offers a good number of probes to see what is going on in the CPU, what is the memory usage, and other things like power consumption. There is also a GPU probe, but it is only capable of sampling the GPU frequency when the driver outputs that information via ftrace.

A key feature of Perfetto is its extendibility. You can feed your own data into Perfetto by either instrumenting your program or making a custom data source. So how about making one to put a magnifying glass on the GPU?

Graphics Perfetto producers

Collabora is working on a new project, namely gfx-pps, which aims to collect various perfetto data sources related to graphics hardware and sofware. The term producer is a key Perfetto concept which refers to a client process contributing to the tracing service with one or more data sources.

The gfx-pps project is under active development on FreeDesktop's GitLab licensed under MIT. It currently includes two data sources: one is able to sample Mali performance counters, while the other generates track events about Weston timeline.

Panfrost data source

One of the Perfetto data sources available in gfx-pps is the Panfrost data source, which is able to query Mali Midgard performance counters using the Panfrost driver.

You can follow the README to compile the tools for your target platform, but all the binaries you need to start tracing your application are already available as artifacts of the project's GitLab CI, on master and perfetto branches. The executables you can find for either x86_64 and aarch64 are the following:

traced, the tracing service.
traced_probes, the OS probes service.
libperfetto.so and perfetto, which is the command line tool used for recording traces.
producer-gpu, which provides the Panfrost data source.
gpu.cfg, config file to feed as input to perfetto describing what to trace. This one and other config files can be found under the gfx-pps/scripts directory.

Once you have everything ready on your target platform, follow these steps to capture a trace.

Start the tracing service by running traced.
Start the OS probes service with traced_probes.
Start the GPU producer producer-gpu.
Start perfetto to capture a trace following the directives of a config file:
```
perfetto --txt -c gpu.cfg -o trace
```

Once tracing has finished, you will find a trace file ready to be opened with ui.perfetto.dev.

Analysis guidelines

The golden rule of performance analysis is to find the bottleneck, which means applying Amdahl's law in order to parallelize as much as possible or to optimize the longest of a series of processes. Then it is all about finding the right balance.

CPU/GPU balancing

The first thing to check is the balance between CPU and GPU workload. If the GPU is idle most of the time, while the CPU is continuously busy, it would not make sense to focus on graphics, but the code running on the CPU needs to be optimized instead.

Once we know the GPU is the bottleneck, we need to make sure to parallelize work on CPU and GPU as much as possible by taking advantage of multi-buffering. Multi-buffering enables us to draw multiple frames in-flight, therefore exploiting the full potentiality of the GPU.

The screenshot below shows a trace of WebGL Aquarium taken on a RK3399 processor which uses a Mali Midgard GPU. You can see a frame generated in 69.3 ms. You can also notice how both GPU and CPU activity occupy respectively the first and the second half of the highlighted area. This suggests that improvements are needed on both sides.

Vertex/Fragment balancing

The next step moves our focus on vertex and fragment workloads. We generally expect to spend more time processing fragments. If the opposite is true, it means we would probably achieve a better speedup by either optimizing the vertex shader or reducing the number of geometry submitted for drawing.

Note that spending more time processing fragments does not mean it should occupy 100% of GPU time. If that happens, it is a sign we need to simplify this stage of the graphics pipeline, by reducing the complexity of the fragment shader.

Tripipe: Arithmetic/Load-Store/Texture balancing

In order to optimize our shaders, it is important to review the Midgard shader core structure, by focusing on the Tripipe execution core. The Tripipe is so-called because it can run arithmetic, load/store, and texture instructions in parallel. By looking at the counters of these three pipes, we could find a hint on which one to focus.

Bandwidth

Last but not least, optimizing memory bandwidth usage is crucial for mobile devices, as it directly results in less power consumption and better performance.

Note:

With the following formula you can calculate the bandwidth in bytes:

( L2 external reads + L2 external writes ) * bus width

While L2 counters are available for tracing, the bus width depends on the specific GPU, e.g. Mali T860 GPUs have a 16 bytes AXI bus. Keep in mind that an ideal value for a mobile GPU should stay below 5 GB/s.

Conclusion

The gfx-pps project is a good starting point for empowering the open source world of performance analysis tooling. As it has proven to be greatly valuable for our work on Panfrost, we have already planned to implement Perfetto data sources for other GPU families!

Learn more

Also useful: Mali Midgard family performance counters.

Bifrost meets GNOME: Onward & upward to zero graphics blobs

From Bifrost to Panfrost - deep dive into the first render

Experimental Panfrost GLES 3.0 support has landed in Mesa

Bifrost meets GNOME: Onward & upward to zero graphics blobs

From Bifrost to Panfrost - deep dive into the first render

Experimental Panfrost GLES 3.0 support has landed in Mesa

Comments (0)

Add a Comment

Search the newsroom

Latest Blog Posts

PipeWire workshop 2025: Updates on video transport, Rust efforts, TSN networking, and Bluetooth support

03/07/2025

As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

12/05/2025

Powerful video analytics pipelines are easy to make when you're well-equipped. Combining GStreamer and Machine Learning frameworks are the…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기