We're hiring!

Panfrost performance counters with Perfetto

Antonio Caggiano avatar

Antonio Caggiano
August 21, 2020

Share this post:

Reading time:

Linux system information is available in several scattered forms. We can query kernel events, CPU counters, and memory counters through ftrace, procfs and sysfs, but historically we've lacked a holistic view of the system - including graphics performance counters - to target optimization. But we have now integrated Mali GPU hardware counters supported by Panfrost with Perfetto's tracing SDK, unlocking all-in-one graphics-aware profiling on Panfrost systems!

What is Perfetto?

Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps. It enables you to capture the state of various components of your system into a trace file, which can be loaded into a web-based trace viewer, also available online.

At the moment of writing, Perfetto offers a good number of probes to see what is going on in the CPU, what is the memory usage, and other things like power consumption. There is also a GPU probe, but it is only capable of sampling the GPU frequency when the driver outputs that information via ftrace.

A key feature of Perfetto is its extendibility. You can feed your own data into Perfetto by either instrumenting your program or making a custom data source. So how about making one to put a magnifying glass on the GPU?

Graphics Perfetto producers

Collabora is working on a new project, namely gfx-pps, which aims to collect various perfetto data sources related to graphics hardware and sofware. The term producer is a key Perfetto concept which refers to a client process contributing to the tracing service with one or more data sources.

The gfx-pps project is under active development on FreeDesktop's GitLab licensed under MIT. It currently includes two data sources: one is able to sample Mali performance counters, while the other generates track events about Weston timeline.

Panfrost data source

One of the Perfetto data sources available in gfx-pps is the Panfrost data source, which is able to query Mali Midgard performance counters using the Panfrost driver.

You can follow the README to compile the tools for your target platform, but all the binaries you need to start tracing your application are already available as artifacts of the project's GitLab CI, on master and perfetto branches. The executables you can find for either x86_64 and aarch64 are the following:

  • traced, the tracing service.
  • traced_probes, the OS probes service.
  • libperfetto.so and perfetto, which is the command line tool used for recording traces.
  • producer-gpu, which provides the Panfrost data source.
  • gpu.cfg, config file to feed as input to perfetto describing what to trace. This one and other config files can be found under the gfx-pps/scripts directory.

Once you have everything ready on your target platform, follow these steps to capture a trace.

  1. Start the tracing service by running traced.
  2. Start the OS probes service with traced_probes.
  3. Start the GPU producer producer-gpu.
  4. Start perfetto to capture a trace following the directives of a config file:
    perfetto --txt -c gpu.cfg -o trace

Once tracing has finished, you will find a trace file ready to be opened with ui.perfetto.dev.

Analysis guidelines

The golden rule of performance analysis is to find the bottleneck, which means applying Amdahl's law in order to parallelize as much as possible or to optimize the longest of a series of processes. Then it is all about finding the right balance.

CPU/GPU balancing

The first thing to check is the balance between CPU and GPU workload. If the GPU is idle most of the time, while the CPU is continuously busy, it would not make sense to focus on graphics, but the code running on the CPU needs to be optimized instead.

Once we know the GPU is the bottleneck, we need to make sure to parallelize work on CPU and GPU as much as possible by taking advantage of multi-buffering. Multi-buffering enables us to draw multiple frames in-flight, therefore exploiting the full potentiality of the GPU.

The screenshot below shows a trace of WebGL Aquarium taken on a RK3399 processor which uses a Mali Midgard GPU. You can see a frame generated in 69.3 ms. You can also notice how both GPU and CPU activity occupy respectively the first and the second half of the highlighted area. This suggests that improvements are needed on both sides.

Vertex/Fragment balancing

The next step moves our focus on vertex and fragment workloads. We generally expect to spend more time processing fragments. If the opposite is true, it means we would probably achieve a better speedup by either optimizing the vertex shader or reducing the number of geometry submitted for drawing.

Note that spending more time processing fragments does not mean it should occupy 100% of GPU time. If that happens, it is a sign we need to simplify this stage of the graphics pipeline, by reducing the complexity of the fragment shader.

Tripipe: Arithmetic/Load-Store/Texture balancing

In order to optimize our shaders, it is important to review the Midgard shader core structure, by focusing on the Tripipe execution core. The Tripipe is so-called because it can run arithmetic, load/store, and texture instructions in parallel. By looking at the counters of these three pipes, we could find a hint on which one to focus.


Last but not least, optimizing memory bandwidth usage is crucial for mobile devices, as it directly results in less power consumption and better performance.


With the following formula you can calculate the bandwidth in bytes:

( L2 external reads + L2 external writes ) * bus width

While L2 counters are available for tracing, the bus width depends on the specific GPU, e.g. Mali T860 GPUs have a 16 bytes AXI bus. Keep in mind that an ideal value for a mobile GPU should stay below 5 GB/s.


The gfx-pps project is a good starting point for empowering the open source world of performance analysis tooling. As it has proven to be greatly valuable for our work on Panfrost, we have already planned to implement Perfetto data sources for other GPU families!

Learn more

Also useful: Mali Midgard family performance counters.

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

Radxa Rock-5B PCIe and RTL8125B networking in U-boot


Work continues on the Radxa ROCK5B RK388, as PCIe and RTL8125B networking support in U-boot have now been added. Publishing code as Open…

Introducing Multiview for NVK


NVK, an open-source Vulkan driver for NVIDIA hardware that is part of Mesa, now supports the Vulkan extension VK_KHR_multiview.

Adding bootloader support for USB 2.0 Host for Radxa ROCK 5B RK3588


The beauty of Open Source is that we can reuse code written by many other people, keep their authorship, and credit them for their work,…

Meson & VSCode: Develop your project in a modern IDE


Want to develop your Meson project in a modern IDE? Make sure to install Meson VSCode extension which is now fully functional with the recent…

Carlafox: Towards reliable open-source 3D perception


Labeling errors are common in present open-source 3D perception datasets, which could have impactful consequences. To tackle this issue,…

Implementing Vulkan extensions for NVK


Since joining the graphics team at Collabora as a Software Engineering Intern last November, I have implemented several Vulkan API extensions…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2023. All rights reserved. Privacy Notice. Sitemap.