We're hiring!
*

Panfrost performance counters with Perfetto

Antonio Caggiano avatar

Antonio Caggiano
August 21, 2020

Share this post:

Linux system information is available in several scattered forms. We can query kernel events, CPU counters, and memory counters through ftrace, procfs and sysfs, but historically we've lacked a holistic view of the system - including graphics performance counters - to target optimization. But we have now integrated Mali GPU hardware counters supported by Panfrost with Perfetto's tracing SDK, unlocking all-in-one graphics-aware profiling on Panfrost systems!

What is Perfetto?

Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps. It enables you to capture the state of various components of your system into a trace file, which can be loaded into a web-based trace viewer, also available online.

At the moment of writing, Perfetto offers a good number of probes to see what is going on in the CPU, what is the memory usage, and other things like power consumption. There is also a GPU probe, but it is only capable of sampling the GPU frequency when the driver outputs that information via ftrace.

A key feature of Perfetto is its extendibility. You can feed your own data into Perfetto by either instrumenting your program or making a custom data source. So how about making one to put a magnifying glass on the GPU?

Graphics Perfetto producers

Collabora is working on a new project, namely gfx-pps, which aims to collect various perfetto data sources related to graphics hardware and sofware. The term producer is a key Perfetto concept which refers to a client process contributing to the tracing service with one or more data sources.

The gfx-pps project is under active development on FreeDesktop's GitLab licensed under MIT. It currently includes two data sources: one is able to sample Mali performance counters, while the other generates track events about Weston timeline.

Panfrost data source

One of the Perfetto data sources available in gfx-pps is the Panfrost data source, which is able to query Mali Midgard performance counters using the Panfrost driver.

You can follow the README to compile the tools for your target platform, but all the binaries you need to start tracing your application are already available as artifacts of the project's GitLab CI, on master and perfetto branches. The executables you can find for either x86_64 and aarch64 are the following:

  • traced, the tracing service.
  • traced_probes, the OS probes service.
  • libperfetto.so and perfetto, which is the command line tool used for recording traces.
  • producer-gpu, which provides the Panfrost data source.
  • gpu.cfg, config file to feed as input to perfetto describing what to trace. This one and other config files can be found under the gfx-pps/scripts directory.

Once you have everything ready on your target platform, follow these steps to capture a trace.

  1. Start the tracing service by running traced.
  2. Start the OS probes service with traced_probes.
  3. Start the GPU producer producer-gpu.
  4. Start perfetto to capture a trace following the directives of a config file:
    perfetto --txt -c gpu.cfg -o trace

Once tracing has finished, you will find a trace file ready to be opened with ui.perfetto.dev.

Analysis guidelines

The golden rule of performance analysis is to find the bottleneck, which means applying Amdahl's law in order to parallelize as much as possible or to optimize the longest of a series of processes. Then it is all about finding the right balance.

CPU/GPU balancing

The first thing to check is the balance between CPU and GPU workload. If the GPU is idle most of the time, while the CPU is continuously busy, it would not make sense to focus on graphics, but the code running on the CPU needs to be optimized instead.

Once we know the GPU is the bottleneck, we need to make sure to parallelize work on CPU and GPU as much as possible by taking advantage of multi-buffering. Multi-buffering enables us to draw multiple frames in-flight, therefore exploiting the full potentiality of the GPU.

The screenshot below shows a trace of WebGL Aquarium taken on a RK3399 processor which uses a Mali Midgard GPU. You can see a frame generated in 69.3 ms. You can also notice how both GPU and CPU activity occupy respectively the first and the second half of the highlighted area. This suggests that improvements are needed on both sides.

Vertex/Fragment balancing

The next step moves our focus on vertex and fragment workloads. We generally expect to spend more time processing fragments. If the opposite is true, it means we would probably achieve a better speedup by either optimizing the vertex shader or reducing the number of geometry submitted for drawing.

Note that spending more time processing fragments does not mean it should occupy 100% of GPU time. If that happens, it is a sign we need to simplify this stage of the graphics pipeline, by reducing the complexity of the fragment shader.

Tripipe: Arithmetic/Load-Store/Texture balancing

In order to optimize our shaders, it is important to review the Midgard shader core structure, by focusing on the Tripipe execution core. The Tripipe is so-called because it can run arithmetic, load/store, and texture instructions in parallel. By looking at the counters of these three pipes, we could find a hint on which one to focus.

Bandwidth

Last but not least, optimizing memory bandwidth usage is crucial for mobile devices, as it directly results in less power consumption and better performance.

Note:

With the following formula you can calculate the bandwidth in bytes:

( L2 external reads + L2 external writes ) * bus width

While L2 counters are available for tracing, the bus width depends on the specific GPU, e.g. Mali T860 GPUs have a 16 bytes AXI bus. Keep in mind that an ideal value for a mobile GPU should stay below 5 GB/s.

Conclusion

The gfx-pps project is a good starting point for empowering the open source world of performance analysis tooling. As it has proven to be greatly valuable for our work on Panfrost, we have already planned to implement Perfetto data sources for other GPU families!

Learn more

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest Blog Posts

Empathy first: Driving growth through people leadership

30/11/2020

This year, the global pandemic has put a strain on us all. Motivation can become hard to maintain, worries can cloud our minds. Now more…

Developing Wayland Color Management and High Dynamic Range

19/11/2020

Wayland is still lacking proper consideration for color management & support for high dynamic range (HDR) imagery. However, a group of renegade…

A summer sprint: bringing near-native performance to Zink

06/11/2020

This week marks two years since the OpenGL implementation on Vulkan was initially announced. Since then, and especially over the past few…

From Panfrost to production, a tale of Open Source graphics

03/11/2020

Since our previous update on Panfrost, the open source stack for Arm's Mali Midgard and Bifrost GPUs, we've focused on taking our driver…

Engaging in an "Open First" remote internship at Collabora

20/10/2020

The concept of a remote internship may raise some doubts, or even red flags, for many students, as would remote jobs for professionals.…

Building GStreamer text rendering and overlays on Windows with gst-build

28/09/2020

GStreamer relies on various 2D font rendering and layout libraries such as Pango and Cairo to generate text for the Pango plugin, which…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2020. All rights reserved. Privacy Notice. Sitemap.