August 21, 2020
Linux system information is available in several scattered forms. We can query kernel events, CPU counters, and memory counters through ftrace, procfs and sysfs, but historically we've lacked a holistic view of the system - including graphics performance counters - to target optimization. But we have now integrated Mali GPU hardware counters supported by Panfrost with Perfetto's tracing SDK, unlocking all-in-one graphics-aware profiling on Panfrost systems!
Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps. It enables you to capture the state of various components of your system into a trace file, which can be loaded into a web-based trace viewer, also available online.
At the moment of writing, Perfetto offers a good number of probes to see what is going on in the CPU, what is the memory usage, and other things like power consumption. There is also a GPU probe, but it is only capable of sampling the GPU frequency when the driver outputs that information via ftrace.
A key feature of Perfetto is its extendibility. You can feed your own data into Perfetto by either instrumenting your program or making a custom data source. So how about making one to put a magnifying glass on the GPU?
Collabora is working on a new project, namely gfx-pps, which aims to collect various perfetto data sources related to graphics hardware and sofware. The term producer is a key Perfetto concept which refers to a client process contributing to the tracing service with one or more data sources.
The gfx-pps project is under active development on FreeDesktop's GitLab licensed under MIT. It currently includes two data sources: one is able to sample Mali performance counters, while the other generates track events about Weston timeline.
You can follow the README to compile the tools for your target platform, but all the binaries you need to start tracing your application are already available as artifacts of the project's GitLab CI, on master and perfetto branches. The executables you can find for either
aarch64 are the following:
traced, the tracing service.
traced_probes, the OS probes service.
perfetto, which is the command line tool used for recording traces.
producer-gpu, which provides the Panfrost data source.
gpu.cfg, config file to feed as input to perfetto describing what to trace. This one and other config files can be found under the gfx-pps/scripts directory.
Once you have everything ready on your target platform, follow these steps to capture a trace.
perfetto --txt -c gpu.cfg -o trace
Once tracing has finished, you will find a
trace file ready to be opened with ui.perfetto.dev.
The golden rule of performance analysis is to find the bottleneck, which means applying Amdahl's law in order to parallelize as much as possible or to optimize the longest of a series of processes. Then it is all about finding the right balance.
The first thing to check is the balance between CPU and GPU workload. If the GPU is idle most of the time, while the CPU is continuously busy, it would not make sense to focus on graphics, but the code running on the CPU needs to be optimized instead.
Once we know the GPU is the bottleneck, we need to make sure to parallelize work on CPU and GPU as much as possible by taking advantage of multi-buffering. Multi-buffering enables us to draw multiple frames in-flight, therefore exploiting the full potentiality of the GPU.
The screenshot below shows a trace of WebGL Aquarium taken on a
RK3399 processor which uses a Mali Midgard GPU. You can see a frame generated in 69.3 ms. You can also notice how both GPU and CPU activity occupy respectively the first and the second half of the highlighted area. This suggests that improvements are needed on both sides.
The next step moves our focus on vertex and fragment workloads. We generally expect to spend more time processing fragments. If the opposite is true, it means we would probably achieve a better speedup by either optimizing the vertex shader or reducing the number of geometry submitted for drawing.
Note that spending more time processing fragments does not mean it should occupy 100% of GPU time. If that happens, it is a sign we need to simplify this stage of the graphics pipeline, by reducing the complexity of the fragment shader.
In order to optimize our shaders, it is important to review the Midgard shader core structure, by focusing on the Tripipe execution core. The Tripipe is so-called because it can run arithmetic, load/store, and texture instructions in parallel. By looking at the counters of these three pipes, we could find a hint on which one to focus.
Last but not least, optimizing memory bandwidth usage is crucial for mobile devices, as it directly results in less power consumption and better performance.
With the following formula you can calculate the bandwidth in bytes:
L2 external reads+
L2 external writes) *
While L2 counters are available for tracing, the bus width depends on the specific GPU, e.g. Mali T860 GPUs have a 16 bytes AXI bus. Keep in mind that an ideal value for a mobile GPU should stay below 5 GB/s.
The gfx-pps project is a good starting point for empowering the open source world of performance analysis tooling. As it has proven to be greatly valuable for our work on Panfrost, we have already planned to implement Perfetto data sources for other GPU families!
The Hantro Video4Linux2 (V4L2) kernel module has gained support for another SoC! The Microchip SAMA5D4 features a single decode unit supporting…
DKMS is a framework that is mostly used to build and install external kernel modules. It can also be used to install a specific patch to…
Building GTK 4 as a Meson subproject for your own application is not only useful for Windows builds, but also for many Linux distributions…
Recently, we have been using Perfetto to successfully profile Apitrace traces in crosvm through VirGL renderer. We have now added perfetto…
As part of a project backed by INVEST-AI, a program managed by IVADO Labs, we have developed a multi-stage neural network-based solution…
Did you know you could run a permissively-licensed MTP implementation with minimal dependencies on an embedded device? Here's a step-by-step…