April 08, 2020
Rockchip SoCs, notably the RK3399, are popular in devices such as Chromebooks and single-board computers. Indeed, they bring some interesting features, one of them being the Arm Frame Buffer Compression (AFBC).
To understand that, let's have a look at a typical display pipeline. There usually is a GPU for 3D rendering and ultimately its output data must be sent to the actual display, and to do so it must be formatted suitably by a Display Processor. Nowadays each frame weighs at least a few megabytes and sometimes you may access multiple frames at once. Obviously, this amount of memory cannot be provided internally by the SoC, so we need to be accessing memory which is external to it, and we need to do this a lot! This, in turn, directly translates into memory bandwidth (of which there is a limit: you can only transfer so much in a unit of time), battery power usage and heat generation. These reasons alone justify looking for a way to reduce memory bandwidth usage in a display pipeline.
Arm created AFBC exactly to mitigate the problems mentioned in the previous paragraph. The trick is that the image is split into blocks (of e.g. 16 x 16 pixels) and each block's data is compressed. Each block also has a fixed length header associated with it to store compression metadata. All the headers are sent first, followed by blocks.
So, theoretically more memory is used by a compressed frame, but read on to understand how the reduction of memory bandwidth becomes possible.
AFBC will allocate slightly more memory to hold each block's data (headers plus usual amount for the frame) than a normal uncompressed frame would . However, if the block's data compresses well (and you can compress by up to 50%), much of that allocation won't actually be used. So, depending on the dynamic compression ratio, you can save up to 50% memory bandwidth, improving both efficiency and performance.
The compression is lossless, which wouldn't be that important if frames were to be only displayed to human users, but compression being lossless is crucial if decompressed frames serve as reference frames, because there are no inaccuracies which would otherwise keep accumulating. And despite the compression, the scheme even allows randomly accessing image data down to 4px x 4px! Great? Wait...
The said compression scheme is proprietary, so unfortunately we don't know how it works. The AFBC-related parts of hardware in the GPU and Display Processor are total black boxes, thus limiting the potential of Open Source community to write great software utilizing AFBC. Nevertheless, AFBC-aware IP blocks should be able to exchange AFBC buffers even if we don't fully understand the buffer contents, so it still is an interesting option for the moment. It does prove useful because of memory bandwidth reduction - but it has trade-offs, including a complete lack of knowledge of its internals. You should know that other vendors have competing implementations of similar schemes, too -- so here's to a proper, open standard for frame buffer compression in the future.
We can still use AFBC in our display pipelines without knowing "what's inside": if the components of our pipeline understand AFBC, we can make them use it and enjoy the benefits. To do so we need to control the involved components and that can be done even without understanding what they do under the hood.
That being said, we can try understanding what's inside an AFBC-compressed frame. You can have a look here, branch afbc-test to see how an all-red frame can be generated. It doesn't take a rocket scientist to notice that for each 16x16 block there is a fixed-length header of 16 bytes. All the headers go first and then all the blocks follow. In the header the first 4 bytes specify where in the buffer its corresponding block data is, and the fifth byte is 0x07:
Inside the block the first 6 bytes contain specific values and the rest is filled with zeros. So, for an all-red frame, we use only 6 bytes out of 1024! (16 * 16 * 4 bpp):
Of course the all-red frame is kind of an extreme case, but you get the idea of how we are limiting memory accesses with AFBC.
Since late 2018, the Mali-DP display drivers, malidp and komeda, have had the ability to use AFBC, and now support for Rockchip (RK3399) is also on its way. While the initial work was done by Rockchip in 2014, it unfortunately wasn't upstreamed. Efforts to provide AFBC support for Rockchip in mainline have recently concluded and the feature is available in drm-misc-next tree. You can find the actual patches here .
The DRM subsystem follows the "library instead of midlayer" approach, which is nicely described in this lwn article from 2009. So, consequently, there is DRM core and DRM drivers. The latter are free to use so called DRM helpers, but there is nothing preventing them from opting out. During the work on AFBC, an idea has crystalized in the mailing list in November 2019 to put AFBC handling in helpers. This indeed proves a good design decision, because thanks to it the core does not deal with a very specific (and proprietary!) extension and, in fact, the use of helpers is purely optional. The patch series first lays the foundation for drivers to allocate struct drm_afbc_framebuffer explicitly in order to be able to do special AFBC-related checks and now the drivers can opt-in to use the new helpers.
We should be expecting that AFBC support for Rockchip will be landing in the next Linux kernel release. It is worth mentioning that existing AFBC users can also benefit from the newly added helper functions (patch ). f you have any questions on how to these new functions, or if you would like to try to bring AFBC support to another AFBC-enabled SoC, please contact us! While we obviously can't create AFBC per se, we can help control the hardware so that it starts using it.
Did you know you could run a permissively-licensed MTP implementation with minimal dependencies on an embedded device? Here's a step-by-step…
Earlier this year, the Rust compiler gained support for LLVM source-base code coverage. In this post we'll explain how to setup a CI job…
Over the past few months, I've been working on a side project to improve Meson sub-project support. The best stress test is to build projects…
The most complete automated testing and continuous integration tool for the Linux kernel continues to evolve at a rapid pace. Here's a look…
In the embedded world, many modern SoCs such as the ST Microelectronics STM32MP1 now include coprocessor cores which can be used for a wide…
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…