February 11, 2015
Now that Presentation feedback has finally landed in Weston (feedback, flags), people are starting to pay attention to the output timings as now you can better measure them. I have seen a couple of complaints already that Weston has an extra frame of latency, and this is true. I also have a patch series to fix it that I am going to propose.
To explain how the patch series affects Weston's repaint loop, I made some JSON-timeline recordings before and after, and produced some graphs with Wesgr. Here I will explain how the repaint loop works timing-wise.
The old repaint scheduling algorithm in Weston repaints immediately on receiving the pageflip completion event. This maximizes the time available for the compositor itself to repaint, but it also means that clients can never hit the very next vblank / pageflip.
|Figure 1. The old algorithm, the client paints as response to frame callbacks.|
Frame callback events are sent at the "post repaint" step. This gives clients almost a full frame's time to draw and send their content before the compositor goes to "begin repaint" again. In Figure 1. you see, that if a client paints extremely fast, the latency to screen is almost two frame periods. The frame latency can never be less than one frame period, because the compositor samples the surface contents (the "repaint flush" point) immediately after the previous vblank.
|Figure 2. The old algorithm, the client paints as response to Presentation feedback events.|
While frame callback driven clients still get to the full frame rate, the situation is worse if the client painting is driven by presentation_feedback.presented events. The intent is to draw and show a new frame as soon as the old frame was shown. Because Weston starts repaint immediately on the pageflip completion, which is essentially the same time when Presentation feedback is sent, the client cannot hit the repaint of this frame and gets postponed to the next. This is the same two frame latency as with frame callbacks, but here the framerate is halved because the client waits for the frame to be actually shown before continuing, as is evident in Figure 2.
|Figure 3. The old algorithm, client posts a frame while the compositor is idle.|
Figure 3. shows a less relevant case, where the compositor is idle while a client posts a new frame ("damage commit"). When the compositor is idle graphics-wise (the gray background in the figure), it is not repainting continuously according to the output scanout cycle. To start painting again, Weston waits for an extra vblank first, then repaints, and then the new frame is shown on the next vblank. This is also a 1-2 frame period latency, but it is unrelated to the other two cases, and is not changed by the patches.
The modification is simple, yet perhaps counter-intuitive at first. We reduce the latency by adding a delay. The "delay before repaint" is in all the figures, and the old algorithm is essentially using a zero delay. The compositor's repaint is delayed so that clients have a chance to post a new frame before the compositor samples the surface contents.
A good amount of delay is a hard question. Too small delay and clients do not have time to act. Too long delay and the compositor itself will be in danger of missing the vblank deadline. I do not know what a good amount is or how to derive it, so I just made it configurable. You can set the repaint window length in milliseconds in weston.ini. The repaint window is the time from starting repaint to the deadline, so the delay is the frame period minus the repaint window. If the repaint window is too long for a frame period, the algorithm will reduce to the old behaviour.
The following figures are made with a 60 Hz refresh and a 7 millisecond repaint window.
|Figure 4. The new algorithm, the client paints as response to frame callback.|
When a client paints as response to the frame callback (Figure 4), it still has a whole frame period of time to paint and post the frame. The total latency to screen is a little shorter now, by the length of the delay before compositor's repaint. It is a slight improvement.
|Figure 5. The new algorithm, the client paints as response to Presentation feedback.|
A significant improvement can be seen in Figure 5. A client that uses the Presentation extension to wait for a frame to be actually shown before painting again is now able to reach the full output frame rate. It just needs to paint and post a new frame during the delay before compositor's repaint. This mode of operation provides the shortest possible latency to screen as the client is able to target the very next vblank. The latency is below one frame period if the deadlines are met.
This is a relatively simple change that should reduce display latency, but analyzing how exactly it affects things is not trivial. That is why Wesgr was born.
This change does not really allow clients to wait some additional time before painting to reduce the latency even more, because nothing tells clients when the compositor will repaint exactly. The risk of missing an unknown deadline grows the later a client paints. Would knowing the deadline have practical applications? I'm not sure.
These figures also show the difference between the frame callback and Presentation feedback. When a client's repaint loop is driven by frame callbacks, it maximizes the time available for repainting, which reduces the possibility to miss the deadline. If a client drives its repaint loop by Presentation feedback events, it minimizes the display latency at the cost of increased risk of missing the deadline.
All the above ignores a few things. First, we assume that the time of display is the point of vblank which starts to scan out the new frame. Scanning out a frame actually takes most of the frame period, it's not instantaneous. Going deeper, updating the framebuffer during scanout period instead of vblank could allow reducing latency even more, but the matter becomes complicated and even somewhat subjective. I hear some people prefer tearing to reduce the latency further. Second, we also ignore any fencing issues that might come up in practise. If a client submits a GPU job that takes a long while, there is a good chance it will cause everything to miss a deadline or more.
As usual, this work and most of the development of JSON-timeline and Wesgr were sponsored by Collabora.
PS. Latency and timing issues are nothing new. Owen Taylor has several excellent posts on related subjects in his blog.
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…
Hwangsaeul, or H8L, a remote surveillance streaming solution, utilizes the capability of libsrt to collect statistics from open SRT sockets…
Complex, real-world correctness tests and performance analysis are now possible thanks to gltrim, a new tool recently added to apitrace,…
Earlier this week, WebRTC became an official W3C and IETF standard. GStreamer has a powerful and rapidly maturing WebRTC implementation.…
Last year, from June to September, I worked on the kernel development tool Coccinelle under Collabora. I implemented a performance boosting…
The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL…