August 20, 2020
RTP is the dominant protocol for low latency audio and video transport. It sits at the core of many systems used in a wide array of industries, from WebRTC, to SIP (IP telephony), and from RTSP (security cameras) to RIST and SMPTE ST 2022 (broadcast TV backend).
Being a flexible, Open Source framework, GStreamer is used in a variety of applications. Its RTP stack has been battle tested in multiple use-cases across all of the aforementioned industries, giving it the distinct advantage of being able to apply optimisations from one use case to another. Without a doubt, GStreamer has one of the most mature and complete RTP stacks available.
Additional unit tests, as well as key fixes and performance improvements to the GStreamer RTP elements, have recently landed in GStreamer 1.18:
The latter in particular provides an important boost in throughput, opening the gate to high bitrate video streaming.
Let's go deeper on that.
One of the essential tasks of GStreamer is to move (push) buffers from an upstream element to the next downstream element, making the pipeline progress.
But what does pushing a buffer mean from a low level point of view?
Elements are connected through pads. Each element has a pad for each possible connection, a pad can either be a "source pad" which the element uses to output buffers or a "sink pad" that it uses to input buffers. To create a connection between two elements, the application programmer connects the source pad of one element to the sink pad of another. When an element wishes to send a buffer with data to the next element, it "pushes" it onto its source pad which then chains it to the sink pad which calls into the next element.
The basic tool that an element uses to push a buffer is the gst_pad_push function:
GstFlowReturn gst_pad_push (GstPad * pad, GstBuffer * buffer);
A buffer push is actually a series of intricate function calls and locks being taken, the sequence is as follows:
gst_pad_push()function on its source pad.
As you can see from this incomplete list, each transfer of a buffer, even though it happens on one thread is actually a number of mutex locks and other atomic operations which are relatively costly on modern pipelined processors. When profiling a GStreamer pipeline, this is actually the part that causes the most overhead when transmitting a large number of small buffers.
Is it possible to do better?
GStreamer has a mechanism called "buffer list" which can be used to reduce the overhead of pushing a single buffer.
The entry point for an element to use this functionality is the gst_pad_push_list function.
GstFlowReturn gst_pad_push_list (GstPad * pad, GstBufferList * list);
What buffer lists do is to group together a number of buffers so that they are forwarded through the pipeline as one operation, which can significantly reduce this overhead as the sequence of operations described above will happen once per list and not once per buffer.
In case some elements do not support chaining buffer lists, GStreamer provides a fall-back mechanism like gst_pad_chain_list_default to push buffers one by one under the hood. This means that elements can always implement processing buffers in a list independently from the level of support in other elements.
This is nice for compatibility and allows incremental refinements, however to actually avoid the bottlenecks of pushing individual buffers and to get the biggest performance improvements all elements in a pipeline should natively support chaining buffer lists (i.e. have their own chainlist function installed on sink pads).
The RTP specification, described in RFC 3550, defines a set of rules for the association of participants during a conversation using RTP, this is called an "RTP Session".
In GStreamer, the core element that implements the session management is rtpsession.
rtpsession element already had support for buffer lists in its send path but not in its receive path.
Let's consider the following pipeline built around the
gst-launch-1.0 -e \ rtpsession name=rtpsess \ videotestsrc ! imagefreeze num-buffers=10000 ! video/x-raw,format=RGB,width=320,height=240 ! rtpvrawpay ! rtpsess.recv_rtp_sink rtpsess.recv_rtp_src ! fakesink async=false sync=false
A test stream is generated (imagefreeze is used to reduce CPU usage in this case), split in RTP packets, processed by
rtpsession, and consumed by a
The upstream element (
rtpvrawpay) and downstream element (
fakesink) could already chain buffer lists, but
rtpsession could not.
After enabling buffer lists in
rtpsession the element throughput improved dramatically:
A simplified visual interpretation can be obtained using flamegraphs.
⇨ Note: By clicking on the graphs below an interactive flamegraph will be opened in a new window.
When pushing individual buffers the call graph is deeper:
When pushing buffer lists the call graph is more balanced:
To be fair this huge improvement is only achievable in controlled use cases, the boost in a generic real-world scenario is currently mitigated by other factors.
rtpsession element is not used directly but via rtpbin that, depending on the scenario, also connects it to other elements (like
rtpssrcdemux); and the input may come from a remote source, like
Consider this more realistic pipeline:
gst-launch-1.0 -e ' rtpbin name=rtpbin \ udpsrc port=5000 caps=application/x-rtp,media=(string)video,clock-rate=(int)90000,encoding-name=RAW,payload=96,sampling=RGB,depth=(string)8,width=(string)320,height=(string)240 ! queue ! rtpbin.recv_rtp_sink_0 \ rtpbin. ! fakesink async=false sync=false \ udpsrc port=5001 caps=application/x-rtcp ! queue ! rtpbin.recv_rtcp_sink_0 \ rtpbin.send_rtcp_src ! queue ! udpsink host=127.0.0.1 port=5003 sync=false async=false
This is the receiving pipeline for one sender, the two
udpsink elements are one for RTP and one for RTCP,
rtpbin handles all the RTP details and delivers media data to
fakesink and RTCP replies for the other participant via
Unless all elements support pushing buffer lists natively there will still be bottlenecks due to individual buffer pushes.
See a comparison of before and after using buffer lists in
rtpsession with a pipeline that uses
The improvement is there but it is not as dramatic as in the controlled scenario.
The improvements in
rtpsession available in GStreamer 1.18 are an important step towards a more efficient RTP implementation in high bitrate scenarios, but further work would be needed (e.g. enable buffer lists on
udpsrc) to actually bring some of the theoretical improvements in for practical usage.
Our recent efforts on the Hantro kernel driver have resulted in the addition of H.264 decoding support and multiple performance improvements.…
Hwangsaeul, or H8L, a remote surveillance streaming solution, utilizes the capability of libsrt to collect statistics from open SRT sockets…
Complex, real-world correctness tests and performance analysis are now possible thanks to gltrim, a new tool recently added to apitrace,…
Earlier this week, WebRTC became an official W3C and IETF standard. GStreamer has a powerful and rapidly maturing WebRTC implementation.…
Last year, from June to September, I worked on the kernel development tool Coccinelle under Collabora. I implemented a performance boosting…
The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL…