GStreamer 1.28 brings AI inference to your media pipeline

GStreamer 1.28 brings AI inference to your media pipeline

Olivier Crête
February 17, 2026

Share this post:

Reading time:

Over the last year, we've been working very hard at Collabora to make mainline GStreamer into the premier tool to integrate media pipelines, GStreamer's core strength, and Machine Learning. In past releases, we've introduced a way to encode metadata such as the position of objects in images, their type, etc., but more importantly, the relationship between these tidbits of data to make it possible to comprehend all this information. In this release, we've implemented many elements to turn this vision into a reality.

Inference engines: ONNX-Runtime, LiteRT, and Burn

When it comes to machine learning, the first thing we need is an inference engine. In past releases, we had already included an element to support the ONNX Runtime project. In this release, Daniel Morin and I improved it. We've ported it from C++ to C, cleaning it up in the process.

We've added support for the VeriSilicon backend provider, enabling it to use their NPU which present on a number of SoCs such as the ST STM32MP2 family, the Amlogic A311D, NXP i.MX 8M Plus, and more. Sjoerd Simons also contributed a last minute fix, correcting a crash.

In addition, we've added a plugin for the LiteRT inference engine. Formerly known as TensorFlow Lite, it is Google's embedded offering and is a popular choice used by a number of embedded platforms and is included in Android. This is the result of a long project by Daniel Morin and Denis Shimizu. In addition to the core work they've accomplished, I've added support for the Google EdgeTPU backend as well as the VeriSilicon backend, offering multiple options for embedded developers. We called this element "tfliteinference" as the library is still called tflite, even though Google has officially renamed the project.

I've also simplified the usage of the inference elements by automatically guessing if the model takes its data as planar or interleaved, or as the AI people call it, CHW or HWC. We no longer need a property, but it is now inferred from the dimensions of the input tensor of each model.

The output of machine learning models is a set of tensors, which are simply multi-dimensional arrays of numbers. Most models will output multiple tensors and often give them meaningless names like output0 and output1, or even just an index. Those names can even be unexpectedly changed when converting between machine learning frameworks, for example, when exporting from PyTorch to ONNX or LiteRT. In order to know which tensor is which, we had embedded an additional name into ONNX files consumed by the ONNX Runtime. We thought this would be convenient as the ONNX format has an easily accessible table of metadata that can be modified with simple Python scripts. Sadly, the LiteRT framework doesn't have a similar concept, so we had to create a sidecar file, which we call a ModelInfo file. It is an ini-style file containing a description of each tensor, its type, its dimensions, its name in the model file, and a recognizable, unique name. This way, the LiteRT inference element can attach this name as metadata to each tensor it outputs, and we can later decode it.

We found this concept so useful that Daniel Morin also ported the ONNX Runtime element to use the same ModelInfo file. The main reason for this is that we had realized that having to modify the ONNX file often created friction for our users, and we hope a separate ModelInfo file will make their lives easier. Daniel Morin even created a Python tool to simplify creating ModelInfo files.

Sebastian Dröge, a key contributor from the community also implemented two useful things. First is an inference element in Rust, using the Burn framework. Sadly, because of the way this framework is designed, the specific type of model is set at compile time, so it currently only supports the YoloX model.

Making sense of tensors

Once we have tensors and know their names, we can interpret them. This requires some code specific to the type of tensor, which is often specific to a model or a family of models. We already had a decoder for the SSD MobileNet family of models, but we've added many more in this release.

As tensor names are not a standardised thing in the industry yet, we created a tensor id registry, where we try to document the encoding (the format) of the output of as many models as we can. We try to document all of those that are supported by GStreamer. We hope others will join us, as we hope this will be a community effort beyond the needs of the GStreamer project.

First, Daniel added a tensor decoder that works for most image classification models; in this case, the tensor is a vector of probabilities, one per class. Those most generically need to be normalized through a softmax function, but sometimes this is already part of the model, so I've added a mode for this variant as well.

Raghavendra Rao added support for another quite popular model to detect faces. This model is fast and accurate for this specific use case, making it quite useful for counting people or anonymizing videos. We demonstrated it running on the STM32MP257 at SIDO in Lyon last year, and we will be demonstrating it again at Embedded World 2026.

That said, the most popular family of models for image analytics is the YOLO family. Daniel Morin, Vineet Suryan, and Santosh Mahto collaborated to create decoders for the object detection, classification, and segmentation variants of the YOLO v8-11 and FastSAM models. This should make GStreamer useful out of the box for many more use cases.

Sebastian also wrote the matching YoloX tensor decoder in Rust, and because of GStreamer's modular architecture, it not only works with the Burn inference element, but also with any others such as the ONNX Runtime and LiteRT ones.

All of those models operate on frames one at a time. While they can find objects, they can't correlate that it's the same object from one frame to the next. One simple technique is to track an object and assume that the object hasn't moved too much since the last frame. The most common algorithm to do this is to calculate the Intersection-over-Union, and Santosh implemented an element to do this.

As we implemented all of those elements, we've been finding more and more things that are common to many elements, and we've been adding helper functions to the shared analytics library to make it easier to implement more tensor decoders in the future.

What's my tensor? And which decoder do I plug in?

When writing an ML inference application, there are two parts that must always go together: first is the model files containing the weight and the architecture of the model, and the second is the tensor decoder, which can make sense of the output. In a way, this is similar to decoding a movie, which we have a file containing data in a specific codec, and we need to load the correct plugin to decode it. Daniel Morin thought it should be as easy, and he created a tensordecodebin that works just like our trusted decodebin does for video. Based on the model file loaded, it gets the information about each tensor present from the caps and loads the right tensor decoder, making the whole operation seamless.

As is often the case with things that seem to "just work," a large amount of effort went into the framework to make it happen. In particular, Daniel had to design a way to describe tensors as part of the GStreamer caps. This required adding a new concept to the GStreamer caps, a "set" or, as we had to call it, a "GstValueUniqueList". This was needed because a tensor decoder needs a specific set of tensors, but the order doesn't matter. GStreamer only had an ordered list previously. Daniel also had to improve the caps subset functionality to produce the correct result in more cases, adding many new tests in the process.

Transforming this metadata along with the image

In GStreamer, we can attach metadata to each buffer as a GstMeta, and this is the mechanism we used to attach all of the object detection, segmentation, and classification information to each buffer. We also have a mechanism to transform this metadata when a buffer containing an image goes through a transformation element. For example, if the image is scaled to a smaller dimension, we could scale the metadata with it. However, we only had the most simple transformation defined, copying, and scaling. We could not describe rotations or crops. At the GStreamer hackfest in Nice last May, I took it upon myself to fix this by creating a new type of transformation. At first, I called it CropScale, but after multiple iterations and thanks to the input from the community, I've added a transformation matrix, making it possible to express any linear transformation such as rotations, scaling, shears, symmetries, and translations. I then went on a campaign to implement this new operation in as many places as I could, going through videoconvertscale, overlaycomposition, videobox, videocrop, compositor, glvideomixer, videoflip, videorotate, and also implementing it in the popular Region of Interest meta as well as the Analytics meta such as the object detection and segmentation meta which encode positions within an image.

Overlaying the information in the video

Now that we can support all of those models, and produce many types of metadata, it is often useful to draw those metadata on the images for visualization. We already had an element to draw the bounding boxes of detected objects, I've enhanced it to be able to draw a filled rectangle, making it into a privacy filter. And I've added the possibility to draw the boxes in different colors when a tracker is used (such as the aforementioned IoU tracker), and to write the tracking number of each object. This makes it easier to debug trackers.

Daniel Morin also contributed an element that can draw a segmented region as semi-transparent regions.

Batch it up

Sebastian also wrote a batching element in Rust, but placed the core batching concept in the main libgstanalytics. His batching element works with heterogeneous data, for example, making it possible to batch audio and video together. I've improved it by simplifying the way the batch is stored to facilitate an implementation of an element that does homogeneous batching, where each member of the batch is of the exact same type. I hope to complete this element soon so we can include it in a future GStreamer release.

And so much more

As we worked through this, we improved the framework with a few more small improvements. Daniel added a "tensor" mtd, which allows tensors to be added as metadata and relate them to specific objects or segments. This will make it possible to efficiently have more clever tracking in the future.

Santosh improved the GStreamer Python bindings, adding native Python iterators for the analytics meta, meaning that Python code can now be more pythonic when dealing with tensors produced by GStreamer.

Looking forward

As you can see, we've made so much progress in the past year, but we're not stopping. Vineet wrote an inference element for the ExecuTorch inference library, an embedded member of the PyTorch family, and it should be merged any day now. Daniel Morin is working on adding support for video frames in floating point formats native to GStreamer, which should make it easier to implement the most efficient pipelines utilizing the platform hardware capabilities to their fullest. And I'm working on a second tracker using DeepSORT, which not only tracks object positions but also their content by comparing feature vectors produced by AI models.

If you're looking to integrate GStreamer to maximize your video or audio analytics, we can help. With over 20 years of GStreamer expertise, your project is in experienced hands. Get in touch today.

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

GStreamer 1.26: Improved hardware efficiency, the MPEG-5 LCEVC codec, and more

Effortless GStreamer Analytics Cross-Platform Support via ONNX Runtime

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

GStreamer 1.26: Improved hardware efficiency, the MPEG-5 LCEVC codec, and more

Effortless GStreamer Analytics Cross-Platform Support via ONNX Runtime

Comments (0)