MLfix to quickly fix datasets

MLfix to quickly fix datasets

Vineet Suryan
June 08, 2023

Share this post:

Reading time:

MLfix to quickly fix datasets

Introduction

MLfix

Key takeaways

Carla Synthetic Datasets

Finding a needle in a haystack with MLfix

Outlook

Contrary to traditional software development, data is more important than code in machine learning. The data labelling part, in particular. Building a high-performing model requires using reliable, precisely labelled data but poor-quality data is not always obvious.

Working with data is critical for ML researchers because the resultant model is dependent on the quantity and quality of the data used. Malicious data points are far more common in machine learning datasets than expected. Early detection of these erroneous data points is crucial for the performance of downstream tasks. However, manually inspecting each data point is neither efficient nor feasible. According to a study evaluating the impact of data cleaning on ML classification tasks, data quality affects machine learning (ML) model performances, and data scientists spend considerable amounts of time on data cleaning prior to model training.

Deep learning algorithms can withstand random mistakes in the training set rather well. It's okay to not spend too much time correcting errors as long as they are inadvertent and somewhat random. For example, a few mislabeled stop signs in a traffic sign training dataset would not significantly affect the model. On the other hand, deep learning algorithms are not resistant to systematic mistakes; for instance, if you have a notably large number of stop sign bounding boxes mislabelled or the bounding box annotation itself is not consistent, then the model will learn this pattern.

MLfix

Using open-source software, Collabora has developed MLfix which helps to identify and filter out labelling errors in machine learning datasets quickly and efficiently. In order to assist annotators and machine learning engineers in identifying and removing labelling errors, MLfix blends cutting-edge unsupervised machine learning pipelines with a novel user interface idea. We further used this tool to spot errors in our synthetic datasets generated from the CARLA simulator for vehicle and traffic sign detection.

Key takeaways

Synthetic datasets generated via simulators can have flaws as well and these datasets should be verified and analyzed before training machine learning models on them.
Other than manually analyzing each sample, we could not find any open-source tool that can quickly identify errors in a huge dataset.
MLfix is open source. With its user-friendly interface, it reduces the time and resources required to identify the errors in an image dataset.

Carla Synthetic Datasets

Developing autonomous driving systems needs a massive amount of training data, typically gathered and labelled by human labour, which is both costly and error-prone. Instead, we use Carlafox for which we released a dedicated blog post as well. It is basically a web-based CARLA visualizer to generate a large number of training samples with automatic ground-truth generation. Through this tool, we assume that we would have error-free data just because we are using a simulator for annotations, unlike humans who are prone to make mistakes when a task is as redundant as annotating traffic signs or vehicles. We then curated two object detection datasets with Carlafox for a 3D object detection task and a traffic sign detection task.

Finding a needle in a haystack with MLfix

Since MLfix was evidently successful in identifying errors even in carefully curated AI datasets like Mapillary traffic sign dataset to improve the performance of the models, as shown in our previous experiments, we ran the synthetic datasets through MLfix which helped us identify many critical issues in our CARLA synthetic data generation pipeline.

MLfix's interface makes it really easy to spot errors in almost all image-based datasets:

Figure 1: MLfix interface to spot errors.

One of the significant issues with the bounding boxes of traffic signs was incorrect width annotation. When only part of a traffic sign is visible within the camera frame as the vehicle's camera sensor approaches, then our data generation pipeline would annotate the traffic sign incorrectly as shown in the examples below:

Figure 2: Bounding box inconsistency in the data generation pipeline.

There is no API in Carla to figure out if a traffic sign is facing the camera or the opposite side. With MLfix it was straightforward to identify that our dataset also had opposite-facing traffic signs, which helped us recognize inconsistencies in our code to filter the opposite ones:

Figure 3: Opposite facing traffic signs.

Since Carla provides the label for each traffic sign, we assumed they will be distinct. With MLfix we easily identified that for the class No Turns we had two completely different looking traffic signs:

Figure 4: No Turns label with different looking traffic signs.

We also used Carlafox to gather a perception dataset. Again, we assumed that this dataset contains error-free ground truth annotations as we are using a simulator to annotate the samples, but thanks to MLfix we could see that we were mistaken.

There were numerous fully occluded objects that were being annotated by our data generation pipeline. Looking at the samples through MLfix, this issue was fairly common. With MLfix, it was easy to identify objects that had a different object in the bounding box than the assigned label because of significant occlusion:

Figure 5: Identified occluded objects.

Another unexpected error was the width of the cyclists. We suspect that the CARLA API returns the width as zero for some of the cyclists, which is due to some further testing and debugging.

Figure 6: Cyclist 3D bounding box width is zero.

MLfix is quite simple and yet very effective in spotting errors in a huge dataset. If not for MLfix, it would have been a tedious task to go through one sample at a time and try to figure out the errors in both of our synthetic datasets. We plan on overcoming these issues by fixing our data generation pipeline.

Outlook

Numerous open-source resources helped us to make our work possible. We hope our contributions will help others to find errors in machine learning datasets and train better neural networks. MLfix is open-source and the code has been released on GitHub here.

If you have questions or ideas on how to analyze your datasets, join us on our Gitter #lounge channel or leave a comment in the comment section.

Carlafox: Towards reliable open-source 3D perception

Labeling tools are great, but what about quality checks?

Carlafox, an open-source web-based CARLA visualizer

Carlafox: Towards reliable open-source 3D perception

Labeling tools are great, but what about quality checks?

Carlafox, an open-source web-based CARLA visualizer

Comments (0)

Add a Comment

Search the newsroom

Latest Blog Posts

PipeWire workshop 2025: Updates on video transport, Rust efforts, TSN networking, and Bluetooth support

03/07/2025

As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

Unleashing gst-python-ml: Python-powered ML analytics for GStreamer pipelines

12/05/2025

Powerful video analytics pipelines are easy to make when you're well-equipped. Combining GStreamer and Machine Learning frameworks are the…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기