We're hiring!

MLfix to quickly fix datasets

Vineet Suryan avatar

Vineet Suryan
June 08, 2023

Share this post:

Reading time:

MLfix to quickly fix datasets



Key takeaways

Carla Synthetic Datasets

Finding a needle in a haystack with MLfix


Contrary to traditional software development, data is more important than code in machine learning. The data labelling part, in particular. Building a high-performing model requires using reliable, precisely labelled data but poor-quality data is not always obvious.

Working with data is critical for ML researchers because the resultant model is dependent on the quantity and quality of the data used. Malicious data points are far more common in machine learning datasets than expected. Early detection of these erroneous data points is crucial for the performance of downstream tasks. However, manually inspecting each data point is neither efficient nor feasible. According to a study evaluating the impact of data cleaning on ML classification tasks, data quality affects machine learning (ML) model performances, and data scientists spend considerable amounts of time on data cleaning prior to model training.

Deep learning algorithms can withstand random mistakes in the training set rather well. It's okay to not spend too much time correcting errors as long as they are inadvertent and somewhat random. For example, a few mislabeled stop signs in a traffic sign training dataset would not significantly affect the model. On the other hand, deep learning algorithms are not resistant to systematic mistakes; for instance, if you have a notably large number of stop sign bounding boxes mislabelled or the bounding box annotation itself is not consistent, then the model will learn this pattern.


Using open-source software, Collabora has developed MLfix which helps to identify and filter out labelling errors in machine learning datasets quickly and efficiently. In order to assist annotators and machine learning engineers in identifying and removing labelling errors, MLfix blends cutting-edge unsupervised machine learning pipelines with a novel user interface idea. We further used this tool to spot errors in our synthetic datasets generated from the CARLA simulator for vehicle and traffic sign detection.

Key takeaways

  • Synthetic datasets generated via simulators can have flaws as well and these datasets should be verified and analyzed before training machine learning models on them.
  • Other than manually analyzing each sample, we could not find any open-source tool that can quickly identify errors in a huge dataset.
  • MLfix is open source. With its user-friendly interface, it reduces the time and resources required to identify the errors in an image dataset.

Carla Synthetic Datasets

Developing autonomous driving systems needs a massive amount of training data, typically gathered and labelled by human labour, which is both costly and error-prone. Instead, we use Carlafox for which we released a dedicated blog post as well. It is basically a web-based CARLA visualizer to generate a large number of training samples with automatic ground-truth generation. Through this tool, we assume that we would have error-free data just because we are using a simulator for annotations, unlike humans who are prone to make mistakes when a task is as redundant as annotating traffic signs or vehicles. We then curated two object detection datasets with Carlafox for a 3D object detection task and a traffic sign detection task.

Finding a needle in a haystack with MLfix

Since MLfix was evidently successful in identifying errors even in carefully curated AI datasets like Mapillary traffic sign dataset to improve the performance of the models, as shown in our previous experiments, we ran the synthetic datasets through MLfix which helped us identify many critical issues in our CARLA synthetic data generation pipeline.

MLfix's interface makes it really easy to spot errors in almost all image-based datasets:

Figure 1: MLfix interface to spot errors.

One of the significant issues with the bounding boxes of traffic signs was incorrect width annotation. When only part of a traffic sign is visible within the camera frame as the vehicle's camera sensor approaches, then our data generation pipeline would annotate the traffic sign incorrectly as shown in the examples below:

Figure 2: Bounding box inconsistency in the data generation pipeline.

There is no API in Carla to figure out if a traffic sign is facing the camera or the opposite side. With MLfix it was straightforward to identify that our dataset also had opposite-facing traffic signs, which helped us recognize inconsistencies in our code to filter the opposite ones:

Figure 3: Opposite facing traffic signs.

Since Carla provides the label for each traffic sign, we assumed they will be distinct. With MLfix we easily identified that for the class No Turns we had two completely different looking traffic signs:

Figure 4: No Turns label with different looking traffic signs.

We also used Carlafox to gather a perception dataset. Again, we assumed that this dataset contains error-free ground truth annotations as we are using a simulator to annotate the samples, but thanks to MLfix we could see that we were mistaken.

There were numerous fully occluded objects that were being annotated by our data generation pipeline. Looking at the samples through MLfix, this issue was fairly common. With MLfix, it was easy to identify objects that had a different object in the bounding box than the assigned label because of significant occlusion:

Figure 5: Identified occluded objects.

Another unexpected error was the width of the cyclists. We suspect that the CARLA API returns the width as zero for some of the cyclists, which is due to some further testing and debugging.

Figure 6: Cyclist 3D bounding box width is zero.

MLfix is quite simple and yet very effective in spotting errors in a huge dataset. If not for MLfix, it would have been a tedious task to go through one sample at a time and try to figure out the errors in both of our synthetic datasets. We plan on overcoming these issues by fixing our data generation pipeline.


Numerous open-source resources helped us to make our work possible. We hope our contributions will help others to find errors in machine learning datasets and train better neural networks. MLfix is open-source and the code has been released on GitHub here.

If you have questions or ideas on how to analyze your datasets, join us on our Gitter #lounge channel or leave a comment in the comment section.

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

Building a Board Farm for Embedded World


With each board running a mainline-first Linux software stack and tested in a CI loop with the LAVA test framework, the Farm showcased Collabora's…

Smart audio filters with WirePlumber 0.5


WirePlumber 0.5 arrived recently with many new and essential features including the Smart Filter Policy, enabling audio filters to automatically…

The latest on cmtp-responder, a permissively-licensed MTP responder implementation


Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…

A roadmap for VirtIO Video on ChromeOS: part 3


The final installment of a series explaining how Collabora is helping shape the video virtualization story for Chromebooks with a focus…

Hacking on the PipeWire GStreamer elements


Last week I attended the GStreamer spring hackfest in Thessaloniki to work on the PipeWire GStreamer elements and connect with the community.

Transforming speech technology with WhisperLive


The world of AI has made leaps and bounds from what It once was, but there are still some adjustments required for the optimal outcome.…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2024. All rights reserved. Privacy Notice. Sitemap.