June 08, 2023
Contrary to traditional software development, data is more important than code in machine learning. The data labelling part, in particular. Building a high-performing model requires using reliable, precisely labelled data but poor-quality data is not always obvious.
Working with data is critical for ML researchers because the resultant model is dependent on the quantity and quality of the data used. Malicious data points are far more common in machine learning datasets than expected. Early detection of these erroneous data points is crucial for the performance of downstream tasks. However, manually inspecting each data point is neither efficient nor feasible. According to a study evaluating the impact of data cleaning on ML classification tasks, data quality affects machine learning (ML) model performances, and data scientists spend considerable amounts of time on data cleaning prior to model training.
Deep learning algorithms can withstand random mistakes in the training set rather well. It's okay to not spend too much time correcting errors as long as they are inadvertent and somewhat random. For example, a few mislabeled stop signs in a traffic sign training dataset would not significantly affect the model. On the other hand, deep learning algorithms are not resistant to systematic mistakes; for instance, if you have a notably large number of stop sign bounding boxes mislabelled or the bounding box annotation itself is not consistent, then the model will learn this pattern.
Using open-source software, Collabora has developed MLfix which helps to identify and filter out labelling errors in machine learning datasets quickly and efficiently. In order to assist annotators and machine learning engineers in identifying and removing labelling errors, MLfix blends cutting-edge unsupervised machine learning pipelines with a novel user interface idea. We further used this tool to spot errors in our synthetic datasets generated from the CARLA simulator for vehicle and traffic sign detection.
Developing autonomous driving systems needs a massive amount of training data, typically gathered and labelled by human labour, which is both costly and error-prone. Instead, we use Carlafox for which we released a dedicated blog post as well. It is basically a web-based CARLA visualizer to generate a large number of training samples with automatic ground-truth generation. Through this tool, we assume that we would have error-free data just because we are using a simulator for annotations, unlike humans who are prone to make mistakes when a task is as redundant as annotating traffic signs or vehicles. We then curated two object detection datasets with Carlafox for a 3D object detection task and a traffic sign detection task.
Since MLfix was evidently successful in identifying errors even in carefully curated AI datasets like Mapillary traffic sign dataset to improve the performance of the models, as shown in our previous experiments, we ran the synthetic datasets through MLfix which helped us identify many critical issues in our CARLA synthetic data generation pipeline.
MLfix's interface makes it really easy to spot errors in almost all image-based datasets:
|Figure 1: MLfix interface to spot errors.|
One of the significant issues with the bounding boxes of traffic signs was incorrect width annotation. When only part of a traffic sign is visible within the camera frame as the vehicle's camera sensor approaches, then our data generation pipeline would annotate the traffic sign incorrectly as shown in the examples below:
|Figure 2: Bounding box inconsistency in the data generation pipeline.|
There is no API in Carla to figure out if a traffic sign is facing the camera or the opposite side. With MLfix it was straightforward to identify that our dataset also had opposite-facing traffic signs, which helped us recognize inconsistencies in our code to filter the opposite ones:
|Figure 3: Opposite facing traffic signs.|
Since Carla provides the label for each traffic sign, we assumed they will be distinct. With MLfix we easily identified that for the class
No Turns we had two completely different looking traffic signs:
|Figure 4: No Turns label with different looking traffic signs.|
We also used Carlafox to gather a perception dataset. Again, we assumed that this dataset contains error-free ground truth annotations as we are using a simulator to annotate the samples, but thanks to MLfix we could see that we were mistaken.
There were numerous fully occluded objects that were being annotated by our data generation pipeline. Looking at the samples through MLfix, this issue was fairly common. With MLfix, it was easy to identify objects that had a different object in the bounding box than the assigned label because of significant occlusion:
|Figure 5: Identified occluded objects.|
Another unexpected error was the width of the cyclists. We suspect that the CARLA API returns the width as zero for some of the cyclists, which is due to some further testing and debugging.
|Figure 6: Cyclist 3D bounding box width is zero.|
MLfix is quite simple and yet very effective in spotting errors in a huge dataset. If not for MLfix, it would have been a tedious task to go through one sample at a time and try to figure out the errors in both of our synthetic datasets. We plan on overcoming these issues by fixing our data generation pipeline.
Numerous open-source resources helped us to make our work possible. We hope our contributions will help others to find errors in machine learning datasets and train better neural networks. MLfix is open-source and the code has been released on GitHub here.
If you have questions or ideas on how to analyze your datasets, join us on our Gitter #lounge channel or leave a comment in the comment section.
The testing ecosystem in the Linux kernel has been steadily growing, but are efforts sufficiently coordinated? How can we help developers…
With the upcoming 0.5 release, WirePlumber's Lua scripts will be transformed with the new Event Dispatcher. More modular and extensible…
This second installment explores the Rust libraries Collabora developed to decode video and how these libraries are used within ARCVM to…
Why is creating object graphs hard in Rust? In part 1, we looked at a basic pattern, where two types of objects refer to one another. In…
Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions…
In Linux, the Industrial Input/Output subsystem manages devices like Analog to Digital Converters, Light sensors, accelerometers, etc. On…