We're hiring!

Labeling tools are great, but what about quality checks?

Jakub Piotr Cłapa avatar

Jakub Piotr Cłapa
January 17, 2023

Share this post:

Reading time:

Labeling tools are great, but what about quality checks?


Key takeaways

The QA problem in data labeling

How hard can it be?

MLfix in action

Slicing the data in many ways

All right, but is it worth doing?


Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. In the words of deep learning pioneer Andrew Ng:

In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. – Andrew Ng: Unbiggen AI - IEEE Spectrum

Over the course of 2022, we worked on an open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors.

Key takeaways

  • Even carefully curated AI datasets have errors that can be spotted and fixed to improve the accuracy of resulting models.
  • Existing labeling tools do not have good support for doing quality assurance.
  • Fixing around 3% of label errors improves the model performance by 2%, although exact results will depend on the dataset and task.
  • Thanks to MLfix, even a big dataset like the Mapillary Traffic Sign Dataset could be fully verified and fixed by a single person over a few days of work.

The QA problem in data labeling

Labeling is a difficult cognitive task and accurate labels require a serious Quality Assurance (QA) process. Most existing labeling tools (both commercial and Open Source) have only minimal support for review. Frequently the QA process is more difficult (and expensive!) than initial labeling since you are forced to use an interface optimized for drawing bounding boxes to verify if all labels were assigned correctly. Here is the process described by a leading annotation service provider:

Annotations are reviewed four times in order to confirm accuracy. Two annotators label a given object, a supervisor then checks the quality of their work. – keymakr, a leading annotation provider

How hard can it be?

Can you spot the mistake in the following photo? I can't blame you. This is hard because it requires expert knowledge and a lot of cognitive resources to read all the labels, remember what each of these signs should look like, and finally spot the ones that are incorrect.

What if instead we show the exact same data like this:

Now it's not so difficult to spot the one speed limit sign that does not fit with the rest (the 30km/h speed limit). It requires you to only keep a single type of object in your working memory at a time and taps into the intuitive skill of spotting items that stand out from the rest. It also takes an order of magnitude less time.

This insight directly led to the creation of MLfix. Using the streamlined interface lets us perform the QA process more than 10 times faster and avoid missing even 30% of the errors.

MLfix in action

The video below shows a user quickly scrolling through 40 objects belonging to 5 classes and finding 6 mislabeled examples.

You can also try it yourself on a selection of 60km/h speed limit signs coming from the Mapillary Traffic Sign Dataset. Note that depending on demand the live demo can take some time to start.

Slicing the data in many ways

MLfix can be used as a standalone tool, but it can also be embedded directly into Jupyter notebooks that are used by data scientists to prepare and train deep learning networks. Thanks to that, MLfix can tap into all the metadata you have about your dataset and also utilize networks you've trained to help you with the QA process. You can:

  1. Slice the images based on the ground truth label:

  2. Show visually similar images together (based on LPIPS metric or a novel sorting network pretrained in an unsupervised manner):

  3. Show the output of your model (sorted by loss) on the validation set images to fish out mistakes. Here we are looking at the ground-truth class other-sign that the model believed to be the do-not-enter sign; we can see that it was right most of the time:

All right, but is it worth doing?

We made a comparison on the Mapillary Traffic Sign Dataset, which is an extensive dataset of 206 thousand traffic signs divided into 401 classes. Among these, there are 6,400 annotations of speed limit signs, and with MLfix, in about 30 minutes we could find and remove 3% of them that were erroneous. In other words, we corrected 0.11% of all the labels in the whole dataset.

We trained image classification models (based on the ResNet50 backbone) on both the original and fixed datasets 20 times and averaged out the accuracy metrics. After fixing the dataset, the model error rate went down from 7.28% to 7.05%, and the error rate for speed signs improved by almost 2% from 10.42% to 8.49%) which is a significant improvement for a very modest amount of effort. More information about these experiments (including the code to reproduce the results) can be found in the GitHub repo - jpc/mlfix-mapillary-traffic-signs. The accuracy histograms show that the improvement is consistent over multiple training runs:


Our work could not have been possible without the help of countless open-source resources. We hope MLfix will help the annotations community to build the next generation of innovative technology.

If you have questions or ideas, join us on our Gitter #lounge channel or leave a comment in the comment section.

Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

Automatic regression handling and reporting for the Linux Kernel


In continuation with our series about Kernel Integration we'll go into more detail about how regression detection, processing, and tracking…

Almost a fully open-source boot chain for Rockchip's RK3588!


Now included in our Debian images & available via our GitLab, you can build a complete, working BL31 (Boot Loader stage 3.1), and replace…

What's the latest with WirePlumber?


Back in 2022, after a series of issues were found in its design, I made the call to rework some of WirePlumber's fundamentals in order to…

DRM-CI: A GitLab-CI pipeline for Linux kernel testing


Continuing our Kernel Integration series, we're excited to introduce DRM-CI, a groundbreaking solution that enables developers to test their…

Persian Rug, Part 4 - The limitations of proxies


This is the fourth and final part in a series on persian-rug, a Rust crate for interconnected objects. We've touched on the two big limitations:…

How to share code between Vulkan and Gallium


One of the key high-level challenges of building Mesa drivers these days is figuring out how to best share code between a Vulkan driver…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.