We're hiring!
*

Font recognition reimagined with FasterViT-2

Marcus Edel avatar

Marcus Edel
November 11, 2025

Share this post:

Reading time:

Fonts are everywhere—from posters and product labels to presentation slides and user interfaces. Recognizing them automatically from images has long been a challenging problem due to the fine-grained visual differences between typefaces. In 2015, the DeepFont system introduced a new approach using convolutional neural networks (CNNs), along with the AdobeVFR dataset. Today, we revisit this problem with modern data and modern architecture—leveraging FasterViT-2, a high-performance hybrid vision transformer, to set a new standard in font classification.

Dataset: Updating AdobeVFR for the modern font landscape

The original AdobeVFR dataset offered a strong foundation for visual font recognition. Collabora has since expanded and modernized it to reflect today’s typographic landscape and usage in screen-based environments.

Collabora-overiew of fonts
Overview of different fonts from the dataset.

Key improvements

  • Extended font set: Over 2,700 fonts collected from modern online repositories like Google Fonts and Font Squirrel.
  • Synthetic training data: 2.7 million grayscale word images generated with augmentations that simulate real-world conditions (e.g., blur, noise, compression, perspective distortion).
  • New real-world test set: 5,000 manually labeled text snippets cropped from contemporary images, including design forums, screen captures, and digital publications. 
  • Consistent pre-processing: All images resized to a height of 105 pixels, with randomized character spacing and aspect ratio adjustments.

These changes make the dataset more reflective of the visual variability found in real-world design and screen environments.

Model: Introducing FasterViT-2

To fully leverage the new dataset, we employed FasterViT-2, a hybrid vision transformer designed for both speed and accuracy.

Why FasterViT-2?

FasterViT-2 combines CNN-like fast local feature learning with transformer-style global modeling through its Hierarchical Attention (HAT) mechanism. It introduces carrier tokens to summarize local windows and enable efficient long-range dependencies—critical for capturing the nuanced differences between fonts.

Architectural highlights

  • Hybrid backbone: Convolutional layers in early stages, transformer layers in later stages.
  • Hierarchical Attention: Efficient multi-scale token interaction for better spatial understanding.
  • Real-time throughput: Over 3,100 images per second on A100 GPUs.

This combination allows FasterViT-2 to outperform both traditional CNNs and earlier ViT-based models in both performance and practicality.

Results: A new state-of-the-art in font classification

We trained FasterViT-2 on the extended AdobeVFR dataset and evaluated its performance on the newly curated real-world test set.

Collabora- sample fontExample font classification, showing the top-5 prediction.

Training setup

  • Training set: 2.7M synthetic images (with font-aware augmentations)
  • Validation: 270K synthetic samples
  • Test set: 5,000 labeled real-world images
  • Backbone: FasterViT-2, pretrained on ImageNet-1K

Evaluation results

Model Top-1 Accuracy Top-5 Accuracy Inference Throughput (img/sec, 4090)
DeepFont (2015) 62.3% 81.4% ~450
FasterViT-2 (ours) 87.4% 92.1% 3161

 

This establishes FasterViT-2 as the new state-of-the-art for real-world font classification, with a significant accuracy improvement and a substantial gain in inference speed—making it viable for both offline and real-time applications.

Applications: From design tools to video upscaling

While our primary goal was to advance font recognition, FasterViT-2 has also proven useful in broader applications.

As part of Collabora’s winning submission to the ICME 2025 Video Super-Resolution Challenge—specifically Track 3: Screen Sharing Videos—FasterViT-2 serves as a semantic font recognition module. Its predictions help guide super-resolution models to better preserve and enhance text during upscaling, improving the clarity of screen-shared content such as code editors, terminal windows, and slide presentations. We will write more about the challenge soon.

This integration shows that font classification can go beyond recognition, serving as a perceptual cue for more intelligent, content-aware visual enhancement.

Conclusion

By combining a significantly updated dataset with the power of FasterViT-2, we’ve set a new benchmark for visual font recognition. This work highlights how modern architectural advances can tackle fine-grained classification tasks with greater precision and efficiency.

FasterViT-2 is not only a state-of-the-art model for font identification, but it is also a stepping stone for improving visual quality in downstream tasks, such as screen content upscaling and intelligent design tools.

Search the newsroom

Latest Blog Posts

Re-thinking framebuffers in PanVK

23/03/2026

PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…

Running Mainline Linux, U-Boot, and Mesa on Rockchip: A year in review

02/03/2026

Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…

Now streaming: Collabora XDC 2025 presentations

02/12/2025

As an active member of the freedesktop community, Collabora was busy at XDC 2025. Our graphics team delivered five talks, helped out in…

Implementing Bluetooth LE Audio & Auracast on Linux systems

24/11/2025

LE Audio introduces a modern, low-power, low-latency Bluetooth® audio architecture that overcomes the limitations of classic Bluetooth®…

Strengthening KernelCI: New architecture, storage, and integrations

17/11/2025

Collabora’s long-term leadership in KernelCI has delivered a completely revamped architecture, new tooling, stronger infrastructure, and…

Font recognition reimagined with FasterViT-2

11/11/2025

Collabora extended the AdobeVFR dataset and trained a FasterViT-2 font recognition model on millions of samples. The result is a state-of-the-art…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2026. All rights reserved. Privacy Notice. Sitemap.