Font recognition reimagined with FasterViT-2

Font recognition reimagined with FasterViT-2

Marcus Edel
November 11, 2025

Share this post:

Reading time:

Fonts are everywhere—from posters and product labels to presentation slides and user interfaces. Recognizing them automatically from images has long been a challenging problem due to the fine-grained visual differences between typefaces. In 2015, the DeepFont system introduced a new approach using convolutional neural networks (CNNs), along with the AdobeVFR dataset. Today, we revisit this problem with modern data and modern architecture—leveraging FasterViT-2, a high-performance hybrid vision transformer, to set a new standard in font classification.

Dataset: Updating AdobeVFR for the modern font landscape

The original AdobeVFR dataset offered a strong foundation for visual font recognition. Collabora has since expanded and modernized it to reflect today’s typographic landscape and usage in screen-based environments.

Overview of different fonts from the dataset.

Key improvements

Extended font set: Over 2,700 fonts collected from modern online repositories like Google Fonts and Font Squirrel.
Synthetic training data: 2.7 million grayscale word images generated with augmentations that simulate real-world conditions (e.g., blur, noise, compression, perspective distortion).
New real-world test set: 5,000 manually labeled text snippets cropped from contemporary images, including design forums, screen captures, and digital publications.
Consistent pre-processing: All images resized to a height of 105 pixels, with randomized character spacing and aspect ratio adjustments.

These changes make the dataset more reflective of the visual variability found in real-world design and screen environments.

Model: Introducing FasterViT-2

To fully leverage the new dataset, we employed FasterViT-2, a hybrid vision transformer designed for both speed and accuracy.

Why FasterViT-2?

FasterViT-2 combines CNN-like fast local feature learning with transformer-style global modeling through its Hierarchical Attention (HAT) mechanism. It introduces carrier tokens to summarize local windows and enable efficient long-range dependencies—critical for capturing the nuanced differences between fonts.

Architectural highlights

Hybrid backbone: Convolutional layers in early stages, transformer layers in later stages.
Hierarchical Attention: Efficient multi-scale token interaction for better spatial understanding.
Real-time throughput: Over 3,100 images per second on A100 GPUs.

This combination allows FasterViT-2 to outperform both traditional CNNs and earlier ViT-based models in both performance and practicality.

Results: A new state-of-the-art in font classification

We trained FasterViT-2 on the extended AdobeVFR dataset and evaluated its performance on the newly curated real-world test set.

Collabora- sample font Example font classification, showing the top-5 prediction.

Training setup

Training set: 2.7M synthetic images (with font-aware augmentations)
Validation: 270K synthetic samples
Test set: 5,000 labeled real-world images
Backbone: FasterViT-2, pretrained on ImageNet-1K

Evaluation results

Model	Top-1 Accuracy	Top-5 Accuracy	Inference Throughput (img/sec, 4090)
DeepFont (2015)	62.3%	81.4%	~450
FasterViT-2 (ours)	87.4%	92.1%	3161

This establishes FasterViT-2 as the new state-of-the-art for real-world font classification, with a significant accuracy improvement and a substantial gain in inference speed—making it viable for both offline and real-time applications.

Applications: From design tools to video upscaling

While our primary goal was to advance font recognition, FasterViT-2 has also proven useful in broader applications.

As part of Collabora’s winning submission to the ICME 2025 Video Super-Resolution Challenge—specifically Track 3: Screen Sharing Videos—FasterViT-2 serves as a semantic font recognition module. Its predictions help guide super-resolution models to better preserve and enhance text during upscaling, improving the clarity of screen-shared content such as code editors, terminal windows, and slide presentations. We will write more about the challenge soon.

This integration shows that font classification can go beyond recognition, serving as a perceptual cue for more intelligent, content-aware visual enhancement.

Conclusion

By combining a significantly updated dataset with the power of FasterViT-2, we’ve set a new benchmark for visual font recognition. This work highlights how modern architectural advances can tackle fine-grained classification tasks with greater precision and efficiency.

FasterViT-2 is not only a state-of-the-art model for font identification, but it is also a stepping stone for improving visual quality in downstream tasks, such as screen content upscaling and intelligent design tools.

Collabora takes first place at ICME 2025 Grand Challenge

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Breaking language barriers: Fine-tuning Whisper for Hindi

Collabora takes first place at ICME 2025 Grand Challenge

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Breaking language barriers: Fine-tuning Whisper for Hindi

Search the newsroom

Latest Blog Posts

Simplifying Bluetooth qualification for Linux/BlueZ: New upstream documentation

26/05/2026

New upstream BlueZ documentation helps simplify Bluetooth qualification for Linux-based products by mapping supported profiles, test requirements,…

Building Tyr in Rust: CSF architecture and booting the MCU

14/05/2026

See how Tyr moves beyond MCU firmware boot to build the group, queue, VM, submission, and completion paths needed to run real Vulkan workloads…

Optimizing memory access in NIR

07/05/2026

A complete breakdown of Mesa’s NIR compiler detailing how it optimizes shader memory access with SSA promotion, deref analysis, copy propagation,…

BlueZ-powered Auracast broadcasting on Genio 700

05/05/2026

Collabora brought Bluetooth Auracast broadcasting to MediaTek Genio 700 for Embedded World 2026. Here's the complete, fully Open Source…

Making the invisible audible: Building an OpenXR experience for ocean protection

22/04/2026

Using our XR expertise, Collabora created a standalone XR experience for our 1% for the Planet partner, SOMAR, to showcase the direct impact…

Bringing BitNet to ExecuTorch via Vulkan

17/04/2026

BitNet-style ternary brings LLM inference to ExecuTorch via its Vulkan backend, enabling much smaller, bandwidth-efficient models with portable…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기