Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Vineet Suryan
May 29, 2025

Share this post:

Reading time:

Previously, we launched Whisper-Hindi v1 by fine-tuning OpenAI’s Whisper on 2,500 hours of Hindi speech with Indic, improving whisper-tiny’s WER on google/fleurs from ~172% to ~14%. That work not only showcased the impact of Indic normalization but also paved the way for timestamp-aware fine-tuning.

Now, in this follow-up post, we’ve cleaned and expanded that corpus to 3,000 hours, added explicit timestamp prediction, reorganized our pipeline around WebDataset for 5× faster I/O, and fine-tuned multilingual Whisper models across all sizes. The result is Whisper-Hindi v2, where we achieve an impressive ~5% WER on google/fleurs—bringing us even closer to fully reliable, production-ready Hindi ASR.

Demo: A verse in Hindi

Here’s a short clip of a Hindi poem. Though we don’t have an official transcript, you’ll notice the model’s output aligns perfectly—capturing both the verse’s rhythm and its literary nuance.

Text output

युद्ध नहीं जिनके जीवन में वे भी बहुत अभागे होंगे या तो प्रण को तोड़ा होगा या फिर रण से भागे होंगे दीपक का कुछ अर्थ नहीं है जब तक तम से नहीं 
लड़ेगा दिन कर नहीं प्रभा बांटेगा जब तक स्वयं नहीं धधकेगा कभी दहकती ज्वाला के बिन कुंदन भला बना है सोना बिना घिसे मेंहदी ने बोलो कब पाया है रंग 
सलोना जीवन के पथके राही को क्षण भर भी विश्राम नहीं है कौन भला स्वीकार करेगा जीवन एक संग्राम नहीं है

Even without a reference transcript, it's evident that every matra and conjunct is intact, the poetic cadence flows naturally, and the timestamps below map precisely to each spoken phrase:

Timestamped chunks

[{'timestamp': (0.0, 9.48), 'text': ' युद्ध नहीं जिनके जीवन में वे भी बहुत अभागे होंगे या तो प्रण को तोड़ा होगा या फिर'}
{'timestamp': (9.48, 18.3), 'text': ' रण से भागे होंगे दीपक का कुछ अर्थ नहीं है जब तक तम से नहीं लड़ेगा दिन कर नहीं'},
{'timestamp': (18.3, 29.8), 'text': ' प्रभा बांटेगा जब तक स्वयं नहीं धधकेगा कभी दहकती ज्वाला के बिन कुंदन भला बना है सोना बिना घिसे मेंहदी'},
{'timestamp': (29.8, 40.04), 'text': ' ने बोलो कब पाया है रंग सलोना जीवन के पथके राही को क्षण भर भी विश्राम नहीं है कौन भला'},
{'timestamp': (40.04, 44.0), 'text': ' स्वीकार करेगा जीवन एक संग्राम नहीं है'}]

Reproduce the demo

import torch
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
asr_pipe = pipeline(
    "automatic-speech-recognition",
    model="collabora/whisper-large-v2-hindi",
    chunk_length_s=30,
    device=device
)
prediction = asr_pipe("./sample.wav", return_timestamps=True)
print(prediction)

Live YouTube captioning in action using WhisperLive (Chrome extension)

Why Indic Normalization is critical

Hindi’s diacritics (matras) and conjuncts aren’t just decorative—they encode critical phonetic and semantic cues. If a normalizer strips or mangles these marks, it can turn valid words into unreadable text, breaking user comprehension and any NLP that follows. An Indic-aware normalizer preserves every matra and cluster, keeping transcripts accurate, legible, and robust downstream.

Semantic integrity & readability

Applications like subtitles, captions, or live transcription power real-time communication. Missing or misplaced diacritics not only confuse listeners but also make it hard to trust the output:

Original:

क्षेत्रफल बढ़ने से उत्पादन बढ़ा।

Transliteration => (kṣetrafal baṛhne se utpādan baṛhā.)

Variant	Text	Notes
Whisper’s Default Normalization	`कषतरफल बढन स उतप दन बढ`	- `क्ष` split into `क्‌ष` - `ढ़` loses its aspirate matra
Indic Normalization	`क्षेत्रफल बढ़ने से उत्पादन बढ़ा।`	- Conjunct `क्ष` preserved - `ढ़` retains aspirated marker - Full readability maintained

Quick example -- Compound conjunctions

Original:      कृषि       (krshi)  => Agriculture
Whisper Norm:  कष        (ks)     => NA
Indic Norm:    कृषि       (krshi)  => Agriculture

By maintaining compounds like कृ, the model learns true phonetic clusters, improving both transcription accuracy and pronunciation.

Key benefits

Conjunct preservation: Compound characters such as क्ष or कृ remain intact, ensuring correct pronunciation and meaning.
Diacritic retention: Aspirated characters like ढ़ keep their matras, so बढ़ isn’t misread as बढ.
Word boundaries & readability: Every matra and conjunct stays intact, making transcripts fully legible and trustworthy—critical for real-time ASR applications.

Datasets

We assemble a diverse collection of Hindi ASR corpora—totaling 3,000 h after cleanup with clearly stated licenses to ensure compliance and reproducibility.

Data sources

Dataset	Hours (Hi)	License	Source
Shrutilipi	~1,558 h	CC BY 4.0	ai4bharat/shrutilipi
IITM Madras SpringLab	~900 h	CC BY 4.0	SpringLab
Common Voice 11.0 (Mozilla)	~20 h	CC 0 1.0 (public domain)	mozilla/commonvoice
IndicSUPERB	150 h	Apache License 2.0	ai4bharat/indic-superb
snow-mountain	67.6 h	CC BY-SA 4.0	bridgeconn/snow-mountain
yodas	~200 h	CC BY 3.0	espnet/yodas
IndicVoices-R_Hindi	75 h	CC BY 4.0	SPRINGLab/IndicVoices-R_Hindi
Lahaja	12.5 h	CC BY 4.0	ai4bharat/lahaja
fleurs	30.0 h	CC BY 4.0	google/fleurs

Preprocessing pipeline

Raw → HuggingFace Audio
- Download raw speech corpus (WAV/MP3).
- Merge segments into clips ≤ 30 s:
  - Concatenate consecutive audio files until either the 30s limit or the Whisper token-length limit is reached.
  - Create & record per-segment start/end times for timestamp tokens.
- Apply Indic Normalization on merged transcripts so that all matras and conjuncts are preserved.
- Emit a CSV/JSON metadata with:
  - file_name (path to merged clip)
  - sentence (merged, normalized transcript with timestamps)
  - duration (clip length in seconds)
- Load into HuggingFace via load_dataset(..., features=Audio()) so downstream scripts treat it as an “audio dataset.”
Audio Dataset → Model-Ready Features
- Cast audio: ensure the audio column is at 16 kHz.
- Feature extraction: convert each audio clip into log-Mel spectrogram tensors using WhisperFeatureExtractor.
- Label tokenization: run the (already normalized) text through WhisperTokenizer to produce input_ids with timestamp tokens.
- Save the preprocessed dataset to disk.

Why this two-stage approach?

Modularity: separate raw merging (with normalization & timestamp logic) from expensive spectrogram/tokenization.
Efficiency: saving preprocessed features means repeated experiments skip audio decoding and normalization, cutting hours off hyperparameter sweeps.

Why WebDataset is a game changer

When you’re firing through thousands of hours of speech, nothing grinds the GPU to a halt faster than slow disk reads. Packing our clips (with all their timestamp tags) into WebDataset shards was a game-changer—suddenly we were training 5–6× faster, and actually enjoying the process again with no more staring at stalled training runs.

Tar-based shards
Each shard (~700 MB) packs 1,000 samples into a single tar file. Within each sample we store input_features.npz (precomputed log-Mel arrays) and labels.npz (token ID sequences with timestamp markers).
Shard streaming
Shards can be read sequentially by multiple CPU workers in parallel. This maximizes throughput by:
1. Fast sequential reads (perfect for NVMe or even spinning disks)
2. Independent decoding per worker
3. We create shards with a dataset that is already shuffled.
Training: workers independently stream from different shards, decode on the fly, and feed batches to the GPU.

Shifting everything—preprocessing through training into WebDataset shards meant we barely touched the file system during a run. No more millions of tiny file opens, just fast, steady throughput from start to finish. With our data fully preprocessed and sharded, and timestamps baked in, we’re ready to fine-tune and evaluate.

Modelling & results

We optimized for both speed and stability on a single RTX 4090. In practice, we trained for 5 epochs on Tiny/Base/Small, 4 on Medium, and 3 on Large-v2, with 600 warm-up steps (300 for Large-v2). Mixed-precision training and 8-bit AdamW (just for Large-v2) helped keep memory in check. Here are the full details:

Setting	Value
Hardware	NVIDIA RTX 4090 GPU
Epochs	Tiny/Base/Small – 5; Medium – 4; Large-v2 – 3
Effective Batch Size	128
Warm-up Steps	300 for Large-v2 & 600
Optimizer	AdamW 8-bit (bitsandbytes) for Large-v2; `fp16` AdamW for others
β₁, β₂, ε	0.9, 0.999, 1e-8
Weight Decay	0.05 (versus 0.1 in Whisper paper)
LR Scheduler	Linear decay after warm-up

Learning rate comparison

Model Size	Paper Max LR	v2 Max LR
Tiny	1.5 × 10⁻³	3 × 10⁻⁴
Base	1 × 10⁻³	1.5 × 10⁻⁴
Small	5 × 10⁻⁴	1 × 10⁻⁴
Medium	2.5 × 10⁻⁴	5 × 10⁻⁵
Large-v2	2 × 10⁻⁴	3.5 × 10⁻⁵

Why these choices?

Lower max LRs ensure gradual adaptation of pre-trained weights, reducing the risk of divergence or sub-optimal minima.
Longer warm-up (700 steps) gives larger models time to “ramp up” before decay.
Reduced weight decay (0.05) strikes a balance between regularization and preserving pre-trained features.
8-bit optimizer on the largest model cuts memory use, allowing higher batch sizes without GPU OOMs.
Timestamp tokens: training with explicit <start>/<end> allows the model to also output timestamp tokens without impacting WER.

In practice, dialing in these learning rates, warm-ups, and weight-decay settings gave us smooth, predictable loss curves and dramatic WER drops—from Tiny all the way to Large-v2—without any nasty surprises.

Fig.1 - Training loss vs. global step.

Baseline (pre–fine-tuning) Whisper WER %

Model	Whisper Norm	Indic Norm
Tiny	172.60	196.57
Base	149.17	160.58
Small	67.37	89.73
Medium	26.85	43.16
Large-v2	21.54	38.46

Fine-tuned Whisper WER %

Model	Whisper Norm	Indic Norm
Tiny	10.01	18.94
Base	8.95	17.42
Small	7.17	15.10
Medium	6.10	13.36
Large-v2	5.33	13.06

We decoded using beam search with beam_size=5, applied a repetition_penalty=1.15, and removed any samples containing numeric characters.

Fig.1 - WER comparison.

Key takeaway

Fine-tuning reduces WER by over 15× on Tiny and Base models, and by 4–5× on Medium and Large-v2—while preserving semantic integrity with Indic normalization.

Looking ahead

Next up, we plan to contribute to ASR with several key initiatives:

Architectural Exploration: Fine-tune and compare Conformer-based ASR models or lightweight CTC architectures that can offer lower latency and higher accuracy, while retaining Indic normalization and timestamp techniques.
Multilingual Extension: Apply the same fine-tuning pipeline (normalization, WebDataset sharding, timestamp-aware training) to other Indian languages—such as Bengali, Tamil, and Marathi—and even low-resource global languages, democratizing high-quality ASR worldwide.
On-Device & Real-Time Inference: Optimize and benchmark models on edge platforms (ARM, Jetson) to enable live captioning and mobile ASR with sub-second latency.

Breaking the myth that you need massive compute, Whisper-Hindi demonstrates that a single RTX 4090, combined with a few smart hacks (WebDataset, mixed-precision, 8-bit AdamW, gentle learning-rate curves), can deliver cutting-edge ASR performance. As we move forward, combining these best practices with new architectures will push Indic ASR—and indeed global multilingual speech recognition—to the next level, making accurate, aligned transcriptions accessible across languages, devices, and applications.

As we roll out new variants and interactive demos, our goal is clear: make robust speech recognition accessible to every language community and every developer with no supercomputer required.

Try Whisper-Hindi for yourself from Hugging Face and help us make ASR accessible to everyone!

Acknowledgments

This project was made possible because of the following resources:

AI4Bharat for the shrutilipi and lahaja datasets.
IIT Madras SpringLab for the SpringX-Hindi datasets.
mozilla-foundation for the common_voice_11_0 Hindi dataset.
google for the fleurs dataset.
IndicNLP Library for their powerful Indic normalization tools.
bridgeconn/snow-mountain dataset.
espnet/yodas dataset.
Dastak India for the video in the demo.

Resources

HuggingFace:
GitHub Repository: collabora/whisper-finetuning
Preprocessed Webdataset: collabora/hi-stt-preprocessed-webdataset

Breaking language barriers: Fine-tuning Whisper for Hindi

Transforming speech technology with WhisperLive

Collabora's WhisperFusion nominated for Embedded Award 2024

Breaking language barriers: Fine-tuning Whisper for Hindi

Transforming speech technology with WhisperLive

Collabora's WhisperFusion nominated for Embedded Award 2024

Comments (0)

Add a Comment

Search the newsroom

Latest News & Events

All roads lead to Brest: Collabora at DebConf25

09/07/2025

Collabora is proud to sponsor this year's annual Debian conference, taking place in Brest, France. Join us as we showcase the latest with…

Introducing Tyr, a new Rust DRM driver

07/07/2025

The last year has seen substantial progress on the DRM infrastructure required to write GPU drivers in Rust. Developed in collaboration…

Build with confidence, sustain for the future: Collabora joins the Toradex Proven Partner Network

02/07/2025

This partnership ensures customers can build embedded products with long-term maintenance viability that will meet the challenges of tomorrow,…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기