Vineet Suryan
May 29, 2025
Reading time:
Previously, we launched Whisper-Hindi v1 by fine-tuning OpenAI’s Whisper on 2,500 hours of Hindi speech with Indic, improving whisper-tiny’s WER on google/fleurs from ~172% to ~14%. That work not only showcased the impact of Indic normalization but also paved the way for timestamp-aware fine-tuning.
Now, in this follow-up post, we’ve cleaned and expanded that corpus to 3,000 hours, added explicit timestamp prediction, reorganized our pipeline around WebDataset for 5× faster I/O, and fine-tuned multilingual Whisper models across all sizes. The result is Whisper-Hindi v2, where we achieve an impressive ~5% WER on google/fleurs—bringing us even closer to fully reliable, production-ready Hindi ASR.
Jump to a section: Demo: A verse in Hindi | Why Indic Normalization is critical | Datasets | Why WebDataset is a game changer | Modelling & results | Looking ahead | Acknowledgments | Resources
Here’s a short clip of a Hindi poem. Though we don’t have an official transcript, you’ll notice the model’s output aligns perfectly—capturing both the verse’s rhythm and its literary nuance.
Text output
युद्ध नहीं जिनके जीवन में वे भी बहुत अभागे होंगे या तो प्रण को तोड़ा होगा या फिर रण से भागे होंगे दीपक का कुछ अर्थ नहीं है जब तक तम से नहीं
लड़ेगा दिन कर नहीं प्रभा बांटेगा जब तक स्वयं नहीं धधकेगा कभी दहकती ज्वाला के बिन कुंदन भला बना है सोना बिना घिसे मेंहदी ने बोलो कब पाया है रंग
सलोना जीवन के पथके राही को क्षण भर भी विश्राम नहीं है कौन भला स्वीकार करेगा जीवन एक संग्राम नहीं है
Even without a reference transcript, it's evident that every matra and conjunct is intact, the poetic cadence flows naturally, and the timestamps below map precisely to each spoken phrase:
Timestamped chunks
[{'timestamp': (0.0, 9.48), 'text': ' युद्ध नहीं जिनके जीवन में वे भी बहुत अभागे होंगे या तो प्रण को तोड़ा होगा या फिर'}
{'timestamp': (9.48, 18.3), 'text': ' रण से भागे होंगे दीपक का कुछ अर्थ नहीं है जब तक तम से नहीं लड़ेगा दिन कर नहीं'},
{'timestamp': (18.3, 29.8), 'text': ' प्रभा बांटेगा जब तक स्वयं नहीं धधकेगा कभी दहकती ज्वाला के बिन कुंदन भला बना है सोना बिना घिसे मेंहदी'},
{'timestamp': (29.8, 40.04), 'text': ' ने बोलो कब पाया है रंग सलोना जीवन के पथके राही को क्षण भर भी विश्राम नहीं है कौन भला'},
{'timestamp': (40.04, 44.0), 'text': ' स्वीकार करेगा जीवन एक संग्राम नहीं है'}]
Reproduce the demo
import torch
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
asr_pipe = pipeline(
"automatic-speech-recognition",
model="collabora/whisper-large-v2-hindi",
chunk_length_s=30,
device=device
)
prediction = asr_pipe("./sample.wav", return_timestamps=True)
print(prediction)
Live YouTube captioning in action using WhisperLive (Chrome extension)
Hindi’s diacritics (matras) and conjuncts aren’t just decorative—they encode critical phonetic and semantic cues. If a normalizer strips or mangles these marks, it can turn valid words into unreadable text, breaking user comprehension and any NLP that follows. An Indic-aware normalizer preserves every matra and cluster, keeping transcripts accurate, legible, and robust downstream.
Semantic integrity & readability
Applications like subtitles, captions, or live transcription power real-time communication. Missing or misplaced diacritics not only confuse listeners but also make it hard to trust the output:
Original:
क्षेत्रफल बढ़ने से उत्पादन बढ़ा।
Transliteration => (kṣetrafal baṛhne se utpādan baṛhā.)
Variant | Text | Notes |
---|---|---|
Whisper’s Default Normalization | कषतरफल बढन स उतप दन बढ |
- क्ष split into क्ष - ढ़ loses its aspirate matra |
Indic Normalization | क्षेत्रफल बढ़ने से उत्पादन बढ़ा। |
- Conjunct क्ष preserved- ढ़ retains aspirated marker- Full readability maintained |
Quick example -- Compound conjunctions
Original: कृषि (krshi) => Agriculture
Whisper Norm: कष (ks) => NA
Indic Norm: कृषि (krshi) => Agriculture
By maintaining compounds like कृ
, the model learns true phonetic clusters, improving both transcription accuracy and pronunciation.
Key benefits
क्ष
or कृ
remain intact, ensuring correct pronunciation and meaning.ढ़
keep their matras, so बढ़
isn’t misread as बढ
.We assemble a diverse collection of Hindi ASR corpora—totaling 3,000 h after cleanup with clearly stated licenses to ensure compliance and reproducibility.
Data sources
Dataset | Hours (Hi) | License | Source |
---|---|---|---|
Shrutilipi | ~1,558 h | CC BY 4.0 | ai4bharat/shrutilipi |
IITM Madras SpringLab | ~900 h | CC BY 4.0 | SpringLab |
Common Voice 11.0 (Mozilla) | ~20 h | CC 0 1.0 (public domain) | mozilla/commonvoice |
IndicSUPERB | 150 h | Apache License 2.0 | ai4bharat/indic-superb |
snow-mountain | 67.6 h | CC BY-SA 4.0 | bridgeconn/snow-mountain |
yodas | ~200 h | CC BY 3.0 | espnet/yodas |
IndicVoices-R_Hindi | 75 h | CC BY 4.0 | SPRINGLab/IndicVoices-R_Hindi |
Lahaja | 12.5 h | CC BY 4.0 | ai4bharat/lahaja |
fleurs | 30.0 h | CC BY 4.0 | google/fleurs |
Preprocessing pipeline
file_name
(path to merged clip)sentence
(merged, normalized transcript with timestamps)duration
(clip length in seconds)load_dataset(..., features=Audio())
so downstream scripts treat it as an “audio dataset.”WhisperFeatureExtractor
.WhisperTokenizer
to produce input_ids
with timestamp tokens.Why this two-stage approach?
When you’re firing through thousands of hours of speech, nothing grinds the GPU to a halt faster than slow disk reads. Packing our clips (with all their timestamp tags) into WebDataset shards was a game-changer—suddenly we were training 5–6× faster, and actually enjoying the process again with no more staring at stalled training runs.
input_features.npz
(precomputed log-Mel arrays) and labels.npz
(token ID sequences with timestamp markers).Shifting everything—preprocessing through training into WebDataset shards meant we barely touched the file system during a run. No more millions of tiny file opens, just fast, steady throughput from start to finish. With our data fully preprocessed and sharded, and timestamps baked in, we’re ready to fine-tune and evaluate.
We optimized for both speed and stability on a single RTX 4090. In practice, we trained for 5 epochs on Tiny/Base/Small, 4 on Medium, and 3 on Large-v2, with 600 warm-up steps (300 for Large-v2). Mixed-precision training and 8-bit AdamW (just for Large-v2) helped keep memory in check. Here are the full details:
Setting | Value |
---|---|
Hardware | NVIDIA RTX 4090 GPU |
Epochs | Tiny/Base/Small – 5; Medium – 4; Large-v2 – 3 |
Effective Batch Size | 128 |
Warm-up Steps | 300 for Large-v2 & 600 |
Optimizer | AdamW 8-bit (bitsandbytes) for Large-v2; fp16 AdamW for others |
β₁, β₂, ε | 0.9, 0.999, 1e-8 |
Weight Decay | 0.05 (versus 0.1 in Whisper paper) |
LR Scheduler | Linear decay after warm-up |
Learning rate comparison
Model Size | Paper Max LR | v2 Max LR |
---|---|---|
Tiny | 1.5 × 10⁻³ | 3 × 10⁻⁴ |
Base | 1 × 10⁻³ | 1.5 × 10⁻⁴ |
Small | 5 × 10⁻⁴ | 1 × 10⁻⁴ |
Medium | 2.5 × 10⁻⁴ | 5 × 10⁻⁵ |
Large-v2 | 2 × 10⁻⁴ | 3.5 × 10⁻⁵ |
Why these choices?
<start>
/<end>
allows the model to also output timestamp tokens without impacting WER.In practice, dialing in these learning rates, warm-ups, and weight-decay settings gave us smooth, predictable loss curves and dramatic WER drops—from Tiny all the way to Large-v2—without any nasty surprises.
![]() |
Fig.1 - Training loss vs. global step. |
Baseline (pre–fine-tuning) Whisper WER %
Model | Whisper Norm | Indic Norm |
---|---|---|
Tiny | 172.60 | 196.57 |
Base | 149.17 | 160.58 |
Small | 67.37 | 89.73 |
Medium | 26.85 | 43.16 |
Large-v2 | 21.54 | 38.46 |
Fine-tuned Whisper WER %
Model | Whisper Norm | Indic Norm |
---|---|---|
Tiny | 10.01 | 18.94 |
Base | 8.95 | 17.42 |
Small | 7.17 | 15.10 |
Medium | 6.10 | 13.36 |
Large-v2 | 5.33 | 13.06 |
We decoded using beam search with beam_size=5
, applied a repetition_penalty=1.15
, and removed any samples containing numeric characters.
![]() |
Fig.1 - WER comparison. |
Key takeaway
Fine-tuning reduces WER by over 15× on Tiny and Base models, and by 4–5× on Medium and Large-v2—while preserving semantic integrity with Indic normalization.
Next up, we plan to contribute to ASR with several key initiatives:
Breaking the myth that you need massive compute, Whisper-Hindi demonstrates that a single RTX 4090, combined with a few smart hacks (WebDataset, mixed-precision, 8-bit AdamW, gentle learning-rate curves), can deliver cutting-edge ASR performance. As we move forward, combining these best practices with new architectures will push Indic ASR—and indeed global multilingual speech recognition—to the next level, making accurate, aligned transcriptions accessible across languages, devices, and applications.
As we roll out new variants and interactive demos, our goal is clear: make robust speech recognition accessible to every language community and every developer with no supercomputer required.
Try Whisper-Hindi for yourself from Hugging Face and help us make ASR accessible to everyone!
This project was made possible because of the following resources:
29/05/2025
After cleaning up and expanding Whisper-Hindi to 3,000 hours, we now have explicit timestamp prediction, faster I/O, and fine-tuned models…
27/05/2025
The latest Linux kernel brings expanded hardware support for MediaTek and Rockchip, enhanced graphics drivers, and more. Collabora played…
26/05/2025
Just about 6 weeks after we announced Vulkan 1.1 conformance for PanVK on G610 GPUs, Vulkan 1.2 is now also checked off the list!
Comments (0)
Add a Comment