We're hiring!
*

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

Vineet Suryan avatar

Vineet Suryan
May 29, 2025

Share this post:

Reading time:

Previously, we launched Whisper-Hindi v1 by fine-tuning OpenAI’s Whisper on 2,500 hours of Hindi speech with Indic, improving whisper-tiny’s WER on google/fleurs from ~172% to ~14%. That work not only showcased the impact of Indic normalization but also paved the way for timestamp-aware fine-tuning.

Now, in this follow-up post, we’ve cleaned and expanded that corpus to 3,000 hours, added explicit timestamp prediction, reorganized our pipeline around WebDataset for 5× faster I/O, and fine-tuned multilingual Whisper models across all sizes. The result is Whisper-Hindi v2, where we achieve an impressive ~5% WER on google/fleurs—bringing us even closer to fully reliable, production-ready Hindi ASR.

Jump to a section: Demo: A verse in Hindi | Why Indic Normalization is critical | Datasets | Why WebDataset is a game changer | Modelling & results | Looking ahead | Acknowledgments | Resources



Demo: A verse in Hindi

Here’s a short clip of a Hindi poem. Though we don’t have an official transcript, you’ll notice the model’s output aligns perfectly—capturing both the verse’s rhythm and its literary nuance.

Text output

युद्ध नहीं जिनके जीवन में वे भी बहुत अभागे होंगे या तो प्रण को तोड़ा होगा या फिर रण से भागे होंगे दीपक का कुछ अर्थ नहीं है जब तक तम से नहीं 
लड़ेगा दिन कर नहीं प्रभा बांटेगा जब तक स्वयं नहीं धधकेगा कभी दहकती ज्वाला के बिन कुंदन भला बना है सोना बिना घिसे मेंहदी ने बोलो कब पाया है रंग 
सलोना जीवन के पथके राही को क्षण भर भी विश्राम नहीं है कौन भला स्वीकार करेगा जीवन एक संग्राम नहीं है

Even without a reference transcript, it's evident that every matra and conjunct is intact, the poetic cadence flows naturally, and the timestamps below map precisely to each spoken phrase:

Timestamped chunks

[{'timestamp': (0.0, 9.48), 'text': ' युद्ध नहीं जिनके जीवन में वे भी बहुत अभागे होंगे या तो प्रण को तोड़ा होगा या फिर'}
{'timestamp': (9.48, 18.3), 'text': ' रण से भागे होंगे दीपक का कुछ अर्थ नहीं है जब तक तम से नहीं लड़ेगा दिन कर नहीं'},
{'timestamp': (18.3, 29.8), 'text': ' प्रभा बांटेगा जब तक स्वयं नहीं धधकेगा कभी दहकती ज्वाला के बिन कुंदन भला बना है सोना बिना घिसे मेंहदी'},
{'timestamp': (29.8, 40.04), 'text': ' ने बोलो कब पाया है रंग सलोना जीवन के पथके राही को क्षण भर भी विश्राम नहीं है कौन भला'},
{'timestamp': (40.04, 44.0), 'text': ' स्वीकार करेगा जीवन एक संग्राम नहीं है'}] 

Reproduce the demo

import torch
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
asr_pipe = pipeline(
    "automatic-speech-recognition",
    model="collabora/whisper-large-v2-hindi",
    chunk_length_s=30,
    device=device
)
prediction = asr_pipe("./sample.wav", return_timestamps=True)
print(prediction)

Live YouTube captioning in action using WhisperLive (Chrome extension)


Why Indic Normalization is critical

Hindi’s diacritics (matras) and conjuncts aren’t just decorative—they encode critical phonetic and semantic cues. If a normalizer strips or mangles these marks, it can turn valid words into unreadable text, breaking user comprehension and any NLP that follows. An Indic-aware normalizer preserves every matra and cluster, keeping transcripts accurate, legible, and robust downstream.

Semantic integrity & readability

Applications like subtitles, captions, or live transcription power real-time communication. Missing or misplaced diacritics not only confuse listeners but also make it hard to trust the output:

Original:

क्षेत्रफल बढ़ने से उत्पादन बढ़ा।
Transliteration => (kṣetrafal baṛhne se utpādan baṛhā.)
Variant Text Notes
Whisper’s Default Normalization कषतरफल बढन स उतप दन बढ - क्ष split into क्‌ष
- ढ़ loses its aspirate matra
Indic Normalization क्षेत्रफल बढ़ने से उत्पादन बढ़ा। - Conjunct क्ष preserved
- ढ़ retains aspirated marker
- Full readability maintained


Quick example -- Compound conjunctions

Original:      कृषि       (krshi)  => Agriculture
Whisper Norm:  कष        (ks)     => NA
Indic Norm:    कृषि       (krshi)  => Agriculture

By maintaining compounds like कृ, the model learns true phonetic clusters, improving both transcription accuracy and pronunciation.

Key benefits

  • Conjunct preservation: Compound characters such as क्ष or कृ remain intact, ensuring correct pronunciation and meaning.
  • Diacritic retention: Aspirated characters like ढ़ keep their matras, so बढ़ isn’t misread as बढ.
  • Word boundaries & readability: Every matra and conjunct stays intact, making transcripts fully legible and trustworthy—critical for real-time ASR applications.

Datasets

We assemble a diverse collection of Hindi ASR corpora—totaling 3,000 h after cleanup with clearly stated licenses to ensure compliance and reproducibility.

Data sources

Dataset Hours (Hi) License Source
Shrutilipi ~1,558 h CC BY 4.0 ai4bharat/shrutilipi
IITM Madras SpringLab ~900 h CC BY 4.0 SpringLab
Common Voice 11.0 (Mozilla) ~20 h CC 0 1.0 (public domain) mozilla/commonvoice
IndicSUPERB 150 h Apache License 2.0 ai4bharat/indic-superb
snow-mountain 67.6 h CC BY-SA 4.0 bridgeconn/snow-mountain
yodas ~200 h CC BY 3.0 espnet/yodas
IndicVoices-R_Hindi 75 h CC BY 4.0 SPRINGLab/IndicVoices-R_Hindi
Lahaja 12.5 h CC BY 4.0 ai4bharat/lahaja
fleurs 30.0 h CC BY 4.0 google/fleurs


Preprocessing pipeline

  1. Raw → HuggingFace Audio
    • Download raw speech corpus (WAV/MP3).
    • Merge segments into clips ≤ 30 s:
      • Concatenate consecutive audio files until either the 30s limit or the Whisper token-length limit is reached.
      • Create & record per-segment start/end times for timestamp tokens.
    • Apply Indic Normalization on merged transcripts so that all matras and conjuncts are preserved.
    • Emit a CSV/JSON metadata with:
      • file_name (path to merged clip)
      • sentence (merged, normalized transcript with timestamps)
      • duration (clip length in seconds)
    • Load into HuggingFace via load_dataset(..., features=Audio()) so downstream scripts treat it as an “audio dataset.”
  2. Audio Dataset → Model-Ready Features
    • Cast audio: ensure the audio column is at 16 kHz.
    • Feature extraction: convert each audio clip into log-Mel spectrogram tensors using WhisperFeatureExtractor.
    • Label tokenization: run the (already normalized) text through WhisperTokenizer to produce input_ids with timestamp tokens.
    • Save the preprocessed dataset to disk.

Why this two-stage approach?

  • Modularity: separate raw merging (with normalization & timestamp logic) from expensive spectrogram/tokenization.
  • Efficiency: saving preprocessed features means repeated experiments skip audio decoding and normalization, cutting hours off hyperparameter sweeps.


Why WebDataset is a game changer

When you’re firing through thousands of hours of speech, nothing grinds the GPU to a halt faster than slow disk reads. Packing our clips (with all their timestamp tags) into WebDataset shards was a game-changer—suddenly we were training 5–6× faster, and actually enjoying the process again with no more staring at stalled training runs.

  • Tar-based shards
    Each shard (~700 MB) packs 1,000 samples into a single tar file. Within each sample we store input_features.npz (precomputed log-Mel arrays) and labels.npz (token ID sequences with timestamp markers).
  • Shard streaming
    Shards can be read sequentially by multiple CPU workers in parallel. This maximizes throughput by:
    1. Fast sequential reads (perfect for NVMe or even spinning disks)
    2. Independent decoding per worker
    3. We create shards with a dataset that is already shuffled.
  • Training: workers independently stream from different shards, decode on the fly, and feed batches to the GPU.

Shifting everything—preprocessing through training into WebDataset shards meant we barely touched the file system during a run. No more millions of tiny file opens, just fast, steady throughput from start to finish. With our data fully preprocessed and sharded, and timestamps baked in, we’re ready to fine-tune and evaluate.


Modelling & results

We optimized for both speed and stability on a single RTX 4090. In practice, we trained for 5 epochs on Tiny/Base/Small, 4 on Medium, and 3 on Large-v2, with 600 warm-up steps (300 for Large-v2). Mixed-precision training and 8-bit AdamW (just for Large-v2) helped keep memory in check. Here are the full details:

Setting Value
Hardware NVIDIA RTX 4090 GPU
Epochs Tiny/Base/Small – 5; Medium – 4; Large-v2 – 3
Effective Batch Size 128
Warm-up Steps 300 for Large-v2 & 600
Optimizer AdamW 8-bit (bitsandbytes) for Large-v2; fp16 AdamW for others
β₁, β₂, ε 0.9, 0.999, 1e-8
Weight Decay 0.05 (versus 0.1 in Whisper paper)
LR Scheduler Linear decay after warm-up


Learning rate comparison

Model Size Paper Max LR v2 Max LR
Tiny 1.5 × 10⁻³ 3 × 10⁻⁴
Base 1 × 10⁻³ 1.5 × 10⁻⁴
Small 5 × 10⁻⁴ 1 × 10⁻⁴
Medium 2.5 × 10⁻⁴ 5 × 10⁻⁵
Large-v2 2 × 10⁻⁴ 3.5 × 10⁻⁵


Why these choices?

  • Lower max LRs ensure gradual adaptation of pre-trained weights, reducing the risk of divergence or sub-optimal minima.
  • Longer warm-up (700 steps) gives larger models time to “ramp up” before decay.
  • Reduced weight decay (0.05) strikes a balance between regularization and preserving pre-trained features.
  • 8-bit optimizer on the largest model cuts memory use, allowing higher batch sizes without GPU OOMs.
  • Timestamp tokens: training with explicit <start>/<end> allows the model to also output timestamp tokens without impacting WER.

In practice, dialing in these learning rates, warm-ups, and weight-decay settings gave us smooth, predictable loss curves and dramatic WER drops—from Tiny all the way to Large-v2—without any nasty surprises.

Fig.1 - Training loss vs. global step.


Baseline (pre–fine-tuning) Whisper WER %

Model Whisper Norm Indic Norm
Tiny 172.60 196.57
Base 149.17 160.58
Small 67.37 89.73
Medium 26.85 43.16
Large-v2 21.54 38.46


Fine-tuned Whisper WER %

Model Whisper Norm Indic Norm
Tiny 10.01 18.94
Base 8.95 17.42
Small 7.17 15.10
Medium 6.10 13.36
Large-v2 5.33 13.06


We decoded using beam search with beam_size=5, applied a repetition_penalty=1.15, and removed any samples containing numeric characters.

Fig.1 - WER comparison.


Key takeaway

Fine-tuning reduces WER by over 15× on Tiny and Base models, and by 4–5× on Medium and Large-v2—while preserving semantic integrity with Indic normalization.


Looking ahead

Next up, we plan to contribute to ASR with several key initiatives:

  • Architectural Exploration: Fine-tune and compare Conformer-based ASR models or lightweight CTC architectures that can offer lower latency and higher accuracy, while retaining Indic normalization and timestamp techniques.
  • Multilingual Extension: Apply the same fine-tuning pipeline (normalization, WebDataset sharding, timestamp-aware training) to other Indian languages—such as Bengali, Tamil, and Marathi—and even low-resource global languages, democratizing high-quality ASR worldwide.
  • On-Device & Real-Time Inference: Optimize and benchmark models on edge platforms (ARM, Jetson) to enable live captioning and mobile ASR with sub-second latency.

Breaking the myth that you need massive compute, Whisper-Hindi demonstrates that a single RTX 4090, combined with a few smart hacks (WebDataset, mixed-precision, 8-bit AdamW, gentle learning-rate curves), can deliver cutting-edge ASR performance. As we move forward, combining these best practices with new architectures will push Indic ASR—and indeed global multilingual speech recognition—to the next level, making accurate, aligned transcriptions accessible across languages, devices, and applications.

As we roll out new variants and interactive demos, our goal is clear: make robust speech recognition accessible to every language community and every developer with no supercomputer required.

Try Whisper-Hindi for yourself from Hugging Face and help us make ASR accessible to everyone!



Acknowledgments

This project was made possible because of the following resources:



Resources

 

Comments (0)


Add a Comment






Allowed tags: <b><i><br>Add a new comment:


 

Search the newsroom

Latest News & Events

Breaking language barriers 2.0: Moving closer towards fully reliable, production-ready Hindi ASR

29/05/2025

After cleaning up and expanding Whisper-Hindi to 3,000 hours, we now have explicit timestamp prediction, faster I/O, and fine-tuned models…

Kernel 6.15: Divide and conquer

27/05/2025

The latest Linux kernel brings expanded hardware support for MediaTek and Rockchip, enhanced graphics drivers, and more. Collabora played…

PanVK reaches Vulkan 1.2 conformance on Mali-G610

26/05/2025

Just about 6 weeks after we announced Vulkan 1.1 conformance for PanVK on G610 GPUs, Vulkan 1.2 is now also checked off the list!

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Limited © 2005-2025. All rights reserved. Privacy Notice. Sitemap.