WhisperSpeech: Exploring new horizons in text-to-speech tech

WhisperSpeech: Exploring new horizons in text-to-speech tech

Jakub Piotr Clapa
September 13, 2023

Share this post:

Reading time:

TL;DR: Collabora is building the best natural-sounding Open Source speech synthesis solution which is ready for commercial use – based only on properly licensed speech datasets and unrestricted Open Source code.

In the modern digital era, the influence of speech technology is rapidly expanding. Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions with chatbots. We're introducing a new player in this field – WhisperSpeech, an Open Source text-to-speech model developed by Collabora.

The Versatility of Text-to-Speech

Text-to-speech technology is no longer limited to screen readers or creating impromptu audiobooks from blog posts for personal use. Its applications now extend across a range of areas beyond traditional uses. While the most apparent application is to produce captivating narrated content or improve podcasts, WhisperSpeech has potential uses in:

Audio Editing: TTS models offer creators the ability to seamlessly modify audio tracks in podcasts and videos. This YouTube video demonstrates the replacement of explicit content with synthesized speech, providing a creative solution for self-censorship and content adaptation. Another application is editing interviews and improving vocal performances without re-recording.

Interactive Voice Response (IVR) Systems: WhisperSpeech's natural-sounding speech is ideal for IVR systems, making automated interactions more personalized and engaging for customers.

Public Announcements: In public spaces or commercial environments, the model's realistic speech could be harnessed for clear and effective announcements.

Introducing WhisperSpeech: A Glimpse into Innovation

WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.

Collabora holds ambitious plans for the future of WhisperSpeech. Larger models with even higher speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, our company builds this foundational model using properly licensed datasets, facilitating commercial use without legal concerns.

An end-to-end generation example, inspired by one famous president’s speech (click on the video to play it):

The Genesis of WhisperSpeech: Inspiration and Innovation

WhisperSpeech's foundation traces back to the SPEAR TTS paper from Google Research. Unfortunately Google released neither the code nor the weights that inspired the model's impressive quality and straightforward design, so we ventured into developing a new Open Source TTS model.

At the time, Collabora was working on WhisperLive, a live transcription tool based on OpenAI's Whisper model. The exceptional accuracy and multilingual capabilities of these models were striking, yet Google's SPEAR TTS solution used custom models for speech transcription and semantic token extraction. This led to the question: Could Whisper be employed to tackle these tasks and improve their approach by using a high-quality supervised model?

Unveiling the Inner Workings of WhisperSpeech

WhisperSpeech's innovative architecture takes its inspiration from the Whisper speech recognition model and reverses its operation to move from transcription to text-to-speech synthesis. This unique approach opens doors to a host of possibilities in generating natural speech. The model utilizes existing open-source technologies such as the Encodec audio codec from Meta and the Vocos vocoder from charactr, ensuring efficiency by building upon established solutions.

The insights gained from the SPEAR TTS architecture were pivotal in this advancement. Dividing speech synthesis into two separate phases—reading and speaking—has made the process more efficient, improving both manageability and accuracy. These two aspects are essential considering how the subtle variations in a single sentence can be expressed.

Whisper, the foundation of WhisperSpeech, consists of two main parts: the encoder and the decoder. The encoder produces a continuous stream of embeddings, each containing contextual information and prosody that enrich the audio data. The decoder then transforms these embeddings into words, utilizing cross-attention mechanisms to identify sound fragments composing each word.

WhisperSpeech operates in reverse by taking text as input, generating embeddings, and then transforming them into sound. Whisper's encoder output is quantized to form semantic tokens (500kbps), enriched with phonetic and prosodic attributes. Meta's Encodec model is then used to compress the audio data into acoustic tokens operating at 1.5kbps. These tokens are handled by popular seq2seq transformers and Collabora meticulously trained models for the Semantic-to-Acoustic (S2A) and the Text-to-Semantic (T2S) processes. These models, combined with the open-source Vocos vocoder, yield high-quality speech from text inputs, producing impressive WhisperSpeech outputs.

WhisperSpeech's Future

Collabora's WhisperSpeech project holds promise in reshaping communication, content creation, and interaction through text-to-speech technology. WhisperSpeech's unique approach, built upon the successes of Whisper and SPEAR TTS, has the potential to establish new standards in open-source natural speech synthesis.

As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperSpeech's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperSpeech is poised to make an impact in the speech technology landscape. For a more in-depth discussion about WhisperSpeech, check out the latest episode of Democratizing AI: "Open Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion".

Triple Threat: The Power of Transcription, Summary, and Translation

MLfix to quickly fix datasets

Carlafox: Towards reliable open-source 3D perception

Triple Threat: The Power of Transcription, Summary, and Translation

MLfix to quickly fix datasets

Carlafox: Towards reliable open-source 3D perception

Search the newsroom

Latest Blog Posts

Re-thinking framebuffers in PanVK

23/03/2026

PanVK’s new framebuffer abstraction for Mali GPUs removes OpenGL-specific constraints, unlocking more flexible tiled rendering features…

Running Mainline Linux, U-Boot, and Mesa on Rockchip: A year in review

02/03/2026

Get the recap of Nicolas Frattaroli's FOSDEM talk detailing Rockchip’s mainline progress, including Vulkan 1.4 and NPU support as a vital…

Now streaming: Collabora XDC 2025 presentations

02/12/2025

As an active member of the freedesktop community, Collabora was busy at XDC 2025. Our graphics team delivered five talks, helped out in…

Implementing Bluetooth LE Audio & Auracast on Linux systems

24/11/2025

LE Audio introduces a modern, low-power, low-latency Bluetooth® audio architecture that overcomes the limitations of classic Bluetooth®…

Strengthening KernelCI: New architecture, storage, and integrations

17/11/2025

Collabora’s long-term leadership in KernelCI has delivered a completely revamped architecture, new tooling, stronger infrastructure, and…

Font recognition reimagined with FasterViT-2

11/11/2025

Collabora extended the AdobeVFR dataset and trained a FasterViT-2 font recognition model on millions of samples. The result is a state-of-the-art…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기