WhisperSpeech: Exploring new horizons in text-to-speech tech

WhisperSpeech: Exploring new horizons in text-to-speech tech

Jakub Piotr Clapa
September 13, 2023

Share this post:

Reading time:

TL;DR: Collabora is building the best natural-sounding Open Source speech synthesis solution which is ready for commercial use – based only on properly licensed speech datasets and unrestricted Open Source code.

In the modern digital era, the influence of speech technology is rapidly expanding. Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions with chatbots. We're introducing a new player in this field – WhisperSpeech, an Open Source text-to-speech model developed by Collabora.

The Versatility of Text-to-Speech

Text-to-speech technology is no longer limited to screen readers or creating impromptu audiobooks from blog posts for personal use. Its applications now extend across a range of areas beyond traditional uses. While the most apparent application is to produce captivating narrated content or improve podcasts, WhisperSpeech has potential uses in:

Audio Editing: TTS models offer creators the ability to seamlessly modify audio tracks in podcasts and videos. This YouTube video demonstrates the replacement of explicit content with synthesized speech, providing a creative solution for self-censorship and content adaptation. Another application is editing interviews and improving vocal performances without re-recording.

Interactive Voice Response (IVR) Systems: WhisperSpeech's natural-sounding speech is ideal for IVR systems, making automated interactions more personalized and engaging for customers.

Public Announcements: In public spaces or commercial environments, the model's realistic speech could be harnessed for clear and effective announcements.

Introducing WhisperSpeech: A Glimpse into Innovation

WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.

Collabora holds ambitious plans for the future of WhisperSpeech. Larger models with even higher speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, our company builds this foundational model using properly licensed datasets, facilitating commercial use without legal concerns.

An end-to-end generation example, inspired by one famous president’s speech (click on the video to play it):

The Genesis of WhisperSpeech: Inspiration and Innovation

WhisperSpeech's foundation traces back to the SPEAR TTS paper from Google Research. Unfortunately Google released neither the code nor the weights that inspired the model's impressive quality and straightforward design, so we ventured into developing a new Open Source TTS model.

At the time, Collabora was working on WhisperLive, a live transcription tool based on OpenAI's Whisper model. The exceptional accuracy and multilingual capabilities of these models were striking, yet Google's SPEAR TTS solution used custom models for speech transcription and semantic token extraction. This led to the question: Could Whisper be employed to tackle these tasks and improve their approach by using a high-quality supervised model?

Unveiling the Inner Workings of WhisperSpeech

WhisperSpeech's innovative architecture takes its inspiration from the Whisper speech recognition model and reverses its operation to move from transcription to text-to-speech synthesis. This unique approach opens doors to a host of possibilities in generating natural speech. The model utilizes existing open-source technologies such as the Encodec audio codec from Meta and the Vocos vocoder from charactr, ensuring efficiency by building upon established solutions.

The insights gained from the SPEAR TTS architecture were pivotal in this advancement. Dividing speech synthesis into two separate phases—reading and speaking—has made the process more efficient, improving both manageability and accuracy. These two aspects are essential considering how the subtle variations in a single sentence can be expressed.

Whisper, the foundation of WhisperSpeech, consists of two main parts: the encoder and the decoder. The encoder produces a continuous stream of embeddings, each containing contextual information and prosody that enrich the audio data. The decoder then transforms these embeddings into words, utilizing cross-attention mechanisms to identify sound fragments composing each word.

WhisperSpeech operates in reverse by taking text as input, generating embeddings, and then transforming them into sound. Whisper's encoder output is quantized to form semantic tokens (500kbps), enriched with phonetic and prosodic attributes. Meta's Encodec model is then used to compress the audio data into acoustic tokens operating at 1.5kbps. These tokens are handled by popular seq2seq transformers and Collabora meticulously trained models for the Semantic-to-Acoustic (S2A) and the Text-to-Semantic (T2S) processes. These models, combined with the open-source Vocos vocoder, yield high-quality speech from text inputs, producing impressive WhisperSpeech outputs.

WhisperSpeech's Future

Collabora's WhisperSpeech project holds promise in reshaping communication, content creation, and interaction through text-to-speech technology. WhisperSpeech's unique approach, built upon the successes of Whisper and SPEAR TTS, has the potential to establish new standards in open-source natural speech synthesis.

As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperSpeech's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperSpeech is poised to make an impact in the speech technology landscape. For a more in-depth discussion about WhisperSpeech, check out the latest episode of Democratizing AI: "Open Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion".

Triple Threat: The Power of Transcription, Summary, and Translation

MLfix to quickly fix datasets

Carlafox: Towards reliable open-source 3D perception

Triple Threat: The Power of Transcription, Summary, and Translation

MLfix to quickly fix datasets

Carlafox: Towards reliable open-source 3D perception

Comments (3)

Stuart Naylor:
Sep 14, 2023 at 12:40 PM

Hi Jakub, I have a tangential question as wih TTS there are a lot of models to choose from but I am still struggling to find a single casual online BSS (Blind Source Seperation) alg or model for binaural audio that favours low computational cost for embedded.
Its been a long term hindrance top opensource voice assistants as key elements at the start of the audio chain are missing.

I do follow the excellent work you guys do at Collabora and in this area there seems to be no libs, alsa or other BSS and for me binaural would be choice.
So I just thought I would highlight in Linux it seems to be missing whilst we have quite a choice with the likes of TTS even if another is great news.

Reply to this comment

Reply to this comment
1. Stuart Naylor:
  Sep 21, 2023 at 03:59 PM
  
  PS it doesn't have to be BSS as there are filters also and same with binaural but after some research the binaural alg likely could have better SNR with lower computional cost and looking at what Google did it seems it does fit low cost devices.
  The start of the 'Smart Assistant' audio chain is missing in Linux as opensource and its extremely heavy on science and math, but surely with contacts of bigger entities someone can provide something to what are now very outdated methods of Speex and Rnnoise.
  
  Reply to this comment
  
  Reply to this comment
  1. Jakub Piotr Cłapa:
    Sep 21, 2023 at 09:20 PM
    
    These are good points, thank you. We decided to start with text-to-speech because we noticed a flurry of research activity on proprietary models which resulted in much better quality and we did not want Open Source solutions to be left behind. But we did discuss voice enhancement and directional hearing through multiple microphones (most of us have a few microphones around these days and yet speech on video calls is often undecipherable) as another potential project that could be interesting, especially for embedded AI applications.
    There are some recent papers on using neural networks to do source separation: https://paperswithcode.com/task/audio-source-separation/latest but it would require some work to filter out those that talk about music track separation and such.
    
    Reply to this comment
    
    Reply to this comment

Add a Comment

Search the newsroom

Latest Blog Posts

Quick notes from the GStreamer Spring Hackfest 2025

15/07/2025

This past May, we met with the community at the GStreamer Spring Hackfest in Nice, France, and were able to make great strides, including…

PipeWire workshop 2025: Updates on video transport, Rust efforts, TSN networking, and Bluetooth support

03/07/2025

As part of the activities Embedded Recipes in Nice, France, Collabora hosted a PipeWire workshop/hackfest, an opportunity for attendees…

Coccinelle for Rust progress report

25/06/2025

In collaboration with Inria, the French Institute for Research in Computer Science and Automation, Tathagata Roy shares the progress made…

Linux Media Summit 2025 recap

23/06/2025

Last month in Nice, active media developers came together for the annual Linux Media Summit to exchange insights and tackle ongoing challenges…

Constructor acquires, destructor releases

09/06/2025

In this final article based on Matt Godbolt's talk on making APIs easy to use and hard to misuse, I will discuss locking, an area where…

What if C++ had decades to learn?

21/05/2025

In this second article of a three-part series, I look at how Matt Godbolt uses modern C++ features to try to protect against misusing an…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기