We're hiring!

WhisperSpeech: Exploring new horizons in text-to-speech tech

Jakub Piotr Clapa avatar

Jakub Piotr Clapa
September 13, 2023

Share this post:

Reading time:

TL;DR: Collabora is building the best natural-sounding Open Source speech synthesis solution which is ready for commercial use – based only on properly licensed speech datasets and unrestricted Open Source code.

In the modern digital era, the influence of speech technology is rapidly expanding. Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions with chatbots. We're introducing a new player in this field – WhisperSpeech, an Open Source text-to-speech model developed by Collabora.

The Versatility of Text-to-Speech

Text-to-speech technology is no longer limited to screen readers or creating impromptu audiobooks from blog posts for personal use. Its applications now extend across a range of areas beyond traditional uses. While the most apparent application is to produce captivating narrated content or improve podcasts, WhisperSpeech has potential uses in:

Audio Editing: TTS models offer creators the ability to seamlessly modify audio tracks in podcasts and videos. This YouTube video demonstrates the replacement of explicit content with synthesized speech, providing a creative solution for self-censorship and content adaptation. Another application is editing interviews and improving vocal performances without re-recording.

Interactive Voice Response (IVR) Systems: WhisperSpeech's natural-sounding speech is ideal for IVR systems, making automated interactions more personalized and engaging for customers.

Public Announcements: In public spaces or commercial environments, the model's realistic speech could be harnessed for clear and effective announcements.

Introducing WhisperSpeech: A Glimpse into Innovation

WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.

Collabora holds ambitious plans for the future of WhisperSpeech. Larger models with even higher speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, our company builds this foundational model using properly licensed datasets, facilitating commercial use without legal concerns.

An end-to-end generation example, inspired by one famous president’s speech (click on the video to play it):

The Genesis of WhisperSpeech: Inspiration and Innovation

WhisperSpeech's foundation traces back to the SPEAR TTS paper from Google Research. Unfortunately Google released neither the code nor the weights that inspired the model's impressive quality and straightforward design, so we ventured into developing a new Open Source TTS model.

At the time, Collabora was working on WhisperLive, a live transcription tool based on OpenAI's Whisper model. The exceptional accuracy and multilingual capabilities of these models were striking, yet Google's SPEAR TTS solution used custom models for speech transcription and semantic token extraction. This led to the question: Could Whisper be employed to tackle these tasks and improve their approach by using a high-quality supervised model?

Unveiling the Inner Workings of WhisperSpeech

WhisperSpeech's innovative architecture takes its inspiration from the Whisper speech recognition model and reverses its operation to move from transcription to text-to-speech synthesis. This unique approach opens doors to a host of possibilities in generating natural speech. The model utilizes existing open-source technologies such as the Encodec audio codec from Meta and the Vocos vocoder from charactr, ensuring efficiency by building upon established solutions.

The insights gained from the SPEAR TTS architecture were pivotal in this advancement. Dividing speech synthesis into two separate phases—reading and speaking—has made the process more efficient, improving both manageability and accuracy. These two aspects are essential considering how the subtle variations in a single sentence can be expressed.

Whisper, the foundation of WhisperSpeech, consists of two main parts: the encoder and the decoder. The encoder produces a continuous stream of embeddings, each containing contextual information and prosody that enrich the audio data. The decoder then transforms these embeddings into words, utilizing cross-attention mechanisms to identify sound fragments composing each word.

WhisperSpeech operates in reverse by taking text as input, generating embeddings, and then transforming them into sound. Whisper's encoder output is quantized to form semantic tokens (500kbps), enriched with phonetic and prosodic attributes. Meta's Encodec model is then used to compress the audio data into acoustic tokens operating at 1.5kbps. These tokens are handled by popular seq2seq transformers and Collabora meticulously trained models for the Semantic-to-Acoustic (S2A) and the Text-to-Semantic (T2S) processes. These models, combined with the open-source Vocos vocoder, yield high-quality speech from text inputs, producing impressive WhisperSpeech outputs.

WhisperSpeech's Future

Collabora's WhisperSpeech project holds promise in reshaping communication, content creation, and interaction through text-to-speech technology. WhisperSpeech's unique approach, built upon the successes of Whisper and SPEAR TTS, has the potential to establish new standards in open-source natural speech synthesis.

As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperSpeech's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperSpeech is poised to make an impact in the speech technology landscape. For a more in-depth discussion about WhisperSpeech, check out the latest episode of Democratizing AI: "Open Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion".

Comments (3)

  1. Stuart Naylor:
    Sep 14, 2023 at 12:40 PM

    Hi Jakub, I have a tangential question as wih TTS there are a lot of models to choose from but I am still struggling to find a single casual online BSS (Blind Source Seperation) alg or model for binaural audio that favours low computational cost for embedded.
    Its been a long term hindrance top opensource voice assistants as key elements at the start of the audio chain are missing.

    I do follow the excellent work you guys do at Collabora and in this area there seems to be no libs, alsa or other BSS and for me binaural would be choice.
    So I just thought I would highlight in Linux it seems to be missing whilst we have quite a choice with the likes of TTS even if another is great news.

    Reply to this comment

    Reply to this comment

    1. Stuart Naylor:
      Sep 21, 2023 at 03:59 PM

      PS it doesn't have to be BSS as there are filters also and same with binaural but after some research the binaural alg likely could have better SNR with lower computional cost and looking at what Google did it seems it does fit low cost devices.
      The start of the 'Smart Assistant' audio chain is missing in Linux as opensource and its extremely heavy on science and math, but surely with contacts of bigger entities someone can provide something to what are now very outdated methods of Speex and Rnnoise.

      Reply to this comment

      Reply to this comment

      1. Jakub Piotr Cłapa:
        Sep 21, 2023 at 09:20 PM

        These are good points, thank you. We decided to start with text-to-speech because we noticed a flurry of research activity on proprietary models which resulted in much better quality and we did not want Open Source solutions to be left behind. But we did discuss voice enhancement and directional hearing through multiple microphones (most of us have a few microphones around these days and yet speech on video calls is often undecipherable) as another potential project that could be interesting, especially for embedded AI applications.
        There are some recent papers on using neural networks to do source separation: https://paperswithcode.com/task/audio-source-separation/latest but it would require some work to filter out those that talk about music track separation and such.

        Reply to this comment

        Reply to this comment

Add a Comment

Allowed tags: <b><i><br>Add a new comment:

Search the newsroom

Latest Blog Posts

The latest on cmtp-responder, a permissively-licensed MTP responder implementation


Part 3 of the cmtp-responder series with a focus on USB gadgets explores several new elements including a unified build environment with…

A roadmap for VirtIO Video on ChromeOS: part 3


The final installment of a series explaining how Collabora is helping shape the video virtualization story for Chromebooks with a focus…

Hacking on the PipeWire GStreamer elements


Last week I attended the GStreamer spring hackfest in Thessaloniki to work on the PipeWire GStreamer elements and connect with the community.

Transforming speech technology with WhisperLive


The world of AI has made leaps and bounds from what It once was, but there are still some adjustments required for the optimal outcome.…

Re-converging control flow on NVIDIA GPUs - What went wrong, and how we fixed it


While I managed to land support for two extensions, implementing control flow re-convergence in NVK did not go as planned. This is the story…

Automatic regression handling and reporting for the Linux Kernel


In continuation with our series about Kernel Integration we'll go into more detail about how regression detection, processing, and tracking…

Open Since 2005 logo

Our website only uses a strictly necessary session cookie provided by our CMS system. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.