Jakub Piotr Clapa
September 13, 2023
TL;DR: Collabora is building the best natural-sounding Open Source speech synthesis solution which is ready for commercial use – based only on properly licensed speech datasets and unrestricted Open Source code.
In the modern digital era, the influence of speech technology is rapidly expanding. Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions with chatbots. We're introducing a new player in this field – WhisperSpeech, an Open Source text-to-speech model developed by Collabora.
Text-to-speech technology is no longer limited to screen readers or creating impromptu audiobooks from blog posts for personal use. Its applications now extend across a range of areas beyond traditional uses. While the most apparent application is to produce captivating narrated content or improve podcasts, WhisperSpeech has potential uses in:
Audio Editing: TTS models offer creators the ability to seamlessly modify audio tracks in podcasts and videos. This YouTube video demonstrates the replacement of explicit content with synthesized speech, providing a creative solution for self-censorship and content adaptation. Another application is editing interviews and improving vocal performances without re-recording.
Interactive Voice Response (IVR) Systems: WhisperSpeech's natural-sounding speech is ideal for IVR systems, making automated interactions more personalized and engaging for customers.
Public Announcements: In public spaces or commercial environments, the model's realistic speech could be harnessed for clear and effective announcements.
WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.
Collabora holds ambitious plans for the future of WhisperSpeech. Larger models with even higher speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, our company builds this foundational model using properly licensed datasets, facilitating commercial use without legal concerns.
An end-to-end generation example, inspired by one famous president’s speech (click on the video to play it):
WhisperSpeech's foundation traces back to the SPEAR TTS paper from Google Research. Unfortunately Google released neither the code nor the weights that inspired the model's impressive quality and straightforward design, so we ventured into developing a new Open Source TTS model.
At the time, Collabora was working on WhisperLive, a live transcription tool based on OpenAI's Whisper model. The exceptional accuracy and multilingual capabilities of these models were striking, yet Google's SPEAR TTS solution used custom models for speech transcription and semantic token extraction. This led to the question: Could Whisper be employed to tackle these tasks and improve their approach by using a high-quality supervised model?
WhisperSpeech's innovative architecture takes its inspiration from the Whisper speech recognition model and reverses its operation to move from transcription to text-to-speech synthesis. This unique approach opens doors to a host of possibilities in generating natural speech. The model utilizes existing open-source technologies such as the Encodec audio codec from Meta and the Vocos vocoder from charactr, ensuring efficiency by building upon established solutions.
The insights gained from the SPEAR TTS architecture were pivotal in this advancement. Dividing speech synthesis into two separate phases—reading and speaking—has made the process more efficient, improving both manageability and accuracy. These two aspects are essential considering how the subtle variations in a single sentence can be expressed.
Whisper, the foundation of WhisperSpeech, consists of two main parts: the encoder and the decoder. The encoder produces a continuous stream of embeddings, each containing contextual information and prosody that enrich the audio data. The decoder then transforms these embeddings into words, utilizing cross-attention mechanisms to identify sound fragments composing each word.
WhisperSpeech operates in reverse by taking text as input, generating embeddings, and then transforming them into sound. Whisper's encoder output is quantized to form semantic tokens (500kbps), enriched with phonetic and prosodic attributes. Meta's Encodec model is then used to compress the audio data into acoustic tokens operating at 1.5kbps. These tokens are handled by popular seq2seq transformers and Collabora meticulously trained models for the Semantic-to-Acoustic (S2A) and the Text-to-Semantic (T2S) processes. These models, combined with the open-source Vocos vocoder, yield high-quality speech from text inputs, producing impressive WhisperSpeech outputs.
Collabora's WhisperSpeech project holds promise in reshaping communication, content creation, and interaction through text-to-speech technology. WhisperSpeech's unique approach, built upon the successes of Whisper and SPEAR TTS, has the potential to establish new standards in open-source natural speech synthesis.
As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperSpeech's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperSpeech is poised to make an impact in the speech technology landscape. For a more in-depth discussion about WhisperSpeech, check out the latest episode of Democratizing AI: "Open Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion".
Text-to-speech (TTS) models are playing a transformative role, from enriching audiobooks to enhancing podcasts and even improving interactions…
In Linux, the Industrial Input/Output subsystem manages devices like Analog to Digital Converters, Light sensors, accelerometers, etc. On…
Collabora's main testing laboratory has grown to automate testing on over 150 devices of about 30 different types. The lab receives job…
Rust is a modern language known for its memory safety, efficiency, and wide range of high-level features. But many beginners also run into…
At Collabora, we're committed to bringing people together. That's why we're pushing state-of-the-art machine-learning techniques like Large…
I have been working on getting U-boot upstream up to speed for the Radxa Rock-5B Rockchip RK3588 board. One of the cool features that I…