WhisperFusion: Ultra-low latency conversations with an AI chatbot

WhisperFusion: Ultra-low latency conversations with an AI chatbot

Marcus Edel
January 25, 2024

Share this post:

Reading time:

In this blog post Vineet Suryan, Jakub Piotr Cłapa, and Marcus Edel share their research and findings towards implementing a real-time communication with an AI chatbot.

You know that anticipation that sets in when you’re expecting a message from a potential interest? Keeping your phone in your constant peripheral, lunging at every buzz? Now chatbots can give you that same excitement!

However, the great advantage of bots is that they can reply instantly without spending any time typing or even thinking. But if you reflect further, you’ll realize that the machine, supposedly responding in milliseconds, has a clear delay between human speech and the bot’s spoken answer. While the information is accurate, the delay makes the interaction feel unnatural and could frustrate users.

That is why at Collabora, we looked at every piece of the process and implemented an ultra-low latency pipeline using WhisperLive and WhisperSpeech.

Building towards real-time communication with an AI chatbot

There is both a long and a short answer on how we achieved ultra-low latency communications with an AI chatbot.

The short answer

Simply put, by using WhisperLive and WhisperSpeech:

WhisperLive is a nearly-live implementation of OpenAI's Whisper. The project is a real-time transcription application that uses the OpenAI Whisper model to convert speech input into text output. It can be used to transcribe both live audio input from microphone and pre-recorded audio files. Unlike traditional speech recognition systems that rely on continuous audio streaming, we use voice activity detection (VAD) to detect the presence of speech and only send the audio data to Whisper when speech is detected. This helps to reduce the amount of data sent to the Whisper model and improves the accuracy of the transcription output. Check out our transcription post and the WhisperLive repository for more details.

WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.

Collabora holds ambitious plans for the future of WhisperLive and WhisperSpeech. More extensive models with even higher transcription and speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, we are building this foundational model using properly licensed datasets which facilitates commercial use without legal concerns. Explore our text-to-speech post and the WhisperSpeech repository for some background.

The longer answer

To aptly break it down, there are two main reasons that chatbot conversations are slow: an algorithmic one and a hardware one.

Algorithmically, the usual pipeline is implemented in a very sequential way. That can be described as recording the audio, waiting until the audio is transcribed, sending the transcription to the large-language model, generating the audio using the text-to-speech model, and sending everything to the client. On the other hand, we implemented a highly parallel pipeline that doesn't wait for one process to finish before it triggers the next process. A high-level overview of our pipeline can be described as:

The process starts with human speech, prompting a WhisperLive transcription endpoint to be received by the app.
The app forwards the transcription to the LLM to stream the response—without waiting for the full transcription to be returned. This important change ensures that a response is relayed back to the user as soon as they are done speaking.
The LLM directly forwards the output to the TTS model and, similar to the LLM, is interrupted by the transcription service if we receive new speech.
Once we detect the end of the speech, we directly forward the output of the LLM and the TTS model to the user.

The hardware-related explanation is straightforward. The core issue lies in the size of Transcription, Large-Language Models, and Text-to-Speech models, which are typically enormous. Even the smaller models boast over 50 million parameters, all of which need to be stored in RAM. The problem with RAM is its relative slowness. To counteract this, CPUs and GPUs are equipped with substantial cache memory situated close to the processor for quicker access. The specifics vary depending on the processor's type and model, but the crux of the matter is that most models run slowly either because they exceed the cache capacity or they fail to fully utilize the available hardware.

A straightforward way to speed up inference is to just buy better hardware or to take better advantage of the hardware you have. We opted for the second and incorporated multiple optimization strategies including torch.compile, TensorRT, Batching, Quantization, KV caching, Attention, Decoding, etc.

In this blog post we will not go into the details but will dive into some of these optimization techniques in a related follow-up post.

Live Demo

Want to see this in action? Just head over to the WhisperFusion repository and run the script, or watch the sample video:

Build your own voice AI bot with Collabora

The future of customer interaction lies in the harmonious fusion of sophisticated AI and powerful communication technologies. The era of waiting for delayed responses or dealing with inefficient chatbots is fading thanks to the innovative strides we have shown with WhisperFusion.

You can achieve real-time, efficient, intelligent communication by using WhisperLive and WhisperSpeech rapid processing capabilities and low-latency communication implementations. This adaptability ensures that your model remains a step ahead as your business expands while adhering to customers' needs, a marker of delivering top-notch service.

As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperFusion's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperFusion is poised to make an impact in the communication technology landscape.

If you have questions or ideas, join us on our Gitter #lounge channel or leave a comment down below.

WhisperSpeech: Exploring new horizons in text-to-speech tech

Triple Threat: The Power of Transcription, Summary, and Translation

Benchmarking machine learning frameworks

WhisperSpeech: Exploring new horizons in text-to-speech tech

Triple Threat: The Power of Transcription, Summary, and Translation

Benchmarking machine learning frameworks

Comments (0)

Add a Comment

Search the newsroom

Latest News & Events

All roads lead to Brest: Collabora at DebConf25

09/07/2025

Collabora is proud to sponsor this year's annual Debian conference, taking place in Brest, France. Join us as we showcase the latest with…

Introducing Tyr, a new Rust DRM driver

07/07/2025

The last year has seen substantial progress on the DRM infrastructure required to write GPU drivers in Rust. Developed in collaboration…

Build with confidence, sustain for the future: Collabora joins the Toradex Proven Partner Network

02/07/2025

This partnership ensures customers can build embedded products with long-term maintenance viability that will meet the challenges of tomorrow,…

About Collabora

Whether writing a line of code or shaping a longer-term strategic software development plan, we'll help you navigate the ever-evolving world of Open Source.

한국의 국기 한국어 버전의 Collabora.com 보기