We're hiring!

WhisperFusion: Ultra-low latency conversations with an AI chatbot

Marcus Edel avatar

Marcus Edel
January 25, 2024

Share this post:

Reading time:

In this blog post Vineet Suryan, Jakub Piotr Cłapa, and Marcus Edel share their research and findings towards implementing a real-time communication with an AI chatbot.

You know that anticipation that sets in when you’re expecting a message from a potential interest? Keeping your phone in your constant peripheral, lunging at every buzz? Now chatbots can give you that same excitement!

However, the great advantage of bots is that they can reply instantly without spending any time typing or even thinking. But if you reflect further, you’ll realize that the machine, supposedly responding in milliseconds, has a clear delay between human speech and the bot’s spoken answer. While the information is accurate, the delay makes the interaction feel unnatural and could frustrate users.

That is why at Collabora, we looked at every piece of the process and implemented an ultra-low latency pipeline using WhisperLive and WhisperSpeech.

Building towards real-time communication with an AI chatbot

There is both a long and a short answer on how we achieved ultra-low latency communications with an AI chatbot.

The short answer

Simply put, by using WhisperLive and WhisperSpeech:

WhisperLive is a nearly-live implementation of OpenAI's Whisper. The project is a real-time transcription application that uses the OpenAI Whisper model to convert speech input into text output. It can be used to transcribe both live audio input from microphone and pre-recorded audio files. Unlike traditional speech recognition systems that rely on continuous audio streaming, we use voice activity detection (VAD) to detect the presence of speech and only send the audio data to Whisper when speech is detected. This helps to reduce the amount of data sent to the Whisper model and improves the accuracy of the transcription output. Check out our transcription post and the WhisperLive repository for more details.

WhisperSpeech stands as a significant advancement in the realm of Open Source text-to-speech technology. Developed by Collabora, the model's focus is on delivering natural-sounding speech for improved communication. The aim is to create an adaptable and seamlessly integrated TTS model with multilingual capabilities.

Collabora holds ambitious plans for the future of WhisperLive and WhisperSpeech. More extensive models with even higher transcription and speech quality are in the pipeline, promising an enhanced auditory experience next month. Furthermore, we are building this foundational model using properly licensed datasets which facilitates commercial use without legal concerns. Explore our text-to-speech post and the WhisperSpeech repository for some background.

The longer answer

To aptly break it down, there are two main reasons that chatbot conversations are slow: an algorithmic one and a hardware one.

Algorithmically, the usual pipeline is implemented in a very sequential way. That can be described as recording the audio, waiting until the audio is transcribed, sending the transcription to the large-language model, generating the audio using the text-to-speech model, and sending everything to the client. On the other hand, we implemented a highly parallel pipeline that doesn't wait for one process to finish before it triggers the next process. A high-level overview of our pipeline can be described as:

  1. The process starts with human speech, prompting a WhisperLive transcription endpoint to be received by the app.
  2. The app forwards the transcription to the LLM to stream the response—without waiting for the full transcription to be returned. This important change ensures that a response is relayed back to the user as soon as they are done speaking.
  3. The LLM directly forwards the output to the TTS model and, similar to the LLM, is interrupted by the transcription service if we receive new speech.
  4. Once we detect the end of the speech, we directly forward the output of the LLM and the TTS model to the user.

The hardware-related explanation is straightforward. The core issue lies in the size of Transcription, Large-Language Models, and Text-to-Speech models, which are typically enormous. Even the smaller models boast over 50 million parameters, all of which need to be stored in RAM. The problem with RAM is its relative slowness. To counteract this, CPUs and GPUs are equipped with substantial cache memory situated close to the processor for quicker access. The specifics vary depending on the processor's type and model, but the crux of the matter is that most models run slowly either because they exceed the cache capacity or they fail to fully utilize the available hardware.

A straightforward way to speed up inference is to just buy better hardware or to take better advantage of the hardware you have. We opted for the second and incorporated multiple optimization strategies including torch.compile, TensorRT, Batching, Quantization, KV caching, Attention, Decoding, etc.

In this blog post we will not go into the details but will dive into some of these optimization techniques in a related follow-up post.

Live Demo

Want to see this in action? Just head over to the WhisperFusion repository and run the script, or watch the sample video:

Build your own voice AI bot with Collabora

The future of customer interaction lies in the harmonious fusion of sophisticated AI and powerful communication technologies. The era of waiting for delayed responses or dealing with inefficient chatbots is fading thanks to the innovative strides we have shown with WhisperFusion.

You can achieve real-time, efficient, intelligent communication by using WhisperLive and WhisperSpeech rapid processing capabilities and low-latency communication implementations. This adaptability ensures that your model remains a step ahead as your business expands while adhering to customers' needs, a marker of delivering top-notch service.

As we continue our mission and build this model fully in the open, we actively seek partnerships and collaborations, offering support for integration and deployment. WhisperFusion's diverse applications span entertainment, commercial, and educational contexts. With forthcoming enhanced models and ongoing research, the evolution of WhisperFusion is poised to make an impact in the communication technology landscape.

If you have questions or ideas, join us on our Gitter #lounge channel or leave a comment down below.


Comments (0)

Add a Comment

Allowed tags: <b><i><br>Add a new comment:


Search the newsroom

Latest News & Events

FOSDEM 2024 - Recorded presentations (videos) now available


Collabora's engineers presented six talks over the course of the weekend, with topics including a review of recent improvements to GStreamer,…

A framework to share analytics data in GStreamer


Engineers have widely adopted GStreamer to build video analytics pipelines, and while many companies have indeed built their machine learning…

Wine on Wayland: A year in review (and a look ahead)


2023 was a great year for the Wayland driver for Wine. After several merge requests, many people are now already able to use the latest…

Open Since 2005 logo

We use cookies on this website to ensure that you get the best experience. By continuing to use this website you are consenting to the use of these cookies. To find out more please follow this link.

Collabora Ltd © 2005-2024. All rights reserved. Privacy Notice. Sitemap.