French AI innovator Mistral has unveiled a new open-source text-to-speech (TTS) model, designed to empower voice AI assistants and enhance enterprise applications such as customer support. This advanced model enables businesses to construct sophisticated voice agents for sales and client engagement, positioning Mistral as a formidable competitor against established players like ElevenLabs, Deepgram, and OpenAI.
The newly launched model, named Voxtral TTS, boasts support for nine languages, encompassing English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
“Our customers have been asking for a speech model. So we built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance,” stated Pierre Stock, VP of Science Operations at Mistral AI, in an interview with TechCrunch.
Mistral highlights Voxtral TTS's capability to adapt a custom voice from an audio sample of less than five seconds, meticulously capturing nuanced characteristics such as subtle accents, inflections, intonations, and natural irregularities in speech flow. Built upon the Ministral 3B architecture, the model facilitates seamless language switching without compromising the distinct qualities of the voice, making it ideal for applications like dubbing or real-time translation. Stock emphasized the company's commitment to making the model sound authentically human, rather than robotic.
According to the company, the model is engineered for exceptional real-time performance. It achieves a time-to-first-audio (TTFA) — the latency from input reception to the start of audio output — of just 90 milliseconds for a 10-second sample of 500 characters. Furthermore, Voxtral TTS demonstrates a real-time factor (RTF) of 6x, meaning it can render a 10-second audio clip in approximately 1.6 seconds.
This release follows Mistral's introduction earlier this year of two transcription models: one tailored for large-scale batch processing and another optimized for low-latency, real-time use cases. With the addition of Voxtral TTS, Mistral appears to be strategically building a comprehensive suite of voice AI products for enterprise clients.
“We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well. The main benefit of that is you get way more information with an end-to-end agentic system that supports audio as an input or output,” Stock elaborated on Mistral's future vision.
Mistral's strategic advantage lies in its open-source philosophy and robust customization options, which are expected to encourage enterprises to adopt its voice models over those of competitors, allowing them unparalleled flexibility to fine-tune the technology to their specific requirements.
The Editorial Staff at AIChief is a team of professional content writers with extensive experience in AI and marketing. Founded in 2025, AIChief has quickly grown into the largest free AI resource hub in the industry.