Mistral launches Voxtral: open-source speech understanding models with top-tier accuracy and deep audio comprehension.
Mistral just dropped Voxtral, two new speech AI models designed for both cloud and edge use. There’s a 24B parameter version for production workloads and a 3B “Mini” variant for local devices. Both are Apache 2.0 licensed and available via API and Hugging Face.
This moves past the usual open-source versus costly proprietary speech trade-off. Voxtral claims state-of-the-art transcription and semantic understanding at less than half the price of competitive APIs.
Key features:
- Handles up to 30 minutes of speech for transcription, 40 minutes for understanding
- Built-in Q&A and summarization directly from audio—no need to chain separate models
- Automatic language detection with support for English, Spanish, French, Hindi, German, and more
- Direct voice-triggered function calls for backend workflows without extra parsing
- Strong text comprehension thanks to Mistral Small 3.1 backbone
Benchmarks show Voxtral outperforms OpenAI Whisper large-v3, GPT-4o mini Transcribe, Gemini 2.5 Flash, and rivals ElevenLabs Scribe in transcription accuracy. It holds state-of-the-art results on European and multilingual datasets.
Audio understanding tests prove Voxtral competitive with GPT-4o mini and Gemini 2.5 Flash, topping speech translation tasks.
Mistral offers enterprise options for private deployment, domain-specific fine-tuning, longer contexts, and speaker/emotion detection.
You can download Voxtral and Mini 3B from Hugging Face. API pricing starts at $0.001 per minute.
Mistral will demo Voxtral with Inworld’s speech-to-speech tech on August 6th. Register here.
Sample audio and full benchmarks are live on Mistral’s site.
Try Voxtral in Le Chat’s voice mode soon: record, transcribe, ask questions, and summarize audio on web or mobile.
Mistral promises more audio features coming in the next months.
Audio sample from Mistral: