ElevenLabs, an AI startup that has just raised a $180 million megafunding round, is known primarily for its audio generation skills. The company took a step in another technical direction by launching its first standalone speech-to-text model called Scribe.
The $3.3 billion worth of startup has helped many other companies in providing text-to-speech services through a huge library of voices. However, the company is now moving on to detect speeches and is trying to compete with the Whisper models of Gladia, SpeechMatics, AssemblyAi, Deepgram and Openai.
ElevenLabs’ Scribe Model supports over 99 languages at launch. The company categorizes more than 25 languages in the excellent accuracy category for models with word error rates below 5%. This list includes English (97% accuracy rate), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish and Vietnamese. Other languages are ranked in different categories with high (5%-10% word error rate), good (10%-20% word error rate), and moderate (25%-50%) word error rates.
The company said the model surpasses the Google Gemini 2.0 Flash and whispers a big V3 in multiple languages from the Fleurs and Common Voice Benchmark tests.

ElevenLabs developed text components from the audio of the AI Conversation Agent platform, released last year. However, this is the first time the company has released a standalone voice detection model. In a conversation with TechCrunch last month, CEO Mati Staniszewski talked about improving its voice detection model.
“We want to better understand what you’re saying in the conversation. We’re working on ways to move away from content generation and understanding and transcriptional speech alone,” Stanisefski said at that point. “A lot of people say that speech and text are solved problems. But in many languages, that’s pretty bad. I think we can build a better speech detection model because we have an internal team that annotates the data and provides quick feedback.”
The model also has a smart speaker diaryization to communicate who’s talking, word-level timestamps for accurate subtitles, and automatic tag sound events like audience laughter. The startup offers a way for customers to directly transfer video content and add subtitles or captions to their studios.
Currently, Scribe only works with pre-recorded audio formats. The company said it will soon release a low-latency real-time version of the model. This means that it is not yet effective in getting transcriptions or voice memos.
ElevenLabs prices Scribes at $0.40 for an hour of transcribed audio. The rates are competitive, but some of its rivals currently offer low prices for audio trumptions with differentiation of several features.