Openai is bringing new transcription and speech-generating AI models to the API, claiming that the company is improving its previous releases.
In Openai, the model fits into a broader “agent” vision: the construction of automated systems that can achieve tasks independently on behalf of the user. The definition of “agent” may be contested, but the head of Olivier Goode, an Openai product, described it as a chatbot where one interpretation can talk to customers in the business.
“We’re going to see more and more agents pop up in the coming months,” Goodement told TechCrunch during a briefing. “And the general theme is to help customers and developers leverage useful, available and accurate agents.”
Openai claims that the new text-to-speech model, “GPT-4O-MINI-TTS,” provides more subtle and realistic speech, and is more “controllable” than the previous generation of speech synthesis models. Developers can instruct GPT-4O-MINI-TTS on how to say things in natural language, such as, “speak like a crazy scientist” or “use a gentle voice like a mindfulness teacher.”
This is the weathered voice of “true crime style”:
And here is a sample of a woman’s “professional” voice:
Jeff Harris, a member of Openai’s product staff, told TechCrunch that it would be to allow developers to adjust both the audio “Experience” and “Context.”
“In different contexts, I just don’t want a flat, monotonous voice,” Harris said. “If you have experience in customer support and want to apologise for your voice because it made a mistake, you can actually have that feeling… Our big belief is that here we really want to control how things are spoken, not just what the developers and users are actually talking about.”
With regard to Openai’s new speech-to-text models, “GPT-4O transcription” and “GPT-4O-Mini-Transcribe,” it effectively replaces the company’s long-standing whispering transcription model. Trained with a “various, high quality audio dataset”, this new model can better capture accents and diverse audio even in a chaotic environment.
They are also unlikely to hallucinate, Harris added. Whispers tend to create words in conversation and even introduce the entire sentence, showing everything from racial commentary to imaginary medicine.
“The
However, mileage may vary depending on the language being transcribed.
According to Openai’s internal benchmarks, the more accurate GPT-4O transcription of the two transcriptional models approaches 30% (120% of 120%) in Dravidian languages such as India, Telugu, Malayalam and Kannada. In other words, three words for every ten words from the model differ from human transcription of those languages.

With a break from tradition, Openai has no plans to make the new transcriptional model publicly available. The company has historically released a new version of Whisper for commercial use under the MIT license.
Harris said GPT-4O transcription and GPT-4O-Mini-Transcribe are not good candidates for open releases as they are “a lot bigger than whispers.”
”
March 20, 2025, updated the 11:54am PT to clarify the language regarding word error rates and to update the benchmark results chart with more recent versions.