On Tuesday, Amazon debuted its new generation AI model, the Nova Sonic. NovaOnic can natively process audio and produce natural audio. Amazon claims that Sonic’s performance is competitive with Openai and Google’s frontier voice models in benchmarks that measure speed, voice recognition, and conversation quality.
Nova Sonic is Amazon’s answer to new AI voice models, such as the voice modes of Model Powering ChatGpt. I feel this speaks more naturally than the earlier, stricter models of Amazon Alexa. Recent technological breakthroughs have created legacy models, and the digital assistants they support, such as Alexa and Apple’s Siri, look incredibly robust in comparison.
Nova Sonic is available from Bedrock, Amazon’s developer platform for building enterprise AI applications through a new two-way streaming API. In a press release, Amazon is called Nova Sonic, and is the “most cost-effective” AI voice model on the market, about 80% cheaper than Openai’s GPT-4o.
According to head scientists at Amazon SVP and Agi Rohit Prasad, components of Nova Sonic from Amazon’s upgraded digital voice assistant Alexa+ are already supplying power.
In an interview, Prasad told TechCrunch that Nova Sonic is based on Amazon’s expertise in the “large-scale orchestration system,” the technical foothold that makes up Alexa. Compared to its rival AI voice models, Nova Sonic is better at routing user requests to various APIs, Prasad said. This feature helps Nova Sonic “know” when you need to retrieve real-time information from the Internet, analyze your own data sources, perform actions in external applications, or use the right tools to do so.
During the two-way dialogue, Nova Sonic is waiting to speak “at the right time,” and considering the suspension and interruption of speakers, Amazon says. It also generates text transcripts of user speeches that developers can use for various applications.
According to Prasad, Nova Sonic is more likely to generate speech recognition errors than other AI voice models. In other words, the model is relatively good at understanding the user’s intent, whether it’s a user tweet, mis-peak, or a loud setting. In a benchmark that measures speech recognition in languages and dialects, Amazon, a multilingual Librispeech, says Nova Sonic achieved a word error rate of just 4.2% (WER) when averaged in English, French, Italian, German and Spanish. In other words, about four for every 100 words from the model were different from human transcriptions of these languages.
Another benchmark measuring loud interactions with multiple participants enhances multiparty interactions, Amazon says Nova Sonic is 46.7% more accurate than Openai’s GPT-4O transcription model. According to Amazon, Nova Sonic is also industry-leading speed, with an average of 1.09 seconds of delay. This makes it faster than the GPT-4O model, which electricizes OpenAI’s real-time API, responding in 1.18 seconds for each artificially analyzed benchmark.
According to Prasad, Nova Sonic is part of Amazon’s broader strategy for building artificial general information (AGI), which the company defines as “an AI system that can do anything human beings can do with a computer.” According to Prasad, Amazon will release AI models that allow you to understand a variety of modalities, such as images, video and audio, as well as “other sensory data that is relevant to bringing things to the physical world.”
Amazon’s AGI division, which is overseen by Prasad, appears to play a greater role in the company’s recent product strategy. Last week, Amazon began previewing the Nova Act, a browser-based AI model that appears to power the Alexa+ power elements and Amazon’s Buy For Me functionality. Starting with Nova Sonic, Prasad says the company wants to provide more internal AI models for developers to build.