Although they don’t have extensive AI expertise, they say they have created an openly available AI model that can generate clips in a similar podcast style to Google’s NoteBookLM.
The market for synthetic speech tools is vast and growing. ElevenLabs is one of the biggest players, but there is no shortage of challengers (see Playai, Sesame, etc.). Investors believe these tools have great potential. According to Pitchbook, Startups Dearaling Voice AI Tech raised more than $398 million in VC funding last year.
Toby Kim, one of the co-founders of Nari Labs, the group behind the newly released model, said he and his fellow co-founders had begun learning about Speech AI three months ago. Inspired by Notebooklm, they wanted to create a model that provided more control over the generated voice and “freedom of scripts.”
Kim says he used Google’s TPU Research Cloud Program. This allows researchers to access the company’s TPU AI chips for free to train Nari’s model, DIA. DIA weighs 1.6 billion parameters and generates dialogs from scripts, allowing users to customize speaker tones and insert disfluence, cough, laughter and other nonverbal cues.
Parameters are internal variables used by the model and create predictions. In general, models with more parameters will perform better.
Available by hugging the face of the AI Dev platform and Github, DIA can run on most modern PCs with at least 10GB of VRAM. It produces random audio unless prompted with the intended style description, but you can also clone a person’s voice.
In a quick test of TechCrunch’s DIA via Nari’s web demo, DIA worked very well and was unable to generate two-way chats for any subject. Voice quality appears to be competitive with other tools, and the voice clone feature is the easiest thing this reporter has tried.
Here’s a sample:
But like many audio generators, DIA offers little protection. Creating recordings of disinformation and fraudsters is trivial. On the DIA project page, Nari dissuades the model from abuse, pretends to, deceives, or engages in illegal campaigns, but the group says it is “not responsible” for the misuse.
Nari also has not revealed which data has been shattered to train DIAs. DIA may have been developed using copyrighted content. Hacker News commenters point out that one sample sounds like a host of NPR’s “Planet Money” podcast. The training model for copyrighted content is a broad but legally questionable practice. Some AI companies argue that fair use protects them from liability, but rights holders argue that fair use doesn’t apply to training.
In any case, Kim says Nari’s plan is to create a synthetic speech platform with a “social aspect” above DIA and create a larger model for the future. Nari will also release a technical report from DIA to expand support for the model for languages beyond English.