MLCommons, a non -profit AI Safety Working Group, has released one of the world’s largest collections in AI Research’s public domain voice recording in cooperation with the AI DEV platform.
The dataset, called the speech of untrained people, contains at least 1 million hours of audio over 89 languages. MLCommons says that the desire to support R & D in various fields of speech technology was motivated to create it.
“Supporting a wide range of natural language processing in language other than English can help more people around the world to bring communication technology,” he wrote on Thursday’s blog post. “In particular, there are some ways for research communities to continue building and developing in the fields that improve the language speech model of low resources, strengthen voice recognition over various accents and dialects, and improve new applications in speech synthesis. I expect it.
Certainly it is a stunning goal. However, AI datasets, such as speeches of untrained people, can take risks to researchers who choose to use them.
Bias data is one of these risks. The speech for untrained people is probably from Archive.org, the most well -known non -profit organization in Wayback Machine Web Archival Tool. Many of Archive.org’s contributors speak English -and because they are Americans, most of the speeches of untrained people are American accent, according to Readme on the official project page.
In other words, AI systems such as speech recognition and audio synthesizer models trained in the speech of untrained people without cautious filtering may indicate some of the same prejudice. For example, they may have difficulty in transferring English spoken by non -native speakers, or struggling to generate synthetic voices in language other than English.
Speeches for untreated people may also include recording from people who are not aware that their voices are being used for research on AI, including commercial applications. MLCommons states that all recording in the dataset can be used in a public domain or under the creative commons license, but may have made mistakes.
According to MIT analysis, hundreds of public AI training datasets have no license information and include errors. Creators, including ED Newton-REX, a non-profit CEO that focuses on AI Ethics, has been very trained, and for the troublesome burden imposed on these creators. Creators argued that the AI dataset did not need to be “opt -out”.
“Many creators (such as Squarespace users, etc.) have no meaning to opt-out,” said Newton-Rex in the X post in June last year. “There are multiple overlap opt -out methods for creators who can opt -out. This is incredibly confused, (2) the coverage is incomplete incomplete. Perfect and perfect opt -out. Even if it exists, it is very unfair to put an opt -out burden on the creator, given that the generated AI uses a job to compete with them.
MLCommons says that it is working to update, maintain and improve the quality of speeches of untrained people. However, considering the potential flaws, the developers have paid serious attention.