Eleutherai, an AI research organization, has released what it claims to be one of the largest collections of licensed and open domain text for training AI models.
The dataset called Common Pile V0.1 took about two years to complete, hugging AI’s startup poolside, faces and more, working with several academic institutions. We trained two new AI models, Eleutherai and Comma V0.1-2T, using a typical pile V0.1, weighing 8 terabytes.
AI companies, including Openai, are caught up in lawsuits over AI training practices. This relies on web scrapings containing copyrighted materials such as books and research journals to build model training datasets. While some AI companies have licensing agreements with certain content providers, most argue that US legal tenets of fair use protect against liability if they train in copyrighted work without permission.
Eleutherai argues that these lawsuits are “dramatically diminishing” from AI companies. It says it has harmed the broader field of AI research by making it more difficult to understand how the model works and what the flaws are.
“The (copyright) lawsuit has not significantly changed data procurement practices with (model) training, but has significantly reduced transparency involving transparency involving transparency companies,” Stella Biderman, executive director at Eleutherai, wrote in a blog post hugging Face early on Friday. “Some of the companies we’ve spoken to have cited the lawsuit specifically as a reason why they were unable to publish research they’re doing in a highly data-centric field.”
Common Pile V0.1 is based on sources that can be downloaded to embrace Face’s AI Dev platform and Github, created in consultation with legal experts and includes 300,000 public domain books digitized by the Library of Congress and Internet Archives. Eleutherai used Whisper, a text model, to transfer audio content from Openai’s open source audio.
Eleutherai argues that the Comma V0.1-1T and Comma V0.1-2T are evidence that the typical Pile V0.1 was carefully curated so that developers can build models that compete with their own alternatives. According to Eleutherai, both have 7 billion parameters in size and were trained only on rival models, such as Meta’s first Llama AI model, a mere portion of the Common Pile V0.1.
A parameter, sometimes called weights, is an internal component of the AI model that guides its behavior and answers.
“In general, we consider the general idea that unlicensed texts drive performance as unfair,” Beiderman wrote in her post. “As the amount of openly licensed and accessible licensed and public domain data increases, we expect the quality of models trained with openly licensed content will improve.”
The general pile v0.1 appears to be part of Eleutherai’s efforts to correct historic mistakes. A few years ago, the company released The Pile, an open collection of training texts that contain copyrighted materials. AI companies are being attacked and legally pressured to train their models using piles.
Eleutherai is committed to working with its research and infrastructure partners to release open datasets more frequently.