Eleutherai releases a large AI training dataset of licensed and open domain text

Eleutherai, an AI research organization, has released what it claims to be one of the largest collections of licensed and open domain text for training AI models.

The dataset called Common Pile V0.1 took about two years to complete, hugging AI’s startup poolside, faces and more, working with several academic institutions. We trained two new AI models, Eleutherai and Comma V0.1-2T, using a typical pile V0.1, weighing 8 terabytes.

AI companies, including Openai, are caught up in lawsuits over AI training practices. This relies on web scrapings containing copyrighted materials such as books and research journals to build model training datasets. While some AI companies have licensing agreements with certain content providers, most argue that US legal tenets of fair use protect against liability if they train in copyrighted work without permission.

Eleutherai argues that these lawsuits are “dramatically diminishing” from AI companies. It says it has harmed the broader field of AI research by making it more difficult to understand how the model works and what the flaws are.

“The (copyright) lawsuit has not significantly changed data procurement practices with (model) training, but has significantly reduced transparency involving transparency involving transparency companies,” Stella Biderman, executive director at Eleutherai, wrote in a blog post hugging Face early on Friday. “Some of the companies we’ve spoken to have cited the lawsuit specifically as a reason why they were unable to publish research they’re doing in a highly data-centric field.”

Common Pile V0.1 is based on sources that can be downloaded to embrace Face’s AI Dev platform and Github, created in consultation with legal experts and includes 300,000 public domain books digitized by the Library of Congress and Internet Archives. Eleutherai used Whisper, a text model, to transfer audio content from Openai’s open source audio.

Eleutherai argues that the Comma V0.1-1T and Comma V0.1-2T are evidence that the typical Pile V0.1 was carefully curated so that developers can build models that compete with their own alternatives. According to Eleutherai, both have 7 billion parameters in size and were trained only on rival models, such as Meta’s first Llama AI model, a mere portion of the Common Pile V0.1.

A parameter, sometimes called weights, is an internal component of the AI model that guides its behavior and answers.

“In general, we consider the general idea that unlicensed texts drive performance as unfair,” Beiderman wrote in her post. “As the amount of openly licensed and accessible licensed and public domain data increases, we expect the quality of models trained with openly licensed content will improve.”

The general pile v0.1 appears to be part of Eleutherai’s efforts to correct historic mistakes. A few years ago, the company released The Pile, an open collection of training texts that contain copyrighted materials. AI companies are being attacked and legally pressured to train their models using piles.

Eleutherai is committed to working with its research and infrastructure partners to release open datasets more frequently.

Source link

What's Hot

Spiral with chatgpt | TechCrunch

Alexa Von Tobel has high expectations for “Fintech 3.0”

How to delete 23andMe data

Eleutherai releases a large AI training dataset of licensed and open domain text

Spiral with chatgpt | TechCrunch

Waymo is restricting services ahead of today’s “No Kings” protest

Google plans to reduce its relationship with scale AI

Clay will secure a new round at a $300 million valuation, sources say

New York passes bill to prevent AI fuel disasters

Google Tests the Audio Summary for Search Queries

Spiral with chatgpt | TechCrunch

Alexa Von Tobel has high expectations for “Fintech 3.0”

How to delete 23andMe data

Investor Experience with TechCrunch All Stages: 1 Floor, Endless Trading Flow

Cancelling the Joy Reed Show is “mistakes”

Black melodrama has a possibility

The “Facts of Life” star died in 83

Cara Sophia Gascon joins Oscar despite social media controversy

Our Picks

Spiral with chatgpt | TechCrunch

Alexa Von Tobel has high expectations for “Fintech 3.0”

How to delete 23andMe data

Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

President Trump mints $31 billion in new official $TRUMP crypto meme coin

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

Subscribe to Updates

What's Hot

Eleutherai releases a large AI training dataset of licensed and open domain text

Related Posts