Close Menu
  • Home
  • AI
  • Business
  • Crypto
  • Entertainment
  • Finance
  • LIfe
  • Market
  • Sports
  • US
  • Tech

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Spiral with chatgpt | TechCrunch

June 15, 2025

Alexa Von Tobel has high expectations for “Fintech 3.0”

June 15, 2025

How to delete 23andMe data

June 14, 2025
Facebook X (Twitter) Instagram
XMcnx
  • Home
  • About Us
  • Advertise with Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
  • Home
  • AI
  • Business
  • Crypto
  • Entertainment
  • Finance
  • LIfe
  • Market
  • Sports
  • US
  • Tech
XMcnx
Home » Eleutherai releases a large AI training dataset of licensed and open domain text
AI

Eleutherai releases a large AI training dataset of licensed and open domain text

By supportJune 6, 2025No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
Gettyimages 2163166299 170667a.jpg
Share
Facebook Twitter LinkedIn Pinterest Email Copy Link


Eleutherai, an AI research organization, has released what it claims to be one of the largest collections of licensed and open domain text for training AI models.

The dataset called Common Pile V0.1 took about two years to complete, hugging AI’s startup poolside, faces and more, working with several academic institutions. We trained two new AI models, Eleutherai and Comma V0.1-2T, using a typical pile V0.1, weighing 8 terabytes.

AI companies, including Openai, are caught up in lawsuits over AI training practices. This relies on web scrapings containing copyrighted materials such as books and research journals to build model training datasets. While some AI companies have licensing agreements with certain content providers, most argue that US legal tenets of fair use protect against liability if they train in copyrighted work without permission.

Eleutherai argues that these lawsuits are “dramatically diminishing” from AI companies. It says it has harmed the broader field of AI research by making it more difficult to understand how the model works and what the flaws are.

“The (copyright) lawsuit has not significantly changed data procurement practices with (model) training, but has significantly reduced transparency involving transparency involving transparency companies,” Stella Biderman, executive director at Eleutherai, wrote in a blog post hugging Face early on Friday. “Some of the companies we’ve spoken to have cited the lawsuit specifically as a reason why they were unable to publish research they’re doing in a highly data-centric field.”

Common Pile V0.1 is based on sources that can be downloaded to embrace Face’s AI Dev platform and Github, created in consultation with legal experts and includes 300,000 public domain books digitized by the Library of Congress and Internet Archives. Eleutherai used Whisper, a text model, to transfer audio content from Openai’s open source audio.

Eleutherai argues that the Comma V0.1-1T and Comma V0.1-2T are evidence that the typical Pile V0.1 was carefully curated so that developers can build models that compete with their own alternatives. According to Eleutherai, both have 7 billion parameters in size and were trained only on rival models, such as Meta’s first Llama AI model, a mere portion of the Common Pile V0.1.

A parameter, sometimes called weights, is an internal component of the AI ​​model that guides its behavior and answers.

“In general, we consider the general idea that unlicensed texts drive performance as unfair,” Beiderman wrote in her post. “As the amount of openly licensed and accessible licensed and public domain data increases, we expect the quality of models trained with openly licensed content will improve.”

The general pile v0.1 appears to be part of Eleutherai’s efforts to correct historic mistakes. A few years ago, the company released The Pile, an open collection of training texts that contain copyrighted materials. AI companies are being attacked and legally pressured to train their models using piles.

Eleutherai is committed to working with its research and infrastructure partners to release open datasets more frequently.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleBook all TC Stage Exhibitor Tables before ending today
Next Article Humans provide the “check and balance” that AI needs, says Lattice CEO
support

Related Posts

AI

Spiral with chatgpt | TechCrunch

By supportJune 15, 2025
AI

Waymo is restricting services ahead of today’s “No Kings” protest

By supportJune 14, 2025
AI

Google plans to reduce its relationship with scale AI

By supportJune 14, 2025
AI

Clay will secure a new round at a $300 million valuation, sources say

By supportJune 13, 2025
AI

New York passes bill to prevent AI fuel disasters

By supportJune 13, 2025
AI

Google Tests the Audio Summary for Search Queries

By supportJune 13, 2025
Add A Comment
Leave A Reply Cancel Reply

Don't Miss

Spiral with chatgpt | TechCrunch

By supportJune 15, 2025

According to recent features of the New York Times, ChatGpt seems to have forced some…

Alexa Von Tobel has high expectations for “Fintech 3.0”

June 15, 2025

How to delete 23andMe data

June 14, 2025

Investor Experience with TechCrunch All Stages: 1 Floor, Endless Trading Flow

June 14, 2025
Top Posts

Cancelling the Joy Reed Show is “mistakes”

February 26, 2025

Black melodrama has a possibility

February 26, 2025

The “Facts of Life” star died in 83

February 25, 2025

Cara Sophia Gascon joins Oscar despite social media controversy

February 25, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to XMcnx – your trusted source for insightful information about the world of Crypto, Market trends, the latest developments in the US, cutting-edge AI technologies, Tech innovations, and Finance.

At XMcnx, our mission is to provide you with timely, accurate, and relevant news and analyses that empower you to stay ahead in an ever-evolving digital world. We understand the challenges of navigating through the complexities of modern markets, technology, and financial systems. That’s why we’re dedicated to delivering high-quality content that helps you make informed decisions.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Spiral with chatgpt | TechCrunch

June 15, 2025

Alexa Von Tobel has high expectations for “Fintech 3.0”

June 15, 2025

How to delete 23andMe data

June 14, 2025
Most Popular

TikTok announces it will go dark on Sunday without ‘definitive’ guarantees

January 18, 2025

President Trump mints $31 billion in new official $TRUMP crypto meme coin

January 18, 2025

El Salvador’s secret weapon? Stacey Herbert talks about the company’s extensive Bitcoin education program

January 18, 2025
  • Home
  • About Us
  • Advertise with Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 xmcnx. Designed by xmcnx.

Type above and press Enter to search. Press Esc to cancel.