Meta CEO Mark Zuckerberg appears to have used YouTube’s battle to remove pirated content to defend his company’s use of a dataset containing copyrighted e-books. This was newly revealed in excerpts from an affidavit released at the end of last year.
The deposition is part of a complaint filed with the court by plaintiffs’ lawyers in connection with the AI copyright case Kadry v. Metaplatforms. This is one of many similar cases in the U.S. court system pitting AI companies against authors and other intellectual property holders. In most cases, the AI companies that are defendants in these lawsuits claim that training on copyrighted content is “fair use.” Many copyright holders disagree.
“For example, I think YouTube might end up hosting content that people pirate for a period of time, but YouTube doesn’t… We are trying to remove that content.” night. “And I think most of what’s on YouTube is reasonably good and licensed.”
Excerpts from Zuckerberg’s deposition provide insight into his thinking on copyrighted content and fair use. However, it should be noted that the full transcript of the deposition has not been made public. TechCrunch has reached out to Meta for additional information and will update the article if we hear back from the company.
Based on testimonial nuggets, Zuckerberg appears to be defending Meta’s use of an e-book training dataset called LibGen to develop a family of AI models known as Llama. Meta’s Llama competes with flagship models from AI companies such as OpenAI.
LibGen describes itself as a “link aggregator” and provides access to works from publishers such as Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued multiple times for copyright infringement, ordered to shut down, and fined tens of millions of dollars.
According to court filings made public this week, Mr. Zuckerberg did not allow Meta to train at least one of Meta’s llama models, despite concerns about legal ramifications within the company’s AI executives and research team. He is said to have authorized the use of LibGen.
Lawyers for the plaintiffs, including best-selling authors Sarah Silverman and Ta-Nehisi Coates, said that Meta employees called LibGen a “known to be pirated dataset” and that its use was “(Meta) “This could jeopardize our negotiating position with regulators).” ” states the legal filing.
During his deposition, Zuckerberg claimed he had “actually never heard of” LibGen.
“I know you’re trying to get me to give an opinion on LiveGen, but I haven’t heard much about it,” Zuckerberg said during his deposition. “It’s just that I don’t have the knowledge about that specific thing.”
Under questioning from one of the plaintiffs’ lawyers, David Boies, Zuckerberg explained why it would be unreasonable to ban the use of datasets like LibGen.
“So, do we want to have a policy that prohibits the use of YouTube because some content may be copyrighted? No,” he said. “There are cases where such a blanket ban is not appropriate.”
Zuckerberg said Meta should be “very cautious” about training on copyrighted material.
“You know, if there were (someone) providing a website and they were intentionally trying to violate people’s rights…obviously that’s something we would be careful about and how we would respond to that. We want to be careful about whether we engage in this, and in some cases our team may even prevent that from happening,” Zuckerberg said during his deposition, according to the transcript.
new suspicion
Plaintiffs’ attorneys in Kadry v. Metaplatform have amended their complaint several times since it was filed in 2023 in the U.S. District Court for the Northern District of California, San Francisco. The latest amended complaint, filed by plaintiffs’ attorneys late Wednesday, includes: New allegations against Meta include that the company cross-referenced certain pirated books in LibGen with copyrighted books that could be licensed. Lawyers argue that Meta used this tactic to determine whether it made sense to enter into licensing agreements with publishers.
According to the amended filing, Meta used LibGen to train the latest Llama model family, Llama 3. Plaintiffs also claim that Meta is using this dataset to train the next generation Llama 4 model.
According to the amended filing, Meta researchers concealed the fact that Llama’s model was trained on copyrighted material by inserting “supervised samples” into Llama’s fine-tuning. It is said that he tried to do so. Mehta then downloaded pirated e-books from another source, Z-Library, for llama training as recently as April 2024, the amended complaint states.
Z-Library (or Z-Lib) has been the subject of a number of legal actions brought by publishers, including domain seizures and takedowns. In 2022, the Russian national who allegedly maintained it was indicted on charges of copyright infringement, wire fraud, and money laundering.