Meta's Secret AI Training: Using Pirated Data and Hiding It from Regulators

📆 1/15/2025 9:28 AM

Technology News

AI, Law, AI

📆 1/15/2025 9:28 AM
📰 verge

⏱ Reading Time:
222 sec. here
16 min. at publisher
📊 Quality Score:
News: 130%
Publisher: 67%

Leaked internal documents reveal Meta's aggressive pursuit of AI dominance, including its use of copyrighted data from Library Genesis (LibGen) to train its Llama models. These documents also show Meta's attempts to conceal this practice from regulators and the public.

A major copyright lawsuit against Meta has revealed a trove of internal communications about the company's plans to develop its Llama open-source AI models. These communications, unsealed by a California court, suggest Meta used copyrighted data when training its AI systems and worked to conceal it as it raced to beat rivals like OpenAI and Mistral . Portions of the messages were first revealed last week.

In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmad Al-Dahle, Meta's vice president of generative AI, wrote that the company's goal “needs to be GPT4,” referring to the large language model OpenAI announced in March 2023. Meta had “to learn how to build frontier and win this race,” Al-Dahle added. Those plans apparently involved the book piracy site Library Genesis (LibGen) to train its AI systems. An undated email from Meta director of product Sony Theakanath, sent to VP of AI research Joelle Pineau, weighed whether to use LibGen internally only, for benchmarks included in a blog post, or to create a model trained on the site. In the email, Theakanath writes that “GenAI has been approved to use LibGen for Llama3... with a number of agreed upon mitigations” after escalating it to “MZ” — presumably Meta CEO Mark Zuckerberg. As noted in the email, Theakanath believed “Libgen is essential to meet SOTA numbers,” adding “it is known that OpenAI and Mistral are using the library for their models (through word of mouth).” Mistral and OpenAI haven’t stated whether they use LibGen. (The Verge reached out to both for more information.)The court documents stem from a class action lawsuit that author Richard Kadrey, comedian Sarah Silverman, and others filed against Meta, accusing it of using illegally obtained copyrighted content to train its AI models in violation of intellectual property laws. Meta, like other AI companies, has argued that using copyrighted material in training data should constitute legal fair use. The Verge reached out to Meta with a request for comment but didn’t immediately hear back. Some of the “mitigations” for using LibGen included stipulations that Meta must “remove data clearly marked as pirated/stolen,” while avoiding externally citing “the use of any training data” from the site. Theakanath’s email also said the company would need to “red team” the company’s models “for bioweapons and CBRNE ” risks. The email also went over some of the “policy risks” posed by the use of LibGen, including how regulators might respond to media coverage suggesting Meta’s use of pirated content. “This may undermine our negotiating position with regulators on these issues,” the email said. An April 2023 conversation between Meta researcher Nikolay Bashlykov and AI team member David Esiobu also showed Bashlykov admitting he’s “not sure we can use meta’s IPs to load through torrents pirate content.” Other internal documents show the measures Meta took to obscure the copyright information in LibGen’s training data. A document titled “observations on LibGen-SciMag” shows comments left by employees about how to improve the dataset. One suggestion is to “remove more copyright headers and document identifiers,” including any lines containing “ISBN,” “Copyright,” “All rights reserved,” or the copyright symbol. Other notes mention taking out more metadata “to avoid potential legal complications” as well as considering whether to remove a paper’s list of authors “to reduce liability.” Last June, The New York Times reported on the frantic race inside Meta after ChatGPT’s debut, revealing the company had hit a wall: it had used up almost every available English book, article, and poem it could find online. Desperate for more data, executives reportedly discussed buying Simon & Schuster outright and considered hiring contractors in Africa to summarize books without permission. In the report, some executives justified their approach by pointing to OpenAI’s “market precedent” of using copyrighted works, while others argued Google’s 2015 court victory establishing its right to scan books could provide legal cover. “The only thing holding us back from being as good as ChatGPT is literally just data volume,” one executive said in a meeting, per The New York Times.It’s been reported that frontier labs like OpenAI and Anthropic have hit a data wall, which means they don’t have sufficient new data to train their large language models. Many leaders have denied this. OpenAI CEO Sam Altman said plainly: “There is no wall.” OpenAI cofounder Ilya Sutskever, who left the company last May to start a new frontier lab, has been more straightforward about the potential of a data wall. At a premier AI conference last month, Sutskever said, “We’ve achieved peak data and there’ll be no more. We have to deal with the data that we have. There’s only one internet.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

AI Law AI Copyright Meta Llama Library Genesis Data Usage Legal Issues Openai Mistral

United States Latest News, United States Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

FBI: New Orleans Attacker Recorded Area Using Meta Smart Glasses Weeks Before AttackThe FBI released a video showing a recording by Shamsud-Din Jabbar, who carried out a terrorist attack in New Orleans that killed more than a dozen people on New Year's Day. NBC News' Kathy Park reports that Jabbar used Meta smart glasses to make the recordings.
Read more »

FBI: New Orleans Attacker Recorded Area Using Meta Smart Glasses Weeks Before AttackThe FBI revealed that the perpetrator of the New Year's Day attack in New Orleans used Meta smart glasses to record the Bourbon Street area weeks prior to the attack.
Read more »

New Orleans Killer Scored French Quarter Using Meta Smart GlassesThe man who drove a truck into a crowd in New Orleans on New Year's Day, killing 14, had previously scouted the French Quarter and recorded video with his Meta smart glasses, the FBI said. The glasses, which are capable of livestreaming, were worn by the attacker during the incident but not activated.
Read more »

Terror Attack Plotted Using Meta SmartglassesShamsud-Din Jabbar, who killed 14 people and injured 35 in New Orleans, used Meta Ray-Ban smart glasses to plan the attack. FBI footage shows him testing the glasses before the incident.
Read more »

Meta AI's Personalized Posts Using User Faces Spark ConcernsA new feature from Meta AI called 'Imagine' generates personalized posts using user-uploaded photos. One Instagram user reported seeing their own face in a post, raising concerns about the potential for misuse and the blurring of reality.
Read more »

Meta Accused of Using Pirated Books to Train AI in Landmark Copyright CaseA US court ordered Meta to unredact documents revealing the company allegedly used the notorious shadow library LibGen to train its AI models. This landmark copyright case could shape the future of AI development by determining whether tech companies can legally use copyrighted works for training.
Read more »