Meta AI Accused of Massive Book Piracy for Training Llama
Newly unsealed court documents reveal that Meta allegedly torrented 81.7 terabytes of data from shadow libraries, including Z-Library, LibGen, and the Internet Archive, to train its Llama AI model. The lawsuit, filed by authors Richard Kadrey, Sarah Silverman, and Christopher Golden, accuses Meta of large-scale copyright infringement.
Evidence of Meta’s Alleged Data Theft
Internal Meta communications indicate employees discussing and executing bulk downloads:
- 10TB from LibGen
- 54TB from Z-Library
- 126TB from the Internet Archive
Employees even noted slow download speeds due to limited seeders. A 2022 internal message from a Meta employee reads, “Using pirated material should be beyond our ethical threshold.”
Authors’ Copyright Lawsuit Against Meta
The authors claim Meta’s Llama AI was trained using copyrighted books without permission. The court documents highlight that:
- Meta downloaded datasets from Anna’s Archive, a piracy aggregator.
- “ThePile” dataset used by Meta allegedly contained 197,000 pirated books.
- Meta may have been a seeder, actively distributing copyrighted materials.
Court Rulings and Meta’s Defense
U.S. District Court Judge Vincent Chhabria previously dismissed most claims against Meta but allowed a direct copyright infringement claim to proceed. The judge criticized the plaintiffs’ attorneys for “dragging out litigation.”
Meta, however, argues that its actions fall under fair use and that plaintiffs failed to show direct harm. The tech giant’s legal motion states:
“Meta’s use of publicly available datasets to train its open-source large language models (LLMs) constitutes fair use under US copyright law.”
Publishers Demand More Evidence
Textbook publishers argue that Meta’s actions exceed the crime-fraud exception threshold, demanding additional internal documents and legal records.
The Future of AI and Copyright Battles
This lawsuit could set a precedent for AI model training and copyright law. If Meta loses, it may reshape how AI companies source training data and how copyright laws apply to AI-generated content.
Stay tuned as the legal battle unfolds and impacts the AI industry.