A new class action lawsuit in San Francisco federal court accuses software giant Salesforce of building XGen AI models based on pirated book libraries and removing references to those sources when questions arose.
The lawsuit, filed Wednesday by authors E. Molly Tanzer and Jennifer Gilmore, is based on copyright law and states that Salesforce “continues to infringe by continuing to store, copy, use, and process datasets containing copies of Plaintiffs’…copyrighted books.”
According to the complaint, Salesforce.INC “pirated hundreds of thousands of copyrighted books to develop its XGen series of large-scale language models” and relied on the “notorious RedPajama and The Pile datasets,” including a book corpus known as Books3 of more than 196,000 books copied from private tracker Bibliotik.
When Salesforce launched XGen in June 2023, it initially listed “RedPajama-Books” among its training sources, and its engineers linked GitHub users directly to both datasets, according to the filing.
But by September, Salesforce had purportedly removed those references from its website and replaced them with vague descriptions of “natural language data” extracted from “publicly available sources.”
Hugging Face, the platform hosting Books3, removed the dataset the following month, citing copyright infringement claims, according to the complaint.
The complaint alleges that Salesforce used The Pile to train a CodeGen model in 2022 and then commercialized the technology through its Agentforce AI platform, including the XGen-Sales model released in October 2024.
Two months later, Salesforce allegedly reviewed its disclosures, removed charts and references to “RedPajama-Books” and replaced them with ambiguous language about “a mixture of publicly available data,” before claiming by December 2023 that its models used “legally compliant datasets” without mentioning RedPajama at all.
Ishita Sharma, Managing Partner at Fathom Legal, said: decryption Authors must “prove actual economic harm, not just that their book was used for training,” he said, noting that Judge Vince Chabria recently rejected a similar lawsuit against Mehta, ruling that “merely claiming that ‘our work was used’ is not enough.”
In a recent ruling in favor of OpenAI and Anthropic in a similar case, the judges found the authors failed to prove harm to the market, but one judge criticized Anthropic for maintaining a “perpetual library of pirated books.”
“Using a public data set like RedPajama or The Pile does not automatically eliminate willful infringement,” Sharma said, adding, “Even if you knew or ignored that a copyrighted work was included, a court could still find reckless disregard.”
“Unless the AI is able to reproduce some part of the original work, model weights themselves are not considered copyright infringement,” she added.
The complaint cites a statement from Salesforce CEO Marc Benioff. bloomberg The interviewer said in January 2024 that AI companies “appropriated” training data and that “all training data has been stolen.”
The authors are seeking class action against all U.S. copyright owners whose works have been used after October 2022, and are seeking statutory damages, destruction of infringing copies, loss of profits, declaration of willful infringement, and payment of attorney’s fees.
Discover more from Earlybirds Invest
Subscribe to get the latest posts sent to your email.


