After the United States, France: Meta, the parent company of Facebook, WhatsApp, and Instagram, is being sued in the Paris District Court for copyright infringement and parasitism. Mark Zuckerberg's group is accused of having blithely mined French literature without authorization to train its generative AI model, Llama. The accusation is being brought by several book trade unions, including the National Publishing Union (SNE), the National Authors and Composers Union (SNAC), and the Society of Writers (SGDL), according to a press release published on Wednesday, March 12.
They are demanding that the "data directories created without authorization and used to train AI" be "completely removed." According to our colleagues at Figaro, they are also asking that the authors of the works used for Llama receive financial compensation. While Meta is the first AI giant to be attacked in France by these publishers and authors, others could also be sued for the same reasons. "The creation of an AI market cannot be at the expense of the cultural sector,” said Vincent Montagne, president of the SNE, quoted in the press release.
At the heart of this dispute is once again “Books3,” a database of 170,000 pirated books, used by many companies in the sector. Meta admitted a year earlier in the United States to having used it for training Llama, relying in the country on an exception to copyright ("fair use") that does not exist in France.
What is Books3, the database at the heart of this dispute?
As we explained in this article, Books3 was put online by Shawn Presser, a researcher who campaigns for open source, in 2020. The latter would include nearly 196,640 references in plain.txt format, according to one of his tweets relayed by Torrent Freak.
This database was allegedly used by Meta to train its LLaMA for Large Language Model Meta AI, an open-source model that presents itself as an alternative to OpenAI's GPT, as the company itself wrote in a research paper. This use is also at the heart of another lawsuit, initiated in July 2023 in the United States, which pits American comedian Sarah Silverman and two other authors against Meta and OpenAI.
And what does this database contain? According to The Atlantic, it includes a large number of pirated books (nearly 170,000), the majority of which were published in the last 20 years, as well as other more surprising data such as subtitles of videos on YouTube, documents and transcripts from the European Parliament, English Wikipedia, and emails sent and received by employees of Enron Corporation before its collapse in 2001.
0 Comments