Reddit, the popular social media platform known for its decades of topic-specific forums, holds a treasure trove of user-generated content that A.I. companies can use to train large language models. But the platform doesn’t take kindly to having its data used without permission. In a lawsuit filed yesterday (June 4), Reddit accused A.I. company Anthropic of scraping its site’s content without authorization. Describing Anthropic as a company that “bills itself as the white knight of the A.I. industry,” Reddit’s court filings argued that the startup is “anything but.”
Reddit’s archives, which span two decades of online discussions, make the site an especially valuable resource for human-generated text. This type of content is increasingly sought after by tech companies as their data pools—necessary for training A.I. models—begin to dwindle.
“Reddit’s vast corpus of public content has enormous utility, including as a potential source of inputs for training emerging large language A.I. technologies, like Anthropic’s Claude offering, and assisting A.I. technologies in generating answers to user queries,” said Reddit in the suit.
Reddit accuses Anthropic of using Reddit users’ personal data to train its Claude models without obtaining consent. Reddit claims this violates user agreements that prohibit the commercial exploitation of its content without prior authorization.
While Anthropic claimed in July 2023 that it had blocked Reddit from its web crawlers, Reddit’s audit logs show that the A.I. company accessed its data more than 100,000 times using automated bots in the months that followed. The lawsuit also referenced a 2021 paper co-authored by Anthropic CEO Dario Amodei, which highlighted Reddit’s subreddits as a valuable source of high-quality training data.
“We disagree with Reddit’s claims and will defend ourselves vigorously,” said an Anthropic spokesperson in a statement.
Reddit has formal licensing agreements with some of Anthropic’s competitors, including OpenAI and Google. Reddit executives have previously said the platform is selective when approaching licensing partners, particularly for large-scale training agreements. The company’s vast collection of authentic, unique conversations on “every topic imaginable” has made it a prized asset in the A.I. era, according to CEO Steve Huffman during a quarterly earnings call last year. “The paradox I see is that as more content on the internet is written by machines, there’s an increasing premium on content that comes from real people,” he noted.
On the company’s most recent earnings call last month, Huffman said “authentic content from humans” is Reddit’s primary value proposition.
Co-founded by Huffman and his college roommate Alexis Ohanian in 2005, Reddit has more than 100 million daily active users who use the platform’s subreddits to ask questions, provide tips and share perspectives on various subjects. The company went public last year and currently has a market capitalization of $21.8 billion.