Reddit has filed a lawsuit in a New York federal court accusing Perplexity AI and three data-scraping firms of unlawfully harvesting its user content to train Perplexity’s AI-based “answer engine”. The complaint names Lithuania-based Oxylabs UAB, Texas-based SerpApi and the web domain AWMProxy (described by Reddit as a former Russian botnet) as accomplices in the scheme. Reddit alleges that these firms circumvented its data protection systems, bypassed technical safeguards and funnelled valuable Reddit-hosted user comments into Perplexity’s model without permission.
In its complaint, Reddit says it sent a cease-and-desist letter to Perplexity in May 2024 after detecting unauthorized harvesting. Rather than halt the practice, Reddit claims Perplexity increased its citations to Reddit content forty-fold after that notice, pointing to an escalation in scraping activity. Reddit emphasises that unlike companies such as Google and OpenAI which have licensed Reddit content Perplexity did not secure a deal and instead used unlicensed data, creating what Reddit’s chief legal officer described as an “industrial-scale ‘data laundering’ economy”.
According to the lawsuit, Reddit’s anti-scraping protections were intentionally bypassed. The complaint asserts that the data-scraping firms accessed billions of search-engine result pages containing Reddit content, disguised their activity, masked user-agent identifiers, evaded rate-limiting and deliberately harvested content that Reddit did not intend for training commercial AI systems. Reddit describes the strategy as akin to “would-be bank robbers” who, unable to access the vault directly, target the armored truck instead.
Perplexity is accused of acquiring scraped Reddit data from at least one of the third-party scrapers and feeding it into its AI training pipeline. Reddit says that this conduct generated value for Perplexity boosting its retrieval and answer engine capabilities without compensating Reddit or its community of creators. The firm is seeking both monetary damages and a court injunction to block future use of its content by the defendants.
Why This Lawsuit Matters
This legal move highlights escalating tensions between content platforms and AI-developers over the sourcing and licensing of human-generated data. Reddit’s business model includes licensing its content to major AI and tech companies, a revenue stream it argues is being undermined by entities that bypass licensing and extract content without permission. In suing Perplexity and its data-scraping network, Reddit is signalling that it is shifting from issuing takedown notices to pursuing full legal redress and trying to define new norms around large-scale data usage for AI training.
For Perplexity and other AI-builders, the case underscores the growing legal risk of relying on publicly-available web content without explicit licensing agreements. If the court sides with Reddit’s broad claims around “unauthorised scraping” and the circumvention of protections, it could precipitate stricter industry standards for how training data is sourced, verified and compensated.
The implications of the lawsuit extend beyond the two companies involved. First, it adds to a growing wave of litigation in which publishers, content-platforms and creator communities are asserting rights over data used in AI systems. Reddit itself sued another AI company, Anthropic, earlier this year under similar claims. The new case against Perplexity broadens the scope to include both AI companies and the lesser-known infrastructure firms that supply scraped data.
Second, the case raises questions about how AI companies should treat “publicly available” content. Just because content is publicly posted does not automatically mean it is free for commercial training without restriction, Reddit argues. For its part, Reddit contends that while its content is publicly visible, it maintains licence relationships for training usage and that the protection systems it employs should not be so easily circumvented.
Third, the lawsuit could shift the economics of the creator-economy and user-content platforms. If platforms are able to extract greater compensation from AI firms or force agreements before their data is used, it may change how value flows and how content platforms monetise their user contributions. For AI companies, this may increase the cost and complexity of sourcing large-scale training data.
Despite its significance, the case faces uncertainties. One key challenge is proving causation: Reddit must show not just that scraping occurred but that Perplexity’s commercial product was trained using the scraped Reddit content and benefitted from it. Establishing the link between unauthorized data ingestion and downstream AI performance is inherently complex.
Another question is legal precedent. Courts are still grappling with how copyright, database rights and technical protections apply in the context of large-scale AI training. The outcome of this case may set a precedent for how platforms and training-data providers negotiate access rights or litigate unauthorized usage.
From Perplexity’s perspective, the company denies wrongdoing, stating that it provides “factual answers with accurate AI” and will “vigorously defend” its practices. The outcome will therefore also hinge on how the court interprets scraping, licence obligations and the role of indirect data-acquisition via third-party crawlers.
In filing its lawsuit against Perplexity and three data-scraping firms, Reddit has signalled a new phase in the battle over who controls online content and how it is used in AI systems. The case brings to the forefront important issues around data licensing, automated scraping, and the rights of platforms that host user-generated content. For AI companies, it reinforces that public content does not equate to “free training data”, and that legal risks around large-scale ingestion are real and intensifying.
Ultimately, the resolution of this lawsuit may influence how content platforms, AI developers and data-scrapers structure their relationships going forward and possibly shape the emerging legal frameworks for AI-training data usage.




