Meta, formerly Facebook, has recently launched a pair of web crawlers, the Meta External Agent and Meta External Fetcher, that are stirring controversy among website owners and industry experts. These bots are designed to gather data from across the internet to enhance Meta’s AI models and other products, but their sophisticated data collection methods have raised serious privacy concerns.
New Bots with Advanced Capabilities
The Meta External Agent, introduced last month, is programmed to harvest publicly available information from a wide range of online sources. This includes news articles, online forums, and other types of public content. The data collected by this bot is used to train AI models, helping Meta refine its products and services.
Alongside this, Meta has deployed the Meta External Fetcher, which focuses on collecting web links to support the company’s AI assistant tools. Together, these bots are integral to Meta’s strategy for advancing its AI technology.
Comparing Meta’s Bots to Industry Peers
Meta’s new bots are reminiscent of those used by other tech giants like OpenAI, whose GPTBot also scrapes the web for AI training data. According to Dark Visitors, a company that tracks web scrapers, Meta’s bots function similarly to OpenAI’s tools. Both are designed to gather extensive online data, crucial for developing effective AI systems.
However, Meta’s bots are equipped with advanced features that make them harder for website owners to block. This has led to increased unease among content creators who are concerned about their data being harvested without their permission.
The Challenge of Blocking Web Scrapers
For decades, website owners have used the `robots.txt` file to restrict automated bots from accessing their content. This protocol has been a standard method for managing web scraping activities. Yet, the increasing demand for high-quality data has led some companies to ignore or bypass these rules.
In recent months, it was revealed that OpenAI and Anthropic have found ways to circumvent `robots.txt` restrictions, highlighting potential vulnerabilities in this system. Meta’s new bots also challenge this protocol. The Meta External Fetcher, in particular, is reported to potentially bypass `robots.txt` rules, complicating efforts by website owners to prevent unwanted data collection.
Moreover, the Meta External Agent combines data collection and content indexing into one bot, making it more difficult for website administrators to block specific functions without impacting others.
Industry Reactions and Concerns
The rollout of Meta’s new bots has sparked a debate about the ethics of large-scale data scraping for AI training. Jon Gillham, CEO of Originality.ai, a firm that identifies AI-generated content, voiced concerns about the limited options available for website owners. He stressed the need for companies like Meta to offer ways for content creators to control how their data is used while still allowing their sites to be visible to users.
Current data shows that only a small fraction of top websites have successfully blocked Meta’s new bots. Approximately 1.5% have blocked the Meta External Agent, and less than 1% have blocked the Meta External Fetcher. In contrast, Meta’s older crawler, FacebookBot, has been blocked by about 10% of major websites, indicating that the new bots are more adept at avoiding detection.
Meta’s Response to Criticisms
In response to these concerns, Meta has stated its commitment to providing website owners with more control over their data. A Meta spokesperson assured that the company is working to make it easier for publishers to manage their content in relation to AI training. This includes allowing web administrators to choose which bots to block.
Despite these assurances, the rapid advancement of AI web crawlers continues to raise questions about data privacy and content ownership. As Meta and other tech giants, including Google and Anthropic, advance their AI technologies, there is an urgent need for clearer guidelines and protections for website owners.