A recent report reveals that major tech companies, including Apple, have trained AI models using YouTube videos without obtaining creators’ consent. Apple trained AI models on YouTube content without consent, raising serious copyright issues for creators. They reportedly used subtitle files downloaded by a third party from over 170,000 videos. Notable creators affected include Marquees Brownlee (MKBHD), MrBeast, PewDiePie, Stephen Colbert, John Oliver, and Jimmy Kimmel.
According to Wired, an investigation by Proof News uncovered that some of the world’s wealthiest AI companies utilized material from thousands of YouTube videos to train their AI models. This action was taken despite YouTube’s strict rules against extracting materials from its platform without permission.
Proof News discovered that subtitle files from 173,536 YouTube videos, sourced from more than 48,000 channels, were used by prominent tech firms like Anthropic, Nvidia, Apple, and Salesforce. The data was reportedly downloaded by EleutherAI, a non-profit organization that supports AI model development.
The Pile Dataset
EleutherAI’s research paper mentions that the subtitle files were included in a dataset called the Pile, which the organization released. This dataset is accessible to anyone with sufficient storage and computing power. While it was intended to assist small developers and academics, it was also utilized by major tech companies, including Apple.
Research papers and posts from companies such as Apple, Nvidia, and Salesforce reveal their use of the Pile to train AI models. Apple specifically used the Pile to train OpenELM, a high-profile model released in April. This was just weeks before Apple announced new AI capabilities for iPhones and MacBooks.
Legal and Ethical Concerns
The practice where Apple trained AI models on YouTube content without consent has drawn criticism from various quarters. The situation raises significant legal and ethical concerns. While Apple and other companies likely used the publicly available dataset in good faith, the legal complexities of web scraping for AI training are highlighted by this case.
One major issue is the use of copyrighted content without permission. YouTube creators like Marquees Brownlee earn income from ads on their videos. Using their content without consent and compensation is akin to a copyright violation. Existing copyright laws, established by the Berne Convention in 1971, are outdated for addressing modern technology like AI.
Copyright laws traditionally cover derivative works, such as movies based on novels. However, AI training on vast amounts of text presents a more distant and complex connection. There is ongoing debate about whether existing copyright protections should extend to AI training data.
This incident underscores the legal and ethical challenges in using web-scraped content for AI training. The use of datasets compiled by third parties without proper permissions creates a potential minefield for companies, highlighting the need for clearer regulations in the digital age.
Analysis of AI Training with YouTube Content
Recent reports show that tech giants like Apple used YouTube content to train AI models without creators’ consent. This practice has stirred controversy, highlighting critical copyright and ethical issues.
Using subtitle files from over 170,000 YouTube videos raises serious copyright concerns. Creators like Marquees Brownlee were affected when Apple trained AI models on YouTube content without consent. Using their work without permission is like stealing, as it takes away their potential earnings and violates their intellectual property rights.
Current copyright laws, set in 1971, are outdated. They were made for a time before the internet and AI. Today, AI training involves huge amounts of data, making the connection between the original content and the AI’s output unclear. This situation questions whether current copyright protections should cover AI training data.
There is also an ethical issue. YouTube’s rules forbid extracting content without permission. EleutherAI, a non-profit organization, downloaded the subtitles, seemingly violating these rules. Although their aim was to help small developers and academics, tech giants ended up using this data.
Also Read: New Startup’s Chip, Sohu, is 20 Times Faster in Running Transformers Like ChatGPT.