AI companies reportedly used YouTube video transcripts for training

Generative Artificial Intelligence (Gen AI) companies scraped YouTube video transcripts to train their engines, claims a new report. Several popular YouTubers such as MrBeast and Marques Brownlee have raised concerns, claiming their content is part of the massive datasets.

Investigation reveals subtitles from more than 170,000 YouTube videos scraped

According to an investigation by Proof News, several big companies scrubbed YouTube videos to train their AI engines. The observations and claims were co-published with Wired.

The investigation claims Apple, Anthropic, Nvidia, and Salesforce were among several tech companies that used “YouTube Subtitles”. Specifically speaking, these companies collectively ripped off subtitles from 173,536 YouTube videos.

Overall, more than 48,000 YouTube channels were used by these companies to build their AI datasets and train their AI engines, the report claims. YouTubers including MrBeast (289Mn subscribers), MKBHD (19Mn subscribers), PewDiePie (111Mn subscribers), and several more have their content in the datasets.

Apple has sourced data for their AI from several companies

One of them scraped tons of data/transcripts from YouTube videos, including mine

Apple technically avoids “fault” here because they’re not the ones scraping

But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY

— Marques Brownlee (@MKBHD) July 16, 2024

Apart from YouTubers, videos from news outlets like ABC News, the BBC, and The New York Times are part of the dataset. Simply put, several tech giants plugged YouTube subtitles into their AI engines.

Tool to confirm AI companies used YouTube data posted online

According to The Verge, the YouTube video subtitles dataset is part of a larger collection of material. Technically speaking, the majority of the companies using YouTube data relied on nonprofit EleutherAI’s dataset called The Pile. This is supposed to be an open-source collection that also contains datasets of books, Wikipedia articles, and content available in the public domain.

To prove that AI companies are using YouTube to build their datasets and train their engines, Proof News also released an interactive lookup tool. Any YouTuber, or even the common public, can check the data.

““It’s theft,” said Dave Wiskus, the CEO of Nebula, a streaming service partially owned by its creators, some of whom have had their work taken from YouTube to train AI.”https://t.co/X34e3LuODW

— Distributed AI Research Institute is on Mastodon (@DAIRInstitute) July 16, 2024

Besides the obvious issue of rewarding or compensating YouTubers for their content, these companies also face legal troubles. YouTube states using its video content to train AI — including transcripts — would violate the platform’s terms.

YouTube has reportedly refrained from responding to the report. However, it is quite likely that its parent company Google would take some steps to protect the video-sharing platform and its content creators.

So far, the datasets seem to contain plain text data. In other words, AI companies could be using only video transcripts or subtitles, and not video, to train their engines. Incidentally, the plain text data also contains live translations of the videos in Japanese, German, and Arabic.

Google has previously admitted it scrubbed some YouTube videos to train its AI engines. However, the search giant has ensured it has appropriate agreements with YouTubers. Needless to say, EleutherAI may not have any such agreement with each of the YouTubers whose videos are now part of the datasets used by tech giants to train their AI.

2024-07-17 15:06:06