Apple refutes using YouTube subtitles to train “Apple Intelligence”

Apple has refuted allegations about scraping YouTube subtitles to train “Apple Intelligence”. However, the iPhone maker hasn’t categorically claimed YouTube transcripts aren’t a part of its Generative Artificial Intelligence (Gen AI).

Apple relied on OpenELM Data, not EleutherAI, to train its AI

According to an investigation by Proof News, several big companies used transcripts of YouTube videos to train their AI engines. The observations and claims were co-published with Wired.

The investigation claimed Apple, Anthropic, Nvidia, and Salesforce were among several tech companies that used YouTube subtitles or video transcripts in multiple languages. Technically speaking, the report claimed these companies relied upon a large dataset from the nonprofit EleutherAI called The Pile, which, in turn, has YouTube subtitles.

According to the report, 173,536 YouTube videos from more than 48,000 YouTube channels were part of the dataset. Apple has now clarified how it utilized content from OpenELM to train its AI.

Apple Intelligence doesn’t have YouTube subtitles as training material?

It is interesting to note that Apple hasn’t specifically refuted that Apple Intelligence contains YouTube subtitles data. Instead, the company has reportedly claimed that it respects the rights of creators and publishers. Additionally, the company mentioned it offers websites the ability to opt out of their data being used to train Apple Intelligence.

It appears Apple is suggesting that it relied on OpenELM, not EleutherAI’s dataset, to build Apple Intelligence. However, in a research paper on OpenELM (PDF), researchers admitted that they trained it on Pile data.

Apple says its OpenELM model doesn’t power Apple Intelligence amid YouTube controversy #ReceptiveLanguage #Vocabulary #Rhyming #Singing #Speaking [Video]https://t.co/NixVnMzOSy

— Marta Fernandez (@MartaFGNN) July 18, 2024

Apple stressed that it trains its AI models, “Using high-quality data that includes licensed data from publishers, stock images, and some publicly available data from the web.” However, datasets from OpenELM are intended for research purposes only, claimed the company.

Apple has further stated that OpenELM is not used to power AI features in any Apple devices. Moreover, the company implied that it doesn’t intend to build future versions of the model.

Apple has sourced data for their AI from several companies

One of them scraped tons of data/transcripts from YouTube videos, including mine

Apple technically avoids “fault” here because they’re not the ones scraping

But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY

— Marques Brownlee (@MKBHD) July 16, 2024

YouTube video subtitles are not intended to be a public resource even if they are available in the public domain. YouTube has stated that using the platform’s video content to train AI — including transcripts — would violate the platform’s terms.

Some reports suggest Apple could be trying to shield itself from legal troubles by relying on third-party datasets to train its AI engine. However, unless YouTube or its parent company thoroughly analyzes the datasets, it would be difficult to draw a decisive conclusion.

2024-07-19 15:05:17