We’ve exhausted data available for AI training, Elon Musk says

Hotstar in UAE
Hotstar in UAE

It hasn’t been too long since artificial intelligence took over the tech industry. ChatGPT sparked a revolution that has brought huge advances in just a few years. During that time, AI-focused companies have been using publicly available data to train their models. However, some prominent figures in the field, such as Elon Musk, believe that the industry has exhausted all the data available for AI training.

Elon Musk and other experts say the industry has exhausted AI training data

Training complex AI models requires huge amounts of data. Many might think that it would take companies a long time to use all the data available in the world. However, experts claim that the moment is near. Ilya Sutskever, a former OpenAI chief scientist, participated in the machine learning-focused NeurIPS conference in December. During the event, Sutskever stated that the AI ​​​​industry has already reached the so-called “peak data.”

This means that, in the scientist’s opinion, we have practically reached the peak in terms of using data to train AI. There is very little unused data left, which will force a paradigm shift in the development of AI models. In line with that, during a livestreamed conversation with Stagwell chairman Mark Penn, Elon Musk said thatwe’ve now exhausted basically the cumulative sum of human knowledge … in AI training.”

Musk owns xAI, the division of X (FKA Twitter) focused on AI development. Grok, an AI-powered chatbot and image generator built into X, is the company’s most popular product. Musk claims that, based on his experience in the AI ​​field, the industry reached the “peak data” mentioned by Sutskever “basically last year.”

Using synthetic data could be the solution, but with nuances

That said, there is a way to get new data for AI training. For a while now, some big AI companies have been using synthetic data as part of training their own models. Synthetic data is basically data generated by other AI models. “The only way to supplement [real-world data] is with synthetic data, where the AI ​​creates [training data],” Musk said. “With synthetic data … [AI] will sort of grade itself and go through this process of self-learning,” he added.

Research and consulting firm Gartner estimates that, by 2024, 60% of the data used for AI-based developments was synthetic. The list includes projects like Microsoft’s Phi-4, Google’s Gemma, Sonnet’s Claude 3.5, and even Meta’s Llama.

That said, developers should be careful when using this type of data on a large scale. Over-deployment of synthetic data can lead to increased bias, which reduces the creativity of the model. This can affect the quality of an AI platform’s output. On the other hand, using synthetic data results in huge cost savings.

2025-01-10 15:05:35

Leave a Comment