Elon Musk says he agrees with AI experts that there’s very little real-world data left for model training. Not to worry though – the billionaire knows the path forward.
During a livestream conversation with Stagwell chairman Mark Penn streamed on X, Musk said: “We’ve now exhausted basically the cumulative sum of human knowledge in AI training. That happened basically last year.”
His statement isn’t exactly surprising, of course, as AI scientists have indeed been saying for quite some time now that even though the internet is a vast ocean of human knowledge, it isn’t infinite – and training ever bigger neural networks has sucked it almost entirely dry.
Watch Stagwell's CEO Mark Penn interview Elon Musk at CES! https://t.co/BO3Z7bbHOZ
undefined Live (@Live) January 9, 2025
In fact, Ilya Sutskever, OpenAI’s former chief scientist who left the AI company last year after the boardroom drama and rumored disagreements with CEO Sam Altman, also touched on the issue in December.
At NeurIPS, the machine learning conference, Sutskever said the AI industry had already reached “peak data” and predicted that models will have to be developed in a different way in the very near future.
Full @ilyasut TALK! about Pre-training is dead and more pic.twitter.com/PigZVvcEGB
undefined Diego | AI 🚀 - e/acc (@diegocabezas01) December 14, 2024
Epoch AI, a virtual research institute, has earlier projected that we’re likely to run out of training data in about four years. It doesn’t help that the models themselves are constantly growing in size and power.
However, researchers do have new ideas and workarounds, and AI training won’t stop just because the revolution is running out of data.
Companies such as Microsoft, Meta, OpenAI, and Anthropic have publicly acknowledged the issue and suggested they were already generating new data in unconventional ways.
Garner recently estimated that 60% of the data used for AI projects in 2024 were synthetically generated. Google’s Gemma models and Meta’s Llama series of models, for example, were trained on synthetic data alongside real-world data.
A spokesperson for OpenAI told Nature: “We use numerous sources, including publicly available data and partnerships for non-public data, synthetic data generation, and data from AI trainers.”
Musk agrees that synthetic data – generated by AI models themselves – is the way forward. During the livestream, he said: “With synthetic data, AI will sort of grade itself and go through this process of self-learning.”
Training on synthetic data can also be much cheaper. However, AI models can then become less “creative” and more biased, some research suggests.
Your email address will not be published. Required fields are markedmarked