There’s a reason why AI companies are rushing to train their models on publicly available data as quickly as possible. That’s because these resources could soon be exhausted, spelling disaster for tech firms.
A new study released by Epoch AI, a research group, predicts that tech companies will exhaust the supply of publicly available training data for AI language models by around the turn of the decade – sometime between 2026 and 2032.
The study's author, Tamay Besiroglu, said that once the AI field drains the reserves of human-generated writing, it will find it difficult to maintain its current pace of progress.
Right now, there’s still content to scrape and use for training. OpenAI’s ChatGPT, Google, and other tech firms can still acquire or pay for data sources to train their large language models.
Deals with popular social media outlets like Reddit or news media organizations such as The Wall Street Journal are especially useful since new data is generated daily.
On the other hand, resistance to such agreements is growing. The New York Times and several other newspapers have sued OpenAI for using their copyrighted work to train the algorithms, and authors like writer George R.R. Martin have followed suit.
In short, there might not be enough new material to sustain the current trajectory of AI development – and the companies are hungry. OpenAI alone reportedly generates 100 billion words per day.
“We argue that human-generated public text data cannot sustain scaling beyond this decade,” the study says.
“Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained.”
According to the study, the amount of text data fed into AI language models has been growing by about 2.5 times per year. Computing has grown about four times per year.
Squeezing more is possible, of course. AI researchers can better use the data they already have, and they can train models on the same sources multiple times, even though that’s essentially photocopying the photocopy and leading to degrading performance.
But there are limits, Epoch says, and what’s to come might be even worse. That’s because companies may feel pressured – by eager investors – to keep finding new resources and then decide to tap into sensitive data now considered private, such as emails or text messages.
That’s probably the nuclear option. First, the AI industry might try using models themselves to generate synthetic data. Another technique involves multimodality and transfer learning – training language models on other existing datasets.
Epoch is a nonprofit institute under the umbrella of San Francisco-based Rethink Priorities. It’s funded by proponents of effective altruism, a philanthropic movement that has invested money into mitigating AI’s worst-case risks.
More from Cybernews:
Apple reveals iOS 18 with Apple Intelligence, ChatGPT integration at WWDC24
Poor cybersecurity to blame for data heist affecting 165 Snowflake customers
Apple Vision Pro gets upgraded OS, available in more countries
Musk vows to ban Apple devices over ChatGPT OS integration
Wordle today #1,088 daily hints: June 11th, 2024
Subscribe to our newsletter
Your email address will not be published. Required fields are markedmarked