
If the current trends do not change, artificial intelligence (AI) systems could eventually fill the internet with truly incomprehensible hogwash, a new study has warned. The possible end result is a “model collapse.”
AI models such as GPT-4, which powers OpenAI’s ChatGPT, or Claude 3 Opus rely on the trillions of words shared online to constantly get smarter. However, in the face of such gradual colonization of the internet, self-damaging feedback loops might emerge.
Researchers from prestigious universities in the United Kingdom and Canada say the end result could fill the web with unintelligible nonsense if left unchecked.
“AI models collapse when trained on recursively generated data. The proliferation of AI-generated content online could be devastating to the models themselves,” says the paper, published in Nature.
The best generative AI models are trained on human-generated content. For instance, GPTG-3.5 was trained on around 570 gigabytes of text data from the repository Common Crawl – that’s 300 billion words.
But this type of data is, of course, finite and likely to be exhausted by the end of this decade. That’s why, for instance, OpenAI is rushing to strike deals with news organizations and social platforms such as Reddit – their content is constantly renewing.
Still, AI systems – powered by hungry data centers – could devour all of the internet’s free knowledge as soon as 2026, recent findings show.
Once this has happened, tech companies will have to begin looking for data elsewhere, and this could, of course, include synthetic data. They could also turn to lower-quality sources or even tap into private data, like our messages and emails.
Algorithms trained on insufficient or low-quality data produce sketchy outputs, as Google’s AI Overviews have hilariously demonstrated. And in the case of models being trained on AI-generated data, the risks are even greater.
In the aforementioned study, lead author Ilia Shumailov, a computer scientist at Oxford, and his colleagues trained a large language model on human input from Wikipedia before feeding the output back into itself over nine iterations.
With generations of self-produced content accumulating, the model’s responses degraded into delirious ramblings.
The model was instructed to produce the next sentence for this input: “Some started before 1360 – was typically accomplished by a master mason and a small team of itinerant masons, supplemented by local parish laborers, according to Poyntz Wright. But other authors reject this model, suggesting instead that leading architects designed the parish church towers based on early examples of Perpendicular.”
By the ninth and final generation, the AI’s response was: “architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”
According to the researchers, this is caused by the model sampling an ever-narrower band of its own output, which creates an overfitted and noise-filled response.
“Indiscriminate use of model-generated content in training causes irreversible defects in the resulting models,” says the study. “It must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.”
Your email address will not be published. Required fields are markedmarked