The cost of training AI models is rising exponentially


It’s becoming more expensive to train large language models (LLMs), underpinning chatbots created by Open AI, Microsoft, and other companies, an AI index report published by Stanford University suggests.

The report, which also addresses many other trends, also says that the development of AI can be hindered by a lack of data in the future.

Sam Altman, the CEO of OpenAI, last year revealed that the training cost of ChatGPT-4 was over 100 million. The researchers note that while AI companies seldom reveal the expenses involved in training their models, it is widely believed that these costs run into millions of dollars and are rising.

They illustrate this trend, providing their own estimates of how much it costs to train an LLM. For example, the Transformer model, which introduced the architecture that underpins virtually every modern LLM, only cost around $900 to train.

RoBERTa Large, released in 2019, cost around $160,000 to train, while OpenAI’s GPT-4 and Google’s Gemini Ultra released are estimated to be around $78 million and $191 million, respectively.

These estimates suggest that the price of training models that are currently in development may likely be in the billions.

Another challenge the creators of LLMs are facing is a lack of data to train them with. In the last couple of years, AI chatbots have achieved significant progress, largely down to the fact that LLMs were trained on increasingly larger amounts of data, such as books, articles, etc., which act as fuel.

However, growing data dependency on AI models has led to concerns that future generations of computer scientists will run out of data to further scale and improve their systems.

One research from Epoch, published in 2022, estimates that computer scientists could deplete the stock of high-quality language data by 2024, exhaust low-quality language data within two decades, and use up image data by the late 2030s to mid-2040s.

One workaround would be to train LLMs on so-called synthetic data created by LLMs. This would not only be a solution for potential data depletion but could also generate data in instances where naturally occurring data is sparse, say researchers from Stanford.

However, two studies published last year suggested that there are limitations associated with training models on synthetic data. One problem with such an approach is that at some point, LLMs trained on synthetic data “lose the ability to remember true underlying data distributions and start producing a narrow range of outputs.”

One experiment demonstrated that with each subsequent generation trained on additional synthetic data, LLMs produce an increasingly limited set of outputs. The same can be applied both to textual data and images.