Three issues with ChatGPT and similar AI chatbots


A lack of data, rising costs and electricity consumption are three problems that creators of large language models (LLMs), underpinning chatbots like ChatGPT, will have to address.

One of the most fascinating things about LLMs, which form the backbone of AI-chatbots, is the speed at which they’ve progressed. The first version of GPT, released in 2018, could do very basic tasks like summarize text, complete sentences, and answer questions.

The latest version, the GPT-4o, which was announced last week, is capable of reasoning and can do it in real-time across voice, text, and vision.

None of the models could achieve such results without training data, which is crucial to their growth. For example, GPT-3 was trained on 570GB of data, which included websites, articles, books, forums, and other publicly available text sources.

But growing data dependency on AI models has led to concerns that in the future we will run out of data to scale and improve LLMs.

Lack of high-quality data

A research paper from Epoch estimated that computer scientists could deplete the stock of high-quality language data by 2024 or 2025 and exhaust low quality data within two decades.

An example of high-quality data might be a book or academic paper, while low quality data can include internet forums, spam emails, biased articles etc. In essence, high-quality data is crucial in developing reliable LLMs.

There is one more factor to consider. With lawsuits hanging over OpenAI, Microsoft, Google, and Anthropic for copyrighted material infringements, companies are now much more careful of what to feed to LLMs.

These limitations don't necessarily mean that the development of LLMs will be stopped. One way to overcome the lack of training data would be to use synthetic data created by LLMs themselves. For example, it’s possible to use text produced by one LLM to train another LLM.

However, two studies published last year suggested that there are limitations associated with training models on synthetic data. One problem with such an approach is that at some point, LLMs trained on synthetic data "lose the ability to remember true underlying data distributions and start producing a narrow range of outputs," researchers from the University of Stanford concluded in their AI-Index.

But some industry experts have other opinions. Anthropic CEO Dario Amodei has said that it may be possible to generate infinite amounts of data by injecting only very small amounts of data of new information. According to him, the key to success was "doing it right."

According to Andrés Diana, Chief Innovation Officer at Accrete AI, the real value of synthetic data emerges when it's generated from private, high-quality sources that were previously inaccessible. He thinks that by using synthetic data as a proxy, LLMs can access the depth of private data without compromising confidentiality.

Another way would be to blend human domain expertise with computational processes, which is the method used by Accelerate AI. The company that licences expert AI agents to government and commercial customers.

"This synergy is crucial as it unlocks tacit knowledge that humans possess but which is often not recorded in data formats traditionally used for machine learning," he says.

The expert adds that LLM providers are forming partnerships to access private datasets. This approach diversifies the training materials and enhances the models' understanding of niche and specialized content, which is not typically available in public datasets.

Power-hungry datacenters

Another issue associated with AI is the amount of electricity needed to operate data centers, which are increasingly used for AI computations. Report by the International Electricity Agency estimates that search tools like Google could see a tenfold increase of their electricity demand in the case of fully implementing AI in it.

It estimates that energy consumption coming from data centers, AI, and cryptocurrencies could double by 2026.

Growing electricity demand for AI might be the biggest issue of all, says Justin Uberti, CTO and co-founder of Fixie.ai, a startup that’s creating verbal communications with AI.

He refers to a study estimating that global electricity consumption may double to 3% by 2030. However, Uberti thinks that it may be bigger, reaching 5%.

"I don't think people have really internalized how much things are going to change here. Unless we hit some sort of point where scaling up no longer yields better results, I don't see anything stopping these increases in training investment, model size, and power consumption. A big data center right now is 100MW. We will see AI data centers at 1GW and possibly beyond, to the point where they have their own power plant(s) feeding them," Uberti stresses.

In the future, new technologies, including energy-efficient semiconductors and cooling systems, and possible advancements in quantum computing may reduce energy consumption. But for now, AI electricity consumption is another growing pain in the attempt to reach our climate goals. And there are a number of instances of push backs on data centers in the US.

According to Uberti, another major challenge in developing AI models might be the lack of graphic processing unit power. He says that the computer needs for running video models like Sora are orders of magnitude beyond even today's large models.

Rising costs

Last but not least, there’s concern about the rising costs of training LLMs. Sam Altman, the CEO of OpenAI, revealed that it cost around $100 million to train ChatGPT. Such an amount of money isn't that difficult to find for a company backed by a tech behemoth like Microsoft or other big players in the field.

However, this might soon change. Amodei of Anthropic thinks that the most expensive model next year will exceed $1 billion, and by 2025, we may have a $10 billion model.

At some point, the cost of training LLMs may become too high, even for big companies.

Rising costs also means even more tools concentrated in the hands of big tech companies, meaning that there’s less space for other players.

However, there still will be plenty of ways to innovate for smaller companies. according to Bob Rogers, a Harvard-trained data scientist and the CEO of Oii.ai, a supply chain AI company.

"They will be able to innovate around locally tuned models that are really great at specific tasks, or in specific domains, and also in use cases that require less energy consumption and computer size for actually using the model once it's developed. AI is starting to look like botany or zoology, with many different sized models with different features and attributes, each one tuned to fit its niche in the ecosystem," he says.