Apple study exposes flaws in AI models’ mathematical reasoning

While current large language models (LLMs) demonstrate remarkable capabilities across various tasks, Apple research shows that they lack ‘true logical reasoning.’

Apple researchers found that simple mathematical problems that the vast majority of people could solve become very difficult tasks for AI chatbots. The LLM-generated answer also depends on how you ask.

“Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question,” the paper by Apple researchers reads.

“Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases.”

Researchers argue that current LLMs “are not capable of genuine logical reasoning.” Instead, they attempt to replicate the reasoning steps observed in their training data.

They proposed a new benchmark to evaluate the models, called GSM-Symbolic. It improves on industry-wide mathematical reasoning benchmarks as it allows the generation of a diverse set of questions from symbolic templates.

“We add seemingly relevant but ultimately inconsequential statements to GSM-Symbolic templates,” researchers said. “These additions do not affect the reasoning required to solve the problem.”

Performance across all state-of-the-art AI models dropped by as much as 65% just by adding an irrelevant single variable to the prompt. It did not contribute in any meaningful way and only appeared to be relevant to the question. Researchers described this as a “catastrophic performance loss.”

Here is an example of the prompt: “Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?”

LLMs, including o1-mini and LLama3-8B, subtracted the smaller kiwis and got the wrong answer of 185 instead of 190.

“Overall, we find that models tend to convert statements to operations without truly understanding their meaning. For instance, a common case we observe is that models interpret statements about ‘discount’ as ‘multiplication,’ regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough,” researchers said.

However, Cybernews couldn’t repeat the results on some larger LLMs, such as Claude or Gemini, as they counted kiwis correctly.

The largest decrease in accuracy was observed on the smallest LLMs, which contain a few billion parameters. O1-preview, OpenAI's most advanced model, demonstrated a still significant 17.5% accuracy drop.

Apple researchers have observed that performance deteriorates as the complexity of the question increases. This suggests that token-generating machines have deeper issues in their reasoning process ‘that cannot be easily mitigated through few-shot learning or fine-tuning.’

Further research is needed to develop AI models capable of formal reasoning and more robust and generalizable problem-solving skills.

“This remains a critical challenge for the field as we strive to create systems with human-like cognitive abilities or general intelligence.”