A popular type of generative AI model can navigate New York with stunning accuracy. However, if you close some streets and add detours, it starts imagining things.
New research has demonstrated that large language models don’t have a coherent understanding of the world despite demonstrating impressive capabilities with daily tasks such as navigation.
For example, it can provide turn-by-turn navigation in New York without having formed an accurate internal map of the city. However, the AI model’s performance immediately plummets when new conditions are added to the task.
Researchers from the Massachusetts Institute of Technology (MIT), Harvard University, Cornell University, and the University of Chicago Booth analyzed “transporter,” a type of generative AI model that is the backbone of many LLMs.
They added detours to the map of New York and witnessed how those extra conditions led the transformer to fail.
“I was surprised by how quickly the performance deteriorated as soon as we added a detour. If we close just 1 percent of the possible streets, accuracy immediately plummets from nearly 100 percent to just 67 percent,” lead author Keyon Vafa, a postdoc at Harvard University, said.
To put it simply, the transformer simply started hallucinating things. According to MIT News, the recovered maps looked like an imagined New York with random flyovers above other streets or multiple streets with impossible orientations.
“Because these transformers fail to recover the true street map of New York City, they are fragile for downstream tasks. While they sometimes have amazing route planning abilities, their performance breaks down when detours are introduced,” the research paper reads.
The authors noted that results are not unique to navigation. They also experimented with logic puzzles and game playing.
“In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead it to fail badly,” the researchers said.
Their paper proposed theoretically grounded metrics for assessing the world models implicit inside generative models.
Your email address will not be published. Required fields are markedmarked