LLMs still can’t replicate the emotional expression of human language

Large language models (LLMs) are increasingly better at replicating forms of human interaction on social networks, but they still fail to capture the feeling.

The novel “computational Turing test” can distinguish social media responses generated by LLMs from those written by actual humans most of the time, according to a new paper published in the preprint server arXiv.

Researchers used the test against nine large language models: Llama 3.1 8B, Llama 3.1 8B Instruct, Llama 3.1 70B, Mistral 7B v0.1, Mistral 7B Instruct v0.2, Qwen 2.5 7B Instruct, Gemma 3 4B Instruct, DeepSeek-R1-Distill-Llama-8B, and Apertus-8B-2509.

They prompted models to reproduce user interactions on the social media networks X, Bluesky, and Reddit.

Don't miss our latest stories on Google News

Add us as your Preferred Source on Google.

The computation Turing test identified AI-generated text with 70-80% accuracy, even when the most advanced models and calibration methods like persona description and fine-tuning were used.

Optimizing AI-generated text to sound more human reduced its semantic accuracy, and vice versa, suggesting a trade-off between realism and meaning.

Nevertheless, the models were more successful at replicating conversational patterns of X users, while their accuracy was lower with Bluesky and Reddit.

Failure at capturing emotional expression

Affective language was found to be the clearest marker of AI-generated language. The test identified positive emotion, affection, optimism, and communication as the features that best differentiate human language from that of AI.

Even when the researchers optimized LLM-written text to sound more human-like by changing sentence length or word count, differences in emotional tone persisted.

Image by Cybernews.

The models also failed to realistically replicate toxicity, sentiment, and emotional expression, including both positive and negative emotions.

The authors wrote, “These results suggest that while LLMs can reproduce the form of online dialogue, they struggle to capture its feeling: the spontaneous, affect-laden expression characteristic of human interaction.”

Even though AI-generated interactions on social networks are easy to spot, the obviousness doesn’t stop users from using LLMs.

Over half (54%) of LinkedIn long-form posts may be AI-generated or heavily edited, with a steep surge in this type of content identified after the release of ChatGPT in 2022, according to a study by Originality AI.

Unlock more exclusive Cybernews content on YouTube.

The “novel Turing test” detects AI with up to 80% accuracy

More from Cybernews

Failure at capturing emotional expression