‘Striking paradox’: AI doctor passes medical exams, fails with real-life diagnosis

AI models might nail medical exams but might not be qualified enough to give you the correct diagnosis in real life, say Harvard and Stanford researchers.

It is becoming increasingly popular to turn to artificial intelligence (AI) for health advice. A survey conducted by Virginia University revealed that 78% of respondents were open to using ChatGPT for self-diagnosis.

Apart from self-diagnosis using ChatGPT, AI models are projected to ease the work of healthcare professionals by removing the workload of taking medical histories and providing preliminary diagnoses based on the patient's symptoms.

Research by Silicon Valley medical startup Ansible Health showed that the viral chatbot can pass medical exams with high accuracy. ChatGPT has demonstrated good results in all three tests in the US Medical Licensing Exam without any specialized training or reinforcement.

On the other hand, the large language models are still struggling with the accurate diagnosis. In August, a study by the National Institutes of Health (NIH) found that while the popular AI tools are good at diagnosing genetic diseases from textbook-like descriptions, their accuracy drops significantly when analyzing patients' descriptions of their health.

That’s a crucial pain point, as navigating conversations about health is at the core of diagnosing patients. This was once again proved by a recent study from Harvard Medical School and Stanford University, with scientists creating a framework that tested how AI models perform in real-world situations at medical offices.

Passing a medical exam is not enough

Researchers tested four AI models, both commercial and open-source, using 2,000 clinical cases covering common primary care conditions and 12 medical specialties.

All models performed well on multiple-choice questions, which resembled medical exams. However, they showed a decline in results when handling open-ended responses and engaging in conversations similar to real-world interactions with patients.

"Our work reveals a striking paradox – while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit," said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School.

Failing to accurately hear out the patients affected the AI’s ability to take medical histories and make correct diagnoses. For example, models often failed to ask the right questions, missed important details, and had trouble piecing together scattered information.

Don’t miss our latest stories on Google News

Add us as your Preferred Source on Google.

“The dynamic nature of medical conversations – the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms – poses unique challenges that go far beyond answering multiple choice questions,” adds Rajpurkar.

The reason is straightforward. Researchers point out that developers typically evaluate AI model performance by having them answer multiple-choice medical questions. These questions are often taken from national exams for graduating medical students or certification tests for medical residents.

“This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world, this process is far messier,” said study co-first author Shreya Johri.

Testing AI models for accuracy

The researchers designed an evaluation framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) that should help simulate real-world medical conversations.

CRAFT-MD tests how well large-language models handle real-world interactions by evaluating their ability to gather details about symptoms, medications, and family history and make a diagnosis.

In this process, an AI agent acts as a patient, responding naturally to questions, while another AI agent grades the model's final diagnosis. Later, human experts review each interaction to assess the model's ability to collect relevant information, diagnose accurately with scattered details, and follow prompts.

How to improve AI’s performance in healthcare?

Based on their findings, published on January 2nd in Nature Medicine, the researchers advise AI developers and regulators to follow these guidelines to improve AI’s performance in the healthcare sector:

Incorporate conversational, open-ended questions into the design, training, and testing of AI tools to better reflect unstructured doctor-patient interactions.
Evaluate models based on their ability to ask relevant questions and extract critical information.
Develop models that can track multiple conversations and synthesize information from them.
Create AI systems capable of combining textual data (e.g., conversation notes) with non-textual data (e.g., images, EKGs).
Design advanced AI agents that can interpret non-verbal cues, such as facial expressions, tone, and body language.

‘Striking paradox’: AI doctor passes medical exams, fails with real-life diagnosis

More from Cybernews

Passing a medical exam is not enough

Testing AI models for accuracy

How to improve AI’s performance in healthcare?