ChatGPT can read medical textbooks but likely will misdiagnose you


Researchers have discovered that popular language models were accurate in diagnosing patients when reading textbook-like descriptions but failed to understand what was wrong with them when analyzing their statements.

Even though language models are not specifically designed for healthcare, people are inclined to use ChatGPT in healthcare contexts. In 2023, a survey by Virginia University showed that 78.4% of respondents were willing to use ChatGPT for self-diagnosis.

AI technologies have been rapidly adopted in the medical field, with successes in automating tasks and analyzing medical images. Numerous studies have highlighted ChatGPT's potential in healthcare.

ADVERTISEMENT

Some research indicates that ChatGPT effectively provides information and support across a range of scenarios, including mental health assessments, counseling, medication management, and patient education. Nonetheless, some studies highlighted that ChatGPT's accuracy in diagnosing conditions in children was only 17%.

That is also the case when diagnosing genetic diseases. A recent study by the National Institutes of Health (NIH), published in the American Journal of Human Genetics, found that popular AI tools like Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT-3.5, and ChatGPT-4 are good at diagnosing genetic diseases from textbook-like descriptions.

However, their accuracy drops significantly when analyzing patients' summaries about their health.

“We may not always think of it this way, but so much of medicine is words-based,” said Ben Solomon, M.D., senior author of the study and clinical director at the NIH’s National Human Genome Research Institute (NHGRI).

“For example, electronic health records and the conversations between doctors and patients all consist of words. Large language models have been a huge leap forward for AI, and being able to analyze words in a clinically useful way could be incredibly transformational.”

Ten different large language models tested

ADVERTISEMENT

The researchers evaluated ten large language models by creating questions about 63 genetic conditions, drawing from medical textbooks and reference materials. These conditions included both common ones, like sickle cell anemia and cystic fibrosis, and many rare genetic disorders.

To capture a range of possible symptoms, they selected three to five symptoms for each condition and framed questions in a consistent format: “I have X, Y, and Z symptoms. What’s the most likely genetic condition?”

The models varied greatly in their ability to diagnose correctly, with initial accuracies ranging from 21% to 90%. Generally, larger models with more training data performed better, with GPT-4 being the most accurate.

For many lower-performing models, accuracy improved in subsequent tests. Overall, the models were more accurate than traditional non-AI methods, including standard Google searches.

The researchers also tested the models' performance with simplified language. For example, instead of “macrocephaly,” they used “a big head” to better reflect how patients might describe symptoms.

While accuracy dropped when medical terms were replaced, seven out of ten models still outperformed Google searches with the simplified language.

Accuracy drops with real patients

To evaluate how well large language models perform with real patient information, researchers at the NIH Clinical Center asked patients to write brief summaries about their genetic conditions and symptoms.

These descriptions varied in length from a single sentence to several paragraphs and differed significantly in style and content from textbook-style questions.

When the models were given these patient-written descriptions, the most accurate model identified the correct diagnosis only 21% of the time, with some models achieving as low as 1% accuracy.

ADVERTISEMENT

The researchers anticipated that these patient summaries would be more challenging, as NIH patients often have very rare conditions that the models might not have enough information about to diagnose correctly.

However, the models' accuracy improved when the researchers provided standardized questions about the same rare genetic conditions.

This suggests that the inconsistent phrasing and formatting of the patient summaries made it harder for the models to interpret the information, likely because they’re trained primarily on more uniform and textbook-like data.

“For these models to be clinically useful in the future, we need more data, and those data need to reflect the diversity of patients,” said Dr. Solomon in a press release.

“Not only do we need to represent all known medical conditions, but also variation in age, race, gender, cultural background, and so on, so that the data capture the diversity of patient experiences. Then these models can learn how different people may talk about their conditions.”