AI doctor gets a reality check

Even medical artificial intelligence models that are “incredibly” well-trained can fall short in real-world scenarios and make mistakes, a new study has found.

Researchers at Northwestern University trained four AI models to diagnose various medical conditions from actual tissue samples. Notably, some of these samples were intentionally contaminated.

The study's results indicated that such contamination can “easily confuse” AI models, which are often developed in pristine, simulated laboratory environments.

“Mistakes can happen” when lab-trained AI is exposed to the complexities of the real world, which includes encountering a variety of materials not present in sterile, artificial environments, explained Dr. Jeffrey Goldstein, a corresponding author of the study.

The findings should serve as a reminder that “AI works incredibly well in the lab but may fall on its face in the real world,” said Goldstein, who is a director of perinatal pathology at Northwestern’s Feinberg School of Medicine.

“Patients should continue to expect that a human expert is the final decider on diagnoses made on biopsies and other tissue samples. Pathologists fear – and AI companies hope – that the computers are coming for our jobs. Not yet,” he said.

Human pathologists are extensively trained to diagnose diseases through laboratory tissue analysis and are experts at identifying tissue contamination, a common lab error where one patient’s tissue samples are mistakenly placed on another's microscope slides.

This type of error proved challenging for the four AI models used in the study. Unlike human pathologists, who disregard contaminated slides and consider the entire sample, the AI systems failed to do so, leading to inaccurate diagnoses.

“If it’s paying attention to tissue contaminants, then it’s paying less attention to the tissue from the patient that is being examined,” Goldstein said.

Researchers tasked three AI models to scan microscope slides of placenta tissue to detect blood vessel damage, estimate gestational age, and classify macroscopic lesions. The fourth model was trained to detect prostate cancer in tissues collected from needle biopsies.

According to Northwestern, it was the first study that examined the impact of tissue contamination on AI models. The paper was published in the journal Modern Pathology.

Research findings signal that AI may be unable to encode biological impurities, a problem that practitioners should work on, the authors of the study said. A well-developed AI could be invaluable in specialized fields like perinatal pathology, where there are only 50 to 100 specialists across the entire US.

“I'm actually very excited about how well we were able to build the models and how well they performed before we deliberately broke them for the study,” Goldstein said.

“Our results make me confident that AI evaluations of placenta are doable. We ran into a real-world problem, but hitting that speed bump means we're on the road to better integrating the use of machine learning in pathology.”