
How good are AI doctors at medical conversations?
Artificial intelligence tools like ChatGPT have been praised for their promise to ease the workload of clinicians by triaging patients, capturing medical histories and even providing preliminary diagnoses.
These tools, known as large language models, are already used by patients to understand their symptoms and medical test results.
But while these AI models perform well on standardized medical tests, how do they perform in situations that more closely resemble the real world?
Not so good, according to the results of a new study led by researchers at Harvard Medical School and Stanford University.
Their analysis was published on January 2 natural medicineThe researchers designed an evaluation framework, or test, called CRAFT-MD (Conversational Reasoning Assessment Framework for Medical Testing) and deployed it on four large language models to see how they performed in an environment that closely mimicked actual interactions. patient.
All four large language models performed well on medical exam-type questions, but their performance deteriorated when engaging in conversations that more closely resembled real-world interactions.
The researchers say this gap highlights a twofold need: first, to create more realistic assessments that better measure the applicability of clinical AI models in the real world, and second, to improve the ability of these tools to make diagnoses before they are deployed into clinics. Based on more realistic interactions.
The research team stated that assessment tools like CRAFT-MD can not only more accurately assess the real health status of artificial intelligence models, but also help optimize their performance in clinical settings.
“Our work reveals a surprising paradox—while these AI models perform well on medical board exams, they fail at basics,” said Pranav Rajpurkar, the study’s senior author and assistant professor of biomedical informatics at Harvard Medical School. of doctors encounter difficulties during their doctor visits: “The dynamic nature of medical conversations—the need to ask the right questions at the right time, piece together disparate information, and reason through symptoms—brings much more than answers. Unique Challenges of Multiple Choice Questions When we move away from standardized testing of these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.
Better tests to see how AI performs in the real world
Currently, developers test the efficacy of AI models by asking them to answer multiple-choice medical questions, often taken from national exams for medical graduates or from tests given to residents as part of certification.
“This approach assumes that all relevant information is presented clearly and concisely, often using medical terminology or buzzwords to simplify the diagnostic process, but in the real world, the process is much more confusing,” said doctoral student Shreya Johri, co-first author of the study. “We need a testing framework that better reflects reality and therefore better predicts model performance. “
CRAFT-MD is designed to be such a more realistic instrument.
To simulate real-world interactions, CRAFT-MD evaluates the ability of large language models to gather information about symptoms, medications, and family history and make a diagnosis. Artificial intelligence agents are used to impersonate patients and answer questions in a conversational, natural way. Another AI agent scores the accuracy of the large language model’s final diagnosis. Human experts then evaluate the outcomes of each encounter, including the ability to gather relevant patient information, diagnostic accuracy when providing discrete information, and compliance with prompts.
Researchers used CRAFT-MD to test the performance of four artificial intelligence models, including proprietary, commercial and open source models, on 2,000 clinical cases covering conditions commonly found in primary care and 12 medical specialties. .
All AI models show limitations, particularly in their ability to conduct clinical conversations and reasoning based on information provided by patients. This in turn compromises their ability to take a history and make an appropriate diagnosis. For example, models often struggle to ask the right questions to gather relevant patient histories, miss critical information during history collection, and struggle to synthesize disparate information. When these models are fed open-ended information instead of multiple-choice answers, their accuracy decreases. The models also performed worse when engaging in back-and-forth exchanges (like most real-world conversations) rather than when engaging in summary conversations.
Recommendations for optimizing real-world performance of artificial intelligence
Based on these findings, the team provides a set of recommendations for AI developers designing AI models and regulators responsible for evaluating and approving these tools.
These include:
- Use conversational, open-ended questions to more accurately reflect unstructured doctor-patient interactions in the design, training, and testing of AI tools
- Evaluate a model’s ability to ask the right questions and extract the most important information
- Design models that track multiple conversations and integrate information from them
- Design an artificial intelligence model that can integrate text (conversation notes) and non-text data (images, electrocardiograms)
- Designing more sophisticated AI agents that can interpret non-verbal cues such as facial expressions, tone of voice and body language
Additionally, the researchers suggest that assessments should include both AI agents and human experts, since relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outperformed human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15-16 hours of expert evaluation. In comparison, the person-based approach requires extensive recruitment, estimated to require 500 hours for patient simulation (approximately 3 minutes per conversation) and approximately 650 hours for expert assessment (approximately 4 minutes per conversation). Using an AI evaluator as the first line has the added advantage of eliminating the risk of exposing real patients to unvalidated AI tools.
The researchers said they expect CRAFT-MD itself will be regularly updated and optimized to incorporate improved patient AI models.
“As a physician-scientist, I am interested in artificial intelligence models that can effectively and ethically enhance clinical practice,” said study co-senior author Roxana Daneshjou, assistant professor of biomedical data science and dermatology at Stanford. “CRAFT-MD creates a framework that more closely matches real-world interactions, so it helps move the field forward when testing the performance of AI models in healthcare.”
Authorship, Funding, Disclosure
Additional authors include Jaehwan Jeong and Hong-Yu Zhou, Harvard Medical School; Benjamin A. Tran, Georgetown University; Daniel I. Schlessinger, Northwestern University; Shannon Wongvibulsin , University of California, Los Angeles; Leandra A. Barnes, Zhuo Ran Cai, and David Kim, Samford University; and Eliezer M. Van Allen, Dana-Farber Cancer Institute.
This work was supported by an HMS Dean Innovation Award and a Microsoft Acceleration Foundation Pattern Research Grant awarded to Pranav Rajpurkar. SJ received further support through the IIE Quad Fellowship.
Daneshjou reports receiving personal fees from DWA, personal fees from Pfizer, personal fees from L’Oreal, personal fees from VisualDx, stock options from MDAlgorithms and Revea, and a pending TrueImage patent outside the submitted work. . Schlessinger is a co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant to Appiell Inc. and LuminDx, and a researcher at Abbvie and Sanofi. Van Allen serves as a consultant to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institutes for BioMedical Research, and Serinus Bio. EMVA provides research support to Novartis, BMS, Sanofi, and NextPoint. Van Allen holds equity interests in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio, Syapse. Van Allen has filed institutional patents on chromatin mutations and immunotherapy responses and methods for clinical interpretation; provides intermittently patent counsel to Foaley & Hoag and serves as scientific progress.
2025-01-02 21:26:47