Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance
Department
Internal Medicine
Additional Department
Gastroenterology
Document Type
Article
Publication Title
The American Journal of Gastroenterology
Abstract
INTRODUCTION: Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.
METHODS: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.
RESULTS: ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.
DISCUSSION: The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.
First Page
2081
Last Page
2085
DOI
10.14309/ajg.0000000000003255
Volume
120
Issue
9
Publication Date
9-1-2025
Medical Subject Headings
Humans; Liver Failure, Acute (diagnosis, therapy); Artificial Intelligence; Gastroenterology
PubMed ID
39688962
Recommended Citation
Malik, S., Frey, L. J., Gutman, J., Mushtaq, A., Warraich, F., & Qureshi, K. (2025). Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance. The American Journal of Gastroenterology, 120 (9), 2081-2085. https://doi.org/10.14309/ajg.0000000000003255