Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance

Department

Internal Medicine

Additional Department

Gastroenterology

Document Type

Article

Publication Title

The American Journal of Gastroenterology

Abstract

INTRODUCTION: Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.

METHODS: Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.

RESULTS: ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

DISCUSSION: The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

First Page

2081

Last Page

2085

DOI

10.14309/ajg.0000000000003255

Volume

120

Issue

9

Publication Date

9-1-2025

Medical Subject Headings

Humans; Liver Failure, Acute (diagnosis, therapy); Artificial Intelligence; Gastroenterology

PubMed ID

39688962

Share

COinS