The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease

Bright Huo, Elisa Calabrese, Patricia Sylla, Sunjay Kumar, Romeo C. Ignacio, Rodolfo Oviedo, Imran Hassan, Bethany J. Slater, Andreas Kaiser, Danielle S. Walsh, Wesley Vosburg

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Background: Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD). Methods: Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages. Results: Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity. Conclusions: Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM’s when utilized for advice on surgical management of GERD. Additional training of LLM’s using evidence-based health information is needed.

Original languageEnglish
Pages (from-to)2320-2330
Number of pages11
JournalSurgical Endoscopy and Other Interventional Techniques
Volume38
Issue number5
DOIs
StatePublished - May 2024

Keywords

  • ChatGPT
  • GERD
  • Generative artificial intelligence
  • Guidelines
  • Large language models
  • Natural language processing
  • Surgery

Fingerprint

Dive into the research topics of 'The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease'. Together they form a unique fingerprint.

Cite this