The Artificial Intelligence Shoulder Arthroplasty Score: development and validation of a tool for large language model responses to common patient questions regarding total shoulder arthroplasty

AISAS Study Group

Research output: Contribution to journalArticlepeer-review

Abstract

Background and Hypothesis: While research into artificial intelligence, specifically large language model (LLM), ability to respond to patient questions regarding specific orthopedic pathologies continues to grow, no tool presently exists to systematically and comprehensively evaluate the quality of LLM responses. The present study seeks to develop and validate the Artificial Intelligence Shoulder Arthroplasty Score (AISAS) to create a comprehensive, standardized, and reproducible system for evaluating artificial intelligence responses to patient questions regarding their orthopedic pathology. Methods: The novel scoring tool, AISAS, was developed to include four equally weighted components related to accuracy, completeness, clarity, and readability. Fifteen common patient questions on glenohumeral arthritis were asked one by one to three of the most used LLMs: ChatGPT (version 3.5), Claude (version 3.5) Sonnet, and Gemini. Ten shoulder and elbow fellowship trained orthopedic surgeons used the proposed framework to evaluate each of the 45 responses. Inter-rater reliability was calculated via Cohen's kappa and rater-score correlation was calculated via Cronbach's alpha. Results: AISAS use for Claude and ChatGPT produced moderate agreement (k = 0.55 and 0.43) while Gemini produced substantial reliability among raters ((k = 0.66). Cronbach's alpha scores demonstrated excellent correlation of Gemini ratings (⍺ = 0.91) and acceptable correlation of the Claude and ChatGPT ratings (⍺ = 0.79 and 0.75). Discussion and Conclusion: AISAS use enables systematic assessment of the overall quality of an LLM response, as well as the individual components of a response that may vary in quality to enable easy comparisons for LLM responses. Furthermore, it offers a tool to trend the progress of LLMs in ability to respond to patient questions. Establishing such a framework to guide areas of improvement for LLMs will serve to optimize LLMs as a patient tool, identify areas for improvement, and allow physicians to better direct patients on how to utilize these tools for optimal use. Conclusion: The AISAS is a comprehensive and reproducible tool for evaluating LLM responses, with high levels of inter-rater reliability. AISAS use can help to evaluate responses to patient questions to guide growth and improvement of LLMs for use in the orthopedic setting.

Original languageEnglish
JournalSeminars in Arthroplasty JSES
DOIs
StateAccepted/In press - 2025

Keywords

  • AISAS
  • Artificial intelligence
  • ChatGPT
  • Claude
  • Gemini
  • Large language model
  • Level III
  • rTSA
  • TSA

Fingerprint

Dive into the research topics of 'The Artificial Intelligence Shoulder Arthroplasty Score: development and validation of a tool for large language model responses to common patient questions regarding total shoulder arthroplasty'. Together they form a unique fingerprint.

Cite this