TY - JOUR
T1 - The Artificial Intelligence Shoulder Arthroplasty Score
T2 - development and validation of a tool for large language model responses to common patient questions regarding total shoulder arthroplasty
AU - AISAS Study Group
AU - Fiedler, Benjamin
AU - Hauck, Jeffrey
AU - Wilhelm, Chris
AU - LeVasseur, Matt
AU - Leinweber, Kathleen
AU - Kurowicki, Jennifer
AU - Haase, Lucas
AU - Nieboer, Micah
AU - Boubekri, Amir
AU - Hachadorian, Mike
AU - Meyerson, Lucas
AU - Edwards, T. Bradley
AU - Elkousy, Hussein A.
AU - Cagle, Paul J.
AU - Phillips, Todd
N1 - Publisher Copyright:
© 2025 American Shoulder and Elbow Surgeons
PY - 2025
Y1 - 2025
N2 - Background and Hypothesis: While research into artificial intelligence, specifically large language model (LLM), ability to respond to patient questions regarding specific orthopedic pathologies continues to grow, no tool presently exists to systematically and comprehensively evaluate the quality of LLM responses. The present study seeks to develop and validate the Artificial Intelligence Shoulder Arthroplasty Score (AISAS) to create a comprehensive, standardized, and reproducible system for evaluating artificial intelligence responses to patient questions regarding their orthopedic pathology. Methods: The novel scoring tool, AISAS, was developed to include four equally weighted components related to accuracy, completeness, clarity, and readability. Fifteen common patient questions on glenohumeral arthritis were asked one by one to three of the most used LLMs: ChatGPT (version 3.5), Claude (version 3.5) Sonnet, and Gemini. Ten shoulder and elbow fellowship trained orthopedic surgeons used the proposed framework to evaluate each of the 45 responses. Inter-rater reliability was calculated via Cohen's kappa and rater-score correlation was calculated via Cronbach's alpha. Results: AISAS use for Claude and ChatGPT produced moderate agreement (k = 0.55 and 0.43) while Gemini produced substantial reliability among raters ((k = 0.66). Cronbach's alpha scores demonstrated excellent correlation of Gemini ratings (⍺ = 0.91) and acceptable correlation of the Claude and ChatGPT ratings (⍺ = 0.79 and 0.75). Discussion and Conclusion: AISAS use enables systematic assessment of the overall quality of an LLM response, as well as the individual components of a response that may vary in quality to enable easy comparisons for LLM responses. Furthermore, it offers a tool to trend the progress of LLMs in ability to respond to patient questions. Establishing such a framework to guide areas of improvement for LLMs will serve to optimize LLMs as a patient tool, identify areas for improvement, and allow physicians to better direct patients on how to utilize these tools for optimal use. Conclusion: The AISAS is a comprehensive and reproducible tool for evaluating LLM responses, with high levels of inter-rater reliability. AISAS use can help to evaluate responses to patient questions to guide growth and improvement of LLMs for use in the orthopedic setting.
AB - Background and Hypothesis: While research into artificial intelligence, specifically large language model (LLM), ability to respond to patient questions regarding specific orthopedic pathologies continues to grow, no tool presently exists to systematically and comprehensively evaluate the quality of LLM responses. The present study seeks to develop and validate the Artificial Intelligence Shoulder Arthroplasty Score (AISAS) to create a comprehensive, standardized, and reproducible system for evaluating artificial intelligence responses to patient questions regarding their orthopedic pathology. Methods: The novel scoring tool, AISAS, was developed to include four equally weighted components related to accuracy, completeness, clarity, and readability. Fifteen common patient questions on glenohumeral arthritis were asked one by one to three of the most used LLMs: ChatGPT (version 3.5), Claude (version 3.5) Sonnet, and Gemini. Ten shoulder and elbow fellowship trained orthopedic surgeons used the proposed framework to evaluate each of the 45 responses. Inter-rater reliability was calculated via Cohen's kappa and rater-score correlation was calculated via Cronbach's alpha. Results: AISAS use for Claude and ChatGPT produced moderate agreement (k = 0.55 and 0.43) while Gemini produced substantial reliability among raters ((k = 0.66). Cronbach's alpha scores demonstrated excellent correlation of Gemini ratings (⍺ = 0.91) and acceptable correlation of the Claude and ChatGPT ratings (⍺ = 0.79 and 0.75). Discussion and Conclusion: AISAS use enables systematic assessment of the overall quality of an LLM response, as well as the individual components of a response that may vary in quality to enable easy comparisons for LLM responses. Furthermore, it offers a tool to trend the progress of LLMs in ability to respond to patient questions. Establishing such a framework to guide areas of improvement for LLMs will serve to optimize LLMs as a patient tool, identify areas for improvement, and allow physicians to better direct patients on how to utilize these tools for optimal use. Conclusion: The AISAS is a comprehensive and reproducible tool for evaluating LLM responses, with high levels of inter-rater reliability. AISAS use can help to evaluate responses to patient questions to guide growth and improvement of LLMs for use in the orthopedic setting.
KW - AISAS
KW - Artificial intelligence
KW - ChatGPT
KW - Claude
KW - Gemini
KW - Large language model
KW - Level III
KW - rTSA
KW - TSA
UR - http://www.scopus.com/inward/record.url?scp=105001409890&partnerID=8YFLogxK
U2 - 10.1053/j.sart.2025.02.003
DO - 10.1053/j.sart.2025.02.003
M3 - Article
AN - SCOPUS:105001409890
SN - 1045-4527
JO - Seminars in Arthroplasty JSES
JF - Seminars in Arthroplasty JSES
ER -