TY - JOUR
T1 - Large language models can support generation of standardized discharge summaries – A retrospective study utilizing ChatGPT-4 and electronic health records
AU - Schwieger, Arne
AU - Angst, Katrin
AU - de Bardeci, Mateo
AU - Burrer, Achim
AU - Cathomas, Flurin
AU - Ferrea, Stefano
AU - Grätz, Franziska
AU - Knorr, Marius
AU - Kronenberg, Golo
AU - Spiller, Tobias
AU - Troi, David
AU - Seifritz, Erich
AU - Weber, Samantha
AU - Olbrich, Sebastian
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/12
Y1 - 2024/12
N2 - Objective: To evaluate whether psychiatric discharge summaries (DS) generated with ChatGPT-4 from electronic health records (EHR) can match the quality of DS written by psychiatric residents. Methods: At a psychiatric primary care hospital, we compared 20 inpatient DS, written by residents, to those written with ChatGPT-4 from pseudonymized residents’ notes of the patients’ EHRs and a standardized prompt. 8 blinded psychiatry specialists rated both versions on a custom Likert scale from 1 to 5 across 15 quality subcategories. The primary outcome was the overall rating difference between the two groups. The secondary outcomes were the rating differences at the level of individual question, case, and rater. Results: Human-written DS were rated significantly higher than AI (mean ratings: human 3.78, AI 3.12, p < 0.05). They surpassed AI significantly in 12/15 questions and 16/20 cases and were favored significantly by 7/8 raters. For “low expected correction effort”, human DS were rated as 67 % favorable, 19 % neutral, and 14 % unfavorable, whereas AI-DS were rated as 22 % favorable, 33 % neutral, and 45 % unfavorable. Hallucinations were present in 40 % of AI-DS, with 37.5 % deemed highly clinically relevant. Minor content mistakes were found in 30 % of AI and 10 % of human DS. Raters correctly identified AI-DS with 81 % sensitivity and 75 % specificity. Discussion: Overall, AI-DS did not match the quality of resident-written DS but performed similarly in 20% of cases and were rated as favorable for “low expected correction effort” in 22% of cases. AI-DS lacked most in content specificity, ability to distill key case information, and coherence but performed adequately in conciseness, adherence to formalities, relevance of included content, and form. Conclusion: LLM-written DS show potential as templates for physicians to finalize, potentially saving time in the future.
AB - Objective: To evaluate whether psychiatric discharge summaries (DS) generated with ChatGPT-4 from electronic health records (EHR) can match the quality of DS written by psychiatric residents. Methods: At a psychiatric primary care hospital, we compared 20 inpatient DS, written by residents, to those written with ChatGPT-4 from pseudonymized residents’ notes of the patients’ EHRs and a standardized prompt. 8 blinded psychiatry specialists rated both versions on a custom Likert scale from 1 to 5 across 15 quality subcategories. The primary outcome was the overall rating difference between the two groups. The secondary outcomes were the rating differences at the level of individual question, case, and rater. Results: Human-written DS were rated significantly higher than AI (mean ratings: human 3.78, AI 3.12, p < 0.05). They surpassed AI significantly in 12/15 questions and 16/20 cases and were favored significantly by 7/8 raters. For “low expected correction effort”, human DS were rated as 67 % favorable, 19 % neutral, and 14 % unfavorable, whereas AI-DS were rated as 22 % favorable, 33 % neutral, and 45 % unfavorable. Hallucinations were present in 40 % of AI-DS, with 37.5 % deemed highly clinically relevant. Minor content mistakes were found in 30 % of AI and 10 % of human DS. Raters correctly identified AI-DS with 81 % sensitivity and 75 % specificity. Discussion: Overall, AI-DS did not match the quality of resident-written DS but performed similarly in 20% of cases and were rated as favorable for “low expected correction effort” in 22% of cases. AI-DS lacked most in content specificity, ability to distill key case information, and coherence but performed adequately in conciseness, adherence to formalities, relevance of included content, and form. Conclusion: LLM-written DS show potential as templates for physicians to finalize, potentially saving time in the future.
KW - Artificial intelligence
KW - Discharge summaries
KW - Electronic health record
KW - Machine learning
KW - Psychiatric
UR - http://www.scopus.com/inward/record.url?scp=85206902315&partnerID=8YFLogxK
U2 - 10.1016/j.ijmedinf.2024.105654
DO - 10.1016/j.ijmedinf.2024.105654
M3 - Article
AN - SCOPUS:85206902315
SN - 1386-5056
VL - 192
JO - International Journal of Medical Informatics
JF - International Journal of Medical Informatics
M1 - 105654
ER -