TY - JOUR
T1 - Closing the gap between open source and commercial large language models for medical evidence summarization
AU - Zhang, Gongbo
AU - Jin, Qiao
AU - Zhou, Yiliang
AU - Wang, Song
AU - Idnay, Betina
AU - Luo, Yiming
AU - Park, Elizabeth
AU - Nestor, Jordan G.
AU - Spotnitz, Matthew E.
AU - Soroush, Ali
AU - Campion, Thomas R.
AU - Lu, Zhiyong
AU - Weng, Chunhua
AU - Peng, Yifan
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/12
Y1 - 2024/12
N2 - Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
AB - Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
UR - http://www.scopus.com/inward/record.url?scp=85203312261&partnerID=8YFLogxK
U2 - 10.1038/s41746-024-01239-w
DO - 10.1038/s41746-024-01239-w
M3 - Article
AN - SCOPUS:85203312261
SN - 2398-6352
VL - 7
JO - npj Digital Medicine
JF - npj Digital Medicine
IS - 1
M1 - 239
ER -