TY - GEN
T1 - Supervised pretraining through contrastive categorical positive samplings to improve COVID-19 mortality prediction
AU - Wanyan, Tingyi
AU - Lin, Mingquan
AU - Klang, Eyal
AU - Menon, Kartikeya M.
AU - Gulamali, Faris F.
AU - Azad, Ariful
AU - Zhang, Yiye
AU - Ding, Ying
AU - Wang, Zhangyang
AU - Wang, Fei
AU - Glicksberg, Benjamin
AU - Peng, Yifan
N1 - Funding Information:
This work is supported by the National Library of Medicine under Award No. 4R00LM013001.
Publisher Copyright:
© 2022 ACM.
PY - 2022/8/7
Y1 - 2022/8/7
N2 - Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-Training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-Training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).
AB - Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-Training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-Training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).
KW - Intra-class variance
KW - Mortality prediction
KW - Pre-Training
KW - Self-supervised learning
KW - Sub-phenotype
KW - Supervised contrastive learning
UR - http://www.scopus.com/inward/record.url?scp=85136490570&partnerID=8YFLogxK
U2 - 10.1145/3535508.3545541
DO - 10.1145/3535508.3545541
M3 - Conference contribution
AN - SCOPUS:85136490570
T3 - Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2022
BT - Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2022
PB - Association for Computing Machinery, Inc
Y2 - 7 August 2022 through 8 August 2022
ER -