TY - JOUR
T1 - Automated disease cohort selection using word embeddings from Electronic Health Records
AU - Glicksberg, Benjamin S.
AU - Miotto, Riccardo
AU - Johnson, Kipp W.
AU - Shameer, Khader
AU - Li, Li
AU - Chen, Rong
AU - Dudley, Joel T.
N1 - Funding Information:
We would like to thank the Mount Sinai Data Warehouse for facilitating data accessibility and the Mount Sinai Scientific Computing team for infrastructural support. This study was funded by the following grants of JTD: National Institute of Health (NIH), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) R01-DK098242-03 and the Harris Center for Precision Wellness.
Publisher Copyright:
© 2017 The Authors.
PY - 2018
Y1 - 2018
N2 - Accurate and robust cohort definition is critical to biomedical discovery using Electronic Health Records (EHR). Similar to prospective study designs, high quality EHR-based research requires rigorous selection criteria to designate case/control status particular to each disease. Electronic phenotyping algorithms, which are manually built and validated per disease, have been successful in filling this need. However, these approaches are time-consuming, leading to only a relatively small amount of algorithms for diseases developed. Methodologies that automatically learn features from EHRs have been used for cohort selection as well. To date, however, there has been no systematic analysis of how these methods perform against current gold standards. Accordingly, this paper compares the performance of a state-of-the-art automated feature learning method to extracting research-grade cohorts for five diseases against their established electronic phenotyping algorithms. In particular, we use word2vec to create unsupervised embeddings of the phenotype space within an EHR system. Using medical concepts as a query, we then rank patients by their proximity in the embedding space and automatically extract putative disease cohorts via a distance threshold. Experimental evaluation shows promising results with average F-score of 0.57 and AUC-ROC of 0.98. However, we noticed that results varied considerably between diseases, thus necessitating further investigation and/or phenotype-specific refinement of the approach before being readily deployed across all diseases.
AB - Accurate and robust cohort definition is critical to biomedical discovery using Electronic Health Records (EHR). Similar to prospective study designs, high quality EHR-based research requires rigorous selection criteria to designate case/control status particular to each disease. Electronic phenotyping algorithms, which are manually built and validated per disease, have been successful in filling this need. However, these approaches are time-consuming, leading to only a relatively small amount of algorithms for diseases developed. Methodologies that automatically learn features from EHRs have been used for cohort selection as well. To date, however, there has been no systematic analysis of how these methods perform against current gold standards. Accordingly, this paper compares the performance of a state-of-the-art automated feature learning method to extracting research-grade cohorts for five diseases against their established electronic phenotyping algorithms. In particular, we use word2vec to create unsupervised embeddings of the phenotype space within an EHR system. Using medical concepts as a query, we then rank patients by their proximity in the embedding space and automatically extract putative disease cohorts via a distance threshold. Experimental evaluation shows promising results with average F-score of 0.57 and AUC-ROC of 0.98. However, we noticed that results varied considerably between diseases, thus necessitating further investigation and/or phenotype-specific refinement of the approach before being readily deployed across all diseases.
KW - Automated cohort selection
KW - Electronic Health Records
KW - Electronic phenotyping algorithms
KW - Feature learning
KW - Vector-based representations
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85048460869&partnerID=8YFLogxK
U2 - 10.1142/9789813235533_0014
DO - 10.1142/9789813235533_0014
M3 - Conference article
C2 - 29218877
AN - SCOPUS:85048460869
SN - 2335-6936
VL - 0
SP - 145
EP - 156
JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
T2 - 23rd Pacific Symposium on Biocomputing, PSB 2018
Y2 - 3 January 2018 through 7 January 2018
ER -