Accurate and robust cohort definition is critical to biomedical discovery using Electronic Health Records (EHR). Similar to prospective study designs, high quality EHR-based research requires rigorous selection criteria to designate case/control status particular to each disease. Electronic phenotyping algorithms, which are manually built and validated per disease, have been successful in filling this need. However, these approaches are time-consuming, leading to only a relatively small amount of algorithms for diseases developed. Methodologies that automatically learn features from EHRs have been used for cohort selection as well. To date, however, there has been no systematic analysis of how these methods perform against current gold standards. Accordingly, this paper compares the performance of a state-of-the-art automated feature learning method to extracting research-grade cohorts for five diseases against their established electronic phenotyping algorithms. In particular, we use word2vec to create unsupervised embeddings of the phenotype space within an EHR system. Using medical concepts as a query, we then rank patients by their proximity in the embedding space and automatically extract putative disease cohorts via a distance threshold. Experimental evaluation shows promising results with average F-score of 0.57 and AUC-ROC of 0.98. However, we noticed that results varied considerably between diseases, thus necessitating further investigation and/or phenotype-specific refinement of the approach before being readily deployed across all diseases.

Original languageEnglish
Pages (from-to)145-156
Number of pages12
JournalPacific Symposium on Biocomputing
Issue number212669
StatePublished - 2018
Event23rd Pacific Symposium on Biocomputing, PSB 2018 - Kohala Coast, United States
Duration: 3 Jan 20187 Jan 2018


  • Automated cohort selection
  • Electronic Health Records
  • Electronic phenotyping algorithms
  • Feature learning
  • Vector-based representations
  • Word embedding


Dive into the research topics of 'Automated disease cohort selection using word embeddings from Electronic Health Records'. Together they form a unique fingerprint.

Cite this