TY - GEN
T1 - Machine Learning Based Prediction of Incident Cases of Crohn’s Disease Using Electronic Health Records from a Large Integrated Health System
AU - Hugo, Julian
AU - Ibing, Susanne
AU - Borchert, Florian
AU - Sachs, Jan Philipp
AU - Cho, Judy
AU - Ungaro, Ryan C.
AU - Böttinger, Erwin P.
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Early diagnosis and treatment of Crohn’s Disease (CD) is associated with decreased risk of surgery and complications. However, diagnostic delay is common in clinical practice. In order to better understand CD risk factors and disease indicators, we identified incident CD patients and controls within the Mount Sinai Data Warehouse (MSDW) and developed machine learning (ML) models for disease prediction. CD incident cases were defined based on CD diagnosis codes, medication prescriptions, healthcare utilization before first CD diagnosis, and clinical text, using structured Electronic Health Records (EHR) and clinical notes from MSDW. Cases were matched to controls based on sex, age and healthcare utilization. Thus, we identified 249 incident CD cases and 1,242 matched controls in MSDW. We excluded data from 180 days before first CD diagnosis for cohort characterization and predictive modeling. Clinical text was encoded by term frequency-inverse document frequency and structured EHR features were aggregated. We compared three ML models: Logistic Regression, Random Forest, and XGBoost. Gastrointestinal symptoms, for instance anal fistula and irritable bowel syndrome, are significantly overrepresented in cases at least 180 days before the first CD code (prevalence of 33% in cases compared to 12% in controls). XGBoost is the best performing model to predict CD with an AUROC of 0.72 based on structured EHR data only. Features with highest predictive importance from structured EHR include anemia lab values and race (white). The results suggest that ML algorithms could enable earlier diagnosis of CD and reduce the diagnostic delay.
AB - Early diagnosis and treatment of Crohn’s Disease (CD) is associated with decreased risk of surgery and complications. However, diagnostic delay is common in clinical practice. In order to better understand CD risk factors and disease indicators, we identified incident CD patients and controls within the Mount Sinai Data Warehouse (MSDW) and developed machine learning (ML) models for disease prediction. CD incident cases were defined based on CD diagnosis codes, medication prescriptions, healthcare utilization before first CD diagnosis, and clinical text, using structured Electronic Health Records (EHR) and clinical notes from MSDW. Cases were matched to controls based on sex, age and healthcare utilization. Thus, we identified 249 incident CD cases and 1,242 matched controls in MSDW. We excluded data from 180 days before first CD diagnosis for cohort characterization and predictive modeling. Clinical text was encoded by term frequency-inverse document frequency and structured EHR features were aggregated. We compared three ML models: Logistic Regression, Random Forest, and XGBoost. Gastrointestinal symptoms, for instance anal fistula and irritable bowel syndrome, are significantly overrepresented in cases at least 180 days before the first CD code (prevalence of 33% in cases compared to 12% in controls). XGBoost is the best performing model to predict CD with an AUROC of 0.72 based on structured EHR data only. Features with highest predictive importance from structured EHR include anemia lab values and race (white). The results suggest that ML algorithms could enable earlier diagnosis of CD and reduce the diagnostic delay.
KW - Crohn disease
KW - Diagnostic delay
KW - Electronic health records
UR - https://www.scopus.com/pages/publications/85163981817
U2 - 10.1007/978-3-031-34344-5_35
DO - 10.1007/978-3-031-34344-5_35
M3 - Conference contribution
AN - SCOPUS:85163981817
SN - 9783031343438
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 293
EP - 302
BT - Artificial Intelligence in Medicine - 21st International Conference on Artificial Intelligence in Medicine, AIME 2023, Proceedings
A2 - Juarez, Jose M.
A2 - Marcos, Mar
A2 - Stiglic, Gregor
A2 - Tucker, Allan
PB - Springer Science and Business Media Deutschland GmbH
T2 - 21st International Conference on Artificial Intelligence in Medicine, AIME 2023
Y2 - 12 June 2023 through 15 June 2023
ER -