Abstract
Background: A critical gap persists in the diagnosis and management of pulmonary embolism (PE) despite contemporary medical advances, necessitating large-scale electronic health record analysis through hybrid natural language processing (NLP) methodologies to elucidate its pathophysiology and optimize clinical interventions. Objectives: The purpose of this study was to develop and validate a hybrid NLP pipeline combining machine learning (ML) and rule-based techniques for accurate identification of PE cases from large-scale radiology report data sets. Methods: The hybrid NLP pipeline consisted of a ML algorithm trained on 1,040 computed tomography pulmonary angiogram reports from Brigham and Women's Hospital, with 80% used for training and 20% for testing. The pipeline was then validated on a larger data set of 49,611 radiology reports from the Mass General Brigham (MGB) health care system. Performance was evaluated using accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Results: The ML model achieved an accuracy of 91% and area under the curve of 0.90 on the Brigham and Women's Hospital testing data set. When deployed on the larger MGB data set, the model's accuracy decreased to 85%. Iterative application of the rule-based algorithm improved the model's accuracy to 94.8%, sensitivity to 96.4%, specificity to 93.2%, positive predictive value to 93.0%, and negative predictive value to 96.5% on the MGB data set. Conclusions: The hybrid NLP approach required less training data than pure ML models and demonstrated high performance across diverse health care settings. A hybrid NLP pipeline can efficiently and accurately identify PE cases from radiology reports and could be deployed for broader PE-focused research and clinical surveillance if similarly validated in external data sets.
| Original language | English |
|---|---|
| Article number | 101845 |
| Journal | JACC: Advances |
| Volume | 4 |
| Issue number | 11P2 |
| DOIs | |
| State | Published - Nov 2025 |
| Externally published | Yes |
Keywords
- automated PE identification
- clinical text mining
- hybrid NLP pipeline
- machine learning radiology classification
- medical document classification
- pulmonary embolism detection
Fingerprint
Dive into the research topics of 'Leveraging Hybrid Natural Language Processing Techniques for Large-Scale Pulmonary Embolism Identification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver