Leveraging Hybrid Natural Language Processing Techniques for Large-Scale Pulmonary Embolism Identification

  • Syed Moin Hassan
  • , Ruben Mylvaganam
  • , Tekle Didebulidze
  • , Malaika Khalid
  • , Pietro Nardelli
  • , Gregory Piazza
  • , Ruben San Jose Estepar
  • , Michael J. Cuttica
  • , Shelsey Johnson
  • , Nam Dao
  • , Raul San Jose Estepar
  • , George R. Washko
  • , Farbod Nicholas Rahaghi

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Background: A critical gap persists in the diagnosis and management of pulmonary embolism (PE) despite contemporary medical advances, necessitating large-scale electronic health record analysis through hybrid natural language processing (NLP) methodologies to elucidate its pathophysiology and optimize clinical interventions. Objectives: The purpose of this study was to develop and validate a hybrid NLP pipeline combining machine learning (ML) and rule-based techniques for accurate identification of PE cases from large-scale radiology report data sets. Methods: The hybrid NLP pipeline consisted of a ML algorithm trained on 1,040 computed tomography pulmonary angiogram reports from Brigham and Women's Hospital, with 80% used for training and 20% for testing. The pipeline was then validated on a larger data set of 49,611 radiology reports from the Mass General Brigham (MGB) health care system. Performance was evaluated using accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Results: The ML model achieved an accuracy of 91% and area under the curve of 0.90 on the Brigham and Women's Hospital testing data set. When deployed on the larger MGB data set, the model's accuracy decreased to 85%. Iterative application of the rule-based algorithm improved the model's accuracy to 94.8%, sensitivity to 96.4%, specificity to 93.2%, positive predictive value to 93.0%, and negative predictive value to 96.5% on the MGB data set. Conclusions: The hybrid NLP approach required less training data than pure ML models and demonstrated high performance across diverse health care settings. A hybrid NLP pipeline can efficiently and accurately identify PE cases from radiology reports and could be deployed for broader PE-focused research and clinical surveillance if similarly validated in external data sets.

Original languageEnglish
Article number101845
JournalJACC: Advances
Volume4
Issue number11P2
DOIs
StatePublished - Nov 2025
Externally publishedYes

Keywords

  • automated PE identification
  • clinical text mining
  • hybrid NLP pipeline
  • machine learning radiology classification
  • medical document classification
  • pulmonary embolism detection

Fingerprint

Dive into the research topics of 'Leveraging Hybrid Natural Language Processing Techniques for Large-Scale Pulmonary Embolism Identification'. Together they form a unique fingerprint.

Cite this