TY - JOUR
T1 - An integrative pipeline for multi-modal discovery of disease relationships
AU - Glicksberg, Benjamin S.
AU - Li, Li
AU - Cheng, Wei Yi
AU - Shameer, Khader
AU - Hakenberg, Jörg
AU - Castellanos, Rafael
AU - Ma, Meng
AU - Shi, Lisong
AU - Shah, Hardik
AU - Dudley, Joel T.
AU - Chen, Rong
PY - 2015
Y1 - 2015
N2 - In the past decade there has been an explosion in genetic research that has resulted in the generation of enormous quantities of disease-related data. In the current study, we have compiled disease risk gene variant information and Electronic Medical Record (EMR) classification codes from various repositories for 305 diseases. Using such data, we developed a pipeline to test for clinical prevalence, gene-variant overlap, and literature presence for all 46,360 unique diseases pairs. To determine whether disease pairs were enriched we systematically employed both Fishers' Exact (medical and literature) and Term Frequency-Inverse Document Frequency (genetics) methodologies to test for enrichment, defining statistical significance at a Bonferonni adjusted threshold of (p ≤ 1×10 -6 ) and weighted q≤0.05 accordingly. We hypothesize that disease pairs that are statistically enriched in medical and genetic spheres, but not so in the literature have the potential to reveal non-obvious connections between clinically disparate phenotypes. Using this pipeline, we identified 2,316 disease pairs that were significantly enriched within an EMR and 213 enriched genetically. Of these, 65 disease pairs were statistically enriched in both, 19 of which are believed to be novel. These identified non-obvious relationships between disease pairs are suggestive of a shared underlying etiology with clinical presentation. Further investigation of uncovered disease-pair relationships has the potential to provide insights into the architecture of complex diseases, and update existing knowledge of risk factors.
AB - In the past decade there has been an explosion in genetic research that has resulted in the generation of enormous quantities of disease-related data. In the current study, we have compiled disease risk gene variant information and Electronic Medical Record (EMR) classification codes from various repositories for 305 diseases. Using such data, we developed a pipeline to test for clinical prevalence, gene-variant overlap, and literature presence for all 46,360 unique diseases pairs. To determine whether disease pairs were enriched we systematically employed both Fishers' Exact (medical and literature) and Term Frequency-Inverse Document Frequency (genetics) methodologies to test for enrichment, defining statistical significance at a Bonferonni adjusted threshold of (p ≤ 1×10 -6 ) and weighted q≤0.05 accordingly. We hypothesize that disease pairs that are statistically enriched in medical and genetic spheres, but not so in the literature have the potential to reveal non-obvious connections between clinically disparate phenotypes. Using this pipeline, we identified 2,316 disease pairs that were significantly enriched within an EMR and 213 enriched genetically. Of these, 65 disease pairs were statistically enriched in both, 19 of which are believed to be novel. These identified non-obvious relationships between disease pairs are suggestive of a shared underlying etiology with clinical presentation. Further investigation of uncovered disease-pair relationships has the potential to provide insights into the architecture of complex diseases, and update existing knowledge of risk factors.
UR - http://www.scopus.com/inward/record.url?scp=84971212642&partnerID=8YFLogxK
M3 - Conference article
C2 - 25592600
AN - SCOPUS:84971212642
SN - 2335-6936
SP - 407
EP - 418
JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
T2 - 20th Pacific Symposium on Biocomputing, PSB 2015
Y2 - 4 January 2015 through 8 January 2015
ER -