TY - JOUR
T1 - NPSV
T2 - A simulation-driven approach to genotyping structural variants in whole-genome sequencing data
AU - Linderman, Michael D.
AU - Paudyal, Crystal
AU - Shakeel, Musab
AU - Kelley, William
AU - Bashir, Ali
AU - Gelb, Bruce D.
N1 - Publisher Copyright:
© 2021 The Author(s) 2021. Published by Oxford University Press GigaScience.
PY - 2021/7/1
Y1 - 2021/7/1
N2 - Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. Results: We introduce NPSV, a machine learning-based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Conclusions: Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a "black box"that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.
AB - Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. Results: We introduce NPSV, a machine learning-based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Conclusions: Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a "black box"that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.
KW - next-generation sequencing
KW - structural variants
KW - whole-genome sequencing
UR - http://www.scopus.com/inward/record.url?scp=85110058503&partnerID=8YFLogxK
U2 - 10.1093/gigascience/giab046
DO - 10.1093/gigascience/giab046
M3 - Article
C2 - 34195837
AN - SCOPUS:85110058503
SN - 2047-217X
VL - 10
JO - GigaScience
JF - GigaScience
IS - 7
M1 - giab046
ER -