Long-read sequence and assembly of segmental duplications

Mitchell R. Vollger, Philip C. Dishuck, Melanie Sorensen, Anne Marie E. Welch, Vy Dang, Max L. Dougherty, Tina A. Graves-Lindsay, Richard K. Wilson, Mark J.P. Chaisson, Evan E. Eichler

Research output: Contribution to journalArticlepeer-review

99 Scopus citations

Abstract

We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33–79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.

Original languageEnglish
Pages (from-to)88-94
Number of pages7
JournalNature Methods
Volume16
Issue number1
DOIs
StatePublished - 1 Jan 2019
Externally publishedYes

Fingerprint

Dive into the research topics of 'Long-read sequence and assembly of segmental duplications'. Together they form a unique fingerprint.

Cite this