TY - JOUR
T1 - Gaps and complex structurally variant loci in phased genome assemblies
AU - Human Pangenome Reference Consortium
AU - Porubsky, David
AU - Vollger, Mitchell R.
AU - Harvey, William T.
AU - Rozanski, Allison N.
AU - Ebert, Peter
AU - Hickey, Glenn
AU - Hasenfeld, Patrick
AU - Sanders, Ashley D.
AU - Stober, Catherine
AU - Korbel, Jan O.
AU - Paten, Benedict
AU - Marschall, Tobias
AU - Eichler, Evan E.
AU - Abel, Haley J.
AU - Antonacci-Fulton, Lucinda L.
AU - Asri, Mobin
AU - Baid, Gunjan
AU - Baker, Carl A.
AU - Belyaeva, Anastasiya
AU - Billis, Konstantinos
AU - Bourque, Guillaume
AU - Buonaiuto, Silvia
AU - Carroll, Andrew
AU - Chaisson, Mark J.P.
AU - Chang, Pi Chuan
AU - Chang, Xian H.
AU - Cheng, Haoyu
AU - Chu, Justin
AU - Cody, Sarah
AU - Colonna, Vincenza
AU - Cook, Daniel E.
AU - Cook-Deegan, Robert M.
AU - Cornejo, Omar E.
AU - Diekhans, Mark
AU - Doerr, Daniel
AU - Ebert, Peter
AU - Ebler, Jana
AU - Eichler, Evan E.
AU - Eizenga, Jordan M.
AU - Fairley, Susan
AU - Fedrigo, Olivier
AU - Felsenfeld, Adam L.
AU - Feng, Xiaowen
AU - Fischer, Christian
AU - Flicek, Paul
AU - Formenti, Giulio
AU - Frankish, Adam
AU - Fulton, Robert S.
AU - Gao, Yan
AU - Kenny, Eimear E.
N1 - Publisher Copyright:
© 2023 Cold Spring Harbor Laboratory Press. All rights reserved.
PY - 2023/4
Y1 - 2023/4
N2 - There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
AB - There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
UR - http://www.scopus.com/inward/record.url?scp=85157978121&partnerID=8YFLogxK
U2 - 10.1101/gr.277334.122
DO - 10.1101/gr.277334.122
M3 - Article
C2 - 37164484
AN - SCOPUS:85157978121
SN - 1088-9051
VL - 33
SP - 496
EP - 510
JO - Genome Research
JF - Genome Research
IS - 4
ER -