An expanded sequence context model broadly explains variability in polymorphism levels across the human genome

Varun Aggarwala, Benjamin F. Voight

Research output: Contribution to journalArticlepeer-review

128 Scopus citations

Abstract

The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site - the site's trinucleotide sequence context - to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.

Original languageEnglish
Pages (from-to)349-355
Number of pages7
JournalNature Genetics
Volume48
Issue number4
DOIs
StatePublished - 29 Mar 2016
Externally publishedYes

Fingerprint

Dive into the research topics of 'An expanded sequence context model broadly explains variability in polymorphism levels across the human genome'. Together they form a unique fingerprint.

Cite this