Exact distribution of a spaced seed statistic for DNA homology detection

Gary Benson, Denise Y.F. Mak

Research output: Contribution to journalConference articlepeer-review

6 Scopus citations

Abstract

Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, S∈= 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.

Original languageEnglish
Pages (from-to)282-293
Number of pages12
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5280 LNCS
DOIs
StatePublished - 2008
Externally publishedYes
Event15th International Symposium on String Processing and Information Retrieval, SPIRE 2008 - Melbourne. VIC, Australia
Duration: 10 Nov 200812 Nov 2008

Fingerprint

Dive into the research topics of 'Exact distribution of a spaced seed statistic for DNA homology detection'. Together they form a unique fingerprint.

Cite this