TY - JOUR

T1 - Exact distribution of a spaced seed statistic for DNA homology detection

AU - Benson, Gary

AU - Mak, Denise Y.F.

N1 - Funding Information:
★ This research was supported in part by NSF grant IIS-0612153 and NIH grant 1 R01 GM072084.

PY - 2008

Y1 - 2008

N2 - Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, S∈= 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.

AB - Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, S∈= 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.

UR - http://www.scopus.com/inward/record.url?scp=85041117073&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-89097-3_27

DO - 10.1007/978-3-540-89097-3_27

M3 - Conference article

AN - SCOPUS:85041117073

SN - 0302-9743

VL - 5280 LNCS

SP - 282

EP - 293

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

T2 - 15th International Symposium on String Processing and Information Retrieval, SPIRE 2008

Y2 - 10 November 2008 through 12 November 2008

ER -