TY - JOUR
T1 - Exact distribution of a spaced seed statistic for DNA homology detection
AU - Benson, Gary
AU - Mak, Denise Y.F.
N1 - Funding Information:
★ This research was supported in part by NSF grant IIS-0612153 and NIH grant 1 R01 GM072084.
PY - 2008
Y1 - 2008
N2 - Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, S∈= 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.
AB - Let a seed, S, be a string from the alphabet {1,*}, of arbitrary length k, which starts and ends with a 1. For example, S∈= 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S. We say that the 1s at the corresponding positions in T are covered. We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p. We refer to this new probability distribution as C nSp , for covered, with S being the seed. We present an efficient method to calculate this distribution exactly. Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.
UR - http://www.scopus.com/inward/record.url?scp=85041117073&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-89097-3_27
DO - 10.1007/978-3-540-89097-3_27
M3 - Conference article
AN - SCOPUS:85041117073
SN - 0302-9743
VL - 5280 LNCS
SP - 282
EP - 293
JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
T2 - 15th International Symposium on String Processing and Information Retrieval, SPIRE 2008
Y2 - 10 November 2008 through 12 November 2008
ER -