TY - JOUR
T1 - On the distribution of k-tuple matches for sequence homology
T2 - A constant time exact calculation of the variance
AU - Benson, Gary
AU - Su, Xiaoping
PY - 1998
Y1 - 1998
N2 - We study the distribution of a statistic useful in calculating the significance of the number of k-tuple matches detected in biological sequence homology algorithms. The statistic is Ra(n,k), the total number of heads in head runs of length k or more in a sequence of iid Bernoulli trials of length n. Calculation of the mean is straightforward. Poisson approximation formulas have been used for the variance because they are simple and powerful. Unfortunately, when p = P(Head) is large, the Poisson approximation no longer works well. In our application, p is large, say .75, and we have turned instead to direct calculation of the variance. Surprisingly, we are able to show that the variance, which is based on the interactions of O(n2) random variables, can be computed in constant time, independent of the length of the sequence and probability p. This result can be used to calculate the mean and variance of a number of other head run statistics in constant time. Additionally, we show how to extend the result to sequences generated by a stationary Markov process where the variance can be calculated in O(n) time.
AB - We study the distribution of a statistic useful in calculating the significance of the number of k-tuple matches detected in biological sequence homology algorithms. The statistic is Ra(n,k), the total number of heads in head runs of length k or more in a sequence of iid Bernoulli trials of length n. Calculation of the mean is straightforward. Poisson approximation formulas have been used for the variance because they are simple and powerful. Unfortunately, when p = P(Head) is large, the Poisson approximation no longer works well. In our application, p is large, say .75, and we have turned instead to direct calculation of the variance. Surprisingly, we are able to show that the variance, which is based on the interactions of O(n2) random variables, can be computed in constant time, independent of the length of the sequence and probability p. This result can be used to calculate the mean and variance of a number of other head run statistics in constant time. Additionally, we show how to extend the result to sequences generated by a stationary Markov process where the variance can be calculated in O(n) time.
UR - http://www.scopus.com/inward/record.url?scp=0031923832&partnerID=8YFLogxK
U2 - 10.1089/cmb.1998.5.87
DO - 10.1089/cmb.1998.5.87
M3 - Article
C2 - 9541873
AN - SCOPUS:0031923832
SN - 1066-5277
VL - 5
SP - 87
EP - 100
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 1
ER -