TY - JOUR
T1 - Comparative analysis of methods for representing and searching for transcription factor binding sites
AU - Osada, Robert
AU - Zaslavsky, Elena
AU - Singh, Mona
N1 - Funding Information:
The authors thank Jessica Fong, Carl Kingsford and the anonymous reviewers for helpful suggestions on the manuscript and Irene Pak for assistance in the initial stages of this project. M.S. thanks the NSF for PECASE award MCB-0093399 and DARPA for grant MDA972-00-1-0031.
PY - 2004/12/12
Y1 - 2004/12/12
N2 - Motivation: An important step in unravelling the transcriptional regulatory network of an organism is to identify, for each transcription factor, all of its DNA binding sites. Several approaches are commonly used in searching for a transcription factor's binding sites, including consensus sequences and position-specific scoring matrices. In addition, methods that compute the average number of nucleotide matches between a putative site and all known sites can be employed. Such basic approaches can all be naturally extended by incorporating pairwise nucleotide dependencies and per-position information content. In this paper, we evaluate the effectiveness of these basic approaches and their extensions in finding binding sites for a transcription factor of interest without erroneously identifying other genomic sequences. Results: In cross-validation testing on a dataset of Escherichia coli transcription factors and their binding sites, we show that there are statistically significant differences in how well various methods identify transcription factor binding sites. The use of per-position information content improves the performance of all basic approaches. Furthermore, including local pairwise nucleotide dependencies within binding site models results in statistically significant performance improvements for approaches based on nucleotide matches. Based on our analysis, the best results when searching for DNA binding sites of a particular transcription factor are obtained by methods that incorporate both information content and local pairwise correlations.
AB - Motivation: An important step in unravelling the transcriptional regulatory network of an organism is to identify, for each transcription factor, all of its DNA binding sites. Several approaches are commonly used in searching for a transcription factor's binding sites, including consensus sequences and position-specific scoring matrices. In addition, methods that compute the average number of nucleotide matches between a putative site and all known sites can be employed. Such basic approaches can all be naturally extended by incorporating pairwise nucleotide dependencies and per-position information content. In this paper, we evaluate the effectiveness of these basic approaches and their extensions in finding binding sites for a transcription factor of interest without erroneously identifying other genomic sequences. Results: In cross-validation testing on a dataset of Escherichia coli transcription factors and their binding sites, we show that there are statistically significant differences in how well various methods identify transcription factor binding sites. The use of per-position information content improves the performance of all basic approaches. Furthermore, including local pairwise nucleotide dependencies within binding site models results in statistically significant performance improvements for approaches based on nucleotide matches. Based on our analysis, the best results when searching for DNA binding sites of a particular transcription factor are obtained by methods that incorporate both information content and local pairwise correlations.
UR - http://www.scopus.com/inward/record.url?scp=12344291347&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/bth438
DO - 10.1093/bioinformatics/bth438
M3 - Article
C2 - 15297295
AN - SCOPUS:12344291347
SN - 1367-4803
VL - 20
SP - 3516
EP - 3525
JO - Bioinformatics
JF - Bioinformatics
IS - 18
ER -