TY - JOUR
T1 - Total ancestry measure
T2 - Quantifying the similarity in tree-like classification, with genomic applications
AU - Yu, Haiyuan
AU - Jansen, Ronald
AU - Stolovitzky, Gustavo
AU - Gerstein, Mark
N1 - Funding Information:
This work is supported by a grant from NIH (P50 HG02357-01). Funding to pay the open access charges was provided by a grant from NIH (P50 HG02357-01).
PY - 2007/8/15
Y1 - 2007/8/15
N2 - Motivation: Many classifications of protein function such as Gene Ontology (GO) are organized in directed acyclic graph (DAG) structures. In these classifications, the proteins are terminal leaf nodes; the categories 'above' them are functional annotations at various levels of specialization and the computation of a numerical measure of relatedness between two arbitrary proteins is an important proteomics problem. Moreover, analogous problems are important in other contexts in large-scale information organization - e.g. the Wikipedia online encyclopedia and the Yahoo and DMOZ web page classification schemes. Results: Here we develop a simple probabilistic approach for computing this relatedness quantity, which we call the total ancestry method. Our measure is based on counting the number of leaf nodes that share exactly the same set of 'higher up' category nodes in comparison to the total number of classified pairs (i.e. the chance for the same total ancestry). We show such a measure is associated with a power-law distribution, allowing for the quick assessment of the statistical significance of shared functional annotations. We formally compare it with other quantitative functional similarity measures (such as, shortest path within a DAG, lowest common ancestor shared and zuaje's information-theoretic similarity) and provide concrete metrics to assess differences. Finally, we provide a practical implementation for our total ancestry measure for GO and the MIPS functional catalog and give two applications of it in specific functional genomics contexts.
AB - Motivation: Many classifications of protein function such as Gene Ontology (GO) are organized in directed acyclic graph (DAG) structures. In these classifications, the proteins are terminal leaf nodes; the categories 'above' them are functional annotations at various levels of specialization and the computation of a numerical measure of relatedness between two arbitrary proteins is an important proteomics problem. Moreover, analogous problems are important in other contexts in large-scale information organization - e.g. the Wikipedia online encyclopedia and the Yahoo and DMOZ web page classification schemes. Results: Here we develop a simple probabilistic approach for computing this relatedness quantity, which we call the total ancestry method. Our measure is based on counting the number of leaf nodes that share exactly the same set of 'higher up' category nodes in comparison to the total number of classified pairs (i.e. the chance for the same total ancestry). We show such a measure is associated with a power-law distribution, allowing for the quick assessment of the statistical significance of shared functional annotations. We formally compare it with other quantitative functional similarity measures (such as, shortest path within a DAG, lowest common ancestor shared and zuaje's information-theoretic similarity) and provide concrete metrics to assess differences. Finally, we provide a practical implementation for our total ancestry measure for GO and the MIPS functional catalog and give two applications of it in specific functional genomics contexts.
UR - http://www.scopus.com/inward/record.url?scp=34548558977&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btm291
DO - 10.1093/bioinformatics/btm291
M3 - Article
C2 - 17540677
AN - SCOPUS:34548558977
SN - 1367-4803
VL - 23
SP - 2163
EP - 2173
JO - Bioinformatics
JF - Bioinformatics
IS - 16
ER -