TY - JOUR
T1 - A machine learning approach for hot-spot detection at protein-protein interfaces
AU - Melo, Rita
AU - Fieldhouse, Robert
AU - Melo, André
AU - Correia, João D.G.
AU - Cordeiro, Maria Natália D.S.
AU - Gümüs, Zeynep H.
AU - Costa, Joaquim
AU - Bonvin, Alexandre M.J.J.
AU - Moreira, Irina S.
N1 - Publisher Copyright:
© 2016 by the authors; licensee MDPI, Basel, Switzerland.
PY - 2016/8/1
Y1 - 2016/8/1
N2 - Understanding protein-protein interactions is a key challenge in biochemistry. In this work, we describe a more accurate methodology to predict Hot-Spots (HS) in protein-protein interfaces from their native complex structure compared to previous published Machine Learning (ML) techniques. Our model is trained on a large number of complexes and on a significantly larger number of different structural- and evolutionary sequence-based features. In particular, we added interface size, type of interaction between residues at the interface of the complex, number of different types of residues at the interface and the Position-Specific Scoring Matrix (PSSM), for a total of 79 features. We used twenty-seven algorithms from a simple linear-based function to support-vector machine models with different cost functions. The best model was achieved by the use of the conditional inference random forest (c-forest) algorithm with a dataset pre-processed by the normalization of features and with up-sampling of the minor class. The method has an overall accuracy of 0.80, an F1-score of 0.73, a sensitivity of 0.76 and a specificity of 0.82 for the independent test set.
AB - Understanding protein-protein interactions is a key challenge in biochemistry. In this work, we describe a more accurate methodology to predict Hot-Spots (HS) in protein-protein interfaces from their native complex structure compared to previous published Machine Learning (ML) techniques. Our model is trained on a large number of complexes and on a significantly larger number of different structural- and evolutionary sequence-based features. In particular, we added interface size, type of interaction between residues at the interface of the complex, number of different types of residues at the interface and the Position-Specific Scoring Matrix (PSSM), for a total of 79 features. We used twenty-seven algorithms from a simple linear-based function to support-vector machine models with different cost functions. The best model was achieved by the use of the conditional inference random forest (c-forest) algorithm with a dataset pre-processed by the normalization of features and with up-sampling of the minor class. The method has an overall accuracy of 0.80, an F1-score of 0.73, a sensitivity of 0.76 and a specificity of 0.82 for the independent test set.
KW - Evolutionary sequence conservation
KW - Hot-spots
KW - Machine learning
KW - Protein-protein interfaces
KW - Solvent Accessible Surface Area (SASA)
UR - http://www.scopus.com/inward/record.url?scp=84979919014&partnerID=8YFLogxK
U2 - 10.3390/ijms17081215
DO - 10.3390/ijms17081215
M3 - Article
C2 - 27472327
AN - SCOPUS:84979919014
SN - 1661-6596
VL - 17
JO - International Journal of Molecular Sciences
JF - International Journal of Molecular Sciences
IS - 8
M1 - 1215
ER -