Developing machine learning methods to predict peptide-protein binding affinity has become an important approach in proteomics. A diversity of linear and nonlinear machine learning algorithms is applied in quantitative structure- activity relationships (QSAR) to generate predictive models for ligand binding to a biological receptor. QSAR represent regression models that define quantitative correlations between the chemical structure of molecules and their physical, chemical, or biological properties. A QSAR equation predicts a molecular property from a set of molecular descriptors representing the input data to a machine learning algorithm, such as linear regression, partial least squares, artificial neural networks, or support vector machines. Here we present a QSAR comparative study for peptides binding to the human amphiphysin- 1 SH3 domain, based on five machine learning methods, namely partial least squares, radial basis function artificial neural networks, support vector machines, Gaussian processes, k-nearest neighbors, and the decision trees REPTree and M5P, as implemented in the machine learning software Weka. The peptide structure was encoded with five amino acid scales, namely the Miyazawa-Jernigan (MJ) substitution matrix, G. Schneider's principal component (GSPC) scale, Lv's DPPS scale, Clementi's GRID scale, and Wold's z scale. The machine learning models were trained with a dataset of 200 peptides, and the QSAR models were tested for a prediction dataset of 684 peptides. The best predictions were obtained with the decision tree M5P for all five amino acid scales, namely z scale q2 = 0.543, MJ scale q2 = 0.553, GSPC scale q2 = 0.557, GRID scale q2 = 0.558, and DPPS scale q2 = 0.599. These results show that M5P decision trees give predictive QSAR for peptide-protein binding affinity, and should be considered as valuable candidates for other peptide QSAR. Also, the new DPPS scale has clear advantages compared to the previous amino acid descriptors. The study provides support to QSAR approaches based on a large-scale evaluation of machine learning algorithms and diverse classes of structural descriptors.
No Supplementary Data
No Article Media