AntiAngioPred: A Server for Prediction of Anti-Angiogenic Peptides

The process of angiogenesis is a vital step towards the formation of malignant tumors. Anti-angiogenic peptides are therefore promising candidates in the treatment of cancer. In this study, we have collected anti-angiogenic peptides from the literature and analyzed the residue preference in these peptides. Residues like Cys, Pro, Ser, Arg, Trp, Thr and Gly are preferred while Ala, Asp, Ile, Leu, Val and Phe are not preferred in these peptides. There is a positional preference of Ser, Pro, Trp and Cys in the N terminal region and Cys, Gly and Arg in the C terminal region of anti-angiogenic peptides. Motif analysis suggests the motifs “CG-G”, “TC”, “SC”, “SP-S”, etc., which are highly prominent in anti-angiogenic peptides. Based on the primary analysis, we developed prediction models using different machine learning based methods. The maximum accuracy and MCC for amino acid composition based model is 80.9% and 0.62 respectively. The performance of the models on independent dataset is also reasonable. Based on the above study, we have developed a user-friendly web server named “AntiAngioPred” for the prediction of anti-angiogenic peptides. AntiAngioPred web server is freely accessible at http://clri.res.in/subramanian/tools/antiangiopred/index.html (mirror site: http://crdd.osdd.net/raghava/antiangiopred/).


Introduction
The process of growth of new capillary blood vessels is used for healing and reproduction, which is known as angiogenesis. It occurs for healing wounds and for restoring blood flow to tissues after injury. The control of angiogenesis is achieved by maintaining balance between growth and inhibitory factors in healthy tissues [1,2]. Angiogenesis is regulated by 'on' and 'off', switches. Angiogenesis-stimulating growth factors are considered as 'on switches' while the angiogenesis inhibitors are considered as the 'off switches'. Excess production of angiogenic growth factors favors the growth of blood vessels while the presence of excess of angiogenic inhibitors prevents angiogenesis. Recent studies have identified several endogenous anti-angiogenic peptides identified from various biological sources, which regulate angiogenesis and tumor growth [3][4][5][6].
The increasing interest in peptide based therapeutics has led to the development of many peptide databases with therapeutic properties like anticancer [14], antihypertensive [15], antimicrobial [16], blood-brain barrier [17], antiparasitic [18], hemolytic [19], quorum-sensing [20], tumor homing [21] and cell penetrating [22]. So far, peptide based drugs have been employed for many diseases and these are being investigated in clinical applications against tumors, either for imaging or therapy [3,[23][24][25][26]. In general, they are attractive molecules as therapeutics because of their natural availability, ability to penetrate cells, specific target binding, and diverse modifications giving flexibility for different applications. The discovery of angiogenesis peptide inhibitors will help in the development of therapeutic treatments against cancer. Several web-based tools are available for the annotation of protein sequence to understand the family and subfamily of the protein [27][28][29]. So far, there are no web-based tools to predict the anti-angiogenic peptides. Thus, the search of anti-angiogenic agents for the treatment of cancer is particularly important. Hence, in this study, a systematic attempt has been made to develop machine learning based models using various features extracted from peptide sequences like binary profile patterns (BPP); amino acid composition (AAC) as well as dipeptide compositions (DPC). A user-friendly web server has also been developed to help the experimental biologist to predict the anti-angiogenic peptides.

Datasets
Positive dataset. The main dataset was collected from the literature. In this study, we have obtained 257 anti-angiogenic peptides from various research articles and patents (S5 Table). Due to the redundancy in the sequences, CD-HIT software was used to eliminate highly similar sequences and it was ensured that no two sequences have more than 70% sequence identity. The resulting dataset contains 135 sequences in the positive dataset (S1 Table). Among these 135 sequences, 20% of the dataset (~28 sequences) was kept separately to be used as independent dataset (S3 Table).
Negative dataset. As there is no source of experimentally proven non-anti-angiogenic peptides, we extracted 135 random peptide regions from proteins from Swiss-Prot database [30] and treated them as non-anti-angiogenic peptides (S1 Table). Though some of these randomly selected peptides could be anti-angiogenic in nature but the probability is very less. The random peptide sequences were extracted in such a way that the length distribution of the dataset remains same as of positive dataset. Among these 135 sequences, 20% of the dataset (~28 sequences) was kept separately to be used as independent dataset.
Terminus datasets. We divided the main dataset into nine terminus datasets, which are NT5, CT5, NTCT5, NT10, CT10, NTCT10, NT15, CT15 and NTCT15. NT5 and CT5 contain first five residues and last five residues from the N-terminal and C-terminal region of the peptide sequence respectively. NTCT5 is obtained by joining the NT5 and CT5 sequence. Similarly other terminus datasets were also constructed to understand the region of the peptide containing maximum information to discriminate these peptides from random sequence.
Independent dataset. The independent dataset was made by extracting 20% of the sequences (~28 sequences) from the positive, as well as negative dataset, thereby making a total of 56 sequences (S2 Table). These sequences were not used in either training or testing procedure while developing any model.
Random datasets. In order to check the reliability of models, we created five more random negative dataset using the same procedure as used in developing negative dataset. These datasets have been created to check whether the property of the developed model changes if the negative dataset is replaced with another randomly created dataset. These datasets were named as 'Random1', 'Random2', 'Random3', 'Random4' and 'Random5' (S4 Table).
Calculation of residue propensities. The propensity of each amino acid in anti-angiogenic peptides was calculated by the following formula: where, P(i) represents propensity of i th amino acid, AACp(i) and AACs(i) represents the average composition of i th amino acid in positive and Swiss-Prot dataset, respectively. We also calculated the position wise propensities of amino acids in both N-terminal and C-terminal regions of the peptides. Cross validation technique. In the present study, we performed ten-fold cross-validation technique to develop our models. In this technique, the sequences were randomly divided into ten sets. Nine sets were used for training the model while the remaining tenth set was used for testing. The process was then repeated ten times such that each set was once used as a test set. The average performance of all the ten sets is reported as the final performance of the method.
Input features for prediction. A machine learning based method requires set of features in the form of numbers as input. These features contain the global information of the biological molecules being studied by the method. The features used in this study are described below.
Amino acid composition (ACC). It is represented by the percentage of each amino acid within a peptide with a vector size of 20. It was calculated by using the following equation: Where AAC(i) represent the percentage of amino acid (i); A i represent the frequency of i th residue and N is the total number of residues in the peptide. Dipeptide composition. Dipeptide composition refers to the percentage of all the possible pair of amino acids (e.g. AA, AC, AD etc.) present in the peptide. It represents a vector size of 400 (20 x 20) and also includes information about the neighboring residues. It was calculated using the following equation: Where DPC (i) represents the percentage of dipeptide (i); DP (i) represents the frequency of i th dipeptide and N represents the total number of dipeptides. Binary profile. In binary profile, each amino acid is represented by a binary vector of size 20 where one element of the vector corresponding to the presence of a particular amino acid is represented by 1 and other 19 elements are represented by 0. (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0). Therefore for a stretch of 5 amino acids, the total vector size of binary profile will be 100 (20 x 5).
Two sample logos. Online service of two sample logo software was used to generate two sample logos [38][39][40]. It is useful in representing the frequency of amino acids at specific positions in the peptide sequence. The size of the residues displayed at each position is proportional to the relative frequency of each amino acid at that position.
Performance measures. The performance of the developed models was calculated using the standard performance parameters like Sensitivity (Sn), Specificity (Sp), Accuracy (Acc) and Matthew's correlation coefficient (MCC). The formula to calculate Sensitivity, Specificity, Accuracy and MCC is given by following equations: Where TP, TN, FP and FN represents True Positive, True Negative, False Positive and False Negative respectively.

Amino acid composition analysis
The amino acid composition analysis was carried out to extract certain residues, which are dominant in anti-angiogenic peptides. We compared the average amino acid composition in anti-angiogenic and non-anti-angiogenic peptides (Fig 1). It was observed that residues like Cys, Pro, Ser, Arg, Trp, Thr and Gly are predominant in anti-angiogenic peptides while residues Ala, Asp, Ile, Leu, Val and Phe are under-represented in these peptides. Composition was also computed for all the random datasets and compared with the negative dataset and no bias was observed, ensuring that the dataset is purely random. We also calculated the average amino acid composition of the entire Swiss-Prot database to be used as reference for analyzing the difference (Fig 1).

Residue propensities and positional preference
The propensities of residues are in accordance with the amino acid composition analysis with Cys, Trp, Ser, Arg and Pro being predominant in anti-angiogenic peptides while Val, Ala, Leu and Ile being less preferred in these peptides (S6 Table). To understand the position wise preference of amino acids at the first and last 10 residues of the N and C terminus of anti-angiogenic peptides, we calculated the position wise propensities using Swiss-Prot as reference dataset (S7 Table). Cys, Ser, Thr and His are preferred at N1 position; Pro at N2 position; Trp and Pro at N3; Ser and Phe at N4 and Cys is predominant at N5, N6 and N7 positions in antiangiogenic peptides. At C-terminal region, Cys is prominent at C1 and C2; Gly and Cys at C3 and C4; Cys at C5 while Arg is most favoured at C8, C9 and C10 position. We also performed residue based preference analysis using two sample logo (S1 and S2 Fig), which is in accordance with the results described above.

Motif analysis
To find the frequent motifs in the anti-angiogenic peptides, we extracted the motifs using MERCI software using following criteria: i) the motif should be present in at least 10% (~14 peptides) of the total number of peptides in the positive dataset, ii) the motif can have a maximum of 5 gaps. Here, the gap represents the presence or absence of any amino acid. Using the above criteria, we obtained a total of 151 motifs. Further, we selected the motifs, which had propensity (Eq 1) more than or equal to 0.90. This resulted in a total of 22motifs, which are "CG-G", "TC", "SC", "SP-S", "W-S-C", "WS-C", "S-T-C", "S-C-S", "CS-T", "C-S-T", "T-C", "S-C", "C-G-G", "TR", "S-T-G", "S-P-S", "SP", "RT", "P-W", "P-C", "C-N" and "CG" (hyphen '-' represents a gap).These motifs are important for understanding and identification of anti-angiogenic peptides. The full list of 151 motifs sorted by propensity is given in S8 Table. Performance of various machine learning approaches on the dataset We used different machine learning classifiers like SVM, Random Forest (RF), IBk, J48, Naïve Bayes, Logistic and Multilayer Perceptron (MP) to develop amino acid composition based model on whole peptide dataset. This helps us to compare the performance of different classifiers on the same dataset. The models developed in this study are explained below.

Amino acid composition based model
We used amino acid composition of the peptide as input feature to develop the prediction model using SVM, J48, RF, Naïvebayes, MP, Logistic and IBk machine learning classifiers (Table 1).SVM (MCC = 0.48), MP (MCC = 0.49) and RF(MCC = 0.48) based models performed better than other methods. The performance among the best models (SVM, RF, MP) was alike and therefore we selected SVM machine learning method for further development of models using different input features.
The performance of SVM based models developed on the terminus datasets is summarized in Table 2. We observed that the best results in terms of accuracy (80.9%) and MCC (0.62) are Dipeptide composition based model SVM based model was developed using dipeptide composition of the whole peptide as input feature and we achieved an accuracy of 74.8% with MCC 0.50 (Table 3). We also developed models on nine terminus datasets as done previously. The maximum accuracy (75.9%) and MCC (0.52) was obtained on CT15 terminus dataset although the performance on NT15 (74.1% accuracy) was also nearby. There was a slight increase in the performance of dipeptide composition based model (74.8% accuracy) as compared to amino acid composition based model (73.8% accuracy) on whole peptide dataset.

Binary profile based model
We also developed SVM based models using binary profile of peptide as input feature. We achieved the best accuracy (77.6%) and MCC (0.55) on NTCT5 terminus dataset (Table 4).

Performance on independent dataset
In order to validate the models, the performances of all the best models were tested on an independent dataset. The amino acid and dipeptide composition based models, both achieved accuracy 69.6% with MCC 0.41 on whole peptide dataset ( Table 5). The model based on amino acid composition on NT15 terminus dataset achieved accuracy 75.0% with MCC 0.51. These results indicate that our models are robust and performed equally well on the independent dataset.

Reliability of models
We created five random datasets (Random-1-5) and developed amino acid composition based model using positive dataset and each of the random datasets generating a total of 5 models. The performance of these models is given in Table 6. The results indicate that the developed models are reliable and stable enough to perform well in all the random datasets.

Implementation of web server
Based on the above study and to assist the scientific community, we developed a web server named 'AntiAngiopred' with user-friendly interface. We implemented two models in the web server; i) amino acid composition based model on N15 terminus dataset, ii) amino acid composition based model on whole peptide dataset. The former model is implemented due to its best performance among other models and the latter is implemented for peptides which are less than 15 residues in length. Due to the limited number of anti-angiogenic peptide sequences, the models implemented in the web server are developed using all the sequences. A user can submit the peptide sequence in the 'Predict' module of the web server and can predict whether his/her peptide has anti-angiogenic property or not. User can also get the single mutant analogs of the submitted peptide along with their prediction. It will also help a user to identify minimum mutations and their location in a peptide sequence so as to have anti-angiogenic properties in that peptide. If a user has multiple peptides then 'Multiple Peptide' module helps him/ her to predict the anti-angiogenic nature of all of his/her peptides using a single submission form. The web service can be accessed at http://clri.res.in/subramanian/tools/antiangiopred/ or at its mirror site at http://crdd.osdd.net/raghava/antiangiopred/

Discussion
In this study an attempt has been made to develop an effective in silico method to predict antiangiogenic peptides. We used a dataset of 107 positive and 107 negative sequences to develop models and check performance of models using ten-fold cross validation technique. We also tested the performance of the developed models on the independent dataset with 28 positive and 28 negative sequences. Primary analysis based on the amino acid composition and residue propensities reveal that the residues such as Cys, Trp, Ser, Arg and Pro are preferred in antiangiogenic peptides while Val, Ala, Ile and Asp are not preferred in these peptides. Analysis of two sample logos and positional preference show that the predominance of certain residues like Ser, Pro, Trp and Cys in the N-terminal region of anti-angiogenic peptides, while in the Cterminus, the residues such as Cys, Gly and Arg were found. Both Ser and Cys have high propensities while Ala and Val have low propensities at most of the positions in the N-terminal region. In C-terminal region, Arg, His and Cys have high propensities while Ala has low propensity at most of the positions. Further, motif analysis suggests the prominent motifs like "CG-G", "TC", "SC", "SP-S", "W-S-C", "WS-C", etc., which are present in the anti-angiogenic peptides. Based on the primary analysis, we developed models for discriminating anti-angiogenic and non-anti-angiogenic peptides using different machine learning techniques. The SVM based models developed in this study, were able to discriminate anti-angiogenic and nonanti-angiogenic peptides with 80.9% accuracy and 0.62 MCC on NT15 dataset using amino acid composition as input feature. On an independent dataset, the above model achieved 75% accuracy and 0.51 MCC. Further, the performance of amino acid composition based models on whole peptide dataset developed using all the random datasets indicate that the model is stable and hence reliable. To assist and help the scientific community, we have integrated the models developed in this study in the web server AntiAngioPred, which can be accessed at http://clri.res.in/subramanian/tools/antiangiopred/index.html (mirror site: http://crdd.osdd. net/raghava/antiangiopred/)

Limitations and future development
The current study is based on the small size of the dataset of anti-angiogenic peptides. Therefore, the predictor may not be robust enough to apply on a very diverse set of peptides as compared to the dataset used in this study. As soon as more and more anti-angiogenic peptides will be made available in the literature, the predictor will require retraining on the new dataset to make it more robust. The choice of random peptides as negative dataset poses a further limitation on this predictor. Ideally, a negative dataset should have experimentally validated non anti-angiogenic peptides. However, in the absence of non anti-angiogenic peptides, a more appropriate choice would be random peptides having similar physico-chemical properties as that of anti-angiogenic peptides. The above suggestions should be considered for the future development of models.   Table. Positional propensity of amino acids in first and last 10 residues of N-and C-terminus of anti-angiogenic peptides. N1 represents the first residue and C10 represents the last residue. The propensities were calculated using Swiss-Prot as reference dataset. (DOCX)

Supporting Information
S8 Table. List of motifs extracted by MERCI software. (DOCX)