Analysis and Prediction of Highly Effective Antiviral Peptides Based on Random Forests

The goal of this study was to examine and predict antiviral peptides. Although antiviral peptides hold great potential in antiviral drug discovery, little is done in antiviral peptide prediction. In this study, we demonstrate that a physicochemical model using random forests outperform in distinguishing antiviral peptides. On the experimental benchmark, our physicochemical model aided with aggregation and secondary structural features reaches 90% accuracy and 0.79 Matthew's correlation coefficient, which exceeds the previous models. The results suggest that aggregation could be an important feature for identifying antiviral peptides. In addition, our analysis reveals the characteristics of the antiviral peptides such as the importance of lysine and the abundance of α-helical secondary structures.

AVPs are known to fight against various viruses. All of the AVPs are derived from either synthetic combinatorial libraries or segments of natural proteins and their homologues. A list of highly effective antiviral peptides against HIV [15], HSV [16], hepatitis C virus [17], influenza virus [18][19][20], rabies virus [21], and west nile virus [22] has been compiled into an online database AVPpred [23]. Recently, there is an dedicated AVP database HIPdb for HIV, comprehensively collecting the experimentally validated HIV inhibiting peptides [24].
Several mechanisms are available for AVPs to fight against viruses. Antiviral therapeutics agents are known to block the attachment of viruses, prevent from the fusion of viruses to host cells, interrupt the signaling process of viruses, or inhibit the replication of viruses in host cells which may involve DNA polymerase, reverse transcriptase, integrase, and protease [14]. Currently studies have shown that AVPs inhibited the fusion of viruses to the cells [25,26]; others have shown that AVPs interfered the replication of viruses [27][28][29].
Little is done in predicting and examining antiviral peptides. Broadly speaking, antiviral peptides should be a part of antimicrobial peptides, which fight against bacteria, fungi, parasites, and viruses. Several studies have been done in antimicrobial peptides [30][31][32][33][34][35], but a recent study by Thakur et al. demonstrated that antimicrobial peptide predictors are not suitable to assess AVPs [23]. In addition, this study was the first to explore four different approaches to predict effective AVPs: motif, sequence alignment, amino acid composition, and physicochemical features. Their results demonstrated that a support vector machine (SVM) approach using physiochemical features was a powerful method to identify AVPs. However, it is not clear whether key residues exist in AVPs and whether other methods can outperform SVM in predicting AVPs.
In this study, we demonstrate that our random forests (RF) model based on physiochemical properties works better for identifying AVPs. Physicochemical properties of peptides are a useful means to identify AVPs. A previous study demonstrated that predicting antimicrobial peptides (AMP) could depend on sequence-derived physicochemical properties and this study also suggested that aggregation could be important for classifying AMPs [33]; A recent study indeed showed that identifying AVPs using physicochemical properties of peptides worked [23]. Here we further investigated this finding.

Training, validation, and test data sets
The data sets were obtained from the study by Thakur et al. [23]. 1,056 peptides were validated experimentally, containing 604 highly effective AVPs and 452 non-effective peptides; another 604 peptides without experimental validation were non-effective from the study by Lata et al. [32]. Each of the peptides in the data sets was different from one another.
Two training sample sets and two independent test sets were established based on the data described above. Here we followed the same nomenclature used in the study by Thakur et al. [23]. 10fold cross-validation was performed in our analysis, where the training and validation sets came from either of the two sample sets T 544P+407N and T 544P+544N* . T 544P+407N consisted of 544 highly effective AVPs and 407 non-effective experimental peptides; T 544P+544N* contained the same 544 positive AVPs but different 544 non-experimental negative peptides. The independent test sets V 60P+45N and V 60P+60N* were used for the benchmark. V 60P+45N consisted of 60 highly effective AVPs and 45 non-effective peptides; V 60P+60N* contained 60 positive peptides and 60 non-experimental negative peptides.

Viral proteomes
The viral proteins were obtained from the viral genome database at the NCBI Entrez (January 2013), which consisted of 3251 viral genomes and viroids. All the 41316 viral proteins expressed by these genomes and viroids were retrieved. Nonstandard amino acids such as 'B', 'J', 'U' and 'X' in the viral proteins, which were less than 0.01% of the overall residues, were eliminated during the analysis. The viral proteomes were treated as non-effective antiviral peptides under a simple assumption that the viral proteins would not discourage themselves to develop.

RF classifier
RF is a classification method using an ensemble of unpruned decision trees with randomly selected features. The RF algorithm integrates random subspace method [36] into the concept of bootstrap aggregating or bagging [37] to generate an ensemble of decision trees. Independently each decision tree is best split by a small fraction of randomly selected features, trained on cases chosen by random sampling with replacement to all the available data. The aggregated decision trees then determine which class the predicted case belongs. The RF classifier known to avoid overfitting is a highly accurate method in many classification problems [38,39]. In this study, the randomForest R package version 4.6-7 was utilized [40], which was based on Breiman and  Cutler's algorithm [39]. Two parameters the number of ntree decision trees and the number of mtry selected features were set as follows: ntree = 100 and mtry~ffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi number of features p as recommended [39]. One additional advantage of the RF model is that the model is possible to interpret the importance of the features using measures such as decrease mean accuracy or Gini importance.

Artificial Neural Network (ANN) classifier
In this study, ANN was trained by the backpropagation algorithm. Its learning rate and momentum rate were equal to 0.3 and 0.2 respectively. The number of hidden units was set to half of the number of features and the number of classes.

Linear Discriminant Analysis (LDA) classifier
The MASS R package version 7.3-26 was utilized to build the LDA models in this study. The LDA models seek the best linear combination of the features to separate AVPs from others.

Gini importance
Gini importance or the mean decrease of Gini index (MDGI) is a robust quantity to measure variable importance in the RF model [41]. Gini index is an impurity quantity defined as follows: where i contains all the classes and P i is the fraction of class i. In our case there are two classes: AVPs and non-AVPs. The Gini index ranges from 0 to 1 and the index is closer to 0 if the sample is purer. The decrease of Gini index in a tree model refers to the difference between the Gini index of a parent node and the weighted Gini index of its descending nodes. Given a particular variable, the MDGI is the sum of the decrease of Gini index over the number of the decision trees in the RF. The larger the MDGI is, the more important the variable is.

Amino Acid Composition (AAC)
Amino acid composition is the ratio of each amino acid in a peptide. The ratio of an amino acid with type T in a peptide X is calculated as follows: where N T is the number of the amino acid with type T and L is the length of peptide X.

Physicochemical properties
In the first attempt, 544 physicochemical indices in amino acid index database (AAindex version 9.1) [42] were examined. Each index contains 20 numerical values to represent the specific physicochemical properties of 20 amino acids. A simple linear addition model, the scalar product of each of the 544 indices and each of the amino acid composition of peptides, was used for the evaluation. Since the number of the parameters generated by these 544 indices was much larger than the number of the training cases, identifying crucial indices using feature selection was recom-mended. We tested few feature selection methods such as minimum Redundancy Maximum Relevance to identify the crucial indices [43]. However, neither the top-ranked indices from our study nor those by Thakur et al. [23] provided a satisfactory predictive performance in the RF models. In fact, our results suggested that the RF models with more indices performed worst (data not shown), which followed similar trends as indicated in the previous finding [34].
In the next attempt, the following physicochemical properties were selected: length, net charges, instability index, aliphatic index, and hydropathicity. Our basic physicochemical models were built by utilizing the five properties combined with amino acid composition. The advanced features such as secondary structures of peptides described below were considered.
The instability index, an estimate of peptide stability, is calculated as follows: where L is the length of peptide and DIWV from the study by Guruprasad et al. is an instability weight value of a dipeptide starting at position i [44]. Peptides with II values greater than 40 are considered to be unstable. The aliphatic index, the relative volume of aliphatic residues in a peptide, is calculated as follows: where a and b are the constants, which represent the relative volume of valine and leucine or isoleucine to alanine. X Ala , X Val , X Leu , and X Ile are the fractions of alanine, valine, leucine and isoleucine multiplied by 100, respectively [45].
Grand average of hydropathicity index (GRAVY) is used to represent the hydrophobicity value of a peptide, which calculates the sum of the hydropathy values of all the amino acids divided by the sequence length. GRAVY was calculated using the hydropathy values from Kyte and Doolittle [46]. Positive GRAVY values indicate hydrophobic; negative values mean hydrophilic. All these physicochemical values could be obtained directly from the ExPASy website (http://www.expasy.org) [47].

Aggregation
AGGRESCAN was utilized to estimate the aggregation tendencies of a peptide [48]. AGGRESCAN, which applied aggregation propensities of amino acids derived from the experimental data of b-amyloid peptides, is a good indicator of in vivo aggregation.

Secondary Structure
Protein secondary structure prediction (PSSpred version 2.0), which was a neural network classifier integrated into the famous I-TASSER server [49], was utilized to predict the secondary structure of a peptide [50]. Each amino acid on the peptide was classified into a-helix, b-sheet, or random coil. The number of each type of secondary structure was then recorded.

Amino acid composition of the AVPs
The comparison of amino acid composition of AVPs was shown in Fig. 1 and Fig. 2. The viral proteomes were chosen to be an optional baseline. Compared to the non-effective experimental peptides and the viral proteomes, the AVPs consistently showed higher percentages of leucine and lysine, but lower percentages of threonin, proline, and valine. Leucine, medium-sized hydrophobic residue, was the most abundant residues in the AVPs. In addition, a clear preference was shown for a basic residue lysine, which was also known as the key characteristics for anti-microbial peptides.

Physicochemical properties, aggregation, and secondary structures of the AVPs
The statistical analysis of the physicochemical properties of the AVPs was done (Fig. 3 and Fig. 4). The following properties were examined, including length, aliphatic index, instability, net charge, hydropathy, aggregation, and secondary structure. The lengths of the 604 AVPs ranged from 6 to 45 amino-acid residue, with an average length of 24.2 residues; the 452 non-effective AVPs had similar range with an average length of 18.2 residues; the viral proteins were longer with an average length over 200 residues. The AVPs had the strongest aliphatic tendency among them all ( Fig. 3B and Fig. 4B). Indeed, the AVPs had the largest portions of aliphatic residues such as leucine, isoleucine, and valine ( Fig. 1 and Fig. 2). Besides, only the AVPs had an average instability value over 40, which considered to be unstable; both the non-effective AVPs and viral proteins had an instability value less than 40 ( Fig. 3C and Fig. 4C). In terms of residue charge, the AVPs had higher tendency to be charged positively than the non-effective peptides; the viral proteins preferred negative net charges ( Fig. 3D and Fig 4D). The AVPs had a slightly negative GRAVY score, but higher than both the non-effective peptides and the viral proteins ( Fig. 3E and Fig. 4E). In the aggregation analysis, the AVP also had a slightly negative AGGRESCAN score on average, but higher than the non-effective peptides and the viral proteins  ( Fig. 3F and Fig 4F). Each residue of the peptides was classified into a-helix, b-sheet, or coil according to the secondary structure prediction. More than half of all the residues of the AVPs were a-helix (Fig. 3H), but the AVPs had lower portions of coil than the non-effective peptides and the viral proteins ( Fig. 3G and Fig. 4G). b-sheet propensity was less clear, for few residues were classified into b-sheet ( Fig. 3I and Fig 4I).

Feature importance of amino acid composition
The analysis of Gini importance on the RF model built with amino acid composition was performed. Lysine, a positively charged molecule, was the most important residue among the 20 amino acids for classifying AVPs. The result was consistent with those evaluated by different methods, which ranked the importance of amino acid similarly albeit slightly different (Table S1). Medium-size hydrophobic residues such as leucine and isoleucine also played a role in distinguishing AVPs from non-AVPs. Besides, the importance of threonine was also identified (Fig. 5).
In addition, the same Gini analysis of a RF model on an AMP database was performed. Although both lysine and arginine were abundant in the AMPs, the Gini analysis showed that neither lysine nor arginine was the most effective residue to tell apart AMPs from non-AMPs (Fig. S1). Our results indicated relatively less abundant residues in the AMPs such as methionine, aspartic acid, glutamic acid, and threonine are important to distinguish AMPs (Fig. S1). This differs from what we have seen in the AVPs, where lysine was the most effective residue.

Performance analysis on the training samples
All eight RF models were examined using 10-fold crossvalidation trained by either T 544P+407N or T 544P+544N* as summarized in Table 1. Those models trained by T 544P+544N* were marked with the number sign #; those trained by T 544P+407N were without the sign. In order to easily compare AVPpred, our models named similarly after the models of AVPpred. The abbreviations compo, physico, structure, and agg represented amino acid composition, the amino acid composition plus five basic physicochemical properties, secondary structure, and aggregation respectively. For example, RFphysico was a RF model based on five basic physicochemical properties and amino acid  composition. RFcompo+structure+agg was based on amino acid composition, secondary structure, and aggregation. In the 10-fold cross-validation, the performances of the eight RF models were close. In the T 544P+407N training data, the best model RFcompo+structure+agg had an accuracy of 92% and 0.83 MCC. Both RFcompo+structure and RFphysico+structure reached a slightly higher accuracy of 85% and 0.69 MCC than RFcompo and RFphysico. In the T 544P+544N* training data, similar trends were found. Four models RFcompo+structure#, RFcompo+structure+agg#, RFphysico+structure#, and RFphysico+structure+agg# achieved a high accuracy of 92% and 0.83 MCC. As suggested before, different levels of the performances on the two training data sets might be due to T 544P+544N* containing non-experimental peptides [23]. Those non-experimental peptides were retrieved from non-secreted proteins, for the original study assumed antimicrobial peptides to be naturally secreted proteins [32]. The non-secreted proteins might be very different from the AVPs, thus distinguishing AVPs from those non-experimental peptides would be easy. Table 2 shows the performances of four learning methods with the standard models. The models using ANN, LDA, and RF were built on the same datasets. Two best SVM models of AVPpred, the only other tool available for predicting AVPs, were also included. All these models were evaluated on the two test sets V 60P+45N and V 60P+60N* . V 60P+45N had only experimentally verified peptides, but V 60P+60N* contained non-experimental peptides. As seen in Table 2, the LDA models performed fairly, the ANN models gave a satisfactory prediction, and the SVM or RF models were very good in distinguishing AVPs. In general, the comparison results demonstrated that the RF models were superior to the other models in this problem.

Performance evaluation on the independent test sets
On the V 60P+45N test set, the RF models performed well ( Table 2). The best RF models consistently outperformed the best models of AVPpred. For example, RFphysico surpassed AVPphysico, which was the performance indicator of AVPpred with an accuracy of 86% and 0.71 MCC, while both models utilized only 25 physicochemical properties.
On the V 60P+60N* test set containing non-experimental peptides, similar results were seen. For this test set, the two best models of AVPpred trained by T 544P+544N* , AVPcompo# and AVPphysico#, were included. Our best model RFcompo# achieved the maximal accuracy of 93% and the highest 0.87 MCC. RFcompo# outperformed AVPcompo# while RFphysico and AVPphysico were comparable.
Whether additional features such as aggregation and secondary structure could improve the prediction was examined. Table 3 shows a performance comparison among the eight RF models. Generally speaking, adding aggregation to the RF models tended to improve the performance slightly. For example, on both test sets, RFcompo+agg surpassed RFcompo while RFcompo+agg# and RFcompo# were comparable. Adding secondary structure could also improve the RF models. For example, RFcompo+structure outperformed RFcompo. In addition, on the V 60P+45N test set, our best model RFcompo+structure+agg achieved the maximal accuracy of 90% and the highest 0.79 MCC. On the V 60P+60N* test set, our best model RFcompo# and RFcompo+agg# achieved the maximal accuracy of 93% and the highest 0.87 MCC. The overall results demonstrated that these features could be useful for identifying the AVPs.

Discussion
This was the first study to apply RF into AVP prediction. Our RF models were based on size, amino acid composition, net charge, aliphaticity, instability, hydrophobicity, and secondary structure. The results demonstrated that predicting AVPs by the RF models through the basic physicochemical properties worked. On the independent test data provided by AVPpred, our evaluation indicated that RF worked better than SVM in distinguishing AVPs using these physiochemical properties. Our result supports that RF, a robust classifier, excels in many problems [38].
In order to reach optimal performances, several training designs for the RF models were explored. One option was to remove highly similar sequences from the training cases. The performance dropped as the RF models were trained by the non-redundant cases, which were generated by removing highly similar cases over 80% identities, leaving out more than one-third of the entire training cases. Additional analyses on the model attributes were also done such as increasing the number of attributes by AAindex and replacing the secondary-structure counts with the secondarystructure percentages. However, none enhanced performance. This suggests that improving AVP prediction is not a trivial task.
To compare AVPpred fully, the effects of different training sets on the RF models were examined. One set had the experimental data; the other contained the negatives without experimental verification. This led to different levels of specificity. Since true negatives in biological context are often limited and difficult to obtain, hypothetical negatives or negatives without experimental verification are the substitutes. However, ideally all true negatives for building the models should be verified experimentally to avoid performance errors. More attention, therefore, should be paid to the models with the experimental data. Our RF models outperformed the previous ones in either training set, but more obvious in the experimental training data.
Feature importance of amino acid composition of the AVPs was analyzed using the experimental data. Lysine, also known as a key residue in antimicrobial peptides, was the most important residue in distinguish AVPs. The residues with great importance need not to be the abundant residues. Our analyses showed that the AVPs were abundant in leucine, lysine, alanine, and glutamic acid ( Fig. 1 and Fig. 2), which supports the previous finding by HIPdb [24]. Several other interesting properties were found in our analysis, but we only emphasized lysine, for this finding was supported by various feature ranking methods. However, why the AVPs were biased toward lysine is not clear. One possible explanation is that the positively charged AVPs might interact with enveloped viruses like HIV [30], inhibiting the entry of the viruses into the cells.
The aggregation propensities of the AVPs were revealed. Our results suggested that the AVPs had higher tendencies to be aggregated in vivo than the non-effective peptides and the viral proteins. Our RF models also indicated that aggregation could be an important feature for classifying the AVPs. It is not clear how aggregation affects the AVPs to restrain the viruses. However, aggregation has been suggested to be an important feature of AMPs, for accumulating and clumping peptides together could affect the peptide availability [33].
In addition, the abundance of a-helix secondary structures in the AVPs was discovered in this study. What role the secondarystructure preference plays in fighting viral infection is not understood. It has been suggested that the a-helical structures are an common element for protein-protein interactions [51] and the a-helical structures of antimicrobial peptides are linked to the permeability to the membranes to abrupt the cell membranes of bacteria [35]. Whether the a-helical structures of the AVPs, a subset of antimicrobial peptides, interact with the enzymes in the virus replication or have similar impact on host cell membranes still need to be investigated. However, given secondary structure information, the performance of the RF models could further improve. On the experimental benchmark, our model trained by amino acid composition, aggregation, and secondary structure achieved the optimal performances. The importance of secondary structure was also supported by the previous study-several top physiochemical indices were related to secondary structure [23]. Figure S1 Feature importance of amino acid composition of AMPs. The importance of each amino acid is measured using the mean decrease of Gini index (MDGI) of the 20 amino acids ranked by the RF model built for the AMPs and non-AMPs. The larger the MDGI value, the more important the residue. (TIF) Text S1 Supplementary Information for Figure S1. (DOCX) Author Contributions