Achieving High Accuracy Prediction of Minimotifs

The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease.


Introduction
Minimotifs (also called Short Linear Motifs) are short contiguous peptide pieces of proteins that have a known biological function, which can be categorized into binding, posttranslational modification of the minimotif, and protein trafficking. Minimotifs are involved in nearly all cell processes including intracellular signaling, extra-cellular activities, and disease [1][2][3][4].
Minimotifs contain both a known biological function and a short protein sequence representation generally of less than 15 amino acids which distinguishes them from protein domains like those in ProSite and other tools such as MEME and SCOP that identify sequence patterns, but do not have known functions [5,6]. Computational minimotif prediction tools have arisen to perform searches and predict new functions in proteins based upon established functions associated with minimotifs in other proteins. Minimotif Miner (MnM), Eukaryotic Linear Motif (ELM), and ScanSite fulfill these roles [3,7,[8][9][10][11]. These approaches do have value in their successes; however, the relatively low sequence complexity of minimotifs gives rise to many false positives, which limit their usefulness.
Our approach to this problem has developed five separate scores/filters each of which has a significant value in reducing false positive predictions [7,8,12,13]. Frequency Score analysis (FS) uses the complexity of minimotif sequence definitions to rank-order minimotifs. A Surface Prediction (SP) algorithm identifies minimotifs likely to be on the surface of a protein. The remaining three approaches take advantage of both the target and source proteins. The Protein-Protein Interaction filter (PPI) refines minimotif predictions by selecting only motifs whose source protein and target protein are known to interact in vivo, eliminating any whose source protein and target protein do not interact [12]. In addition to exact PPIs, protein-protein interactions are also expanded based on orthologues and paralogues across species and taxa (''Homo-loGene-PPI''), as well as sequence similarity (''Similarity-PPI''). The Cellular or Molecular Function filters (CF/MF), retain minimotifs whose source protein and target protein share a common cellular or molecular function, respectively [13]. Exact functional matching is not required; rather function terms are related through the network structure provided by the Gene Ontology (GO) database [14]. For example, one function may be a subclass of another function, or one function may regulate another function.
These scores/filters exploit different components of a minimotif syntax developed for this purpose [15]. We next demonstrated that pairwise combinations of filters were better than either alone, suggesting that each filter used distinct information. This led us to perform a systematic comparison of different combinations of five scores/filters that we had developed previously. A study of minimotif filtering with linear regression, support vector machine, and neural network algorithms shows a vast improvement in minimotif prediction with accuracies above 85% and in one analysis less than 1% false positives while retaining more than ,90% of the true positives. This advance sets us on a path to vastly reducing false positive predictions. Implementation of this filter combination on the MnM website renders minimotifmediated protein function prediction much more reliable and influential.

Results
To build and test the multi-filter approach we used five existing filters designed to remove false-positive minimotifs [7,8,12,13]. This multi-filter approach was enabled in large part due to a rich model of the syntactical and semantic structure for minimotifs [14]. Briefly, a minimotif is found in a 'source protein' and the target protein binds the minimotif or alters the minimotif. Two of the filters are based upon regular expression searches involving solely the source protein (where the minimotif is found). Frequency Score analysis (FS) uses the complexity of minimotif sequence definitions to rank-order minimotifs. A surface prediction algorithm identifies minimotifs likely to be on the surface of a protein.
We first evaluated each individual filter on the same dataset by generating Receiver Operator Curves (ROCs) and comparing the area under the curves (AUC) (Fig. 1, Table 1). The (AUC) for individual filters ranged from 0.72-0.88, indicating good filter performance. There is much room for improvement.
We evaluated several approaches for combining differing filtering techniques. Linear regression, support vector machine, and neural network multi-filter models were trained and tested by randomly partitioning the true positive and true negative data equally into five groups, each of which contained a subset of 400 instances. A five-fold cross validation was performed by successively using four groups to train the multi-filter models and one group of validation data to evaluate the effectiveness of the multifilter. The three multi-filter models used the individual CF, MF, FS, PPI, and SP minimotifs filters. The AUC values indicated that the multi-filters were significantly better than any individual filter ( Table 2).
We next optimized the multi-filters. We repeated the minimotif filtering varying the filter score threshold to identify the maximum AUC for the best cross validation test in each of the three models (#2 of linear regression, #3 of support vector machine, and #3 of neural network). Plots showing the dependency of sensitivity, specificity, and accuracy on the filter threshold are shown for the linear regression, support vector machine, and neural network models in Fig. 2. The threshold dependence was typical of that for any filter. For these models, as the threshold increases, the sensitivity decreases as one would expect. As the threshold value increased the specificity for both models increased. The accuracy increased to a maximum and then decreased as the sensitivity dropped.
The plots shown in Fig.2 were used to identify several threshold values for each model to help us select the best minimotif-filtering model. A threshold with the maximum accuracy is defined as the optimal threshold (T o ). A stringent threshold that minimizes the number of false positives while retaining a high sensitivity is denoted as T s . The optimal threshold for the three minimotif filtering models produced accuracies above 90% with , 85% true positive rate and less than 1% false positives (6% for the neural network) ( Table 3). The stringent threshold produced less than 1% or in some cases no false positives (linear regression in Table 3), while retaining more than ,90% of the true positives for the linear regression and neural network models (84% for the support vector machine model). Our evaluation of the selected models was also supported by the Matthews Correlation Coefficient (MCC) with a good performance of the filter combinations (MCC of 1 indicates a perfect prediction while 0 indicates no better than random).
Remarkably, the linear regression model with the Ts threshold produced 84% true positives with no false positives ( Table 3), and the neural network model produced 83% true positives with less than 0.3% false positives. The ROC analysis further validated the optimized multi-filter approach as being far superior to any one filter by itself (Fig. 1). These ROC plots showed that each multifilter model significantly outperformed any single filter by itself with AUCs above 0.95, whereas the AUCs for individual filters ranged form 0.72-0.88. The neural network had an AUC of 0.998 indicating that it is a superior filter model. This AUC was significantly better than that of the linear regression and the support vector machine models. The identification of highly efficient and accurate minimotif filter approaches represents an important milestone in the prediction of minimotifs.
In most minimotif searches the number of true positives far outweighs the negatives. Therefore, we also repeated the training and testing analysis on a larger data set where the negative data size was increased to 5-fold (10,000 randomly generated negative data points). This analysis for the combined filters showed a modest increase in the AUC and accuracy for all three algorithm models further supporting this approach for minimotif identity.
Since some of the individual filters had non-significant P values ( Fig. 1), we questioned whether all five minimotif filters were needed to achieve the high level of accuracy. We repeated the filter analysis to find the best performing of all the five 4-combinations for each model. The average value of the AUCs and standard deviation (STD) of the 5-fold cross validation were calculated and a t-test was used to test which filters were optimal (P,0.05; Table 4). When the t-test identified more than one filter with similar performance, we reported the filter with highest average  AUC, but also list the alternative filter combinations. The same approach was used to successively identify the best three-filter, and two filter combinations ( Table 4).
To identify the best performing filters, the t-tests were also used to compare two-, three-, four-and five-filter combinations based on AUCs. One of the best filter combinations was the neural network model with the MF+FS+PPI+SP filters, having an AUC of 99.5%. This combination had an accuracy of 96.4% on the optimal threshold and an accuracy of 95.0% on the stringent threshold. For the linear regression the FS+PPI two-filter combination was significantly better than the other filter combinations. For the support vector machine, FS+SP was significantly better than FS+PPI+SP and CF+MF+FS+PPI+SP. For the neural network the MF+FS+PPI+SP and FS+PPI+SP were significantly better than the other filters. Collectively, this analysis identified the best model and filter combinations for increasing the accuracy of minimotif predictions.

Implementation
We have now implemented multi-filtering on the Minimotif Miner website to help eliminate false-positive predictions (http:// mnm.engr.uconn.edu and http://minimotifminer.org). The minimotif results table now lists the predictions ranked with the fivefilter linear regression multi-filter score. We chose this model over the linear regression because so few false positives were produced while maintaining a very similar accuracy to the neural network. We chose the five-filter combination because, even though it had only a non-significant increase in AUC over some two-, three-and four-filter combinations, we could identify a threshold that had high accuracy with false-positives and a high percentage of truepositives. Those minimotifs with a score larger than 0.48 (a threshold above which only true positives surpass, and maximum accuracy of 92.1% is reached) are colored green, a score below 0.33 (which is the intersection of sensitivity and specificity shown in Fig. 2A) are colored red, and those between 0.48 and 0.33 are colored yellow. Those minimotifs where information is lacking and hence no score can be calculated are also colored red. A test of 20 randomly selected queries shows on average that 83% of minimotif predictions are rejected when using the threshold of 0.33 and 88% are rejected when 0.48 is used. This demonstrates that the trained filter successfully reduces the number of candidate minimotifs and the analysis of the global test set shows that most of the removed minimotifs are likely false-positives.

Discussion
Minimotifs, by their definition are short, thus are of low complexity and highly prone to prediction of false positives, which limits their usefulness. As a result, tools that predict new minimotifs have developed scoring techniques or filter approaches. Even though a number of such scoring mechanisms are known, their effectiveness in reducing false positive rate has been limited [3,[7][8][9]12,13,16]. One approach has been to try to increase the expected value by reducing the search space [17]. Most other approaches use different types of information to eliminate false positives. In our prior work we have considered pairwise combinations of select filters and found better filtering efficiency [13].
In this paper, we have developed and tested a new approach by combining multiple filters in an appropriate manner to achieve more effective filtering. An important decision to make in this case is on how to create a composite score, from several other disparate scoring metrics. We take the general view that we can pose the combination problem as one of learning. In this paper we have investigated three important ones, namely, linear regression, neural networks, and the support vector machine. Neural networks have been employed to solve different learning problems in biology such as identifying tyrosine based sorting signals and nucleolar localization sequences [18,19]. Likewise linear regression and support vector machines have also been fruitfully employed in examples such as DNA splice site prediction, predicting antifreeze proteins sequences, NAD+ binding sites, etc. [20][21][22]. The suitability of these techniques for a given application can only be decided empirically because these techniques do not easily render themselves to complexity analysis. For instance, even for simple neural networks, convergence proofs are hard to derive. Similarly, for support vector machines, the separation achievable between the hyperplanes not only depends on the application, but also the specific set of data points.
Our empirical results show the robustness of the multi-filter in eliminating false positives and reaching a high accuracy. Meanwhile, joining different knowledge from each individual filter, the multi-filter also has limitations. The multi-filter works only if all the information of each individual filter for a minimotif is known, or all individual filters give valid results. Missing related information for one individual filter or incomplete data will limit the effectiveness of the multi-filter. This is part of the rationale for choosing the five-filter combination over other combinations with fewer filters with similar levels of significance. Also, there is bound to be bias in the datasets used in this analysis -the true positives are those reported for well studied proteins -while it is acknowledged that a tiny portion of false negatives are introduced in our generation of the negative dataset. Despite these limitations, the combined score with its excellent accuracy achievement represents the state-of-the-art in minimotif prediction and will be of great importance for biologists investigating proteins and disease mechanisms.

Data Sources
In order to both train and evaluate the multi-filter, it was necessary to compare a dataset of verified minimotifs with one containing known negatives. A set of , 5,300 verified minimotifs exist in the MnM 2 database [8]. However, due to the nature of the individual filtering mechanisms, not all filters give definite results for each minimotif (for instance, minimotifs in which either the target or source proteins are undefined). The inclusion of such instances would bias the training towards those filters, which can act on incomplete definitions. Thus, the verified dataset was pruned to the 2,000 minimotifs that had unique source proteins, for which each filter yields a definite result, termed the ''validated positive dataset''.
Since some minimotif sources proteins in the Minimotif Miner database have more than one target, we wanted to ensure that this was not providing a strong bias to our minimotif filtering analyses. 100 minimotif source proteins were randomly selected and pairwise alignment to all other minimotif source proteins in the dataset was performed using BLAST [23]. Approximately 10% of the source sequences had a bit score .30. Since this is often considered a threshold for common ancestry, this analysis indicates that there is some sequence similarity in the dataset used, but not enough to impact our conclusions.
Unfortunately, no database of verified negative minimotifs exists. Therefore, a negative dataset was computationally generated as previously described for our analyses of the PPI, CF and MF filters [12,13]. First, pairs of (source protein, target protein) were randomly generated including no duplicates. For each source protein, minimotifs were found based on sequence matching from minimotifs in MnM database. In this manner, unique tuples of (source protein, minimotif, target protein) were generated. We created two data sets, one with the same number of data points as used in the positive dataset and one with a 5-fold excess of negative data points. These entries were treated as negative dataset and were estimated to have a negligible number of false negatives, which we expect would have negligible impact on the conclusions of our paper.

Linear Regression
In linear regression, it is assumed that the relationship between a dependent variable and the associated independent variables is approximately linear and the model postulates the formula in (eq. 1).
Given n statistical observations of Y and X , the linear regression problem is to find b(b 0 ,b 1 ,b m ) such that the linear model best predicts Y from X , for example, to minimize P e j in the least square approach.
In the filter combination we envision that the independent variables are the outputs of the PPI filter (PPI), cellular function filter (CF), molecular function filter (MF), frequency score filter (FS), and the surface prediction filter (SP). The value of the dependent variable, called Score, is 1 or 0 and is decided based on  whether the training entry is from positive data or negative data, respectively. Thus, the combination model is shown in formula (eq. 2).
The outputs of the cellular function filter and the molecular function filter are not binary. These filters output the shortest distance, or the least number of edges between cellular or molecular functions associated with the source and the target proteins in this training phase. Similarly, the frequency score filter outputs

Support Vector Machine
Support vector machine is a training and learning technique to classify data of different classes. In contrast to linear regression, which looks for a hyperplane crossing as many data points as possible, using the support vector machine produces a separating hyperplane which maximizes the margin between the closest points of two classes of data. That is, given X~fx i Dx i [< n g and Y~fy i Dy i [f{1,1gg where x i is a data point in n-dimensional space and y i is the class to which x i belongs, the algorithm identifies a hyperplanew wx xzb~0 to maximize the distance between two parallel hyperplanes (w wx xzb~1 andw wx xzb~{1) which separate the data points into two groups. The distance between those two hyperplanes is 2=w w k k . Therefore, the support vector machine tries to find a hyperplanew wx xzb~0 to minimizẽ w w k k, given X and Y .
The original support vector machine is a linear classification technique [24]. With a kernel function, the non-linear support vector machine can be created to get a curve with the maximum margin [25]. Non-linear separation is achieved by transforming the data points from the original space into a new space in which they can be more easily separated, which is done by replacing the linear dot product operations for vectors with (non-linear) kernel functions. Several kernel functions can be used in the support vector machine, like polynomial, radial basis function, and sigmoid. A linear kernel also exists to recover the computation back to the linear support vector machine. In the proposed method, we used the original linear support vector machine model based on the assumption of the independence of individual filters.
The training data for the support vector machine was collected as in the linear regression, except that the parameter of score of positive and negative data is not necessary here. Assuming a high dimensional space, in which each dimension indicates a filter, given a motif with its output values of all filters, this motif is located at the coordinate of its filters' output, like (PPI, CF, MF, FS, SP). Support vector machine is designed to construct a hyperplane to separate the motif points of positive data from those of negative data in the training phase and such a hyperplane is tested in the evaluation.

Neural Network
Neural network, or artificial neural network, is a model to simulate biological neural networks. Neurons are the basic units or nodes in this network, which are interconnected layer by layer. Each neuron is connected to neurons in adjacent layers based on the edge weights. Each neuron works independently and accepts inputs (or input signals) from the previous layer. Layer by layer, all the neurons are combined together for a final output, to model the relationship between the inputs and their desired output. Sometimes an activation function defines whether to activate a neuron by thresholding its input values.
Mathematically, the neural network uses a function f : X ?Y , in which each neuron contributes to the final output based upon edge weights. The output of the j th neuron on the i th layer of a neural network is a function f i j (x) that is based on the outputs from the (i{1) th layer f i{1 . The output of the j th neuron on the first layer is defined as f 1 j (x)~P k w 0 k x k , where w 0 1 ,w 0 2 ,w 0 k are the weights on the inputs x 1 ,x 2 ,x k .
In the filter combination we have constructed, the outputs of individual filters (PPI, CF, MF, FS, SP) are used as input data X (x 1~P PI,x 2~C F ,x 3~M F ,x 4~F S,x 5~S F ) and for Y we use 1 for positive data and 0 for negative data. By training this model, it is expected that the filter's output for positive data will be ,1, while , 0 for negative data. When training and testing the neural network model, hidden layers were eliminated as much as possible without sacrificing performance. The reported neural network has two hidden layers.

Cross Validation
A 5-fold cross-validation was used to validate linear regression, support vector machine, and neural network models as follows: 1) partition the positive and negative data into five equally sized groups: 400 positive and 400 negative data points; 2) for each group, leave one group out and use the remaining data to train a linear model and test it with the left-out group of data; 3) Repeat training and testing five times till each group is used as testing data once. Evaluation of the filters with Receiver Operator Curve (ROC) was performed. A threshold is used to determine whether a new query (source, motif, target) should be retained or eliminated by the multi-filter. The optimal threshold is determined recursively for a maximum accuracy (eq. 3, TP: true positive; TN: true negative; FP: false positive; FN: false negative; P: positive data; N: negative data).

Accuracy~(TP|PzTN|N)=(PzN) ð3Þ
The Matthews Correlation Coefficient (MCC) is used to show the performance of the multi-filter (eq. 4). The value is perfect if it is 1, and 0 means not better than a random prediction.