Figure 1.
Assessing Similarity-Based Inference.
(A) The plot serves to assess the errors made when inferring GO terms from the nearest neighbor of each protein. The inferred annotations are sorted according to the similarity measures (CE, TM, LP, GP) and then binned such that each bin contains an equal number of annotation counts (ca. 670 annotations). This allows for comparing the number of errors for the inference according to different similarity measures which are operating on different scales. The x-axis denotes the range of similarity measure scores falling into that bin, the y-axis the ratio of correct annotations in that range. (B) In contrast to (A), the inferred annotations are sorted according to raw function conservation scores, based on the similarity measures (CE, TM, LP, GP). The x-axis denotes the range of raw function conservation scores falling into that bin, the y-axis the ratio of correct annotations in that range.
Figure 2.
Impact of Different Similarity Measures on Inferring Function.
The four-set Venn diagram covers the correct GO term inferred from the neighbors based on the individual similarity measures. Each ellipse represents the number of GO terms correctly inferred using one similarity measure. The numbers of GO terms correctly inferred by several similarity measures are shown in the intersections between one or more ellipses.
Figure 3.
We exemplify the GOdot method on a set of five template proteins (t1–t5) having two different molecular functions (drawn in yellow and red, respectively). The training procedure (top row) consists of similarity calculations (A1), yielding four different similarity matrices one of which is shown (A2). Based on these similarities, logistic curves are fitted for each molecular function in the dataset (A3). The prediction (bottom row) comprises similarity computations between the query protein and the proteins in our dataset (B1), which are then used to predict the conservation of molecular functions in the queries proximity (B2). The final ranking of GO terms is obtained using combination schemes along the GO graph structure (B3). See Methods section for details.
Figure 4.
(A) Using TM to identify the nearest neighbor of the sample query protein 1ve3 yields protein domain d1vlma. For d1vlma the TM scores were pre-computed, resulting in the neighborhood illustrated here with Kruskal's non-metric multidimensional scaling [44](where similar proteins structures are depicted close). Domain d1vlma has several molecular functions attached, for this illustration we selected GO∶0008757 (S-adenosylmethionine-dependent methyltransferase activity). Protein domains having this function are colored yellow, domains not annotated with this function are colored in grey. (B) TM scores with respect to d1vlma are sorted along the x-axis. Protein domains annotated with molecular function GO∶0008757 are assigned a y coordinate of 1 (drawn in yellow), domains not annotated with this function are assigned a y coordinate of 0 (drawn in grey). Unlabeled domains are from the 200 nearest neighbors of d1vlma. A logistic curve is fit through these points (drawn in orange). The logistic curve can be evaluated for the raw function conservation score for a given TM score.
Figure 5.
Comparing Similarity Scores to Raw and Combined Function Conservation Scores.
The ROC plot serves to analyze the reliability when inferring GO level three functional annotations from the nearest protein neighbors. For each protein domain, nearest neighbors are sought according to the four similarity measures (CE, TM, LP, GP). The GO terms attached to these nearest neighbors can be potentially inferred for a query protein. By sorting annotation transfers according to the similarity scores and evaluating the true positive rate versus the false positive rate, a ROC curve is derived.The black curve displays the average ROC curve for the four similarity measures (CE, TM, LP, GP); the boxplots attached serve to estimate the observed spread. Similarly, when sorting according to raw function conservation scores, we obtain four ROC curves, the average of which is shown as green curve along with the estimated spread as boxplots. Merging the information into a combined consensus score yields one score per inferred annotation; The corresponding ROC curve is plotted in violet for selective combination and in blue for consensus combination.
Table 1.
Evidence Codes Used by the Gene Ontology Annotation Project.
Figure 6.
On Experimental Annotation Data Only.
Comparing similarity scores to raw and combined function conservation scores. ROC analysis on a reduced high quality data set containing only experimental annotation data (evidence codes IDA, IEP, IGI, IMP, IPI) for 629 proteins. The black curve displays the average ROC curve for the four similarity measures (CE, TM, LP, GP); the boxplots are an estimate of the observed spread. The green curve corresponds to the average of the four raw function conservation scores. The ROC performance of selective and consensus combination is shown with the violet and blue curves, respectively.
Figure 7.
Evaluation According to PHUNCTIONER Protocol.
Following the protocol described for evaluation of the PHUNCTIONER method in [17], the ROC curve considers only the highest scoring predicted level three GO term for each query protein. A diagonal line in the ROC plot indicates random predictor performance. Optimal performance is demonstrated by a curve passing through the upper left corner.
Figure 8.
A) Structural neighbors of hypothetical protein TT1426 (PDB 1wd5) according to TM-align. The image was generated by multidimensional scaling in the same way as Figure 4A. Proteins annotated with GO term GO:0016757 (glycosyltransferase activity) are colored yellow and they form a large group on the lower left, where the query is also located. The glycosyltransferase group is subdivided into subgroups. In general these subgroups are associated with different substrates, in particular adenine phosphoribosyltransferase (d1l1qa, d1g2qa), uracil phosphoribosyltransferase (d1o5oa, d1a3c), or xanthine/hypoxanthine/guanine phosphoribosyltransferases (d1nula, d1hgxa, d1dqna, d1j7ja, d1bzya). Proteins not annotated with GO∶0016757 are colored grey. They are less structurally related to the query than the glycosyltransferases, and accordingly they group separately on the right and top. B) Structural superposition of query TT1426 (PDB 1wd5 [41] in light blue) and the nearest neighbor, xanthine phosphoribosyltransferase (ASTRAL d1nula [45] in gold). The conserved 5-phosphoribosyl-1-pyrophosphate (PRPP)-binding motif characteristic of type I PRTases is colored pink in 1wd5, and violet in d1nula. Residues Arg32 and Lys56 in the query 1wd5 are shown in blue sticks. They are likely to be functionally relevant (involved in binding the pyrophosphate [41]). The structurally equivalent residues in the nearest neighbor are shown in orange. The structural differences in helices α3 and α4, as well as in the substrate binding C-terminal hood region (helices α7 and α8), indicates that they might have different substrates.
Figure 9.
Selective and Consensus Combination Schemes.
Examples of selective (A) and consensus (B) raw score combinations. (A) and (B) both show a subgraph of the full gene ontology. Raw function conservation scores were mapped to specific GO terms (red). We compute combined function conservation scores for more general GO terms (orange) using the selective and consensus combination schemes. Grey nodes indicate GO terms, that were not predicted by the method.