Table 1.
Mathematical notation.
Table 2.
Model parameters.
Table 3.
Parameters for the beta priors used in MCMC and MAP estimation.
Shape parameters (α, β) were set with the prior that functional gain and loss, (μ01d, μ10d), are more likely at duplication events.
Fig 1.
ROC curve for each estimation method.
As reflected in the parameter estimates, ROC curves are very similar across all four methods.
Table 4.
Parameter estimates for model with experimentally annotated trees.
Overall, the parameter estimates show a high level of correlation across methods. Regardless of the method, with the exception of the root node probabilities, most parameter estimates remain the same. As expected, mutation rates at duplication nodes, indexed with a d, are higher than those observed in speciation nodes, indexed with an s.
Table 5.
Differences in Mean Absolute Error [MAE].
Each cell shows the 95% confidence interval for the difference in MAE resulting from MCMC estimates using different priors (row method minus column method). Cells are color coded blue when the method on that row has a significantly smaller MAE than the method on that column; Conversely, cells are colored red when the method in that column outperforms the method in that row. Overall, predictions calculated using the parameter estimates from pooled-data predictions outperform one-at-a-time.
Fig 2.
Boxplots of MAEs as a function of single parameter updates. Sub-figures A1 through D show how the MAEs change as a function of fixing the given parameter value ranging from 0 to 1. In each of these plots, the white box indicates the parameter value used to generate the data (i.e., the “correct” parameter value). The last boxplot, sub-figure E, shows the distribution of the MAEs as a function of the amount of available annotation data, that is, how prediction error changes as we randomly remove annotations from the available data.
Fig 3.
Number of annotated leaves for all 138 trees by type of annotation.
Most of the annotations available in GO are positive assertions of function, that is, 1s. Furthermore, as observed in the figure, all of the trees used in this study have two “Not” (i.e., 0) annotations, which is a direct consequence of the selection criteria used, as we explicitly set a minimum of two annotations of each type per tree+function.
Fig 4.
Number of internal nodes by type of event.
Each bar represents one of the 138 trees used in the paper by type of event (mostly speciation). The embedded histogram shows the distribution of the prevalance of duplication events per tree. The minority of the events, about 20%, are duplications.
Table 6.
Effects of missing data on MAE on 138 trees.
The null hypothesis for the paired t-test is MAE(with X% annotations randomly dropped)—MAE(baseline) = 0. As more annotations are randomly removed from the tree, mean absolute error increases significantly.
Fig 5.
Mean Absolute Error versus number of annotations on 138 trees.
Each point represents a single tree+function colored by the number of negative annotations (“absent”). The x-axis is in log-scale, and the figure includes a locally estimated scatterplot smoothing [LOESS] curve with a 95% confidence interval.
Fig 6.
The first set of annotations, first column, shows the experimental annotations of the term GO:0001730 for PTHR11258. The second column shows the 95% C.I. of the predicted annotations. The column ranges from 0 (left end) to 1 (right-end). Bars closer to the left are colored red to indicate that lack of function is suggested, while bars closer to the right are colored blue to indicate function is suggested. Depth of color corresponds to strength of inference. The AUC for this analysis is 0.91 and the Mean Absolute Error is 0.34.
Fig 7.
The first set of annotations, first column, shows the experimental annotations of the term GO:0004571 for PTHR45679. The second column shows the 95% C.I. of the predicted annotations using leave-one-out. Bars closer to the left are colored red to indicate that lack of function is suggested, while bars closer to the right are colored blue to indicate function is suggested. Depth of color corresponds to strength of inference. The AUC for this analysis is 0.33 and the Mean Absolute Error is 0.52.
Fig 8.
ROC curve for aphylo and SIFTER predictions.
This figure includes 184 annotations on 147 proteins. Of the 184 annotations, 18 were negative. The corresponding AUC for these curves are 0.72 for aphylo, and 0.60 and 0.52 for SIFTER using truncation level 1 and truncation level 3 respectively.
Fig 9.
Comparison of our 220 proposed annotations to annotations currently in the GO database.
Of the 220, 46 were not found in the GO database, 8 of which we proposed as negative assertions. Of the remaining 174 which we found in the GO database, five were inconsistent (having both a present and absent annotation), two had an experimental evidence code, and the remainder, 167, corresponded to annotations without experimental evidence.
Table 7.
List of proposed annotations for mouse proteins.
The 95% CI corresponds to the posterior probability of the model. We suggest that values with the CI close to zero indicate absence of function and should be annotated with the qualifier “NOT”. The Ref. column lists literature in which we found evidence of the corresponding annotation can be found, and the last column lists evidence codes found for annotations to the same term in the GO database (note that all corresponding GO database annotations are presence of function). Notes: (1) Exp. evidence found in literature. (2) Was not present in our training dataset, but we correctly predicted an experimentally supported annotation.
Table 8.
Distribution of trees per number of functions (annotations) in PANTHER.
At least 50% of the trees used in this paper have 10 or more annotations per tree.
Fig 10.
Distribution of AUCs and MAEs for the scenario with partially annotated trees and mislabeling.
The x-axis shows the proportion of missing annotations, while the y-axis shows the score (AUC or MAE).
Fig 11.
Empirical bias distribution for the Fully annotated scenario by type of prior, parameter, and number of leaves.
Fig 12.
Empirical bias distribution for the partially annotated scenario by parameter and proportion of missing labels.
Fig 13.
Difference in the number of input proteins used for predictions.
Each bar represents a single annotation (prediction) to be made. The y-axis shows the difference between the number of input proteins used by aphylo, and SIFTER. Negative values indicate that SIFTER included more proteins as input for making the prediction, whereas positive values indicate that aphylo included more proteins as input. A paired t-test shows that, on average, SIFTER included more proteins than aphylo for each one of its calculations.
Fig 14.
Scatter plot comparing all 184 prediction scores between SIFTER with truncation level one, and SIFTER with truncation level 3.
Red triangles correspond to negative annotations, while blue triangles mark positive annotations. The six negative annotations with a black border highlight observations that were correctly classified with truncation one, but were misclassified using truncation level 3. The coordinates were jittered to avoid overlapping.