Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees

Gene function annotation is important for a variety of downstream analyses of genetic data. But experimental characterization of function remains costly and slow, making computational prediction an important endeavor. Phylogenetic approaches to prediction have been developed, but implementation of a practical Bayesian framework for parameter estimation remains an outstanding challenge. We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross-validation, and we further validated some of the predictions in the experimental scientific literature.

For the analysis in our paper, we focused on proteins for which there were at least two proteins with an annotation of the type "present", i.e., a "1", and two others with an annotation of type "absent", i.e., a "0", for the same function within the same tree, this is, at least four annotations. These requirements yielded the initial number of ~1,500 annotations over ~1,300 proteins which we used for fitting our evolutionary model. However, when mapping those proteins to Pfam trees, we were not able to use all 1,300 with SIFTER, as many of the resulting trees either reduced to a degenerate tree with a single leaf, or to a tree with a single function, which means that SIFTER cannot be used since it requires at least two functions for inference (if there is just one function, it simply infers all leaves as having that function). Because of these considerations, the initial set of candidate proteins that SIFTER could analyze was reduced from 1,300 to 472.
After running SIFTER for those 472 proteins, the total number of proteins that could be compared with our method further reduced to 147. The reasons for this reflect on a couple of issues that we observed when running SIFTER: 1. In a pre-processing step that SIFTER makes, it prunes the GO tree, which ultimately translates into dropping some annotations. For example, if we passed SIFTER a dataset with 2 GO terms, in some cases it would drop one of those and proceed with the calculations using only a single GO term. This is described in detail in page 18 of the Supplemental Materials of Engelhardt et al. (2011), "[i]n the GO DAG, we first prune all ancestors of nodes with annotations (even if the ancestors themselves have annotations), then we prune all non-annotated nodes. This leaves a set of candidate functions that are neither ancestors nor descendants of each other, ensuring there are no deterministic dependencies between them in terms of the semantic network." 3. In another handful of cases, SIFTER was not able to find some of the proteins included in the analysis. Again, similar to the previous point, this was mostly due to the fact that the Pfam database is outdated.
After these cases were excluded, there was only one case in which SIFTER failed to converge and another case in which the final set of proteins was of size one, so leave-one-out cross-validation (LOO-CV) was not possible.
Thus, ultimately, because of the specific requirements of SIFTER, our comparison of methods is conducted on 147 of the ~1300 proteins that we were able to analyze using our own methodology. Nonetheless, this is a reasonable amount of data with which to estimate true-and false-positive rates, so we proceeded with the comparison. The total number of annotations used in this comparison was 184, 18 of which were negative, spread across those 147 proteins.
We calculated mean-absolute-errors (MAEs) and ROC-AUC measurements for each method and reported those in the paper in the new figure 7, which shows the ROC-AUCs curves for both analyses. We found that our method out-performed SIFTER on these data.
We note that this analysis required considerable effort in terms of code-writing and deep-diving into the innards of SIFTER. We are making available to the wider community all the code used to generate our benchmark comparison, together with detailed instructions on how to setup SIFTER and the data for this kind of analysis (which proved to be extremely non-trivial). That code is available on GitHub at https://github.com/USCbiostats/sifter-aphylo , as noted in the revised version of the paper.
These details are now included in the paper in two new subsections between lines 440-487 and 732-789 respectively.
We note that, during the preparation of the revised manuscript, the number of experimentally annotated trees featured throughout the results section changed. The count reduced from 141 to 138, because it became apparent that three of those families included both negative and positive assertions for a handful of genes, which we were treating as negative assertions only. After excluding those annotations, the three families only had positive assertions, which we could not use for our LOO-CV analysis.
Finally, we would like to mention that conducting this leave-one-out cross-validation (LOO-CV) process using SIFTER for the set of 147 proteins took about 110 minutes, whereas the analysis using our method, aphylo, took about one second.
Big picture issue -intent of paper While we are happy to include the above benchmarking against SIFTER, and believe it improves the value of the paper, we also wish to note that our goal is not really to compete with, or out-perform the annotation prediction performance of SIFTER at this point. Instead, our paper aims to develop a model in which parameter estimation itself (the precursor to prediction) is feasible not only over a single gene family, but jointly over multiple gene families. We have attempted to clarify our intent by changing one sentence in the Motivation section of the abstract, and adding the following text to the introduction section of the paper ( lines 51-71 ).
"We emphasize that in this paper we focus on developing a model in which parameter estimation is feasible not only over a single gene family, but jointly over multiple gene families. Parameter estimation in SIFTER was attempted for only a single, specific gene family (AMP/adenosine deaminase), and was otherwise infeasible due to multiple factors, including the use of a transition-based function evolution model that results in a combinatorially increasing number of parameters as the number of different functions increases. Further work on SIFTER (Engelhardt et al., 2011) developed an approximation to the likelihood calculation, and did not attempt parameter estimation, instead relying on fixed parameters.
Our goal in this paper is to provide a method that retains computational tractability when estimating the evolutionary parameters, and does not therefore require the use of fixed parameter estimates. Indeed, by adopting a Bayesian perspective we allow the user to allow for uncertainty in parameter estimate by integrating out over the posterior distribution for those parameter values. In order to do so, we simplify the evolutionary model to treat gain and loss of each function independently, minimizing the model parameters. We consider only two properties of each gene family tree: topology and branch type, distinguishing only between branches following speciation versus gene duplication events. Thus, by simplifying the model we regain computational tractability on large trees without paying any great price in terms of prediction error (see Results). This paves the way for future more refined versions of our model that add more realism to the evolutionary model." Thus, our aim here is to demonstrate that parameter estimation is feasible in such contexts, and additionally yields useful function predictions. We believe the comparison with SIFTER makes this point clearer, and we are grateful to the reviewers for suggesting it, But we have also added a discussion of this question of "goal" to the introduction to make our overall intent clearer. Finally, to emphasize the latter, we have changed the title of the manuscript from "On the automatic annotation of gene functions using observational data and phylogenetic trees", which was rather vague, to "Bayesian parameter estimation for automatic annotation of gene functions using observational data and phylogenetic trees", which more specifically indicates our focus.
We now address the comments of each reviewer in detail: The authors present a computational method for modeling the evolution of gene/protein function annotations in a phylogenetic tree. Since such annotations are typically sparse and noisy, computational approaches for inferring annotations can aid biological and biomedical analyses, and the authors' use of phylogenetic trees for their evolutionary model provides biological justification for the inferred gene functions. The paper is generally clear and well written with appropriate mathematical notation for describing their method, which is easy to understand. However, it would benefit by addressing several questions and concerns.
1. How sensitive is the method to the topology of the phylogenetic trees or missing or spurious edges in the phylogenetic trees?
We have added the following discussion of this issue to the Discussion section (see lines 546-562 ): "Sensitivity to misspecified trees. Algorithms such as ours, and that of SIFTER, are built to exploit the information provided by the phylogenetic tree. Loosely speaking, leaves that are most closely related are most likely to have the same function. Whilst it clearly makes sense to exploit this information, it does imply that errors in tree topology may lead to errors in inference of function. In principle, one could extend an algorithm such as ours to explicitly mix over the space of possible trees, but this would have several consequences: 3. The authors could be more rigorous/precise about how they evaluate their method's performance. Some statements can be difficult to understand, such as "about 73% accuracy using the area under the curve statistic". Some mathematical and statistical terms are used inconsistency or interchangeably, such as accuracy and AUC/AURC and MAE, or are described in non-standard ways. Unfortunately, these issues may be distracting and confusing to some readers.
Thank you for this observation. We were indeed using accuracy and MAE interchangeably and understand that when we used the word accuracy, readers could confuse it with the actual metric of the same name. We now make sure that the word accuracy is used only when its meaning is clear, replacing most instances in which it was used with "MAE", "prediction error", or simply "error" across the manuscript. 4. The method appears to be very robust to missing data and surprisingly seems to result in higher performance with higher missingness. The MAEs of their results also appear to be close to 0.5 for many of the experiments, where 0.5 is what one would expect for a random classifier. It is not clear how well the method is performing or that it is performing as one might expect, e.g., better performance with more data. (Also, the authors say that a "paired t-test found a significant drop in the accuracy levels", but they're not comparing accuracy, the effect size seems very small, and it is unclear if the comparison is overpowered.) This was a misunderstanding caused by our poor choice of labeling for the relevant figure. The previous caption of Figure 2 (Figure 1, in  As noted in the opening section of our response letter, we have now added a comparison with SIFTER in order to better benchmark our method. We refer the reader to that response. 6. The authors helpfully perform a detailed sensitivity analysis of their model's parameters. They also apply their method to a number of GO terms and phylogenies from the PANTHER database, but they limit themselves to a small subset (141 of 15,000) of the phylogenies that are relatively well annotated and therefore potentially unrepresentative of what many users may encounter. There could be more discussion of the biological results.
As described in the opening to this response letter, we have added text to clarify the goals of this paper, which is focused on the issue of parameter estimation and not on large-scale function prediction. As the reviewer points out, our work was focused on the families that have at least two annotations of each type (present and absent), which was necessary in order to assess true positive and true negative rates in a leave-one-out cross-validation. We agree that these trees are generally better-annotated than most, which was one reason we included an analysis of how our performance is affected by missing annotations. We realize that we did not sufficiently discuss this point, and have added text in the sensitivity analysis section highlighting the fact that more needs to be done to understand the behavior of the model under different levels of annotation ( lines 347-355 and 383-385 .) figure ( Figure 9 ) and table ( Table 7 ). We have also identified three mouse proteins for which our predictions are inconsistent with the annotations in the GO database, and briefly discuss these examples.

Reviewer #2
This paper presents a new phylogeny-based method, *aphylo*, for predicting gene ontology (GO) annotations from data.
The data are gene trees, where each leaf represents a protein sequence that has been annotated based on function (1 = function present and 0 = function absent). Importantly, each gene tree has been **reconciled** based on a rooted species tree, so that the gene tree is rooted and its internal nodes are labeled by evolutionary events (e.g. speciation or duplication). The training data are reconciled gene trees where all leaves have annotations, and the testing data are reconciled gene trees where some but not all leaves have annotations. The goal is then to predict the missing annotations. The proposed method, *aphylo*, achieves this by defining a generic model of function evolution and then utilizing this model within a Bayesian framework.
At a high level, function is modeled as having evolved down the gene tree, meaning that function occurs at the root with some probability and then can be gained or lost on each edge with some probability that depends on whether the node above the edge represents a speciation or duplication event. The posterior distributions of the model parameters (e.g., the probability of gaining function after a gene duplication event) are estimated from the training data within a Bayesian framework (although MLE and MAP estimation is also enabled). Then, at each node, conditioned on these distributions, the posterior probability of function can be computed ---this is used to predict whether function is present (1) or absent (0) at leaves without annotations.
The major weakness of this paper is that *aphylo* (which seems simpler than other phylogeny-based methods for gene ontology annotation) is not benchmarked against other methods (major comment #1).
Because this paper does not evaluate whether *aphylo* improves upon the accuracy or speed of other methods for gene ontology annotation (e.g. those evaluated in the CAFA2 study), it is difficult to evaluate its significance. Therefore, I recommend "Reject".
Major Comments 1. On line 509, the paper reads: "Notwithstanding [*alphylo*'s] simplicity, it provides probabilistic predictions with an accuracy level comparable to that of other more complex, phyla-based models." This is statement is surprising given that the proposed method, *aphylo*, was not benchmarked against other methods (at least I didn't notice this in the main paper). In lines 253-270, there is a brief comparison of *aphylo* to SIFTER [1,2]; however, this discussion is focused on running time not prediction (NOTE: Figures 1-8 in the main text do not seem to show comparisons to SIFTER). Although SIFTER may be too computationally intensive to run on all datasets evaluated in this study, comparisons could be made on at least a small subset of the datasets. If it is impossible to make a comparison to SIFTER, then a comparison could be made to a "base-line method", for example the BLAST-based approach used in the "term-centric evaluation" from the second Critical Assessment of Functional Annotation study (CAFA2) [3].
Term-centric experiments evaluate methods for binary classification tasks, where given a protein sequence, an ontology term (related to molecular function, biological process, cellular component, or human phenotype) is assigned (1 = present and 0 = absent). Several different methods, including SIFTER 2.4 [1,2], were evaluated in CAFA2. Interestingly, Jones-UCL [3] outperformed SIFTER (Fig 7 in [3]); therefore, it seems relevant to compare *aphylo* to Jones-UCL, SIFTER, and a base-line method (assuming that it is computationally feasible to perform such experiments).
As noted in the opening section of our response letter, we have now added a comparison with SIFTER in order to better benchmark our method. We refer the reader to that as a response to this observation.
2. Figure 1 shows the results of several experiments on simulated datasets to evaluate *aphylo* under different conditions (NOTE: Figure 1 would benefit from having the 8 subplots labeled "A", "B", ... "H"). Under ideal conditions (white boxes), the mean MAE is approximately 0.38 (MAE of 0.50 is no better than a random coin toss); however, the MAE takes on a HUGE range. It would be beneficial to do a more fine-grained analysis of datasets with very low and very high MAE.

Although justification for showing MAE (instead of ROC curves) is provided, my opinion is that
showing ROC curves in addition would be helpful for many readers. In addition, as suggested, we have added the ROC curves corresponding to each estimation method featured in Table 4 3. On line 320, the paper reads: "Finally, we evaluated the impact of removing annotations on accuracy. While a significant drop may not be obvious from the last box in Figure 1, a paired t-test found a significant drop in the accuracy levels." From the description starting on line 536, it sounds like annotations were deleted from gene trees uniformly at random. Does this pattern of missing annotations reflect the pattern of missing annotations in the biological datasets (especially for the lower rates of missing data)? Given the feature example where co-location of missing annotations impacts accuracy, it may be useful to evaluate this issue further (major comment #2).
We do not address this issue for the simple reason that for the overwhelming majority of proteins, annotations are extremely sparse. The median number of annotations is 12 per tree+go function, and of those only two are of the type "Absent." When annotation is this sparse one cannot really assess the importance of non-random missingness since, even if we try to remove annotations in a way that is clustered, we will very rarely have two annotated leaves sufficiently close together to be affected by this. Thus, we expect results would be essentially identical in the vast majority of cases. We have added a brief discussion of this issue to the paper, explaining this perspective. See lines 367-376 . 4. Gene duplication has a huge impact on the probability of gaining function (Tables 3 and 4). Because the gene tree and its reconciliation are estimated from molecular sequence data, it seems important to consider how the accuracy of the estimated gene trees and their reconciliations may impact the accuracy of *aphylo*. Note that gene trees reconciliation may be especially challenging when there is hidden paralogy ( https://ecoevorxiv.org/wzcbg/ ). It would be interesting if strange results, such as parallel evolution of function, ormany gains/losses in function, could be explained by gene tree estimation error or reconciliation error. Along these lines, it also would be helpful to briefly note how gene trees were estimated and reconciled for the PANTHER v11 (as well as if branch support was estimated).
The initial question, about sensitivity to errors in tree topology, was addressed in our response to comment 1. From reviewer 1.

Minor Comments
1. Section 3.5 (Discoveries) seems quite valuable, and it may be worthwhile to expand this section and to display the key findings (e.g. for the 10 mouse genes) in a table.
As suggested, we have expanded this section by adding a table with details regarding the analyzed annotations, including a tabulation comparing our proposed annotations with a Quick GO API search. See lines 488-525 .
2. This paper models the evolution of a binary trait (in this case the presence or absence of a particular function), so it may be beneficial to discuss the connections to related work in this area (e.g., [5,6,7,8]).
Thank you for providing those references. Indeed, the phylogenetic regression models have an extensive development on analyzing both binary and continuous traits within the context of non-independent data. But these models use the tree information to build correlation matrices which, conceptually, overlooks our main goal of modelling the evolutionary process itself.