On the automatic annotation of gene functions using observational data and phylogenetic trees

Motivation Gene function annotation is important for a variety of downstream analyses of genetic data. Yet experimental characterization of function remains costly and slow, making computational prediction an important endeavor. In this paper we use a probabilistic evolutionary model built upon phylogenetic trees and experimental Gene Ontology functional annotations that allows automated prediction of function for unannotated genes. Results We have developed a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov Chain Monte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out validation, and we further validated some of the predictions in the experimental scientific literature. Availability Our method has been implemented as an R package and it is available online at https://github.com/USCBiostats/aphylo. Code needed to reproduce the tables and figures can be found in https://github.com/USCbiostats/aphylo-simulations. Author summary Understanding the individual role that genes play in life is a key issue in biomedical-sciences. While information regarding gene functions is continuously growing, the number of genes with unknown biological purpose is yet greater. Because of this, scientists have dedicated much of their time to build and design tools that automatically infer gene functions. In this paper, we present yet another attempt to do such. While very simple, our model of gene-function evolution has some key features that have the potential to generate an impact in the field: (a) compared to other methods, ours is highly-scalable, which means that it is possible to simultaneously analyze hundreds of what are known as gene-families, compromising thousands of genes, (b) supports our biological intuition as our model’s data-driven results coherently agree with what theory dictates regarding how gene-functions evolved, (c) notwithstanding its simplicity, the model’s prediction accuracy is comparable to other more complex alternatives, and (d) perhaps most importantly, it can be used to both support new annotations and to suggest areas in which existing annotations show inconsistencies that may indicate errors or controversies.


1
The overwhelming majority of sequences in public databases remain experimentally 2 uncharacterized, a trend that is increasing rapidly with the ease of modern sequencing 3 technologies. To give a rough idea of the disparity between characterized and 4 uncharacterized sequences, there are ∼ 15 million protein sequences in the UniProt 5 database that are candidates for annotation, while, only 81,000 (0.3%) have been 6 annotated with a Gene Ontology (GO) term based on experimental evidence. It is 7 therefore a high priority to develop powerful and reliable computational methods for 8 inferring protein function. Many methods have been developed, and a growing number 9 of these have been assessed in the two Critical Assessment of Function Prediction 10 (CAFA) experiments held to date [27,19]. 11 In previous work, we developed a semi-automated method for inferring gene function 12 based on creating an explicit model of function evolution through a gene tree [13]. This 13 node in the phylogenetic tree. Using information about experimental annotations available in the GO database, and phylogenetic trees from the PANTHER project [22], 47 we show that we can build a parsimonious and highly scalable model of functional 48 evolution that provides both intuitive insights on the evolution of functions and highly 49 accurate predictions. 50 The content of the paper is as follows: section 2 presents mathematical notation and 51 formally introduces the model, including parameter estimation and calculation of 52 posterior probabilities. Section 3 presents a large-scale application in which we take a 53 sample of annotations from the GO database along with their corresponding 54 phylogenies, fit our model, and analyze the results. Finally, section 4 discusses method 55 limitations and future research directions for this project. 56

57
In general terms, we propose a probabilistic model that reflects the process by which 58 gene functions are propagated through an evolutionary tree. The fundamental idea is 59 that for any given node in the tree, we can write down the probability of observing a 60 function to be present for the gene as a function of model parameters and the functional 61 state of its parent node, essentially modeling probability of gaining and losing function. 62 The baseline model has 5 parameters: the probability that the most recent common 63 ancestor of the entire tree, i.e. the root node, had the function of interest, the 64 probability of functional gain (the offspring node gains the function when the parent 65 node did not have it), the probability of functional loss (the offspring node loses the 66 function when the parent node had it), and two additional parameters capturing the 67 probability that the gene was incorrectly labeled, i.e. the probability of classifying an 68 absent function as present and vice-versa. We also consider two simple extensions: 69 specifying the functional gain and loss by type of evolutionary event (speciation or 70 duplication), and pooling information across trees. 71 As explained later, in this version of our model, if there are multiple gene functions 72 of interest, we analyze one function at a time (i.e., we treat those functions as 73 independent). But later we also show results for a joint analysis of multiple functions, 74 assuming they share parameter values. We then discuss further extensions to our model 75 May 6, 2020 4/39 in the Discussion section. 76 We assume that our starting data consists of a set of gene annotations for a given 77 tree. We further assume that those annotations occur only at the leaves of the tree (as 78 is typical) and that those leaves were annotated via manual curation derived from 79 experimental data (and are therefore subject to the misclassification probabilities 80 outlined above). Our goal is then to predict the functional state of un-annotated leaves 81 (and, conceivably, to assess the likely quality of the existing annotations). Our 82 perspective is Bayesian. Therefore, we proceed by estimating the posterior distributions 83 of the parameters of the evolutionary model and then, conditional on those estimated 84 distributions, computing the posterior probability of function for each node on the tree. 85 We now give full details, beginning with some notation. otherwise. We define an Annotated Phylogenetic Tree as the tuple D ≡ (Λ, X).

95
Our goal is to infer the true state of X, while only observing an imperfect 96 approximation of it, Z = {z l } l∈N , derived from experimentally supported GO 97 annotations [29]. Typically only a small subset of the leaf nodes will have been 98 annotated. We refer to the tupleD ≡ (Λ, Z) as an Experimentally Annotated 99 Phylogenetic Tree. Finally, letD n ⊂D be the induced experimentally annotated 100 subtree that includes all information -i.e., tree structure and observed annotations -101 regarding the descendants of node n (including node n itself). This object constitutes a 102 key element of the likelihood calculation. Probability that the root node has the function µ 01 Probability of gaining a function µ 10 Probability of losing a function ψ 01 Probability of experimental mislabeling of a 0 ψ 10 Probability of experimental mislabeling of a 1 Our evolutionary model for gene function is characterized by the phylogeny and five 107 model parameters: the probability that the root node has the function, π, the 108 probability of an offspring node either gaining or losing a function, µ 01 and µ 10 , and the 109 probability that either an absent function is misclassified as present or that a present 110 function is misclassified as absent, ψ 01 and ψ 10 respectively, in the input data.

113
Following [10], we use a pruning algorithm to compute the likelihood of an 114 annotated phylogenetic tree. In doing so, we visit each node in the tree following a 115 post-order traversal search, i.e., we start at the leaf nodes and follow their ancestors up 116 the tree until we reach the root node.

117
Given the true state of leaf node l, the mislabeling probabilities are defined as 118 follows: 119 P (z l = 1 | x l = 0) = ψ 01 , P (z l = 0 | x l = 1) = ψ 10 May 6, 2020 6/39 We can now calculate the probability of leaf l having state z l , given that its true state is 120 x l as: Similarly, the functional gain and functional loss probabilities are defined as follows: Note that in this version of our model we assume that these probabilities are 122 independent of the time that has passed along the branch connecting the two nodes. We 123 return to this point in the Discussion. Now, for any internal node n ∈ N , we can 124 calculate the probability of it having state x n , given the state of its parent p (n) and 125 the vector of parameters µ, as: Together with (1), and following [10], this allows us to calculate the probability of 127 the interior node n having state x n conditional onD n , its induced subtree, as the 128 product of the conditional probabilities of the induced subtrees of its offspring: where This is a recursive function that depends upon knowing the offspring state 130 probabilities for internal nodes, which, since we are using a pruning algorithm, will 131 already have been calculated as part of the process. Finally, the probability of the 132 experimentally annotated phylogenetic tree can be computed using the root node 133 conditional state probabilities: where P (x 0 | π) = πx 0 + (1 − π)(1 − x 0 ).

135
In the next section we introduce additional refinements that allow us take into 136 account prior biological knowledge that might constrain the parameter space and 137 alleviate the typical sparseness of the curated data -we return to these issues in the 138 Discussion. context makes inference challenging. However, in order to improve inference we might 142 attempt to "borrow strength" across a set of functions, by combining them in a single 143 analysis. As a "proof of principle" of this idea, we will also show results in which we 144 assume that sets of annotated phylogenetic trees share population parameters. This way, 145 we can estimate a joint model by pooling those annotated phylogenetic trees, providing 146 stronger inference about the evolutionary parameters and therefore, we hope, more 147 accurate inference of gene annotations. We note that we have strived to make sure our 148 software implementation extremely computationally efficient, allowing us to estimate a 149 joint model with hundreds of annotated trees in reasonable time on a single machine.

150
Our results show that using a pooled-data model for parameter estimation greatly 151 increases the model's predictive power. In the Discussion we will outline possible future, 152 less simplified, approaches to this problem. The non-leaf nodes in the phylogenetic trees that we are considering here come in two 155 types, reflecting duplication and speciation events. It has been widely-observed that 156 change of gene function most commonly occurs after duplication event (broadly the same. Our model reflects this biological insight by allowing the functional gain and loss 163 probabilities to differ by type of event. In particular, instead of having a single 164 parameter pair (µ 01 , µ 10 ), we now have two pairs of functional gain and loss parameters, 165 (µ 01d , µ 10d ) and (µ 01s , µ 10s ), the gain and loss probabilities for duplication and 166 speciation events, respectively.

168
We adopt a Bayesian framework, introducing priors for the model parameters. In 169 particular, as described in section 3, we use Beta priors for all model parameters. probabilities during the computation of the likelihood function (pruning process) and 179 feed them into the posterior probability prediction algorithm, which is exhaustively 180 described in subsection 5.1 (see also [18]). It is important to notice that, in general,

181
predictions are made using a leave-one-out approach, meaning that to compute the 182 probability that a given gene has a given function, we remove the annotation for that 183 gene from the data before calculating the overall likelihood of the tree. Otherwise we 184 would be including that gene's own observed annotation when predicting itself. The genes, on the other hand, is performed using all the available information.

187
In general, whenever using MCMC to estimate the model parameters, we used 4 188 parallel chains and verified convergence using the Gelman-Rubin [3] statistic. In all 189 cases, we sampled every tenth iteration after applying a burn-in of half of the sample.

190
As a rule of thumb, the processes were considered to have reached a stationary state 191 May 6, 2020 9/39 once the potential scale reduction factor statistic was below 1.
Where |Z| is the number of observed annotations, z n is the observed annotation (0/1), 197 andx n is the predicted probability that the n-th observation has the function of and (c) robust to probabilistic noise, see [11,5] for a more general discussion. 205 We now turn our attention to an application using experimental annotations from 206 the Gene Ontology Project. 207 3 Results

208
To evaluate performance of our model, we used data from the Gene Ontology project.

209
In particular, in order to assess the potential utility of combining information across 210 similar genes, we used our model in two different ways: a pooled-data model, in which 211 we treated all trees as if they had the same evolutionary parameter values, and a 212 one-tree-at-a-time model, in which we estimated parameters, and then imputed 213 functional status, for each tree separately.

214
To reduce sparsity of annotations somewhat, from the entire set of available 215 annotated phylogenies in PANTHER [22] (about 15,000), we selected only those in 216 which there were at least two annotations of each type (i.e., at least two genes 217 annotated as possessing a given function, and two as lacking the function). This yielded 218 Beta priors, which are detailed in Table 3, were chosen to reflect the biological intuition 225 that change of state probabilities should be higher at duplication events.
Finally, accuracy was measured using leave-one-out cross-validated Mean Absolute   Table 4, and are seen to reflect biological intuition: 232 we see high probabilities of change of function at duplication events, which occurs 233 regardless of whether we use an informative prior or not. While both probabilities are 234 high, gain of function is roughly twice as likely as loss of function at such nodes. We 235 note that we estimate relatively high values of ψ 01 and low values for ψ 10 . We believe 236 this is largely due to the sparsity of 0, i.e. "absent", experimental annotations, which 237 means that high values for ψ 10 are implausible and implies that any mislabelings must 238 perforce be that of assigning function falsely. Thus we think the relatively high 239 estimates of ψ 01 are an artefact that will likely disappear in the presence of less sparse 240 annotation data.

241
May 6, 2020 11/39 we used both non-informative (uniform) and relatively informative priors. We see that 243 choice of prior had no significant effect on the final set of estimates, meaning that these 244 are driven by data and not by the specified priors. The only exception is for the root 245 node, π, which is hard to estimate since the root node exists far back in evolutionary 246 time [12]. We also conducted a simulation study across a broader range of parameter 247 values, using simulated datasets, in which we saw similar behavior. These results are 248 shown in subsection 5.2. However, we note that when analyzing one tree at a time, this 249 robustness is lost [results not shown] because the sparsity of data typical in any single 250 given tree does not permit strong inference regarding evolutionary parameters.

251
Therefore, the prior becomes more influential.

252
(1)  Table 4. Parameter estimates for model with experimentally annotated trees. Overall, the parameter estimates show a high level of correlation across methods. Regardless of the method, with the exception of the root node probabilities, most parameter estimates remain the same. As expected, mutation rates at duplication nodes, indexed with a d, are higher than those observed in speciation nodes, indexed with an s. outline how we propose to extend our method in future to allow for for more 269 sophisticated joint analyses across function than the one we present here.  . Each cell shows the 95% confidence interval for the difference in MAE resulting from two methods (row method minus column method). Cells are color coded blue when the method on that row has a significantly small MAE than the method on that column; Conversely, cells are colored red when the method in that column outperforms the method in that row. Overall, predictions calculated using the parameter estimates from pooled-data predictions outperform one-at-a-time.  values close to one that the prediction accuracy significantly deteriorates. While higher 310 values of ψ 10 have a consistent negative impact on accuracy, the same is not evident for 311 ψ 01 . The latter could be a consequence of the low prevalence of 0 ("no function") 312 annotations in the data, as illustrated in Figure 2.

313
Functional gain and loss probabilities (second and third rows) have a larger effect on 314 MAEs when misspecified, especially at speciation events, (µ s01 , µ s10 ). This is largely 315 because speciation events are much more common than duplication events across trees. 316 Figure 3 shows a detailed picture of the number of internal nodes by type of event. Of 317 the seven parameters in the model, the root node probability π has a negligible impact 318 on MAE.

319
Finally, we evaluated the impact of removing annotations on accuracy. While a 320 significant drop may not be obvious from the last box in Figure 1, a paired t-test found 321 a significant drop in the accuracy levels. using each set of sampled parameter values. As before, we used a leave-one-out 336 approach to make predictions, meaning that to predict any given leaf, we removed any 337 annotation for that leaf and recalculated the probabilities described in subsection 5.1.

338
In the high accuracy example, Figure 4, which predicts the go term GO:0001730 on 339

361
In an effort to understand why the predictions were so poor in this case, we reviewed 362 the family in detail. Because all GO annotations include references to the scientific 363 publications upon which they are based, we were able to review the literature for genes 364 in this family, the ER degradation enhancing mannosidase (EDEM) family. It turns out 365 that the inconsistency of the annotations in the tree reflect a bona fide scientific 366 controversy [24], with some publications reporting evidence that these proteins function 367 as enzymes that cleave mannose linkages in glycoproteins, while others report that they 368 bind but do not cleave these linkages. This explains the divergent experimental 369 annotations among members of this family, with some genes being annotated as having 370 mannosidase activity (e.g. MNL1 in budding yeast), others as lacking this activity (e.g. 371 May 6, 2020 16/39 mnl1 in fission yeast), and some with conflicting annotations (e.g. human EDEM1).

372
The inability of our model to correctly predict the experimental annotations actually 373 has a practical utility in this case, by identifying an area of inconsistency in the GO 374 knowledgebase. state was predicted with a greater degree of certainty, we now consider the subset of 384 predicted annotations whose 95% credible interval was either entirely below 0.1 or 385 entirely above above 0.9, i.e., low or high chance of having the function respectively. We 386 then compared the list of annotations to those available from the QuickGO API, 387 regardless of the evidence code, which resulted in a list of 295 novel predictions.

388
Ten of the predictions were for genes in the mouse, a well-studied organism [16]. We 389 searched the literature for evidence of the predicted functions, and uncovered evidence 390 for six of the ten predictions: estradiol dehydrogenase activity for Hsd17b1 [15], 391 involvement in response to iron ion for Slc11a2 [28], involvement of Bok in promoting 392 apoptosis [4], DNA binding activity of Nfib [17], extracellular location for Lipg [1], the 393 adenylate kinase activity of Ak2 [21]. inference. However, being based on a model also means being "wrong" [2]. We now 408 discuss the ways in which we are wrong, and how we believe that despite this 409 "wrongness" our approach is still useful. annotations across functions in order to obtain better parameter inference (see Table 6).  an approach will be challenging to implement, we intend to pursue it in future work. 442 We also intend to explore approaches in which we use a simple loglinear model to   Table 6. Distribution of trees per number of functions (annotations) in PANTHERDB. At least 50% of the trees used in this paper have 10 or more annotations per tree.
Parameters suggest biological interpretations. Our model, while simple, has several 452 advantages and appears to perform well overall. We are able to determine parameters Use in epidemiologic analyses. We emphasize that the goal of this method is not 489 simply to assign presence or absence of various gene functions to presently unannotated 490 human genes, but to estimate the probabilities π gp that each gene g has each function p. 491 In analyzing epidemiologic studies of gene-disease associations, we anticipate using this 492 annotation information as "prior covariates" in a hierarchical model, in which the first 493 level would be a conventional multiple logistic regression for the log relative risk 494 coefficients β g for each gene g and the second level would be a regression of these β g on 495 the vector π g = (π pg ) of estimated function probabilities. This second level model could 496 take various forms, depending upon whether one thought the functions were informative 497 about both the magnitude and sign of the relative risks or only their magnitudes. In 498 former case, one might adopt a linear regression of the form β g ∼ N (α 0 + π g α, σ 2 ). In 499 the latter case, one might instead model their dispersion as β g ∼ N (0, λ g ) or 500 β g ∼ Laplace(λ g ) where λ g = exp α 0 + π g α , corresponding to an individualized form 501 of ridge or Lasso penalized regression respectively.

502
In summary, we have presented a parsimonious evolutionary model of gene function. 503 While we intend to further develop this model to reflect additional biological features, we 504 note that in its current form it has the following key features: Using the law of total probability, the second term of (7) can be expressed in terms of n's parent state, x p(n) , as: The denominator of (9) can then be rewritten as follows: (given thatwe know that is x n , x p(n) is no longer informative forD n ) = P D n x n = 1 P x n = 1 x p(n) = s + P D n x n = 0 P x n = 0 x p(n) = s . Now, we can substitute this into the denominator of (9) and write: Together with the pruning probabilities calculated during the model fitting process, 520 this equation can be computed using a recursive algorithm, in particular the pre-order 521 traversal [23], in which we iterate through the nodes from root to leaves.

522
When node n is the root node, the posterior probability can be calculated in a straightforward way: since the terms P D n x n have already been calculated as part of the pruning 523 algorithm. above, but now estimated the model using a partially annotated tree. Specifically, 537 we randomly dropped a proportion of leaf annotations m ∼ Uniform(0.1, 0.9).

542
In order to assess the effect of the prior distribution, in each scenario we performed 543 estimation twice using two different priors: a well-specified prior, i.e., the one used 544 during the data-generating-process, and a biased prior in which the α shape was twice 545 of that of the data-generating-process. Overall, the predictions are good with a relatively low MAE/high AUC.

550
Furthermore, as seen in Figure 6, both AUC and MAE worsen off as the data becomes 551 more sparse (fewer annotated leaves).

553
We now consider bias. Figure 7 shows the distribution of the empirical bias, defined as 554 the population parameter minus the parameter estimate, for the first scenario (fully 555 annotated tree). Since the tree is fully annotated and there is no mislabeling, the plot 556 only shows the parameters for functional gain, loss and the root node probability. Of 557 the three parameters, π is the one which the model has the hardest time to recover, 558 what's more, it generally seems to be over-estimated. On the other hand, µ 01 , µ 10 559 May 6, 2020 26/39 estimates do significantly better than those for π, and moreover, in sufficiently large 560 trees the model with the correct prior is able to recover the true parameter value. 561 Figure 8 shows the empirical distribution of the parameter estimates in the third 562 simulation scheme: a partially annotated tree with mislabelling. As the proportion of 563 missing annotations increases, the model tends to, as expected, do worse. In the case that the model includes a single set of gain and loss probabilities, (i.e. no 566 difference according to type of node), in the limit as the number of branches between 567 the root and the node goes to infinity, the probability that a given node has a function 568 can be calculated as 569 P (z n = 1) = (1 − µ 10 ) P (z n−1 = 1) + µ 01 P (z n−1 = 0), since in the limit P (z n = 1) = P (z n+k = 1), ∀k = (1 − µ 10 ) P (z n = 1) + µ 01 P (z n = 0), finally P (z n = 1) = µ 01 µ 01 + µ 10 However, when the gain and loss probabilities differ by type of node, the unconditional 570 probability of observing having function is computed as follows: 571 P (z n = 1) = P (z n = 1 | C n = D)P (C n = D)+ P (z n = 1 | C n = ¬D)P (C n = ¬D) (12) Where C n ∈ {D, ¬D} denotes the type of event (duplication or not duplication, 572 respectively). We need to calculate P (z n = 1 | C n = D) and P (z n = 1 | C n = ¬D).

573
WLOG let's start by the first:

574
May 6, 2020 27/39 P (z n = 1 | C n = D) =P (z n = 1 | C n = D, z n−1 = 1)P (z n−1 = 1 | C n = D)+ P (z n = 1 | C n = D, z n−1 = 0)P (z n−1 = 0 | C n = D) Now, following the same argument made in (11) Where the probability of a node having a given class can be approximated by the 578 observed proportion of that class in the given tree. updates. The first seven plots show how the MAEs change as a function of fixing the given parameter value raging from 0 to 1. In each of these plots, the white box indicates the parameter value used to generate the data (i.e., the "correct" parameter value). The last boxplot shows the distribution of the MAEs as a function of the amount of available annotation data, that is, how accuracy changes as the degree of missing annotations increases.   High accuracy predictions The first set of annotations, first column, shows the experimental annotations of the term GO:0001730 for PTHR11258. The second column shows the 95% C.I. of the predicted annotations. The column ranges from 0 (left end) to 1 (right-end). Bars closer to the left are colored red to indicate that lack of function is suggested, while bars closer to the right are colored blue to indicate function is suggested. Depth of color corresponds to strength of inference. The AUC for this analysis is 0.91 and the Mean Absolute Error is 0.34.