Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models

Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.


Introduction
Virtually all codon models in wide use today (see [1,2] for recent reviews) are members of the class of finite-state, continuous time reversible Markov chains, each defined by an instantaneous rate matrix Q. Transition matrices for finite amounts of time are found via the matrix exponential of Q, so the probability that a position initially occupied by codon I is occupied by codon J after t units of time is P IJ (t)~e Qt À Á IJ (throughout the manuscript we will use upper-case letters to index codons and lower-case letters to index nucleotides). If M is a model in this class, the individual entries of its rate matrix can be written in the canonical form Q IJ~hIJ p M J . The h IJ can be thought of as ''rate parameters'' that govern the relative rates of substitutions between different codons, while parameters p M J induce the equilibrium frequencies of the codons. The choice of p M J is the primary distinction between the two popular families of codon models: MG (introduced in [3]) and GY (introduced in [4]). How to best estimate the p M J -or more precisely, how to estimate model parameters that actually determine the p M J -from sequence alignments is the focus of this note. In order to frame this discussion we need to define what we mean by empirical frequencies, model parameters and equilibrium frequencies ( Figure 1). Given an observed alignment, the positionspecific empirical nucleotide frequencies, e p a where a is a nucleotide (A,C,G,T) and p the codon position (1,2,3), can be estimated directly by counts from the data, and the empirical codon frequencies, e J , can be estimated by counts as well (the latter gives rise to the F61 codon frequency estimator [4]). Either of these estimates can be used to set model parameters, however typical alignments have insufficient information for the direct estimation of empirical codon frequencies with a sufficient degree of confidence. Rather, the empirical nucleotide frequencies are used to set the nucleotide frequency parameters, w p a , and by multiplication of their constituents, the codon frequency parameters, p M J . For example, in the original MG94 model of codon evolution [3], the equilibrium frequency of codon J~xyz is given by A common extension of this model, referred to as MG94 F364, allows the three codon positions to have their own nucleotide frequency parameters and leads to equilibrium codon expressed as: In this expression the superscripts indicate the position, and the equation for P stop is modified in the obvious way. If we set all of the model nucleotide frequency parameters to be equal, i.e. w p a~0 :25, the result is equal equilibrium frequencies for all codons, i.e. p J~1 =61 for all J. This vector of codon equilibrium frequencies allows us to easily tabulate, via marginalization, the equilibrium frequencies of each nucleotide at each position: Note that there are only 13 occurrences of T in the first position, 14 of A in the second position, etc because the model explicitly disallows (TAG,TAA,TGA) as is standard for all other codon models. The finding from this exercise is that when one sets all the w p a~0 :25, each of the codon equilibrium frequencies, p J takes the anticipated value of 1=61. However, remarkably, the equilibrium nucleotide frequencies generated by this model are not the anticipated 0:25. For instance, the equilibrium frequency of A at the first position is 1=61|16~0:262. Traditionally, the empirical nucleotide frequencies are used to set nucleotide frequency parameters, and it is therefore assumed that the induced equilibrium nucleotide frequencies are equal to those observed in the alignment. However, given that the nucleotide composition of stop codons is not accounted for, this practice is flawed, because w p a =p p a . The conflation of frequency parameters (w p a ) and equilibrium nucleotide (p p a ) frequencies results in incorrect estimates of equilibrium nucleotide (and codon) frequencies as demonstrated in (2) above. This phenomenon is not restricted to the MG family of models. It is simple to demonstrate the exact same behavior for the GY family of models, again because of the incorrect designation of nucleotide frequency parameters in the rate matrix as equal to empirical nucleotide frequencies. We show that the traditional identification of frequency parameters and observed nucleotide frequencies leads to a cascade of problems. Model frequency parameters are estimated with bias, which leads to biased estimation of the equilibrium codon frequencies, which leads to compensatory biased estimation of the substitution rate parameters. We propose a correction, and a maximum likelihood frequency parameterization and show that both these approaches are not similarly biased, and therefore advocate their use in codon models.

Materials and Methods
To ensure clarity of presentation, we first carefully introduce the necessary notation (summarized in Figure 1). For a given substitution model, let p J be the frequency of sense codon J (J~1,2,3, . . . ,61) in its equilibrium distribution, and p p a , a~1,2,3,4 be the equilibrium frequency of nucleotide a in codon position p~1,2,3. When necessary, we will indicate specific models via a superscript (ie, MG or GY). The position specific nucleotide equilibrium frequencies, p p a , are uniquely determined by the codon equilibrium frequencies, p J , through marginalization, e.g. p 1 T is simply the sum of frequencies of the 13 sense codons that have a T in their first position, e.g. as in equation (2).
These equilibrium frequencies, of both nucleotides and codons, have traditionally been assumed equal to empirical frequencies observed in a sequence alignment, e J or e p a , and used to set model parameters. If the specified model is correct, e J converges to p J and e p a to p p a as the sequence length N increases. (However, note that this result requires that the evolutionary process itself be at equilibrium; many important biological mechanisms-notably directional positive selection-are likely to disrupt equilibrium; see [5][6][7]).
Because the simple example in equation (2) demonstrated that the empirical and equilibrium nucleotide frequencies are not synonymous, we strive to obtain an expression that relates the equilibrium nucleotide frequencies to the model nucleotide frequencies, w p a , and through extension -to the observed empirical frequencies. Even though the MG and GY models treat equilibrium codon frequencies differently, it is a fortunate coincidence that in either case the p J have identical forms when written in terms of w p a . Given twelve MG nucleotide frequency parameters, only 9 of which are independent because P a w p a~1 for each position p, the equilibrium frequency of codon J~xyz induced by their values is as in equation (1).
By using e p a to directly estimate w p a in equation (1), one obtains the popular F 3|4 estimator of codon equilibrium frequenciesby far the most common estimator used in literature for both MG and GY classes of models. The statistical and computational appeal of F 3|4 lies in its use of only 9 nucleotide parameters to describe 61 codon frequencies. However, the key shortcut-direct estimation of nucleotide frequency parameters with empirical nucleotide frequencies from the data-is flawed. The empirical nucleotide frequencies are unbiased estimates of the true equilibrium frequencies; unfortunately, the model parameters they are being used to estimate are something different. Thus, a fundamental problem with current practices is that use of the F 3|4 estimators with either MG or GY models leads to biased estimates of the w p a , and in turn the p J . As we will show below, the problems do not end there, and lead to biased estimation of other model parameters.
We first present two approaches for correcting these estimation errors. The obvious, but more computationally demanding method is to estimate the w p a by maximum likelihood along with other model parameters. We dub this approach MLF3|4. Theory suggests that estimates from this methodology will have all the desirable properties of maximum likelihood estimation. Maximum likelihood estimation of these values has been available in some software packages, e.g. in HyPhy [8], for a number of years, but to our knowledge it has rarely been used.
The second strategy, described here for the first time, relies on finding an expression for the induced equilibrium frequency of nucleotide a at codon position p (p p a ) as a function of w p a . Since the w p a define codon equilibrium frequencies (equation 1), we can readily obtain such equations by marginalization: Here, 1{p X is simply scaling for the absence of stop codons: p X~P xyz[X p 1 x p 2 y p 3 z , and X~fTAA,TAG,TGAg defines the set of stop codons. The corrected F 3|4, or CF 3|4 estimator equates p p a with observed nucleotide frequencies e p a , and then solves the nonlinear system (3) for w p a to obtain estimates of the latter. Because P 4 a~1 w p a~P 4 a~1 p p a~1 , the above system of 12 nonlinear equations relate 9 independent observed statistics (e p a , e.g. for a [ fA,G,Tg) with 9 independent model parameters w p a . We were unable to obtain a closed form solution to the system, but it can be easily solved numerically at a negligible computational cost.
We conducted simulations to further investigate the effects of biases in the equilibrium frequencies on parameters typically estimated using phylogenetic models. We generated two-sequence codon alignments with uniform codon frequency composition (w p a~0 :25). We used h AC~0 :5, h AG~1 , h AT~0 :8, h CG~0 :3, h CT~2 :0, h GT~0 :1 as substitution bias parameters in the MG94xREV model [9], and set the nonsynonymous/synonymous substitution rate ratio v to 0:25. The two sequences were 10% divergent on average, and the length of the alignment, N, was one of 400, 1,600 or 32,000 codons. 100 replicates were generated for each value of N. We compared the fits of F 3|4, CF 3|4 and MLF 3|4 on simulated data sets, and furthermore compared simulated to inferred parameter estimates with each of the three frequency parameterizations. In addition to the simulated data, we fitted all three frequency parameterizations to a sample of 856 alignments from the carefully curated Pandit database [10]. All alignments were chosen to contain between 10 and 20 sequences and at least 200 reliably aligned codon sites. Given that each estimator has the same number of independent parameters (9), an improvement in log-likelihood under one of the models is considered as evidence in favor of the better fitting model, e.g. under the BIC [11] criterion. All new estimators for the MG94 class of models are implemented in HyPhy.

Results and Discussion
We simulated data with a uniform codon frequency composition and fitted all three frequency parameterizations for alignments of various sequence lengths. The suboptimal nature of the F 3|4 estimator is immediately apparent from Figure 2a, where the improvement in log L scores of the model equipped with the corrected estimator CF 3|4 is shown. For all replicates, the CF 3|4 estimator yielded better log L, with median improvements of 2:29, 9:46, and 184 (for 400,1,600, and 32,000 codons respectively), or approximately 0:006 likelihood points per codon site. Note that as the sample size increased, the estimators from (3) effectively matched the performance of the maximum likelihood estimator (Figure 2b). Even more importantly, the use of the F 3|4 frequency estimator led to biased inference of other model parameters. Maximum likelihood estimates of some substitution rates were biased under the F 3|4, and the bias was progressively more pronounced with increasing sample size (Figure 2c). Indeed, for N~32,000, a simple likelihood ratio test rejected the (true) null of h CT~2 :0 at pv0:05 for all 100 replicates. Biased MLEs of the substitution rate parameter h CT is a result of the under/overestimates of p p T and p p C using F 3|4. Similar results were seen for the other h IJ . To our relief, the maximum likelihood estimate (MLE) for the v ratio was not noticeably affected even for the largest sample size (mean 0:2494, median 0:2495, IQR 0:2445{0:2539 under F 3|4; mean 0:2500, median 0:2501, IQR 0:2452,0:2545 under CF 3|4, Figure 2d). For the Pandit alignments log L values were, of course, higher for the models estimated using MLF3|4 than for those using F 3|4. However, the magnitudes of the differences were impressive (median 17:59, IQR 10:29{27:55, max 453:2). The CF 3|4 estimator improved the log L score of the F 3|4 estimator for over 80%(692=856) of the alignments by a median of 7:4 points; in the remaining cases the median decrease in log L score was 2:9 points. As with the simulated data, the MLEs of v were largely unaffected by the choice of frequency estimators (but there were some datasets where the difference was large), while some substitution rate estimates appeared biased (Figure 3). For example, the estimates of h CT were strongly linearly correlated between MLF 3|4 and F 3|4 methods (r 2~0 :952), but the regression line was estimated as F 3|4~0:073z0:930MLF 3|4, which recapitulates the downward bias observed on simulated data (if the estimates were unbiased, we would expect an intercept of zero and slope of one).
We have demonstrated through simulations that the almost universally used F 3|4 estimator of equilibrium frequencies in codon substitution models is biased, and we have pointed out how a misinterpretation of standard codon model parameters is responsible for these biases. Although this bias appears to have little effect on estimation of ''composite'' parameters such as the nonsynonymous/synonymous rate ratio (v) and branch lengths (results not shown), the bias has considerable damaging effects on the estimation of substitution rate parameters in the instantaneous rate matrix. This problem will become acutely relevant as researchers pursue finer-scale studies of the evolutionary process, such as developing substitution models with protein residuedependent codon substitution rates [12,13]. Since the computational burden of the F 3|4 estimator is virtually identical to that of our proposed CF 3|4 estimator, which in turn is only marginally faster than MLF 3|4, we recommend the use of either of the alternatives offered in this manuscript over the F 3|4 estimator. Our current recommendation is to obtain CF 3|4 estimates and use them to initialize the optimization procedure for MLF3|4 to speed up convergence.