## Figures

## Abstract

### Background

Empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. We develop a codon-based model, in which mutational tendencies of codon, a genetic code, and the strength of selective constraints against amino acid replacements can be tailored to a given gene. First, selective constraints averaged over proteins are estimated by maximizing the likelihood of each 1-PAM matrix of empirical amino acid (JTT, WAG, and LG) and codon (KHG) substitution matrices. Then, selective constraints specific to given proteins are approximated as a linear function of those estimated from the empirical substitution matrices.

### Results

Akaike information criterion (AIC) values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices significantly better. Also, the ML estimates of transition-transversion bias obtained from these empirical matrices are not so large as previously estimated. The selective constraints are characteristic of proteins rather than species. However, their relative strengths among amino acid pairs can be approximated not to depend very much on protein families but amino acid pairs, because the present model, in which selective constraints are approximated to be a linear function of those estimated from the JTT/WAG/LG/KHG matrices, can provide a good fit to other empirical substitution matrices including cpREV for chloroplast proteins and mtREV for vertebrate mitochondrial proteins.

### Conclusions/Significance

The present codon-based model with the ML estimates of selective constraints and with adjustable mutation rates of nucleotide would be useful as a simple substitution model in ML and Bayesian inferences of molecular phylogenetic trees, and enables us to obtain biologically meaningful information at both nucleotide and amino acid levels from codon and protein sequences.

**Citation: **Miyazawa S (2011) Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes. PLoS ONE 6(3):
e17244.
https://doi.org/10.1371/journal.pone.0017244

**Editor: **Darren Martin, Institute of Infectious Disease and Molecular Medicine, South Africa

**Received: **November 5, 2010; **Accepted: **January 24, 2011; **Published: ** March 18, 2011

**Copyright: ** © 2011 Sanzo Miyazawa. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **This work was supported by Gunma University. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The author has declared that no competing interests exist.

## Introduction

Any method for inferring molecular phylogeny is implicitly or explicitly based on the evolutionary mechanism of nucleotide or amino acid substitutions, and the reliability of phylogenetic analyses strongly depends on models assumed for the substitution processes of nucleotide and amino acid. Mutational events occur at the individual nucleotide level, but selective pressure primarily operates at the amino acid level. Thus, a codon-based model of amino acid substitutions has a potential to be preferable to both mononucleotide substitution models [1]–[3] and amino acid substitution models [4]–[12], because it can take into account both mutational tendencies at the nucleotide level and selective pressure on amino acid replacements as well as the knowledge of a genetic code. Schneider et al. [13] and Kosiol et al. [14] empirically estimated a codon substitution matrix from a large number of coding sequence alignments. However, the tendencies of substitutions differ among nuclear, mitochondrial [6], and chloroplast genes [8]. Delport et al. [15], [16] pointed out that empirical substitution matrices represent the average tendencies of substitutions over various protein families by sacrificing gene-level resolution. A mechanistic codon substitution model, in which one can change a genetic code, and adjust mutational tendencies at the codon level and selectional preferences on amino acid replacements, is potentially more superior than empirical codon substitution matrices.

A main difference between the current mechanistic codon substitution models [7], [15]–[24] resides in the estimation of selective constraints against amino acid replacements. (1) In [19], [20], [22], the difference between nonsynonymous and synonymous substitution rates was taken into account but the amino acid dependences of selective constraints were not taken into account; i.e., single selective constraints. (2) In [7], [17], [18], selective constraints against amino acid replacements were evaluated from physico-chemical properties of amino acids. (3) In [21], [23], [24], codon exchangeabilities for nonsynonymous changes were evaluated from those in empirical amino acid substitution matrices. (4) In [15], [16], selective constraints were grouped, and the number of groups and the strength of selective constraint of each group were optimized for a given protein phylogeny. The fourth method has the highest resolution of selective constraints employing as many substitution groups as necessary. However, it seems to be a very computer-intensive calculation [16]. Here, we try to estimate selective constraint for each type of amino acid replacement by maximizing the likelihood of individual empirical substitution matrices. Unlike the present method, in the previous methods of this third category codon exchangeabilities for nonsynonymous changes were assumed to be proportional to the corresponding amino acid exchangeability [23], or a codon substitution matrix was restricted to yield amino acid exchangeabilities equal to empirically-derived ones [21]. The empirical substitution matrices fitted are 1-PAM amino acid substitution frequency matrices, the JTT matrix [5], the WAG matrix [10], and the LG matrix [11], evaluated from relatively large data of nuclear-encoded proteins, the mtREV matrix [6] from vertebrate mitochondrial proteins, and the cpREV matrix [8] from chloroplast-encoded proteins, and also a 1-PAM codon substitution frequency matrix (KHG) [14]. In the following, these empirical substitution frequency matrices corresponding to 1 PAM will be simply referred to by their common acronyms, JTT, WAG, LG, KHG, mtREV, and cpREV.

In most of the reversible Markov models for codon substitutions, instantaneous rates for codon substitutions that require multiple nucleotide changes were assumed to be equal to . [15], [17]–[19]. However, in all empirical substitution matrices unnegligible amounts of rates are assigned to amino acid replacements that require multiple nucleotide changes. Variations in substitution rates or time intervals would yield significant amounts of probabilities for the multi-step substitutions. Alternative explanation is that the significant fraction of these substitutions occurred with multiple nucleotide changes. Thus, both of them are taken into account in the present work. It is assumed that substitution rates are distributed with a distribution. The use of distribution for rate variation has been attempted in many studies [25], [26]. Multiple nucleotide changes are assumed to occur in the same order of time as single nucleotide changes do.

Interdependence of nucleotide substitutions at three codon positions [7] and also spanning codon boundaries [20] have been pointed out. Evidences for a high frequency, which is the order of 0.1 per site per billion years, of double-nucleotide substitutions were found in diverse organisms by Averof et al. [27], although there is a report [28] indicating a low rate of double-nucleotide mutations in primates. Bazykin et al. [29] pointed out a possibility of successive single compensatory substitutions for multiple nucleotide changes. Recently, many codon models relaxing mathematical assumptions in a more sophisticated way than the models of Goldman and Yang [18] and Muse and Gaut [19] are devised to study and to detect evidence of positive selection in codon evolutionary processes; see Anisimova and Kosiol [30] for a review.

In the Singlet-Doublet-Triplet (SDT) mutation model [20], single-nucleotide, doublet and triplet mutations spanning codon boundaries are taken into account, but double nucleotide mutations at the first and the third positions in a codon were not taken into account. The dependences of selective constraints on amino acid pairs were not taken into account. In the present model, it is assumed that nucleotide mutations occur independently at each codon position and so any double nucleotide mutation occurs as frequently as doublet mutations. The codon substitution rate matrix of KHG [14] indicates that some types of double nucleotide mutations at the first and the third positions frequently occur.

Close relationships between selective constraints on amino acids and physico-chemical properties of amino acids and protein structures have been pointed out [4],[9],[17],[31]–[34]. We suppose that the relative strengths of selective constraints among amino acid pairs do not strongly depend on species, organelles, and even protein families but amino acid pairs. Then, we examine the performance of the present codon-based model, in which selective constraints are approximated to be a linear function of those estimated from JTT, WAG, LG, or KHG, in respect of how well other empirical substitution matrices including cpREV and mtREV can be fitted by adjusting parameters such as mutational tendencies and the strength of selective constraints. It is shown that these maximum likelihood (ML) estimators of the selective constraints perform better than any physico-chemical estimation. It is also indicated that the present model yields good values of Akaike information criterion (AIC) for a phylogenetic tree of mitochondrial coding sequences in comparison with the codon model almost equivalent to mtREV. If the present model is applied to the ML inference of phylogenetic trees, it will allow us to estimate mutational tendencies at the nucleotide level, which are specific to each species and organelle, such as transition-transversion bias and the ratio of nonsynonymous to synonymous rate. One of the interesting results revealed by the present model is that the ML estimators of transition to transversion bias calculated from the empirical substitution matrices are not so large as previously estimated. Also, AIC values indicate that a model allowing multiple nucleotide changes fits the empirical substitution matrices and the phylogeny of vertebrate mitochondrial proteins significantly better.

The present codon-based model with the new estimates for selective constraints on amino acids is useful as a simple evolutionary model for phylogenetic estimation, and also useful to generate log-odds for codon substitutions in protein-coding sequences with any genetic code.

## Methods

### A mechanistic codon substitution model with multiple nucleotide changes

In early codon substitution models [17], [18], the probabilities of multiple nucleotide replacements in the infinitesimal time difference were completely neglected by assuming them to be , when the probabilities of single nucleotide replacements are taken to be . In other words, the instantaneous mutation rate from codon to was assumed to be equal to zero for codon pairs requiring multiple nucleotide replacements. However, multiple nucleotide mutations may not be neglected in real protein evolution [7], [14], [20], [27], [29], [35]. Here, multiple nucleotide changes are assumed to occur with the same order of time as single nucleotide changes occur, but unlike the SDT model [20] a mutation process is simplified in such a way that mutations independently occur at each position of a codon. Thus, the mutation rate matrix for a codon is defined here as (1)

where is a mutation rate matrix between the four types of nucleotides at the th codon position, is the Kronecker's , and the index means the th nucleotide in the codon ; where . Assuming that the rate matrix satisfies the detailed balance condition, it is represented as (2)(3)(4)

where is the equilibrium composition of nucleotide at the th codon position, and is the exchangeability between nucleotides and at the th codon position. As a result of the detailed balance condition assumed for the , the also satisfies the detailed balance condition; (5)

The instantaneous substitution rate from codon to can be represented as the product of the mutation rate and the fixation probability of the mutations under selection pressure; . Let us assume that the also satisfies the detailed balance condition; that is,(6)

where is the equilibrium codon composition of the substitution rate matrix . The detailed balance condition Eq. 6 for the is equivalent with a condition that can be expressed to be a product of the element of a symmetric matrix and the equilibrium composition . Similarly, the detailed balance condition Eq. 5 for the is equivalent with a condition that the matrix whose () element is equal to is symmetric. Thus, the detailed balance conditions for the and the require that the fixation probability must be represented as the product of frequency-dependent, , and frequency-independent, , terms; , where . Then, the codon substitution rate can be represented as (7)

where is an arbitrary scaling constant. The unit of time is chosen by determining the arbitrary scaling constant in Eq. 7 in such a way that the total rate of the rate matrix is equal to one; (8)

Therefore, only the relative values among are meaningful. The frequency-dependent term represents the effects of selection pressures at the DNA level as well as at the amino acid level, which preserve the codon frequency, , specific to a species and a protein, from the mutational frequency, . By taking the frequencies of stop codons to be zero, the rates from any codon to the termination codons are set to zero. The quantity is the same as the one that Miyata et al. [32] called the rate of acceptance. We assume that selection pressure against codon replacements principally appears on an amino acid sequence encoded by a nucleotide sequence; for the codon pair is equal to the selective constraint for the encoded amino acid pair .(9)

where is a genetic code table and takes the value one if codon encodes amino acid , otherwise zero. At the amino acid level, there should be no selection pressure against synonymous mutations. Thus, the satisfies(10)

The matrix will be directly estimated by maximizing the likelihood of an empirical substitution matrix, or it will be evaluated for a specific protein family as a linear function of such an estimate of ;(11)

In Eq. 11, is the Kronecker's , and means the estimate of , which is either a physico-chemical estimate or a ML estimate calculated from a specific substitution matrix, and satisfies Eq. 10. The parameter , which is non-negative, adjusts the strength of selective constraints for a protein family. The parameter controls the ratio of nonsynonymous to synonymous substitution rate, but it will be ineffective and may be assumed to be equal to 0 if amino acid sequences rather than codon sequences are analyzed.

Then, the substitution probability matrix at time t in a time-homogeneous Markov process can be calculated as (12)

Because the rate matrix satisfies the detailed balance condition, the also satisfies it. Therefore, a substitution process is modeled as a reversible Markov process. The and the that satisfy the detailed balance condition can be easily diagonalized with real eigenvalues and eigenvectors [17]; the eigenvalues of are the same as those of a symmetric matrix whose element is equal to .

If multiple nucleotide changes were completely ignored, then Eq. 1 would be simplified as , whose formulation for a codon mutation rate matrix with Eq. 2 is essentially the same as the one proposed by Muse and Gault [19]. Here, it should be noted that in Eq. 2 is defined to be proportional to the equilibrium nucleotide composition . Alternatively, one may define as in the same way as Miyazawa and Jernigan [17] and others [7], [18] defined it to be proportional explicitly to the composition of the base triplet, . This alternative definition with Eqs. 7 and 8 is equivalent to Eqs. 1 and 2 with , and thus it is a special case in the present formulation; see [36] for justifications of this alternative definition.

In the present analyses, we assume for simplicity that and do not depend on codon position ; that is, and , where . This assumption is reasonable because mutational tendencies may not depend on a nucleotide position in a codon. Let us define to represent the average of the exchangeabilities of the transversion type, , , , and , and likewise to represent the average of the exchangeabilities of the transition type, and . We use the ratios as parameters for exchangeabilities, and to represent the ratio of the exchangeability of double nucleotide change to that of single nucleotide change and also the ratio of the exchangeability of triple nucleotide change to that of double nucleotide change; note that the exchangeabilities of single, double, and triple nucleotide changes are of , and in Eq. 1, respectively, and that Eq. 8 must be satisfied. Then, multiple nucleotide changes in a codon can be completely neglected by making the parameter approach zero with keeping constant in Eq. 8. Also, it is noted that double nucleotide changes at the first and the third positions in a codon are assumed to occur as frequently as doublet changes.

### Empirical substitution matrices used for model fitting

Maximum likelihood (ML) values are calculated for each 1-PAM substitution frequency matrix, which corresponds to the time duration of 1 amino acid substitution per 100 amino acids, of the JTT [5], the WAG [10], the LG [11], the cpREV [8], and the mtREV [6] amino acid substitution matrices, and of the KHG codon substitution matrix [14]. We have arbitrarily chosen the transition matrices of 1-PAM, whose time interval is long enough for the significant number of substitutions to occur and also too short for multi-step substitutions to cover multiple nucleotide changes. JTT is an accepted point mutation matrix compiled from the pairs of closely related proteins encoded in nuclear DNA. WAG, LG, cpREV, and mtREV are amino acid substitution matrices estimated by maximizing the likelihood of a given set of optimum phylogenetic trees. The KHG matrix used is the one named ECMunrest in the supplement of their paper, for which multiple nucleotide changes are allowed. JTT, WAG, LG, and KHG were all calculated from nuclear-encoded proteins, although JTT was calculated by a different method from the others. The matrices of cpREV and mtREV were calculated from proteins encoded in chloroplast DNA, and in vertebrate mitochondrial DNA, respectively. It should be noted here that a non-universal genetic code is used in the mitochondrial DNA.

### Average of a transition matrix over time or over rate

In the present study, model parameters are estimated by maximizing the likelihood of each 1-PAM substitution frequency matrix of JTT, WAG, LG, cpREV, mtREV, and KHG. In the case of JTT, the pairs of closely related sequences were used to count substitutions and the transition matrix was calculated by completely neglecting multiple substitutions at a site in a parsimony method. Thus, JTT should be considered to consist of substitutions that occurred in various time intervals (various branch lengths). The substitution rate matrices of WAG, LG, mtREV, cpREV and KHG were estimated by the ML method for a given set of protein phylogenetic trees. Each site of protein families may have evolved with a different rate. As a result, these substitution matrices may be regarded as an average over different substitution rates. Here we assume that evolutionary time intervals or substitution rates for each substitution matrix are distributed in a distribution. There have been many attempts [25], [26] of using a distribution for rate variation.

If the substitution rate matrix is assumed to vary only by a scalar factor, the mean of a substitution matrix irrespective of over-time and over-rate will be calculated as (13)

where is the probability density function of a distribution with a scale parameter and a shape parameter , is the function, and is the identity matrix. The mean and the variance of the distribution are equal to and , respectively. Here we should recall that the rate matrix is normalized such that the total rate per unit time is equal to one; see Eq. 8.

### Evaluation of the log-likelihood of an empirical substitution matrix

The log-likelihood of the empirical frequency, , of substitutions from to in the present model can be calculated as (14)

where and mean one of the amino acid types for amino acid substitution matrices or one of the codon types for codon substitution matrices, is an observed transition probability matrix corresponding to the accepted point mutation matrix , is the observed composition of amino acid or codon , and is the total number of amino acid or codon sites compared to count substitutions. The observed composition is assumed to be the equilibrium composition of . is a set of parameters and is a set of the maximum likelihood (ML) estimators. Similarly, the estimate of the Kullback-Leibler (K-L) information by replacing the real distribution to the observed frequency distribution is calculated as (15)(16)

Maximum log-likelihood corresponds to the minimum of the estimate of K-L information, .

The transition probability, , between amino acids and and the composition, , of amino acid are related to those for codons as follows. (17)(18)

The goodness of a model and the significance of parameters can be indicated by Akaike Information Criterion (AIC). The AIC value is defined as (19)(20)(21)

For convenience, , which is equal to a constant value added to the AIC value, is also defined above. The AIC and always take a non-negative value. Models with smaller AIC and can be considered to be more appropriate [37].

Parameters in the present model are , , , , , and . Assuming that the observed process of substitutions is in the stationary state, the estimates of the equilibrium codon and the equilibrium amino acid compositions, and , are taken to be the observed composition of the codon and of the amino acid: (22)

In the case of amino acid sequences, for which their coding sequences are not available, codon compositions may be parameterized by (23)(24)

In the present analyses, this parameterization is used for the equilibrium codon compositions in amino acid sequences.

Then, the shape parameter of a distribution for variations in mutation rates or evolutionary time intervals for observed codon or amino acid substitutions is estimated by equating the ratio of the expected number of substitutions in the model to its observed value.(25)

Other parameters , , , , and are evaluated as ML estimators or fixed to a proper value. The observed transition matrix corresponding to 1-PAM is used here; PAM means accepted point mutations per 100 amino acids. (26)

### The total number of site comparisons () for each empirical substitution matrix

In the case of JTT, 59190 accepted point mutations found in 16130 protein sequences were used to build a substitution probability matrix of 1-PAM [5]. Thus, the total number of amino acid comparisons for JTT is assumed to be equal to . On the other hand, a phylogenetic tree for cpREV is based on amino acid sites of 45 proteins encoded in chloroplast DNAs of 9 species [8], and the one for mtREV is based on amino acid sites of the complete mitochondrial DNA from 20 vertebrate species (3 individuals from human) [6]. Thus, the total number of site comparisons for them may be approximated to be equal to the number of amino acid sites multiplied by the number of branches in the phylogenetic tree used to evaluate the transition matrices; that is, for cpREV, and for mtREV. The BRKALN database consisting of 50867 sites and 895132 residues was used to estimate WAG. Thus, is used for WAG [10], [11]. To evaluate LG, 3412 of 3912 alignments consisting of 49637 sequences, 599692 sites, and 6697813 residues are used [11]. Therefore, is assumed for LG. These crude estimates of are used to evaluate the AICs of JTT, WAG, LG, cpREV and mtREV.

In the case of KHG, which was estimated by maximizing a likelihood of a set of phylogenetic trees of coding sequences of 7332 nuclear protein families taken from Pandit database [38], the total numbers of residues and sites are not written in Kosiol et al. [14], so that an AIC value is not given for KHG in the following.

## Results

Models, each of which includes a different number of parameters and is a special case of models including more parameters, are fitted by a maximum likelihood method to each of the 1-PAM amino acid substitution frequency matrices, JTT [5], WAG [10], and LG [11] for proteins encoded in nuclear DNA, cpREV [8] for chloroplast DNA, and mtREV [6] for mitochondrial DNA. Also, the models are fitted to the 1-PAM codon substitution frequency matrix of KHG [14] for nuclear DNA. The selective constraints are either directly estimated by ML or evaluated from a known estimate by Eq. 11 that includes two parameters and . The parameter is fixed here to for amino acid substitution matrices because the likelihood of an amino acid substitution matrix does not strongly depend on ; codon substitution data are required to reliably estimate the value of , which significantly affects the ratio of nonsynonymous to synonymous substitution rate. Each model is named to indicate either the method to estimate or the name of with a suffix meaning the number of ML parameters. Each model is briefly described in Table 1. The Nelder-Mead Simplex algorithm has been used for the maximization of likelihoods.

### The effects of selective constraints

First, the No-Constraints models, in which selective constraints do not depend on amino acid pairs, in Eq. 11, were examined to see how well nucleotide mutation rates, codon frequencies and a genetic code can explain the observed frequencies of amino acid substitutions in JTT, WAG, cpREV, and mtREV; the No-Constraints models disallowing multiple nucleotide changes are equivalent to mononucleotide substitution models, because is used here. The value and the ML estimates for each parameter set are listed in Table 2 and Table S1, respectively. Please refer to Text S1 for details. These No-Constraints models serve as a reference to measure how selection models can improve the likelihoods. Then, we examine various estimations of selective constraints on amino acids based on the physico-chemical distances of amino acids evaluated by Grantham [31] and by Miyata et al. [32] and mean energy increments due to an amino acid substitution. These models are called Grantham, Miyata, and Energy-Increment-based (EI) models, respectively. Please refer to Text S1 for the definition of the mean energy increment and for the details of each model. The values and the ML estimates for these models with various sets of parameters are also listed in Table 2, and Tables S2 and S3, respectively. Comparisons of values between the models in Table 2 indicate that the selective constraints on amino acids representing conservative selection against amino acid substitutions significantly improve the values of all substitution matrices. It is also indicated that the Miyata's physico-chemical distance performs better in all parameter sets than the Grantham's distance, This result is consistent with that of Yang et al. [7] for mitochondrial proteins. The present physico-chemical evaluation of selective constraints (EI models) fits JTT and WAG even better than the Miyata's distance scale, although the performances of both the methods are almost same for cpREV and mtREV. One of the important facts in these results is that allowing multiple nucleotide changes in a codon significantly improve the AIC irrespective of the estimations of selective constraints; compare the values between the Grantham-10 and the Grantham-11, between the Miyata-10 and the Miyata-11, and between the EI-10 and the EI-11.

### The effects of multiple nucleotide changes on ML estimations

In principle, all parameters for selective constraints can be optimized in the case of codon sequences. In the case of protein sequences, all 190 non-diagonal elements of in addition to the parameters for mutational tendencies at the nucleotide level and others cannot simultaneously be optimized; the number of freedoms in a general reversible model for an amino acid transition matrix is equal to 209.

In order to see how well amino acid substitution matrices can be explained with the assumption of successive single nucleotide substitutions, let us optimize corresponding to single-step amino acid pairs by assuming that only single nucleotide mutations are possible, i.e., by with in Eq. 8. The number of for the single-step amino acid pairs is equal to 75 in the case of the universal genetic code. All 75 for the single-step amino acid pairs have been optimized for each of JTT and WAG together with the nucleotide exchangeabilities , the equilibrium nucleotide composition , the codon usage parameters and the scale parameter ; the total number of the parameters is equal to 87 in addition to the 19 amino acid frequencies and the shape parameter . This maximum likelihood model to estimate the matrix is called ML with a suffix meaning the number of ML parameters; see Table 1. The ML estimates of these parameters except for the ML-87 are listed in Table 3 for JTT and WAG.

In the lowest rows of this table, the ratio of the total nucleotide substitution rate per codon to the codon substitution rate, which represents the average number of nucleotide changes for substituting a codon, the ratio of the total transition to the total transversion rate per codon, and the ratio of nonsynonymous to synonymous substitution rate per codon are listed for the models. The sum of the total transition and the total transversion rates per codon is equal to the total nucleotide substitution rate per codon. The lowest three rows list their values in the case of and , and the second lowest three rows for the case of . Thus, the differences of their values between the lowest and second lowest three rows represent the effects of selective constraints on amino acids (), and those between the second lowest and the third lowest three rows describe the effects of rate/time variations on the substitution matrix. If codon substitutions proceed by successive single nucleotide changes, i.e., , then the ratio of the total nucleotide to the codon substitution rate will be equal to 1 in the case of .

Here it should be noticed that the nonsynonymous and the synonymous substitution rates are defined not to be rate per site but simply rate per codon. The sum of the nonsynonymous and the synonymous substitution rates is equal to the codon substitution rate. The ratio of the nonsynonymous to the synonymous substitution rate per codon does not corresponds to the ratio of nonsynonymous to synonymous substitutions per site, [39], but the ratio of nonsynonymous to synonymous substitutions per codon, [39]. The ratio ( [39]) of the effective number of nonsynonymous sites to that of synonymous sites per codon corresponds to the ratio of nonsynonymous to synonymous rate in the case of no selective constraints (). In the present models, indicating the effects of selection on amino acid replacements corresponds to the nonsynonymous to synonymous substitution rate ratio in the case of divided by that in the case of and . Table 3 indicates that selection on amino acids is conservative, because the ratio of nonsynonymous to synonymous rate per codon is much smaller in the case of than in the case of and .

As expected, the AIC value drastically decreases from that of the EI-14 in both cases of JTT and WAG, indicating that the introduction of many parameters may be still appropriate. However, there are large discrepancies between the observed transition matrix and the one estimated by the ML-87. Let us see the discrepancies between them in terms of log-odds.

A log-odds matrix introduced by Dayhoff et al. [4] is one of the representations of amino acid substitution propensities. The element of the log-odds matrix is defined to be the logarithm of odds to find an amino acid pair in comparison with random sequences. The odds is equal to the element of transition matrix divided by the amino acid composition .(27)(28)

The proportional constant in Eq. 28 is the one originally used by Dayhoff et al. [4].

In Fig. 1, the log-odds corresponding to the 1 PAM transition matrix of the ML-87 model fitted to JTT are plotted against those calculated from JTT. Plus, circle and cross marks show the log-odds for one-, two-, and three-step amino acid pairs, respectively. Although the estimated values of log-odds for one-step amino acid pairs are almost exactly equal to those of the JTT matrix, there are still large discrepancies between the log-odds values for two- and three-step amino acid pairs, indicating a non-stepwise manner of codon substitutions. Similar discrepancies are also found in Fig. S1 for WAG.

Each element log- of the log-odds matrices of (A) the ML-87 and (B) the ML-91 models fitted to the 1-PAM JTT matrix is plotted against the log-odds log- calculated from JTT. Plus, circle, and cross marks show the log-odds values for the types of substitutions requiring single, double and triple nucleotide changes, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

We have examined how the AIC is improved by enabling multiple nucleotide changes in a codon. The selective constraints for multiple nucleotide changes are classified into 6 groups according to the amounts of discrepancies between the observed and the estimated values of the log-odds as shown in Fig. 1. Then, the ML estimates of 94 parameters including 7 additional parameters, for the 6 groups of multiple nucleotide changes and the parameter for the rate of multiple nucleotide change, are calculated. This model is called ML-94. Also, the values of for multi-step amino acid pairs are calculated by maximizing the likelihood with fixing the values of all other parameters including for the single-step amino acid pairs; this model is called here ML-94+ by appending the "+" mark. It should be noted that these values of for the multi-step amino acid pairs in the ML-94+ are not ML estimates at all. The ML estimates for single-step amino acid pairs, the classification of multi-step amino acid pairs into the 6 groups, and the ML estimates for those categories of are provided in Data S1. As shown in Table 3, the ML estimates of , , and for the ML-87 model are very different from those for the ML-94, and some of them for the ML-87 seem to be unrealistic. For example, is evaluated to be smaller than . Also, the small value of indicates the extremely biased usage of codons. The ML estimate of a distribution is too large. These parameters are forced in the ML-87 to take such values to reduce the discrepancies between the observed and the estimated counts for multi-step amino acid pairs. In the ML-94 model, the ML estimators of these parameters take more reasonable values. However, it may also yield unreasonable estimates for codon usage parameters, ; for example, in the ML-94 for WAG, and in the ML-94 for LG. Thus, the ML-91 model with , which means equal codon usage, may be better than the ML-94. The ML-91 model was applied for JTT, WAG, and LG, and the ML estimates for them in the ML-91 are also listed in Table 3.

The ML estimators , , and show a similar tendency between the ML-91 models for all the amino acid substitution matrices, i.e., JTT, WAG, and LG. The parameter for multiple nucleotide changes and the scale parameter for rate variation are both significant for all the matrices. The values of for JTT, WAG, and LG indicate that the mean exchangeability of the transition type is larger than that of the transversion type in all the matrices.

As shown in Fig. 1 for JTT and in Fig. S1 for WAG, the large discrepancies of the log-odds for the multi-step amino acid pairs disappear in the ML-91, in which multiple nucleotide changes are taken into account. The AIC values of JTT and WAG are significantly improved by enabling multiple nucleotide changes in the ML-91. This fact confirms that multiple nucleotide changes are statistically significant and should be taken into account to build a codon substitution model.

### ML estimation for the KHG codon substitution matrix

If a codon substitution matrix is used for model fitting with the assumption of multiple nucleotide changes, all 190 parameters of selective constraints will be able to be optimized. The ML-200 model has been fitted to the 1-PAM codon substitution frequency matrix of KHG, which was empirically estimated without any restriction on multiple nucleotide changes [14].

The log-odds values for the codon pairs requiring single, double, and triple nucleotide changes are shown in Figs. 2A, 2B, and 2C, respectively. In these figures, upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and one-, two-, and three-step amino acid pairs, respectively. The dotted line shows the line of values where the observed and the estimated values of log-odds are equal to each other. The log-odds of the codon pairs requiring single/double/triple nucleotide changes for one/two/three-step amino acid pairs respectively tend to fall along the dotted line in comparison with the log-odds of the other codon pairs. In other words, the log-odds of the codon pairs for which any nucleotide change is accompanied by an amino acid change are correctly estimated. On the other hand, the estimated log-odds values do not well agree with the observed ones for synonymous codon pairs shown by the upper triangles. These estimated log-odds can be adjusted only by changing nucleotide mutation rates, i.e., and . Thus, the approximations of the independence and of no difference of nucleotide exchangeabilities between nucleotide positions may be limited; see Eq. 1.

Each element log- of the log-odds matrix corresponding to (A) single, (B) double, and (C) triple nucleotide changes in the ML-200 model fitted to the 1-PAM KHG codon substitution matrix is plotted against the log-odds log- calculated from KHG. In (D), codon log-exchangeabilities of the 1-PAM KHG codon substitution matrix corresponding to triple nucleotide changes are plotted against the log-odds log- calculated from KHG. The log-exchangeability of the 1-PAM KHG is defined as . Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and one-, two-, and three-step amino acid pairs, respectively. Log-exchangeabilities for the codon pairs whose instantaneous rates are estimated to be in KHG are shown to be about in this figure. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

The codon pairs, whose log-odds values are less than and which require more nucleotide changes than the least nucleotide changes required for the corresponding amino acid pair, tend to be located in the upper region than in the lower region of the dotted line; see plus marks in Fig. 2B and plus and circle marks in Fig. 2C. Such a tendency is more clear in Fig. 2C, in which plus and circle marks corresponding to one- and two-step amino acid pairs are mostly located far from and almost in parallel to the dotted line. The estimated values of the log-odds for these one- and two-step amino acid pairs are greater by 10 – 15 than the observed values.

In Fig. 2D, the log-exchangeabilities of the codon pairs requiring triple nucleotide changes in the 1-PAM KHG matrix are plotted against their log-odds of the 1-PAM KHG matrix. The log-exchangeability is defined here to be . The log-exchangeabilities of the codon pairs corresponding to three-step amino acid pairs are all nearly equal to their log-odds. The smallest log-exchangeabilities of these codon pairs reach almost . However, there are many codon pairs whose log-exchangeabilities are smaller than , and all of them correspond to one- or two-step amino acid pairs. The log-exchangeabilities of these codon pairs are significantly smaller than their log-odds, indicating that almost all substitutions of these codon pairs were estimated in KHG not to occur by triple nucleotide changes but rather by successive single or double nucleotide changes.

In the present model, codon exchangeabilities are approximated by the product of nucleotide exchangeabilities; see Eq. 1 for the exact expression. Therefore, all codon exchangeabilities for triple nucleotide changes are in the same order of magnitude, and specific codon pairs cannot be significantly less exchangeable. Thus, the present approximation for codon exchangeabilities may have a limitation, unless those exchangeabilities of KHG are underestimated. Estimation of the exchangeabilities for those codon pairs, which require more nucleotide changes than the least nucleotide changes required for the corresponding amino acid pair, may be less reliable than for the others.

The ML estimates , and for KHG are listed in Table 3. The scale parameter of the distribution is estimated to be for KHG, meaning that variations in rates need not be taken into account for KHG. There is a different tendency in the between KHG and the amino acid substitution matrices. One remarkable difference between them is that the parameter for transition-transversion bias is estimated to be greater than one in the ML-91 for JTT, WAG, and LG but to be less than one in the ML-200 for KHG. This estimation of transition to transversion bias for KHG results from a fact that the ratio of the total transition to the total transversion substitution rate is actually equal to in KHG, although this fact is contrary to the common understanding of transition-transversion bias. Because selective constraints on amino acids more favor transitions than transversions, transition-transversion bias in nucleotide mutation rates for KHG must be much less than . Actually the ratio of the total transition to the total transversion mutation rate is estimated to be 0.427; see Table 3.

### Comparison of ML estimates among the present models

In Table 4, the correlation coefficients of between the present models are listed. The lower half of the table lists those for single-step amino acid pairs, and the upper half lists those for multi-step amino acid pairs by excluding the amino acid pairs that belong to the least exchangeable class at least in one of the models. Each model name of JTT/WAG/LG-ML91+ and KHG-ML200 means the empirical substitution matrix and the method used to estimate selective constraints, . In the following, these ML estimates of will be specified as and . In the EI method, selective constraints are approximated by a linear function of the energy increment due to an amino acid substitution, , which is defined by Eqs. S1-4, S1-5, and S1-6 in Text S1; therefore, .

The correlations of the ML estimates between the JTT-ML91+, the WAG-ML91+, and the LG-ML91+ are very strong even for the multi-step amino acid pairs. Comparisons of the ML estimates of selective constraints between various models are shown in Fig. S2. The estimated from the KHG codon substitution matrix are less correlated with from the other amino acid substitution matrices, especially less for the multi-step amino acid pairs. The ML estimates for the multi-step amino acid pairs are relatively smaller in the KHG-ML200 than in the JTT/WAG/LG-ML91+ models; see Fig. S2.

The correlations of between the EI and others are not as good as those between the other estimates, but they are significant especially between the EI and the KHG-ML200 even for the multi-step amino acid pairs. In Fig. 3A, the ML estimates in the JTT-ML91+ are plotted against the energy increments due to an amino acid substitution; the least exchangeable category of multi-step amino acid pairs are not shown in this figure. Similar plots for the WAG-ML91+ and for the LG-ML91+ are shown in Fig. S3. The ML estimates for all amino acid pairs in the KHG-ML200 are plotted against the energy increments in Fig. 3B. No drastic difference in the correlation between these two quantities is found among one-, two-, and three-step amino acid pairs. The correlations of between the EI and the other models are better for the ML-91 than for the ML-87; the correlation coefficient between them for the single step amino acid pairs is equal to for the JTT-ML87 but for the JTT-ML91 and for the WAG-ML87 but for the WAG-ML91. The ML estimates for the single step amino acid pairs are compared between the ML-87 and the ML-91 models in Fig. S4.

The ML estimate, (A) in the ML-91+ model fitted to the 1-PAM JTT amino acid substitution matrix and (B) in the ML-200 model fitted to the 1-PAM KHG codon substitution matrix, for each amino acid pair is plotted against the mean energy increment due to an amino acid substitution, () defined by Eqs. S1-4, S1-5, and S1-6 in Text S1. In (A), the estimates for the least exchangeable class of multi-step amino acid pairs are not shown. Plus, circle, and cross marks show the values for one-, two-, and three-step amino acid pairs, respectively.

In the next section, we will examine whether the differences among these estimates of are significant in representing selective constraints on amino acids.

### Performance of the ML estimates and the characteristics of nucleotide mutations estimated

The present model for codon substitutions is designed to separate selective pressures at the amino acid level from mutational events at the nucleotide level. Both unequal usage of degenerate codons and different rates of transition and transversion are characteristic of a genetic system specific to each species and each organelle. On the other hand, the relative strengths of selective constraints on amino acids would be far less specific to each species and each protein than each type of amino acid, although the mean strength of the selective constraints is specific to each protein family. Thus, we tried to approximate selective constraints () for empirical substitution matrices including cpREV and mtREV by a linear function of those () estimated from each of JTT, WAG, LG, and KHG; and are used as in Eq. 11. We call these models JTT/WAG/LG-ML91+ or KHG-ML200, which mean the empirical substitution matrix and the model used to estimate , with a suffix meaning the number of ML parameters; see Table 1.

In Table 5, the ML values for these models with the various sets of parameters are listed for all empirical substitution matrices. The ML estimates in the JTT/WAG/LG-ML91+−11 and the KHG-ML200-11 models are listed in Tables 6, 7, and 8. The JTT-ML91+−0, the WAG-ML91+−0 and the LG-ML91+−0 models are the codon-based models corresponding to the JTT-F, the WAG-F and the LG-F amino-acid-based model, respectively, in which the JTT, the WAG and the LG rate matrices with an adjustment for the equilibrium frequencies of amino acids are used as a substitution rate matrix, because all 11 parameters of , , and are fixed to the values of their ML estimators in the ML-91+ for JTT, WAG, and LG; and are assumed, However, a critical difference is that a genetic code cannot be taken into account in the JTT/WAG/LG-F but in the JTT/WAG/LG-ML94+−0. This difference between both models can been clearly seen in the present models applied to mtREV, because a non-universal genetic code is used in the vertebrate mitochondrial DNA. The AIC is improved from in the JTT-F to in the JTT-ML91+−0. This indicates an advantage of the present mechanistic model to the empirical amino acid substitution model.

The AIC values of the JTT/WAG/LG-ML91+−0 are better for all the four matrices (JTT, WAG, cpREV, and mtREV) than those of the physico-chemical method EI-11; compare Tables 2 and 5. The AIC values of the KHG-200-0 are better for all except for JTT than those of the EI-11. The AIC values of all the models are drastically improved for all the matrices by optimizing the 11 parameters; see Table 5. It is noteworthy that all the models of the JTT-ML91+−11, the LG-ML91+−11, and the KHG-ML200-11 yield a better AIC value for WAG than the ML-87 model does, rejecting the null hypothesis of no multiple nucleotide change again; see Tables 3 and 5. Thus, the ML estimates and sufficiently represent selective constraints on amino acid substitutions.

In addition, Table 5 indicates which parameters are the most effective for improving AIC. As well as the EI models, the JTT/WAG/LG-ML91+−7, in which the parameters are fixed to the ML estimates for JTT/WAG/LG with a certain ratio of transition to transversion exchangeability, can improve the AIC up to the similar degree to the AIC values of the JTT/WAG/LG-ML91+−11, respectively. In other words, the parameters are very effective to improve the AIC in comparison with the parameters .

The log-odds values of amino acid pairs estimated by the KHG-ML200-11 are plotted against their empirical values for the 1-PAM amino acid substitution matrices of JTT, WAG, LG, and mtREV in Fig. 4. Similar plots are shown in Figs. S5 – S10. The comparisons of Fig. 1 and Fig. S1 for the ML-87 model with Fig. 4 and Fig. S5 clearly indicate the good qualities of the ML estimators and . Relatively large disagreements between empirical and estimated log-odds exist for cpREV and mtREV in comparison with those for JTT, WAG, LG, and the KHG-derived amino acid substitution matrix (KHGaa); see Fig. 4 and Figs. S5 – S7. It is unknown whether the disagreements shown in these figures represent meaningful features in the amino acid substitutions in the chloroplast DNA and the mitochondrial DNA or result from the relatively small size of sequence data used for cpREV and mtREV. However, the large disagreements in the region of low log-odds values may be artifacts, because cpREV and mtREV tend to include relatively large errors in this region, especially for mtREV; the log-odds values for mtREV whose values are smaller than about are all assumed to be ; see the original paper [6].

Each element log- of the log-odds matrices of the KHG-ML200-11 model fitted to the 1-PAM matrices of (A) JTT, (B) WAG, (C) LG, and (D) mtREV is plotted against the log-odds log- calculated from the corresponding empirical substitution matrices. Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa. The log-odds elements of mtREV whose values are smaller than about are all assumed to be ; see the original paper [6].

The ML estimates of listed in Tables 6, 7, and 8 indicate that the strength of selective constraints on amino acids is strong in the order of LG, WAG, and JTT. The strength of selective constraints is also shown by the change of the ratio of nonsynonymous to synonymous rate per codon between the two cases without and with selective constraints, i.e., the cases of and , and . As already noted, the ratio of these values between the two cases represents the strength of selective constraints. In the KHG-ML200-11, these ratios are equal to , , and for LG, WAG, and JTT, respectively, meaning that the selective constraints of LG are strongest; it should be noted that this order agrees with the increasing order of .

Tables 6 and 7 indicate that the selective constraints estimated from the KHG codon substitution matrix tend to estimate the contribution of multiple nucleotide changes () to be smaller, the ratio of transition to transversion exchangeability () to be smaller, to be larger, and variations in substitution rates () to be less than the from the amino acid substitution matrices. Table 8 shows that the same characteristic differences will be observed if the JTT/WAG/LG-ML91+−11 models are fitted to the codon substitution matrix of KHG instead of its derived amino acid substitution matrix. Tables 6, 7, and 8 also show that the ratio of transition to transversion exchangeability () tends to be estimated to be smaller in the order of the LG-ML91+, the WAG-ML91+, the JTT-ML91+, and the KHG-ML200. The is estimated by the ML-91 or the ML-200 model to be smaller in the order of LG, WAG, JTT, and KHG; see Table 3. The present ML estimates for selective constraints on amino acids seem to reflect the characteristics of respective substitution matrices to which the models are fitted. It remains to be analyzed which estimation is better among the JTT/WAG/LG-ML91+ and the KHG-ML200 and how better it is. Irrespective of which estimation of the selection constraints is better, the ML estimates indicate that the transition to transversion bias is not so strong as previously estimated.

One of the interesting facts is that the ratio of the total transition to the total transversion rate per codon will be estimated to be much larger if multiple nucleotide changes are neglected; (and the ratio of the total transition to the total transversion rate for ) are estimated for the mtREV to be 2.15 (3.32) in the JTT-ML91+−10 but 2.01 (2.52) in the JTT-ML91+−11, 4.27 (4.13) in the WAG-ML91+−10 but 3.43 (2.73) in the WAG-ML91+−11, 4.57 (4.74) in the LG-ML91+−10 but 3.82 (3.31) in the LG-ML91+−11, and 1.81 (2.58) in the KHG-ML200-10 but 1.64 (1.96) in the KHG-ML200-11. The same tendency is observed for JTT, WAG, cpREV, and mtREV irrespective of the matrices, and for the EI, the Miyata, and the Grantham models irrespective of the models.

In the case of mtREV, not only the transition-transversion exchangeability bias () but also the ratio of the total transition to the total transversion rate per codon is larger in the JTT/WAG/LG-ML91+−11 than in the JTT/WAG/LG-ML91+−0, and in the KHG-ML200-11 than in the KHG-ML200-0. Also, the JTT/WAG/LG-ML91+−11 and the KHG-ML200-11 models estimate and the ratio of the total transition to the total transversion rate to be larger for mtREV than for JTT, WAG, and cpREV. These results are consistent with a well-known fact that transition to transversion bias is larger in mitochondrial DNA than in nuclear DNA.

## Discussion

Halpern and Bruno [40] considered a codon-substitution model in which site-specific selection is taken into account in terms of residue frequencies. If site-specific codon frequencies are explicitly taken into account in the present model, the substitution rate will be regarded as the average of the site-specific rate over sites . According to Eq. 7, the site-specific rate is defined as the product of site-independent mutation rate and site-dependent fixation probability, .(29)

Here the site-dependency of the fixation probability is taken into account only in terms of codon frequencies. Then, the average of the site-specific rate over sites is calculated as follows. (30)(31)

where is the average of over sites. Thus, the defined here includes the effects of site-specific selection in terms of codon frequencies.

In the model of Halpern and Bruno [40], the term of was not distinguished from and merged with the mutation rate ; that is, for was assumed, Yang and Nielsen [22] considered mutation-selection models of codon substitutions and estimated selective strengths on codon usage. In their models, selection pressures that deviate codon frequencies from the equilibrium codon frequencies at the mutational level were explicitly taken into account, and selective constraints on amino acids are assumed to be constant over amino acid pairs; that is, for was assumed. However, the site-specific selection was not considered; that is, . In other words, unlike the present model, selection was taken into account principally in terms of codon or residue frequencies in both the models. Also. multiple nucleotide changes were not taken into account. Halpern and Bruno [40] developed their model for distance calculation. As pointed out by Yang and Nielsen [22], taking account of site-specific codon frequencies is not practical for real data analysis due to the use of too many parameters. Instead, the use of is more practical. The present results show that the ML values of the JTT/WAG/cpREV/mtREV amino acid substitution matrices are too small in the No-Constraints models in which is assumed, and they can be improved by taking account of the term of the selective constraints . Also, it is indicated that selective constraints on amino acids strongly depend on the type of amino acid.

In some previous models [7], [17], [18], amino acid substitutions were assumed to proceed in a stepwise manner by successive single nucleotide changes in a codon. The empirical amino acid substitution matrices of JTT, WAG, LG, cpREV, and mtREV, and the codon substitution matrix KHG all include many substitutions between amino acid or codon pairs requiring multiple nucleotide changes. Significance of multiple nucleotide substitutions was pointed out [7], [14], [20], [27], [29]. There are two possible mechanisms to yield substitutions between such multi-step amino acid pairs even for a short time interval. One is variations in substitution rates or time intervals. Another is multiple nucleotide changes in a codon. Here, the assumption of multiple nucleotide changes has been directly introduced into a codon-based substitution model together with the use of a distribution for variations in substitution rates and time intervals, and the effectiveness of the assumption has been examined.

In the models using any physico-chemical evaluation of selective constraints, the significance of multiple nucleotide changes has been indicated; see Tables 2 and 3. The ML-87 models fitted to JTT and WAG, in which the selective constraints for all single-step amino acid pairs are optimized by maximizing the likelihood with the assumptions of no multiple nucleotide change for codon substitutions and of variations in substitution rates, reveal that large discrepancies between the observed and the estimated log-odds values remain for multi-step amino acid pairs; see Fig. 1. When multiple nucleotide changes are taken into account in the model ML-91, these discrepancies disappear and the AIC values significantly decrease, indicating the significance of multiple nucleotide changes in codon substitutions; see Fig. 1, Fig. S1, and Table 3.

Evidence for multiple nucleotide changes was found by Averof et al. [27], and the frequency of multiple nucleotide changes was evaluated [20]. On the other hand, a possibility for successive single compensatory substitutions was pointed out by Bazykin et al. [29]. As pointed out by Kosiol et al. [14], the high exchangeabilities of the double nucleotide changes, Rcgt Ragg and Rcgt Raga, in KHG may result from successive single compensatory substitutions. On the other hand, a selection on synonymous substitutions is necessary for compensatory substitutions to cause the higher exchangeability of Rcga Ragg than estimated, because the most probable paths of single nucleotide changes between Rcga and Ragg are Rcga Raga Ragg and Rcga Rcgg Ragg both of which do not accompany any amino acid change; see Fig. 2. Whatever causes multiple nucleotide changes, the present scheme for codon substitutions could be applied to phylogenetic analyses of protein-coding sequences, because the underlying time scale in the present substitution model is much longer than that of positive selection for successive single compensatory substitutions.

The models JTT/WAG/LG-ML91+−0 and KHG-ML200-0, in which parameters are taken to be equal to the ML estimates for JTT/WAG/LG in the ML-91+ model and the ML estimates for KHG in the ML-200 model, are codon-based models corresponding to the JTT/WAG/LG/KHG-F model, respectively. The model ML-91+ can almost perfectly reproduce JTT, WAG, and LG. The model ML-200 for the KHG codon substitution matrix can well reproduce the codon substitution probabilities for the codon pairs for which any nucleotide change is accompanied by an amino acid change, although the exchangeabilities of the other codon pairs are over-estimated for KHG. This means that the JTT/WAG/LG-ML91+−0 and the KHG-ML200-0 models can be used as a simple substitution model without any loss of information instead of the empirical substitution matrices of the JTT/WAG/LG/KHG in maximum likelihood and Bayesian inferences of phylogenetic trees of amino acid and codon sequences, respectively. Although the empirical substitution matrices represent the average tendencies of substitutions over proteins and species and may lack gene-level resolution [15], [16], the present mechanistic codon model has adjustable parameters for nucleotide mutation and for the strength of selective constraints, which can be tailored to specific genes. It is possible to optimize the selective constraints for each gene. However, such a method [12], [15], [16] is far more computer-intensive than the present method. The present methods, JTT/WAG/LG-ML91+− using and the KHG-ML200- with the , provide alternative models for amino acid/codon substitutions with a small number of ML parameters in the probabilistic inference of phylogenetic trees. The number of ML parameters specific to the present model is at most 6 exchangeabilities and 3 equilibrium frequencies for nucleotide mutations, and 2 parameters for selective constraints. Thus, the present model requires the same order of cpu time as the nucleotide substitution model (GTR) does. In other codon models [21], [23], exchangeabilities between amino acids are taken to be equal to their values in empirical amino acid substitution matrices. However, in the present codon model, amino acid and codon exchangeabilities vary according to nucleotide mutation rates and the strength of selective constraints.

The parameters , , and are differently estimated by the KHG-ML200- and the JTT/WAG/LG-ML91+− using different ; see Tables 6, 7, and 8. The yields a smaller rate of multiple nucleotide changes, a smaller , a smaller ratio of transition to transversion exchangeability, and a smaller ratio of nonsynonymous to synonymous rate per codon than the does. Whichever estimation is better, the present ML estimators for transition-transversion bias strongly indicate that the transition-transversion bias is not so large as previously estimated. An excess of transitional over transversional substitutions was shown in the DNA sequences of metazoa, and has been assumed to be universal. However, Keller et al. [41] found a counter example to the transition-transversion bias from grasshopper pseudogenes. The present ML estimate of the ratio of transition to transversion exchangeability for the KHG codon substitution matrix is rather less than 1.0, i.e., in the ML-200 model, which corresponds to the overall rate bias of transitions over transversions, . Even for the amino acid substitution matrices JTT, WAG, and LG, the ML-91 model estimates to be less than , making the overall rate bias of transitions over transversions less than ; see Table 3. It should be noted that the ratio of transition to transversion exchangeability tends to be overestimated if no multiple nucleotide change is allowed; see Tables S2 and S3. Thus, the present results indicate that transition-transversion bias is not a solid assumption. On the other hand, the present results indicate that transition-transversion bias is stronger in mitochondrial DNA than in nuclear DNA in accordance with previous understanding; see Tables 6 and 7.

The ML estimates and significantly correlate with each other and also with the mean energy increments due to an amino acid replacement. However, the JTT/WAG/LG-ML91+− and KHG-ML200- models fit substitution data significantly better than the EI- model; see Tables 2 and 5. This fact indicates that the differences between the physico-chemical estimates and the ML estimates for selective pressure at the amino acid level reflect the actual tendency of selective constraints for respective types of amino acid pairs in protein evolution. Eq. 31 indicates that the is modulated by site-specific codon frequencies and differentiated from the site-independent constraints, , which may be more similar to the physico-chemical estimates than the . The selective constraints estimated here may be used as a base line to detect evidence of positive selection. Models [20], [22] in which the dependences of selective constraints on amino acid pairs are not taken into account may be improved by introducing them. On the other hand, it still remains to be examined whether or not the JTT/WAG/LG-ML91+− and the KHG-ML200- perform comparably with cpREV for the maximum likelihood inferences of phylogenetic trees of chloroplast proteins and with mtREV for those of mitochondrial proteins. Also, it should be examined which performs better.

A preliminary calculation has been pursued to examine the performance of the present substitution models in the ML inference of a phylogenetic tree. Log-likelihoods of the present models and the codon models corresponding to the mtREV-F, the JTT-F, the WAG-F, and the LG-F are calculated and listed in Table 9 for a phylogenetic tree [6] of the concatenated sequences of 12 protein-coding sequences encoded on the same strand of mitochondrial DNA from 20 vertebrate species with 2 races from human. The phylogenetic tree and the proteins used are those which Adachi and Hasegawa [6] used to estimate mtREV; the Japanese mtDNA was not used because it couldn't be found in the GenBank database. The coding sequences of each protein were aligned with codon score matrices by the ClustalW2 [42], and then concatenated. Their likelihoods on the phylogenetic tree were calculated by the Phyml [43]. Both the programs have been modified for the analysis of coding sequences. Log-odds calculated by the KHG-ML200-11 fitted to mtREV were used as the codon score matrices. Positions with gaps are included for the calculation of the likelihoods. The codon substitution matrices corresponding to mtREV, JTT, WAG, LG, and the KHG-derived amino acid substitution matrix (KHGaa) are calculated in such a way that codon exchangeabilities for nonsynonymous codon pairs are taken to be equal to multiplied by the exchangeability of the corresponding amino acid pair and those for synonymous codon pairs are assumed to be all equal to the mean amino acid exchangeability. In all models, the parameter in Eq. 11 was optimized even for the No-Constraints models, and codon frequencies were taken to be equal to those in coding sequences. The substitution matrices, JTT, WAG, LG, and KHG were estimated from nuclear DNA, which use a different genetic code from vertebrate mtDNA. On the other hand, mtREV was estimated by a maximum likelihood method from the almost same set of the protein sequences encoded in mtDNA. Thus, it is expected that the log-likelihood values of the mtDNA phylogenetic tree for the models, KHGaa-1-F, LG-1-F, WAG-1F, and JTT-1-F are worse than that for the mtREV-1-F. An important thing is that the codon models with the selective constraints estimated from nuclear DNA or by the physico-chemical method yield a much smaller value of AIC than the mtREV-1-F. One of the effective parameters is that directly controls the ratio of nonsynonymous to synonymous substitution rate. It also improves the likelihood to explicitly take account of rate variations over sites. The discrete approximation [44] of the distribution with 4 categories was used to represent rate variations over sites in the models named with the suffix "dG4"; the shape parameter is a ML parameter. An interesting and reasonable fact is that averaging substitution matrices over rate becomes unnecessary, i.e., , in the case that rate variations over sites are explicitly taken into account; in the Yang's model [26], [44], the likelihood of a phylogenetic tree of each site is averaged over rate. Also, all the present codon-based models estimate , which indicates the significance of multiple nucleotide changes. The present results strongly indicate that the tendencies of nucleotide mutations and codon usage are characteristic of a genetic system specific to each species and oranelle, but the amino acid dependences of selective constraints are more specifc to each type of amino acid than each species, organelle, and protein family. Full evaluation will be provided in a succeeding paper.

One may question whether the whole evolutionary process of protein-coding sequences can be approximated by a reversible Markov process or not. Kinjo and Nishikawa [45] reported that the log-odds matrices constructed for 18 different levels of sequence identities from structure-based protein alignments have a characteristic dependence on time in the principal components of their eigenspectra. Although they did not explicitly mention, this type of temporal process peculiar to the log-odd matrix in protein evolution is fully encoded in the transition matrices of JTT, WAG, LG, and KHG. In Fig. S11, it is shown that this characteristic dependence of log-odds on time can be reproduced by the transition matrix based on the present reversible Markov model fitted to JTT; see Text S1 for details. This fact supports the appropriateness of the present Markov model for codon substitutions. The present codon-based model can be used to generate log-odds for codon substitutions as well as amino acid substitutions. Such a log-odds matrix of codon substitutions would be useful to allow us to align nucleotide sequences at the codon level rather than the amino acid level, increasing the quality of sequence alignments.

As a result, the present model would enable us to obtain more biologically meaningful information at both nucleotide and amino acid levels from codon sequences and even from protein sequences, because this is a codon-based model.

## Supporting Information

### Text S1.

**Supporting information consisting of the following sections.** 1. A method for the physico-chemical evaluation of selective constraints on amino acid replacement. 2. Models with no amino acid dependences of selective constraints. 3. A physico-chemical evaluation of selective constraints on amino acids. 4. Other physico-chemical evaluations of selective constraints on amino acids. 5. Evolutionary process of amino acid substitutions in terms of log-odds.

https://doi.org/10.1371/journal.pone.0017244.s001

(PDF)

### Data S1.

**A computer-readable dataset of the ML estimates of parameters in the ML-200 for KHG, and the ML-91 and the ML-91+ for LG, WAG, and JTT as well as the EI.**

https://doi.org/10.1371/journal.pone.0017244.s002

(TXT)

### Figure S1.

**The ML-87 and the ML-91 models fitted to WAG.** Each element log- of the log-odds matrices of (A) the ML-87 and (B) the ML-91 models fitted to the 1-PAM WAG matrix is plotted against the log-odds log- calculated from WAG. Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s003

(PDF)

### Figure S2.

**Comparison between various estimates of selective constraint for each amino acid pair** The ML estimates of selective constraint on substitutions of each amino acid pair are compared between the models fitted to various empirical substitution matrices. The estimates for multi-step amino acid pairs that belong to the least exchangeable class at least in one of the models are not shown. Plus, circle, and cross marks show the values for one-, two-, and three-step amino acid pairs, respectively.

https://doi.org/10.1371/journal.pone.0017244.s004

(PDF)

### Figure S3.

**Selective constraint for each amino acid pair estimated from WAG and from LG.** The ML estimate, in (A) and in (B), of selective constraint on substitutions of each amino acid pair in the ML-91+ models fitted to the 1-PAM matrices of WAG and LG is plotted against the mean energy increment due to an amino acid substitution, () defined by Eqs. S1-4, S1-5, and S1-6 in Text S1. The estimates for the least exchangeable class of multi-step amino acid pairs are not shown. Plus, circle, and cross marks show the values for one-, two-, and three-step amino acid pairs, respectively.

https://doi.org/10.1371/journal.pone.0017244.s005

(PDF)

### Figure S4.

**Comparison of the ML estimates of selective constraint for each amino acid pair between the ML-87 and the ML-91 models.** The ML estimate of selective constraint for each single step amino acid pair in the ML-87 model fitted to (A) the 1-PAM JTT matrix or (B) the 1-PAM WAG matrix is plotted against that in the ML-91 model.

https://doi.org/10.1371/journal.pone.0017244.s006

(PDF)

### Figure S5.

**Models fitted to each of JTT, WAG, and LG.** Each element log- of the log-odds matrix of the model fitted to each empirical substitution matrix is plotted against the log-odds log- calculated from the corresponding empirical substitution matrix. Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s007

(PDF)

### Figure S6.

**Models fitted to each of cpREV and mtREV.** Each element log- of the log-odds matrix of the model fitted to each empirical substitution matrix is plotted against the log-odds log- calculated from the corresponding empirical substitution matrix. Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s008

(PDF)

### Figure S7.

**Models fitted to the KHG-derived amino acid substitution matrix.** Each element log- of the log-odds matrix of the model fitted to the 1-PAM KHG-derived amino acid substitution matrix (KHGaa) is plotted against the log-odds log- calculated from KHGaa. Plus, circle, and cross marks show the log-odds values for one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s009

(PDF)

### Figure S8.

**The JTT-ML91+−12 model fitted to the 1-PAM KHG codon substitution matrix.** Each element log- of the log-odds matrix corresponding to (A) single, (B) double, and (C) triple nucleotide changes in the JTT-ML91+−12 model fitted to the 1-PAM KHG codon substitution matrix is plotted against the log-odds log- calculated from KHG. Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s010

(PDF)

### Figure S9.

**The WAG-ML91+−12 model fitted to the 1-PAM KHG codon substitution matrix.** Each element log- of the log-odds matrix corresponding to (A) single, (B) double, and (C) triple nucleotide changes in the WAG-ML91+−12 model fitted to the 1-PAM KHG codon substitution matrix is plotted against the log-odds log- calculated from KHG. Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s011

(PDF)

### Figure S10.

**The LG-ML91+−12 model fitted to the 1-PAM KHG codon substitution matrix.** Each element log- of the log-odds matrix corresponding to (A) single, (B) double, and (C) triple nucleotide changes in the LG-ML91+−12 model fitted to the 1-PAM KHG codon substitution matrix is plotted against the log-odds log- calculated from KHG. Upper triangle, plus, circle, and cross marks show the log-odds values for synonymous pairs and one-, two-, and three-step amino acid pairs, respectively. The dotted line in each figure shows the line of equal values between the ordinate and the abscissa.

https://doi.org/10.1371/journal.pone.0017244.s012

(PDF)

### Figure S11.

Temporal changes of the eigenvalues and the eigenvectors of the log-odds matrix log- calculated by the ML-91+ model fitted to JTT as a function of sequence identity. In (A), the solid, the broken, and the dotted lines show the temporal changes of the first (), the second (), and the third () principal eigenvalues, respectively. The inner products of the eigenvectors with the eigenvectors of the JTT 20-PAM log-odds matrix, , are shown in (B) for the first principal eigenvector (), in (C) for the second principal eigenvector (), and in (D) for the third principal eigenvector (), by solid lines for , by broken lines for , and by dotted lines for .

https://doi.org/10.1371/journal.pone.0017244.s013

(PDF)

### Table S1.

**ML estimates of the present models without selective constraints on amino acids for the 1-PAM substitution matrices of JTT, WAG, cpREV, and mtREV.**

https://doi.org/10.1371/journal.pone.0017244.s014

(PDF)

### Table S2.

**ML estimates of the present models with the selective constraints based on mean energy increments due to an amino acid substitution (EI) for the 1-PAM substitution matrices of JTT, WAG, cpREV, and mtREV.**

https://doi.org/10.1371/journal.pone.0017244.s015

(PDF)

### Table S3.

**ML estimates of the present models with the selective constraints based on the Grantham's and the Miyata's amino acid distances for the 1-PAM substitution matrices of JTT and WAG.**

https://doi.org/10.1371/journal.pone.0017244.s016

(PDF)

## Acknowledgments

The author would like to thank Prof. Masami Hasegawa and Prof. Hiroyuki Toh for their valuable advice. I also thank reviewers for constructive suggestions on the manuscript.

## Author Contributions

Conceived and designed the experiments: SM. Performed the experiments: SM. Analyzed the data: SM. Contributed reagents/materials/analysis tools: SM. Wrote the paper: SM.

## References

- 1. Kimura M (1980) A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111–120.
- 2. Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial dna. J Mol Evol 22: 160–174.
- 3. Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial dna in humans and chimpanzees. Mol Biol Evol 10: 512–526.
- 4.
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO, editor. Atlas of protein sequence and structure. Washington D.C.: National Biomedical Research Foundation. pp. 345–352. volume 5. Suppl. 3 edition.
- 5. Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275–282.
- 6. Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial dna. J Mol Evol 42: 459–468.
- 7. Yang Z, Nielsen R, Hasegawa M (1998) Models of amino acid substitution and application to mitochondrial protein evolution. Mol Biol Evol 15: 1600–1611.
- 8. Adachi J, Waddell PJ, Martin W, Hasegawa M (2000) Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast dna. J Mol Evol 50: 348–358.
- 9. Dimmic MW, Mindell DP, Goldstein RA (2000) Modelling evolution at the protein level using an adjustable amino acid fitness model. Pacific Symposium on Biocomputing 5: 18–29.
- 10. Whelan S, Goldman N (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18: 691–699.
- 11. Le SQ, Gascuel O (2008) An improved general amino acid replacement matrix. Mol Biol Evol 25: 1307–1320.
- 12. Huelsenbeck JP, Joyce P, Lakner C, Ronquist F (2008) Bayesian analysis of amino acid substitution models. Phil Trans R Soc B 363: 3941–3953.
- 13. Schneider A, Cannarozzi GM, Gonnet GH (2005) Empirical codon substitution matrix. BMC Bioinformatics 6: 134.
- 14. Kosiol C, Holmes I, Goldman N (2007) An empirical codon model for protein sequence evolution. Mol Biol Evol 24: 1464–1479.
- 15. Delport W, Scheffler K, Gravenor MB, Muse SV, Kosakovsky Pond S (2010) Benchmarking multi-rate codon models. PLos One 5: e11587.
- 16. Delport W, Scheffler K, Botha G, Gravenor MB, Muse SV, et al. (2010) Codontest: Modeling amino acid substitution preferences in coding sequences. PLos Comp Biol 6: e1000885.
- 17. Miyazawa S, Jernigan RL (1993) A new substitution matrix for protein sequence searches based on contact frequencies in protein structures. Protein Eng 6: 267–278.
- 18. Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding dna. Mol Biol Evol 11: 725–736.
- 19. Muse SV, Gaut BS (1994) A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 11: 715–724.
- 20. Whelan S, Goldman N (2004) Estimating the frequency of events that cause multiple-nucleotide changes. Genetics 167: 2027–2043.
- 21. Doron-Faigenboim A, Pupko T (2007) A combined empirical and mechanistic codon model. Mol Biol Evol 24: 388–397.
- 22. Yang Z, Nielsen R (2008) Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol 25: 568–579.
- 23. Seo TK, Kishino H (2008) Synonymous substitutions substantially improve evolutionary inference from highly diverged proteins. Syst Biol 57: 367–377.
- 24. Seo TK, Kishino H (2009) Statistical comparison of nucleotide, amino acid, and codon substitution models for evolutionary analysis of protein-coding sequences. Syst Biol 58: 199–210.
- 25. Jin L, Nei M (1990) Limitations of the evolutionary parsimony method of phylogeny analysis. Mol Biol Evol 7: 82–102.
- 26. Yang Z (1993) Maximum-likelihood estimation of phylogeny from dna sequences when substitution rates differ over time. Mol Biol Evol 10: 1396–1401.
- 27. Averof M, Rokas A, Wolfe KH, Sharp PM (2000) Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science 287: 1283–1286.
- 28. Smith NGC, Webster MT, Ellegren H (2003) A low rate of simultaneous double-nucleotide mutations in primates. Mol Biol Evol 20: 47–53.
- 29. Bazykin G, Kondrashov F, Ogurtsov A, Sunyaev S, Kondrashov A (2004) Positive selection at sites of multiple amino acid replacements since rat-mouse divergence. Nature 429: 558–562.
- 30. Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26: 255–271.
- 31. Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185: 862–864.
- 32. Miyata T, Miyazawa S, Yasunaga T (1979) Two type of amino acid substitutions in protein evolution. J Mol Evol 12: 219–236.
- 33. Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL (2007) Quantifying the impact of protein tertiary structure on molecular evolution. Mol Biol Evol 24: 1769–1782.
- 34. Conant GC, Wagner GP, Stadler PF (2007) Modeling amino acid substitution patterns in orthologous and paralogous genes. Mol Phylogenet Evol 42: 298–307.
- 35. Takahata N (1987) On the overdispersed molecular clock. Genetics 116: 169–179.
- 36. Rodrigue N, Lartillot N, Philippe H (2008) Bayesian comparisons of codon substitution models. Genetics 180: 1579–1591.
- 37. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Contr AC-19: 716–723.
- 38. Whelan S, de Bakker P, Quevillon E, Rodriguez N, Goldman N (2006) Pandit: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucl Acid Res 34: D327–D331.
- 39. Miyata T, Yasunaga T (1980) Molecular evolution of mrna: a method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its applications. J Mol Evol 16: 23–36.
- 40. Halpern AL, Bruno WJ (1998) Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol 15: 910–917.
- 41. Keller I, Bensasson D, Nichols RA (2007) Transition-transversion bias is not universal: A counter example from grasshopper pseudogenes. PLoS Genet 3: 0185–0191.
- 42. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, et al. (2007) Clustalw and clustalx version 2.0. Bioinformatics 23: 2947–2948.
- 43. Guindon S, Gascuel O (2003) Simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biol 52: 696–704.
- 44. Yang Z (1994) Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J Mol Evol 39: 306–314.
- 45. Kinjo AR, Nishikawa K (2004) Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins. Bioinformatics 20: 2504–2508.