Selection Pressure in Alternative Reading Frames

Overlapping genes are two protein-coding sequences sharing a significant part of the same DNA locus in different reading frames. Although in recent times an increasing number of examples have been found in bacteria the underlying mechanisms of their evolution are unknown. In this work we explore how selective pressure in a protein-coding sequence influences its overlapping genes in alternative reading frames. We model evolution using a time-continuous Markov process and derive the corresponding model for the remaining frames to quantify selection pressure and genetic noise. Our findings lead to the presumption that, once information is embedded in the reverse reading frame −2 (relative to the mother gene in +1) purifying selection in the protein-coding reading frame automatically protects the sequences in both frames. We also found that this coincides with the fact that the genetic noise measured using the conditional entropy is minimal in frame −2 under selection in the coding frame.


Introduction
Overlapping genes are protein coding genes sharing the same DNA locus in different reading frames. As DNA consists of two strands and each amino acid is encoded by non-overlapping triplets (codons), up to six reading frames are possible at a given locus. Overlapping genes are a well known and accepted phenomenon in viruses, however this effect was explained from space limitations of the capsid volume [1]. Until lately most authors denied the existence of overlapping genes in bacterial genomes, consequently bacterial genome annotation programs excluded overlapping candidates in alternative reading frames deliberately [2][3][4][5]. Although an experimental verification of two protein-coding genes in the same DNA locus is extremely challenging, over the last years an increasing number of non trivially overlapping genes in prokayotes have been found [6][7][8][9][10].
This paper is concerned with the question how selection pressure in the protein-coding frame influences alternative reading frames. Is it possible to protect by selection two protein-coding sequences simultaneously? We explore this question using a stochastic model for the evolution of the protein-coding reading frame and predict the consequent behaviour in the alternative reading frames. Sequence evolution can be described on nucleotide level [11][12][13][14], amino acid level, e.g. Dayhoff and Schwartz [15], or on codon level. Here we chose the latter approach using a time-continuous Markov process as suggested by Goldman and Yang [16] and Muse and Gaut [17]. We apply the model of Yang and Nielsen [18] which is based on [16]. An extended model was already used by Sabath et al. [19] to study the evolution of a random proteincoding sequence. In contrast to our approach, Sabath investigated the selection intensities of overlapping genes assuming that each gene of the overlapping pair faces selection independently.
Several studies analyzed selection intensities in virus genomes within overlapping gene regions investigating how nonsynonymous and synonymous mutations influence two reading frames simultaneously showing that a high rate of nonsynonymous mutations in one reading frame falls onto synonymous substitutions in an alternative frame at the same time, e.g. [20][21][22].
Our investigation reveals that selection pressure in the proteincoding reading frame +1 is correlated to the reverse reading frame 22, where in fact many examples of overlapping genes found so far are located e.g., [9,10]. Precisely there is a strong coupling of the nonsynonymous to synonymous substitutions rate ratios in these frames. In another approach following Yockey [23], we quantify the genetic noise using the conditional entropy and the mutual information as a measure of sequence similarity. The results obtained coincides with the former observations. The outline of the paper is as follows: In Section Methods we introduce the evolutionary framework and the calculation of selection pressure. The biological and information theoretic measures are presented in Section Results, together with an application of the model to a bacterial genome and evidence on the robustness of our approach. Finally we discuss the results in the last section.

Framework of Evolutionary Model
This section introduces the evolutionary framework and the notations used. We denote a discrete random variable with X and their corresponding probability mass function with p X (x), where x is the concrete realization of X. Throughout the paper the nucleotide alphabet is denoted with N = {A, C, G, T} and the codon alphabet is denoted with C = {A, C, G, T} 3 .
In the following we consider the well known Goldman and Yang model [16] in a simplified version as it was introduced by [18], where the following definitions can be found. (For more details on the derivation of the model see [24].) The model assumes a stationary codon distribution and independence of the evolving codon sites. The evolution of protein-coding DNA sequences is modelled by a time-continous Markov process described by the substitution rate matrix Q = {q xy }, where q xy is the rate from codon x to codon y with x?y q xy (p y , k,v)0 if x and y differ at more than one position where k is the transition/transversion rate, v is the nonsynonymous/synonymous rate ratio and p y is the equilibrium frequency of codon y. Note that p y [ C 61~f C \(TAG,TGA,TAA)g as transitions to stop codons are not allowed inside functional proteins. The row sums of the rate matrix Q = {q xy } have to be zero, which determines the main diagonal of the matrix. Further the rate matrix is multiplied by a scaling factor to normalize the expected number of nucleotide substitutions per codon to one. With every time-continuous Markov process, a discrete time Markov chain can be associated. This leads to a discrete evolution matrix P(t)~P Y DX (t) with conditional probabilities that describes a transition of an input X [ C 61 to an output Y [ C 61 for a fixed t. The evolutionary transition probability matrix is determined by where p xy is the probability that input codon x becomes y after time t. Note that pP(t)~p and pQ~0 holds, where row vector p is the stationary codon distribution. We call (X ,Y ,P Y DX ) an evolutionary channel referring to the communication theoretic term [25]. Note that the rate matrix Q is also a channel matrix. Further the parameters of the rate matrix t, v and k are arbitrary but fixed.
Given a rate matrix for the protein-coding reading frame, we are interested in computing the resulting rate and evolutionary channel matrices in the other reading frames. We define the protein-coding reading frame as +1 and denote the shifted and reverse complement reading frames as non-coding reading frames f = {21, 62, 63}. If we refer to a special reading frame, we use the index f. The setup we consider is as follows: In the proteincoding reading frame we assume that codons x [ C 61 with codon usage p +1 from a bacterial organism are transmitted independently over the evolutionary channel P(t, k, v, p) to the output y. This is called a discrete memoryless channel. For convenience we write In frame +1 we observe the scheme presented in Figure 1. Given P z1 Y DX and p +1 we want to determine the evolution matrix per reading frame P f Y DX , f [ f{1,+2,+3g. We solve this task directly via the rate matrix per reading frame Q f given the rate matrix Q +1 and p +1 such that we are independent of the evolution time. For the alternative reading frames we combine two independent time-continuous Markov chains to the corresponding di-codon matrix in frame +1 by where fl is the Kronecker product and I Q is the identity matrix with the same dimension as the rate matrix [26]. The rates of the di-codon transitions are now combined to compute the rate matrices in the other frames. Without loss a generality, we consider frame +2 (black parts in Figure 1).
The rate matrices of the other frames can be determined accordingly. Note that Q f is a 64664 matrix for f = {62, 63} and a 61661 matrix for f = {61}. Given the rate matrix Q f of a time-continuous Markov chain the corresponding stationary distribution p f in each reading frame as well as the transition matrix P f Y DX for time t can be easily determined.

Selection pressure during evolution
An important parameter describing the selection pressure on the protein level is the ratio of nonsynonymous d N to synonymous d S substitution rates, denoted with v~d N d S , see e.g., [16]. Three basic scenarios are distinguished e.g., [27]: Purifying selection when v, 1, adaptive selection for v.1 and neutral mutation if v = 1. To determine the nonsynonymous/synonymous rate ratio v in each reading frame, we apply the procedure presented in [18] and [24], that is based on the transition probability matrix P Y DX , but can be easily adapted to the rate matrix Q. Assume we determined, the rate matrix Q f in each frame f [ f+1,+2,+3g as presented in Framework of Evolutionary Model as well as the stationary distributions p f . The proportion of synonymous substitutions is the sum over all codon pairs x and y (x?y) that code for the same amino acid where aa x is the amino acid encoded by codon x. The proportion of nonsynonymous substitutions is calculated accordingly by r f N~X x=y,aax=aay p f q f xy : The transition/tranversion rate k is the same in all reading frames. We assume that it is known from reading frame +1. To : where the time as well as scaling factors of the rate matrices cancel out.

Results
Throughout the paper, we use the model genome Escherichia coli O157:H7 EDL933 (Accession number NC_002655, abbreviation EHEC), with a GC content of 50.4% and a length of 5528445 base pairs. In File S1 Model verification we present some simulation results to validate the calculations of the equilibrium frequencies per reading frame p f .
To investigate the influence of selection pressure during evolution we chose two different input scenarios: The transition/ transversion rates are k = 1.0 or k = 5.0 at time t = 1.0 and the nonsynonymous/synonymous rate ratio v takes values between [0, 3]. The calculation of the nonsynonymous/synonymous rate ratio v f for f [ f+1,+2,+3g reveals the following results.
Purifying selection refers to a selection against nonsynonymous substitutions on the DNA level, which protects the sequence. In Figure 2, we see that a protection of the coding frame +1 with v, 1, also protects the sequence in frame 22, the other alternative frames face adaptive selection. The opposite is observed for v.1, where new information can be induced in frames +1 and 22, whereas the other frames are slightly below the neutral mutation line. The behaviour of v f is consistent for both scenarios. Note, there are numerous methods to determine the synonymous to nonsynonymous rate ratio. The File S1 Selection pressure shows a comparison of our approach with an alternative method.

Quantifying noise during evolution
In this part, we deal with the following question: An amino acid is transmitted over the evolutionary channel, how long is this information conserved in the different reading frames?
Evolution of a sequence can be considered as a communication process over time. In his book [23] proposed to use the conditional entropy to measure the amount of genetic information that can be transmitted over a noisy channel (based on [25]). We define the amino acid alphabet A~G(C 61 ), where G is the genetic code, which results in a cardinality of DAD~20. The codon evolution matrix per frame P f Y DX can be summarized to determine the amino acid evolution matrix P f ,A Y DX , based on the stationary distribution p f . The stop codon probabilities are removed in all frames. The conditional entropy between two random variables X and Y over alphabets X ,Y is defined as, e.g., in [28]: where P Y DX (yDx) is the conditional probability. The conditional entropy between two randomly chosen amino acids X and Y in frame f conditioned on X = a with a [ A is accordingly where P f ,A Y DX is the amino acid substitution matrix per reading frame. If we know that a specific amino acid was transmitted, how much of this knowledge is lost after time t? As the comparison of 20 values over time is inconvenient, we apply uniform weighting according to the amino acids p f ,A , which results in Note that Eq. (2) is bounded by where the entropy (or uncertainty) H f t (Y )~{ P y [ A p f ,A (y) log 2 p f ,A (y) is maximal for a uniformly distributed random variable Y.
Yockey [23] additionally suggests the application of the mutual information as a measure of similarity between sequences. The mutual information between amino acid X and Y per frame f is defined as, e.g., in [28]: Note, that the channel capacity, which is the maximal mutual information for all input distributions, can be determined numerically using the Blahut-Arimoto algorithm. But as there is no direct interpretation in our framework and the results match those of the mutual information, we abandoned the presentation.
Results at the example of EHEC. We chose two different input scenarios: Set v = 0.3 to model purifying selection and v = 3.0 to model adaptive selection. The transition/transversion rate is fixed to k = 1.0 and the time t is changed during simulation. The same parameter setting was already used in [27]. We apply the conditional entropy introduced in Eq. (2) to answer the question how long the information which amino acid was transmitted is conserved in the different reading frames. The results are presented in Figure 3.
Evolution means loss of information over time or from a complementary point of view, an increase of uncertainty. To quantify this information loss, we determine the time needed to loose half of the information. As the conditional entropy in bounded by log 2 (20), we determine for each frame t1 = 2 such that The results are summarized in Table 1.  In accordance with the results of v f in Figure 2 we interpret Figure 3 and Table 1 as follows. Protected frames with v,1, store information longer than the unprotected frames with v.1. When frame +1 is protected, then the 22 frame is protected automatically, therefore those frames show a slower increasing uncertainty than the alternative frames. Now, the mutual information I f t (X ; Y ) per reading frame f [ f+1,+2,+3g is investigated applying Eq. (3). The mutual information measures the similarity between X and Y, which is directly connected to the amount of information that can be transmitted over the channel [23]. We observe for the first scenario, where v = 0.3, presented in the left panel of Figure 4 that most information can be transmitted in reading frame +1 followed by reading frame 22. In general the proportion of information, that can be transmitted over the evolutionary channel decreases over time, but this information loss is faster in the frames, where v f .1. In the right panel of Figure 4, where v = 3.0, we see that the mutual information is smallest, for the frames +1 and 22 which is also in accordance with Figure 2. This observation is confirmed in the File S1 Conditional entropy and mutual information for different values of v.
Robustness of method. The question arises, how robust our method is, if we choose another codon substitution matrix. As we are able to determine the mutual information and the conditional entropy per reading frame, given only the evolution matrix in the coding reading frame P z1 Y DX and the stationary distribution of EHEC p +1 it is also possible to substitute the channel matrix P z1 Y DX . In 2005 [29] published an empirical codon substitution matrix (P ECM ) obtained from an alignment of vertebrate DNA, which can also be applied to bacteria. Given a transition matrix we present in File S1 Robustness of results a method to estimate the transition matrices per reading frame based on the channel matrix P z1 Y DX . For our investigations, the different time points t presented in Figure 5 are obtained by P t ECM ,t [ Z. The results confirm our findings, that most information can be transmitted in +1, followed by 22. That makes sense, as the matrix is based on genes with purifying selection, otherwise they would not have survived over time.

Summary
In this paper we introduced a model to determine the codon evolution in different reading frames based on the protein-coding reading frame +1. The model is used to predict the selection pressure within different reading frames and reveals that a protection of the protein-coding reading frame also preserves the reverse reading frame 22. For the case of adaptive selection both frames are free to evolve. The remaining alternative frames show the reverse relation, i.e. they give preference to nonsynonymous substitutions while reading frame +1 is protected and are preserved when +1 is exposed to adaptive selection. These findings are further confirmed by the presented results on the conditional entropy. Namely, if v,1, the genetic noise is minimal in frames + 1 and 22, also the sequence similarity measured by the mutual information is largest. Conversely for v.1, where the genetic noise of +1 and 22 is largest and the sequence similarity accordingly smallest.

Discussion and Conclusion
At a first glance, understanding the evolution of overlapping protein-coding regions is extremely challenging, because one DNA segment codes for two proteins which are translated in different reading frames simultaneously, such that a mutation affects both proteins [30,31]. Biologist investigate evolutionary adaption of proteins for years now, assuming that adaption requires more nucleotide mutations at positions that change an amino acid than at positions that preserve a site [32]. The parameter of choice that measures the substitution rate at those sites is v~d N d S and is therefore used as an indicator of selective pressure within genes. Meanwhile a large field emerged, investigating the evolutionary constraints within overlapping and non overlapping reading frames [30,[33][34][35][36]. There exist empirical analyses describing, that a loss of a stop codon within a protein-coding gene by deletion, mutation or frame-shift, causes an elongation to the next stop codon, whereby an overlapping pair originates [37,38]. Other studies suggest, that the loss of a start codon is responsible for the development of an overlap [39][40][41]. From this point of view, a random formation can not be ruled out.
Our point of interest is slightly different, assuming we are given a protein-coding reading frame that evolves over time, we are interested in the evolutionary constraints implied within alternative reading frames. A biological interpretation of our findings is that during adaption many mutations occur that change amino acids in reading frames +1 and 22 simultaneously. Once a protein in reading frame +1 is fixed and adaptive selection is replaced by purifying selection, this process stops and the amount of synonymous substitutions increases, again in both reading frames. Note that we make no statement that both reading frames are already translated into proteins, since function of a sequence could also evolve later. As a matter of fact, over time the divergence of a sequence always increases even if it is protected, but we showed that this change happens slower in case of purifying selection for both, the +1 and 22 reading frame. No matter, how or if an overlapping gene pair evolved, our observations indicate the special role of the 22 reading frame. Interestingly, two recently experimentally verified examples of overlapping gene pairs in bacteria yaaW/htga by [10] and dmdR1/adm by [9] are in frame 22. We showed that it is possible to protect this frame by simply controlling the selection pressure within the protein-coding reading frame. This can be attributed to a property of the genetic code, as the most important codon positions are the first and second which fall onto the second respectively first position in the 22 frame. But this could also mean that a conserved sequence in 22 might be solely an artefact, providing not necessarily evidence for functionality.
Finally note that it is challenging to embed information in the overlapping reading frame 22, when the protein-coding reading frame +1 has a fixed amino acid sequence. Assume two amino acids A 1 ,A 2 [ A Ã , where A * is A plus a stop label, should be encoded in the coding reading frame. Obviously each amino acid corresponds to an individual number of codons, hence it is possible to encode a certain number of different amino acids in the alternative reading frames without changing A 1 and A 2 . The average taken over all possible pairs A 1 , A 2 are shown in Table 2; it turns out that the degree of freedom is smallest in 22. It is worth noting that in general it is possible to embed information even in protein-coding sequences, see for example [42].