Skip to main content
  • Loading metrics

Protein and DNA Sequence Determinants of Thermophilic Adaptation

  • Konstantin B Zeldovich,

    Affiliation Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America

  • Igor N Berezovsky,

    Affiliation Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America

  • Eugene I Shakhnovich

    To whom correspondence should be addressed. E-mail:

    Affiliation Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America


There have been considerable attempts in the past to relate phenotypic trait—habitat temperature of organisms—to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT) of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from −10 °C to 110 °C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL), and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = −0.10). On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.

Author Summary

Prokaryotes living at extreme environmental temperatures exhibit pronounced signatures in the amino acid composition of their proteins and the nucleotide compositions of their genomes, reflective of adaptation to their thermal environments. However, despite significant efforts, the definitive answer of what are the genomic and proteomic compositional determinants of optimal growth temperature (OGT) of prokaryotic organisms remained elusive. Here we performed a comprehensive analysis of amino acid and nucleotide compositional signatures of thermophylic adaptation by exhaustively evaluating all combinations of amino acids and nucleotides as possible determinants of OGT for all prokaryotic organisms with fully sequenced genomes. We discovered that total concentration of seven amino acids in proteomes—IVYWREL—serves as a universal proteomic predictor of OGT in prokaryotes. Resolving the old-standing controversy, we determined that the variation in nucleotide composition (increase of purine load, or A + G content with temperature) is largely a consequence of thermal adaptation of proteins. However, the frequency with which A and G nucleotides appear as nearest neighbors in genome sequences is strongly and independently correlated with OGT as a result of codon bias in corresponding genomes. Together these results provide a complete picture of proteomic and genomic determinants of thermophilic adaptation.


As proteins and nucleic acids must remain in their native conformations at physiologically relevant temperatures, thermal adaptation requires adjustment of interactions within these biopolymers. Given the limited alphabet of amino acid residues, an apparent way to control protein stability is to properly choose the fractions of different residue types and then to arrange them in sequences that fold into stable and unique native structures [14] at physiological conditions of a given organism. Various mechanisms of thermostability were discussed in the literature, and many authors pointed to changes in amino acid composition as one of the clearest manifestations of thermal adaptation [2,48].

Indeed, it is well-known that enhanced thermostability is reflected in specific trends in amino acid composition [911]. The most pronounced ones are in elevated fractions of charged residues [1216], and/or increased amounts of hydrophobic residues in (hyper)thermophilic organisms as compared with mesophilic ones. An early attempt of a systematic search for amino acids that are most significant for protein thermostability was made by Ponnuswamy et al. [17], who considered a set of 30 proteins and about 65,000 combinations of different amino acids to find the amino acid sets serving as best predictors of denaturation temperature.

Furthermore, a number of authors explored the possible relationship between an optimal growth temperature (OGT) of organisms and nucleotide content of their genomes [16,18,19]. An increase in purine (A + G) load of bacterial genomes of some thermophiles was noted recently [20] as a possible primary adaptation mechanism.

However, despite significant efforts, a clear and comprehensive picture of genomic and proteomic signatures of thermal adaptation remains elusive. First, it is not fully established which compositional biases in genomes and proteomes represent the most definitive signatures of thermophilic adaptation. On the genomic level, is it G + C as widely believed by many after the classical experimental study by Marmur and Doty [21] or alternative signatures such as purine loading index as suggested by others [19]? For proteomic compositions is it excess of charge residues or hydrophobic or both, and which amino acids specifically are most sensitive to thermal adaptation? On a more fundamental level the key question is which factor—amino acid or nucleotide composition—is primary in thermal adaptation and which one is derivative? This issue is most vivid in prokaryotes whose genomes consist mostly of coding DNA, leading to a well-defined relationship between nucleotide and amino acid compositions. However, this relationship may be not rigid due to degeneracy of genetic code. In principle, amino acid and nucleotide compositions can adapt, to a certain degree, independently due to possible application of codon bias.

Definite answers to these questions can be obtained only via a comprehensive study that considers in a systematic way all known prokaryotic genomes and all possible composition factors. Here, we carry out a comprehensive investigation of the relationship between the OGT (Topt) of prokaryotes and the compositions of their complete proteomes and compositions and pairwise nearest-neighbor correlations in their genomes. First, our goal is to find the sets of amino acids and nucleotides whose total content in a proteome (genome) serves as the best predictor of the OGT of an organism. We perform an exhaustive enumeration of all possible combinations of amino acid residue types, without making any a priori assumptions of the relevance of a particular combination for thermostability. Our analysis is based on 204 complete genomes of bacteria and archaea thriving under temperatures from −10 °C to 110°C, comprising psychro-, meso-, thermo-, and hyperthermophilic organisms (Table S1). It reveals a particular combination of amino acids whose cumulative concentration in proteomes is remarkably well-correlated with OGT. This result holds mainly for globular proteins composing most parts of prokaryotic proteomes. For comparison, we apply this approach also to membrane proteins having α-helical bundles [22] and β-barrel [23] structures, and show that the amino acids predictors of OGT found for proteomes of globular proteins do not work for membrane proteins. This finding clearly indicates that mechanisms of thermal stabilization of membrane proteins are different from those of globular proteins.

Next we turn to genomes and carry out the same comprehensive analysis to determine compositional genomic determinants of OGT. We find that purine load index, i.e., concentration of A + G exhibits the highest correlation with OGT, consistent with an earlier observation made on several individual genomes [20]. Having found both proteomic and genomic compositional characteristics that correlate with OGT, we then turn to the key question of whether genomic and proteomic determinants are independent of each other or one is a derivative of the other. By running a set or reshuffling controls as described below, we show that the primary factor is adaptation on the level of amino acid composition and the variation in DNA composition is largely a derivative of amino acid adaptation. While the nucleotide composition biases appear to be largely due to adaptation at the level of amino acid compositions, an additional DNA sequence adaptation that is independent of amino acid composition adaptation is still possible—at the level of higher-order correlation in nucleotide sequences. A possible mechanism for that in prokaryotes is through codon bias. Indeed, we find clear evidence for independent adaptation in DNA at the level of nearest-neighbor correlation of nucleotides (possibly due to strengthening of stacking interactions). We show that this adaptation of DNA sequences does indeed occur via codon bias. Finally we discuss how observed patterns of change in amino acid compositions in response to extreme conditions of the environment are related to physical principles that govern stability of globular proteins.


The complete genomes (Table S1) were downloaded from the NCBI Genome database, and the OGTs were collected from original publications, and from the Web sites of the American Type Culture Collection,, the German Collection of Microorganisms and Cell Cultures,, and PGTdb database [24], The distribution of OGTs of prokaryotes with completely sequenced genomes shows that 75 species out of 204 have an OGT of 37 °C (mostly human and animal pathogens), 27 species have an OGT of 30 °C, and 19 species grow best at 26 °C. Therefore, 121 genomes, or 59% of the data, correspond to just three values of OGT. To overcome this obvious bias in the data, we averaged the amino acid compositions of the organisms with OGTs of 26 °C, 30 °C, and 37 °C, respectively. After this procedure, our dataset consists of 86 proteomes, 83 real proteomes, and three averaged proteomes at 26 °C, 30 °C, and 37 °C. To validate the averaging of the overrepresented amino acid compositions, we repeated the complete analysis for the 204 genomes and did not find any major differences (see Figure S1).

We represent the different sets of amino acids by vectors {ai} of 20 binary digits, where each digit ai takes the value of 1 if the amino acid of type i is present in the set, and 0 otherwise. For the set to be nontrivial, at least one of the components of ai must be 1 and at least one other must be 0. There are 220 − 2 = 1,048,574 such vectors, enumerating all possible subsets of 20 amino acids. Let be the fraction of amino acid of type i in proteome j. Then, for each of the 86 proteomes and each vector {ai}, we calculate the total fraction F(j) of the amino acids from a particular subset, and perform the linear regression between the values of F(j) and the organism living temperatures . The higher the correlation coefficient R of this regression, the more important is a given subset of amino acids as a predictor of OGT. Since for each proteome , there are in fact only 219 − 1 linearly independent combinations of fractions of amino acids; if the total fraction of N amino acids increases with OGT and the correlation coefficient is R, the total fraction of the complementary set of 20 − N amino acids is decreasing with OGT with the correlation coefficient –R. Therefore, the predictive powers of the complementary subsets of amino acids are equal. To resolve this ambiguity, we focus on the amino acid combinations that are positively correlated with OGT.

For 86 proteomes, the best predictor consists of amino acids IVYWREL, correlation coefficient R = 0.930, and accuracy of prediction 8.9 °C (see Results).

We also performed an exhaustive enumeration of 320 − 1 combinations of amino acids in a ternary model, where the coefficients ai can take the values of −1, 0, and 1. We expected that this method could be advantageous, as it allows us to enrich the predictor combination by the amino acids whose fraction is decreasing with OGT (such amino acids would have a weight of −1). The best predictor combination in the ternary model consists of amino acids VYWP with weight +1 and CFAGTSNQDH with the weight −1, correlation coefficient R = 0.948, accuracy of prediction of 8.1 °C for 86 proteomes. While this result is generally consistent with the IVYWREL predictor, the ternary model does not offer a significant increase of accuracy of the prediction of OGT as compared with a much simpler binary model. It is also possible to perform a full linear regression between the amino acid fractions and OGT, effectively allowing for noninteger values of {ai}. This procedure involves a great risk of overfitting, and results only in a marginal improvement of accuracy of OGT prediction as compared with both binary and ternary models.

TMHMM prediction server [22] at has been used to identify the sequences of alpha-helical membrane proteins in the 83 genomes. The number of proteins predicted in a genome did not correlate with OGT, so the prediction algorithm is sufficiently robust with respect to the changes of average amino acid composition. The PROFTmb server [23] at has been used to identify the transmembrane beta-barrels; proteins with eight to 22 transmembrane beta-strands have been selected for further study.

We applied the same technique to enumerate all possible combinations of nucleotides in the coding DNA of prokaryotes and find the correlation coefficients between the total fractions of sets of nucleotides and OGT. To quantify the pairwise nearest-neighbor correlations in DNA, for each of the 16 possible pairs of nucleotide types i and j and for each of the 83 genomes, we calculated the correlation function , where N is the total number of nucleotides in the DNA sequence, ni, nj are the numbers of nucleotides of types i and j, and nij is the number of pairs where nucleotide j immediately follows nucleotide i in the coding strand. In random DNA without sequence correlations, cij = 0 for any i,j and for any nucleotide composition of the sequence.

Reshuffling of DNA and protein sequences has been performed by swapping two randomly chosen letters in the sequence and repeating this procedure 2N times, where N is the length of the sequence.

Sequences of ten protein folds (according to SCOP description [25]) with less than 50% of sequence identity were extracted from PDB database. Given a fold, representatives of its sequences were found in 83 complete proteomes by using blastp program with BLOSUM62 substitution matrix, e-value equal to 0.005. For every analyzed protein fold, representatives of its sequences were detected in organisms with OGTs covering the whole interval from +10 °C to 110 °C. Extracted proteomic sequences were used in calculations of thermostability predictor if the length of the alignment was more than 70% of the size of the fold's sequence.


Sequence Determinants of Thermal Adaptation of Soluble Proteins

We computed the 220 − 2 = 1,048,574 values of the correlation coefficient R of the dependence between OGT and the fractions F(j) for all possible sets of amino acids {ai}, and found the set yielding the highest R. For the 86 proteomes with different OGT (see Methods) studied, the best set of amino acids is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL), giving the correlation coefficient between FIVYWREL of a proteome and OGT of the organism R = 0.930 (Figure 1). The quantitative relationship between the OGT Topt (in degrees Celsius) and fraction F of IVYWREL amino acids reads

Figure 1. Correlation between the Sum F of Fractions of Ile, Val, Tyr, Trp, Arg, Glu, and Leu (IVYWREL) Amino Acids in 86 Proteomes and the OGT of Organisms Topt

The linear regression (red line) corresponds to the correlation coefficient R = 0.93. The OGT Topt (in degrees Celsius) can be calculated from the total fraction F of IVYWREL in the proteome according to Topt = 937F-335. By construction, the IVYWREL set is the most precise predictor of OGT among all possible combinations of amino acids; other combinations statistically yield a larger error of prediction of OGT.

The root-mean-square deviation between the actual and predicted OGTs for 86 genomes is 8.9 °C. The same analysis performed on the complete set of 204 genomes yields KVYWREP as the best OGT predictor, with the correlation coefficient R = 0.84 and root-mean-square error of 9.2 °C (Figure S1). The ten best amino acid combinations for 86 genomes and the corresponding correlation coefficients are presented in Table S2. Table S5 presents all of the one- and two-letter amino acid combinations and their correlation coefficients R.

Statistical Tests and Controls

We performed several tests to substantiate our procedure and to prove that the IVYWREL set is not a numerical aberration in a fit of noisy data, but a robust predictor of habitat temperature of prokaryotes.

First, as a control that the data are not overfitted, we randomly reshuffled the values of between organisms, and, using the same exhaustive enumeration of all amino acid combinations, found the best composition–temperature correlation in the resulting artificially randomized proteome–temperature combinations. The distribution of the maximum correlation coefficient Rmax obtained in 1,000 such reshufflings follows a Gaussian shape centered around <Rmax> = 0.344, σ = 0.060. Therefore, the probability to find the observed correlation of R = 0.93 as a result of overfitting in a random combination of proteomes and temperatures is less than p = 10−20, which proves an extremely high statistical significance of both the IVYWREL set and its correlation coefficient R = 0.93.

Figure S2 shows another random control–histogram of the correlation coefficient R of the F(Topt) dependence for all possible combinations of seven amino acids (black curve) and for all variations of the IVYWREL set with one substitution of the amino acid (red curve). The correlation coefficient for the IVYWREL-like sets is significantly higher than that for a combination of seven random amino acids.

To demonstrate the robustness and stability of the IVYWREL predictor set, in Figure 2 we plot the distribution of correlation coefficients of the F(Topt) dependence for different variations of the IVYWREL set. Figure 2 shows that the IVYWREL set is very tolerant to addition or removal of one amino acid, with the median value of R in excess of 0.8. Addition or deletion of two letters, or a substitution of one letter, has a more significant effect on the accuracy of OGT prediction, with median R on the order of 0.75. Table 1 shows the examples of additions, deletions, and substitutions in the IVYWREL set having the strongest or weakest effect on the correlation coefficient R.

Figure 2. Distribution of the Correlation Coefficient R between OGT and Fractions of Amino Acids in a Proteome for Different Variation of the IVYWREL Set, Additions or Deletions of One or Two Amino Acids to/from the Set, or Substitution of One or Two Amino Acids from the Set by One or Two Amino Acids Not from the Set

The dashed red line at R = 0.93 corresponds to the unperturbed IVYWREL. The horizontal red lines indicate the median values of the correlation coefficient for the given type of change of the predictor set.

Table 1.

Effects of the Changes in the IVYWREL Predictor Combination on Its Predictive Power

To check the power of our method in the prediction of OGT of prokaryotes, we randomly split the set of 86 proteomes into training sets of 43 and a test set of 43 species. We used the training set to determine the parameters of the best linear relationship between the fractions of all 220 − 1 amino acid sets and OGT, and employed this regression to predict OGTs of organisms in the test set. The accuracy of this prediction can be characterized by the root-mean-square difference σΔT between predicted and actual OGT of the 43 organisms in the test set,

Figure 3 presents the histogram of the values of σΔT in 1,000 test/train splits, with average σΔT being equal to 12 °C for 43 organisms. This level of precision allows us to reliably discriminate between psychro-, meso-, thermo-, and hyperthermophilic organisms [2] knowing solely the amino acid composition of their proteomes. The histogram of probability to find each of the 20 amino acids in the predictor set (based on 43 organisms in 1,000 training sets) is presented in Figure S3. Each of the amino acids from the IVYWREL set can be found in the predictor sets with the probability p > 0.85, the most frequent amino acids being Val, Tyr, and Arg.

Figure 3. The Distribution of the Root-Mean-Square Error σΔT of the Prediction of the OGT in the 1,000 43-Species Test Sets (Black), and the Prediction Errors for 86 Organisms Using the IVYWREL Set (Red) or Sets of Charged or Hydrophobic Residues (Blue)

Individually, sets of hydrophobic or charged residues provide a much lower precision than their proper combination, IVYWREL.

Earlier, it has been suggested that the fraction of charged amino acid residues (Asp, Glu, Lys, Arg, DEKR) can be a significant predictor of living temperature [9,1215]. We found that the fraction of these residues predicts the OGT with an accuracy (root-mean-square deviation) of 21 °C for 86 proteomes or 14 °C for 204 species. Similarly, a set of hydrophobic residues (Ile, Val, Trp, Leu, IVWL) predicts the OGT with an accuracy of 16.8 °C for 86 proteomes. The IVYWREL predictor we discovered is accurate up to 8.9 °C for 86 proteomes. The result clearly demonstrates that consideration of both charged and hydrophobic residues is crucially important for predicting the living temperature and, thus, thermostability.

Membrane Proteins and Specific Folds

The calculation above made use of the complete proteomes, considering soluble and membrane proteins together. It is well known, however, that membrane proteins are markedly different from soluble ones, especially in terms of their stability and folding mechanism [26]. Interactions with the lipid bilayer presumably result in the mechanisms of thermostabilization of membrane proteins different from those of soluble ones. We used the TMHMM hidden Markov model [22] to identify α-helical membrane proteins in the proteomes of the 83 organisms, and performed the same analysis as for the complete proteomes. The IVYWREL combination is the best predictor of thermostability of membrane proteins with three or more transmembrane helices, R = 0.89, and is among the five highest R predictors for membrane proteins with ten or more helices, R = 0.85 (see legend to Figure 4). In the latter case, the best predictor is IVYWRELGKP, R = 0.86. Figure 4 shows that the fraction of IVYWREL amino acids in membrane alpha-helical proteins of mesophilic organisms is always higher than the overall IVYWREL fraction, whereas in hyperthermophilic organisms the fractions of IVYWREL in soluble proteins and in membrane α-helical bundles practically coincide. The weaker dependence of IVYWREL content on temperature suggests that the mechanism of thermostabilization of alpha-helical membrane proteins is different from that of soluble proteins (note, however, that IVYWREL compositions for soluble and α-helical membrane proteins appear to converge in hyperthermophiles). Next, we used the ProfTMB [23] server to identify transmembrane β-barrel proteins in the proteomes of 29 organisms uniformly covering the temperature scale (see Table S3). The most predictive amino acid combination for thermostability of transmembrane beta barrels is CVYP, R = 0.72, while IVYWREL is a very poor predictor, R = 0.35. The low slope of IVYWREL correlation with OGT in α-membrane proteins and the CVYP predictor in TM β barrels suggest that thermal adaptation in membrane proteins is governed by different rules than in globular ones (and probably different between different types of membrane proteins).

Figure 4. Total Fraction of IVYWREL Amino Acids in Alpha-Helical Membrane Proteins Containing Three or More (Green) or Ten or More (Red) Helices, Plotted against the OGT

Solid lines are linear regressions. The black line is the linear regression of the fraction of IVYWREL amino acids in all proteins in a proteome, the same as in Figure 1. The fraction of IVYWREL in membrane proteins of thermophilic organisms is about the same as in all of their proteins. In mesophiles, membrane proteins are enriched with hydrophobic residues. For membrane proteins with ten or more helices, the five best predictors are ILVWYGERKP, FILVWYATSERKP, FILVWYATSEHRKP, IVYWREL, MFILVWYATSEHRKP, 0.85 < R < 0.86.

Table 2 shows that thermostability predictors of most of the folds are very similar to universal combination IVYWREL. The most abundant protein folds, such as α/β barrel and β barrel, Rossman fold, and bundle, reveal a high correlation coefficient between IVYWREL content in their sequences and OGT. Importantly, protein folds that invoke complementary mechanisms of stability, involving heme- and metal-binding (globin, cytochrome C, ferredoxin) or S-S bridges (lysozyme) show a significantly lower correlation between IVYWREL content and OGT.

Table 2.

Thermostability Predictors for the Major Protein Folds

IVYWREL Is Not a Consequence of Nucleotide Composition Bias

Proteins are encoded in the nucleotide sequences of their genes, and thermal adaptation presumably leads to increased stability of both proteins and DNA. Therefore, signatures of thermal adaptation in protein sequences can be due to the specific biases in nucleotide sequences and vice versa. In other words, one has to explore whether a specific composition of nucleotide (amino acid) sequences shapes the content of amino acid (nucleotide) ones, or thermal adaptation of proteins and DNA (at the level of sequence compositions) are independent processes.

To resolve this crucial issue, we applied the following logic. If amino acid biases are a consequence of just nucleotide biases and not protein adaptation, then proteomes translated from randomly reshuffled genomes will feature similar “thermal adaptation” trends in amino acid composition as observed in real proteomes. In contrast, if amino acid compositions are selected independently, then such control calculation will result in apparently different amino acid “trends” in randomly reshuffled genomes than observed in reality. We carried out this computational experiment by randomly reshuffling nucleotides in the coding sequences of 83 genomes and translating the reshuffled genomes into protein sequences. Then, we calculated the amino acid compositions of the resulting control proteomes, and found (using the same exhaustive enumeration of all amino acid combinations) the amino acid combinations that are best correlated with temperature. In Figure 5 we present the probability to find an amino acid in 1,000 highest-R combinations for real proteomes and control proteomes created from reshuffled DNA. The striking difference between the two plots suggests that the increase of IVYWREL with environmental temperature is not a consequence of the nucleotide composition bias of coding DNA. This result proves the existence of a specific adaptation pressure acting on the amino acid composition of proteins in the process of thermal adaptation.

Figure 5. Histogram of the Probability to Find an Amino Acid among 1,000 Combinations of Amino Acids Which Are Most Correlated with OGT for Real Proteomes (Red) and for Artificial Proteomes Created from Reshuffled DNA Sequences (Blue)

The histogram for real proteomes supports the stability of IVYWREL predictor, while the difference between the two histograms suggests that amino acid biases upon thermophilic adaptation are not a consequence of the trends in nucleotide composition.

Thermal Adaptation of DNA

In Figure 6A, we plotted the fraction of the proteomic thermostability predictor, IVYWREL, against the G + C content, a major contributor to pairing interactions in DNA, and a presumable indicator of DNA thermostability. A very weak negative correlation (R = −0.14) between the fractions of IVYWREL in the proteome and that of G + C in the coding DNA suggests that protein thermostability and its IVYWREL predictor are not consequences of enhanced GC content.

Figure 6. G + C Content of Coding DNA Is Not Correlated with OGT or IVYWREL

(A) Dependence of the fraction of IVYWREL amino acids in 83 proteomes (protein thermostability predictor) on the fraction of G + C in the coding DNA in the corresponding genomes.

(B) Dependence of the G + C content in the coding DNA of the 83 complete genomes on the OGT of the organisms. The correlation coefficient is R = −0.15, indicating that G + C content of the coding DNA is not related to the OGT.

We used a complete enumeration of the 24 −2 = 14 possible sets of nucleotides to look for possible composition–OGT relationships in the coding DNA of prokaryotes, see Table S4 (on average, coding DNA constitutes 85% of prokaryotic genomes). We found that the fraction of G + C in the complete genomes of 83 (Figure 6B) or 204 (Figure S4) species does not show any significant correlation with OGT. The same holds true for complete genome sequences, including noncoding DNA (unpublished data).

The only combination of nucleotides whose fraction is statistically significantly correlated with temperature is purine composition, A + G, R = 0.60 (Figure 7A), in agreement with earlier observations made for several genomes [19]. As this correlation is very significant, there is a possibility that the trends in amino acid and nucleotide composition are tightly linked: either the proteomic predictor, IVYWREL, is a direct consequence of the A + G nucleotide bias (the possibility that we already discarded, see above), or vice versa, the A + G nucleotide bias automatically follows from the prevalence of IVYWREL in the proteomes of thermophilic organisms.

Figure 7. Effect of Codon Bias on the Increase of Purine Load with OGT

(A) The fraction of A + G in the coding DNA of the 83 complete genomes is highly correlated with the OGT, R = 0.60.

(B) When protein sequences of the 83 organisms are reverse-translated into DNA without codon bias, the fraction of A + G remains correlated with OGT, R = 0.48.

Purine Loading Bias Is Mainly due to IVYWREL

To what extent can the increase of A + G with temperature be explained by the trends in amino acid composition? Obviously, the only way to adjust nucleotide sequences without affecting the encoded proteins is through the use of codon bias. A particularly important question is whether a specific codon bias is required to reproduce the trends in DNA composition.

To answer this question, we reverse-translated the protein sequences of the 83 organisms into DNA sequences without codon bias, i.e., by using synonymous codons with equal probabilities. As shown in Figure 7B, the fraction of A + G in the resulting nucleotide sequences is very significantly correlated with the OGT, R = 0.48, which is very close to the corresponding value in actual genomes, R = 0.60 (Figure 7A). In other words, the amount of variation of A + G explained by the peculiarities of amino acid composition is almost the same with and without codon bias. (We note, however, that the slopes of dependencies in Figure 7A and 7B are somewhat different suggesting that codon bias may be partly responsible for the overall purine composition of DNA). Therefore, we conclude that the fraction of A + G in the coding DNA is largely defined by the composition of proteins. Indeed, by imposing the correct amino acid composition and choosing the available synonymous codons with equal probabilities, one arrives at the correct prediction of the trend in DNA composition, increase of A + G with temperature. Together with the apparent irrelevance of G + C content for thermal adaptation on the organism scale, this finding suggests that on the level of nucleotide composition, direct selection pressure on DNA sequence composition appears to be weak.

Nearest-Neighbor Correlation in DNA Sequences

After nucleotide composition, the next level of description of DNA sequences is the pairwise nearest-neighbor correlation function, or normalized probability to find successive pairs of specific nucleotides. Although sequence correlations have a minor effect on base pairing, they determine the strength of the stacking interactions, where successive A and G nucleotides (ApG pairs, or dinucleotides) have a low energy, stabilizing the DNA [27,28].

For each of the 83 genomes, we calculated the correlation function cij (see Methods) of all 16 possible successive pairs of nucleotides in the coding strand. For each pair, we plotted its cij value in the 83 genomes against the OGT, and calculated the correlation coefficient (see Table 3, column 1). It turns out that the correlation function cAG, or the excess probability to find successive pairs of A and G nucleotides (ApG pairs) in the coding DNA is significantly increasing with OGT, R = 0.68. The increase of cAG with temperature is observed for both coding DNA and the complete genome sequence.

Table 3.

Correlation Coefficients R between OGT and DNA Sequence Correlation Function cij for the 16 Combinations of Nucleotides i,j in 83 Genomes

We have shown above that codon bias does not define the dominant trends in nucleotide composition of coding DNA. Is codon bias necessary to explain the observed sequence correlations in coding parts of DNA, given that amino acid bias is established independently? To answer this question, we reverse-translated the actual protein sequences into nucleotide sequences, using synonymous codons with equal probabilities.

In principle, there are two possible sources of the correlations in DNA sequences, one stemming from the neighboring nucleotides within a codon, and another one stemming from the combination of nucleotides at the interface of successive codons. The latter possibility implies the dependence of correlations in DNA on the sequence correlations in proteins. Reshuffling of protein sequences while retaining the actual codons used for each amino acid removes the effect of codon interface (Table 3, column 2). These data show that ApG pairs are still highly correlated with OGT, and correlations in protein sequences have a small effect on the correlations in DNA.

After removal of codon bias, ApG pairs are no longer correlated with OGT, R = 0.20 (see Table 3, column 3, and Figure 8). Therefore, the correlation between the ApG pairs and temperature is entirely due to the evolved codon bias. Importantly, CpT pairs in coding sequences of DNA are also highly correlated with OGT. As we analyze correlations in the sense strand of DNA molecule, abundance of CpT pairs points out to the enrichment of anti-sense strand with ApG ones. Therefore, double-stranded DNA is stabilized by stacking interactions provided by ApG dinucleotides that are spread in different locations of both sense and anti-sense strands. This conclusion holds for the entire DNA sequence when both coding and noncoding parts are considered. Therefore, it appears that the crucial role of codon bias is to increase the number of ApG pairs in coding DNA in response to elevated environmental temperatures, enhancing the stacking interactions in DNA. We also note that the trend to increase the frequency of ApG dinucleotides via codon adaptation may be another factor (besides amino acid adaptation, see above) that gives rise to increased overall composition of purine nucleotides A + G in hyperthermophiles, as can be seen from comparison of slopes of scatterplots shown in Figure 7A and 7B.

Figure 8. Dependence of the Pairwise Nearest-Neighbor Correlation Function for A,G Nucleotides cAG on the OGT in the Genomes of 83 Species (Black) and in the DNA Sequences Obtained from the Proteomes of 83 Species without Codon Bias (Red)

While codon bias is not essential for reproducing the trends in nucleotide composition (Figure 6), nucleotide correlations are entirely dependent on the proper choice of codon bias. An increase of the number of ApG dinucleotides enhances the stacking interactions in DNA, stabilizing it at elevated environmental temperatures.


Earlier works [2931] have established an empirical correlation between OGT of an organism and the melting temperature of its proteins. Here we found the pronounced amino acid biases and related proteomic determinant of thermal adaptation: IVYWREL combination of amino acid residues, which are highly correlated with the OGT of prokaryotes. To better understand possible molecular mechanisms responsible for the observed highly significant proteomic trends, one must establish a connection between thermal adaptation at organismal and molecular levels.

Remarkably, all amino acids from the IVYWREL predictor set are attached to tRNA by class I aminoacyl-tRNA synthetases [32]. Other amino acids belonging to the group of class I synthetases are Cys (prone to form stabilizing S-S bridges), Lys (important in the generic mechanism of thermostability, see below and [33]), and Met which are often placed in N-termini of proteins. Thus, class I amino acids, contrary to those of class II [34], may constitute a group of amino acids sufficient for the synthesis of thermostable proteins. Besides, all class I synthetases have the Rossmann fold structure, one of ancient LUCA domains [35] with high compactness and, therefore, stability [36]. These observations suggest a possible connection between thermal adaptation and evolution of protein biosynthesis.

As it is known from statistical mechanics and the theory of protein folding [37], the stability of proteins is largely determined by the Boltzmann weight exp(−ΔE/kBT), where ΔE is the energy gap between the native state and the nearest in energy misfolded decoy structure, T, is environmental temperature, and kB is the Boltzmann constant. As IVYWREL amino acids are important for thermostability, it is natural to assume that the energy gap ΔE is established mainly by interactions between these types of residues. Thus, we suggest that natural selection adjusted the content of IVYWREL in the proteomes to maintain ΔE/kBT at a nearly constant level irrespective of the environmental temperature. Interestingly, the total fraction of IVYWREL residues in the proteomes changes from 0.37 to 0.48 over the accessible temperature range. This relatively small yet significant change highlights a very delicate balance between hydrophobic, van der Waals, ionic, and hydrogen-bond interactions in correctly folded proteins. Indeed, the IVYWREL set contains residues of all major types, aliphatic and aromatic hydrophobic (Ile, Val, Trp, Leu), polar (Tyr), and charged (Arg, Glu), both basic and acidic. Recently, we have shown, using exact statistical mechanical models of protein stability, that the increase of the content of hydrophobic and charged amino acids can be quantitatively explained as a physical response to the requirement of enhanced thermostability (“from both ends of the hydrophobicity scale” mechanism, unpublished data), reflecting the positive and negative components of protein design.

The basic mechanism of protein thermal stability can be further augmented or modified by other biological, evolutionary, and environmental inputs. Availability of different amino acids in the environment, protein–protein interactions, and protein function requirements are just a few of those other major factors affecting the amino acid composition. Membrane proteins deliver an example where the mechanism of thermal adaptation is adjusted according to the specificity of the structure and interactions with its hydrophobic environment. In α-helical bundles, the temperature dependence of IVYWREL deviates from that of soluble proteins, whereas in β barrels even the determinant of thermostability, CVYP, is completely different. These observations are suggestive of the differences between the folding processes of globular and membrane proteins. Thermostability predictors derived for individual folds confirm a generic nature of the IVWREL predictor, which dominates in the majority of protein folds. Specific mechanisms of structure stability present in metal-binding proteins (globin, cytochrome C, ferredoxin) or those stabilized by S-S bridges (lysozyme) explain the lesser importance of the IVYWREL predictor for these folds.

Thermostability of proteins is, however, only a partial prerequisite of thermal adaptation of an organism, as DNA molecules must also remain stable at elevated environmental temperatures. Numerous experimental and theoretical works established two fundamental interactions in double-stranded DNA, base pairing and base stacking, as major determinants of DNA stability in vitro. GC pairs in the double helix have stronger base-pairing interactions than AT pairs [21], while purines, A and G, yield a low energy of stacking [27,28,38].

The role of G + C content in establishing certain biases in amino acid composition has been widely discussed [10,18,3941]. Our high-throughput analysis of nucleotide and amino acid compositions has confirmed that the G + C content of DNA is not correlated with the OGT of prokaryotes. The only signature of thermal adaptation in DNA composition that we found, increase of A + G content in the coding DNA, is to a considerable extent a consequence of the thermal adaptation of protein sequences. Elimination of codon bias and reverse-translation of the protein sequences into nucleotide sequences does not change the major trends in DNA composition (Figure 7). This result suggests that composition-dependent (pairing) interactions in double-stranded DNA are not the bottleneck of thermal adaptation in prokaryotes, as even without codon bias the variation in composition of DNA follows the variation of composition of proteins across the whole range of environmental temperatures.

On the contrary, pairwise nearest-neighbor correlations observed in DNA sequences are largely independent from amino acid trends as they are entirely determined by the codon bias developed as a form of thermal adaptation in prokaryotes. We found that natural selection tailored the codon bias to increase the fraction of ApG pairs in both strands of DNA molecules of thermophilic organisms. Indeed, an increased number of ApG pairs leads to a lower stacking energy, and, thus, stabilization of DNA. We also demonstrated that the trend of higher frequency of ApG pairs in thermophiles persists for the whole DNA, including its coding and noncoding parts. To our knowledge, this consideration is one of the first physical models that quantitatively explain the necessity of and one of the possible reasons for codon bias in prokaryotes.

An important finding presented in this work is that amino acid composition adaptation is a primary factor, while certain signatures of thermal adaptation at the level of nucleotide composition such as purine loading index may be partly derivative of the amino acid adaptation requirement. This is perhaps not surprising given that many (but not all) proteins are present in cytoplasm in monomeric form and must be stabilized on their own—by interactions between their own amino acids (and also destabilize misfolded conformations, performing negative design as well). On the other hand, many proteins bind to DNA in prokaryotic cells, and some of them may provide additional stabilization to DNA in hyperthermophiles, making thermal adaptation at the level of DNA sequences a “collective enterprise” of the cell, relieving direct pressure on DNA sequence to adapt to high temperature. Positive superhelicity provided by gyrases may be one such mechanism, and perhaps several other mechanisms of DNA stabilization by proteins could be discovered in the future. In particular, the observed purine–purine AG correlation as an “independent” (from amino acid composition) adaptation mechanism may be a signature of the important role of stacking interaction in DNA thermal stabilization. Alternatively, the AG correlations may be a signature of an additional thermal adaptation of DNA via modulation of local mechanical properties or facilitation of interaction with proteins. Elucidation of the exact mechanism(s) by which DNA in prokaryotic genomes adapt to high temperatures is a matter of future research.

To summarize, we have shown that thermal adaptation of proteins in prokaryotes is strongly manifest at the level of amino acid composition. Among all possible combinations of amino acids, the fraction of IVYWREL amino acids in a proteome is the most precise predictor of OGT. We also observe a difference in the signatures of thermal adaptation of soluble and membrane proteins. We did not find a strong evidence for thermal adaptation of DNA at the level of nucleotide composition. The G + C content of the genomes is not correlated with environmental temperature, while A + G content increases with environmental temperature, to a considerable extent, as a consequence of thermal adaptation of proteins. At the same time, our analysis of sequence correlations in DNA shows that ApG dinucleotides are overrepresented in both strands of dsDNA of thermophilic organisms. The abundance of ApG pairs is a direct consequence of the codon bias developed in thermophilic prokaryotes. These findings provide a definitive answer to the long-standing problem of which genomic and proteomic compositional trends reflect thermal adaptation.

Supporting Information

Figure S1. Dependence of the Fraction of KVYWREP Amino Acids on the OGT for the 204 Genomes from Table S1

(134 KB DOC)

Figure S2. Histogram of the Values of the Correlation Coefficient R between OGT and Fraction F of a Set of Amino Acids for All Possible Seven-Letter Combinations and IVYWREL Combinations with One Substitution

(132 KB DOC)

Figure S3. Probability to Find Each of the 20 Amino Acids in the Predictor Combinations (Highest-R Amino Acid Combination) Calculated from 1,000 of 43-Organism Training Sets

(126 KB DOC)

Figure S4. Dependence of the Fraction of G + C Nucleotides on the OGT for the 204 Genomes from Table S1

(147 KB DOC)

Table S1. Cardinal Temperatures of Prokaryotes (Minimum, Optimal, and Maximum Growth Temperatures)

(477 KB DOC)

Table S2. Combinations of Amino Acids Giving the Ten Highest Values of the Correlation Coefficient R between OGT and Fraction of the Amino Acids for 86 Proteomes

(29 KB DOC)

Table S3. Organisms Used for Prediction of Transmembrane Beta Barrel Proteins and Their OGT

(44 KB DOC)

Table S4. Combinations of Nucleotides and Correlation Coefficients between OGT and Fraction of the Corresponding Nucleotides in the Coding DNA of 83 Organisms

(31 KB DOC)

Table S5. Correlation Coefficients R between OGT and Amino Acids and Pairs of Amino Acids

(A) Correlation coefficients R between OGT and fractions of individual amino acids for 83 genomes, ordered by R.

(B) Correlation coefficients R between OGT and fractions of two-letter combinations of amino acids for 83 genomes, ordered by R

(70 KB DOC)


We are grateful to M. D. Frank-Kamenetskii for illuminating discussions and to M. Kloster for useful comments on an initial version of this paper. We would like to thank an anonymous referee for the suggestion to analyze membrane proteins and to H. Bigelow (Columbia University) for the help with prediction of TM beta barrels.

Author Contributions

KBZ and EIS conceived and designed the experiments and wrote the paper. KBZ performed the experiments. KBZ, INB, and EIS analyzed the data. INB contributed reagents/materials/analysis tools.


  1. 1. Shakhnovich EI (1994) Proteins with selected sequences fold into unique native conformation. Phys Rev Lett 72: 3907–3910.
  2. 2. Jaenicke R, Bohm G (1998) Stability of proteins in extreme environments. Curr Opin Struct Biol 8: 738–748.
  3. 3. Jaenicke R (2000) Stability and stabilization of globular proteins in solution. J Biotechnol 79: 193–203.
  4. 4. Berezovsky IN, Shakhnovich EI (2005) Physics and evolution of thermophilic adaptation. Proc Natl Acad Sci U S A 102: 12742–12747.
  5. 5. Thompson MJ, Eisenberg D (1999) Transproteomic evidence of a loop-deletion mechanism for enhancing protein thermostability. J Mol Biol 290: 595–604.
  6. 6. Querol E, Perez-Pons JA, Mozo-Villarias A (1996) A simple electrostatic criterion for predicting the thermal stability of proteins. Protein Eng 9: 265–271.
  7. 7. Vetriani C, Maeder DL, Tolliday N, Yip KS, Stillman TJ, et al. (1998) Protein thermostability above 100 °C: A key role for ionic interactions. Proc Natl Acad Sci U S A 95: 12300–12305.
  8. 8. Hurley JH, Baase WA, Matthews BW (1992) Design and structural analysis of alternative hydrophobic core packing arrangements in bacteriophage T4 lysozyme. J Mol Biol 224: 1143–1159.
  9. 9. Szilagyi A, Zavodszky P (2000) Structural differences between mesophilic, moderately thermophilic, and extremely thermophilic protein subunits: Results of a comprehensive survey. Structure 8: 493–504.
  10. 10. Kreil DP, Ouzounis CA (2001) Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res 29: 1608–1615.
  11. 11. Pe'er I, Felder CE, Man O, Silman I, Sussman JL, et al. (2004) Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla. Proteins 54: 20–40.
  12. 12. Cambillau C, Claverie JM (2000) Structural and genomic correlates of hyperthermostability. J Biol Chem 275: 32383–32386.
  13. 13. Das R, Gerstein M (2000) The stability of thermophilic proteins: A study based on comprehensive genome comparison. Funct Integr Genomics 1: 76–88.
  14. 14. Chakravarty S, Varadarajan R (2002) Elucidation of factors responsible for enhanced thermal stability of proteins: A structural genomics based study. Biochemistry 41: 8152–8161.
  15. 15. Tekaia F, Yeramian E, Dujon B (2002) Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: A global picture with correspondence analysis. Gene 297: 51–60.
  16. 16. Singer GA, Hickey DA (2003) Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content. Gene 317: 39–47.
  17. 17. Ponnuswamy P, Muthusamy R, Manavalan P (1982) Amino acid composition and thermal stability of globular proteins. Intl J Biol Macromol 4: 186–190.
  18. 18. Musto H, Naya H, Zavala A, Romero H, Alvarez-Valin F, et al. (2004) Correlations between genomic GC levels and optimal growth temperatures in prokaryotes. FEBS Lett 573: 73–77.
  19. 19. Lambros RJ, Mortimer JR, Forsdyke DR (2003) Optimum growth temperature and the base composition of open reading frames in prokaryotes. Extremophiles 7: 443–450.
  20. 20. Lin FH, Forsdyke DR (2006) Prokaryotes that grow optimally in acid have purine-poor codons in long open reading frames. Extremophiles. 7 September 2006. epub.
  21. 21. Marmur J, Doty P (1962) Determination of the base composition of deoxyribonucleic acid from its thermal denaturation temperature. J Mol Biol 5: 109–118.
  22. 22. Sonnhammer ELL, von Heijne G, Krogh A (1998) A hidden markov model for predicting transmembrane helices in protein sequences. In: Glasgow TL, Major F, Lathrop D, Sankoff D, Sensen C, editors. Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology. Menlo Park (California): AAAI Press. pp. 175–182.
  23. 23. Bigelow HR, Petrey DS, Liu J, Przybylski D, Rost B (2004) Predicting transmembrane beta-barrels in proteomes. Nucleic Acids Res 32: 2566–2577.
  24. 24. Huang SL, Wu LC, Liang HK, Pan KT, Horng JT, et al. (2004) PGTdb: A database providing growth temperatures of prokaryotes. Bioinformatics 20: 276–278.
  25. 25. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540.
  26. 26. Bowie JU (2005) Solving the membrane protein folding problem. Nature 438: 581–589.
  27. 27. Friedman RA, Honig B (1995) A free energy analysis of nucleic acid base stacking in aqueous solution. Biophys J 69: 1528–1535.
  28. 28. Saenger W (1984) Principles of nucleic acid structure. New York: Springer-Verlag.
  29. 29. Hochachka P, Somero G (2002) Biochemical adaptation. Mechanism and process in physiological evolution. Oxford: Oxford University Press.
  30. 30. McFall-Ngai MJ, Horwitz J (1990) A comparative study of the thermal stability of the vertebrate eye lens: Antarctic ice fish to the desert iguana. Exp Eye Res 50: 703–709.
  31. 31. Gromiha MM, Oobatake M, Sarai A (1999) Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys Chem 82: 51–67.
  32. 32. Eriani G, Delarue M, Poch O, Gangloff J, Moras D (1990) Partition of tRNA synthetases into two classes based on mutually exclusive sets of sequence motifs. Nature 347: 203–206.
  33. 33. Berezovsky IN, Chen WW, Choi PJ, Shakhnovich EI (2005) PLoS Comput Biol 1(4): e47..
  34. 34. Klipcan L, Safro I, Temkin B, Safro M (2006) Optimal growth temperature of prokaryotes correlates with class II amino acid composition. FEBS Lett 580: 1672–1676.
  35. 35. Shakhnovich BE, Deeds E, Delisi C, Shakhnovich E (2005) Protein structure and evolutionary history determine sequence space topology. Genome Res 15: 385–392.
  36. 36. England JL, Shakhnovich BE, Shakhnovich EI (2003) Natural selection of more designable folds: A mechanism for thermophilic adaptation. Proc Natl Acad Sci U S A 100: 8727–8731.
  37. 37. Shakhnovich EI (1997) Theoretical studies of protein-folding thermodynamics and kinetics. Curr Opin Struct Biol 7: 29–40.
  38. 38. Yakovchuk P, Protozanova E, Frank-Kamenetskii MD (2006) Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Res 34: 564–574.
  39. 39. Hurst LD, Merchant AR (2001) High guanine-cytosine content is not an adaptation to high temperature content: A comparative analysis amongst prokaryotes. Proc Biol Sci 268: 493–497.
  40. 40. Suhre K, Claverie JM (2003) Genomic correlates of hyperthermostability, an update. J Biol Chem 278: 17198–17202.
  41. 41. Wang HC, Susko E, Roger AJ (2006) On the correlation between genomic G + C content and optimal growth temperature in prokaryotes: Data quality and confounding factors. Biochem Biophys Res Commun 342: 681–684.