Relative Amino Acid Composition Signatures of Organisms and Environments

Background Identifying organism-environment interactions at the molecular level is crucial to understanding how organisms adapt to and change the chemical and molecular landscape of their habitats. In this work we investigated whether relative amino acid compositions could be used as a molecular signature of an environment and whether such a signature could also be observed at the level of the cellular amino acid composition of the microorganisms that inhabit that environment. Methodologies/Principal Findings To address these questions we collected and analyzed environmental amino acid determinations from the literature, and estimated from complete genomic sequences the global relative amino acid abundances of organisms that are cognate to the different types of environment. Environmental relative amino acid abundances clustered into broad groups (ocean waters, host-associated environments, grass land environments, sandy soils and sediments, and forest soils), indicating the presence of amino acid signatures specific for each environment. These signatures correlate to those found in organisms. Nevertheless, relative amino acid abundance of organisms was more influenced by GC content than habitat or phylogeny. Conclusions Our results suggest that relative amino acid composition can be used as a signature of an environment. In addition, we observed that the relative amino acid composition of organisms is not highly determined by environment, reinforcing previous studies that find GC content to be the major factor correlating to amino acid composition in living organisms.


Introduction
As early and in the 1930s, Alfred Redfield analyzed the oceanic ratios of carbon, nitrogen and phosphorus to find that they were approximately constant at 106C:16N:1P and similar to those observed in the organisms living in those ecosystems [1]. Later, Redfield suggested that this was a consequence of organisms maintaining the environmental abundance of the major chemical elements at homeostatic values closer to those in protoplasm [2]. Measurements of the Redfield ratio in other environments suggest that, even though they vary slightly, they are approximately constant for a given type of environment [3,4].
Recent work also reveals that the Redfield ratios in individual organisms and clades deviates from the global values and is dependent on phylogeny, geochemical constraints, and nutrient availability [4][5][6][7][8]. In fact, long term deficits of a given environmental chemical nutrient can be a driving force for evolutionary changes in the composition of the enzymes that fix that nutrient. Such changes in the enzyme's composition usually lead to a decrease in the frequency of amino acids that contain large amounts of the limiting environmental nutrient [9]. At the molecular level, such biases are also observed and enzymes that synthesize specific amino acids, when they are absent from the environment, contain low relative amounts of their cognate amino acids [10].
Taken together, the above observations raise the following questions: i) Does environmental relative amino acid abundance (eRAAA) of the 20 naturally occurring protein L-a-amino acids have ratios that are analogous to the Redfield ratios for chemical elements? ii) If so, are such ratios as widespread as those for C:N:P? iii) Is cellular relative amino acid abundance (cRAAA) of each organism approximately the same as its habitat eRAAA, suggesting an efficient utilization of free amino acids by the microorganisms in a given community? If not, does each organism have distinctive cRAAA, suggesting that eRAAA is a complex function of the dynamics of amino acid production and turnover by microbial communities inhabiting such environments? iv) Finally, could the environment feedback onto the organisms and contribute to the evolution of their whole cell amino acid composition in specific environments, as it appears to do for the nucleotide composition of genomes [11]?
To answer the first two questions we compiled complete amino acid measurements from different environments and compared them. We found that indeed there are specific signatures for eRAAA and that these signatures are broadly similar within each type of environment.
Answering the last two questions required an additional estimation of the cellular relative amino acid composition (cRAAA) of the different species inhabiting the various environments. Given that single-cell organisms represent the vast majority of biomass in aquatic and terrestrial environments [12], to estimate such cRAAA we used the predicted proteome composition of prokaryotes and unicellular eukaryotes with fully sequenced genomes. Our results showed that the population in any given environment is heterogeneous with respect to their cRAAA, suggesting that organisms evolved the ability to differentiate their cRAAA from that of their environment through the regulation of their amino acid biosynthesis and utilization pathways.

Data collection and classification
We collected more than 100 different environmental determinations of amino-acid natural abundances. Out of these, we only retained those measurements that simultaneously determined at least 16 out of the 20 L-a-amino acids (n = 69, see Table S1 and references therein), covering a wide spectrum of habitats, including water bodies, land masses and intestinal environments. Determinations of Asp/Asn and Glu/Gln were considered together for the analysis, because environmental measurements did not distinguish between the two amino acids in the pairs. Environments were classified in terms of aquatic (ocean and freshwater environments), terrestrial and host-associated environments.
Completely sequenced genomes and predicted protein sequences of unicellular organisms were obtained from the KEGG and NCBI databases [13,14]. Similar strains were discarded from the analysis to reduce phylogenetic bias. Genomes used in further analyses (n = 1086) included 961 Bacteria, 72 Archaea and 53 Eukarya (Table S2).
Organisms were classified in terms of habitat (aquatic, terrestrial, versatile, specialized and host-associated) based on information retrieved from the Integrated Microbial Genomes [15], Genomes Online [16] and NCBI Genome Project [17] databases and from the primary literature.

Calculation of genome and proteome properties
Using locally developed PERL scripts, we estimated the following properties for each organism from completely sequenced and fully annotated genomes: GC content, base pair composition of genes, codon usage and absolute amino acid abundance. These were important to control for the influence of non-environmental factors on protein amino acid composition.

Estimation of cellular amino acid content
Since it was not feasible to obtain experimental determinations of the cRAAA for all the organisms used in this work, we estimated cRAAA from an organism's predicted protein abundances assuming that cells grow without nutrient restrictions. In such conditions, the level of expression of the different proteins in the genome can be estimated with respect to that of abundant ribosomal proteins [18][19][20][21][22]. In this way, the average amino acid composition of each organism was calculated using locally developed PERL scripts by weighting the abundance of each protein with respect to the ribosomal proteins, whose abundance was set to be maximal. Two different metric functions were used to weight protein abundance, CAI and a d index.
CAI defines translationally optimal codons [23]. To calculate it, we normalized the data using the relative adaptiveness (w c,a ), as previously described [22]. This adaptiveness was calculated for ribosomal proteins, in which the frequency of each synonymous codons (n c,a ) was normalized by the frequency of the most frequent codon (being C a the set of synonymous codons used by amino acid a): Thus, the codon usage of each coding sequence was represented by a vector of length 59 (stop codons and amino acids with only one codon were discarded). CAI was then computed for each gene by summing over the codon usage vector (rather than over the length) [22]: Here, n tot is the total number of codons in the gene.
To compute the d index, first we measured the Euclidean distance (c p ) between the codon usage of protein p (CU c,protein p representing the average relative usage of codon c in protein p) and the average codon usage of ribosomal proteins (CU c,rib ): Figure 1. Characterization of different environments by their relative amino acid composition. A) scatter plot by Principal Component Analysis according to the type of environment; B) Hierarchical clustering analysis. The length of branches represents the degree of dissimilarity between clusters. The x-axis of the heat map represents the 20 amino acids by alphabetical order of the three-letter code name. Determinations of Asp/Asn and Glu/Gln were considered together for the analysis, because environmental measurements did not distinguish between the two amino acids in the pairs. The y-axis of the heatmap represents the individual environments where amino acid abundance was determined. Over-and underrepresentation of amino acid residues in each environment are represented in green and red colored squares, respectively. doi:10.1371/journal.pone.0077319.g001 Environmental and Cellular Amino Acid Signatures PLOS ONE | www.plosone.org Table 2. Spearman rank correlation coefficients between estimated amino acid compositions (based on CAI and d predictors) and experimentally-determined amino acid abundances.  Then, we defined d as an independent weighting function for gene expression as follows: Here, c max is the maximum Euclidean distance in a given genome.
CAI and d range between 0 and 1 for any given gene, with higher values indicating genes that are more expressed, thus having higher contribution to organism's amino acid content.
The cRAAA of each organism was computed as a vector of 20 amino acids, in which the cellular relative abundance of each amino acid (cRAAA aai ) was calculated as follows: Here, n aai is the frequency of amino acid i in each protein, EI is the predicted expression index (CAI or d) and N aai is the total count of that amino acid in the organism's proteome.
As a control, predictions were compared with experimentallydetermined amino abundances in Bacillus subtilis, Escherichia coli and Staphylococcus aureus retrieved from the literature [24][25][26].

Statistical analyses
Multi-dimensional matrixes of cRAAA and eRAAA were generated in which each column represented the relative content of each amino acid and each row represented each organism or environmental measure, respectively.
Principal Component Analysis (PCA) and Hierarchical Clustering were carried out to analyze the segregation of environments and organisms as a function of eRAAA and cRAAA, respectively.
To understand if environmental factors significantly contributed to shape global amino acid composition of organism, we needed to  control for the factors which are already known to explain that composition: genomic GC content and phylogeny [11]. In addition we considered the main habitat of each organism. Finally, we created a generalized linear model of amino acid composition as a function of these three factors in order to estimate the effect of each factor on that composition: cRAAA aaj~a1 GCza 2 Phylumza 3 Habitatze In this equation cRAAA aaj is the cRAAA of amino acid j, a 1 , a 2 , and a 3 are linear coefficients and e is a noise term [27]. The variables Phylum and Habitat were treated as discrete categorical data and given integer values (e.g. Crenarchaeota R 1, Euryarchaeota R 2, Korarchaeota R 3, Nanoarchaeota R 4, Acidobacteria R 5, Actinobacteria R 6, etc.; and Aquatic R 1, Terrestrial R 2, Versatile R 3, Specialized R 4, Host-associated R 5, Gut R 6, respectively). We created independent models for each amino acid and estimated the coefficients and their significance using ANOVA analysis [28].
Linear and Rank Correlation analyses between environmental and cellular amino acid compositions were performed based on Pearson's and Spearman's correlation coefficients, respectively. Statistical significance of the correlation coefficients was calculated using a t-test [28].
All statistical analyses were done using Wolfram Mathematica 8.0 (Wolfram Research, Inc., USA).   Table 1 shows the average environmental relative amino acid abundances (eRAAA) obtained based on the experimentally determined amino acids measurements collected from the literature (Table S1 and references therein). Overall, eRAAA ranged between 0.2% and 16%, with Gly, Ser, Ala, Asn+Asp and Gln+Glu being the most abundant amino acids (mean .10%).

Environments share amino acid signatures
PCA was performed on the multidimensional matrix of eRAAA for the environments. The principal components were used to investigate how environments grouped as a function of their amino acid composition. PCA showed segregation of water, soil and intestinal environments with respect to eRAAA, as observed in Figure 1A. Eight principal components were needed to explain more than 90% of the variation in the environmental composition data (91.7%). Figure 1B represents the clustering of environments on the basis of their eRAAA. Overall, similar environments clustered together. Environments were segregated in clusters corresponding to ocean waters, soil and host-associated environments, indicating the presence of habitat-specific trends in eRAAA. In general, ocean waters (cluster I) showed relative higher abundance of Ala, Trp, Gly, Gln+Glu and Leu. Host-associated environments (cluster II), showed relative higher abundance of Cys, Leu, Lys, Met and Pro. Terrestrial environments (cluster III) grouped into three sub-clusters: a) soil boreal forest, characterized by higher abundance of Asn+Asp, Gln+Glu, His and Ala; b) sandy soils (sacaton, mesquite, open) characterized by higher content of Arg, Lys, Leu, Ala and Asn+Asp and c) grass land environments, richer in Tyr, Lys, Asn+Asp, Ala, Leu and Thr.
Taken together these results suggest that the abundance of specific sets of amino acids creates signatures that are particular for each environment. However, it should be noted that Spearman rank correlations between the eRAAA of each pair of environments were statistically significant, ranging between 0.517 (p-value , 0.05) and 0.860 (p-value , 0.001). This indicates that, although eRAAA of different environments are specific to that environment and significantly different from those of other environments, the absolute differences between environments are small.

Prediction of cellular amino acid abundance
To answer the third and fourth questions we estimated the cellular amino acid abundance of the organisms inhabiting the different environments, using CAI and d (see methods) as predictors of an organisms' amino acid cellular abundance. These indexes weight the contribution of a given protein to the cellular amino acids pool by its predicted relative abundance with respect to ribosomal proteins. To validate this approach, predictions were compared with published cRAAA of reference organisms. Spearman rank correlations ranged from 0.743 to 0.861, showing that estimated amino acid abundances correlated highly significantly (p-values , 0.001) with experimental determinations ( Table 2). On average, higher correlations were obtained considering amino acid abundances weighted by d. Nevertheless, CAI-weighted cellular amino acid abundances also highly and significantly correlated with the experimental determinations, whereas unweighted RAAA in the full proteomes of the test organisms correlated to the experimental determinations with significantly lower Spearman correlations ( Table 2).
Further calculations and analysis were performed using both indexes d and CAI and produced similar results. For convenience, data shown refers to the d predictor only.

Organisms did not segregate according to habitat or lifestyle
To investigate whether relative cellular abundance of amino acids also contained a signature of the environment in which the organisms have evolved, we performed PCA of the cRAAA and correlated each organism with its main environment. Figure 2A shows the PCA analysis for the amino acid composition of organisms colored according to the type of habitat. Projection of the data in the 3 largest components accounted for 78.14%, 75.38% and 72.67% of the variation among organisms belonging to Archaea, Bacteria and Eukarya domains, respectively. However, no segregation of habitats by principal components was observed (Figure 2A). Similar results were obtained considering lower taxonomic levels (phyla and classes) as well as when neglecting relative expression levels (data not shown).
Analyses based only on the amino acid composition of ribosomal proteins and RNA polymerases provided similar trends, with the 3 largest components accounting for only 62.77%, 62.70% and 66.14% of the variation among Archaea, Bacteria and Eukarya domains, respectively ( Figure S1).
Hierarchical clustering analysis grouped organisms into two main clusters. The first cluster (cluster I) included organisms from the three domains and different habitats, which share relatively homogeneous amino acid abundance and GC content lower than 50%. The second cluster (cluster II) included organisms from the three domains and different habitats, but possessing higher content of Ala, Arg, Gly, His, Pro, Trp and Val, lower relative abundance of Asn, Ile, Lys, Pro, Tyr, and a GC content higher than 50% ( Figure 2B). Thus, segregation of amino acid relative compositions was not habitat-or domain-specific, being more constrained by GC content.

%GC is the major factor influencing amino acid composition
To understand the contribution of the different factors on the cellular RAAA, we performed a more detailed analysis by modeling amino acid composition as a function of phylogeny, GC content and habitat using generalized linear models.
Best fit models obtained confirmed that in 17 out of 20 amino acids GC% was the major factor influencing amino acid composition. In Ala, Arg, Asn, Gly, Ile, Lys, Pro and Tyr, that effect was higher than 75% ( Table 3). Phylogeny was the factor that impacted most on Cys, Gln and Met relative abundance, although explaining only about 3% of the variance.
In fact, amino acid composition plotted against average GC content showed a strong correlation with the majority of amino acids, being Asp, Cys, Gln, Glu, His, Leu, Met, Ser, Thr the amino acids least affected by GC composition (Figure 3). Results obtained considering the entire set of proteins in the genome did not differ from those obtained when considering only the set of highly expressed proteins (data not shown). The same trends were also observed considering amino acid compositions not weighted by expression.
Finally, we determined Spearman rank correlations between cellular and environmental RAAAs. Correlations were high and significant (Figure 4), although the correlation between the cRAAA of a given organism and that of its environment was not significantly different from the correlation between the composition of the same organism and that of non-cognate environments for that organism.
Thus, our results suggest that the cRAAA are not unequivocally determined by the eRAAA in their habitats and that there is a complex and dynamic relationship between the relative amino acid abundance of an environment and that of its inhabiting organisms.

Discussion
The ability to both adapt to and change its environment is a characteristic of life. Such changes are observed from macroscopic to microscopic and chemical scales. It is known that the relative amount of the major chemical elements that compose organisms is similar to that of environments. The accepted explanation for this is that, over time, cells have modified the environmental chemical landscape in such a way that it becomes more similar to themselves [1,2].
To the best of our knowledge, whether a similar process is observed at the molecular level had not been analyzed before. In this work, we performed such an analysis by focusing on L-aamino acids and their relative abundance in the environments and in cells. Given that L-a-amino acid production on Earth environments is only biological, one might expect that those abundances would be relatively constant. However, given that many microorganisms can efficiently scavenge amino acids from the environment one might also expect that the RAAA ratios are a dynamic result of the balance between amino acid release to the environment and amino acid uptake from the environment by the biota. Our results support that there is a relative abundance of the different amino acids in environmental source that is approximately constant.
Nevertheless, specific sets of amino acids were enriched in specific habitats, creating molecular signatures. The following environment-specific patterns of increased RAAA were observed: Ala, Trp, Gly, Gln+Glu and Leu, in oceans; Cys, Leu, Lys, Met and Pro, in host-associated environments; Asn+Asp, Gln+Glu, His and Ala, in soil boreal forest; Arg, Lys, Leu, Ala and Asn+Asp in sandy soils; and Tyr, Lys, Asn+Asp, Ala, Leu and Thr, in grass environments.
When it comes to identifying a direct correlation between environmental and cellular RAAA, our findings again suggest that such abundances correlate well globally. In contrast, when looking at how the cRAAA of specific organisms correlate to that of their cognate environments, we find that such correlations cannot be used to infer the environment from which the organism was extracted. These findings are apparently at odds with those from a previous study [11] that found both GC content and amino acid content of metagenomic datasets to be influenced by their environment. However, that study does not take into account environmental amino acid determinations and only four metagenomes are analyzed.
In fact, there is a large body of work on the compositional biases of genomes and proteomes and what controls such biases. Examples of such compositional biases and their probable causes are known at the nucleotide [29,30], codon [31] and amino acid [9,10,[32][33][34][35] levels. Nevertheless, such studies involved a limited number of sequences/organisms and only looked at the amino acid composition of the individual proteins, not that of the whole cells.
Our analyses took into account the relative amino acid compositions of proteins weighted by predicted levels of expression. These predictions were based on metrics that compare codon utilization between the ribosomal coding genes and other proteincoding genes [18][19][20][21]. More sophisticated estimations could also be achieved considering aa-tRNA abundance and ribosome occupancy as well [36][37][38]. However, aa-tRNA abundances are only well known for a small number of organisms and under very specific conditions. Estimations of aa-tRNA abundance based on the number of genes coding for tRNAs could also be used [39]. However, this could be biased by the varying quality of the genome annotation for each organism, given the size of the dataset used in this study.
Correlations between the cRAAA calculated using CAI and d indexes and experimentally determined mRNA and protein abundance were not significant (data not shown). However, both CAI-and (more notably) d-calculated cellular RAAA highly and significantly correlated with experimentally determined amino acid compositions. To the best of our knowledge, this constituted the first study on the relative amino acid compositions across domains, taking into account differential gene expression.
Hierarchical clustering analysis and PCA showed no apparent segregation of organisms according to habitat or domain. Clustering of amino acid relative content weighted by CAI and d indexes showed segregation of organisms with higher content of residues with GC-rich codons (Ala, Arg, Gly, Pro) and organisms with higher content of residues with AU-rich codons (Asn, Ile, Lys, Phe, Tyr). A recent study [40] found similar results and reported that overall amino acid usage in Archaea is dominated by GCbias. Lightfield and co-workers [41] also reported that distantlyrelated bacterial genomes with similar GC content have similar patterns of amino acid usage. Analyses of Sargasso's Sea shotgun sequencing reads have also shown an overrepresentation of AUrich residues in such low-GC environments [11]. Taken together, these strongly suggest that amino acid composition of organisms cannot be directly predicted from their cognate environments and are strongly dependent on the GC content of their genomes. Our generalized linear model analysis showed that, with the exception for Cys, Gln and Met, the variation in the cellular RAAA of all other amino acids was clearly explained by the variation in the GC content of the genomes. Given that the genomes of the organisms we are looking at are mostly constituted by gene coding sequences, such GC dependency could be a result of the relative abundance of amino acids coded by GC-rich (Ala, Gly, Pro, Arg and Ser) and/or GC-poor (Phe, Ile, Lys, Met, Asn, Tyr, and Leu) codons in the genes. However, the observation that the genomic GC content is the factor that explains the largest amount of variation in cRAAA for 17 out of 20 amino acids indicates that the dependency of cRAAA on genomic %GC content is not strongly affected by the GC content of codons.
The variation on the cRAAA of Cys, Gln and Met was, in contrast, influenced mainly by organism's phylogeny. Models based on mutation and selection in nearly 600 genomes, also suggest that GC content drives codon usage (and implicitly amino acid composition), rather than the reverse [42].
In conclusion, our findings are consistent with environmental amino acid abundances following relationships that are analogous to those of the Redfield ratios for chemical elements. Our results point to the existence of specific amino acid signatures that are particular for each environment, while also indicating that there are global relationships between the relative amino acid abundance in different environments. In contrast, the relative amino acid composition of organisms is not highly determined by the environment, even if the environmental composition is undoubtedly determined by its community of resident microorganisms. This is consistent with the existence of a complex and dynamic relationship between the RAAA of an environment and that of its inhabiting organisms, suggesting that individual organisms have evolved the capacity to mold their amino acid composition selectively, in a manner that is mostly independent from the eRAAA. Figure S1 Principal Component Analysis of organisms as a function of cRAAA considering A) all predicted protein sequences in the genome [as shown in Figure 2A] and B) only ribosomal proteins and RNA polymerases. (TIF)