Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Figure 1.

Phylogenomic analyses of protein domains and tRNA structures and functions.

A. Flow diagram showing the reconstruction of trees of protein domain structures. A census of domain structures in proteomes of hundreds of completely sequenced organisms is used to compose data matrices, which are then used to build phylogenomic trees describing the evolution of individual protein structures. Elements of the matrix (g) represent genomic abundances of domains in proteomes, defined at different level of classification of domain structure (e.g. SCOP F, FSF, and FF). They are converted into multi-state phylogenetic characters with character states transforming according to linearly ordered and reversible pathways. Trees of proteomes can be generated from the matrices of phylogenetic characters. They are not used in this paper but are largely congruent with traditional classification. B. Evolution of tRNA structure and function. The ancient ‘top half’ of tRNA embeds a ‘operational code’ in the identity elements of the acceptor arm that interact with the catalytic domain of aaRSs through class I and II modes of tRNA recognition. The evolutionarily recent ‘bottom half’ of tRNA holds the standard code in identity elements of the anticodon loop that interact with anticodon-binding domains of aaRSs. The flow diagram below describes the phylogenetic reconstruction of trees of tRNA substructures (ToSs). The structures of rRNA molecules were first decomposed into substructures, molecules. Structural features (e.g., length, Shannon entropic descriptors) of substructures such as helical stem tracts and unpaired regions are coded as phylogenetic characters and assigned character states according to an evolutionary model that polarizes character transformation towards an increase in conformational order (character argumentation). Coded characters (s) are arranged in data matrices, which can be transposed for further cladistic analyses (e.g., to produce trees of substructures). Phylogenetic analysis using maximum parsimony optimality criteria generates rooted phylogenetic trees of tRNA molecules. Embedded in trees of domains and trees of tRNAs are timelines that assign age to molecular structures and associated functions. C. Culling of PDB sequences for calculation of amino acid frequencies and dipeptide counts. Dipeptides define concatenated 2-mer amino acid sequences.

More »

Figure 1 Expand

Figure 2.

Evolutionary accretion of domains in aaRS enzymes.

The age of domains for aaRS enzymes (arrowheads) were mapped along a timeline of domain appearance generated from a phylogenomic analysis of FF structures in 420 free-living organisms (Figure S1). The three epochs of the protein world are shaded and the evolution of NRPS modules is given as reference together with other relevant landmarks. Domains were identified with concise classification strings [8]. A molecular clock of domain structures places the relative timeline in a geological time scale [13]. The inset shows examples of evolutionary accretion of aaRS domains. Structural models of ProRS from Thermus thermophilus complexed with tRNAPro (PDB entry 1H4S) and LeuRS from Pyrococcus horikoshii complexed with tRNALeu (1WZ2) with catalytic, editing, anticodon-binding and accessory domains colored according to their age of origin.

More »

Figure 2 Expand

Figure 3.

Coevolution of aaRS domains and cognate tRNA.

A. The ages of catalytic and editing domain FFs (ndFF) interacting with type II tRNA (Group 1) and type I tRNA (Group 2) and the age of anticodon-binding domain FFs (Group 3; provided only as reference) were plotted against the age of tRNA isoacceptors (Saac). The significant correlation (P<0.0067) unfolds an evolutionary timeline of early domain function. The time period spanning the most ancient functions (class I TyrRS) and the most recent trans-editing function (Ala-X of AlaRS) involves ∼2 Gy of evolution. B. Coevolution of anticodon-specific tRNAs (Scod) and anticodon-binding domain FFs (ndFF) is significant for the early start of the code (dashed line; F = 20.8; P<0.001), but are followed by episodes of tRNA structural recruitment. An ‘idealized’ timeline partitions code expansions in three age groups (A, B and C).

More »

Figure 3 Expand

Figure 4.

Origin and evolution of the genetic code.

A. Inception of the ‘operational’ code. Mapping of amino acid charging functions onto a binary decision-tree and a condensed vis-á-vis complementarity representation of the genetic code. Cells are indexed with Group 1, 2 and 3 domain inception, discriminator base identity, and nucleotide composition (pie charts) of the N2 position of the acceptor stem of tRNA. In the right, structural models of TyrRS (1H3R) interacting with tRNATyr and an acceptor-minihelix illustrate a possible evolutionary route of domain growth and accretion as the binary tree unfolds in evolution (domains are colored with corresponding geological age). B. Evolution of the ‘standard’ code. Ancestries define a timeline of early genetic code expansion in the condensed vis-á-vis code representation with major and minor groove modes of tRNA recognition. The mappings take into consideration the alphabet and number of anticodon positions that are most parsimonious and anticodon loop identity elements. Note that Pro, the founder, already uses 2nd and 1st code positions (identity elements G35 and G36) and that the first use of 3rd codon position (G34) occurs first with Thr and then His (the last two initial recruitments of c.51.1.1) when the alphabet expands to the triplex code. Also, the Yin-Yang complementarity pattern is fulfilled with the last recruitment of a.27.1.1 once the modern tetraplex code is in place.

More »

Figure 4 Expand

Figure 5.

Evolutionary heat maps describing the amino acid and dipeptide compositions of FF domain structures of different age.

A. Frequency of amino acids in FFs. The color array of 29,480 cells (1,475 rows×20 columns) describes the amino acid composition of 1,475 FFs along the evolutionary timeline. Columns represent the 20 standard amino acids ordered (from left to right) according to average amino acid frequency and rows represent FFs ordered (from top to bottom) according to domain age (ndFF = 0 ∼ 1). B. Frequency of dipeptides in FFs. The color array of 589,600 cells (1,475 rows×400 columns) describes the 400-dipeptide composition of FFs along the timeline. Columns represent dipeptide types ordered (from left to right) according to average frequency (from LL to WW) and rows represent FFs ordered according to age. The heat maps confirm the existence of non-random patterns of amino acid and dipeptide compositions along the evolutionary timeline of FFs and reveal unique signatures of amino acid and dipeptide use in FFs. Amino acids are described with single-letter codes.

More »

Figure 5 Expand

Figure 6.

Dipeptide makeup of ancient proteins.

A. The distribution of dipeptide compositions in proteins shows remarkable conservation along the FF timeline. Stacked column charts describe the 408 possible dipeptides (combinations of two amino acids) corresponding to 9 sets specified by Groups 1, 2 and 3 aaRS structures (1-1, 1-2, 2-1, etc). The stacked columns on the right display the general distribution pattern of dipeptides in the dipeptide sets for all 2,384 sequences and the expectation of dipeptide set distributions calculated by free permutation. Circles and asterisks represent groups that are over- or underrepresented, respectively, following χ−square statistical contrasts. B. Ancient FFs appearing before anticodon-binding domains (ndFF ≤0.2) were significantly enriched (P<0.01) in dipeptides composed of amino acids specified by the ancient editing domains (Group 1 and 2). The bar plot shows the amino acid frequencies of the 33 enriched dipeptides, the doughnut chart describes enriched dipeptide set compositions, and the network displays dipeptide makeup, with peptide bonds (edges, weighed by number of dipeptide types) connecting participating amino acids (nodes, with size proportional to connections). C. Mapping of enriched dipeptides in protein structures. Box-and-whisker plots describe the distribution of the 33 dipeptides that are significantly enriched in early FFs (ndFF ≤0.2) versus that of all dipeptides in regular and non-regular structural regions of the 2,384 protein sequences analyzed. Regular structures include helical regions (H) with α-helix (h), 310-helix (g) and π-helix (i) elements, strand regions (E) with β-strand (e) and β-bridge (b) elements, and turn/bend regions (T) with turns (t) and bends (b). Non-regular (unstructured) regions include loops (Ω). PBT amino acids can span different regions. Statistical differences between PBT were defined by p-values of Mann-Whitney non-parametric tests. Increases and decreases in central tendencies for the ancestral proteins are indicated with+and – signs, respectively, for structural sets with significant associations.

More »

Figure 6 Expand

Figure 7.

Mapping enriched dipeptides in the structural model of proteins of ancient and recent origin.

A. The N-terminal domain of Escherichia coli MukB chromosome partitioning protein (1QHL), which harbors the most ancient FF structure (c.37.1.12), is compared to the pore-forming domain of colicin A (1COL), which harbors the more recent colicin FF structure (f.1.1.1). B. The glycine betaine-binding periplasmic protein ProX (1R9L), which harbors the ancient phosphate-binding protein-like FF (c.94.1.1), is compared to the antiviral protein Ski8 (1SQ9), which harbors the very recent WD40-repeat FF (b.69.4.1). Dipeptides significantly enriched (P<0.01) in ancient proteins are labeled in yellow and red, with red regions involving loop segments. Ancient proteins are on average enriched in dipeptides located in regular structures (segments in yellow) and are depleted in dipeptides located in loops (Ω)(segments in red), except for those in boundaries with turns (T- Ω) (with labels in red). A ratio of the number of T- Ω dipeptides to other dipeptides in loops (r) clearly shows the rigidity of loops of ancient proteins. With exceptions (e.g. the comparison of MukB and colicin A), the total number of enriched dipeptides (n) reveals impoverishment of enriched dipeptides in evolution. The structures that are shown are representative pairs of typical structures of similar total length belonging to the ancient and derived groups. Except for MukB, they were selected at random. We note that protein class does not affect the dipeptide distribution trends we report.

More »

Figure 7 Expand

Figure 8.

Model of origin and evolution of archaic protein biosynthesis.

The flow diagram describes the evolutionary progression of protein biosynthesis and its diversification into ribosome-like processive and NRPS-like assembly-line systems. Translation starts with archaic non-specific synthetases capable of producing dipeptides and small peptides [7]. We assume that these primordial enzymes were originally peptides of less than 60 amino acid residues that emerged from a pool of small peptides (some of them ∼ 25 residues in length and loop-forming) through non-specific condensation reactions. These emergent molecules quickly gained structural properties and stable molecular functions, all of which were initially driven by enhancements of the persistence of emerging cells [7]. The initial synthetases developed the ability to acylate a wide variety of cofactors (4′-phosphopantetheine, CoA, NADP, and related derivatives, and short polynucleotides) in two-step catalytic reactions involving activated intermediates. Peptides could be further ligated into quasi-statistical proteins by the action of non-specific ligase derivatives of the synthetases. It is likely that prebiotic biases in dipeptide makeup resulting from amino acid chemical synthesis [47], [99] and prebiotic peptide formation [48], [49] acted as initial constraints of the emerging quasi-statistical system. In all cases, the quasi-statistical proteins that were formed achieved only Rossmanoid and bundle folded structures, constrained by primitive membranes, and were founders of the most basal fold structures of our phylogenomic timelines [7]. These included P-loop hydrolases and extended or tandem AAA-ATPase mechanoenzymes, oxydoreductases, chaperones and factors. The initial biases were then enhanced by fortuitous contacts between proteins that were beneficial for the primordial cellular system, including the protection of cofactors from degradation. These contacts stabilized protein biosynthesis complexes that facilitated the initial enzymatic activities, and the resulting ensembles behaved very much as modules as the synthetases diversified and enhanced their catalytic toolkit. In some cases they went to produce assembly line complexes similar to modern NRPS systems. In other cases, some modules interacted with polynucleotides and specialized in aminoacylation reactions, leading to modern aaRS functions. Other modules specialized in processive functions leading to the modern ribosome. Polynucleotides gained in some cases folded structures (minihelices, L-shaped conformations) that tuned the make up of interacting protein structures. These initial chains became ancient genomes and important cofactors, and could have also gained functions as nucleic acid replicases, helicases, and ligases. The model that we here propose is fully compatible with a framework that explains the generation of modules and hierarchical structure in biology [100]. Under this framework, modules emerge through two phases of diversification of parts. In the first phase, parts interact weakly and associate diversely. As they diversify and compete, parts interact and these interactions increasingly constrain their structure and associations, leading to modular structures. In the second phase of diversification, variants of the modules and their functions evolve and become new parts for a new cycle of generation of higher-level modules. In our model, parts are emerging proteins and modules are complexes that gain biosynthetic functions. The model highlights the biphasic patterns of diversification of the underlying framework, which we also see unfolding at the amino acid composition level (Fig. 5) and when studying protein flexibility [63].

More »

Figure 8 Expand