Molecular Evolution of the Oxygen-Binding Hemerythrin Domain

Background The evolution of oxygenic photosynthesis during Precambrian times entailed the diversification of strategies minimizing reactive oxygen species-associated damage. Four families of oxygen-carrier proteins (hemoglobin, hemerythrin and the two non-homologous families of arthropodan and molluscan hemocyanins) are known to have evolved independently the capacity to bind oxygen reversibly, providing cells with strategies to cope with the evolutionary pressure of oxygen accumulation. Oxygen-binding hemerythrin was first studied in marine invertebrates but further research has made it clear that it is present in the three domains of life, strongly suggesting that its origin predated the emergence of eukaryotes. Results Oxygen-binding hemerythrins are a monophyletic sub-group of the hemerythrin/HHE (histidine, histidine, glutamic acid) cation-binding domain. Oxygen-binding hemerythrin homologs were unambiguously identified in 367/2236 bacterial, 21/150 archaeal and 4/135 eukaryotic genomes. Overall, oxygen-binding hemerythrin homologues were found in the same proportion as single-domain and as long protein sequences. The associated functions of protein domains in long hemerythrin sequences can be classified in three major groups: signal transduction, phosphorelay response regulation, and protein binding. This suggests that in many organisms the reversible oxygen-binding capacity was incorporated in signaling pathways. A maximum-likelihood tree of oxygen-binding hemerythrin homologues revealed a complex evolutionary history in which lateral gene transfer, duplications and gene losses appear to have played an important role. Conclusions Hemerythrin is an ancient protein domain with a complex evolutionary history. The distinctive iron-binding coordination site of oxygen-binding hemerythrins evolved first in prokaryotes, very likely prior to the divergence of Firmicutes and Proteobacteria, and spread into many bacterial, archaeal and eukaryotic species. The later evolution of the oxygen-binding hemerythrin domain in both prokaryotes and eukaryotes led to a wide variety of functions, ranging from protection against oxidative damage in anaerobic and microaerophilic organisms, to oxygen supplying to particular enzymes and pathways in aerobic and facultative species.


Introduction
Before the evolution of oxygenic photosynthesizers, sources of free oxygen were scarce [1]. Free molecular oxygen constitutes 21% of present-day terrestrial atmosphere and its main source is, essentially, oxygenic photosynthesis. Accumulation of free atmospheric oxygen during the Precambrian [1][2][3] is, undoubtedly, one of the major changes in the history of the planet and may be considered the most significant biogeochemical process after the origin of life itself. Oxygen-dependent metabolism evolved first in bacteria and is pervasive in contemporary eukaryotes.
The evolution of metazoans was constrained by the oxygen requirements of tissues [4][5][6]. Therefore, oxygen-carrier proteins that maintain a continuous delivery of oxygen while avoiding autoxidation as well as the formation and accumulation of reactive oxygen species [4] became essential for the development of animals. Four evolutionarily unrelated families of oxygen-carrier proteins are known: hemoglobin, hemerythrin and the two non-homologous families of molluscan and arthropodan hemocyanins. Such diverse assortment can only be understood in terms of the selective pressure imposed on the biosphere by free oxygen.
Hemerythrin is a small 118 amino acid protein classically studied in four metazoan phyla: Sipuncula, Brachiopoda, Priapulida and Annelida [7,8]; it has also been identified in other eukaryotes as well as in bacterial and archaeal genomes [9][10][11][12]. However, the molecular evolution of hemerythrin and its relationship to the geochemical history of the Earth has been largely ignored. In this work, we studied the phylogenetic distribution of hemerythrin-like sequences in 2521 completely sequenced bacterial, archaeal and eukaryotic genomes, and correlated the possible evolutionary scenarios with what is currently known about the structure and function of the hemerythrin domain in both prokaryotes and eukaryotes.

Sequence similarity search
Oxygen-binding hemerythrin (O 2 -binding Hr) homologs were identified based on the statistical significance of aligning O 2 -binding Hr sequences from two annelid species, Phascolopsis gouldii [13] and Themiste hennahi [14], with non-mutated protein sequences annotated as hemerythrin or hemerythrin-like proteins in the PDB. The sequences of two O 2 -binding Hrs, from the proteobacteria Methylococcus capsulatus [15] and Desulfovibrio vulgaris [16], were statistically similar to the annelid sequences (Table 1), and were used as queries in subsequent sequence similarity searches. In contrast, the N-terminal domain of the human F-box and leucine-rich repeat protein 5 (FBXL5), which possess an iron-coordination site that does not bind oxygen [17][18][19], was not significantly similar to any of the annelid hemerythrin sequences used here as reference (Table 1). We constructed a profile Hidden Markov Model (S1 Fig) with 148 nodes, exclusively based on O 2 -binding Hr homologues, which constitute a divergent monophyletic sub-group of the hemerythrin/HHE (histidine, histidine, glutamic acid) cation-binding domain (Fig 1). long sequences and the presence of additional protein domains was investigated using the Pfam-A database (Figs 3 and 4). O 2 -binding Hr homologues were found in the same proportion as single-domain and as long protein sequences in archaeal, bacterial and eukaryotic genomes overall. A total of 56 different domain architectures were identified in long O 2 -binding Hr sequences (Fig 4 and S2 Fig). The largest group of long O 2 -binding Hr sequence architectures (56.9%) contained N-terminal or C-terminal regions of variable size with no clear homologs in Pfam-A. The most frequent location of the O 2 -binding Hr domain in long protein sequences is the protein termini (117 proteins at the C-terminus, 52 proteins at the N-terminus), suggesting a later incorporation by gene fusion events. The most frequent architectures were: 1) N-terminal methyl accepting chemotaxis protein domain (MCP signal) with a C-terminal O 2 -binding Hr domain (33 protein sequences); 2) N-terminal O 2 -binding Hr domain with a C-terminal diguanylate cyclase (GGDEF) domain (22 protein sequences); and 3) N-terminal histidine kinases, adenyl cyclases, methyl-accepting proteins and phosphatases (HAMP) domain with a MCP signal and a C-terminal O 2 -binding Hr domain (14 protein sequences). The Gene Ontology terms associated to the Pfam-A domains identified corresponded to: 1) signal transduction (43%); 2) phosphorelay response regulation (7%); and 3) protein binding (6%).

Phylogenetic distribution of oxygen-binding hemerythrin sequences
Of the 2521 cellular genomes that were analyzed in this work, a total of 392 (367 bacterial, 21 archaeal and 4 eukaryotic genomes) encoded for O 2 -binding Hr sequences (Fig 2). The number of O 2 -binding Hr copies in a genome varies widely, in particular, species of Spirochaetes, Chlorobi, Acidobacteria, Firmicutes-Clostridia, and α-, βand δ-Proteobacteria present several copies of single-domain O 2 -binding Hr sequences (Fig 3). The greatest number of singledomain O 2 -binding Hr copies was found in Magnetospirillum magneticum AMB-1, a facultative α-Proteobacteria encoding 15 paralogous sequences (Fig 3).
The sample studied here included archaeal genomes only from the Crenarchaeota and Euryarchaeota phyla due to an under-representation of Archaea in genome databases and the low agreement between the databases consulted (See the Methods section). Within Crenarchaeota, O 2 -binding Hr was found only in five species of the order Desulfurococcales within long protein sequences. In Euryarchaeota, five species of the Methanomicrobia class contained single-  Fig 2, the sequences of single-domain HHE cation-binding hemerythrin and single-domain O 2 -binding Hr, which we have analyzed in this work, are the outcome of an ancient gene duplication predating the explosive divergence of Bacteria during Precambrian times. The maximum-likelihood tree in Fig 2 also shows that the single-domain O 2 -binding Hr is a well-defined, monophyletic, highly divergent group distinct from the hemerytrin/HHE cation-binding domain.
The tree in Fig 5 shows two well-defined single-domain O2-binding Hr derived groups, a and b, formed by sequences encoded mostly by anaerobic species. The derived clade a includes sequences from Deinococcus-Thermus; Cyanobacteria; Firmicutes; Spirochaetes; and δ-Proteobacteria. The derived clade b is formed by sequences from Acidobacteria; Chlorobi; α-, β-, γand δ-Proteobacteria, as well as from the archaeal Methanospirillum hungatei JF-1, where it is present most likely because of a lateral gene event. Sequences in both derived clusters diverged from an early gene duplication of the single-domain O 2 -binding Hr gene, and also appear to have been subject of lateral gene transfer, gene loss, gene duplication and orthologous replacement events.
There are several cases in which more than one O 2 -binding Hr sequence can be identified in a single cellular genome. This may be due to lateral gene transfer events or, most likely, to paralogous duplication events ( Table 2). The many cases in which highly similar single-domain O 2 -binding Hr copies are closely grouped in the same cluster in the phylogenetic tree suggest recent paralogous duplication events. For instance, Magnetospirillum magneticum AMB-1, a facultative proteobacteria, encodes for 15 single-domain O 2 -binding Hr gene copies, some of which are located in the same cluster, while others are found in distant clusters: amb4136 is represented by a single branch; amb4265, amb2654, amb4171, amb2239 and amb1415 are in cluster ε; amb1987, amb0226, amb1549 and amb2249, in cluster ι; amb3418, amb4296 and amb3966 in cluster η; amb0569 in cluster z; and amb1952 in cluster b1, indicating duplication events that occurred at different times during evolution of M. magneticum AMB-1 ( Fig 5). Alternatively, the peculiarities of the distribution of one or more O 2 -binding Hr gene copies can be explained by lateral gene transfer events.
The overall topology of the single-domain O 2 -binding Hr phylogenetic tree is in clear disagreement with the 16/18S rRNA reference tree ( Fig 5). This indicates that the significance of O 2 -binding Hr as a phylogenetic marker is hindered by a complex history of gene losses, gene duplications, paralogous replacements and lateral gene transfer events. The base of the tree is characterized by a highly branching pattern of single-domain O 2 -binding Hr sequences encoded by aerobic organisms including Bacteria, Archaea and Eukarya ( Fig 5 and Table 3). The ample distribution of highly divergent O 2 -binding Hr sequences among aerobic organisms is probably best understood by the adaptive value of oxygen-binding proteins in a Precambrian environment that was becoming increasingly oxidizing as time went by.
Agreement with the topology of the species tree was observed mainly at the sub-group level of the O 2 -binding Hr sequence tree. For instance, as shown in Fig 5 and Table 3, sequences from Aquificae form a minor sub-group (sub-group α), suggesting that a single-domain O 2binding hemerythrin was present in their common ancestor. A comparable situation may be seen in the θ sub-group, which can be interpreted as the outcome of vertically inherited single-

Discussion
The HHE cation-binding domain was first predicted by bioinformatics methods as a domain composed of two helical regions and a conserved HHE cation-binding site [22]. The hemerythrin-like domain family is a repetition of the HHE cation-binding domain, which folds into an up-and-down bundle of four left-handed helices [14,23]. The molecular function of proteins containing the hemerythrin/HHE cation-binding domain, including metazoan oxygen-carrier hemerythrins, is often related to O 2 or reactive oxygen species responses [10,16,[24][25][26][27][28][29].
The search for O 2 -binding Hr domain homologs was performed using a manually curated profile Hidden Markov Model, and resulted in the identification of sequences of 86 to 2425 amino acids length, using a profile that contained 148 nodes (S1 Fig), highly weighting the position of the iron-coordinating amino acids in X-ray solved structures of O 2 -binding Hrs. Sequences with subject coverage lower than 85% were considered long sequences. To identify the presence of possible additional domains, we searched the protein profile database Pfam-A, which confirmed known additional domains in 37.2% of long sequences. In agreement with previous bioinformatics searches [9,11], the most frequent domain found in long O 2 -binding Hr sequences was the MCP signal domain (Fig 4). The characterization of the hemerythrin domain of the methyl-accepting chemotaxis protein dcrH from Desulfovibrio vulgaris has shown that the four-helix bundle fold and the amino acids of the active site are conserved [24]. It has been proposed that the hemerythrin domain in this structure could have a role in signal transduction [30]. O 2 -binding Hr sequences were identified in two archaeal groups (Euryarchaeota and Crenarchaeota), ten bacterial groups (Aquificae, Deinococcus-Thermus, Cyanobacteria, Chloroflexi, Firmicutes, Acidobacteria, Chlorobi, Spirochaetes and Proteobacteria), and four eukaryotic species (Naegleria gruberi, Micromonas pusilla CCMP1545, Nematostella vectensis and Acanthamoeba castellanii). It is also known to be present in marine invertebrates (Sipuncula, Brachiopoda, Priapulida and Annelida) [8]. By far, the highest number of O 2 -binding Hr sequences is found in Proteobacteria (Figs 2 and 3).
Single-domain and long O 2 -binding Hr-containing sequences are not randomly distributed in the phylogenetic tree, but exhibit a differential distribution among taxonomic groups, particularly in Bacteria, where two separate groups can be distinguished. In the first bacterial group, that includes deep-branching bacteria and the Firmicutes-Clostridia clade, O 2 -binding Hr Internal nodes with approximate likelihood-ratio test lower than 0.6 were collapsed. Names of the sequences appear at the tips of the branches. Sub-clusters are named by Greek letters on the base of the tree and by letters and numbers in the direved clusters. Node names appear in color when there is at least another copy of single-domain O 2 -binding Hr in a separate cluster of the tree. Oxygen requirement of the species source, according to the Genomes OnLine Database [21], is indicated by color bars at the tips of the nodes.

Evolution of Hemerythrin
Geobacter bemidjiensis Bem 8 Anaeromyxobacter dehalogenans 2CP-C 8 Anaeromyxobacter dehalogenans 2CP-1 7 Anaeromyxobacter sp. Fw109-5 7 Anaeromyxobacter sp. K 7 Geobacter sp. M18 7 Desulfovibrio salexigens DSM 2638 6 Geobacter sulfurreducens KN400 5 Geobacter sulfurreducens PCA 5  homologues are predominantly single-domain sequences (91.6%). The expression of the singledomain HerA hemerythrin in the microaerophilic bacteria Campylobacter jejuni [26] reduces its susceptibility to oxygen and hydrogen peroxide-mediated damage of two iron-sulphur cluster enzymes (pyruvate:acceptor oxydoreductase and 2-oxoglutarate:acceptor oxidoreductase). This suggest that single-domain O 2 -binding Hr may also be involved in the prevention of oxygenmediated damage in microaerophilic aquificales and in anaerobic clostridia, and probably reflect an evolutionary adaptation to the presence of free molecular oxygen. The absence of O 2 -binding Hr homologues in Mollicutes and the class Bacilli of Firmicutes, two bacterial groups closely related to clostridial species, is probably due to secondary loss. The second bacterial group includes Acidobacteria, Chlorobi, Spirochaetes and Proteobacteria. Within this group, long O 2 -binding Hr sequences (68.1%) are more frequent than singledomain (31.9%) O 2 -binding Hr. In Proteobacteria, which is the most diverse bacterial phylum, the study of the O 2 -binding Hr domain led to the identification of three different functions: a) as a 135-amino acid domain in a chemotactic protein of Desulfovubrio vulgaris [24]; b) as a single-domain protein McHr in Methylococcus capsulatus, that supplies oxygen to the membranebound methane monooxygenase [10]; and c) as a single-domain protein HerA in Campylobacter jejuni, which protects iron-sulphur cluster enzymes from oxidative damage [26]. This exemplifies how the O 2 -binding Hr domain may have been coopted into specific physiological functions by different species, particularly in later-branching bacteria, where it was incorporated into a wide variety of sequences. In some cases, the additional sequences within long O 2binding Hr sequences are group-specific. For instance, long O 2 -binding Hr sequences from Acidobacteria and Chlorobi exhibit elongations that do not match domains in the Pfam-A database, and two architectures of the long O 2 -binding Hr sequences from Spirochaetes genomes contain a domain called Cache_1, which is an acronym for calcium channels and chemotaxis receptor.
Variations at the iron-coordinating amino acid residues in 106 single-domain O 2 -binding Hr sequences were observed (S4 Fig). This is quite evident in sequences from the Deinococcus-Thermus, Cyanobacteria, Firmicutes-Clostridia and from the α-, βand δ-Proteobacteria clades. Amino acid substitutions may modify the reversible oxygen-binding ability of hemerythrins [31], or even the native metal-ion preference. Molecular characterization of the variants could clarify whether sequence variations at the iron-coordination site produce loss of function, neo-functionalization, or alternative iron-binding mechanisms.   Hemerythrin homologues have an ample biological distribution, and are present in the three domains of life. However, as shown here, their distribution is not universal, which is consistent with previous genomic searches, and may be explained by frequent gene losses [11,12]. Although the topology of the O 2 -binding Hr tree indicates intense horizontal gene transfers, gene duplications and differential gene loss (Fig 5), it is clear that O 2 -binding Hr is widely distributed in Clostridia and all five classes of Proteobacteria, suggesting that it is an ancient Precambrian trait that was already present before the divergence of Firmicutes and Proteobacteria. Alternatively, the evolution of O2-binding Hr may have occurred in a divergent phylogenetic group, and was horizontally transferred, through independent events, to the individual orders

Conclusion
The four known evolutionary unrelated oxygen-carrier protein families (hemoglobin, hemerytrhin and the two non-homologous molluscan and arthropodan hemocyanins) represent a polyphyletic response to the selective pressure imposed by the accumulation of free oxygen during the Precambrian. Like other molecular mechanisms involved in the protection and repair of oxygen-induced damage that evolved during Precambrian times [32], the ample biological distribution of O 2 -binding Hr probably reflects its adaptive significance in an environment became increasingly oxidizing. The biological group where hemerythrin first evolved remains unknown, but the distribution of O 2 -binding Hr sequences suggest that its capacity to reversibly bind oxygen was exploited first by prokaryotic species to fulfill a range of different needs, ranging from protection against oxidative damage to oxygen supply to particular enzymes and pathways. More specifically, the available data suggests that it may have originated prior to the divergence of Firmicutes and Proteobacteria. Thereafter, the incorporation of the O 2 -binding Hr domain into preexisting proteins, combined with other mechanisms involved in the protection against oxidative damage, may have allowed a functional diversification of the protein repertoire particularly in the proteobacteria, one of the most diverse groups of bacteria. The evolution of O 2 -binding Hr sequences appears to parallel the evolution of strategies allowing the incorporation of oxygen to biological processes as a consequence of accumulation of free oxygen in the Precambrian oceans and atmosphere.

Sequence similarity searches
Eleven non-mutated hemerythrin and hemerythrin-like amino acid sequences retrieved from the Protein Data Bank (PDB) database [33] were compared to the hemerythrin sequence from Themiste hennahi (PDB codes 2MHR) and from Phascolopsis gouldii (PDB code 1I4Y) with the program PRSS version 36.3.8c (number of shuffles: 200; scoring matrix: Blosum50; open gap penalty: -10; extension gap penalty: -2) from UVa FASTA Server [34,35]. Expect value lower than 1e-3 of one-to-one alignments was used as cutoff for homologous sequences. An initial BLAST search against the Kegg Genes database release 75.1 [20,36] identified 365 protein sequences with expect value < 1e-5 and subject coverage > 95%. The sequences were aligned with MAFFT G-INS-I algorithm [37] with default parameters (gap opening penalty: 1.53; offset: 0.0). A profile Hidden Markov Model (pHMM) was constructed with HMMER 3.1b1 [38]. The Pfam hemerythrin profile (http://pfam.xfam.org/family/ hemerythrin) [39] has a low restriction to include sequences with changes at the iron coordination amino acids observed in oxygen-binding hemerythrins (S1 Fig). The distribution of the hemerythrin domain homologues found by the Pfam profile is shown in S3 Fig. The hand curated hemerythrin pHMM was scanned against 2521 genomes from the Kegg Genes database release 75.1 that were unequivocally traced to an entry of the SSU RefNR99 guide tree from the SILVA SSU database release 119 [40] based on taxonomy identifiers from both databases.

Domain annotation and associated biological processes
Each sequence was scanned with hmmscan from the HMMER 3.1b1 software using the database of protein families Pfam-A version 28.0 [39]. The hemerythrin profile in Pfam-A was substituted with the pHMM that we generated. Domain overlapping was solved with a pearl script preferring domains with lower expect values. Gene Ontology information on molecular function of Pfam domains was obtained from the Pfam web page [41].

Phylogenetic analysis
The maximum-likelihood (ML) tree of single-domain oxygen-binding hemerythrin sequences and of hemerythrin HHE cation-binding domain sequences were constructed with PHYML [42]. The parameters for the construction of the trees (Table 4) were automatically selected with SMS: Smart Model Selection using the Akaike Information Criterion at the South of France bioinformatics platform (http://www.atgc-montpellier.fr). Branch support was given by the approximate likelihood-ratio test (aLRT) for branches [43]. Internal nodes with low support (< 0.6) were collapsed. Additional bootstrap support is provided by a 100-replicates as shown in S5 Fig. The multiple sequence alignments were constructed with MAFFT, L-INS-i algorithm using the default parameters (gap opening penalty: 1.53; offset: 0.0). The maximumlikelihood tree of hemerythrin HHE cation-binding domain sequences was midpoint-rooted; the maximum-likelihood of single-domain oxygen-binding hemerythrin sequences was rooted following the topology of the hemerythrin HHE cation-binding domain tree. Tree visualization was made with Interactive Tree Of Life V1.0 [44].