Multiple Inter-Kingdom Horizontal Gene Transfers in the Evolution of the Phosphoenolpyruvate Carboxylase Gene Family

Pepcase is a gene encoding phosphoenolpyruvate carboxylase that exists in bacteria, archaea and plants,playing an important role in plant metabolism and development. Most plants have two or more pepcase genes belonging to two gene sub-families, while only one gene exists in other organisms. Previous research categorized one plant pepcase gene as plant-type pepcase (PTPC) while the other as bacteria-type pepcase (BTPC) because of its similarity with the pepcase gene found in bacteria. Phylogenetic reconstruction showed that PTPC is the ancestral lineage of plant pepcase, and that all bacteria, protistpepcase and BTPC in plants are derived from a lineage of pepcase closely related with PTPC in algae. However, their phylogeny contradicts the species tree and traditional chronology of organism evolution. Because the diversification of bacteria occurred much earlier than the origin of plants, presumably all bacterialpepcase derived from the ancestral PTPC of algal plants after divergingfrom the ancestor of vascular plant PTPC. To solve this contradiction, we reconstructed the phylogeny of pepcase gene family. Our result showed that both PTPC and BTPC are derived from an ancestral lineage of gamma-proteobacteriapepcases, possibly via an ancient inter-kingdom horizontal gene transfer (HGT) from bacteria to the eukaryotic common ancestor of plants, protists and cellular slime mold. Our phylogenetic analysis also found 48other pepcase genes originated from inter-kingdom HGTs. These results imply that inter-kingdom HGTs played important roles in the evolution of the pepcase gene family and furthermore that HGTsare a more frequent evolutionary event than previouslythought.


Introduction
Following wide acceptance of Darwin's theory of evolution, the tree of life became a well accepted representation of the evolutionary relationships among organisms.Recent findings of the horizontal gene transfer (HGT) in the genomes of many species [1,2,3,4,5,6] strongly challenge this certainty.HGT, though, is still thought as rare event and genes that originated from HGT account for a tiny proportion in each genome, while vertical descent of genes remains the major mechanism of evolution.Moreover, all HGT genes are treated as noise when species phylogeny is constructed.Here, for the first time, we found 48 members from well supported inter-kingdom HGT in a single gene family coding phosphoenolpyruvate carboxylase.This case demonstratesthe means by which the evolution of a single gene family can form a complex web via horizontal gene transfer, and likewise suggests that the previously ignored contribution of HGT to the evolution pattern would strongly enhance our understanding of the evolution as a tree of life to more rich and diversified web of life that revealsthe unexpected complexity of evolution.
Phosphoenolpyruvate carboxylase (PEPC) is an important enzyme that catalyzes the carboxylation reaction of phosphoenolpyruvate into oxalacetate, which is then used by the citric cycle.This reaction is also used by C4 and crassulacean acid metabolic pathway and is an important step to store and concentrate carbon dioxide for photosynthesis.In 2003, Sanchez and Cejudo found a PEPC gene in Arabidopsis and rice with close homologs with PEPCs in bacteria [7].Since then, the plant PEPC gene family has been categorized in to plant-type (PTPC) and bacteria-type (BTPC) subfamilies.Despite this organization, the actual evolution of the whole gene family has not been discussed in any detail.Only O'Leary et al.'s [8] recent review included a constructed phylogeny of PEPC gene family including members from Archaea, Bacteria, protists and plants.In this tree, the BTPC were clustered with bacteria PEPCs forming a clade as a sister group of protist PEPC.This phylogeny showed that the ancestor of all bacteria PEPCs, protists PEPCs and BTPCs originated from a duplication event in the lineage of PTPC to algae after its divergence with vascular plant PTPCs.This gene phylogeny has many inconsistencies with the accepted species tree constructed by multiple gene Table 1.Sequences used in the phylogenetic reconstruction.

Taxon
GenBank or Uniprot ID

Riemerella anatipestifer g3 E6JHS7 RIEAN (E6JHS7)
Tetrahymena thermophila Q23YQ3 TETTH (Q23YQ3) analysis and can only be explained by multiple gene transfer from the common ancestor of all BTPC, protists PEPC and bacteria PEPC to the ancestor of protists and bacteria.There is one remaining problem: the diversification of bacteria is a very ancient event,predating the divergence between algae and vascular plants.
In theory, the duplicated copy of the ancestral PTPC which postdates the divergence of vascular and algal plant PTPC can by no means be transferred to the ancestors of the bacteria.Reconciliation between the gene tree and species tree is then almost impossible.This phylogeny must be reconsidered with caution.
We searched the GenBank and UniProt to explore the entire range of existent PEPC genes in all organisms sequenced in the database.We identified possible inter-kingdom HGT candidates in PEPC family, and constructed the gene family phylogeny with genes from representative taxa and those identified inter-kingdom HGT candidates in order to clarify the evolution of this gene family and validate the suspected inter-kingdom HGT events.

Results and Discussion
We searched the GenBank by BLASTP and tBLASTn using PEPCs as a query and found that PEPC is a widely spread gene in archaea, prokaryotes and eukaryotes.In eukaryotes, PEPC exists mostly in plants, protists and slime mold.Only two hits were found in animals:The first was found in the genome of the black-legged tick, Ixodesscapularis.The 164-amino-acid fragment on the Cterminus of a 193-amino-acid protein(gene ID: 8031581) has 100% identity with pepcase from an alpha-proteobacterium, Rhodobacterales bacterium HTCC2255.Because this peptide is very short and possibly non-functional, it may be the relic of a recent unsuccessful horizontal gene transfer.The second was found in the genome of platypus, Ornithorhynchusanatinus.This is a peptide of 374 amino-acid (gene ID: 345310721) coded on a short contig of 1,614 base pair in the genome assembly.This gene has its closest homolog (e value, 3e-98) in a parasite, Babesiabovis.This may be a result of gene transfer from the parasite to the host, but we cannot exclude the possibility of parasitic genome pollution during genomic DNA preparation of the sequencing project.
We confirmed our suspicion of parasite contamination after reviewing the gene family information in Pfam database, in which we found two PEPC gene families, PEPcase (PF00311) and PEPcase_2 (PF14010).PEPcase is distributed in bacteria and eukaryotes including plants, protists and slime mold, while PEPcase_2 is mainly distributed in Archaea.However, there are also members within the two gene families whose taxonomy positions are incongruent with the main distribution, potentially due to an inter-kingdom HGT.From the maximum likelihood phylogenetic tree based on the curated seed alignment of PF00311 (Figure S1), we saw that plant PEPC is clustered with a group of PEPCs from gamma-proteobacteria,forming a sister group to other bacteria PEPCs.This phylogeny supported the idea that plant PEPCs is a lineage derived from ancestral bacteria PEPCs by means of an ancestral inter-kingdom HGT, contrary to the previous understandings that bacteria PEPCs originated from plant PEPCs.However, the plant PEPCs in the seed alignment all belong to the so-called BTPC group and many important eukaryotic taxa that are not plant, such as the protist and cellular slime mold, were not included in the seed alignment.To identify the origin of PTPC and PEPCs in the non-plant eukaryotic taxa, we carried out further phylogeny reconstruction of PEPCs from representative taxa in bacteria, archaea, plant and non-plant eukaryotes.
To explore the possible existence of inter-kingdom HGT in PEPC, we screened the full curation of PF00311 and PF14010 in the Pfam database to find inter-kingdom HGT candidates and included those candidates in the sequences for the following phylogenetic reconstruction.We searched the Pfam ''full'' tree to find the PEPC sequences from different kingdoms with the branches surrounding it.As no PEPC is found in fungi and only two are found in animals, we focused on divisions of the plants, bacteria and archaea.In total, we found 29 sequences from nonarchaea organisms in the full tree of PF14010, 49 sequences from non-plant organisms and 30 sequences from non-bacteria organisms in the plant and bacteria divisions of the PF00311 full tree, respectively.Because the phylogeny of PF00311 contain 2976 sequences and many alignments of short fragments are represented on the tree and many internal branches have low bootstrap support value, we removed dubious candidates from short fragment of peptide (less than 300 amino acids), and used the remaining 21 sequences from non-plant organisms and 19 sequences from non-bacteria organisms to carry out further phylogenetic analysis.
Having collected the inter-kingdom HGT candidates from plant and bacteria, we carried out phylogeny reconstruction in combination with the sequences of the non-plant eukaryotic taxa, BTPC and PTPC from several plants and representative bacteria PEPCs curated in the seed alignment (Table 1).In total, we used 122 PEPCs for gene phylogeny reconstruction.For the interkingdom HGT in archaeaphylogenetic reconstruction, we used the sequences of all 77 members of PEPcase_2 and four bacteria PEPCs as outgroups.We first aligned the sequences and then adopted a program MUMSA to assess the quality in order to find the best alignment by calculating the multiple overlap score (MOS) that indicates the overall inter-consistency with other alignments (see Materials and Methods).The alignment with the highest MOS was selected as the best alignment, and those alignments were then used to carry out phylogeny reconstruction.
We constructed the phylogenetic tree using three methods: maximum likelihood, neighbor joining and maximum parsimony.The protein substitution model used in maximum likelihood was selected by calculating the likelihood score under all 20 available models implemented in RAxML, and then we selected the model with the highest score.To avoid artificial resultscaused by improper construction methods, we combined the three trees to build a consensus tree that only contained branches supported by all the three methods.By inspecting this final consensus tree manually, we confirmed that there are 19 non-bacteria sequences clustered within the bacteria branches, a single non-plant sequence clustered within the plant branches (Figure 1) and 29 non-archaea sequences clustered within the archaea branches (Figure 2).To avoid artificial results due to uncertainty of alignment, we also repeated the phylogenetic analysis with the second best alignments and found no contradictory evidence (data not shown).To further exclude the possibility of artifacts due to alignment, we used GUIDANCE [9] to carry out alignment and bootstrap assessment of the alignment confidence and used only the high confidence columns (with bootstrap scores greater than 0.93) in the alignment to reconstruct the phylogeny.The results also showed no contradictory evidence with our major conclusion (See Figure S2  and S3).In Figure S2, the monophyly of all eukaryotic genes is supported by ML, NJ, MP with bootstrap value of 0.43, 0.64 and 0.17, respectively.However, the relation between eukaryotic groups (plant, protist, slime mold) is not consistent among three methods and most of the nodes are of low confidence.And for the pepcase_2 tree in Figure S3, the topology of NJ tree and MP tree are mostly consistent and those consensus nodes also receive high bootstrap support in NJ tree.The ML tree differs with the other two trees in the branch order of the basal branches.In the ML tree, thegroup of HGTs in Clostridium split first with the other archeae groups, while in NJ and MP trees a group of Crenarchaeota containing Ignicoccushospitalis diverges first with the other archeae groups.And also the NJ tree received the highest bootstrap support of those consensus nodes for pepcase_2.Compared with the computational cost of the ML and MP method, NJ seems to be the most efficient method among them.And we also checked the genomic location of those candidates to exclude the possibility of sequence pollution for those unclustered HGT genes.The result showed that most genes are from long genomic scaffolds except for the HGT genes in poplar and Microbaterium sp. which are from short fragments of 1,312 bp and 2,913 bp (Table S1).However, because the HGT gene in Microbacterium sp.clustered together with genes from seed plants and the possibility of genomic contamination of microbial genome library from multi-cellular organism is very low.We believe that the HGT in Microbatierium sp. is probably not the result of contamination.Further experiment is needed to exclude the possibility of genomic contamination for the HGT candidates in Populustrichocarpa.Collectively, in the evolution of phosphoenolpyruvate carboxylase gene family, we found 48 sequences originated from inter-kingdom HGTs.We also found that there three separate ancient HGT events,one from bacteria to archaea and the other two from archaea to bacteria,that respectively contributed to 15, 10 and 14 genes (Figure 1 and 2).
As for the origin of BTPC and PTPC, our phylogeny supported the idea that each type of PEPCs form a monophyletic group and both originated from ancestral bacteria lineage.That said, there is still uncertainty as to the precise relationship between these two groups and other eukaryotic PEPCs, due to inconsistency between different methods and low bootstrap support.This is consistent with the reality that the deep phylogeny of eukaryotes is still surrounded by controversy.Hopefully, further research on the basal phylogeny of eukaryotes will shed light on some of the controversy and further help explain the evolution of BTPC and PTPC.And our results also provide some information concerning the large scale phylogeny of the three life domains: Eukaryote, Eubacteria and Archeae.The well accepted phylogeny based on small-subunit (SSU) rDNA showed that Eukaryote and Archeae form a sister group with Eubacteria as the outgroup.However, many operational genes in Eukaryote are found to be more similar with homologs in Eubacteria while most eukaryotic informational genes is closer to their homologs in Archeae.And many hypothesis of symbiotic origin of Eukaryote are formed based on this finding.PEPC in Eukaryote is another gene originated via the horizontal gene transfer from bacteria symbiont (probably the ancestor of chloroplast) to the nucleus of the ancestral eukaryotic host [10,11].
On a broader level, HGT was thought to be a relatively rare event in evolution.As more and more genome sequences become available, we continue to find many genes in the genome originated from HGT [12,13,14].To date, however, there are no well-supported cases of multiple HGT events occurring in one gene family.One potential reason is that HGT was thought of as rare event, unlikely to hit a single gene family more than once.Consequently, little systematic research looking for HGT events in one gene family has been done.Our research provides the first case of multiple inter-kingdom HGTs in a single gene family and furthermore suggests that HGTsare much more frequent and important than previously expected.There is also research showing that HGT is more frequent between closely related organisms [15].Here we opted to only look into the inter-kingdom HGT because HGTs between different kingdomsaremore readily identified when the intra-kingdom phylogeny of many species based on well recognized orthologs is not available.However, the frequency of all HGTs should be much higher than that of interkingdom HGT which we found in this study.
Successful HGTs involve two processes: the physical transfer of the genetic material into the recipient genome of another species, and the fixation of the gene in the population of the species by selection forces.Our findings are consistent with the fact that HGTs were found to be biased toward operational genes as opposed to informational gene because the operational gene can function and bring out fitness advantages with less interaction with other genes [11,16].PEPC is an operational gene that can function in many metabolic and developmental pathways but does not need many partner genes.We can only speculate that this may be the reason there are so many HGT events surrounding the evolution of this gene.

Materials and Methods
We downloaded the protein sequences, alignment and phylogenetic trees of PEPcase (PF00311) and PEPcase_2 (PF14010) from the Pfam database [17].Phylogenetic tree viewing and editing was done in the tree editor Archaeopteryx (0.960 beta A48) [18].We cut the kingdom specific sub-trees for both bacteria and plant from Pfam full tree of PF00311.For archaea, we use the full Pfam tree of PF14010.Base on those kingdom specific tree, we use home-made scripts to find out the inter-kingdom HGT candidate, which is wrapped in the branches belong to a different kingdom in the Pfam tree.First, the taxonomy codes of all leaves were extracted from the sub-trees of bacteria, plant and archaea and searched in the UniProt taxonomy database [19].We then inspected the taxonomy search results to find the taxa whose lineages do not contain the bacteria, plant or archaea.Finally, we extracted the full protein sequences and aligned fragments of those taxa from Pfam database; aligned fragments shorter than 300 amino acids were excluded from candidate list.
To validate the phylogenetic relationship between those HGT candidates and other members of PEPcase gene family and get a panorama of the gene family evolution in plant and bacteria, we collected the HGT candidates' full sequences and PEPcase sequences from representative taxa, totally 122 protein sequences Figure 1.Phylogeny of bacteria and eukaryotic PEPcase and inter-kingdom HGT candidates.Phylogeny of inter-kingdom HGT candidates and PEPcase sequences from representative taxa in bacteria and eukaryotes were reconstructed.Nine archaea sequences were included as outgroups.HGT candidates confirmed in this phylogeny are in bold letters with red branches.The branches of outgrouparcheae are in grey and all eukaryotic branches are in blue.The bootstrap values of 100 replicates in the three different methods were labeled on each branch in order of maximum likelihood, neighbor joining and maximum parsimony.Ancient HGT events are marked with a triangle on the branch.doi:10.1371/journal.pone.0051159.g001to reconstruct the phylogeny of the gene family.For archaea, we used the full sequences of all PF14010 members.We applied four programs (T-Coffee, MAFFT, MUSCLE and ClustalW) to align the sequences and then assessed the quality of the alignments with Mumsa (online server at http://msa.sbc.su.se/cgi-bin/msa.cgi)[20,21,22,23,24].All alignment programs were run using the default parameters, except T-Coffee where we used the ''expresso'' option.
The sequences in all alignments were sorted into the same order with MEGA5 [25] and then submitted to the Mumsa server to get the quality scores.Mumsa program calculates the MOS score of each alignment (See [24] for the detail of the algorithim).Briefly, the aligned residues shared by many alignments are more reliable, and the alignment with the largest number of such residues is supposed to be the closest to the true alignment [24].We then selected the alignment with best quality to carry out phylogeny reconstruction with maximum likelihood, neighbor-joining and maximum parsimony methods.For maximum likelihood tree, we first use RAxML and a wrapperPERL script proteinmodelselection.pl to find the substitution model with highest likelihood score for the protein alignment, and then we used this substitution model with GAMMA model of rate heterogeneity and carried out rapid bootstrap test of 100 replicates [26].The neighbor-joining tree was inferred using MEGA5 with distances calculated with Possion correction and bootstrap test of 100 replicates.The maximum parsimony tree was also inferred using MEGA5 with the Close-Neighbor-Interchange algorithm and bootstrap test of 100 replicates.We combined the consensus trees of three methods using TreeGraph2 and deleted the different methods' contradictory nodes [27].Finally, inter-kingdom HGT genes were identified by manual inspection of the combined phylogenetic tree.
To further test our conclusion against alignment artifacts, we used the GUIDANCE webserver [9] to carry out alignment and assessment of the alignment accuracy.The analysis was carried out with default parameters, using MAFFT as the aligner and GUIDANCE as the algorithms for evaluating confidence scores, which measures the robustness of the alignment to guide-tree uncertainty.Then the high confidence columns of the alignments were extracted from the result with threshold of score 0.93.Then the filtered alignments were further used to reconstruct the phylogeny with three different methods (same as the above).Table S1 Genomic information on singular HGT candidates.(DOCX)

Supporting Information
. Phylogeny of archaea PEPcase and inter-kingdom HGT candidates.Phylogeny of PEPcase sequences from PF14010 were reconstructed.Four bacteria sequences were included as outgroups and their branches are in grey.HGT candidates confirmed in this phylogeny are in bold letters with red branches.The bootstrap values of 100 replicates are labeled in the same manner as Figure 1.Ancient HGT events were marked with triangles.Euryarchaeota branches were drawn in yellow while Crenarchaeota branches were in green.

Figure
Figure S1 Maximum likelihood tree of PF00311 seed alignment.Phylogenetic tree of PF00311 seed alignment were