Rickettsia Phylogenomics: Unwinding the Intricacies of Obligate Intracellular Life

Background Completed genome sequences are rapidly increasing for Rickettsia, obligate intracellular α-proteobacteria responsible for various human diseases, including epidemic typhus and Rocky Mountain spotted fever. In light of phylogeny, the establishment of orthologous groups (OGs) of open reading frames (ORFs) will distinguish the core rickettsial genes and other group specific genes (class 1 OGs or C1OGs) from those distributed indiscriminately throughout the rickettsial tree (class 2 OG or C2OGs). Methodology/Principal Findings We present 1823 representative (no gene duplications) and 259 non-representative (at least one gene duplication) rickettsial OGs. While the highly reductive (∼1.2 MB) Rickettsia genomes range in predicted ORFs from 872 to 1512, a core of 752 OGs was identified, depicting the essential Rickettsia genes. Unsurprisingly, this core lacks many metabolic genes, reflecting the dependence on host resources for growth and survival. Additionally, we bolster our recent reclassification of Rickettsia by identifying OGs that define the AG (ancestral group), TG (typhus group), TRG (transitional group), and SFG (spotted fever group) rickettsiae. OGs for insect-associated species, tick-associated species and species that harbor plasmids were also predicted. Through superimposition of all OGs over robust phylogeny estimation, we discern between C1OGs and C2OGs, the latter depicting genes either decaying from the conserved C1OGs or acquired laterally. Finally, scrutiny of non-representative OGs revealed high levels of split genes versus gene duplications, with both phenomena confounding gene orthology assignment. Interestingly, non-representative OGs, as well as OGs comprised of several gene families typically involved in microbial pathogenicity and/or the acquisition of virulence factors, fall predominantly within C2OG distributions. Conclusion/Significance Collectively, we determined the relative conservation and distribution of 14354 predicted ORFs from 10 rickettsial genomes across robust phylogeny estimation. The data, available at PATRIC (PathoSystems Resource Integration Center), provide novel information for unwinding the intricacies associated with Rickettsia pathogenesis, expanding the range of potential diagnostic, vaccine and therapeutic targets.

Due to both small genome size and medical importance, ten genome sequences from Rickettsia spp.have been published and annotated in the last decade [9,[21][22][23][24][25][26][27], providing a foundation to study the evolutionary history of these lineages through comparative genomics.Recently, Gillespie et al. [28] proposed a revision to the long-standing classification of Rickettsia by erecting the transitional group (TRG) as a distinct lineage that shares immediate ancestry with the members of the spotted fever group (SFG) rickettsiae.Coupled with the typhus group (TG) and ancestral group (AG) rickettsiae, these four rickettsial lineages comprising 10 sequenced genomes present an opportunity to create a database that encompasses the distribution of the predicted open reading frames (ORFs) across all ten annotated genomes (Figure 1).
Establishing orthology across multiple genomes serves not only to identify genes with shared evolutionarily histories, but also facilitates genome annotation [29,30], and significant attention has focused on algorithms for creating orthologous groups (OGs).Recent work has centered on the following four aspects: i) overall improvement of OG assignment in the face of paralogy, ii) building tools for the cross-querying of taxon-specific databases, iii) creating databases that house specific gene or protein profiles for facilitating the identification of orthologs in novel sequences, and iv) the inclusion of phylogeny estimation into the processes of assigning orthology and detecting paralogy.
At the PathoSystems Resource Integration Center (PATRIC) [31], OGs have been preliminarily established for several groups of organisms, including Rickettsia spp.The advantage of a Rickettsiaspecific database lies not only in the ability to query exclusively against the 10 genomes currently annotated in our system, but also to evaluate the results of several algorithmic approaches that create OGs.Furthermore, PATRIC offers continued updates to the annotation of rickettsial genes and proteins, and provides multiple sequence alignments as well as phylogenetic trees, when applicable, for each OG consisting of two to ten rickettsial taxa.The database will continually evolve with the addition of newly sequenced rickettsial genomes, with existing OG assignments driving the curation process of raw genome data.
In the present study, we report the rickettsial OGs (RiOGs) in conjunction with a highly robust phylogeny of the core rickettsial genes, providing an evolutionary framework for interpreting the genomic characteristics of the four main lineages of Rickettsia.These data highlight the genetic anomalies previously characterized for this genus, such as extremely reduced genomes and the high presence of putative pseudogenes, and also reveal novel characteristics including the lack of group-specific virulence factors and high occurrence of lateral transfer between groups that harbor plasmids (AG and TRG rickettsiae).Information on the conserved core genes, as well as those that may be involved in specific functions that define monophyletic groups, host associations, and plasmid-related behavior, will be valuable resources for future laboratory work (e.g., development of vaccines, diagnostics and therapeutics) as well as further evolutionary studies of this intriguing obligate intracellular bacterial group.

Synteny and Phylogeny of Rickettsia Genomes
Whole genome alignments for the ten analyzed Rickettsia taxa reveal highly conserved colinearity in six of the seven derived species (sans R. bellii and R. canadensis) with minimal gene rearrangements, most of which occur near the predicted origin of replication termination (Figure 2).However, the R. felis genome contains several long-range symmetrical inversions in the central region of the alignment that are not found in other taxa.Removal of R. felis from the alignment illustrates the highly conserved synteny across the derived rickettsia taxa (Figure S1-A).Furthermore, switching the positions of R. akari and R. felis in the alignment (Figure S1-B) demonstrates that these central inversions in R. felis, as well as a large genome size, are autapomorphic (uniquely derived) traits within derived rickettsiae.Among the three AG rickettsiae, R. canadensis (formerly R. canada) is more colinear with the derived taxa than it is to either R. bellii strain.Like R. felis, R. canadensis contains several autapomorphic symmetrical inversions in the central region of the alignment, yet they are smaller than the long-range inversions found in R. felis.As previously reported [32], R. bellii str.RML369-C shares little colinearity with other rickettsial genomes, and our analysis of both R. bellii genomes is in agreement with this observation.Despite several long and short range inversions between the R. bellii str.RML369-C and R. bellii str.OSU 85-389 genomes, few gene positions are shared with R. bellii and R. canadensis or the derived taxa (Figure 2), and switching the positions of the R. bellii strains in the alignment does not result in more conserved synteny between either strain and the derived taxa (Figure S1-C, D).
Phylogenetic analyses implementing both maximum likelihood and parsimony of the 731 representative core rickettsial proteins (discussed below) resulted in robust estimates for these 10 taxa (Figure 3).The estimated tree topologies are identical in branching pattern and are congruent with the tree from our previous analysis of 716 fewer genes [28], suggesting that ten or more concatenated (and well-behaved, with high signal to noise ratio) genes are sufficient for obtaining a robust phylogenetic estimate for these rickettsial taxa.Thus, our recent classification scheme for Rickettsia consisting of 4 major groups (AG rickettsiae: R. bellii str.RML369-C, R. bellii str.OSU 85 389, R. canadensis str. McKiel; TG rickettsiae: R. prowazekii str.Madrid E, R. typhi str.Wilmington; TRG rickettsiae: R. akari str.Hartford, R. felis str.URRWXCal2; SFG rickettsiae: R. rickettsii str.Sheila Smith CWPP, R. conorii str.Malish 7, R. sibirica str.246) is substantiated with a phylogenomic approach.In what follows, we use this evolutionary framework to analyze the distribution and relative conservation of all predicted genes for these ten rickettsial genomes.3, with taxon abbreviations explained in the Figure 1 legend.Alignment created using Mauve [189] after reindexing the R. sibirica genome (see text for details).doi:10.1371/journal.pone.0002018.g002representative (Figure 4A), meaning they include only one CDS per strain, thus ranging in membership from 2-10 sequences.The remaining 12% of the OGs are non-representative (Figure 4B) and include multiple predicted ORFs from at least one member.Categorization of the OGs into two classes based on distribution across the rickettsial tree and other attributes, such as presence of plasmids and common arthropod hosts (Figure 4C-D), reveals that 69% of the OGs are comprised of single rickettsial groups (e.g., AG, TG, TRG, and SFG), shared rickettsial groups (subgeneric), plasmid-harboring genomes, and genomes with common arthropod hosts (Table 1).These class 1 OGs (C1OGs) contain 76% of the predicted ORFs grouped into OGs by OrthoMCL, suggesting that our criteria for distinguishing biologically interesting protein families based empirically on robust phylogeny estimation, presence of extra-chromosomal DNA and shared arthropod hosts is valid.The remaining ORFs grouped into class 2 OGs (C2OGs) depict gene families drifting or sporadically lost from the core genetic repertoire of the rickettsial ancestor [32] or genes acquired laterally (Figure S2).Interestingly, while the majority (71%) of representative OGs qualify as C1OGs, the non-representative OGs are distributed within C1OGs and C2OGs in near equal frequency (Table 1), suggesting minimal conservation for gene duplications and laterally acquired genes in these rickettsial genomes.

Predicted OGs: Conservation and Representation
The RiOGs range in membership from two to 31 ORFs, with few (,3%) OGs exceeding more than 10 ORFs (Table 2).Representative C1OGs comprise a substantial portion (64%) of the OGs with membership of 10 or fewer ORFs.Regarding the OGs with more than 10 members, a range from 4% (R. prowazekii) to 32% (R. conorii) illustrates the frequencies at which a particular rickettsial genome contributes to non-representation.As expected due to their smaller genome sizes and few gene duplications [21,25], TG rickettsiae make little contribution (avg.5%) to larger non-representative OGs as compared to AG (avg.19%), TRG (avg.17%) and SFG (avg.31%) rickettsiae (Table 2).Thus, these three latter groups have genomes more tolerant of multicopy genes, particularly those resulting from transposases and other insertion sequences, which act to produce elevated levels of paralogous genes.For instance, analysis of the distribution of RiOGs containing genes associated with mobile DNA and/or Tree from Bayesian analysis.Three MCMC chains were primed with a neighbor-joining tree and run independently for 25000 generations in model-jumping mode.Burn-in was attained by 2500 generations for all chains, and a single tree topology with exclusive use of the Jones substitution model was observed in post burn-in data.
The consensus tree shown here thus has 100% support for every branch.Branch support is from the distribution of posterior probabilities from all trees minus the burn-in.(B) Tree from exhaustive search using parsimony.Branch support is from one million bootstrap replicates.doi:10.1371/journal.pone.0002018.g003 horizontal gene transfer (HGT), such as genes coding for proteins with ankyrin (ANK) and tetratricopeptide repeat (TPR) motifs, proteins with rickettsial palindromic elements (RPE), proteins associated with transposable elements (TNP), proteins of toxinantitoxin modules (TA), and phage related elements, revealed that they are nearly non-existent in TG rickettsial genomes (Table 3).
The remaining three lineages, all purportedly containing some species that harbor plasmids, have elevated levels of most of these gene groups compared to TG rickettsiae.Interestingly, nearly half (47%) of the C2OGs are comprised of these six gene groups, while only a small portion of the C1OGs (5%) and singletons (4%) contain them (Table 3).Given the probable lateral inheritance of many of these genes, either as facilitators or products of HGT, it is evident that they are less conserved and of less importance to overall rickettsial fitness and survival.However, their contribution to species-and strain-specific pathogenicity cannot be overlooked.Interestingly, our observation that these more promiscuous gene families tend to occur predominantly within C2OGs is congruent with a recent study demonstrating that barriers to bacterial HGT are more stringent for single copy genes [35].
A comparison of the distributions of both representative and non-representative C1OGs and their associated singletons uncovers the high occurrence of singleton genes (53%) per representative C1OGs (Figure 5).While many singletons may be the product of gene overprediction (discussed below), some could possibly have important species-or strain-specific functions, such as host manipulation.''False singletons'', which depict non-representative OGs with all members from a single genome (Figure 4C), contribute less (17%) towards non-representation when identical genes from R. felis plasmids pRF and pRFd are not considered (for speculation on the existence of pRFd see Gillespie et al. [28]).Thus the biological causes of non-representation, such as HGT and gene duplication, tend to occur more within gene families common across multiple rickettsial genomes rather than in unique genes within individual genomes.This is congruent with our determination of the high occurrence within C2OGs of six gene families typically associated with mobile DNA and/or HGT (above).

The Nature of Non-Representation
The degree of non-representation recovered by OrthoMCL is not a surprise as Rickettsia genomes are notorious for being highly reductive [e.g., [36][37][38], having a high occurrence of split genes and pseudogenes [e.g., 22,23,32,39,40] and limited conservation in important host-recognition proteins such as rickettsial outer membrane protein A (rOmpA) and other cell surface antigens (Scas) [e.g., [41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57].Coupled with this, some of the more recently sequenced genomes (namely both R. bellii strains and R. felis) are riddled with gene rearrangements and elevated levels of repetitive elements and transposases [9,27], and the staggering degree of repetitive sequences and gene duplications in the recently sequenced genome of Orientia tsutsugamushi [58] suggest the old paradigms for genome reduction and synteny in Rickettsiaceae need reevaluation.Furthermore, as we recently predicted [28], new evidence is mounting for the presence of plasmids in several members of AG, TRG and SFG rickettsiae (reviewed in Baldridge et al. [59]), with some proteins having high similarity to counterparts encoded on rickettsial chromosomes [e.g., 28,60].All of these factors confound the accurate assignment of gene orthology across genomes, and it is important to view our results as algorithm-dependent, which further required manual scrutiny and adjustment.
Manual inspection of the 259 non-representative OGs via multiple sequence alignment of each specific case revealed the high occurrence of split genes versus true gene duplications (Table 4; Table S1).Including spurious duplications from the identical R. felis pRF and pRFd plasmids, 387 problematic ORFs were eliminated or stitched together to create pseudogene ORFs, resulting in only 80 remaining non-representative OGs defined by true gene duplications.Notably, elimination of identical pRF and pRFd plasmid genes created 33 additional R. felis singletons.After ''repairing'' OGs defined by split ORFs, four distributions contained the majority of C1OGs, illustrating the instances of gene decay from the core, -TG, TRG+SFG, and SFG distributions (Figure 6).Regarding the repaired OGs with a core distribution, nearly half of the split genes were from the R. bellii str.OSU 85-389 genome and include critical genes such as those encoding alanyl-and leucyl-tRNA synthetases and one of the five virB6 components of the type IV secretion system.OGs containing split genes with a -TG distribution include two proteins possibly involved in DNA transformation: a ComEC/Rec2-related protein and a putative DNA processing protein DprA, plus two phage related proteins and a TPR motif-containing protein.This illustrates that genes deleted from the TG genomes involved in conjugation or other methods of foreign DNA uptake are in the process of decaying from the remaining rickettsial genomes.Through the comparison of the proportion of split genes to gene duplications per rickettsial genome (Table 5), it is evident that split genes occur more frequently, particularly in SFG rickettsiae, and that both split genes and gene duplications are nearly nonexistent in TG rickettsiae.Interestingly, the genomes with plasmids and elevated levels of transposases and related elements, namely R. felis and R. bellii, also have elevated levels of gene duplications.Core and Group-Specific C1OGs The distribution of representative (1300) and non-representative (79) C1OGs and singletons are shown over our estimated phylogeny (Figure 7).Singletons (1467) are also shown but discussed in a separate section below.Of the 1379 C1OGs, 31% are annotated as hypothetical proteins (HPs), suggesting that a significant amount of even the conserved genes within these rickettsial genomes remain to be characterized.Not considering the bellii C1OG, which contains genes unique to the R. bellii genomes, the amount of HPs within the C1OGs decreases to 18%.The core and lineage specific C1OGs are discussed below.
Core rickettsial genes.OrthoMCL grouped 731 representative and 21 non-representative protein families that are present in all ten analyzed rickettsial genomes (Table S2).Thus, the genes encoding these proteins define the foundation of rickettsial biology, such as ''house-keeping'' functions, as well as rudimentary processes in host cell recognition, invasion and survival (but not necessarily virulence as not all Rickettsia spp.are known pathogens).The distribution of the assigned cellular functions of each of these core proteins provides insight on the conservation of cellular activities relative to other bacteria (Figure 8A).Not surprising, OGs involved in translation represent the largest functional category (16.14%), as other cellular functions such as amino acid (2.6%), carbohydrate (2.1%), nucleotide (2.3%), and lipid (2.2%) synthesis are less necessary when many of these resources can be obtained from host cells [61,62].Analyzing a crude depiction of the R. felis proteome, Ogawa et al. [40] reached a similar observation as their 172 identified proteins sorted into cellular function categories similar to those assigned for our core proteins, although with far fewer members per category (Figure 8B).The core rickettsial protein distribution across cellular function categories is also similar to another obligate intracellular pathogen, Chlamydia trachomatis, suggesting that this lifecycle is defined by reduction of many genes with conserved cellular functions (save translation) in facultative intracellular (Yersinia pestis) and extracellular      ORFs found in only one rickettsial genome.Does not include false singletons (see Figure 4).doi:10.1371/journal.pone.0002018.t003(Escherichia coli) pathogenic bacteria.The percentage of ORFs coding for metabolic genes is lower in the obligate intracellular bacteria, with exception of the coenzyme transport/metabolism and lipid transport/metabolism genes of Chlamydia, which equal and exceed that of the two larger genomes, respectively.
AG rickettsiae.Based on phylogeny estimation of over 30 proteins that placed R. canadensis basal to the TG, TRG and SFG rickettsiae, we categorized it with both R. bellii strains in the AG rickettsiae [28], a result recovered here and consistent with several previous studies [3; consensus tree of Vitorino et al. [63]].Conversely, our analysis of OG distribution recovered only two proteins that are unique to AG rickettsiae: RiOG_1416 (Type I restriction-modification system, M subunit) and RiOG_1429 (F pilus assembly protein TraB).RiOG_1416 is truncated in R. bellii str.OSU 85-389 and extremely truncated in R. canadensis.Similarly, RiOG_1429 is truncated in R. canadensis; thus it is unlikely that either ORF is an important signature for AG rickettsiae.Furthermore, while both strains of R. bellii share 321 unique representative protein families (Figure 7, Table S3), R. canadensis only shares two unique proteins with the remaining derived rickettsiae: RiOG_925 (COG0419: ATPase involved in DNA repair) and RiOG_927 (methyltransferase family protein), with the latter likely part of a multigene family with other R. bellii homologs.Thus, OG distribution provides little evidence for placing R. canadensis either within AG rickettsiae or as derived.For instance, of the three derived rickettsial groups, R. canadensis shares more OGs with SFG (13; However, the three OGs shared between R. canadensis and TG rickettsiae are all unique sugar transferases, and all three genomes share an unprecedented 52 lost OGs relative to the remaining seven rickettsial genomes (Table 6; Figure S2-F3).Interestingly, R. canadensis shares zero lost genes with either TRG or SFG rickettsiae.It also shares with R. prowazekii a unique split gene, scaI, that is the most conserved member of the scas and is present in all analyzed Rickettsia spp.[57].Thus, while phylogeny estimation places R. canadensis basal to the TG, TRG and SFG rickettsiae, and common OGs suggest an affinity to SFG and TRG rickettsiae over TG rickettsiae, the mode of gene loss across the lineages branching off after R. bellii suggests the position of R. canadensis within our generated phylogeny is well supported, but with possible affinities with TG rickettsiae, which were originally suggested based on serological cross reactivity studies [64].Accordingly, phylogenetic analysis and signature proteins alone should not be solely used to characterize rickettsial groups, as shared absence of genes may reflect relatedness that is difficult to detect otherwise in these highly reductive genomes.Interestingly, Vitorino et al. [63] recently demonstrated an affinity between R. canadensis and R. helvetica based on phylogeny estimation from eight genes, although they concluded that the phylogenetic position of R. canadensis was unstable, which is consistent with previous studies.For instance, like SFG rickettsiae, R. canadensis was isolated from ixodid ticks and is maintained transstadially and transovarially [65,66], grows within the nuclei of its host [65], and contains both rOmpA and rOmpB genes [67,68].However, like TG rickettsiae, R. canadensis grows abundantly in yolk sac, lyses red blood cells, is susceptible to erythromycin, and forms smaller plaques as compared to SFG rickettsiae [69].Genomic characteristics are just as anomalous, as despite sharing the same G+C% [26,69] and only a slightly larger genome size than TG rickettsiae (Figure 7), R. canadensis shares more common repetitive elements with SFG rickettsiae genomes than with any other group [26] and has many similar genes found within the tra cluster of R. massiliae [70].Switching the position of R. canadensis in our genome alignment to reflect a derived relationship relative to TG rickettsiae did not improve synteny with the other rickettsial genomes, and despite a large central inversion, R. canadensis gene order is highly conserved with most of the derived taxa (Figure S1-D).In an effort to test a putative affinity between R. canadensis and R. helvetica (genome sequence unavailable), we selected 16 existing full or partial gene sequences for R. helvetica and estimated a phylogeny (Figure 9).R. helvetica is supported as basal to the remaining SFG rickettsiae in an otherwise identical phylogeny estimated from the 731 core rickettsial genes (Figure 3), thus refuting an affinity between R. canadensis and R. helvetica.The recent phylogenies estimated from 16S rDNA and groEL Taxon abbreviations are explained in the Figure 1 legend.
3 Number of split genes, followed by number of ORFs resulting from splits, followed by overestimated ORFs.Note: split genes may be split more than once. 4 Number of gene duplications, followed by number of duplicated ORFs.Note: some genes are duplicated more than once, and pRF genes are considered duplications of R. felis chromosomal orthologs. 5 Total number of split ORFs and gene duplication events per genome.
nucleotide sequences, the VirB4 protein and 14 concatenated proteins of the T4SS complex, and entire genome sequences placed R. canadensis between TG and TRG rickettsiae [26]; however, R. bellii was not sampled, likely affecting character polarity with the absence of an ancestral taxon.Thus, given our estimation of phylogeny from all available annotated rickettsial genomes, we are confident in the placement of R. canadensis as basal to the TG, TRG and SFG rickettsiae, although limited similarity is apparent to both R. bellii genomes as revealed by OG distribution and synteny.It is not unreasonable to predict that R. canadensis will ultimately group within a fifth distinct rickettsial group once more genomes are sequenced from lesser known rickettsiae, particularly species non-pathogenic to humans.TG rickettsiae.Despite being distinct from the other rickettsial groups with its highly reductive genomes and strictly insect-specific lifestyles, TG rickettsiae were predicted to contain only three unique representative OGs: a putative GTP pyrophosphokinase (RiOG_2080) and two HPs (RiOG_2081 and RiOG_2082).RiOG_2080 is part of a probable multigene family that is duplicated in most rickettsial genomes.These enzymes catalyze the synthesis of guanosine 59-triphosphate 39diphosphate (pppGpp) as well as guanosine 39,59bispyrophosphate (ppGpp) by transferring pyrophosphoryl groups from ATP to GTP or GDP respectively [71], functioning as mediators of the stringent response that coordinate a wide range of cellular activities in reaction to changes in nutritional abundance [72].While common in multiple variable copies across the sampled genomes, the role lineage specific GTP pyrophosphokinases play in accommodating the different modes of intracellular replication and intercellular spreading by different rickettsial groups is worth exploring.RiOG_2081 is an uncharacterized protein conserved in a limited number of other bacteria (COG3274) and unknown from non-TG rickettsiae.The distribution of this protein, a putative membrane associated acyltransferase, in many pathogenic bacterial species and one bacteriophage, PhiV10, is interesting (Table 7).Finally,   RiOG_2082 is a small putative ORF that BLASTs to no other organisms, with the start codon missing in R. typhi.
While a wealth of unique genes seemingly does not define TG rickettsiae, 53 unique gene loss events may offer insight into the streamlined manner of their evolution (Table 6).The loss of the Arp2/3 complex activating protein, rickA, from TG rickettsiae has been well-documented, and distinguishes this group in its mode of host cell spreading [73,74].Interestingly, our comparative analysis has revealed two other curious proteins that are present and conserved in all other non-TG rickettsiae genomes.The first is RiOG_897, a putative trichohyalin, which are intermediate filament-associated proteins found predominantly in the hair follicle cells of mammals [75,76] but also expressed in the hard palate, tongue, nail bed, and a suite of pathological epidermal tissues [77,78].We discuss more about trichohyalins below in regards to insect-associated rickettsiae containing a unique trichohyalin-like homolog that is different from the gene found in all other non-TG rickettsiae.The second interesting OG (RiOG_901) found exclusively in non-TG rickettsiae is an ecotinlike protein.Ecotin is a dimeric periplasmic protein described in Escherichia coli that belongs to the protease inhibitor I11 (ecotin) family (PF03974).Ecotin inhibits several pancreatic serine proteases, including chymotrypsin, trypsin, elastases, factor X, kallikrein, as well as a variety of other proteases [79][80][81].Eggers et al. [82] have shown that ecotin protects E. coli from neutrophil elastase (NE), a mammalian serine protease demonstrated to be important for neutrophil killing of several gram-negative bacteria.Specifically, NE cleaves ompA causing increased permeability to the bacterial outer membrane [83].Once NE translocates across the vulnerable outer membrane, it functions in inhibiting bacterial cell growth and repair, causing cell death.The presence of ecotin in the periplasm inhibits NE function, thus fostering recovery and growth of the invading bacterial cells [82].Given the diversity of rickettsial outer membrane surface proteins, particularly the Scas [55], it is reasonable to suggest that one or several surface proteins present in all non-TG rickettsiae may be dependent upon the putative NE inhibitory function of RiOG_901.
TRG rickettsiae.Based on the monophyly of its sampled members (R. felis and R. akari), its strongly supported position in our estimated rickettsial phylogeny, an affinity with AG rickettsiae plasmid-associated genes, and the use of both acarines and insects as primary invertebrate hosts, we erected the TRG rickettsiae as a third derived lineage of Rickettsia [28].OrthoMCL predicted 37 OGs unique to TRG rickettsiae (Table 8).Of the three other rickettsial lineages, TRG shares more common OGs with SFG rickettsiae (36) than with TG rickettsiae (0) or AG rickettsiae (6) (Figure 7), reflecting its shared common ancestry with the ''true'' spotted fever group taxa.However, exclusion of R. canadensis sheds light on our previously described affinities of TRG rickettsiae with AG rickettsiae (Table 9).For instance, 26 OGs are shared between the R. bellii genomes and TRG rickettsiae (Figure S2-C25), with six of these annotated as members of toxin-antitoxin (TA) modules, and another two annotated as bacteriophagederived proteins.Additionally, the R. felis genome shares 44 OGs with the R. bellii genomes (Figure 7), six of which are annotated as members of TA modules, with another one annotated as bacteriophage-derived protein.Furthermore, the R. akari genome shares 10 OGs with the R. bellii genomes (Figure S2-B23), and two of these OGs are predicted members of TA modules.This high presence of TA system components, as well as bacteriophagederived proteins, attests to our previous observations that AG (at least R. bellii) and TRG rickettsiae are linked via conjugative systems and have a pronounced presence of similar plasmid (and now phage) related ORFs, likely the end products of various lateral gene exchanges between these distantly related groups.
Despite the abovementioned characteristics shared between AG and TRG rickettsiae, the TRG rickettsiae also share three TA components exclusively with SFG rickettsiae (Table S4).Additionally, SFG rickettsiae and the R. bellii genomes have three TA components not found in the other analyzed genomes (Figure  S13 for gene names and sequence accession numbers.Tree estimated under parsimony (see text).doi:10.1371/journal.pone.0002018.g009S2-D7).This alludes to the likelihood that SFG rickettsiae and R. bellii have also had lateral exchange between plasmids at some point in their evolution, although not nearly to the degree that TRG and the R. bellii genomes have had.For instance, of the 27 OGs shared between R. felis and SFG rickettsiae (Figure S2-C1), only three are components of TA modules (Table S4).And of the 22 OGs shared between R. akari and SFG rickettsiae (Figure S2-C2), none are predicted as components of TA modules.This distinction of the close relatedness of TRG to AG rickettsiae (at least the R. bellii genomes) relative to its sister clade, SFG rickettsiae, based on plasmid associated gene distribution is critical in understanding the mode of gene loss from the last common ancestor of Rickettsia, as well as the degree conjugative systems have contributed to the architecture of these genomes.
Based on phylogeny estimation of 16S rDNA sequences, the largest clade recovered to date for TRG rickettsiae included R. akari, R. felis, R. australis, and poorly characterized rickettsiae from booklouse (Liposcelis sp.) and parasitic wasp (Neochrysocharis sp.) hosts [16].In addition, Reeves et al. [84] recently identified two novel rickettsial genotypes from the mite Ornithonyssus bacoti from Egypt that are closer to TRG rickettsiae than the other rickettsial groups based on partial sequence comparison of the 17 kD antigenic gene.Aside from R. australis, which has been found exclusively in tick hosts, none of these taxa purportedly parasitize ticks, with R. akari found in mites [85], R. felis found in fleas [51,[86][87][88][89], and the other unnamed Rickettsia spp.known only from their booklouse, wasp and mite hosts.Thus the group is interesting from an arthropod host perspective as well as from its apparent affinities to the R. bellii genomes.In light of this, we suggested that R. australis would continue to group within the TRG rickettsiae [28], as it has previously done in some cases wherein one or few genes were analyzed [e.g., 16,63,[90][91][92].Our dataset including 16 gene sequences from R. helvetica (discussed above) also contained eight sequences from R. australis and grouped this taxon with R. akari in a clade subtended by R. felis with strong bootstrap support (Figure 9).However, while the TRG rickettsiae is still recovered when R. akari and R. australis are analyzed in the absence of R. felis [49,92,93], the exclusion of R. akari in the presence of R. australis and R. felis [51] failed to recover a monophyletic TRG rickettsiae.Furthermore, while four of the eight single gene phylogeny estimates by Vitorino et al. [63] recovered the TRG rickettsiae, the consensus tree did not, as the TG rickettsiae was placed within the TRG rickettsiae, splitting the R. akari/R.australis clade from R. felis.Thus, the TRG rickettsiae is not easily demonstrated as a distinct lineage of rickettsiae unless the taxon and character sampling is robust enough for this intriguing lineage to emerge (Figure 9; [28]).SFG rickettsiae.The majority of the described species of Rickettsia fall within the SFG rickettsiae.The analyzed spotted fever group genomes form a monophyletic cluster of taxa with little sequence divergence relative to the other rickettsial groups (Figure 3).OrthoMCL predicted 113 OGs that are unique to SFG rickettsiae (Table 10).Of note, in addition to the four core rickettsial proline/betaine transporters (Table S2), SFG rickettsiae contain two variant copies (RiOG_1314 and RiOG_ 1332).Other transporters unique to SFG rickettsiae include three ATPase and permease components of an ABC-type multidrug transporter (RiOG_1347, RiOG_1364 and RiOG_1365), an ATP-binding protein similar to ABC transporter (RiOG_1376), an MSF-like sugar transporter (RiOG_1355), and an RND family efflux transporter (RiOG_1294).While high numbers of transporters are expected in Rickettsia to counterbalance depleted metabolic pathways and acquire host resources, it is unclear why the SFG rickettsiae have elevated levels of unique components of organic and inorganic transport systems relative to the other three rickettsial groups.As with TG rickettsiae, there are group-specific GTP pyrophosphokinases (RiOG_1350 and RiOG_1361) in SFG rickettsial genomes, and their role in a group-specific stringent response is worthy of attention.Like AG and TRG rickettsial genomes, SFG rickettsiae have group-specific ANK repeat containing proteins, with a particular one (RiOG_1344) similar to metazoan tankyrases, telomeric repeat binding factorinteracting ANK-related ADP-ribose polymerases.Aside from potentially playing key roles in the maintenance of telomere function [e.g., 94], tankyrases have been implicated in mitogen-activated protein kinase signaling [95], regulation of cell death [96,97] and viral inhibition [98].
Using EasyGene [99], a program that ranks prokaryotic predicted ORFs based on statistical significance, Nielsen and Krogh [100] determined that the R. conorii str.Malish 7 genome was over-annotated by 16%, ranking 7th among most overannotated replicons in a sample of 143 prokaryotic genomes.Specifically, EasyGene determined 225 RefSeq genes to be false, with 34 additional genes predicted by EasyGene that were not called in the original study [22,23].Aside from possible gross ORF over-prediction in all ten rickettsial genomes (discussed below), our analysis yielded many OGs with imperfect representation within the SFG group, as 54 OGs are found exclusively in the R. conorii and R. sibirica genomes (Figure S2-A1), 52 are found exclusively in the R. rickettsii and R. sibirica genomes (Figure S2-A2), and 36 are found exclusively in the R. rickettsii and R. conorii genomes (Figure S2-A3).Given that the SFG rickettsial genomes have elevated split genes as compared to other rickettsial genomes (Table 5; Table S1), our findings and those of Nielsen and Krogh [100] hint at a pronounced rate of pseudogenization in SFG rickettsiae depicted by a patchy distribution of split and truncated ORFs decaying from the ancestral SFG genome.
One hallmark occurrence of probable pseudogenization in SFG rickettsiae involves a Sec7-domain-containing protein known in prokaryotes only from Rickettsia and Legionella species [101].The Legionella counterpart of this curious protein, named RalF, is a guanine nucleotide exchange factor that recruits ADP-ribosylation factor to occupied phagosomes, permitting Legionella to replicate free from the host immune system [102].The rickettsial RalF xenolog (RiOG_19), including the N-terminal Sec7 domain and immediate flanking Sec7-capping-domain [103], is present in all rickettsial genomes except for SFG rickettsiae and R. canadensis, suggesting a biological mechanism that has been lost from the true spotted fever group and R. canadensis.Unlike Legionella RalF, which has a short (44 aa) C-terminal tail containing a type 4 secretion system (T4SS) signal sequence [104], the rickettsial genes encode an additional variable domain (97-315 aa) between the Sec7capping-domain and the C-terminal tail.Within this third domain lies a region immediately flanking the predicted T4SS signal sequence that is extraordinarily rich in proline residues, much like the P-rich domain of rickA proteins [74].Interestingly, the SFG genomes each contain small ORFs corresponding to the tails of the RalF-like sequences.A similar sequence within the R. canadensis genome (not annotated) also spans this region yet is riddled with frame-shift mutations.Given that Rickettsia, unlike Legionella, quickly lyse the phagosome upon host cell entry, the function of a RalF xenolog, particularly given its curious distribution in the rickettsial tree, is worthy of investigation.Finally, full intact RalF xenologs in both TRG rickettsial genomes further attest the distinction of this lineage from the SFG rickettsiae [28].

Arthropod Host-Specific OGs
Several studies have demonstrated the presence of certain rickettsial species outside of their natural arthropod hosts.For example, the louse (and less often flea) associated R. prowazekii has been found in ticks in Africa [105] and Mexico [106], and was also reported in acarids from flying squirrels in the United States [107].However, it should be recognized that many blood-feeding arthropods have a wide range of vertebrate hosts and likely act as reservoirs for a variety of bacteria that incidentally fall outside of their natural arthropod vector.To this extent reports of pathogenic bacteria (i.e., R. prowazekii) in unusual vectors need to be substantiated beyond simple detection in these foreign hosts, and caution should be taken when immediately assigning novel host associations.Given the low frequency of resident bacteria in many natural arthropod populations [108], substantiation of novel arthropod hosts can be achieved in the field by robustly sampling other invertebrate and vertebrate animals from the same locality that may actually be the true host of the incidentally collected bacterium.Furthermore, laboratory studies would be needed to determine the pathogenicity, if any, that the bacterium causes in its novel host.However, laboratory inoculation of an animal may result in pathogenesis only because the number of bacteria far exceeded what occurs in nature, thus compromising an immune system that under natural circumstances is quite capable of killing the pathogen.Furthermore, demonstrating laboratory bacterial infection or vectorization in a foreign host, for example R. conorii in the body louse [109], may initially prove successful, but eventually will clear from the host as it would from natural populations.For instance, Rickettsia have been grown in mosquito cell lines, yet to our knowledge no wild caught mosquitoes to date have been shown to act as hosts to any Rickettsia.In fact, based on the analysis of the highly divergent sca genes in rickettsiae, which are suspected to directly interact with host cell proteins [47,110], Blanc et al. [55] concluded that rapid evolution of such important host colonization genes likely keep Rickettsia host ranges quite narrow.Given our conservative stance on definitive rickettsial arthropod hosts, we have chosen to present the predicted genes that are exclusive to insect associated Rickettsia and tick associated Rickettsia (as depicted in Figure 7).Because only one analyzed genome is from a mite-associated species (R. akari), we have no comparative analysis to describe potential mite-specific rickettsial genes.However, the list of singleton genes found in R. akari may provide a start to such an approach (see below).
Insect-associated rickettsiae.Three of the ten analyzed rickettsial genomes have definitive insect hosts, with R. typhi and R. felis reported from rodent, shrew and feline [51,[86][87][88] associated fleas, and R. prowazekii predominantly pathogenic in lice, as well as fleas in the sylvatic form.Thus these three rickettsial lineages share common arthropod hosts at least in fleas.Regarding R. typhi, It has become apparent that the ecology of murine typhus in both south Texas and southern California, where the classic cycle of R. typhi involving commensal rats and primarily the rat flea (Xensopsylla cheopis), has been replaced by the Virginia opossum (Didelphis virginiana)/cat flea (Ctenocephalides felis) cycle.For instance, Sorvillo et al. [111] demonstrated the association of 33 cases of locally acquired murine typhus in Los Angeles County with seropositive domestic cats and opossums.However, urban rat/flea populations are still the main reservoir of R. typhi worldwide and particularly in Table 9. cont.
many cities where urban settings provide a constellation of factors for the perpetuation of murine typhus, including declining infrastructures, increased immunocompromised populations, homelessness, and high population density of rats and fleas.Thus, aside from the reported louse host of R. prowazekii and a laboratory demonstration that R. typhi infection is lethal for human body lice [112] despite R. typhi being unknown from wild lice, these three rickettsial taxa are all capable of infecting and causing pathogenicity in an overlapping range of flea species, prompting a genomic comparison to detect common genes possibly involved in flea cell invasion and pathogenicity.Despite the vast evolutionary divergence between arachnids and hexapods, two lineages with a common ancestor estimated to have split over 500 million years ago [113], only two OGs (RiOG_1496 and RiOG_1497) specific to the R. prowazekii, R. typhi and R. felis genomes were predicted by OrthoMCL (Figure 5, Figure 7, Table 11).However, these genes are exceptionally interesting from two perspectives.First, while the ORFs encoding both OGs are contiguous in all three genomes, they are present only on the pRF plasmid and not the chromosome of R. felis, suggesting a possible lateral exchange of these genes between TG rickettsiae and the R. felis genome.Second, these ORFs share little homology with genes from other organisms, and the taxonomic distribution of these organisms is quite intriguing.RiOG_1496 is annotated as myosin-11 and has close similarities to RiOG_1454, which is annotated as a HP found in the R. felis genome as well as both R. bellii genomes.Furthermore, RiOG_897 (discussed above), a predicted trichohyalin-like protein found in all analyzed rickettsial genomes but TG rickettsiae, has limited similarity with RiOG_1496.Aside from the more general functions described above, trichohyalin also acts as a cross-bridging protein that assists in the coordination of mechanical strength between the peripheral cell envelope barrier structures and cytoplasmic keratin filament networks [114].The lysosomal cysteine protease, cathepsin L, which is critical for skin and hair follicle homeostasis, likely uses trichohyalin as a substrate [115].Recently, Ou et al. [116] determined that a trichohyalin homolog, DYF-14, in the nematode Caenorhabditis elegans is essential for cilium biogenesis.Thus, this group of proteins seems to be critical for epithelial cell maintenance in a wide range of animals, and the presence of similar proteins in TG rickettsiae may hint at a molecular function involved with epithelial (invertebrate host) or endothelial (vertebrate host) cell entry and modification, as both R. typhi, R. prowazekii and R. felis enter their vertebrate hosts transdermally through inoculation or inhalation of insect feces.
Aside from sharing limited homology to these other OGs, RiOG_1496 is also similar to a predicted permease component of a ribose/galactose ABC transporter from the bacterium Mycoplasma mycoides (mollicutes: Spiroplasma group), the etiological agent of contagious bovine pleuropneumonia.Interestingly, a similar ORF is present in the cow genome, possible hinting at a horizontal exchange between M. mycoides and its bovine host.RiOG_1496 also Blasts to sequences from three other metazoans, the rust red flour beetle, Tribolium castaneum, the African clawed frog, Xenopus laevis, and the California purple sea urchin, Strongylocentrotus purpuratus.The beetle and frog ORFs are predicted as structural maintenance of chromosomes (SMC) proteins 6 and 5, respectively.SMC proteins are involved in such cellular processes as chromosome condensation, sister chromatid cohesion, chromosome partitioning, dosage compensation, DNA repair, and recombination [e.g., [117][118][119].In Bacillus subtilis, an SMC protein (BsSMC) plays a role in chromosome organization and partitioning, and has been shown to affect supercoiling in vivo, most likely by constraining positive supercoils, an activity contributing to the compaction and organization of chromosomes [120].The ORF from the sea urchin, as well as one final BLASTP hit to a sequence from Neurospora crassa, a type of red bread mold of the phylum Ascomycota, are annotated as HPs.
Like RiOG_1496, RiOG_1497 had only a few BLASTP hits with significant alignments, yet they cover a range of diverse Table 10.OGs present only in SFG rickettsiae.organisms.RiOG_1497 shares limited similarity with RiOG_1439, which is annotated as a HP and found only in the R. bellii genomes and the chromosomal genome of R. felis.Regarding eukaryotes, RiOG_1497 shares limited similarity with HPs from the green spotted pufferfish, Tetraodon nigroviridis, and the rice blast fungus, Magnaporthe grisea.RiOG_1497 also Blasts to a HP from another a-proteobacterium, Stappia aggregata (Rhodobacterales).Interestingly, there is also limited similarity between RiOG_1497 and a serine/threonine protein kinase from the marine filamentous cyanobacterium, Trichodesmium erythraeum.
OrthoMCL predicted zero non-representative OGs for the insect-associated Rickettsia (Figure 5, Figure 7), and only two representative and two non-representative OGs are present in all other genomes except the insect-associated rickettsiae (depicting shared lost genes in the insect-associated genomes) (Figure S2-F6).Both representative OGs (RiOG_948 and RiOG_951) are HPs, while the two non-representative OGs, RiOG_814 and RiOG_817, are annotated as a conserved uncharacterized bacterial protein (COG4374) and a HP, respectively.Thus, only the poorly characterized tandem gene group of RiOG_1496 and RiOG_1497 exists for attempting to distinguish the insectassociated Rickettsia from the other lineages with non-insect hosts.
Although the similarity of both RiOG_1496 and RiOG_1497 to the sequences described above is limited, it is nonetheless interesting that their distribution as contiguous ORFs in the TG rickettsiae and the R. felis pRF plasmid is unique amongst the analyzed rickettsial genomes.It is also interesting that at least one of the ORFs (RiOG_1496) has homology to vertebrate smooth muscle protein myosin-11, which is known to be expressed in the esophagus and trachea of humans, as well as trichohyalin, a protein associated with various healthy and pathological epithelial cell types.Both of these proteins are present at the infection interface between insect associated Rickettsia and vertebrate hosts and, at the very least, provide our best guess for a means to distinguish, at the genomic level, insect-associated vertebrate cell invasion from that of acarine.This result of a few examples from  the comparative analysis of ten genomes is surprising, and perhaps can be improved upon by the sequencing of more insect-associated rickettsial genomes.
Tick-associated rickettsiae.Six of the ten analyzed rickettsial genomes have definitive tick hosts, including both R. bellii genomes, R. canadensis, R. rickettsii, R. conorii, and R. sibirica.In general, little is known about the definitive host ranges of members of the AG and SFG rickettsiae, partly because few host-specific characteristics have been described for any rickettsial/acarine relationship, but also because multiple arthropod or vertebrate (or other eukaryote) hosts are seldom sampled from a given locality to distinguish between true rickettsial hosts and incidental vectors (discussed above).R. bellii seems to parasitize the widest range of tick genera [17], while of the pathogenic taxa, only R. conorii seems to be limited to one vector species [129].OrthoMCL predicted one non-representative (RiOG_866) and three representative (RiOG_1005, RiOG_1012 and RiOG_1021) OGs specific to the tick-associated rickettsial genomes (Figure 5, Figure 7).RiOG_866 is an alpha-(1,3)fucosyltransferase that is highly truncated in all but the R. bellii genomes and further split in R. conorii and R. sibirica (Table S1), depicting a gene undergoing decay.Similarly, RiOG_1021, annotated as a poly-beta-hydroxybutyrate polymerase, is also experiencing pseudogenization, as it depicts an artifact of the clustering process.RiOG_1021 is related to RiOG_834 (core distribution), which has full-length (,583 aa) proteins in TG and TRG rickettsiae, but parts of split genes from the tick-associated taxa.The corresponding halves of these split genes constitute RiOG_1021.Thus, if only the full sized ORFs are functional, alpha-(1,3)-fucosyltransferase is the lone signature protein found exclusively in TG and TRG genomes (the converse of the tick-associated rickettsiae).
RiOG_1005 has mild similarity to fic (filamentation induced by cAMP) proteins (Table 12), which are involved in cell division and folate metabolism (IPR003812).Aside from R. canadensis, which is highly truncated, the rickettsial sequences contain the central conserved HPFXXGNG motif characteristic of this protein family.Critical for the production and maintenance of new cells [130], folate is especially important during periods of rapid cell division and growth.While the exact molecular function of fic proteins is unknown, it is possible RiOG_1005 is involved in some aspect of folate synthesis, an incomplete pathway in Rickettsia likely requiring energy-coupled transporters to uptake host stores of the vitamin and/or its derivatives [61,62].However, the absence of this gene in insect-and mite-associated rickettsial genomes and the loss of the majority of the protein in R. canadensis hint more toward the decaying of this gene family.The identification of a core rickettsial transporter involved in folate/ folate derivative uptake would support this hypothesis.
RiOG_1012 is highly similar to macrolide, virginiamycin A, chloramphenicol, and streptogramin A acetyltransferases, acetyltransferases of the isoleucine patch superfamily and transferases with hexapeptide repeats from many different bacterial species, several of which are highly pathogenic (Table 12).In particular, streptogramin A and virginiamycin A acetyltransferases confer gram-positive bacteria resistance to A-type compounds of virginiamycin-like (Vml) antibiotics [e.g., [131][132][133][134].Because gram-negative bacteria typically have an innate resistance to Vml antibiotics [e.g., 135,136], the presence of Vml acetyltransferases in certain gram-negative bacterial genomes went unnoticed until their discovery early this decade in Yersinia enterocolitica [137].With the rapid accumulation of bacterial genome sequences it became apparent that many gram-negative bacterial genomes harbor Vml acetyltransferases (e.g., Table 12).Interestingly, the predominant presence of Vml acetyltransferases on plasmids in gram-positive bacteria versus their typical chromosomal location in gram-negative bacteria suggests that the genes encoding these variable proteins likely spread via conjugation and possibly equip gram-positive bacteria with resistance to Vml antibiotics [137].While all six sequences within RiOG_1012 are highly similar in the C-terminal region, the N-terminal halves of the proteins are highly divergent between SFG rickettsiae, the R. belii sequences, and R. canadensis (Table 12).This is consistent with the initial studies that concluded streptogramin, chloramphenicol and related acetytransferases belong to a vast family of enzymes with varying substrates [131,138].The presence of a Vml acetyltransferase only in tick-associated rickettsiae is interesting and implores further laboratory investigation.
As with insect-associated rickettsiae, OrthoMCL predicted few signatures for tick-associated rickettsiae.Despite the diversity between insects and ticks, all of the analyzed rickettsial species are capable of infecting vertebrates; thus the identified host-specific OGs likely do not contain proteins involved in vertebrate host cell invasion and pathogenicity.The likelihood that these signatures are involved in arthropod cell entry is also low, given the incidental collection of rickettsial species outside the range of their expected hosts (discussed above).However, these signature genes may be involved in mechanisms specific to arthropod host lifestyle, aiding long-term infection and the ability to persist in tick (via transstadial and transovarial transmission) and insect (via fecal inoculation and inhalation) populations despite the rapid generation times of these arthropods.

Plasmid Associated OGs
We recently analyzed the genetic composition of the pRF plasmid of R. felis and determined that the replicon is composed of  genes with likely origins to AG rickettsiae and other plasmidcontaining bacteria [28].This suggests that the last common ancestor of all rickettsiae likely harbored plasmids, with R. bellii [139], R. felis [27], likely R. akari [140] and other members of TRG rickettsiae, and some members of SFG rickettsiae either maintaining plasmids despite the constraints of shrinking genomes, or acquiring plasmids later in their evolution.Given the plasticity of plasmid presence/absence in other obligate intracellular bacteria [[e.g., [141][142][143][144][145]], as well as other medically-important pathogenic bacteria [e.g., [146][147][148][149][150], it is probable that the presence of plasmids may be variable at the strain level in Rickettsia, particularly when only one of the two sequenced R. bellii genomes harbors a plasmid [9,139].Past reports of pulsed-field gel electrophoresis (PGE) on rickettsial species that do not correlate with the sizes of recently sequenced genomes [151,152] may also allude to plasmid plasticity in populations of species and strains of Rickettsia.
Our previous suspicion that plasmids are likely to be found in some lineages of SFG rickettsiae [28] has recently been confirmed, as the plasmid pRM from R. monacensis was identified by transposon insertion and further characterization and sequencing [60].Subsequently, the same research group used PGE and southern blotting to identify plasmids of variable size and composition in R. helvetica, R. peacockii, R. amblyommii, and R. massiliae [59].The entire plasmid sequence of R. massiliae was later reported [70].Furthermore, the duplication of several ORFs associated with the type IV secretion system (T4SS) in rickettsiae (VirB4, VirB6, VirB8, and VirB9), coupled with phylogenetic evidence for an ancestral plasmid origin of all T4SSs [153], suggests plasmid systems and related chromosomal genes are a major constituent of rickettsial genomes, possibly contributing to pathogenicity in many lineages.The recent discovery of extraordinarily duplicated conjugative operons, as well as extremely elevated levels of transposons, TPR and ANK motif-containing proteins, integrases, and potential T4SS effector proteins in the Orientia tsutsugamushi genome further attests to the phenomena of plasmid plasticity and HGT amongst the Rickettsiales [58], implying that the rickettsiae progenitor was larger and less stream-lined than its modern descendants [32] and likely equipped with a suite of conjugative machineries [28].
Plasmids.OrthoMCL grouped 58 predicted pRF ORFs into 49 OGs, with 11 pRF ORFs left as singletons (Table 13).Of these 49 OGs, six contained two pRF ORFs (RiOG_920, RiOG_1057, RiOG_1279, RiOG_1282, RiOG_1283, and RiOG_1596), and one contained three pRF ORFs (RiOG_928), depicting the presence of duplicated genes on the plasmid, including the chromosomal replication initiator protein DnaA (pRF04 and pRF19), a probable transposase of the mutator family (pRF01, pRF30 and pRF55), an epsilon subunit-like protein of DNA polymerase III (pRF34 and pRF53), two TPR motif-containing proteins: (pRF12 and pRF15) and (pRF16 and pRF18), a site specific recombinase similar to DNA invertase Pin homologs and TnpR resolvase (pRF32 and pRF66), and a predicted transcription regulatory protein (pRF02 and pRF29).The remaining representative OGs containing single pRF ORFs generally reflect the distribution reported by Gillespie et al. [28] based on BLASTP results, except for a few instances (italicized OGs in Table 13).In comparison to the recently discovered SFG rickettsial plasmids, it is apparent that at least three proteins, namely a DnaA-like replication initiation protein, a Sca12-like protein and a small heat shock protein, are common to all rickettsial plasmids [59].Thus, despite the growing number of plasmids in Rickettsia, their unknown origin in the rickettsial tree and lack of conserved genes involved in conjugation keep their exact function and essentiality elusive.Table 13.cont.
on plasmids, typically as two-component operons, for the control of plasmid partitioning and stable inheritance [154][155][156][157]. Antitoxins, usually highly labile in their mature form, are constitutively expressed and neutralize the accumulation of their counterpart toxins, which are more stable.Upon imperfect segregation of plasmids after cell division, plasmidless daughter cells are destroyed by elevated toxin levels due to the rapid breakdown of the unstable antitoxin and lack of its further synthesis [158,159].Although originally described as mediators of bacterial programmed cell death, studies now suggest that TA modules also act as regulators of the stringent response (reviewed in [159]) and are widely present on chromosomes of diverse bacteria [160].While TA systems are found in many free-living bacteria, they are typically uncommon among obligate intracellular pathogens [160,161].However, the genome sequences for both R. bellii strains and R. felis contain elevated levels of chromosomally encoded TA loci, the majority of which seem to be degraded [9,27].Moreover, these genomes typically retain only one component of the TA modules, possibly alluding to a neofunctionalization [153] of the remaining genes for adaptation to eukaryotic hosts, as has been suggested for at least R. felis toxin and antitoxin genes [27].However, given that the reductive nature of rickettsial genomes may result in high levels of constitutively expressed loci and reduced operons, and that many antitoxins contain motifs common to two, three and even four different DNA-binding-proteins [162], incomplete and noncontiguous rickettsial TA modules may still interact with one another to coordinate a response to stress within host cells.Alternatively, the presence of incomplete TA modules may reflect vertically acquired plasmid-associated genes that are in the process of pseudogenization.In support of this, of the numerous TA components in the R. felis genome, only one VapB antitoxin (RiOG_941) was recovered in a proteome screen [40].
Our bioinformatic analysis reveals that components of 5 TA systems (relBE, phd/doc, vapBC/vag, mazEF, and parDE) are recurrent in all rickettsial genomes save the TG rickettsiae and R. canadensis (Table S4).Of the predicted 56 toxin and 86 antitoxin ORFs, zero occur in the TG rickettsiae and only two are found in R. canadensis.The majority of these ORFs occur in the R. bellii genomes and TRG rickettsiae (avg. of 22 and 26 TA ORFs per genome, respectively), although slightly lower levels also occur in SFG rickettsiae (avg. of 14.3 TA ORFs per genome).However, there are more occurrences of similar TA module components shared between the R. bellii genomes and TRG rickettsiae (12) than between SFG rickettsiae and either the R. bellii genomes (1) or TRG rickettsiae (5) (Table S4).Thus, the presence and distribution of these TA ORFs correlates to the lineages of sampled rickettsiae that contain plasmids, and further supports the TRG rickettsiae having affinities with AG rickettsiae [28].Furthermore, the R. belli and TRG rickettsiae genomes have elevated levels of predicted PIN-domain proteins (homologs of the pilT N-terminal domain), which in eukaryotes function as ribonucleases [163,164] involved in RNAi and nonsense-mediated RNA degradation [162,163].Most of the described prokaryotic PIN-domain proteins are toxins of chromosomally-encoded TA operons [159][160][161] that are present in a diverse array of unrelated bacteria, likely having arisen due to the advantages they bestow on competing mobile elements [165,166].Indeed, the PilT protein of the pathogenic Neisseria meningitidis has been hypothesized to interact with the T4SS due to its limited homology to the DotB protein of the Legionella T4SS [167].
While the exact manner of their origin and current functional significance is debatable, it is apparent that TA systems have arisen via HGT in a wide range of bacteria [168].Given the distribution of plasmids and associated TA systems in the analyzed rickettsial genomes, it is likely that conjugation via plasmids befits some rickettsial lineages with genes important for survival in stressful environments, allowing for dormancy and slow growth.However, it remains to be determined if those rickettsial species that harbor plasmids use TA modules for mediating the partitioning and stable inheritance of said plasmids.

Singleton ORFs
OrthoMCL failed to group 1467 ORFs (10.2% of total predicted ORFs) from the ten analyzed genomes into any OG (Figure 5, Figure 7; Table S5, S6, S7, S8, S9, S10, S11, S12, S13, S14).The range across rickettsial groups shows TG genomes contribute the least (8.5%) and TRG genomes contribute the most (41%) to the total count of singletons (Figure 10A).The individual genome contributions to the overall singleton count range from 4% (R. typhi) to 21% (R. felis), with the rank of all genomes matching the group ranking (TG,SFG,AG,TRG) (Figure 10A).However, an inherent bias of these comparisons is difficult to avoid, as OrthoMCL grouped 321 ORFs present only from both R. bellii genomes (Figure 5, Figure 7, Table S3).Accounting for these R. bellii doubletons, the rank and proportion of singletons per rickettsial group is modified: TG (7%),SFG (20%),TRG (34%),AG (39%), and illustrates that TRG and AG genomes are more similar in their number of singleton ORFs relative to TG and SFG rickettsiae.This brings up a practical concern with phylogenomic analysis in that sampling one genome per species (or strain) may not suffice for capturing the true composition of genes within the bacterial population.This is consistent with a recent study that cautioned on the very same idea in relation to vaccine design for Streptococcus agalactiae, an organism that has a core genome of approximately 80% across various strains, with the accessory genome quite plastic [169].Using mathematics, a rather daunting conclusion was reached suggesting even after sampling hundreds of additional S. agalactiae genome sequences, novel genes would still be added to the accessory genome [169].Nonetheless, inclusion of the R. bellii doubletons illustrated the similar composition of singletons in AG and TRG genomes and further adds to the similarities these genomes share as a result of related conjugation systems.
Unsurprisingly, the majority of singleton ORFs are annotated as HPs, ranging from 68% (R. felis) to 95% (R. typhi) across the  analyzed genomes (Figure 10A, B).In an effort to identify the degree of over-prediction of ORFs, we plotted the average lengths of singleton ORFs with predicted functions versus singleton ORFs annotated as HPs for all ten rickettsial genomes (Figure 10C).The rationale for this is that the majority of singletons under 100 amino acids in length should be HPs, with many having arisen by chance [100].Aside from R. felis, R. prowazekii and the R. bellii genomes, there is minimal difference between the average lengths of singletons with predicted functions and singletons annotated as HPs.The much larger average lengths of singletons with predicted functions versus singleton HPs are expected in the R. bellii and R. felis genomes, as many of the larger singletons in these genomes are probable products of HGT events (e.g., larger transposases, ANKand TPR-motif containing proteins).This same pattern in the R. prowazekii singletons, however, is unexpected, yet is skewed in part due to the presence of several large split ORFs that did not cluster into their respective OGs.While the shorter singleton HPs may have arisen by chance, it is likely that some of them are functional genes that are difficult to homologize with other closely related sequences, given the problems with assessing percent conservation across short sequences with even minimal differences.For instance, small ORFs are found in a variety of protein classes, including ribosomal proteins, transcriptional regulators, chaperonins, thioredoxins, metal ion chelators, proteolipids, stress proteins, nucleases, and mating pheromones [170].Of the original 299 Saccharomyces cerevisiae small ORFs annotated as HPs, 170 have since been assigned cellular functions, with the majority of information coming from laboratory evidence [171].Given the probable plasticity of the accessory genomes of rickettsial strains (discussed above) and the growing importance small ORFs have garnered in the literature [e.g., [171][172][173][174], the high number of small singleton HPs in Rickettsia should not be ignored.Experimental evidence has confirmed the translation of several small HPs in R. felis [40] and future microarray data will help lend resolution to this poorly understood characteristic of rickettsial genomes.

Conclusion
This study analyzed 14354 predicted ORFs from ten rickettsial genomes and generated OGs ranging from two to 31 sequences for 90 percent of the total ORFs.A conserved core rickettsial genome consisting of 731 OGs (51% of total predicted ORFs) was identified, and a phylogeny was estimated from this core genome to allow for subsequent phylogenomic comparison of the remaining accessory genome.This robust phylogeny estimate is congruent with our recent reclassification of rickettsial lineages into four groups [28] and OGs specific to each group provide the first signature genes possibly involved in the phenotypic characteristics defining each group.The unstable phylogenetic position of R. canadensis, coupled with it only sharing three OGs with the R. bellii genomes, reflects that the base of the rickettsial tree is poorly defined.However, an unprecedented mode of gene loss was discovered in the lineage spanning R. canadensis and TG rickettsiae, illustrating that gene signatures alone may not well-characterize specific rickettsial groups, but instead the modes of gene loss (and stricter reliance on host resources) may be the defining features [175].Given the emerging diversity of Rickettsia [16], particularly species associated with medically non-important metazoans and ancestrally related to the pathogenic species analyzed here, the origins of pathogenicity from primitive rickettsial symbionts may not be elucidated without a broader genomic comparison reflective of the overall diversity within the genus.
As a consequence of distinguishing OGs comprising single rickettsial groups (e.g., AG, TG, TRG, and SFG), shared rickettsial groups (subgeneric), plasmid-harboring genomes, and genomes with common arthropod hosts (C1OGs) from OGs with a patchy distribution across the rickettsial tree (C2OGs), two interesting results were obtained.First, C2OGs comprise 31% of all generated OGs, implying a significant portion of the rickettsial accessory genome is comprised of gene decay and laterally acquired genes.Supporting this is the presence of the majority of split ORFs within C2OGs (Table S1) and the high proportion of gene families typically associated with the bacterial mobile gene pool in C2OGs (47%) versus the low proportions in C1OGs (5%) and singleton ORFs (4%).Second, the ratio of representative OGs to non-representative OGs is skewed within C1OG distributions (71-29%) but nearly equal in C2OG distributions (56-44%), suggesting that gene duplications (paralogs) and HGT events (xenologs) are more prevalent in C2OGs.Taken collectively, these observations yield the manner in which the rickettsial genomes have acquired their variation: a conserved core genome is supplemented with a highly variable accessory genome that is comprised of gene decay and many horizontally acquired genes.However, the nature of the horizontally acquired genes remains unknown: for example, did the products of HGT arise ancestrally in the analyzed taxa, becoming shuffled over time through recombination and high rates of decay, or are HGT products continually sculpting the variation within the accessory genome overtop of a highly reductive nature of all genes within the genome?The recent explosion of reported cases of plasmids in all rickettsial groups except TG rickettsiae argues for the latter scenario, and is congruent with our findings of nearly zero instances of plasmid associated genes, genes typical of HGT events and gene duplications within TG rickettsial genomes.Thus, while many Rickettsia seem to be able to accept and pass genes of the mobile gene pool, the contribution of HGT products to pathogenicity is unknown and seemingly nonessential to all known rickettsial pathogens.The role lineage specific virulence factors play in pathogenic strains is thus an important aspect of future laboratory work.While HGT was traditionally considered rare in Rickettsia, we recently suggested, based on a detailed analysis of the R. felis pRF genes, that it is more common, particularly among species in which conjugation systems had yet been discovered [28].Our suspicions have recently been verified [70] and the exact degree HGT contributes to rickettsial diversification will only be elicited with the accumulation of more rickettsial genome sequences.Such endeavors will challenge our existing classification scheme; however, a preliminary analysis of two recently published SFG rickettsiae genomes (R. massiliae str.MTU5 and R. africae str.ESF 5) using genome alignment (Figure S1-E) and phylogeny estimation (Figure S3) does not overturn our results, and we predict that OGs generated with the inclusion of these new genomes will not alter the conclusions reached herein.
Finally, we present two concerns regarding phylogenomic analysis of Rickettsia.First, the high degree of pseudogenization in rickettsial genomes means that OG prediction programs and related methods alone are insufficient for grouping related genes.Manual inspection of algorithm output is imperative, as the high occurrence of split genes will lead to overestimation of nonrepresentative OGs as well as inaccuracies in ORF clustering (see Table S1).Second, and perhaps more pressing, is the revelation that rickettsial species may be comprised of highly variable genomes, particularly across exceedingly divergent strains.Attesting to this, our analysis of predicted OGs included two strains of R. bellii that shared 321 species-specific genes but contained 97 (str.RML369-C) and 117 (str.OSU 85 389) strain-specific genes.Similarly, a recent genomic comparison of R. rickettsii str.Sheila Smith CWPP with the avirulent R. rickettsii str.Iowa revealed 143 These OGs likely include pseudogenes, genes with less conserved functions in rickettsiae, and laterally acquired genes.Black =strictly representative OGs, blue = strictly non-representative OGs, red = both representative and non-representative OGs.Top numbers depict total number of OGs and bottom numbers show proportion of hypothetical proteins.Numbers in parentheses depict the proportion of non-representative OGs made representative via concatenation of split ORFs (see Table S1).Asterisks denote distributions that are made entirely representative after split ORF concatenation (27 of 47 non-representative distributions; see Table 4 and Table S1).Found at: doi:10.1371/journal.pone.0002018.s002(2.99 MB PDF) Figure S3 Phylogenetic analysis of 14 rickettsial taxa.Tree estimated using the same 16 proteins as the analysis in Figure 9, with the addition of orthologous sequences from the recently completed genomes of R. massiliae str.MTU5 and R. africae str.ESF 5 (sequences obtained from WGS reads using tBlastn).Tree estimated under parsimony (see text for details).Found at: doi:10.1371/journal.pone.0002018.s003(0.34 MB PDF) Table S1 Distribution and characterization of predicted ORFs within 259 non-representative OGs across ten rickettsial genomes, and the results after manual curation.Found at: doi:10.1371/journal.pone.0002018.s004(0.18 MB PDF) Table S2 Seven hundred-fifty two core rickettsial OGs predicted across ten analyzed genomes.Found at: doi:10.1371/journal.pone.0002018.s005(0.19 MB PDF) Table S3 OGs present only in the R. bellii genomes.Found at: doi:10.1371/journal.pone.0002018.s006(0.06 MB PDF) Table S4 Distribution of putative toxin-antitoxin (TA) systems within the rickettsial OGs predicted by OrthoMCL.Found at: doi:10.1371/journal.pone.0002018.s007(0.07 MB PDF) Table S5 Singletons present in the R. bellii str.RML369-C genome.Found at: doi:10.1371/journal.pone.0002018.s008(0.05 MB PDF) Table S6 Singletons and false singletons present in the R. bellii str.OSU 85 389 genome.

Figure 2 .
Figure 2. Alignment of 10 rickettsial genomes.Taxa are in the same position as in estimated trees in Figure 3, with taxon abbreviations explained in the Figure 1 legend.Alignment created using Mauve [189] after reindexing the R. sibirica genome (see text for details).doi:10.1371/journal.pone.0002018.g002

Figure 3 .
Figure 3.Estimated phylogenies of ten rickettsial taxa based on 731 representative core proteins.(A) Tree from Bayesian analysis.Three MCMC chains were primed with a neighbor-joining tree and run independently for 25000 generations in model-jumping mode.Burn-in was attained by 2500 generations for all chains, and a single tree topology with exclusive use of the Jones substitution model was observed in post burn-in data.The consensus tree shown here thus has 100% support for every branch.Branch support is from the distribution of posterior probabilities from all trees minus the burn-in.(B) Tree from exhaustive search using parsimony.Branch support is from one million bootstrap replicates.doi:10.1371/journal.pone.0002018.g003

Figure 4 .
Figure 4. Illustration of representative and non-representative OGs and their categorization into Class 1 and Class 2 OGs.Taxon abbreviations are explained in the Figure 1 legend.Dark circles depict gene presence, while open circles depict gene absence.(A) Representative OGs: orthologous groups with only one ORF per included genome.Our analysis includes ten rickettsial genomes, thus representative OGs only include from 2-10 ORFs.Four examples are shown.(B) Non-representative OGs: orthologous groups with multiple ORFs from at least one included genome, comprised of either recent (orthologs) or distant (paralogs) gene duplications (dupl).False singleton OGs are comprised of only one taxon, but with multiple ORFs from that taxon (example on right).Four examples are shown.(C) Class 1 OGs (C1OGs): orthologous groups comprising single rickettsial groups (e.g., AG, TG, TRG, and SFG), shared rickettsial groups (subgeneric), plasmid-harboring genomes, and genomes with common arthropod hosts.Two representative (left) and two non-representative (right) C1OGs are shown.(D) Class 2 OGs (C2OGs): orthologous groups with patchy distribution across the rickettsial tree, depicting gene losses and/or genes acquired laterally.Two representative and two non-representative C2OGs are shown.doi:10.1371/journal.pone.0002018.g004

3 C2OGs (see Figure 4
for description and Figure S2 for distribution of representative and non-representative C2OGs across rickettsial phylogeny).4 Percentage of 637 C2OGs present within each rickettsial genome.The 128 distributions of these OGs are illustrated in Figure S2.

Figure 7 .
Figure 7. Distribution of representative and non-representative class 1 OGs (C1OGs) and singleton ORFs over estimated rickettsial phylogeny.Boxes depict the distribution of phylogenetic groups, singletons, plasmid associated groups, and host-related groups: Red = AG rickettsiae, aquamarine = TG rickettsiae, blue = TRG rickettsiae, brown = SFG rickettsiae, gray = higher-level groupings, light green = R. bellii strains only.Orange boxes depict genes found on the pRF plasmid of R. felis str.URRWXCal2 and chromosomes R. felis and both R. bellii strains (as of this publication the R. bellii plasmids remain unavailable).Genes specific to single rickettsial genomes (singletons) are in yellow boxes, with taxon abbreviations explained in the Figure 1 legend.Host specific groups are defined by green (insect) and tan (tick) boxes.Genome statistics were compiled from the PATRIC and NCBI databases.Cladogram is based on trees shown in Figure 3. Inset in dashed box describes general schema for each box.*Total R. felis genome size: 1,485,148 bp = chromosome; 62,829 bp = pRF and 39,263 bp = pRFd.doi:10.1371/journal.pone.0002018.g007

Figure 8 .
Figure 8. Bioinformatic analysis of core representative OGs.(A) Assignment of 731 core representative RiOGs to predicted cellular function categories.Format follows that established at the COG database (NCBI) except for cf = combined function and rpe = rickettsial palindromic element.(B) Comparison of the distribution of cellular function categories across 731 core rickettsial OGs (Ri), a recent protein expression profile for R. felis [40] (Rf), and COGs for three other bacteria: Escherichia coli (Ec), Yersinia pestis (Yp) and Chlamydia trachomatis (Ct).Inset at left shows the number of genes per genome for cellular function categories involved in organic and inorganic transport and metabolism (E, F, G, H, I, P, and Q) followed by the percentage these genes comprise of total protein-encoding genes.Results from a six-way regression analysis are shown in the right inset.doi:10.1371/journal.pone.0002018.g008

Figure 10 .
Figure 10.Analysis of the distribution of 1467 singleton ORFs omitted from OG prediction across 10 rickettsial genomes.(A) Singleton ORFs across four rickettsial groups.(B) Singleton ORFs across 10 rickettsial genomes.First number is total number of singleton ORFs per taxon, with second number the total singleton ORFs annotated as HPs.Dashed lines in pie charts separate characterized proteins from HPs, with percentages given only for HPs.(C) Average lengths of singleton ORFs with predicted functions versus singleton ORFs annotated as HPs for all ten analyzed rickettsial genomes.doi:10.1371/journal.pone.0002018.g010

Table 1 .
Distribution of representative and non-representative OGs predicted across 14354 ORFs from ten rickettsial genomes, and their categorization into Class 1 and Class 2 OGs.1

Table 4 .
Manual evaluation of 259 non-representative OGs across ten rickettsial genomes.Split genes may be split multiple times, and multiple gene duplications may occur within single genomes (see TableS1).doi:10.1371/journal.pone.0002018.t004
1Not including 52 instances where pRFd ORFs cause or further contribute to non-representation. 2

Table 6 .
OGs missing in the lineage spanning R. canadensis and TG rickettsiae.

Table 7 .
Results of a BLASTP search for RiOG_2081 using RP338 (R. prowazekii) as a query 1 .

Table 8 .
OGs present only in TRG rickettsiae.

Table 11 .
Results of BLASTP searches evaluating two OGs (1496 and 1497) predicted by OrthoMCL to contain only insectassociated rickettsiae.

Table 12 .
Results of BLASTP searches evaluating two OGs (RiOG_1005 and RiOG_1012) predicted by OrthoMCL to contain only tick-associated rickettsiae.

Table 13 .
Distribution of the 68 R. felis pRF plasmid ORFs within the OGs predicted by OrthoMCL A .
ORFs present on the pRF plasmid and the R. felis chromosome