The genomes of Crithidia bombi and C. expoeki, common parasites of bumblebees

Trypanosomatids (Trypanosomatidae, Kinetoplastida) are flagellated protozoa containing many parasites of medical or agricultural importance. Among those, Crithidia bombi and C. expoeki, are common parasites in bumble bees around the world, and phylogenetically close to Leishmania and Leptomonas. They have a simple and direct life cycle with one host, and partially castrate the founding queens greatly reducing their fitness. Here, we report the nuclear genome sequences of one clone of each species, extracted from a field-collected infection. Using a combination of Roche 454 FLX Titanium, Pacific Biosciences PacBio RS, and Illumina GA2 instruments for C. bombi, and PacBio for C. expoeki, we could produce high-quality and well resolved sequences. We find that these genomes are around 32 and 34 MB, with 7,808 and 7,851 annotated genes for C. bombi and C. expoeki, respectively—which is somewhat less than reported from other trypanosomatids, with few introns, and organized in polycistronic units. A large fraction of genes received plausible functional support in comparison primarily with Leishmania and Trypanosoma. Comparing the annotated genes of the two species with those of six other trypanosomatids (C. fasciculata, L. pyrrhocoris, L. seymouri, B. ayalai, L. major, and T. brucei) shows similar gene repertoires and many orthologs. Similar to other trypanosomatids, we also find signs of concerted evolution in genes putatively involved in the interaction with the host, a high degree of synteny between C. bombi and C. expoeki, and considerable overlap with several other species in the set. A total of 86 orthologous gene groups show signatures of positive selection in the branch leading to the two Crithidia under study, mostly of unknown function. As an example, we examined the initiating glycosylation pathway of surface components in C. bombi, finding it deviates from most other eukaryotes and also from other kinetoplastids, which may indicate rapid evolution in the extracellular matrix that is involved in interactions with the host. Bumble bees are important pollinators and Crithidia-infections are suspected to cause substantial selection pressure on their host populations. These newly sequenced genomes provide tools that should help better understand host-parasite interactions in these pollinator pathogens.


Introduction
Previous genomic studies of trypanosomatids have largely focussed on the dixenous taxa, such as Trypanosoma [43,44] and Leishmania [45,46]. Only more recently monoxenous representatives were added, e.g. Lotmaria passim (formerly, Crithidia mellificae) [17,47], or Leptomonas pyrrhochoris [19]. Collectively, these studies underline that the genomic organization of trypanosomatids is unusual in several respects [48]. For example, the characteristic, single kinetoplast, located at the base of the flagellar pocket, contains the kinetoplast DNA (kDNA), which is organized as a set of circular DNAs. The so-called maxi-circles (typically, a few dozen) contain fairly conserved regions coding for functions similar to mitochondrial genes in higher organisms (e.g. in C. fasciculata, maxi-circles are 38 kb in size), whereas the numerous (typically, many thousands), genetically heterogeneous mini-circles (2.5 kb in size) carry sequences that are involved in RNA-editing [49][50][51][52]. Moreover, genomic sequences are typically organized in polycistronic clusters (that is, the same mRNA is coding for different genes), the gene-coding regions generally lack introns, and post-transcriptional mechanisms rather than changes in transcription rate are used to regulate gene expression [53][54][55][56][57], whereby gene duplication is common to increase expression levels.

Sequencing and assembly of the genome
The final genome assembly size of Crithidia bombi is 31. 66 Mb and is well resolved with 206 scaffolds (Table 1); the GC-content is 55.8%. The scaffolds contain a total of 585 contigs with a N50-contig size of 124.6 kb. The average scaffold and contig length is 155.5 kb and 54.1 kb, respectively; scaffold N50 is 855 kb long. Based on the assembly size and the total amount of cleaned sequencing reads, the average sequencing coverage is estimated at 243-times. The CEGMA-analysis on the final assembly resulted in 179 complete (72.2%) and an additional 14 (5.6%) partial matches to core eukaryotic genes. Being the second species to be sequenced, and with the genome of C. bombi already at hand, we only used PacBio for C. expoeki, which lead to a lower, but still sufficient coverage (estimated at 62-times; Table 1). We thus managed to establish a high-quality genome of Crithidia expoeki with 222 contigs of 34.08 Mb (Table 1); the GC-content is 54.4%. The contigs do not contain any gaps, have an average length of 153.5 kb and a N50 size of 592.2 kb. The longest contig has a size of 3,079,598 bp. The CEGMA-analysis resulted in 184 complete (74.2%) and an additional 11 (4.4%) partial matches to core eukaryotic genes.
To check the completeness of the assembly, we also ran a BUSCO analysis [58]. For C. bombi, this showed 140 complete and single-copy, 1 fragmented and 74 missing BUSCO orthologs. For C. expoeki we found 148 complete and single-copy, 12 complete and duplicated, and 67 missing BUSCO orthologs. This compares well to the L. major values of 175 complete and single-copy, 10 complete and duplicated, and 40 missing BUSCO orthologs.
Furthermore, there is a high degree of synteny between the two species as shown in Fig 2. Quantitatively, the SyMap synteny analysis revealed 5,098 anchors in 99 syntenic blocks with conserved gene order among C. bombi and C. expoeki [59]. Of these conserved blocks, 18 are smaller than 100 kbp, 80 blocks are between 100 kbp and 1 Mbp, and one block is larger than one Mbp. The average block size in C. bombi is 299 kbp, the smallest block is 32 kbp, the largest 1.596 Mbp. Average block size in C. expoeki is 291 kbp, the smallest block is 27 kbp, and the largest 1.559 Mbp (S2 Table). The analysis for C. bombi and the seven additional species revealed between 71 syntenic blocks (with T. brucei) and 280 blocks (with L. seymouri). The analysis with C. expoeki and the additional species revealed between 75 syntenic blocks (with L. major) and 260 blocks (with L. seymouri). Thus, not surprisingly, C. bombi and C. expoeki have lesser degrees of synteny with the other taxa in the set (S1 and S2 Figs), notably with T. brucei. As expected from the phylogenetic distances, a high degree of synteny is detected with L. pyrrhocoris. A similar degree of synteny is expected with L. seymouri, however, the analysis is hampered by the comparably fragmented draft assembly (e.g., N50 is only 70 kbp), resulting in the majority of syntenic blocks being shorter than 100 kbp. Surprisingly, the two species under study here show rather high degrees of synteny with the phylogenetically more distant Blechomonas ayalai (S1 and S2 Figs).
In addition, the distribution of GO-terms among the annotated proteins are, as expected, very similar in the two species under study (Fig 3). Differences were nevertheless visible, as there are more proteins in C. bombi that have been assigned to the protein-or ATP-binding categories, whereas C. expoeki seems richer in other categories, such as in serine family metabolic processes, glycolysis, or proteins associated with microtubules. Here we show sections of the genomes (kilobases, kb) of C. bombi and C. expoeki (top two panels; (scaffold3_4 and scf7180000000921, respectively) and the syntenic region in L. major (bottom panel) as an example of overall synteny among these genomes. Green arrows are gene sequences coding for proteins, as based on annotations in L. major and as indicated at the bottom. Reversed (left-facing) arrows indicate polycistronic regions. Note that, in this example, no introns are present. The red arrow refers to the amastin-like protein (LmjF.34.0970 in L. major), which is an ortholog to gene Ce.1.39770 (C. expoeki) and Cb.1.06720 (C. bombi). Two further amastin-like proteins are immediately up-and downstream from this location. The grey bars connect orthologs within the same orthologous group, as based on the OA analysis, and demonstrate a high degree of synteny among the three species. The yellow zone represents a gap in the C. bombi scaffold. https://doi.org/10.1371/journal.pone.0189738.g001

Orthologs
The OMA browser identified orthologous sequences from the two genomes studied here relative to the set of the other trypanosomatid species as available in the TriTryp database (Table 2), and using the protein annotations for both Crithidia spp. from MAKER2 as described in the Materials and Methods and from the ENSEMBL Protist database. OMA also generated groups of orthologs that are shared between a minimum of two and up to all (eight) taxa in the comparison; these OMA groups were analysed further below.  [59,60]. The plot shows all syntenic blocks between the scaffolds of C. expoeki (bottom half of the circle) mapping to scaffolds of C. bombi (upper half of the circle). Each coloured block indicates a scaffold of the respective genome. Syntenic blocks are linked with lines in the colour of the C. expoeki scaffolds. For illustrative purposes, a few scaffolds (as named in this study) are indicated at their approximate position in the circle.  In all, the two Crithidia-species studied here have substantial overlap with L. pyrrhocoris (for C. bombi: a total of 6,777 orthologs overlapping, C. expoeki: 6,333), and marginally fewer with  Trypanosomatids share some of their physiology and ultrastructure, but, for example, surface molecules show lineage-specific elements [62,63]. These are of great interest, since the cell surfaces of parasitic protists interact with the host in many ways and thus can determine essential properties, such as infection success or parasite virulence. Because of the specialized mechanisms for polycistronic gene expression in trypanosomatids, genes coding for cell surfaces are organized in groups, the 'contingency gene families'. It is thought [63] that a process of concerted evolution, also observed in trypanosomatids, can lead to a loss of orthology in surface molecules because derived sequences, originating from gene duplication as paralogs, gradually replace the ancestral ones, such that the derived genes within one species are more closely related to one another than with the homologs in other taxa. Whilst the paralogs can shift to new genomic locations and assume new functions, the remaining orthologs occupy similar genomic positions throughout the clade.
Several surface gene families are of interest in this context, such as the Major Surface Proteases (MSPs) that are found across all trypanosomatids [63,64]. These show signatures of selection [65] and divergence among lines [64]. Typically, MSPs encode metalloproteases that counteract the host's immune defences; for example, in Leishmania MSPs block macrophage activity [66], among other effects [67]. MSP homologues and metalloprotease activities are known for several taxa in the narrow or more distant vicinity of C. bombi and C. expoeki, e.g. in C. fasciculata, C. luciliae, C. deanei, C. guilhermei, Leptomonas seymouri, Bastocrithidia culicis [67], or Leptomonas pyrrhocoris [19]. In the following cases, we extracted all orthologs that carried a particular annotation (such as 'gp63', or 'amastin') in the TriTryp database and reconstructed the phylogeny within the set of the eight species.
gp63. Among the MSPs, glycoprotein 63 (gp63) is one of the best studied [67]. It is involved in adhesion to host cells in Leptomonas [68]. We explored the relationships of gp63 in the two Crithidia species studied here with those reported from the other trypanosomatids in our set of eight species. A total of four sequences for C. bombi and eight for C. expoeki, which met with the annotation 'gp63' in better studied taxa, were identified (see S3 Table). The other taxa in the set had a total of 69 sequences. Our phylogenetic analysis found all sequences of T. brucei clustered on a separate branch, whereas all other sequences intermingled with one another across the tree (Fig 5A). Except for one case (Ce.1.70950 pairing with Cb.1.37410), a given sequence for C. bombi was never closely associated with one from C. expoeki, and vice versa, whereas a conspicuous cluster of five sequences was found for C. expoeki only.
Amastins. These are transmembrane glycoproteins present in the surface of trypanosomatids and are expressed particularly in dixenous taxa when entering the mammalian host. In our set, we found a total of 190 sequences that met the criterion. Amastins have diversified in Leishmania (L. pyrrhocoris and L. seymouri together had 63 sequences), but orthologs are also found in the monoxenous species such as Crithidia [69], with C. fasciculata alone contributing 43 sequences in our study. Our annotation had identified 18 sequences in C. bombi, and 24 sequences in C. expoeki. The phylogenetic analysis ( Fig 5B) showed that amastins of the two species are found in all parts of the tree and group with those of other taxa in a seemingly arbitrary way. With few exceptions (e.g. the clusters containing Ce.1.77300, or Ce.1.28590; Fig 5B), C. bombi and C. expoeki always have representatives each within the same neighbourhood (clusters defined by the longer branches followed by small radiations; Fig 5B), suggesting that diversification of amastins in the two species happened in parallel several times along the different branches. Also, amastins of the two species are often nearest to Leptomonas or C. fasciculata ( Fig 5B).
Tryparedoxin and RAD51. These enzymes are unique for trypanosomatids and essential for their infection success, for example, qualifying as 'virulence factor' in Leishmania [71]. Their biological role is thought to be in the mitigation of host defences by oxidative stress [72,73], but this process may actually be independent of tryparedoxin, at least in L. infantum [74]. In our analysis, we find a total of 96 tryparedoxin-like sequences in the set of eight species, with 11 sequences in C. bombi and 14 in C. expoeki. The reconstructed phylogeny suggests a similar pattern as found in the amastins, i.e. the tryparedoxins of the two species are found at different places in the tree, and the clusters contain representatives of each (except Ce.1.65800 and Ce.1.65740) (S3A Fig) C. bombi shows frequent genetic exchange when different genotypes co-infect the same host [36], and the recombination pattern is consistent with Mendelian segregation as also reported in other trypanosomatids [39]. RAD51, for example, is part of the recombination system that underlies the variable expression of surface molecules and which is based on an archive system as best described for the 'African Trypanosomes' [75,76]. In this process, the infecting parasite changes its antigenic surface by retrieving variants from the archive in a programmed way such as to escape the detection by the host's immune system [77][78][79]. In our study, OMA identified at total of 31 sequences carrying a 'RAD51'-annotation in the set of eight taxa, of which four each were assigned to C. bombi and C. expoeki, respectively. Again, the same pattern as above emerged, with the sequences of the two species found in pairs around the tree (S3B Fig).

Signatures of selection
For this exploratory analysis, a total of n = 2,934 one-to-one orthology groups with entries for all eight species was available. Across the whole phylogeny (testing model M8 vs. M7 from PAML for strict trimming), a total of 350 orthology groups showed signs of significant selection and had some annotation information, whereas the remaining 119 groups had no annotation (based on annotations for C. bombi). In the example of strict trimming (S4 File), the most common meaningful annotations were 'dynein heavy chain' (n = 8 groups), 'ATP-dependent RNA helicase' (n = 6), 'protein kinase' (n = 4), or 'ABC transporter' (n = 15), whereas un-informative groups, such as ' hypothetical protein' (n = 32) and 'missing' (n = 158) were most frequent. We eventually found 380 orthologous groups showing evidence of positive selection (after Benjamini-Hochberg correction) for all trimming criteria; no trimming resulted in the most groups (n = 919), followed by 'relaxed' (n = 685) and 'strict' (n = 469) (S5 File, S4 Fig).
We also detected evidence of significant positive selection on the branch leading to Crithidia (BS model from PAML). A total of 86 groups tested significant for positive selection (after Benjamini-Hochberg correction) for all trimmings. The largest number of groups showed significance with no trimming (n = 522), followed by 'relaxed' (n = 316) and 'strict' (n = 91) (S6 File, S5 Fig). The significant groups in the BS-model contained very similar annotations as for the M8-model, as shown with the example of strict trimming (S4 File). Among the 91 significant groups, the most frequent categories were 'missing' (n = 33 groups) and 'hypothetical protein' (n = 14), followed by 'dynein heavy chain' (n = 3), and many others that appeared only once.
To test whether genes are evolving more rapidly in Crithidia than in the rest of our trypanosomatid tree, we compared the signature of selection from the M8-model that covers the whole phylogeny to the results from the branch-site (BS) model that calculates the selection only on the branch leading to Crithidia. Only groups that tested significant at P < 0.05 after a Benjamini-Hochberg correction, and that were significant in both models (M8 and BS) were included; a total of 23 groups met this criterion. Yet, the BS-model generated very high values of ω, which are biologically unlikely and may result when the proportion of sites assigned to ω > 1 is very small or when there is not enough information in the data to accurately infer the value of ω. Only one group (ID 261, annotated with 'dynein heavy chain') was within reasonable limits (i.e. in the range ω < 10).

Metabolic pathways: Example of N-glycan
As an example of the metabolic pathways found in Crithidia, we checked the pathway that leads to N-glycans of C. bombi, which, in eukaryotes, have many functional roles. N-glycans are assembled in the endoplasmic reticulum, transferred to selected asparagine residues (N-X-S/T sequon) of polypeptides that enter the secretory pathway and are further modified in the Golgi. In this study, we could not fully resolve the structure of the N-glycans, but instead focus on the evolutionary analysis of critical enzymes in the cascade, especially on ALGs ('asparagine-linked glycosylation'). These are enzymes that catalyse the addition of sugars to conserved oligosaccharide precursors in the endoplasmic reticulum.
In our analysis, we blasted putative genes from C. bombi against known sequences from yeast in the data bases. We found that N-glycan synthesis in the endoplasmic reticulum must be different from other eukaryotes. As is the case for other parasitic kinetoplastids [80], C. bombi lacks genes involved in the glucosylation of LLOs (lipid-linked oligosaccharides) in the luminal side of the endoplasmic reticulum, that is, ALG 5,6,8,10. In addition, the absence of the ALG12 locus (encoding a Dol-P-Man-dependent α-1,6 mannosyltransferase) suggests the transfer of a Man7GlcNAc2 oligosaccharides by oligosaccharyltransferase (OST; Fig 6, S6 Fig), which is similar to Crithidia and Leishmania but different from other trypanosomatids. ALG13 and ALG14 encode glycosyltransferase subunits, responsible for the second GlcNAc-addition (N-acetylglucosamine) to the LLO. The catalytic activity is with Alg13, but this is not activated unless when bound to Alg14-at least based on what is known from yeast [81]. In C. bombi, the linker sequence in the fused Alg13/Alg14 does not seem to take part in the folding of the active protein, although we cannot conclusively confirm this, as the protein structure could not be fully analysed. Alg13/Alg14 are encoded as a fused gene on a scaffold of the C. bombi genome ( Fig 6B).
Furthermore, C. bombi does not encode any non-catalytic subunits of oligosaccharyltransferase (OST; Fig 6), but three paralogs of catalytic subunit Stt3 (S7 Fig) located on scaffold 3/ 64. One of them has a large insertion sequence and thus might be a vestigial gene. We cloned the remaining two paralogs (CbSTT3A, CbSTT3B) for functional tests. It appears from these tests that CbSTT3A but not CbSTT3B can functionally complement the mutant stt3 from yeast (S8 Fig). CbSTT3B has a mutation in the DxE motif (corresponding to a DxD motif involved in binding a divalent cation [82,83]). It has been suggested that the ALG-glycosyltransferase set is secondarily lost [84]. Alternatively, the eukaryotic ALG pathway might have evolved by the addition of endoplasmic reticulum lumen-oriented glycosyltransferases [85]. Irrespective of the model applied, the systems in Crithidia and Leishmania were probably branching from a common ancestor that itself branched from Trypanosoma. Because Leishmania species encode three active STT3 genes [86], the inactivation of CbSTT3s may have happened early as the lineage split from the common ancestor.

Discussion
The biology of the genus Crithidia and, in particular, C. bombi has been the focus of a number of ecological, evolutionary studies over the last two decades. Taken together, these have explored and analysed the wide-spread occurrence of this parasite, e.g. [87][88][89][90], its effects on the host [30, 91,92], the genetic structure of populations [36, 88,89], the dynamics of multiple infections [93], or scrutinized the host genes and their expression when defending against this pathogen [94][95][96]. At the same time, the toolbox to study such questions had expanded over the years. For example, polymorphic microsatellites and SNPs are now available for C. bombi [36]. Both, C. bombi and C. expoeki, can be extracted from infected wild bees, and clones from single cells established, multiplied and maintained in pure culture [97]. The Bombus-Crithidia system is therefore very accessible for field work and laboratory studies.
Sequencing and annotation of the full genomes of C. bombi and C. expoeki now adds further important elements to the toolbox and adds a substantial amount of detailed information on these parasites. For C. bombi, the two-step assembly procedure resulted in an estimated genome of 31. 66 (55 Mb [48]). Furthermore, our analysis suggests that both, C. bombi and C. expoeki, have around 7,800 protein-coding genes ( Table 2). These numbers are quite comparable with estimates for L. seymouri (8,595 genes), and B. ayalai (8,126), but somewhat smaller than those in the other species of the set (c.f. Table 2). Some of this difference can be accounted for by the much more detailed study that some of the other species, such L. major, T. brucei, or C. fasciculata, have received. Alternatively, it could reflect evolutionary change that, for example, can lead to lineage-specific gene loss as a result of adaptation to parasitism and specific host groups (e.g. [9,46,80]. The average polycistronic gene cluster length of 142 kb in C. bombi and 114 kb in C. expoeki are slightly shorter than the sizes predicted for pol II clusters in T. brucei (153 kb) and L. major (180 kb) [98].
Taken together, we can see that the two genomes reported here are quite typical for the trypanosomatids, and the kinetoplastids more generally, in many of their aspects, such as genomic organisation (Figs 1 and 2) or the gene repertoire (Fig 3). Just like other trypanosomatid parasites [48,99], the two Crithidia species studied here also show polycistronic gene organisation and relatively few introns (S1 Table). At the same time, the phylogenetic reconstruction of gene trees shows a pattern consistent with concerted evolution, also similar to other trypanosomatids. The sequences of C. bombi and C. expoeki occur in 'pairs' (with 'pairs' sometimes meaning more than two sequences) within the same clusters and these 'pairs can be located in different parts of the tree (Fig 5, S3 Fig). In other words, these C. bombi and C. expoeki genes are each closely related to one another and to homologous sequences in the other trypanosomatids, but more distant from further, functionally similar genes (based on current annotations) within the same organism. An interesting slight deviation is visible for our reconstruction of gp63 (Fig 5A) because the sequences from T. brucei are conspicuously separated from all others, whilst C. expoeki shows an expansion to five sequences and seems more separated from C. bombi than observed in the other genes studied.
Because of the unusually high rates of genetic exchange observed in C. bombi [36], genes associated with the recombination machinery are particularly interesting. As an example, RAD51 is a conserved part of the recombination system that otherwise underlies the variable expression of surface molecules in the 'African Trypanosomes' [75,76]. We identified several orthologs of RAD51 in the two Crithidia species studied here, and, again, indications of concerted evolution, which can lead to the observed pairing of sequences from C. bombi and C. expoeki in different parts of the tree (S3B Fig). The function of RAD51 in the Crithidia species studied here, and whether their recombination system may be different, must remain unclear for the time being.
Our current analysis is still rather preliminary but for some genomic aspects, C. bombi and C. expoeki are closer to Leptomonas (especially, L. pyrrhochoris [19]) than, for example, to C. fasciculata-the 'type' species-or C. mellificae (Lotmaria passim)-a parasite of the honeybee [17]. On the other hand, C. bombi and C. expoeki are distinct in many aspects from the rest. For instance, a number of genes appear to be under positive selection in the branch leading to Crithidia (S6 File), many of those with unknown function, yet, 'dynein heavy chain' appearing prominently (S4 File). These proteins are typically associated with microtubules and involved in flagellar movement, e.g. [100]. Looking at the example of the initiating glycosylation pathway of surface components, we also found that the two Crithidia studied here deviate from most other eukaryotes and are also somewhat different from other kinetoplastids. For example, the two species lack various ALG glycosyltransferases, and all of them lack non-catalytic OST subunits that are typical for the eukaryotic pathway (Fig 6, S6-S8 Figs). The observed differences in canonical kinetoplastid N-glycosylation pathways could be the result of a rapid evolution of the glycan components of the extracellular matrix in kinetoplastids, possibly driven by a strong selection pressure exerted by the defence system of the host. Hence, we find many similarities to other trypanosomatids and a few differences. In this first study, however, there are no conspicuous, unique genomic features that could readily be associated with the simple, direct life cycle of the two species studied here and which would contrast with those genomic characteristics of species that have more than one hosts and/or a vector.
C. bombi, and very likely C. expoeki too, are important pathogens of bumblebees and arguably represent a considerable threat for the provision of pollination services by these bees. For example, infected spring queens loose nearly half of their reproductive success, thus exerting considerable selection on host populations [30]. These newly sequenced genomes represent another major step to better understanding these host-parasite interactions. Clearly, these genomes need to be analysed in more detail and, in particular, more functional tests are needed to develop a deeper insight into the genetic underpinnings of the infection and virulence processes.

Origin of samples and DNA extraction
We isolated our reference strain of C. bombi from a spring queen of Bombus terrestris L., collected on April 10, 2008, in Northeastern Switzerland (site: 'Neunforn'). We separated an individual cell and grew it clonally according to the methods described in [97] (designated as clone #08.076 in our project archives). We then extracted genomic DNA with the Blood & Cell Culture DNA Midi Kit (Qiagen, cat. no. 13343) according to the smanufacturer's instructions. We similarly isolated and extracted genomic DNA for a strain of C. expoeki from a B. lucorum worker collected in the Jura mountains (site 'Röschenz', on June 9, 2008) (clone #BJ08.175).

Genome sequencing
We sequenced the full genome of Crithidia bombi using a combination of sequencing runs on the Roche 454 FLX Titanium, Pacific Biosciences PacBio RS (starting in 2009; both at the Functional Genomics Center Zurich, FGCZ; http://www.fgcz.ch), and Illumina GA2 at GATC-Biotech (Konstanz, Germany). We used a total of 11 fragment libraries constructed for the 454-platform (9 paired-end libraries and 2 single-end libraries; S3 Table). In total, we generated 7,127,289 sequence reads with a mean length of 446 bp on this 454-platform. We produced one single-molecule real-time (SMRT) library for the PacBio platform according to the manufacturer's recommendations (Pacific Bioscience; but slightly modified, as we had to start with 10-20 μg, rather than 5 μg as suggested, sheared with g-tubes from Covaris, pn 520079), and then sequenced the library at the FGCZ on six SMRT cells and according to FGCZ's protocols. Our PacBio sequencing generated 270,958 sequences with mean length of 2,517 bp. For the Illumina, we constructed and sequenced four fragment libraries, producing 65,082,902 single-end reads of 76 bp length. Reads containing adapters were trimmed with cutadapt [101]. Quality-filtering and trimming was done with condetri.pl [102]. We used the Illumina reads to error-correct the PacBio reads with the pacBioToCA module of the WGS-Assembler version 7.
We sequenced the full genome of Crithidia expoeki with the Pacific Biosciences PacBio RS platform at the FGCZ. One SMRT library was constructed and sequenced on 9 SMRT cells, generating 381,293 sequences with a mean length of 7,181 bp; trimming was done within a local installation of the Pacific Bioscience SMRT portal version 2.3.0.

Genome assembly
We assembled the Crithidia bombi genome in two steps. First, we assembled all Roche 454 sequence reads using the runAssembly command line interface of the 454 GS de novo assembler version 2.7 with default settings, except for minimum overlap length (set to 40 bp), minimum overlap identity (set to 95%) and minimum contig length (set to 100 bp). The resulting assembly contained 265 scaffolds, a scaffold N50 of 658k bp, and a total size of 32.1 Mb. In a second step, we error-corrected the PacBio sequence reads with pacBioToCA [103] from the WGS-Assembler version 7 [104] using the Illumina sequence reads. The resulting corrected reads were then used to improve and extend the 454/Roche contigs from the first step using the software PBjelly version 12.9.14 [105]. In order to optimize parameters of the assembly tools and to assess the quality of the final assembly we used the CEGMA tool [106] to count the core eukaryotic genes. Higher numbers of complete proteins are an indication of a more complete and accurate assembly.
For Crithidia expoeki, we assembled the Pac Bio reads with a local installation of the PacBio SMRT Portal using the 'RS_HGAP_ Assembly.2' assembly protocol after filtering subreads to a minimum length of 500 bp, minimum quality of 0.75, and a seed read length of 8,000bp. A total of 367,242 reads with mean length 7,446 bp remained after filtering. We then used the Celera Assembler using the following settings: genome size = 35 Mb, target coverage = 25, overlapper error rate = 0.07, overlapper min length = 50, overlapper k-mer = 16. Finally, we used Quiver [107] in the polishing step using only the unambiguously mapped reads. We manually inspected the assembly and removed 73 scaffolds with less than 10x coverage.
In order to assess the completeness of the assembly we ran BUSCO v2.0.1 with the protist ensemble database downloaded from the BUSCO website. The option "-long" was set to turn on the Augustus optimization mode. BUSCO was run on the final genome assemblies of C. bombi and C. expoeki and for comparison also on the genome of L. major.
All reads are deposited in the European Nucleotide Archive (ENA) under accession numbers PRJEB21108 (C. bombi) and PRJEB21109 (C. expoeki).

Transcriptome sequencing and assembly
We isolated RNA from the clones of both Crithidia species using the RNEasy Mini Kit (Qiagen cat. no 74104) according to manufacturer's instructions. We then sent the extracted RNA to BGI (Beijing Genomics Institute) for sequencing on the Illumina platform, resulting in 53,695,762 paired-end sequencing reads of 100 bp length and 300 bp insert size. We removed read duplicates with filterPCRdupl.pl (condetri: PCRdupl_v1.01.pl) and trimmed adapters with cutadapt [101]. Finally, we used condetri.pl [102]to quality-filter and trim the reads using the default settings, except for parameters lq (set at 15), lfrac (set at 0.05), ml (set at 1) and minlen (set at 40). We assembled the resulting reads into transcripts with the software Trinity (r2012-05-18 [108]) using 'path reinforcement distance' set at 45 and 'group pairs distance' at 600.

Gene prediction and annotation
We produced automated gene predictions and structural annotations using MAKER2 [109] with the gene prediction tools SNAP [110], Augustus [111], and GeneMark-ES [112]. A de novo repeat library was constructed using RepeatModeler version 1.0.5 (http://www. repeatmasker.org/RepeatModeler.htmlh). We combined the proteins from the UniProt Swiss-Prot protein database [113] and all RefSeq proteins of Leishmania major, L. mexicana, Trypanosoma brucei, and T. cruzi available at the National Center for Biotechnology Information (NCBI) as evidence for protein homology. As EST evidence, we used the Trinity assembled transcripts. Two iterative MAKER2-runs were made to produce a final set of gene predictions and structural annotations. In a first run, MAKER2 was set to use EST evidence for predicting gene models (option: est2genome = 1) but not use the gene prediction tools. The resulting gene models (gff and fasta files) were collected. The corresponding protein translations where searched against a protein database containing all Swiss-Prot proteins and all RefSeq proteins of Leishmania major, L. mexicana, Trypanosoma brucei, and T. cruzi with BLAST+ v2.2.23 [114] using the algorithm blastp (numbering 8,316 hits at the time). Proteins with an E-value smaller 1 x 10 −8 and a query and target sequence coverage of at least 50% were collected. 500 of these proteins were randomly selected and the corresponding genes used as input for training the Augustus gene prediction tool according to the Augustus training tutorial (http://bioinf. uni-greifswald.de/augustus/binaries/tutorial/training.html) provided by the Augustus authors. Briefly, an initial training was run with the etraining command, then the gene models were optimized with the optimize_augustus.pl script and finally another etraining command was run.
The same 500 genes were used to train SNAP according to the tool's authors workflow. Briefly, the gff-file was converted with maker2zff (a MAKER2 accessory script), then the 500 genes were categorized and exported with the fathom command and model parameters estimated with forge. Finally, new hmm models were created with the hmm-assembler.pl script. The self-training gene predictor GeneMark-ES was run with the default settings on the genomic sequences, producing a GeneMark hmm-file. The resulting Augustus, SNAP and Gene-Mark-ES gene models were now used in a second iteration of MAKER2, this time with the option est2genome set to 0.
To annotate our predicted genes, we deployed reciprocal best hit BLAST using the protein predictions derived from the MAKER2 analysis and all RefSeq proteins of Leishmania major (numbering 8,316 hits at the time) downloaded from NCBI, using custom Perl scripts and BLAST+ v2.2.23 [114] with a cut-off of E 1 x 10 −8 . We searched for L. major homologs to all MAKER2-derived proteins and vice versa. We inferred further functional annotations from gene ontology terms using Blast2GO version 2.7.0 [61], which used the results of BLAST searches against the nt-database with BLAST+ v2.2.23, using a E-value threshold of of E 1 x 10 −8 . We ran BLAST locally and imported these results into Blast2GO. We supplemented the BLAST annotations with results from Interproscan, version 5RC7, which we also run locally and imported these results in Blast2GO. The mapping, annotation augmentation (ANNEX) and analysis steps were then performed using the graphical user interface of Blast2GO using the default settings. The resulting annotations were exported as text files. We assessed the over-representation of gene ontology terms with topGO [115]. The lengths of polycistronic gene clusters containing at least two genes and their gene numbers was extracted from the MAKER2-created annotation files using custom Perl scripts.

Orthologs and synteny
To find protein orthologs in the two Crithidia genomes to other sequenced trypansosomatids, we used a standalone version of the Orthologous Matrix (OMA.1.0.5) [116,117]. For the comparative analyses we used eight taxa. In addition to the annotated proteins of C. bombi and C. expoeki (this study), these were the protein libraries (annotated proteins) of the following taxa, generously made available in the TriTryp data base [118]: Crithidia fasciculata (TriTrypDB-33_CfasciculataCfCl_AnnotatedProteins; provider: BeverleyLab, by permission), Leptomonas pyrrhocoris (TriTrypDB-29_LpyrrhocorisH10_AnnotatedProteins; provider: LukesLab), Leptomonas seymouri (TriTrypDB33_LseymouriATCC30220_AnnotatedProteins; provider: Yurch-enkoLab), Blechomonas ayalai (TriTrypDB-33_BayalaiB08-376_AnnotatedProteins; provider: YurchenkoLab) Leishmania major Friedlin (TriTrypDB-29_LmajorFriedlin_AnnotateProteins provider: GeneDB), and Trypanosoma brucei (TriTrypDB-29_TbruceiTREU927_Annotated-Proteins provider: GeneDB). The choice was guided by having representatives of closer and more distant taxa within the Trypanosomatidae [20,119], and was limited by the computing time needed to run the the identification of orthologs with OMA, and the analysis of signatures of selection, respectively. To confirm the validity of orthologs, we used reciprocal best hit BLAST searches. In our comparative study, we could not include Crithidia mellificae/Lotmaria passim ( [17,47], as the protein database is not available.
Synteny analysis was done with SyMap, version 4.2 [59]. The genome assembly of each species was loaded into SyMap using a minimum contig length threshold of 2000 bp. Alignment and synteny analysis of C. bombi and C. expoeki versus all other species was done using the default parameters, with the exception of activating the "merge blocks" option.

Phylogenetics
We a priori selected a number of interesting genes-based on the biology of trypanosomatid genes discussed in the literature and as indicated for each case below-and extracted the protein sequences for both species of Crithidia, based on orthology to L. major. For the phylogenetic reconstruction of gene trees, we started with the orthologs identified by OMA, within the set of eight species, as defined above. After identification of all orthologous sequences in this set, we extracted all genes that carried a particular annotation (such as 'gp63', or 'amastin') in at least one of the annotations of the genes listed in the OMA groups. Typically, this was found to be the annotations for L. major, arguably the best characterized genome in the set. After identification of all orthologous sequences, we first checked for outliers with a simple neighbour-joining tree of the aligned sequences. Outliers, if any, were checked by submitting the protein sequence to HMMSCAN (biosequence analysis using profile hidden Markov models; available web service: https://www.ebi.ac.uk/Tools/hmmer/search/phmmer). Outliers were removed if the so identified domains were of doubtful support (eventually, only a few cases were removed). Final alignment was subsequently done with MAFFT as web tool (http://www. ebi.ac.uk/Tools/msa/mafft), and the aligned sequences submitted to MrBayes [120] (v3.2.6s) to produce phylogenetic trees, with default settings of four chains, two random start trees, a burn-in period of 25% of trees, and a total of, typically, 10 Mio generations. We used the consensus tree for further visualisation in FigTree v.1.4.2. [70].

Signatures of selection
We analyzed the signatures of selection and estimated rates of evolution using eight taxa, resulting in tests of 2,934 one-to-one orthology groups, i.e. the set containing a sequence for each of the taxa, as identified by OMA. Alignments were done with prank, and either left untrimmed, or trimmed with Gblocks with two trimming options, either the stringent ('strict trimming') or relaxed criteria ('relaxed trimming') [121]. We subsequently used the PAML v4.9 package [122], which implements likelihood-based codon models, for calculating the respective statistics. We identified the best model using likelihood ratios for the best fitting model among pairs of nested models. These differed solely in ω, the ratio of non-synonymous to synonymous substitutions (ω = dN/dS). We took ω > 1 to indicate positive (diversifying) selection, while ω < 1 and ω = 1 indicates negative (purifying) and neutral selection, respectively. Functional and structural constraints mean that most sites in functional genes are conserved, hence, the average ω is not a good indicator for positive selection [123,124]. We instead used the M7-and M8-models from PAML to test for the presence of positively selected sites [123]. In both models, can vary from site-to-site, based on a Beta-distribution in the interval (0, 1), divided into 11 discrete categories, with the last category (ω10) allowed to be ! 1. The model also calculates the proportion of sites, p, associated with ω10, i.e. the proportion of genes under positive selection, which is also the average dN/dS-ratio for those sites.
We used the branch-site model (BS) [125,126] to identify episodes of positive selection on the connecting branches between clades. In our case, we a priori assigned the branch to the genus Crithidia to the foreground, which allows testing for selection on the branch connecting Crithidia to the other species, such that foreground sites are constrained, whilst background branches are either constrained or evolving neutrally. We then compared the branch-site model for each orthology group to its corresponding null model, which assumes no difference between foreground and background branches, to identify orthology groups where the branch-site model better describes the evolution of these genes than the null model. In the BS-model, ω is divided into three categories with (0 < ω < 1), neutral (ω = 1), and positively selected (ω2 > 1; constrained to foreground branches); it returns the proportion of sites, p, associated with ω2, i.e. the proportion of genes under positive selection. We used three different initial estimates for ω and initialized branch lengths to values derived from maximum-likelihood trees constructed in PhyML [127] (using the LG substitution matrix, optimizing only branch-lengths, since the correct topology is known), such that PAML was not caught in local optima. Multiple testing was accounted for with the method of Benjamini-Hochberg [128].  Fig. Functional test of SST3-substitutes. C. bombi-derived CbSTT3A, but presumably not CbSTT3B, can complement the defective mutant stt3Δ from yeast (Saccharomyces cerevisiae) as shown by the appearance of a product. The background was stt3Δ, harbouring two plasmids, expressing C. bombi-derived STT3 (LEU2 marker) and yeast STT3 (URA3 marker); incubation at 30˚C and 4 days. The conditions were with and without 5-FOA (5-fluoroorotic acid), which, in yeast genetics, is used to select for the absence of the URA3-plasmids. (TIFF) S1