Extreme genome diversity in the hyper-prevalent parasitic eukaryote Blastocystis

Blastocystis is the most prevalent eukaryotic microbe colonizing the human gut, infecting approximately 1 billion individuals worldwide. Although Blastocystis has been linked to intestinal disorders, its pathogenicity remains controversial because most carriers are asymptomatic. Here, the genome sequence of Blastocystis subtype (ST) 1 is presented and compared to previously published sequences for ST4 and ST7. Despite a conserved core of genes, there is unexpected diversity between these STs in terms of their genome sizes, guanine-cytosine (GC) content, intron numbers, and gene content. ST1 has 6,544 protein-coding genes, which is several hundred more than reported for ST4 and ST7. The percentage of proteins unique to each ST ranges from 6.2% to 20.5%, greatly exceeding the differences observed within parasite genera. Orthologous proteins also display extreme divergence in amino acid sequence identity between STs (i.e., 59%–61% median identity), on par with observations of the most distantly related species pairs of parasite genera. The STs also display substantial variation in gene family distributions and sizes, especially for protein kinase and protease gene families, which could reflect differences in virulence. It remains to be seen to what extent these inter-ST differences persist at the intra-ST level. A full 26% of genes in ST1 have stop codons that are created on the mRNA level by a novel polyadenylation mechanism found only in Blastocystis. Reconstructions of pathways and organellar systems revealed that ST1 has a relatively complete membrane-trafficking system and a near-complete meiotic toolkit, possibly indicating a sexual cycle. Unlike some intestinal protistan parasites, Blastocystis ST1 has near-complete de novo pyrimidine, purine, and thiamine biosynthesis pathways and is unique amongst studied stramenopiles in being able to metabolize α-glucans rather than β-glucans. It lacks all genes encoding heme-containing cytochrome P450 proteins. Predictions of the mitochondrion-related organelle (MRO) proteome reveal an expanded repertoire of functions, including lipid, cofactor, and vitamin biosynthesis, as well as proteins that may be involved in regulating mitochondrial morphology and MRO/endoplasmic reticulum (ER) interactions. In sharp contrast, genes for peroxisome-associated functions are absent, suggesting Blastocystis STs lack this organelle. Overall, this study provides an important window into the biology of Blastocystis, showcasing significant differences between STs that can guide future experimental investigations into differences in their virulence and clarifying the roles of these organisms in gut health and disease.


Introduction
Blastocystis is a genus of atypical, nonflagellated, anaerobic stramenopiles commonly inhabiting the intestinal tract of humans and other animals. The majority of stramenopiles, which, along with the Alveolata and Rhizaria, are members of the eukaryotic supergroup known as SAR, are marine biflagellated cells with tubular hairs on their surface-Blastocystis has none of these characteristics. The number of humans infected with Blastocystis globally has been estimated at over 1 billion [1], with prevalence being higher in developing than in developed countries [2]. The infective stage of Blastocystis is an environmentally resistant cyst, with the most common mode of transmission being the fecal-oral route. The pathogenicity of Blastocystis is controversial, but colonization with this organism has been linked to gastrointestinal symptoms, including diarrhea, abdominal pain, nausea, and irritable bowel syndrome [3]. Blastocystis is capable of becoming established in the gut and is difficult to eradicate via pharmacological interventions [3]. However, a causal link between the presence of the organism and disease symptoms has not been established [1,3], and some authors argue that Blastocystis is a part of a healthy gut microbiota [4].
At the genetic level, Blastocystis is remarkably heterogeneous. Many morphologically similar but genetically distinct lineages of Blastocystis have been identified, based primarily on sequences of their small subunit (SSU) ribosomal RNA genes [5]. Seventeen lineages have been isolated from mammals and birds to date and are referred to as "subtypes" (STs); the inter-ST divergence of the SSU rRNA gene is at least 3%. Blastocystis STs can have a remarkably broad host range and are almost never found exclusively in 1 host [5,6]. Only STs 1-9 have been found in humans to date, with STs 1-4 being responsible for around 90% of all human cases examined [6].
The high degree of genetic diversity is a confounding factor in establishing whether Blastocystis is a pathogen. Recent in vitro and in vivo molecular investigations have identified hydrolases and proteases as candidate virulence factors. Blastocystis proteases cause cleavage and degradation of immunoglobulin A (IgA) secreted by the host, may disrupt the intestinal epithelial barrier, and increase production of pro-inflammatory cytokines [2,7,8]. However, there is substantial inter-ST variation in adhesion to enterocytes, disruption of intestinal epithelial tight junctions, activation of pro-inflammatory cytokines, and the ability to scavenge nitric oxide. Virulence factor variability presumably extends to the intra-ST level [9] because the same ST is commonly found in both symptomatic and asymptomatic hosts. The highly variable clinical presentations attributed to Blastocystis could potentially be due to colonization with different STs or strains of the organism and untangling the relationships remains an area of active research and concern [10][11][12].
Recently, researchers have begun to assess the relationship between Blastocystis colonization and the composition of the prokaryotic gut microbiota. While several studies indicate Blastocystis may be associated with a more diverse and "healthy" microbiota [13,14], others have reported an association between Blastocystis colonization and a decrease in protective bacteria in the gut [15] or no differences in microbiota between Blastocystis-positive and -negative patients [16].
The ST7 Blastocystis nuclear genome, obtained using a culture established from fecal matter from a symptomatic human, was the first to be sequenced [21]. The study reported a genome size of 18.8 Mb with 6,020 protein-coding genes. The authors described, among other things, a complex "mitochondria-like" organelle, effector proteins possibly involved in adaptation to a parasitic lifestyle, and a suite of secretory proteins that could have the potential to alter host physiology. They also detailed the genome architecture of ST7, finding it to be highly compact and having numerous duplicated blocks of genes.
More recently, an ST4 genome obtained from a laboratory rodent in Singapore was published [22]; its size was 12.91 Mb and a set of 5,713 protein-coding genes was predicted. There were no detailed analyses of the genome structure or of the genes themselves. In 2015, draft assemblies for the nuclear genomes of STs 2, 3, 4, 6, 8, and 9 were released to public databases. None of these assemblies have been annotated with predicted genes and no analyses have been reported to date beyond their use in a microbiome analysis [13].
Here, we report a draft genome sequence and transcriptome analysis of Blastocystis ST1, NandII strain from a symptomatic human, and compare it to the published genomes of Blastocystis ST7 [21] and ST4 [22]. We first provide a general overview and statistics for the NandII genome and perform high-level comparisons between the 3 genomes. We then present findings derived from extensive manual annotation, with a focus on genes that are potential host effectors, and highlight significant genomic differences between the STs. Our study provides a strong framework for subsequent molecular and cellular investigations of the role of Blastocystis in gastrointestinal health and disease and of its impact on the microbial communities of the gut.

Genome assembly
The assembled size of the Blastocystis ST1 genome is 16.5 Mb spread across 580 scaffolds (Table 1). In comparison, the assembly of ST7 is composed of 54 scaffolds and 18.8 Mb, while the ST4 assembly is 12.9 Mb and has 1,301 scaffolds. The substantially lower number of scaffolds in the ST7 assembly reflects the longer reads of the Sanger sequencing technology used versus the shorter reads generated in next-generation sequencing (454 and Illumina) used in the cases of ST1 and ST4. Although the assembly for ST7 is substantially larger with longer scaffolds, the fold coverage, gene complement, and transcriptome coverage suggest that the difference in size between the strains is real. The substantial genomic differences between ST1, ST4, and ST7 are not limited to size. The GC content varies by 15% between the 3 STs (39.6% in ST4, 45.2% in ST7, and 54.6% in ST1), which is significantly different when compared to GC-content variation in some parasites: the GC content difference among 3 strains of Giardia is only 2% [23], as is the difference among 4 species of Leishmania [24]. More biologically relevant is the variation among the Blastocystis STs in terms of overall gene numbers and gene structure. ST1 has 524 more protein-coding genes than were identified in ST7 and 831 more than in ST4 (Table 1). ST1 has the largest average gene size, the highest percentage of genes with introns, the highest average number of exons per gene, and the largest percentage of its genome devoted to encoding proteins, whereas ST7 has the lowest figures for these same attributes (Table 1). Although ST7 was characterized as being a compact genome [21], ST1 is demonstrably more compact, with an average intergenic spacer size of 615 bps versus 1,687 bps in ST7. ST4 tends to fall in the middle for these various measures, although it has more genes/kb, 0.44, than ST1 (0.39) or ST7 (0.32) because it has the smallest genome size.
While these differences in the number and organization of genes in different STs likely reflect real differences between their genomes, they may to some extent be impacted by the procedures used to predict gene models in the 3 genome assemblies. Particularly noteworthy is that ST7 and ST4 were annotated before the recognition of the phenomenon of polyadenylation-mediated creation of stop codons that occurs in Blastocystis (see below). Hence, the statistics concerning the different genomic elements in ST7 and ST4 (e.g., genes, exons, introns, intergenic spaces) reported in this and subsequent sections must be considered current estimates that may change after revised annotation.

Functional termination codons
A striking structural feature of Blastocystis STs that is unique amongst eukaryotic nuclear genomes is the polyadenylation-mediated generation of termination codons [25]. This process involves the creation of a functional stop codon by 3 0 -polyadenylation of mRNAs in a subset of genes, whereby the first 1 or 2 adenines of the poly-A tail appended to terminal uracil-adenine/uracil-guanine or uracil respectively, complete a termination codon missing in the actual gene sequence. Importantly, the addition of the poly-A tail occurs upstream of any possible canonical stop site in the underlying genomic sequence in these genes. Because most genefinding algorithms are predicated on the presence of standard stop codons, this unique feature in Blastocystis can lead to erroneous gene models with problems such as overly long protein sequences, overlapping genes, chimeric genes, and the introduction of false introns to reach the next downstream stop codon.
The phenomenon was first investigated in ST7 using expressed sequence tag (EST) data [25]. The authors concluded that potentially 15% of ST7 genes have stop codons generated by polyadenylation. They also found some examples of the same process in a preliminary draft genome of ST1. With the full genome of ST1 available, as well as RNA-sequencing (RNA-Seq) data, we investigated the extent of the phenomenon. Appropriate stop codons generated by polyadenylation were found in 1,693 protein-coding genes, representing 26% of the proteincoding genes in ST1. The 15% suggested for ST7 may be an underestimate because mapped RNA-Seq data are far superior to ESTs for confirming whether the stop codon is generated via polyadenylation.
Because this phenomenon was not identified prior to the gene-finding process for the Blastocystis ST7 genome, a number of the initial gene models were incorrect [25]. At present, it is unclear whether the gene models for ST4 suffer from similar problems or even whether the same polyadenylation-mediated generation of termination codons occurs. The sites for polyadenylation appear to be linked to a highly conserved motif (TGTTTGTT) usually found 5 bases downstream of the nucleotide preceding the poly-A tail. A search for this motif indicates that it is abundant (Table 2) in Blastocystis. All 3 genomes have roughly the same number of motifs when genome size is taken into account. ST1 has 1 site per 3,097 bps, ST4 has 1 per 2,751 bps, and ST7 has 1 per 2,782 bps. A random selection of other stramenopile genomes showed that Thalassiosira pseudonana has 1 motif per 17,754 bps, while Phytophthora infestans has 1 site per 37,441 bps, suggesting that in these 2 genomes, the presence of the motif is random.
Because ST4 has a very similar motif complement to ST1 and ST7, it is likely that it too uses the polyadenylation process to generate some of its termination codons. Also suggestive are the 95 gene model pairs that overlap in ST4 (Table 1). An examination of these overlapping coding regions found motifs at locations appropriate to allow separation of the genes in most of the 95 cases. For example, the 3 0 -ends of KNB46045 and KNB46046 overlap by 37 bps (S1 Fig). Forty-four bps upstream of the annotated stop codon for KNB46045 is the conserved motif TGTTTGTT, which, if used to direct the generation of a new termination codon, would eliminate the overlap between the 2 genes.
Future studies of additional Blastocystis STs need to take polyadenylation-mediated stop codons into account. The mechanism has been demonstrated to be active in ST1 and ST7, suggesting that the mechanism evolved prior to the divergence of these 2 STs.

Introns
The ST1 genome is substantially more intron rich than that of ST7 or ST4. It has 35,412 predicted introns compared with 18,200 in ST7 and 24,093 in ST4 (Table 1). Consequently, ST1 has the highest percentage of genes with introns and the highest number of introns per gene. The size of introns in ST1 is also less variable (Fig 1), with over half being 30 bps in length. In ST4 and ST7, the percentage of introns with a length of 30 bps is considerably lower. ST7 shows greater variation in intron size, with a much lower peak at 30 bps and higher percentages of introns in the size range 31-38 bps. ST1's relative intron richness may in part reflect methodological differences in gene calling. Exon/intron boundaries in ST1 were corrected using RNA-Seq transcriptome data, which were not available for ST4 and ST7. Depending on the parameters used to identify a "typical" gene and its exon/intron structure during the automated gene-calling process, having RNA-Seq data would tend to result in finding more real introns and annotating more introns with the correct boundaries.
Of the approximately 35,000 annotated introns in the Blastocystis ST1 nuclear genome, the vast majority (98.3%) represent the standard (or U2 type) spliceosomal introns characterized by GT-AG boundaries and spliced by the so-called major (U2) spliceosome containing U1, U2, U4, U5, and U6 small nuclear RNAs (snRNAs) containing small nuclear ribonucleo proteins. Two additional, much less abundant intron categories are also apparently spliced by the  Forward strand sites  2,620  2,320  3,329  954  3,069   Reverse strand sites  2,707  2,368  3,403  873  3,035   1 Site/length (bp)  1/3,097  1/2,751  1/2,792  1/17,754  1/37,441 major spliceosome. One has GC instead of GT as the 5 0 intron boundary and constitutes about 0.5% of all introns in both ST1 and ST7 (Fig 1), which is on par with the proportion of this intron found in metazoan or plant genomes [26]. The second category, characterized by the GA dinucleotide at the 5 0 border, is even more sparse, with 59 such introns (i.e., only around 0.16% of all introns) identified in ST1 and only 9 supported by EST data in ST7. The existence of such introns is not without precedent, as such GA-AG introns were previously identified in the genome of the dinoflagellate Symbiodinium minutum [27]. The final intron category in Blastocystis corresponds to the minor, or U12-type, introns, characterized by boundaries typically exhibiting the dinucleotides AT-AC (Fig 1) and spliced by the minor (U12) spliceosome containing U11, U12, U4atac, and U6atac snRNA molecules. A systematic analysis of the Blastocystis ST1 genome revealed 346 U12-type introns in 319 genes. In addition, genes for all 4 snRNAs and for all 7 protein subunits (20K, 25K, 31K, 35K, Comparative genomics of Blastocystis reveals diversity of subtypes 48K, 59K, and 65K) specific to the U12 spliceosome were identified [28] (S1 Table). U12-type introns and U12 spliceosome-specific components were not noted in the report on the ST7 genome [21], but both can be identified in the genome sequence of that ST (Fig 1, S1 Table). Hence, Blastocystis represents a previously missed eukaryotic lineage that has retained U12type introns and the associated splicing machinery. In stramenopiles, only oomycetes have so far been known to exhibit both types of spliceosomal introns, whereas only the major "standard" type has been retained in the other lineages [29].

Significant differences in gene complement between STs
Blastocystis ST1 was found to have 6,544 protein-coding genes, in contrast to 6,020 in ST7 and 5,713 in ST4 (Table 1). An examination of the amino acid identities between putative orthologs reveals an extraordinary degree of dissimilarity among the Blastocystis STs (Fig 2). The median sequence identity of aligned regions of orthologs among STs 1, 4, and 7 ranges from 59% to 61% (S2 Table). The differences among the 3 STs exceed those observed for orthologs from pairs of species from parasitic protistan genera such as Cryptosporidium (C. parvum-C. hominis, Alveolata), Leishmania (L. major-L. infantum, Excavata), and Theileria (T. parva-T. annulata, Alveolata) (  Table). The dissimilarity is comparable to that between species of Plasmodium (P. falciparum-P. knowlesi, Alveolata) and Trypanosoma (T. cruzi-T. brucei, Excavata). The extent of the dissimilarity among the Blastocystis STs supports the contention that they should be considered at least equivalent to separate species, particularly when placed in the context of other protistan pathogens. The lack of morphologically distinguishing traits and low correlation between ST and host [30] will continue, however, to make the taxonomy of Blastocystis challenging. ST1 and ST4 genes are the least alike at the protein level while ST7 protein-coding sequences are usually more similar to ST1 than to ST4 genes (Fig 2). This is in line with recent phylogenetic analyses [5,20,31] that place ST4 in a clade sister to one that includes ST1 and ST7. There are also marked differences in gene content between STs. The percentage of genes present in ST1 but not in ST4 is significantly lower than the percentage of genes in ST1 not present in ST7 (Fig 3). Similarly, ST4 has far fewer unique genes when compared with ST1 than with ST7. The percentage of unique genes in ST7 is virtually identical in comparisons with ST1 and ST4. The differences in gene complement between the Blastocystis STs greatly exceed the differences between selected parasitic protistan species pairs (Fig 3). Beyond a common core of genes presumably devoted to common housekeeping tasks, the Blastocystis STs possess a substantial number of genes unique to each ST. The consequences of these major genetic differences between the STs are discussed below.

Kinome
Kinase enzymes modify target proteins via phosphorylation and participate in the regulation of cellular pathways, particularly those involved in signal transduction [32]. Our analysis of Blastocystis ST1 identified 221 kinases, which were classified according to kinase.com [33]. Representatives of most kinase groups are similarly distributed among STs 1, 4, and 7 (Fig 4) except for 2 clades showing lineage-specific gene expansion. The first is a small clade of calcium/calmodulin-dependent-like (CAMKL) kinases almost exclusive to STs 4 and 7 (Fig 4). The second and more striking example of ST differences is a clade of STE20/7 kinases. While ST4 and ST7 both encode representatives of the STE family, they appear to completely lack members of a cluster of 58 closely related STE20/7 kinases found only in the ST1 genome. This seems to be a spectacular case of gene family expansion specific to ST1.
The exact roles of the various kinases in Blastocystis STs and, in particular, the large group of STE20/7 kinases exclusive to ST1 are currently unclear. The STE20/7 family includes members of the mitogen-activated protein kinase (MAPK) signaling pathway, which regulates responses to extracellular stimuli [34]. The ST1 cluster does not exhibit a particularly close relationship with any specific gene of the STE20/7 family. In Giardia, 2 members of the MAPK family have been implicated in initiation of encystation [35], while in T. brucei, a MAPK is involved in mediating its interferon-γ-induced proliferation in the host [36]. MAPK pathways have been characterized, at least in pathogenic fungi, as a "functional nervous system" that controls virulence and modulates the outcome of the disease [37]. General "housekeeping" responses to stimuli would presumably be similar across the STs, so the highly developed exclusive MAPK pathway in ST1 must have some unique consequences and could be worthy of future functional studies.

Secretome and degradome
A stringent, 4-step approach identified a total of 89 genes confidently predicted to encode secreted proteins in Blastocystis ST1 (S1 Data). The corresponding predicted number for ST7 was 307 [21], although those results were based only on SignalP predictions, a less stringent approach than the one used here. Most of the 89 secreted proteins predicted in ST1 are also Cumulative ortholog frequency is plotted on the y-axis while the x-axis represents the amino acid identity between orthologs. The more similar species pairs are to each other, the steeper the curve and further to the right. Note that the order of the pairs in the legend matches the order in the figure. Comparisons for Blastocystis STs were based on aligned regions of reciprocal best Basic Local Alignment Search Tool (BLAST) hits, while non-Blastocystis data were obtained from the authors of a comparative study of Giardia intestinalis [23]. Data for this figure can be found in S12 Data. present in ST7, with the exception of a metallophosphoesterase and GH2 and GH33 glycosyl hydrolases. Several of the 89 genes were not predicted as secreted in ST7 because their signal peptides were not part of the predicted gene models. More than half of these putative secreted proteins have a predicted role in posttranslational modification and protein turnover (Fig 5). We compared the secretory signal peptides of the different STs and were unable to identify a unifying characteristic apart from a core of 10 hydrophobic amino acids (S2 Fig). The majority of the secreted protein-coding genes in the 3 STs were present in more than 1 copy, with a notable expansion of cysteine proteases (see below). Comparative genomics of Blastocystis reveals diversity of subtypes Intriguingly, ST1 has 4 genes that are predicted to code for secreted collagen-like proteins. These 4 genes are very different at the amino acid level, with pairwise percentage identities only in the mid to upper 30s, but all have signal peptides that suggest they are secreted and all have the distinctive collagen-like motif GXY (glycine, second and third residues can be anything but frequently proline and hydroxyproline) repeated 63 times. ST4 also has 4 copies of genes encoding these collagen-like proteins, with only 1 predicted to be secreted, whereas ST7 has a single gene encoding a related protein that lacks a signal peptide. Based on sequence similarity and repeat characteristics, the Blastocystis collagen-like proteins are of the bacterial type [38]. A number of bacterial pathogens have collagen-like proteins that are involved in pathogenicity, immune response elicitation, and host-parasite interactions. For example, collagen-like proteins are able to bind to the human extracellular matrix, thereby aiding adhesion and colonization [39][40][41]. It is possible that collagen is one of the factors mediating adhesion of Blastocystis to enterocytes. Another possibility is that collagen may be part of a mechanism used by Blastocystis to trap bacteria and other microbial eukaryotes for nutritional purposes. This process has been observed by electron microscopy, although an exact mechanism has not been described [42][43][44]. At present, the roles played by collagen-like proteins in Blastocystis are purely speculative, based on their similarities to bacterial proteins that are clearly involved in pathogenicity. However, it suggests fertile ground for further investigation, specifically into whether differences in both number of genes and potential for secretion are implicated in variable virulence.
Another class of proteins demonstrating clear ST differences is proteases or peptidases. Proteases are crucial for many biological processes and constitute potential virulence factors in parasitic protists [45]. Cathepsin B, a cysteine protease, has been linked to increased intestinal cell permeability [8], while other cysteine proteases have been reported to cleave human Comparative genomics of Blastocystis reveals diversity of subtypes secretory IgA [46] and induce up-regulation of interleukin 8 cytokine transcription and secretion [47] in intestinal epithelial cells.
Blastocystis has a large number of proteases, with the 3 STs having mostly similar profiles, but with some notable differences. The total number of proteases encoded in ST1 is 243, with 198 in ST4 and 210 in ST7 (S2 Data) (Fig 6). All 3 STs encode aspartic, cysteine, metallo, serine, and threonine proteases. The most prevalent protease genes are those of the cysteine type, constituting between 39% (ST7) and 47% (ST4) of the degradome. Undoubtedly, many cysteine proteases play roles that are conserved in eukaryotes, such as lysosomal function, autophagy, and ubiquitination, but others have been implicated in host-parasite interactions [48][49][50]. All 3 STs show extensive gene expansions of the cysteine protease families C1, C13, and C19. A large number of C1 genes is fairly common. Less common is the number of C13 genes found in ST1 (16), ST4 (11), and ST7 (11). Most protist genomes contain fewer than 5 C13 genes and many just have a single version (see MEROPS database, merops.sanger.ac.uk, [51]). The only protist known to approach the number of C13 genes seen in Blastocystis is the sexually transmitted human excavate Trichomonas vaginalis (10), in which at least 1 type of C13 protein has been implicated in trichomonal cytoadherence [52]. While Blastocystis appears to have an elevated number of C19 genes compared with other protists (MEROPS database), currently there is no indication that these genes are involved in anything other than standard intracellular removal of ubiquitin molecules.
Other protease families that exhibit large differences between STs are C56, C95, C97, and M16 (S2 Data). The biological significance, if any, of these differences is currently unknown. The most striking disparity between the STs is seen in metallo-type proteases (Fig 6, S2 Data). The divergence in the number of genes belonging to this type is entirely attributable to the complete absence of subfamily M23B genes in the ST4 genome, while both ST1 and ST7 have 29 members. The M23 family is composed of metallopeptidases involved in the lysing of peptidoglycans in bacterial cell walls for either defense or feeding (MEROPS database). What functions they possess in Blastocystis and why they are absent from ST4 are open questions. Differences in the number of the S54 rhomboid serine proteases were also identified, with ST1 having 4 genes, ST4 having 3, and ST7 having only 1. These transmembrane peptidases play a role in the invasion of host cells in the alveolates Toxoplasma, Cryptosporidium, and Plasmodium, while in the amoebozoan Entamoeba histolytica, the single rhomboid protease is vital to immune evasion via the cleavage of lectins during receptor capping [53][54][55].

Unusual aspects of Blastocystis ST1
Mitochondria. The mitochondria of anaerobic protists are often metabolically distinct from their aerobic relatives [56], as they typically couple ATP generation to the production of Comparative genomics of Blastocystis reveals diversity of subtypes hydrogen and not oxidative phosphorylation. Due to the lack of canonical features, they are frequently referred to as MROs. The first EST survey of Blastocystis ST1 revealed the MRO to have elements of canonical mitochondria, as exemplified by a highly reduced cytochromemediated electron transport chain with only complexes I and II being present. These organelles also showed features characteristic of hydrogenosomes, including MRO-targeted homologs of [FeFe] hydrogenase and pyruvate:ferredoxin oxidoreductase (PFO) [18]. This mosaic composition of the MRO proteome has intrigued evolutionary biologists, as it represents an intermediate form between classical mitochondria and hydrogenosomes. Comparative analyses of MRO genomes of 8 STs have shown that they are identical in gene content and gene order [17,19,20].
From the Blastocystis ST1 nuclear genome sequence, a variety of MRO proteins were identified that were previously unreported (Fig 7, S3 Data). In particular, additional components of the protein import apparatus, including transporters, chaperones, and proteases, were identified (Fig 7). Genes were also found that are involved in "typical" mitochondrial processes such as cardiolipin and phospholipid metabolism, electron transfer (Complex II assembly [SDH5], electron transferring flavoprotein), Fe-S cluster biogenesis (NFU1), cofactor/vitamin metabolism (folate, B5, B12, steroid, and lipoate), fatty acid biosynthesis, cristae maintenance (Mic60), and regulation (PDH kinase). Genes encoding MRO proteins that are only found in anaerobes were identified, such as the rhodoquinone biosynthesis enzyme RQUA and a putative ATP-generating acetyl-CoA synthetase (ACS).
Previous attempts to identify PFO and hydrogenase activity in the MRO of Blastocystis ST1 have been unsuccessful [57]. While this observation might be due to the labile nature of these enzymes, it is possible that additional factors are at play. PFO catalyzes the oxidation of pyruvate and generates reduced ferredoxin to be used by hydrogenase to generate hydrogen. PNO is homologous to PFO, but contains a C-terminal NADPH-hemoprotein reductase domain [58]. PNO (and not PFO) activity has been reported previously in Blastocystis [57]; however, while several candidate PNO proteins could be identified, none had a predicted targeting peptide. It is possible that at least 1 of these possesses a cryptic mitochondrial targeting signal; such signals are not so uncommon in model system mitochondria [59]. However, genes encoding 3 homologs of PFO and a standalone NADPH-hemoprotein reductase were identified, all of which had canonical targeting peptides for localization to MROs. One possible scenario is that these mitochondrial PFO proteins are functioning in trans together with the soluble NADPH-hemoprotein reductase as PNOs, thus explaining the lack of organellar PFO activity. A similar situation occurs for the Fe hydrogenase homologs of Blastocystis. Two of the hydrogenase genes identified in ST1 have clear mitochondrial targeting peptides but possess an additional electron-transferring domain (flavodoxin), while the lone canonical hydrogenase (i.e., without the flavodoxin domain) is cytosolic. Recent reports have identified these atypical hydrogenases in other anaerobic protists [59][60][61]. Moreover, ST1 does not encode the maturases HydF and HydG, which are typically necessary for the correct assembly of the mature hydrogenase enzyme. A possible scenario is that the Blastocystis ST1 hydrogenase does not participate in hydrogen production but instead uses an alternative electron acceptor to protons. While these scenarios are speculative, the noncanonical domain architecture of the Blastocystis hydrogenases and PF(N)Os is intriguing and additional biochemical investigations will be needed to test these hypotheses.
An analysis of Blastocystis GTPases encoded in STs 1, 4, and 7 (S4 Data) identified Miro, a protein that has been implicated in regulating mitochondrial morphology, intracellular transport, and interactions with the endoplasmic reticulum (ER). Miro is a multidomain protein composed of 2 divergent GTPase domains separated by a region that includes 2 Ca 2+ -binding EF-hand motifs. The distal portion of the C-terminus includes a transmembrane domain  [62]. It is widely conserved across eukaryotes including stramenopiles but was previously found to be missing from all eukaryotic lineages with anaerobic MROs (hydrogenosomes or mitosomes), including Microsporidia, Entamoeba, and Metamonada [63]. The presence of Miro in Blastocystis is thus somewhat unexpected, but in fact further underscores the intermediate position of the MRO organelle in this organism between canonical aerobic mitochondria and highly modified mitochondrial derivatives such as hydrogenosomes and mitosomes. More unexpected is the presence of 3 quite different Miro paralogs conserved across all 3 Blastocystis STs, apparently resulting from duplications specific to the Blastocystis lineage (Fig 8). Duplicated Miro genes were previously detected only in eukaryotic groups known to have experienced cycles of whole-genome duplications, such as vertebrates, some plants, and some ciliates [63], so the expanded family of Miro genes in Blastocystis is noteworthy.
Peroxisomes. In contrast to stramenopiles with canonical peroxisome [64], Blastocystis STs 1,4, and 7 appear to have retained only a single subunit of the Dsl1 multisubunit tethering complex (S3 Fig), which is known to be involved in transport between the ER and the peroxisome [65]. Reduction of this complex has been observed in other eukaryotic organisms thought to lack peroxisomes [64]. Characterization of peroxisomes in genomes from the SAR clade is limited (reviewed in [66]), but there is ultrastructural and/or biochemical evidence for peroxisomes in Phytophthora palmivora [67] and Phaeodactylum tricornutum [68], and peroxisomes have been predicted to be present in T. pseudonana [69]. Like nearly all other parasitic organisms [70], peroxisomes have not been described ultrastructurally for Blastocystis. Consistent with this, with the Dsl1 result, and with a report that failed to find the AAA-ATPases Pex1 and Pex6 [71], no PEX genes (responsible for peroxisome biogenesis) were identified in any of the Blastocystis STs, with the exception of a possible Pex4p homolog (Fig 9). This makes Blastocystis only the second example, after alveolates, of a lineage with MROs retaining aspects of classical mitochondria but lacking peroxisomes [70].

Membrane-trafficking machinery
The membrane-trafficking system enables transport of proteins and lipids between intracellular locations and is an interface with the extracellular environment. Crucial to the healthy working of eukaryotic cells, it underpins the pathogenic mechanisms of diverse eukaryotic parasites through the release of virulence factors as well as the uptake of metabolites from the host. Trafficking is enabled by a suite of proteins involved in vesicle formation and fusion [72]. Comparative genomic and molecular phylogenetic analyses have established that a relatively complex complement of membrane-trafficking machinery was present in the last eukaryotic common ancestor (LECA) [73].
Our analyses showed that homologs of nearly all components of the vesicle-formation and -fusion machinery present in the LECA are also encoded in Blastocystis STs 1, 4, and 7 (S3 outer mitochondrial membrane (blue), inner mitochondrial membrane (green and orange), and mitochondrial matrix (purple). Numbers correspond to the protein number in the complex (e.g., TOM40  A number of Ras superfamily proteins (or small GTPases) known to be involved in membrane and protein trafficking have orthologs in Blastocystis (S4 Data). Notable among them are 2 (ST7) or 3 (ST4, ST1) paralogs of a GTPase that serves as the membrane-anchored β subunit (SRβ) of the signal-recognition particle receptor on the ER that is involved in the cotranslational import of proteins into the ER [74]. Most eukaryotes, including other stramenopiles, employ only 1 SRβ protein, so the functional significance of the varying number of SRβ paralogs in Blastocystis is unclear.
Blastocystis STs also harbor some Ras superfamily members that appear to have emerged in this lineage through extensive divergence of existing or duplicated genes. These can be considered "novel" genes and may potentially underpin significant evolutionary innovations. For instance, ST1 and ST4 both encode a divergent Rab7-like paralog (Rab7L in S4 Data) that is apparently missing from the genome of ST7. Although the cellular function is difficult to predict for such divergent paralogs, the similarity of Rab7L to standard Rab7 proteins suggests that it may be involved in some trafficking processes associated with lysosomes or vacuoles. The Blastocystis genomes encode 5 additional divergent Rab-related proteins whose evolutionary origin cannot be deduced with confidence and hence are labeled RabX1 to RabX5 (S4 Data). These proteins are presumably involved in specialized membrane-trafficking pathways in Blastocystis, but their functions cannot be predicted from sequence analyses alone. The 2 Rab1 paralogs in Blastocystis, annotated as Rab1A and Rab1B, differ from the previous examples in that their origin is more ancient, as they apparently stem from a Rab1 duplication previously suggested to be an evolutionary novelty of the SAR supergroup [75]. The functional significance of the 2 differentiated Rab1 paralogs in organisms of the SAR clade remains unknown despite an attempt to characterize both paralogs in Toxoplasma gondii cells [76]. Most recently, Rab1A was found in association with rhoptries in schizonts of P. falciparum and suggested to be involved in regulating vesicular trafficking from the ER to the former secretory organelles of the parasite [77]. Comparative genomics of Blastocystis reveals diversity of subtypes Anaphase promoting complex/cyclosome Control of the cell cycle and cell proliferation is mediated extensively by the anaphase promoting complex/cyclosome (APC/C) [78]. It functions as an E3 ubiquitin ligase that coordinates the degradation of specific substrates via the 26S proteasome at specific points in the cell cycle [79]. Typically, the complex is composed of about a dozen subunits with a combined mass of about 1.5 MDa. It can be divided into 3 functional parts or subcomplexes [80]: (i) a structural complex serving as a scaffold, (ii) a catalytic arm, and (iii) a tetratricopeptide repeat (TPR) arm designed to position the substrate for successful transfer of ubiquitin. Surprisingly, Blastocystis seems to lack genes for all the subunits that make up the scaffold (i.e., APC1, APC4, and APC5), and to encode a reduced TPR arm composed of only 2 subunits versus typically 4 in other stramenopiles and up to 7 in other eukaryotes [78]. Particularly notable is the loss of the APC3 subunit, which is found in virtually all eukaryotic organisms. However, Blastocystis appears to be unusual in that it has several paralogs encoding the TPR subunit APC8, which potentially may compensate for the absence of the other subunits in the TPR arm (S5 Data).
The APC/C interacts with a number of adaptors and coactivators that modulate its activity and specificity. The most important of these adaptors are cell-division cycle protein 20 (Cdc20) and cadherin-1 (Cdh1) because, at various stages of the cell cycle, they are essential for activation of the complex and selecting which substrates to interact with [81]. In particular, APC/C-Cdh1 plays a role in DNA synthesis during the G1/S phase because it allows the 26S proteasome to degrade several DNA replication inhibitors [80]. Surprisingly, while the vast majority of eukaryotes possess both Cdc20 and Cdh1, Blastocystis only has a homolog of Cdc20. How the end of anaphase is regulated in Blastocystis remains an open question.
In addition, genes encoding 2 of the main APC/C targets that are involved in the integrity and regulation of the cohesion complex, Scc2/Scc4 and Eco1, were absent. This protein complex keeps sister chromatids together and regulates their separation during cell division. Scc2/ Scc4 aids in loading the complex onto the chromosome, while Eco1 is responsible for the establishment of cohesion between cohesin and chromatin [80]. The absence of these components in Blastocystis and other stramenopiles suggests that an alternative route may exist to achieve a properly functioning cohesion complex and the separation of sister chromatids.

DNA repair and meiosis
Potential homologs of proteins involved in DNA damage response and repair, chromatin structure relevant to repair, and meiosis, a process not previously attributed to the life cycle of Blastocystis, were identified (S4 Fig, S6 Data). Homologs of genes encoding 9 out of 11 meiosis-specific proteins required for the progression of meiosis in other organisms [82,83] are found in Blastocystis ST1 and ST4; these include Hop1, Spo11-2, Top6BL, Dmc1, Hop2, Mnd1, Msh4, Msh5, and Mer3. Msh5 is absent from ST7, and Rec8 and Spo11-1 were not identified in any of the STs. Similar to T. vaginalis [84], the Blastocystis genomes apparently do not encode components of the nonhomologous end-joining machinery, suggesting that homologous recombination is the principal mechanism for the repair of double-stranded DNA breaks.

Carbohydrate active enzymes
A total of 203 carbohydrate active enzymes (CAZymes) were identified in the Blastocystis ST1 genome (S7 Data). The most interesting cases are discussed below and additional details are provided in S1 Text.
Metabolism of nonreducing disaccharides. Sucrose and trehalose are the nonreducing disaccharides present in eukaryotes. Sucrose metabolism is completely absent in stramenopiles including Blastocystis, as deduced from the absence of genes encoding sucrose synthase (family 4 of the glycosyltransferases [GT4]), sucrose-phosphate synthase (GT4), and invertases (glycoside hydrolase family 32 [GH32 and GH100]) in ST1 and ST7. OtsA-OtsB is the most widespread trehalose biosynthetic pathway, present in both eukaryotes and prokaryotes [85]. This pathway involves the synthesis of trehalose-6-phosphate by the GT20 trehalose phosphate synthetase (TPS). Two GT20 enzymes, similar to the typical stramenopile TPSs, are encoded in ST1 and ST7 (S7 Data), but no enzyme capable of degrading trehalose was identified. However, ST1 does encode several GT4 family members that are highly divergent, so it is possible that these proteins have trehalose-phosphorylase catabolic activity.
Storage polysaccharide metabolism. Blastocystis has a small but apparently complete set of glycogen metabolism genes: a glycogen synthase (GT5), 2 glycogen phosphorylases (GT35), branching enzyme (GH13), and indirect branching enzyme (GH133) (S7 Data). This is the first reported case of a stramenopile with α-glucan metabolism. In general, stramenopiles utilize β-glucans as storage polysaccharides and lack starch or glycogen metabolism-related genes, as evidenced by the stramenopile genomes examined so far (diatoms, brown algae, and oomycetes) [86,87]. Our results are consistent with previous microscopic observations of glycogen-like particles in the cytoplasm of several species of Blastocystis [88][89][90].
Interestingly, genes encoding enzymes involved in β-glucan metabolism are completely absent from all 3 genomes. This finding indicates that storage metabolism differs significantly between Blastocystis and other stramenopiles for which there are published analyses [86]. Other stramenopiles are β-glucan accumulators and lack α-glucan-related gene homologs. However, a recent search of data from the Marine Microbial Eukaryote Transcriptome Sequencing Project [91] for the basal branching stramenopile Cafeteria revealed strong matches for a glycogen synthase gene (S3 Table) as well as the absence of β-glucan-related genes. The presence of α-glucan metabolism in Blastocystis and Cafeteria can be accounted for by lateral gene transfer (LGT) or by maintenance of an ancestral eukaryotic glycogen metabolism lost in stramenopile sister lineages. Quite significantly, the glycogen synthase enzyme belongs to the GT5 database of carbohydrate-active enzymes (CAZy) family, meaning the animal hosts of Blastocystis can be rejected as possible LGT sources, as they all contain the very different GT3-type enzymes. Given that alveolates also employ α-glucan metabolism [86], it is likely that α-glucan metabolism was present in the ancestor of stramenopiles. The presence of the indirect debranching enzyme (iDBE), absent from red and green algae, suggests that this network is not a secondary endosymbiosis relic from Archaeplastida.
Coat carbohydrates. Previous spectrophotometric analyses have shown that the surface coat of Blastocystis contains α-D-mannose, α-D-glucose, N-acetyl-α-D-glucosamine, α-Lfucose, chitin, and sialic acid [92]. Our analyses have confirmed Blastocystis STs 1, 4, and 7 encode a eukaryotic-type chitin synthase (GT2) and 2 eukaryotic-type chitinases (GH18) orthologous to proteins found in other taxa, indicating that this organism is capable of synthesizing and degrading this polysaccharide. No orthologs for fucose and sialic acid synthesis were identified. Many pathogens have developed the ability to harvest sialic acid from their hosts and incorporate it into their own glycoconjugates. Sialylated glycoconjugates in many pathogens seem to be crucial for their survival in the mammalian host, possibly serving as molecular mimics of host cell surfaces to evade immune attack [93]. A bacterial-type α-fucosidase (GH29) ortholog was found in all 3 genomes. Two prokaryotic-type CAZymes were present in ST1 and ST4 but not in ST7: an exo-α-sialidase (GH33) and an α-fucosidase (GH95). All 3 bacterial-type enzymes are part of a suite of genes that have been acquired through LGT and are discussed in a separate analysis [94].

Intermediary metabolism
Unlike the intestinal parasites Giardia and Entamoeba, which rely on the host as a source of purines and pyrimidines [95,96], Blastocystis ST1 possesses complete pathways for the de novo synthesis of these compounds (S8 Data). Its capacity for de novo amino acid biosynthesis is limited to alanine, aspartate, and glutamate, while serine and glutamine can be produced via conversion from other amino acids. Blastocystis ST1 has a mostly complete folate biosynthesis pathway, lacking only the alkaline phosphatase responsible for the dephosphorylation of 7,8-dihydroneopterin 3 0 -triphosphate. However, this enzyme is also lacking in many probiotic bacteria such as bifidobacterial, which appear to use another NUDIX enzyme [97]. While genes encoding homologs of this protein could not be identified, it seems likely that Blastocystis ST1 uses an uncharacterized alkaline phosphatase to complete the pathway.
Unlike most parasites, Blastocystis ST1 appears to have a nearly complete de novo thiamine (vitamin B1) biosynthesis pathway. In this pathway, 4-amino-2-methyl-5-hydroxymethylpyrimidine (HMP) is pyrophosphorylated by HMP kinase sequentially to form HMP-PP, which is in turn condensed with thiazole to form thiamine phosphate by thiamine-phosphate pyrophosphorylase. Only the thiazole salvage enzyme hydroxyethylthiazole kinase (ThiM) could be identified, suggesting thiamine biosynthesis in Blastocystis ST1 is either dependent on exogenous thiazole or synthesizes thiazole by an unknown mechanism. The latter is not unprecedented, as P. falciparum has been shown to synthesize vitamin B1 de novo despite having the same repertoire of thiamine synthesis-related genes as Blastocystis [98].

Cytochrome P450 gene family
Cytochrome P450 (CYP) is a large and versatile heme-containing protein superfamily containing at least 317 families and numerous subfamilies (https://cyped.biocatnet.de/sequencebrowser). These proteins are found in all domains of life and in all major eukaryotic lineages. However, no homologs were found of any of the CYP families in the 3 Blastocystis genomes. Genomes available for other stramenopiles do encode various CYPs. The diatom P. tricornutum has 3 CYP genes, consisting of 2 CYP97 and 1 CYP51 enzymes, whereas T. pseudonana encodes 9 family members [99]. P. falciparum was the first documented eukaryotic species without a CYP-encoding gene [100]. However, it should be noted that Kinetoplastida such as T. brucei have a limited set of CYPs devoted to sterol synthesis (e.g., CYP51A). Another apicomplexan parasite, T. gondii, has a single CYP gene, encoding a steroid 11-β hydroxylase. Since sterols are essential for eukaryotic membranes, the lack of CYP51 in Blastocystis suggests that it obtains sterols from its host, as does Giardia [101].

Concluding remarks
The intestinal eukaryote Blastocystis continues to be of significant clinical interest because of it widespread prevalence. Despite years of study, however, its pathogenicity and role in the gut remain controversial. This ambiguity is compounded by the presence of STs that, while morphologically indistinguishable, are nevertheless genetically heterogeneous. The extent of the differences and the degree to which they matter clinically are still unclear, but the increase in genomic data reported here opens up the possibility of answering some of the outstanding questions through comparative analyses.
Structurally, there is considerable variation among the 3 Blastocystis ST genomes now available in terms of genome size and GC-content. ST1 also displays sizeable differences compared to ST7 and ST4 with regard to the number of genes identified and general gene characteristics, such as intron numbers and gene size. However, the use of a novel mechanism by which stop codons are generated was examined thoroughly in ST1 and led us to conclude that some of the differences in gene numbers and characteristics could be due to previous annotation efforts in ST4 and ST7 not taking the stop codon issue into account.
While the need for reannotation of ST4 and ST7 affects the quality of the protein data sets from these STs to some extent, nevertheless, divergence at the amino acid level between homologous proteins has been demonstrated that is almost an order of magnitude greater than that seen between species within other genera of parasitic protistans. Analyses of protein classes of interest revealed numerous differences. Of interest will be the variation in the number and type of protease genes identified among the STs, since this class of protein has been linked to pathogenicity. Another intriguing area ripe for exploration and experimentation is the kinome. ST-specific expansions of protein kinase genes were identified. In particular, ST1 encodes a large number of novel proteins from the STE20/7 family, which is typically involved in responding to extracellular stimuli.
General aspects of the Blastocystis genome were examined that help in understanding its biology and how it relates to other stramenopiles and other parasitic protistans as well. Some of the highlights (see S1 Text for additional analyses) include an expanded proteome of the MRO, the absence of peroxisome-related genes, the presence of α-glucan metabolism, and intriguing expansions and additions to the repertoire of membrane-trafficking machinery proteins. The extensive structural and gene complement differences between the genomes of the 3 STs suggests the need for a revised taxonomy of Blastocystis. Since Blastocystis has been linked to disease in humans, the primary focus of any genomics analyses will likely be medically oriented. The availability of Blastocystis genomes will greatly assist future functional studies. In particular, the development and successful implementation of a transformation system, to allow finely tuned control over constructs as well as localization experiments, will depend on mining the genome for appropriate primers, promoters, and genes of interest. Resequencing to evaluate the success of transformations and any off-target responses will be facilitated by preexisting genomes. Pathogenomic studies of Blastocystis will benefit greatly from access to a genome of ST1. Comparisons with existing and future genomes from other STs and other isolates of ST1 will help in finding genes that are correlated with virulence as well as the development and testing of drug targets. Equally as important to elucidating the roles of Blastocystis STs in the human gut is understanding the biology of these anaerobic protists and their repertoires of metabolic pathways, which allow them to thrive in the gut environment.

Methods and materials
Culturing and DNA extraction Blastocystis sp. NandII (ST1) was obtained from the American Type Culture Collection (ATCC 50177) and maintained in Locke's medium and horse serum egg slants at 35.6˚C in an anaerobic chamber [102]. Blastocystis ST1 cells were harvested by centrifugation at 850xg for 15 min at 4˚C. Total DNA was extracted with the standard CTAB protocol [103]. Subsequently, Hoechst dye cesium chloride (CsCl) centrifugation was used to obtain purified nuclear DNA. Extracted DNA was diluted with TE buffer to a final volume of 10 mL and resuspended in 11.5 g of CsCl. Ten mg of Hoechst dye was added and the mix was homogenized by shaking for 3 hours. The solution was then transferred into Quick-Seal centrifuge tubes and centrifuged in a fixed-angle Ti-75 rotor at 40,000xg for 44 h. The resulting DNA bands were visualized under long-wave UV light. The AT-rich organellar band was removed and the remaining nuclear DNA band retrieved with a 30-gauge needle. Hoechst dye was removed with 3 successive washes of water-saturated butanol. DNA was precipitated with 100% ethanol and suspended in water.

RNA extraction, sequencing, and assembly
Total RNA was isolated using a modified Trizol protocol [104], in which the supernatant, after the first round, underwent another Trizol extraction. The resulting RNA was used to create a cDNA paired-end library (Vertis Biotechnologies AG [Germany]) that was then sequenced on a 4SLX Titanium Platform (Genome Quebec). The reads were filtered for primer, vector, and adaptor sequences. Following filtering, 69,155,000 read pairs remained. A second cDNA library was constructed (Beijing Genomics Institute) and RNA-Seq data was generated using the Illumina HiSeq 2000 system. As above, the reads were filtered for primer, vector, and adaptor sequences. The reads were also filtered by quality scores using FASTX-Toolkit (version 0.0.13) http://hannonlab.cshl.edu/fastx_toolkit/index.html. To be retained, a minimum of 70% of the bases had to have quality scores of 20 or better [99]. Ultimately, 370,469,956 reads remained.
The miraEST [105] and Trinity [106] assembly software programs were used to assemble the paired-end reads into contigs. 20,426 contigs were assembled. Only contigs above 200 bps were considered for further analysis. The longest contig was 18,604 bps and N50 was 793 bps.

Genome sequencing and assembly
The Blastocystis ST1 genome was sequenced using 454 pyrosequencing and Illumina technologies. Illumina sequencing entailed generation of a mate paired-end library with an insert size of 3 kb. Following the filtering steps described above, 1,865,740 454 and 55,617,444 Illumina reads remained. The 454 and Illumina reads were assembled using the Ray genomic assembler software [107]. Only contigs above 200 bps were retained for further analysis.
Using different combinations of 454 and Illumina reads, 3 different assemblies were created. 2F was generated from just Illumina reads and contained 13,338 contigs. 2E was generated from all reads and contained 29,673 contigs. 2D was generated just from 454 reads. The software Minimus2 from the AMOS package [108] was used to merge the 3 into a single assembly with 14,304 contigs.

Bacterial contamination
The genome assembly for Blastocystis ST1 was interrogated for the presence of contigs composed of bacterial reads. For contigs of less than 500 bps, BLASTx and BLASTn [49] searches were performed. All contigs with a percent identity of 95% and higher against entries in the nr and nt NCBI databases were considered contaminants. In total, 305 contigs were removed. Large contigs were split into 5,000-bp pieces while those less than 5,000 bps in length were kept intact. All the pieces were compared, using BLASTx, against a specialized database consisting of protein sets from all publicly available bacterial genomes and protein sets from 28 eukaryotic genomes drawn mainly from Chromalveolata and Archaeplastida. Pieces that only had matches for bacterial proteins were putatively designated as derived from bacterial contamination. Because of the possibility of LGT from bacteria, all of the 5,000-bp pieces from a large contig were required to be "bacterial" to warrant removal of the contig. In total, 459 contigs were set aside as being bacterial in nature and most probably resulting from bacterial contamination. 150 were also removed as clearly derived from contaminating Saccharomyces cerevisiae and E. histolytica DNA.

Scaffolding
After the removal of bacterial/yeast/Entamoeba contamination, attempts were made to assemble the ST1 contigs into larger scaffolds. SSPACE [109], which is specially designed to work with preassembled contigs, was used on those contigs greater than 500 bps. The 2,494 contigs were assembled into 693 scaffolds with the N50 increasing from 8,801 bps to 50,205. The Blastocystis ST1 genome sequences are available from the NCBI with the accession LXWW00000 000.

Estimation of genome coverage
In the absence of an experimentally determined genome size for ST1, the extent to which the final assembly represents the full genome was determined roughly using 2 methods. The first method entails taking a set of 679 conserved protein sequences derived from RNA-Seq data and matching them against the genome assembly. Of the conserved transcripts, 5.4% were not found in the genome assembly, suggesting that the current assembly encompasses 94.6% of the genome or at least the gene space of the genome [110].
The second strategy for estimating genome coverage was based on a method first used in the Conus bullatus genome paper [111] to estimate genome size. A database of 31.8 million 90-bp Illumina reads was used in BLASTn searches of the assembled contigs. Read coverage for contigs was estimated as (number_of_hits Ã read_length)/length_of_contig. The mode was determined from a histogram of the read coverages of the contigs and used to estimate genome size: (number_of_reads Ã read_size)/read_coverage_mode ((31,826,528 Ã 100)/153 = 18.7 Mb).
The estimated size was then compared with the assembled genome size (17.4 Mb/18.7 Mb = 93%).
Additionally, the protein data sets for the 3 genomes were used for BUSCO [112] analyses to assess their completeness based on a set of 429 highly conserved single copy eukaryotic genes. An initial BUSCO analysis was done using the ST protein data set. Any conserved genes that were considered missing were then compared against the appropriate ST genomic scaffolds to determine if the gene had been missed during annotation. Given the evolutionary distances between Blastocystis and the organisms used to generate the BUSCO test set, the likelihood of achieving 100% coverage was low. More useful was a comparison of the results for the 3 STs.

Average level of heterozygosity
The average level of heterozygosity for ST1 was determined by aligning trimmed genomic reads to the genomic assembly with Bowtie 2.3.1 [113]. The program mpileup from SAMtools [114] was used to determine the read depth at each position of the assembly as well as the breakdown of the 4 possible bases. An in-house perl script was used to parse the mpileup data. Positions with less than 10X coverage were ignored. A site was deemed to be heterozygous if at least 2 different bases were present and there were at least 2 (or 3) reads with the different bases. The number of heterozygous sites was then divided by the total number of sites (with !10X coverage).

Gene modeling
A set of 679 protein sequences was generated from the ST1 RNA-Seq clustered transcripts and used to train AUGUSTUS [115] on a repeat masked assembly [116]. The training set contained protein sequences with the following criteria: at least 1 intron, less than 80% identical at the amino acid level to any other sequence in the set, and a start site based on highly similar eukaryotic protein sequences. The script optimize_augustus.pl was also run to create speciesspecific meta parameters. A set of 32,269 Trinity [106] assembled RNA-Seq transcripts were used to generate a hints file for donor/acceptor intron sites. Subsequent gene modeling generated 5,637 gene models, which were then corrected with PASA [117] based on RNA-Seq transcripts. Because of the unusual nature of termination codons in Blastocystis [25], it was necessary to set the parameter stopCodonExcludedFromCDS to true. Of the final gene model set, 58% was examined manually at some point to check various aspects such as intron boundaries, stop codons, and chimeric models. The degree to which intron boundaries were confirmed by RNA-Seq data was determined by mapping RNA-Seq data to the genome assembly using Bowtie2 [113]. Using the intron boundary positions and the alignment positions of individual reads from the SAM file, an in-house perl script calculated that 70.3% of the intron boundaries were confirmed by 5 or more RNA-Seq reads (60.1% with 10 or more RNA-Seq reads).

Functional annotation
To assist in the manual annotation of the gene models, various information was generated for each model. The models were compared against the conserved domain database [118] and domain hits as well as Pfam [119] hits were extracted from the results. The models were also compared against the NCBI protein database as well as the protein data set for ST7 using BLASTp. SignalP results were generated using a locally installed version of SignalP. The genome browser Genomeview [120] was chosen for annotation because of its relative ease in setting up and populating with information via gff files and the ability to alter the information using custom scripts. The genome browser also included tracks for genome reads mapped to the scaffolds (Bowtie2 [113] and SAMtools [114]), genome read map coverage, RNA-Seq reads mapped to the scaffolds, and RNA-Seq map coverage.
Because of the possibility of missed genes, the intergenic spacers were identified and the sequences compared against the NCBI protein database (nr) as well as the ST7 protein data set using BLASTx. Potentially missed genes were mapped to the scaffolds and displayed in the genome browser for confirmation and correction through manual annotation. The ST7 protein data set was also compared to the gene models for ST1 using BLASTp and the results mapped to the scaffolds as a genome browser track.

Comparative genomics
Automated annotation for eukaryote genomes, particularly those from nonmodel organisms and poorly sampled lineages, are prone to missed genes, i.e., the gene-finding algorithms fail to find all of the protein-coding regions. While having a transcriptome is invaluable for reducing the level of missed genes, invariably, some are not found. Therefore, as a general rule, the various analyses detailed below that involve presence/absence of genes, particularly when comparing STs, used techniques such as tBLASTn to interrogate the full genome sequence before declaring that a gene was missing from an ST. Additional steps to confirm the presence/ absence of a gene are indicated in the individual sections.

GC content
The genomic scaffolds for STs 1, 4, and 7 were ordered according to their lengths. They were further stripped of their Ns and header lines. The program GC-Profile [121] as implemented at http://tubic.tju.edu.cn/GC-Profile/ was used to create GC profile and GC content graphs for the 3 genomes. As recommended by the authors, a halting parameter of 300 and a minimum segmentation length of 3,000 bps were used.
GC-Profile, a windowless method, was chosen from many different algorithms to avoid some of the disadvantages present in other methods to calculate GC content such as GC patterns dependent on window size, and lack of resolution [121,122].

Read-depth coverage SNP
The SAMtools option depth [114] was used to determine the read-depth coverage at each position of the genome scaffolds. These values were used to calculate median read-depth coverage and normalized median read-depth coverage using in-house scripts. SNPs were detected using the SAMtools options mpileup and bcftools.

Amino acid divergence between STs
The level of amino acid divergence between Blastocystis protein sequence data sets was analyzed using reciprocal best BLAST hit protein sequence pairs. Because of concerns about retained introns in some of the STs as well as issues with incorrect stop codons, the calculation of amino acid identity scores for gene pairs was restricted to high-scoring segment pairs without the inclusion of gaps. The values were generated using an in-house perl script.

Intron categories
For Blastocystis ST1, the analysis of different intron categories was based on the final annotation of the genome, incorporating data from RNA-Seq and extensive manual curation. To analyze introns in the Blastocystis ST7 genome, the available annotation of the genome could not be relied on because it proved to be inaccurate in many cases. Therefore, analysis was restricted to introns that have direct support from EST sequences. Available ESTs were aligned to the genome sequence using STAR [123] and introns indicated by the alignments were identified and manually corrected if needed. Homologs of protein subunits specific for the minor spliceosome were identified by BLAST (BLASTp, tBLASTn) using sequences of previously defined subunits from Homo sapiens and plants. The identity of the candidate orthologs was confirmed by reciprocal BLASTp searches against the nr database at NCBI. Incorrect or missing gene models in Blastocystis ST1 were corrected or added to the annotation of the genome. Most of the identified homologs in Blastocystis ST7 are either represented by incorrect gene models or completely missing from the annotated genes but could be found in the assembly with tBLASTn. Models for snRNA in both STs were identified using the program cmscan from the Infernal package [124].

Mitochondrial protein predictions
A multipronged approach was used to predict mitochondrial proteins. (1) Each gene model and translated transcript was queried for N-terminal mitochondrial targeting sequences using Mitoprot and TargetP [125,126]. Sequences that returned a score greater than 0.5 were further investigated as MRO candidates. (2) The top 10 BLAST hits retrieved from the nr database at NCBI were surveyed for any mitochondrial annotation and manually investigated. (3) All gene models and translated transcripts were queried against a local curated MitoMiner data set [127] and significant hits (e-value < 1e-10) were further investigated. (4) Finally, a local MRO protein data set from various anaerobic protists including Pygsuia biforma, T. vaginalis, and Mastigamoeba balamuthi was queried against the gene models and translated transcript data sets. Mitochondrial targeting sequence scores and protein annotations are provided in S3 Data. Those proteins previously unrecognized as mitochondrial in Blastocystis species are shown in color in Fig 7.

Cell cycle
Homologs of proteins suspected to be involved in cytokinesis were identified by BLASTp against the Blastocystis ST1 predicted proteome (e-value cutoff: 10e-10), using as queries previously identified mammalian orthologs [128]. Absence of homologs was verified by tBLASTn against the genomic scaffolds and transcriptome. Blastocystis hits were then used as queries to run reversed BLASTp searches against nr and results were manually inspected to confirm assignation to a given protein family. Identification of homologs of the APC/C complex and targets was carried out in a similar fashion, using as queries previously identified homologs in Blastocystis ST1 or, if these did not exist, various other eukaryotic homologs [78]. For calcium-binding proteins, H. sapiens homologs of protein families identified in [129] were used as queries to carry out BLASTp searches against the Blastocystis ST1 predicted proteome but also against all eukaryotic protein data present in GenBank as of February 2016 ("Eukaryota[Organism]" option of the BLAST+ package).

DNA repair and meiosis
A query database including 135 DNA repair genes, meiosis-specific genes, and meiosis-related genes from a wide range of eukaryotes was established using literature and key word searches of the NCBI database. Genes with functional experimental evidence were used as queries to retrieve similar proteins using PSI-blast. After retrieval, protein sequence alignments were created and used to build HMM profiles with HMMER3.1b2 [130] to enable gene searches in Blastocystis STs 1, 4, and 7. Retrieval of genes was restricted to e-values below 1e-04. Phylogenetic trees were constructed with FastTree [131] using closely related paralogs and/or reconstructing entire gene families for each gene of interest. Protein domains were mapped to the trees to facilitate gene recognition. In a few cases, secondary structure predictions were made to confirm the gene identification.

Kinome analysis
221 kinases were identified from the protein-coding gene set of Blastocystis ST1 using HMMER [130]. The STYKc profile (SM00221) from the SMART database [132] was used, as was the Pfam Pkinase profile (PF00069) [119]. Blastocystis kinases were classified by their positions in phylogenetic trees inferred from the alignments of the HMMER output (both neighbor joining and maximum likelihood trees were taken into consideration). Blastocystis gene models with 2 kinase domains were classified by the phylogenetic position of the domain with a better HMMER score. The kinase classification follows the webite www.kinase.com [33].
The 221 kinases identified from Blastocystis ST1 were used to query the protein-coding gene sets from STs 4 and 7. The annotations for STs 4 and 7 were also searched for the key word "kinase" and putative protein kinases were checked against the kinase database (www. kinase.com) using the BLAST search tool.
The resulting sequences from STs 1, 4, and 7 were aligned using MUSCLE [133], followed by manual inspection of the alignment. A maximum likelihood tree was inferred by the RAxML program [134] with the PROTGAMMALG model and visualized using Figtree (http://tree.bio.ed.ac.uk/software/figtree/).

Membrane-trafficking system
The components of the membrane-trafficking system in Blastocystis STs 1, 4, and 7 were identified using both comparative genomic and phylogenetic methods. To identify potential homologs, functionally characterized membrane-trafficking components from H. sapiens and S. cerevisiae were used as queries to search the Blastocystis predicted proteins using BLASTp with default search parameters. Hits with an e-value 5e-02 were considered candidate homologs and were used as BLAST queries to reciprocally search predicted proteins of the H. sapiens and S. cerevisiae genomes. The Blastocystis protein was considered homologous to the initial H. sapiens query if it retrieved the query or a clear ortholog as (i) the top hit and (ii) with an evalue 5e-02.
If BLASTp or HMMer failed to identify a membrane-trafficking component, tBLASTn searches were performed, using the H. sapiens and S. cerevisiae queries to search the Blastocystis scaffolds. The region of the scaffold with a hit that had an e-value 5e-02 was excised and used as a BLASTx query to search the H. sapiens genome, and homologs were identified using the same criteria as were used in the BLASTp and HMMer searches.
The more permissive e-value cutoff of 5e-02 was intended to reduce the number of false negatives from divergent Blastocystis genes when searching with the experimentally characterized opisthokont queries. Nonetheless, we found in post hoc assessment that over 93% of the orthologs identified had e-values of lower than 1e-05 in either the forward or reverse BLAST searches.
In searching for the components of a protein family in Blastocystis, Bayesian and maximum likelihood phylogenetics methods were used to more rigorously determine orthology. For each phylogeny, a combination of homologs from H. sapiens, S. cerevisiae, T. pseudonana and P. sojae, P. infestans, A. thaliana, D. discoideum, and N. gruberi were collected and aligned with the Blastocystis homologs using MUSCLE version 3.8.31 [133] and manually adjusted. Positions of ambiguous homology were removed (alignments available upon request) and ProtTest 3.2 [141] was used to determine the best-fit model of sequence evolution, which, in all cases, was the LG model [142], incorporating rate among site and invariant site corrections where relevant. Phylobayes 3.3 [143] was used to produce the optimal topology and posterior probability values. Analyses were run until the average standard deviation of the split frequencies fell below 0.1 and the effective sample size was at least 100. Once convergence occurred, the first 20% of sampled trees were removed. Additionally, RAxML and, in some cases, PhyML version 3 [144] were used to obtain maximum likelihood bootstrap values (100 pseudoreplicates). Resultant phylogenetic trees were viewed using FigTree v1.4.0.

Peroxisomal machinery
To identify putative PEX genes in Blastocystis STs 1, 4, and 7, protein sequences from H. sapiens, S. cerevisiae, and/or Neurospora crassa were used as queries for BLASTp and tBLASTn searches against locally hosted Blastocystis protein sequences and scaffolds, respectively. Candidate homologs with e-values less than or equal to 0.05 were subjected to reciprocal BLASTp searches against the locally hosted query genome as well as the locally hosted nr database. The retrieval of the original query as the top reciprocal BLASTp hit with an e-value less than or equal to 0.05 was the criteria for identification of a putative homolog.
To identify putative PEX genes in A. thaliana and the stramenopile genomes of P. ramorum, T. pseudonana, and P. tricornutum, H. sapiens, S. cerevisiae, and/or N. crassa protein sequences were used as queries for pHMMer [145] searches against locally hosted query genomes. Candidate homologs with e-values less than or equal to 0.05 were subjected to reciprocal pHMMer searches against the locally hosted genome and NR database. Thus, the retrieval of the original query as the top reciprocal pHMMer hit with an e-value less than or equal to 0.05 was the criteria for identification of a putative homolog. Results were compared to previously listed PEX genes in A. thaliana [146] and P. tricornutum [68]. Newly identified A. thaliana, P. ramorum, T. pseudonana, and P. tricornutum PEX genes were subsequently used as queries for BLASTp and tBLASTn searches against locally hosted Blastocystis protein sequences and scaffolds, respectively. Candidate homologs were subjected to reciprocal BLASTp searches according to the criteria described above.

Identification and analysis of CAZymes
The CAZy annotation pipeline was used to analyze 5,966 predicted proteins from the Blastocystis ST1 genome using a 2-step procedure of identification and annotation [147]. Sequences were subjected to BLASTp analysis against the CAZy database, composed of full-length proteins. Hits with an e-value <0.1 were then subjected to a modular annotation procedure using BLASTp against libraries of catalytic and carbohydrate binding modules and profile HMMs [148]. The results were complemented with signal peptide, transmembrane, and glycosylphosphatidylinisotol (GPI) anchor predictions [149,150]. Fragmentary gene models and all models suspected of containing errors were identified and flagged. A final functional annotation step involved performing BLASTp comparisons against a library of protein modules derived from biochemically characterized enzymes [147]. Three other stramenopile species, the diatom T. pseudonana, and the oomycetes, Albugo laibachii Nc14 and P. infestans T30-4, are present in the CAZy database and were used for comparison. The predicted CAZymes encoded by the genome of the brown alga Ectocarpus siliculosus (http://bioinformatics.psb.ugent.be/orcae/ overview/Ectsi) were also annotated using the same procedures for comparison with other stramenopiles.

Intermediary metabolism homology probing
To predict various pathways related to intermediary metabolism (e.g., nucleotide, amino acid, cofactor metabolism, etc.), the predicted gene models and RNA-Seq transcripts were annotated using the KEGG automated annotation server (KAAS) using the single-directional best hit approach (see http://www.genome.jp/tools/kaas/). Each pathway was individually searched for completeness and results are summarized in S8 Data. In some cases, in which KAAS did not predict an ortholog capable of a reaction, query sequences were retrieved from KEGG and manually searched against the scaffold and transcriptome data.

Secretome
All predicted gene models were run through SignalP with a threshold of 0.70. The selected models were then run through TargetP in order to identify a secretory pathway signal peptide with a cutoff of 0.70. Subsequently, the remaining models were run through WoLF PSORT [151] and only those predicted as extracellular were retained. As a final step, TMHMM [152] was used as a final checkpoint to find and remove any model with predicted transmembrane domains. Therefore, the secretome was identified as having a secretory pathway signal peptide, extracellular localization, and absence of transmembrane regions.

Proteases
All gene models were run through the MEROPS database [51]. Only the models with predicted active sites were considered. Annotation and naming of the models follows the MEROPS database terminology (S2 Data).

Flagellar-associated proteins
A reciprocal best BLAST analysis was performed for the protein data sets from Blastocystis ST1 and C. reinhardtii using the program orthoparahomlist.pl with default settings (Stanke M. Orthoparahomlist.pl script. 2011. https://github.com/goshng/RNASeqAnalysis/blob/master/ pl/orthoparahomlist.pl). The resulting list of orthologs was compared against the 213 C. reinhardtii conserved ciliary proteins [153] to determine which of the orthologs found in Blastocystis ST1 had matches.

GTPases
Members of the Ras superfamily were searched for in the 3 Blastocystis genomes using wellcharacterized representative sequences from other eukaryotes (H. sapiens, P. sojae, and others) as queries. Both the databases of predicted protein sequences (using BLASTp) and the genome sequence assemblies (using tBLASTn) were searched to ensure that no gene was missed because of the lack of a corresponding protein prediction. Indeed, a number of genes that were missing in the protein sequence databases were identified in all 3 STs. The missing or inaccurate gene models in ST1 were manually predicted or corrected using various lines of evidence (transcriptomic data and comparisons to homologs) and incorporated into the main genome annotation. Models missing for the ST4 and ST7 genomes were constructed only when the phylogenetic relationship to ST1 genes was unclear from the tBLASTn searches. Orthologous and paralogous relationships among Blastocystis Ras superfamily genes were established primarily on the basis of reciprocal BLAST searches, but phylogenetic analyses were required in a few cases. Orthology to conserved eukaryotic groups of GTPases was, in most cases, obvious from BLAST comparisons, except for several highly divergent paralogs related or similar to Rab GTPases, which most likely represent lineage-specific, rapidly evolving genes derived from duplications of some common Rab genes. A more detailed analysis was carried out for the Blastocystis family of Miro proteins. These sequences and Miro homologs from 2 newly analyzed stramenopile lineages (Labyrinthulea, represented by Aurantichytrium limacisporum, and Placididea, represented by Cafeteria sp. Caron lab isolate) were added to a selection of Miro sequences previously analyzed by Vlahou et al. (2011) [63], excluding the very divergent sequences of N. gruberi, ciliates, and trypanosomatids. The sequences were aligned using MAFFT (version 7, http://mafft.cbrc.jp/alignment/server/, [154]) and the alignment was trimmed using the Gblocks server (http://molevol.cmima.csic.es/castresana/Gblocks_server. html, [155]) with the least stringent parameters, keeping 245 aligned amino acid positions. The tree was calculated using RAxML-HPC (8.2.8) run at the Cyberinfrastructure for Phylogenetic Research (CIPRES) Portal (http://www.phylo.org/sub_sections/portal/), employing the LG+Г+F substitution model and rapid bootstrapping followed by a thorough search for the optimal tree. The middle gene corresponds to RPS23. ST1 genes are in blue, ST7 in red. ST1 has 6 copies of RPS23, all of which differ from each other at the nucleotide level. The 3 copies of RPS23 in ST7 are aligned with the corresponding copy in ST1 based on similar flanking genes. Two ST1 copies (AV274_5206 and AV274_5757) do not have a corresponding ST7 copy and, based on the arrangement of flanking genes, represent either insertions in ST1 or deletions in ST7. (DOCX) S10 Fig. GC profile and GC content graphs for Blastocystis ST1, ST4, and ST7. GC profile graphs (A1, B1, and C1 for ST1, ST4, and ST7, respectively) are based on a windowless method. Straight lines denote a region of relatively homogenous GC content. An upward line indicates an abrupt increase in GC content while a downward line indicates an abrupt decrease in GC-content. GC content graphs (A2, B2, and C2 for ST1, ST4, and ST7, respectively) indicate the percentage of GC bases across a segment. The graphs were generated using http:// tubic.tju.edu.cn/GC-Profile/ with a halting parameter of 300 and a minimum segment length of 3,000. (DOCX) S1