Hidden genomic evolution in a morphospecies—The landscape of rapidly evolving genes in Tetrahymena

A morphospecies is defined as a taxonomic species based wholly on morphology, but often morphospecies consist of clusters of cryptic species that can be identified genetically or molecularly. The nature of the evolutionary novelty that accompanies speciation in a morphospecies is an intriguing question. Morphospecies are particularly common among ciliates, a group of unicellular eukaryotes that separates 2 kinds of nuclei—the silenced germline nucleus (micronucleus [MIC]) and the actively expressed somatic nucleus (macronucleus [MAC])—within a common cytoplasm. Because of their very similar morphologies, members of the Tetrahymena genus are considered a morphospecies. We explored the hidden genomic evolution within this genus by performing a comprehensive comparative analysis of the somatic genomes of 10 species and the germline genomes of 2 species of Tetrahymena. These species show high genetic divergence; phylogenomic analysis suggests that the genus originated about 300 million years ago (Mya). Seven universal protein domains are preferentially included among the species-specific (i.e., the youngest) Tetrahymena genes. In particular, leucine-rich repeat (LRR) genes make the largest contribution to the high level of genome divergence of the 10 species. LRR genes can be sorted into 3 different age groups. Parallel evolutionary trajectories have independently occurred among LRR genes in the different Tetrahymena species. Thousands of young LRR genes contain tandem arrays of exactly 90-bp exons. The introns separating these exons show a unique, extreme phase 2 bias, suggesting a clonal origin and successive expansions of 90-bp–exon LRR genes. Identifying LRR gene age groups allowed us to document a Tetrahymena intron length cycle. The youngest 90-bp exon LRR genes in T. thermophila are concentrated in pericentromeric and subtelomeric regions of the 5 micronuclear chromosomes, suggesting that these regions act as genome innovation centers. Copies of a Tetrahymena Long interspersed element (LINE)-like retrotransposon are very frequently found physically adjacent to 90-bp exon/intron repeat units of the youngest LRR genes. We propose that Tetrahymena species have used a massive exon-shuffling mechanism, involving unequal crossing over possibly in concert with retrotransposition, to create the unique 90-bp exon array LRR genes.


Introduction
The species is one of the fundamental units of biology. Defining a species has been a controversial issue for more than half a century, and there are more than 20 species concepts [1]. One commonly used method of species definition by taxonomists is the morphospecies concept [1]. The term morphospecies is applied to a taxonomic group containing multiple cryptic biological species that are sexually isolated but morphologically indistinguishable or very similar [2][3][4]. Morphospecies are quite common and widespread in eukaryotes, from unicellular protists to mammals [2,[5][6][7], and yet the evolutionary novelty hidden in morphospecies remains poorly investigated. Ciliates are a group of protists in which morphospecies are particularly common; thousands of free-living ciliate morphospecies are estimated to exist [8,9]. The ciliate genus Tetrahymena is composed of more than 40 currently described species [10]. Historically, a large group of morphologically very similar Tetrahymena species were referred to as T. pyriformis [9]. Following the discovery of mating types in Tetrahymena, it was recognized that T. pyriformis is a morphospecies consisting of many reproductively isolated species and some asexual species [11,12]. These species cannot be distinguished by ultrastructural morphology (such as cortical features revealed using protargol staining or electron microscopy), the number and size of micronuclear chromosomes, or tolerance to metals and high temperature [12]. Comparison of the ribosomal DNA (rDNA) sequences of these species suggested that they have high genetic divergence [9] and that the Tetrahymena genus is a few hundred million years old [13].
Like ciliates in general, Tetrahymena separates its germline and somatic genetic information by maintaining 2 functionally distinct nuclei: the silent, diploid micronucleus (MIC) or germline nucleus, and the highly expressed, polyploid macronucleus (MAC) or somatic nucleus [8,14]. The ciliate life cycle includes a sexual stage (conjugation) in which the conjugants undergo meiosis of the diploid MIC and exchange haploid gametic pronuclei, which fuse and generate a diploid fertilization nucleus in each exconjugant [15]. The new diploid MIC and polyploid MAC of the sexual progeny differentiate from mitotic siblings of the diploid fertilization nucleus, and the parental MAC is destroyed [15]. During MAC differentiation, transposable elements (TEs) and other repeated sequences inherited from the MIC are excised as Internal Eliminated Sequences (IESs) [16,17]. The MAC chromosomes are formed by joining the remaining MAC-destined sequences (MDSs) [16,17]. All gene expression during the vegetative phase of the life cycle occurs in the MAC [16,17]. Thus, from a genetic and evolutionary standpoint, ciliates are comparable to multicellular eukaryotes that separate germline and somatic cell lineages [18].
We were intrigued by the nature of the evolutionary novelty that has led to prolific speciation within this genus while retaining morphological similarity over a few hundred million years. The MAC and MIC genomes of the model species T. thermophila were sequenced, assembled, and annotated previously [19,20]. We report here the sequencing and annotation of 9 additional Tetrahymena MAC genomes widely spanning the phylogeny of the genus. The on ortholog clustering, all predicted protein-coding genes in each species were divided into 2 groups: conserved genes (represented in at least 2 species) and "species-specific" genes (represented in only 1 species in our current sample). Proportions of species-specific genes range  from 15% to 38% among the 10 species (Fig 2A). Different species seem to have independently undergone similar genome innovation processes, because the top protein domains encoded by species-specific genes are very similar for every species, mainly consisting of the following 7 domains: LRR, tetratricopeptide repeat (TPR), WD40 repeat, protein kinase (PK), cyclic nucleotide-binding domain (CNBD), growth factor receptor (GFR) cysteine-rich domain, and Ploop-containing nucleoside triphosphate hydrolase (P-loop NTPase) ( Fig 2B and S2B Fig). These 7 protein domains are essentially universal, occurring widely in both eukaryotes and prokaryotes.
In T. thermophila, the germline (MIC) contains 5 pairs of metacentric chromosomes [24,25]. Recently, chromosome-level assemblies of all 5 chromosomes were generated, and the limits of centromeric regions of each MIC chromosome were rigorously established [20], allowing us to map gene distribution patterns. We aligned T. thermophila's most conserved (category X, found in all 10 species) and least conserved (T. thermophila-specific) genes with the MDS of the 5 MIC chromosome assemblies. In general, species-specific genes are highly enriched in the pericentromeric and subtelomeric regions (approximately 2 Mb near the midpoints of the centromeric regions, and the terminal approximately 1 Mb at both ends of the chromosome assemblies-note that these are arbitrary lengths based on the observed gene distribution patterns). In contrast, species-specific genes are depleted in MIC chromosome arms (Fig 3). Conversely, the most conserved genes are enriched on MIC chromosome arms and depleted in MIC pericentromeric and subtelomeric regions (Fig 3). The same differential localization is observed for species-specific LRR genes, the largest group of species-specific genes (Fig 3). This differential localization supports the hypothesis that, in T. thermophila, speciesspecific genes are preferentially generated and/or retained within the pericentromeric and subtelomeric regions of the MIC genome. In a subsequent section, we provide additional support for the preferential generation hypothesis.

Species-specific genes show relatively rapid evolution, high incidence of tandem duplication, and low expression in axenic culture
The abundance of species-specific genes in Tetrahymena raises several questions that are addressed in this section. By what mechanism(s) have they expanded? Have they experienced reduced selective pressure, perhaps suggestive of sub-or neo-functionalization? Are they engaged in essential metabolic functions or rather more specialized functions?
Tandem gene duplication is a major way to generate raw material for genome innovation [26]. The initial analysis of the T. thermophila MAC genome sequence uncovered a high proportion of tandemly duplicated genes [19]. We have further investigated this phenomenon using comparative genomics. We determined the proportion of strictly tandemly arranged genes (no interspersion of unrelated genes) in each ortholog cluster and compared these frequencies among the 10 categories of ortholog clusters. We found far higher frequencies among the species-specific group (category I; range 8%-51%) compared with the most conserved group (category X; range 7%-14%; S3 Fig). Singletons, i.e., genes that do not fall into any OrthoMCL gene cluster in any of the 10 species (S2 Fig), and which thus show the greatest divergence, are also tandemly arranged (with respect to other singletons) with high frequency (S3 Fig) and show an absence of interspersion with Category I-X genes.
The largest tandemly duplicated gene cluster found in any of the 10 species was an LRR gene cluster in T. pyriformis consisting of 17 strict tandem inparalogs, extending to 65 LRR genes when the criterion used to identify tandemly duplicated genes was relaxed by allowing 2 inparalogs to be separated by no more than 3 other genes (S4 Fig). Although this tandem LRR gene cluster belongs to an ortholog group with genes present in 4 species (category IV), the extensive expansion is species specific, i.e., restricted to T. pyriformis. Additional species-specific clusters of tandemly duplicated LRR genes are found in other species (e.g. S5 Fig). As with the distribution of species specific, the proportion of tandemly duplicated genes is higher in the pericentromeric and subtelomeric regions than on chromosome arms (Fig 3).
The Ka/Ks value (the ratio of nonsynonymous to synonymous substitutions) is informative regarding the selective pressures encountered in gene evolution [27]. When paralogs in different Tetrahymena species are compared, pairwise Ka/Ks values are significantly higher (Mann Whitney U test, p < 0.01) for category I and II genes (i.e., genes in ortholog clusters represented in only 1 or 2 species), with median Ka/Ks values 0.25 and 0.30, respectively, than for categories III-X genes, with median Ka/Ks values between 0.03 and 0.07 (S6A Fig). Ka/Ks values less than 1 indicate that the genes underwent purifying selection. However, the higher pairwise Ka/Ks ratios of species-specific genes suggest relatively relaxed pressures of purifying selection in their evolution. This difference is especially true for LRR genes, for which the median Ka/Ks value of approximately 0.4 for categories I and II genes is 20 times higher than the approximate 0.02 value for category X genes (S6B Fig). Mapping Ka/Ks values of all genes to the T. thermophila MIC chromosomes shows that genes within pericentromeric and subtelomeric regions have higher Ka/Ks ratios than those within chromosomes arms (Fig 3). This distribution of Ka/Ks ratios resembles the gene distribution pattern of species-specific genes and provides independent evidence for the preferential localization of rapidly evolving genes to pericentromeric and subtelomeric chromosomal regions.
In its natural aquatic environment, bacteria are the principal food source of Tetrahymena, but it can also grow in axenic culture in the laboratory [28]. Tetrahymena can take up particulate nutrients through phagocytosis and absorb soluble nutrients in growth media through its membrane system [29]. When expression levels of T. thermophila genes during vegetative growth in axenic culture (Super Proteose Peptone [SPP] medium) are plotted against MIC chromosomal location, these levels are lower in pericentric and subtelomeric regions and higher on MIC chromosome arms (Fig 3). Directly checking the expression of species-specific genes shows that most of them are silenced or have very low expression during vegetative growth in axenic culture, the overall expression level being much lower than the whole-proteome background (S7A Fig). Further analysis of genes in different categories of gene clusters shows that gene expression levels correlate with the degree of gene conservation: conserved genes (categories II-X) are more highly expressed than species-specific genes (S7B Fig). This finding suggests that most species-specific genes do not function in basic structural or metabolic functions or in absorbing the soluble nutrients found in axenic culture. Further experimentation may reveal growth conditions under which these species-specific genes are expressed.
LRR genes make the largest contribution to the high level of genome divergence and show unique 90-bp exon arrays, phase 2 introns, and diverse gene structures The number of genes containing the 7 most frequent species-specific protein domains has extensively expanded in every Tetrahymena species (S2 Table). This is especially true for LRR genes: all but 2 species (T. empidokyrea and T. shanghaiensis) have more than 1,000 LRR genes, and T. pyriformis has over 2,000 (S2 Table), more than 6% of its total genes. Indeed, LRR genes numerically compose the largest category of species-specific genes in Tetrahymena (Fig 2B and S2 Table). Furthermore, in any given Tetrahymena species, about 45%-70% of LRR genes are specific to that species (except for T. empidokyrea, the mosquito parasite). This range is significantly higher (chi-squared test p < 1 × 10 −5 ) than that for all pooled species-specific genes in each species (15%-38%). Therefore, LRR genes form the largest gene category contributing to the high level of genome divergence between the 10 morphologically similar Tetrahymena species investigated here. LRR is a widespread protein structural motif, found in all life forms, that is generally 20 to 29 residues long and often involved in protein-protein interactions. Genes with different numbers of LRRs can form a variety of protein structures [30] and perform many different functions.
A striking feature of the MAC (somatic) genomes of the 10 Tetrahymena species is the large number of exons that are exactly 90 bp in length (Fig 4A and 4B). In silico functional annotation of genes containing at least one 90-bp exon showed that 76% of them are LRR genes (S8A Fig). Among LRR genes containing 90-bp exons, 77% of them have at least 3 consecutive 90-bp exons, which we will refer to as 90-bp exon arrays ( Fig 4B). Extreme intron phase bias associated with 90-bp exons was found in the LRR genes. The overwhelming majority of introns (>98%) that precede 90-bp exons (tens of thousands of them) are phase 2 introns, meaning that the intron is inserted at position 2 of a codon ( Fig 4C). This extreme bias is observed regardless of the number of 90-bp exons in the gene (S3 Table). Comparison controls (introns in LRR genes that lack 90-bp exons and introns preceding 90-bp exons in non-LRR genes) show no phase 2 bias (34%) or a mild phase 2 underrepresentation (24%), respectively (S9 Fig). This extreme intron phase bias of LRR 90-bp exons is observed in all 10 Tetrahymena species investigated here (Fig 4C). A consequence of this extreme phase 2 intron bias is that, in an array of consecutive 90-bp exons, the 5 0 base pair of any exon functions in concert with the two 3 0 base pairs of the previous exon to encode a"junction" amino acid. Thus, every 90-bp exon in the array contributes full or partial sequence information for the translation of 31 amino acids. The entire array of consecutive 90-bp exons thus behaves translationally as an interlocked unit. As far as we know, tandem arrays of exactly 90-bp exons and phase 2 introns have not been described in LRR genes of any other organism.
In Tetrahymena, the unique 90-bp exon structure can be used as an indicator to measure the number of LRRs in a gene (which cannot always be identified by common protein sequence similarity algorithms). When we focused on LRR genes containing 90-bp exons, the number of 90-bp exons in LRR genes showed large variation, ranging from 1 to 50 in most genes (Fig 4D), even among LRR genes that are clustered as inparalogs (S10 Fig). Thus 90-bp exon-containing LRR genes encode diverse protein structures with potentially different functions.

LRR genes in Tetrahymena MAC genomes fall into three major age groups
To further investigate the 90-bp exon-containing LRR genes, we first focused on the LRR genes in the model species, T. thermophila. In this species, 80% (1,500 out of 1,874) of LRR genes contain at least one 90-bp exon (S8B and S8C Fig). This finding raises the possibility of relatively recent evolutionary expansion that might be detected by the presence of repetitive sequences within an LRR gene or among different LRR genes. Therefore, we used RepeatModeler to identify nucleotide consensus repeat sequences (CRSs; the output consensus sequences of RepeatModeler) in the T. thermophila MAC genome. We de novo-identified 8 CRSs (CRS1-CRS8, see S4 Table and S3 Data) that mask LRR genes ( Fig 4B). Taken together, they mask 53% (986 out of 1,874) of T. thermophila LRR genes (S8D Fig). T. thermophila LRR genes can be sorted into 3 groups depending on whether they contain at least one 90-bp exon and whether they are masked by at least 1 of the 8 CRSs (Fig 5A). Group I consists of 322 genes that lack 90-bp exons and are not masked by any CRS. Group II consists of 566 genes that have at least one 90-bp exon and are not masked by any CRS. Group III consists of 934 genes that have at least one 90-bp exon and are masked by at least 1 CRS. Group II and III LRR genes generally contain arrays with tandem repeats of at least three 90-bp consecutive exons: 80% (456 out of 566) for group II and 87% (815 out of 934) for group III.
The fraction of species-specific LRR genes increases from group I to group III ( Fig 5B). Only 12% of group I LRR genes are species specific, while the percentages are 35% in group II and 80% in group III (Fig 5B). This finding suggests that LRR genes in T. thermophila fall into 3 major age groups and that group III mainly contains the youngest LRR genes. To validate this conclusion, we investigated additional features of the 3 groups.
Group I genes, which lack 90-bp exons (Fig 5C), appear to be the oldest T. thermophila LRR genes. Most group I LRR genes (88%) are conserved (i.e., not species specific, suggesting earlier origin) ( Fig 5B) and generally have fewer than 10 introns (92%) (S11A Fig). These traits are shared by the LRR genes of the more distantly related P. tetraurelia, which lack a 90-bp exon frequency peak and have few introns (S12 Fig). Introns of group I genes are usually approximately 56 bp long, similar to the genome background intron length (i.e., all predicted genes; Fig 5D and S13 Fig). These results suggest that ancestors of group I LRR genes originated before the Paramecium-Tetrahymena divergence. The GC content of T. thermophila group I genes is also similar to that of the genome background (approximately 27% for CDS; S11B  In contrast, group II and III genes have 90-bp exons, usually in arrays of at least 3 exons, and contain more introns (57% and 49% of genes in group II and III, respectively, have more than 10 introns) (S11A Fig). To the best of our knowledge, arrays of exactly 90-bp exons encoding LRR domains is a feature unique to Tetrahymena, which suggests that group II and III genes originated later than group I genes, after the Paramecium-Tetrahymena divergence.
Several observations strongly suggest that group III LRR genes are younger than group II genes. The former contain intragenic repeat units detectable at the nucleotide sequence level (masked by at least 1 CRS), which are lacking in group II LRR genes (S14 Fig). Furthermore, group II LRR genes have shorter introns, with a length distribution similar to that of group I genes and to the genome background length peak (approximately 56 bp; Fig 5D). In contrast, group III LRR genes usually have longer introns (main distribution peak at approximately 257 bp; Fig 5D and S14A Fig). Also, nucleotide conservation profiles of 90-bp exons show that group III LRR genes have greater intraspecies sequence conservation than group II LRR genes in all Tetrahymena species investigated (S15 Fig, S16 Fig and S17 Fig). All these findings suggest that group III represents the youngest group of LRR genes. There are additional characteristic differences between groups II and III. Group III LRR genes have a higher GC content than group I and II LRR genes, whose GC content is similar to the genome background (S11B Fig). Also, group III genes show different leucine codon preferences than those of group I and II genes (S11C Fig).

Recurrent waves of clonal expansion have shaped the evolutionary landscape of Tetrahymena LRR genes
Conservation profiles provide evidence of clonal expansions in Tetrahymena LRR genes. Using the same methods and criteria as in T. thermophila, we also identified LRR CRSs in the other 9 Tetrahymena species (S4 Table and S3 Data) and sorted their LRR genes into groups I, II, and III (S18 Fig) with the exception of T. empidokyrea, the only parasitic Tetrahymena examined here, which lacks the youngest group (III) of LRR genes.
The 90-bp exons of both group II and III LRR genes show extreme leucine conservation at 6 positions in all 10 Tetrahymena species (S19 Fig and S20 Fig). This observation, together with the exact 90-bp length and the extraordinary intron phase 2 bias, suggests that 90-bp exons originated by clonal expansion, by which we mean an increase in numbers by rounds of successive duplication starting from 1 or a few 90-bp exon(s) already present in the last common ancestor (LCA) of the 10 species, more than 300 Mya.
Amino acid conservation profiles of 90-bp exons of group II genes for 9 of the 10 species are very similar (S20 Fig). In contrast, the sequence profile of T. paravorax shows relatively higher conservation (S20 Fig and S5 Table). We next compared the sequence profiles between T. paravorax and the other 9 species using the Two Sample Logo tool based on the binomial statistical test. The amino acid profile of T. paravorax shows significant overrepresentation (35%-49%, S21A Fig) at some positions (9,10,12,13,15,18,20,23,24) compared to the other 9 species. At T. paravorax positions 3, 10, 11, 14, 17, and 20 (S5 Table), the most conserved amino acids are different from those in the other 9 species. The nucleotide sequence profile also showed significant overrepresentation at many positions (43%-50%, S21B Fig) compared to the other 9 species, especially at position 47, 59, 74, 80, and 82. These results suggest a later clonal expansion of group II exons in T. paravorax, after its divergence from the other 9 species.
The amino acid conservation profiles of group III exons show that, at several positions, different amino acids predominate in different species (S20 Fig). This is consistent with the hypothesis that group III exons resulted from later clonal expansions of progenitors having some differences from the consensus. For example, in contrast to the conserved leucine at position 16 in other species, isoleucine is the most common amino acid (85%) in T. shanghaiensis (S20 Fig), suggesting that the progenitor of the clonal expansion that generated the group III exons in T. shanghaiensis was a minority type with isoleucine at this position.
Intron length distributions provide independent evidence for recent independent waves of group III LRR gene repeat generation within different Tetrahymena species. A striking general feature of Tetrahymena group II and III LRR genes is that their 90-bp exons are usually found in tandem arrays of distinct intragenic repeat units (detectable at the nucleotide level) containing at least 3 consecutive exons (S14 Fig). For example, T. thermophila shows a peak at approximately 10 exons (S11A Fig). The introns in 90-bp exon arrays show characteristic lengths in each Tetrahymena species that has group III LRR genes (S18 Fig, column 4). For example, in T. thermophila, most group III LRR gene introns are approximately 257 bp in length (S18 Fig, column 4, green line), and therefore repeat units within LRR genes are approximately 347 bp long (90-bp exon + 257-bp intron). In T. malaccensis, its closest related species, the intron length of group III LRR genes peaks at about 232 bp. Such differences, also seen in other Tetrahymena species (S18 Fig, column 4, green lines), suggest that group III LRR genes expanded independently through recent lineage-specific events within individual species.
Although the main intron distribution peak of T. thermophila group III LRR genes is located at about 257 bp, we also see a sharp secondary peak at about 122 bp for a subset of 69 group III genes, which is exclusively masked by CRS1 (Fig 5D and S22 Fig). This result suggests a recent expansion of a subset of LRR genes originating from a T. thermophila LRR gene with at least one 122-bp intron. Interestingly, similar secondary intron length distribution peaks of group III LRR genes are seen in other species, including T. pyriformis, T. vorax, T. shanghaiensis, and T. paravorax (S18 Fig, column 4, green lines). These secondary intron length peaks imply that clonal expansions of subsets of LRR genes continue to independently occur in various species and can in principle account for the species-specific intron length distributions found today.
Very recent clonal expansions of 90-bp exons within group III LRR genes are revealed by CRS masking. As reported above, group III genes were assigned to subgroups based on CRS-masking scores. Most Tetrahymena CRS-masked exons fall into species-specific clusters or lineage-specific clusters (e.g., T. thermophila and T. malaccensis, or T. borealis and T. canadensis lineages) (S19 Fig), suggesting that clonal expansions of group III genes have continued to occur in different lineages or species.
Seven CRSs, which we refer to as "exceptional CRSs," cluster differently from all the others (S19B Fig, green clade). A phylogenetic analysis of exons masked by the various CRSs shows that 90-bp exons masked by exceptional CRSs of several distantly related species cluster together (S19B Fig, green clade), suggesting that the ancestor of these exons arose very early, pre-dating speciation within the Tetrahymena genus. However, other observations suggest that exons masked by exceptional CRSs expanded recently, in independent lineages. First, these exons still show intraspecies nucleotide level conservation, as they are masked by the exceptional CRSs. Second, nucleotide conservation profiles of T. thermophila CRS1-masked exons reveal nucleotide flips or greater conservation of the same nucleotide at particular positions relative to exons masked by the other 7 CRSs (S19C Fig and S23 Fig) In summary, our analyses of 90-bp exons of Tetrahymena LRR genes provide evidence of recurring waves of clonal expansion of 90-bp exons. The first detectable expansion occurred after the generation of the first exactly 90-bp exons and phase 2 introns in Tetrahymena LRR genes. This expansion occurred after the Tetrahymena-Paramecium divergence but preceded the divergence of the 10 currently sequenced Tetrahymena species. More recent expansion waves generated the group III genes that, within species, still retain nucleotide conservation with one another. The most recent expansions have generated multiple clades of lineage-specific or species-specific 90-bp exons, detected by CRS masking and as secondary peaks in the intron length landscape.

LRR gene evolutionary innovation: Proposed mechanisms at work
The strength of preferential localization of T. thermophila LRR genes to pericentromeric and subtelomeric regions of MIC chromosomes is related to the age of the genes. We have reported above that species-specific T. thermophila genes show preferential localization to pericentromeric and subtelomeric regions of MIC chromosomes. The 3 age groups of LRR genes in T. thermophila show different frequencies of species-specific gene content ( Fig  5B). Therefore, it is reasonable to expect differential localizations of these 3 groups of LRR genes on the MIC chromosomes. By mapping all the T. thermophila LRR genes in the 3 groups to the MIC chromosomes, we found that the youngest (group III) LRR genes, as expected, are highly enriched in the pericentromeric and subtelomeric regions (Fig 6). Group II LRR genes are somewhat enriched in the pericentromeric regions but are also found in chromosome arms (Fig 6). The oldest group of LRR genes, which have the highest proportion of conserved genes, are mainly distributed on the MIC chromosome arms (Fig 6). Our finding that the younger LRR genes are the higher is their concentration in pericentromeric and subtelomeric regions of MIC chromosomes strengthens the conclusion that these chromosomal regions are the highest source of genomic innovation in Tetrahymena.
Tetrahymena LRR genes show evidence of extensive ectopic recombination. Duplication by ectopic recombination (i.e., unequal crossing over) is generally considered to be the most common way of creating new genes [31, 32]. In T. thermophila, only about 1.6% of LRR genes appear to be the result of direct tandem duplications of entire genes. Most LRR genes show highly variable numbers of 90-bp exons, suggesting the occurrence of frequent interand intra-genic ectopic recombination. For example, the number of 90-bp exons in LRR genes ranges from 1 to 51 in T. thermophila. This suggests that extensive ectopic recombination has occurred during the evolution of LRR genes-not just gene duplications, which would not change the number of 90-bp exons per gene. To demonstrate the occurrence of ectopic recombination, we focused on group III LRR genes because they still show 90-bp exon sequence conservation at the nucleotide level (see also S17 Fig). We found that inparalogs in a gene cluster (ortholog group) with high sequence similarity to one another show high variability in the number of 90-bp exons (S10 Fig). This result strongly suggests that extensive ectopic recombination between LRR genes has generated inparalogs with diverse gene structures (i.e., different number of repeats).
To elucidate the evolutionary relationships among 90-bp exons of inparalogs, we investigated a T. thermophila LRR gene cluster consisting of 83 inparalogs. This cluster contains a total of 1,932 90-bp exons-an average of approximately 23 exons per gene. A phylogenetic tree of all the 90-bp exons (S24 Fig) clearly shows that extensive ectopic recombination has scrambled the sequence of the exons. Very often, the sequence of an exon is more closely related to that of an exon in a different gene of the cluster than to other exons in the same gene (S24 Fig). To further validate this result, we clustered the nucleotide sequences of all 90 bp-exons with identity levels of 97% or greater (i.e., allowing no more than 2 nucleotides' difference between 2 exons). The largest cluster contains twelve 90-bp exons, and these exons are found in different LRR genes with diverse gene structures on different MIC chromosomes (Fig  7, S25 Fig and S26 Fig). Ectopic recombination or gene conversion between 2 different exons in the same or different genes are possible mechanisms that can explain these observations. Exons of 90 bp participating in ectopic recombination need not belong to inparalogs; the same events can in principle occur between 90-bp exons of the same gene or of any 2 different LRR genes. Of course, only ectopic recombination that occurs in the MIC can be perpetuated. Recombination between nonhomologous chromosomes could often result in chromosomal rearrangements, which may become apparent when high-quality MIC genome assemblies are available for comparison to that of T. thermophila.
Group III 90-bp exon array LRR genes show preferential physical association with IESs in the T. thermophila MIC genome. As described in a previous section, the youngest T. thermophila LRR genes are highly enriched in the pericentromeric and subtelomeric regions of all 5 MIC chromosomes. These regions are also enriched with IESs [20] (S27 Fig), which include active and inactive TEs, other undefined repetitive sequence, as well as undefined unique sequence [20]. In the T. thermophila MIC, we have found that most (72%) species-specific LRR genes are flanked by IESs ( S28 Fig). This percentage is significantly higher than for other species-specific genes (72% versus 56%; chi-squared test p < 1 × 10 −5 ). These findings raise the possibility that IESs, and perhaps TEs within them, contribute to the generation of new species-specific genes.
To further investigate this possibility, we analyzed previously reported MIC CRSs [20] and found that some mask both MDSs and IESs (S29 Fig). A notable case is the MIC CRS named "Contig[0735]#nonTE/REP-MAC" [20], which masks >3,000 MIC DNA segments. The 5 0 segment of this MIC CRS shares significant nucleotide sequence identity to 7 of the 8 LRR gene-related MAC CRSs described in the previous section. Therefore, we will here abbreviate the name of that MIC CRS as tLRR-MIC-CRS. Compared to other MIC CRSs, tLRR-MIC-CRS is exceptional in that 80% of the masked DNA segments are located in MDS (i.e., also in the MAC). We now understand why: tLRR-MIC-CRS masks group III LRR genes containing Genome-wide analysis showed that 169 copies of the conserved 54-bp "tail" of Tt.REP are retained in the MAC genome (in contrast to most of Tt.REP's sequence, which is excluded from the MAC genome by developmental IES excision), and the distribution of those 54-bp sequences is highly correlated with the distribution of group III LRR genes in MIC chromosomes (Fig 6  and S31 Fig). Furthermore, CRS6 masks 3 tandem copies of what was originally called a "complex direct repeat" [33] (size approximately 350 bp) adjacent to Tt.REP and that we now recognize as 90-bp exon/intron repeats of group III LRR genes. In addition, by checking the relatively recently expanded copies of Tt.REP2 (a previously reported functional copy of Tt. REP [33]), we found that the tLRR-MIC-CRS repeat is usually located near Tt.REP (S32 Fig). As expected from previous findings about the distribution of TE elements in the T. thermophila MIC [20], Tt.REPs are highly enriched in pericentromeric and subtelomeric regions, as are the species-specific LRR genes ( Fig 8A).
Analysis of nearby repeat sequences shows that most Tt.REP elements (76%) colocalize with LRR repeat units (Fig 8B and 8C). More importantly, LRR repeat units are usually adjacent to both ends of Tt.REP, regardless of whether this TE is considered functional (i.e., has the 2 intact open reading frames of Tt.REP, S30A Fig) or nonfunctional (i.e., lacks at least 1 intact open reading frame) (Fig 8C). These neighboring LRR repeat units (approximately 62%) belong to functional protein-coding genes, some of which are expressed in the MAC (Fig 8B). These findings indicate an intimate physical relationship between Tt.REP and LRR genes.
To further validate the physical association between LRR gene repeat units and non-LTR REP retrotransposons in the Tetrahymena MIC genome, we obtained a draft MIC genome assembly of T. malaccensis (the species most closely related to T. thermophila). We used longread PacBio single-molecule real-time (SMRT) sequencing technology, which allows much more complete assembly of long repeated elements like REP. Most (60%) non-LTR REP retrotransposons in T. malaccensis (Tm.REP), and especially the functional ones (86%), were flanked by the T. thermophila "complex direct repeat" homolog, encoding an LRR 90-bp exon and intron sequence unit (S33 Fig). These combined findings in T. thermophila and T. malaccensis strengthen the evidence for a very specific and intimate association between REP and LRR genes.

High genetic diversity is observed among the 10 Tetrahymena species
Morphospecies have been widely found in many eukaryotes across the tree of life, including mammals, birds, insects, nematodes [5] [8]. Among ciliates, morphospecies appear to be particularly common. Foissner and colleagues reported that there are 4,500 described free-living ciliate morphospecies [9], suggesting that some general evolutionary features may favor their appearance within this particular group of unicellular eukaryotes. Despite their morphological similarity, molecular evidence (such as 18S rDNA sequence) indicates high genetic diversity among ciliate morphospecies, such as in Paramecium, Tetrahymena, Chilodonella, Carchesium, and some choreotrich and oligotrich ciliates [39][40][41][42][43][44][45]. Recent single-cell omics data also support the high genomic diversity of ciliates [46]. This dichotomy suggests that morphological and molecular evolution may be decoupled in many ciliate species [43]. Comparative genomics of Tetrahymena morphospecies Nanney noted this phenomenon in Tetrahymena long ago [13,21]. Based on the divergence of rDNA sequence, he estimated this genus has existed for approximately 300 million years although its morphological design has persisted unchanged [13,21]. Our whole genome studies of Tetrahymena more fully demonstrate the features of high genetic diversity and ancient divergence despite high morphology conservation. From a macroevolutionary perspective, we observe high genetic diversity at the MAC genome level when comparing the 10 sequenced Tetrahymena species. This high genetic diversity is consistent with our independent estimate that the Tetrahymena genus originated approximately 300 Mya (Fig 1), about the origin time of amniotes [47].
Our study further shows that the high genomic diversity is particularly pronounced among genes encoding proteins with certain domains, LRR domain-containing genes being the most numerous (Fig 2). The 10 species share >8,000 orthologs (6,052 one-to-one orthologs) in all 10 species, a number even larger than the total genes in the yeast genome. On the other hand, thousands of genes, averaging nearly a quarter (24%) of their total proteome, are unique to each species. Considering that morphospecies appear to be particularly common among ciliates, our findings in Tetrahymena likely are the tip of an iceberg: the biological diversity of other ciliate morphospecies is likely to be very high when investigated at the genome level.

Pericentromeric and subtelomeric regions of T. thermophila are genome innovation centers
The human genome sequencing project revealed that pericentromeric and subtelomeric regions are structurally complex [48], characterized by highly repetitive sequences and segmental duplication sequences specific to each chromosome arm [49]. In plants also, repetitive sequences are abundant in the centromeric, pericentromeric, and subtelomeric regions [50,51]. In Drosophila, TEs are abundant in centromeric and pericentromeric regions [52]. In some species, subtelomeric regions have also been shown to be gene rich [53][54][55]. For example, many variable surface antigen genes are located in the subtelomeric regions of the chromosomes of Plasmodium falciparum [54]. Pericentromeric regions are often referred to as genomic junkyards, but they can also be the birthplace of new genes with novel functions [56,57]. The occurrence of numerous duplications, transpositions, and rearrangements within pericentromeric and subtelomeric regions is associated with their being hotspots for eukaryotic chromosome evolution [58,59].
In T. thermophila, the only Tetrahymena species with a chromosome-level MIC genome assembly, TEs and other IES sequences are enriched in pericentromeric and subtelomeric regions as well [20]; they likely are also enriched in centromeric regions, but the centromeres are currently poorly assembled precisely because of the abundance of repeated sequence. Our study reveals that many species-specific genes of various types, especially LRR genes, are also enriched in these regions (Fig 2, Fig 3, Fig 8A and S27 Fig). Additional lines of evidence reported here support the idea that pericentromeric and subtelomeric regions are centers of gene innovation in T. thermophila.
1. Genes within MIC chromosome arms have significantly lower Ka/Ks ratios than those within MIC pericentromeric and subtelomeric regions (Fig 3 and S34 Fig), suggesting a higher concentration of rapidly evolving genes in the latter regions.
2. The proportion of tandemly duplicated genes is higher in pericentric and subtelomeric regions than chromosome arms.
3. Most telling, the youngest LRR genes are concentrated in pericentromeric regions, while the oldest are concentrated in chromosome arms. This finding rules out the alternative possibility that MIC chromosome arms are the centers of innovation and new genes are "swept" over time by some unknown mechanism towards centromeres and telomeres.
For these reasons, we conclude that pericentric and subtelomeric regions of T. thermophila MIC chromosomes are the main centers of gene innovation (Fig 9). We expect the same to be true for the other Tetrahymena species because the evolutionary observations reported in this article seem to be so conserved in all of them.
Because essential and conserved genes are preferentially located on germline chromosome arms, the segregation of genome innovation to pericentromeric and subtelomeric regions may effectively physically insulate the organism's need for genome stability from the uncertainty of genome innovation.

The uniqueness of 90-bp exons and phase 2 introns of Tetrahymena LRR genes
The primary function of the LRR domain is as a protein recognition motif [30,60]. LRR genes are widely distributed among eukaryotes [61] and often highly abundant [62][63][64][65][66], especially in plants (e.g., the apple genome has more than 1,000 LRR genes [67]). We report here that most Tetrahymena species have >1,000 LRR genes, and some of them have almost 2,000. The great Comparative genomics of Tetrahymena morphospecies abundance of Tetrahymena LRR genes and their continuous, species-specific diversification suggest that they may serve important evolutionary functions.
The exactly 90-bp exon peak (Fig 4) of LRR genes present in all Tetrahymena species investigated here is a remarkable feature of this gene family. Equally remarkable is that more than 96% of the approximately 72,000 90-bp exons are flanked by phase 2 introns on both ends ( Fig  4C and S9 Fig). Thus, 3 consecutive 90-bp exons must contribute to the translation of one LRR 30-amino acid repeat unit. Tandem repeats have contributed to gene diversity in many eukaryotes, including primates, insects, fungi, and dinoflagellates [68][69][70][71]; however, we are unaware of tandem repeats in any other eukaryote with such an exceptional uniformity of exon and intron features. These unique features suggest that the exactly 90-bp-exon LRR genes with phase 2 introns may represent a novel category of genes that have evolved only once: in the Tetrahymena lineage.

The special association between the non-LTR REP retrotransposon and LRR gene introns
The preferential association of the REP retroelement and LRR gene introns described in the Results section is very striking: 76% of Tt.REPs colocalize with LRR intron/exon repeat units (Fig 8). Furthermore, LRR repeat units are the most common element flanking Tt.REPs, (Fig  8, S31 Fig). Some of these nearby LRR repeat units belong to functional protein-coding genes ( Fig 8B). Indeed, we have reported here cases in which remnants of the REP element are inserted in the middle of a functional LRR gene (S31 Fig). In T. malaccensis, with a more complete assembly of repeated element sequence, 86% of functional Tm.REPs are flanked by LRR 90-bp intron/exon units (S33 Fig). These results show, at a minimum, that LRR gene introns are favored REP transposition targets, a possibility first glimpsed by Fillingham and colleagues [33]. The above observations also raise the possibility of REP-mediated co-retrotransduction of LRR intron/exon repeats. In fact, we can assemble some transcript fragments including adjacent regions of LRR gene repeats and REP DNA in developing MACs deficient in the RNA interference (RNAi) silencing pathway (S35 Fig); this result indicates the possibility of co-transcription and co-retrotransposition of LRR gene repeats and adjacent REP DNA. In other eukaryotes, such events often come to attention when paralogs are found in the genome that have lost their introns because they are spliced out prior to retrotranscription. In the case of the Tetrahymena LRR genes, this would result in fused 90-bp LRR exons. We searched the T. thermophila MAC genome and found only one possible case of fusion of 2 LRR exons, despite our efficient means of detection. Thus, if co-retrotransposition between LRR genes and REP elements really occurred, the introns of LRR genes may fail to be spliced out. However, more evidence is needed to support this hypothesis.

Tetrahymena LRR genes are a remarkable example of "exon shuffling"
Exon shuffling is a phenomenon whereby 2 or more exons from different genes are joined, or the same intron/exon unit is duplicated [31]. Exon shuffling is a major mechanism by which new genes are created [31]. Well-studied examples include the Jingwei gene in Drosophila [72] and LRR-containing pathogen receptors in plant [73]. The existence of thousands of Tetrahymena LRR genes with tens of thousands of exactly 90-bp exons and phase 2 introns is strong evidence for massive exon shuffling.
Our evidence implicates 2 types of MIC mechanisms as potential factors in LRR gene innovation by exon shuffling: ectopic recombination and retrotransposition. Ectopic recombination (i.e., unequal crossing over) is generally considered to be the most common way of creating new genes by tandem duplication [31, 32]. Cases of tandem gene duplication were previously reported in the T. thermophila MAC genome [19], and we describe here similar observations in the MAC genomes of 9 other Tetrahymena species. In the case of Tetrahymena LRR genes, MIC ectopic recombination between different repeats in the same LRR gene or between different LRR genes (Fig 7, S25 Fig, S26 Fig, Fig 9D1 and 9D2), either during meiosis or during asexual MIC mitosis, has likely resulted in the creation of LRR genes with more and fewer combinations of repeats (exon shuffling) and genomes with more and fewer LRR genes. Subsequent natural selection and genetic drift would result in the differential survival of genomes with more LRR genes and alleles with more repeats.
Several observations suggest that the REP retrotransposon could represent a second mechanism that may have caused ectopic recombination between LRR genes, although the phenomenon remains to be demonstrated. Tetrahymena REP non-LTR retrotransposon copies, as other TEs, are restricted to the MIC and are preferentially associated with pericentromeric and subtelomeric regions of MIC chromosomes (Fig 8A), just as the group III species-specific LRR genes. At a finer scale, Tetrahymena REPs are preferentially found next to LRR gene repeats, as described in the previous section. Retrotransposition of REP to introns of preexisting LRR genes (Fig 9D3, step1)-followed by the possible REP-mediated retrotransduction of LRR repeats to some other LRR gene-would lead to a net increase in the number of repeats in the target LRR gene (Fig 9D3, step2, left branch). A similar kind of exon shuffling has been reported for the human non-LTR retrotransposon LINE1, which retrotransduces DNA sequences flanking its 3' end into transcribed genes [74]. In addition, possible REP-mediated co-retrotransposition to a non-LRR intron location potentially leads to the dispersion of LRR repeats in the MIC genome (Fig 9D3, step2, right branch). Furthermore, REPs are repeated sequences, themselves capable of undergoing ectopic recombination with one another and thus generating novel LRR genes ( Fig 9D4).
Finally, we have reported here that some Tetrahymena LRR gene repeats containing exactly 90-bp exons are fused with other domains, including PK (S36A Fig). We conclude that exon shuffling has likely also occurred between LRR gene repeats and other protein domains to generate genes with novel functions.
In summary, during Tetrahymena evolution, it is very likely that exon shuffling, mediated by the concerted action of unequal crossing over, REP-mediated ectopic recombination, and possibly co-retrotransduction, is largely responsible for the origin, dispersion, expansion, and intragenic structure of LRR genes containing unique 90-bp exon arrays (Fig 9).

An intron length cycle revealed by Tetrahymena LRR 90-bp exon genes
We have reported above that cohorts of younger LRR genes have longer than average introns, with lengths inversely related to gene age, while the introns of the oldest LRR genes are approximately 50 bp-the genome background intron length. These observations suggest the existence of a Tetrahymena LRR "intron length cycle." We propose the following possible scenario. A wave of REP retrotranspositions into introns within LRR genes containing 90-bp exon/intron repeat units leads to a cohort of REP-containing LRR genes of the same age. Because the REP DNA is excised as an IES during MAC development, the insertion initially affects neither the length of the introns in the MAC nor the expression of the host genes in the cohort. With time, random mutations inactivate the REP elements of the cohort, but the REP-derived DNA is still recognized as an IES and excised during MAC development. This stage is consistent with our finding that many LRR gene introns in the MIC contain IESs (Fig 8B and S31 Fig).
As inactive REP copies continue to mutate, the length of REP remnants recognized and excised as IES gradually gets smaller, and the intron therefore gets correspondingly longer because DNA previously excised as IES has become additional intron DNA. Because of the randomness of those mutations, the cohort's peak in the intron length distribution slowly becomes broader and shallower. The long-term tendency of all Tetrahymena introns to shorten ultimately drives the cohort mean intron length back down to the approximately 50 bp genome background limit, and the cohort ceases to form a distinct peak. The cycle would operate identically if any of the REP transposition events also resulted in co-retrotransduction of LRR intron/exon repeats.
We can document this natural LRR intron length cycle because we can distinguish older from younger Tetrahymena LRR genes and detect the role played by the REP retrotransposon in their evolution. The waves of events that periodically restart the intron length cycle are likely to occur most frequently in pericentromeric and subtelomeric regions of MIC chromosomes, where active TEs-REP included-are concentrated. The intron length cycle may well apply to other genes, where age dependence of intron length may be more difficult to document than in the LRR genes.

On the function of the unique LRR genes in Tetrahymena
Tetrahymena species have more LRR genes than other eukaryotes. For example, whereas the human genome has <400 LRR genes [64], most Tetrahymena species have >1,000 LRR genes, and some (T. paravorax, T. pyriformis, T. malaccensis, and T. thermophila) have almost 2,000. These LRR genes have intact open reading frames, and RNA-seq data show that many are expressed genes under standard laboratory culture conditions. Furthermore, 9 of the 10 species have continued to generate LRR genes with tandem arrays of exactly 90-bp exons, including the earliest diverged T. paravorax. The mosquito parasite, T. empidokyrea, is the only Tetrahymena species in which the continued creation of LRR genes stopped-but only after group II LRR genes had evolved (S18 Fig). Significantly, Ka/Ks values indicate that most 90-bp exon LRR genes have undergone purifying selection, indicative of the continued evolutionarily importance of their function. Thus, the independent retention and continued evolution of 90-bp exon LRR genes in all the Tetrahymena species suggest a response to similar types of ever-present evolutionary challenges.
In plants, LRR genes are used for bacterial pathogen recognition-a type of innate immunity ( [75] and S36B Fig). Such genes include LRR receptor kinase genes (e.g., Xa21 [76], signal peptide + LRR domain + coiled-coil + PK domain), LRR genes (e.g. Xa21D [77], with only LRR domain), and nucleotide-binding site (NBS)-LRR genes (P-loop NTPase domain + LRR domain) [78]. In all 10 Tetrahymena species, domain compositions of species-specific genes show that not only LRR but also PKs and P-loop NTPase domain genes have been extensively expanded (S36A Fig). These expansions suggest that at least some Tetrahymena LRR genes could be involved in bacteria recognition-like processes. Indeed, the extensive diversity of Tetrahymena LRR genes hints at providing responsiveness to unpredictable challenges of a biological nature and is reminiscent of other well-known combinatorial immunity mechanisms, such as bacterial clustered regularly interspaced short palindromic repeats (CRISPR) systems to respond to viral infection [79] and vertebrate and plant systems of adaptive immunity [75,76].
In addition, Tetrahymena cells experience special environmental challenges because of their free-living, opportunistic grazer lifestyle. They live in fresh water and have diverse food sources-especially bacteria, the oral apparatus of which is specially adapted to ingest-whose relative availability and diversity can be quickly altered by environmental changes. Furthermore, the freshwater environment is one that is subject to more frequent and pronounced variation in physical and chemical conditions than the marine environment in which Tetrahymena's ancestors originated. Therefore, it seems reasonable to propose that Tetrahymena has evolved many different genes to specifically adjust to frequent environmental change, including the sensing of different bacterial foods and their digestion products [80][81][82]. Most Tetrahymena LRR genes do not include transmembrane domains, and many LRR proteins may be restricted to the cytoplasmic cell domain. If so, some could also function as a defense mechanism to neutralize noxious digestion byproducts unpredictably generated by opportunistic feeding on different types of bacteria. Additional experimental studies will be needed to test these hypotheses.

Materials and methods
Additional method details are given in S1 Text.

Cell strains, growth conditions, and morphological analysis
T. thermophila MAC and MIC genomes were previously sequenced and are available at the TGD Wiki (http://ciliate.org). Another 9 Tetrahymena species (T. malaccensis, T. elliotti, T. pyriformis, T. vorax, T. borealis, T. canadensis, T. empidokyrea, T. shanghaiensis, and T. paravorax) were newly sequenced in this study. These species were obtained from either the American Type Culture Collection (ATCC) or the Tetrahymena Stock Center (S6 Table). Cells were grown in SPP medium and harvested at a density of 150,000 to 250,000 cells/ml. The morphology of Tetrahymena cells was analyzed using the silver carbonate impregnation method [83].

Genome sequencing, genome assembly, and gene annotation
MAC genomes of the 9 Tetrahymena species were sequenced using the Illumina platform. Three of them (T. malaccensis, T. elliotti, and T. borealis) were sequenced by the Broad Institute in collaboration with the J. Craig Venter Institute, and the rest were sequenced by Institute of Hydrobiology, Chinese Academy of Sciences. In general, paired-end (insert size 180-220 bp) and mate-pair (insert size 2-3 Kb) libraries were constructed and sequenced using standard Illumina protocols. The germline MIC genome of T. malaccensis was sequenced using PacBio Sequel platform. For RNA-Seq, Poly-A mRNAs were isolated and sequenced using standard Illumina protocols.
MAC genomes of each species were assembled using SOAPdenovo [84] or ALLPATHS-LG [85]. The assembly with the longest N50 length was selected, and scaffolds with length <1 Kb were excluded. The T. malaccensis MIC genome was assembled using SMRTLINK_V5 (HGAP4, https://www.pacb.com/support/software-downloads/). Repeat consensus sequences were de novo identified and annotated using RepeatModeler (http://www.repeatmasker.org/ RepeatModeler/) with default settings, and the output consensus sequence library of Repeat-Modeler was used to mask the genomes using RepeatMasker (-div 20).
Gene models were predicted using ab initio-based and homology-based methods (S37 Fig). RNA-Seq data for logarithmic growth stage in all 9 newly sequenced species were generated to aid gene prediction. Finally, a set of gene models was created using Evidence Modeler [86] by merging all predicted gene models and RNA-Seq transcripts, followed by manual correction.

Ortholog group identification
Ortholog groups (clusters) were annotated using OrthoMCL [23], which provides the best overall balance of sensitivity and specificity for multiple species ortholog clustering [87]. The important parameter inflation index of OrthoMCL was set at 1.5 to balance sensitivity and selectivity, as used in OrthoMCL-DB construction [88]. OrthoMCL groups genes in different species into clusters (ortholog groups). These ortholog groups were classified into 10 categories according to their presence in different species (Category I-X, see S2 Fig). Genes were classified as singletons if they could not be assigned to any ortholog group by OrthoMCL. Note that the number of genes in an ortholog group may vary among species because of the presence of inparalogs and co-orthologs; therefore, statistical analyses were performed to determine whether inparalogs had undergone significant expansion (such inparalogs can also be species specific, also see S1 Text). We identified ortholog groups that show significant inparalog expansion in one species because they are usually species-specific genes (S38 Fig). Species-specific genes include singletons, category I, and members of significantly expanded ortholog group genes.

Phylogenomic analysis and divergence time estimation
A total of 198 one-to-one ortholog groups were identified in all 10 Tetrahymena species and other ciliates (I. multifiliis, P. persalinus, P. tetraurelia, O. trifallax, and S. mytilus). A maximum-likelihood tree was constructed using RAxML [89]. The divergence time for Tetrahymena species was estimated using r8s [90] with the divergence time for Ichthyophthirius and Tetrahymena as the approximate calibration time point. This divergence time is about 447 Mya [91] based on a phylogenomics study in ciliates incorporating a ciliate fossil record-based estimate reported by Weitschat and colleagues [92].

Ka/Ks analysis
Pairwise alignments of protein sequences were performed for gene pairs in each OrthoMCL ortholog group using MUSCLE [93], and pairwise nucleotide sequence alignments were generated by transforming protein alignments into codon alignments in RevTrans [94]. Ka/Ks ratios were calculated based on pairwise codon alignments using PAML (runmode = −2) [95].

Gene expression analysis
For gene expression analysis, RNA-Seq data, generated from cells during the vegetative growth stage, were mapped to assembled genomes using TopHat [96]. Gene expression level was estimated using uniquely mapped reads based on predicted gene models and expressed as fragments per kilobase of exon per million reads mapped (FPKM) values.

Sequence logos and phylogenetic tree of consensus sequences of 90-bp exons
The vast majority of the 90-bp exons are in the same phase. The sequences of 90-bp exons are naturally aligned, because all of them are exactly 90 bp in length. Because there is extreme phase 2 intron bias in consecutive 90-bp exons of LRR genes, a 90-bp exon can be translated into 29 amino acids after excluding the first base pair at the 5 0 end and the last 2 base pairs at the 3 0 end. Nucleotide sequence logos were generated using the sequences of 90-bp exons plus, as anchors, the GT-AG intronic splicing signals. Amino acid sequence logos were generated based on the 29 amino acids translated from 90-bp exons. LRR genes containing 90-bp exons were sorted into 2 groups, group II and group III (see Results section). For group III LRR genes, we further divided them into different groups based on the masking of CRSs identified by RepeatModeler. Each LRR gene in group III was exclusively assigned into a CRS group based on the highest masking score. Then, we compared the sequence logo difference between either group II and III or different CRS groups. For each group (group II and group III, or CRS groups), sequences of all 90-bp exons in all LRR genes were extracted, which are naturally aligned because they are identical in length. These alignments were used to generate sequence logos using WebLogo. Sequence logo profile comparisons between T. paravorax and the other 9 species were performed by the Two Sample Logo tool (http://www.twosamplelogo.org) [97] (binomial test, p-value cutoff: 1 × 10 −5 ).
To investigate the relationships of CRS groups (note that CRS groups only exist in group III LRR genes) in different species, the sequences of all 90-bp exons in all LRR genes of each CRS group were extracted. Because there are more than 45,000 exons of 90 bp, it is hard to interpret such a large tree. Therefore, we generated a 90-bp consensus sequence for all the 90-bp exons in each CRS group and then constructed a phylogenetic tree based on the consensus sequences (90 bp in length) of all CRS groups.
Supporting information S1 Fig. Tetrahymena species show high genome diversity. Heat map and matrix showing the percentage of homologous regions among all 10 genomes. Pairwise genome alignment was performed using promer in MUMmer 3 software (http://mummer.sourceforge.net/) with default settings, and only one-to-one alignments were used to identify homologous regions. Note that the difference between the percentages above and below the diagonal for the same species pair is due to the different genome sizes of the 2 species used as denominators when calculating the percentages. "pCen" indicates the pericentromeric region; "mid-arm" indicates near the middle of chromosome arms; and "NA" indicates not available because the gene is located in a still unassembled region. Note that (1) This is the largest group of nearly identical 90-bp exons and (2) TTHERM_001443819 and TTHERM_00001659049 both have 2 exons that belong to this group. The above phylogenetic network was constructed using SplitsTree4 (https://ab.inf.unituebingen.de/software/splitstree4) based on the Ucorrected_P-NeighborNet-EqualAngle pipeline with default settings and 1,000 bootstraps. The phylogenetic network defines each nonconstant column in the sequence alignment as a so-called split and generates a phylogenetic graph of the split network by integrating all compatible and incompatible splits, and thus the phylogenetic network gives all the possible relationships of 90-bp exons compared to a simple phylogenetic tree. LRR, leucine-rich repeat; MIC, micronucleus. Examples illustrating in detail the physical relationship between 3 Tt.REP copies highlighted in (a) and tLRR-MIC-CRS-masked sequences in the MIC genome. Note that Tt.REP copy at supercontig location 2.734:15527..17709 is not included in panel b because is at the end of a small, incomplete scaffold, whose flanking sequence, likely repetitive, remains unassembled. CRS, consensus repeat sequence; LRR, leucine-rich repeat; non-LTR, Non-long terminal repeat; MIC, micronucleus.

S36 Fig. Similar domain compositions of plant innate immunity related LRR genes and
Tetrahymena species-specific genes. (a) Protein domain architecture for Tetrahymena species-specific genes, presented as a network. Each node represents a specific domain type, and lines represent links between 2 domains within a gene. The width of each line indicates the number, which is also written on the line. LRR, TPR, WD40, PK, CNBD, GFR, and P-loop NTPase domains are shown in different colors, and lines connecting the same color of node represent genes containing only this domain (e.g., an LRR-LRR connection indicates genes that only contain the LRR domain). Coiled-coil (Coil), transmembrane helix (Transmembrane), and signal peptide (Signal_peptide) structures are also included to illustrate that other domains are often associated with the 7 most frequent domains in species-specific genes. (b) Domain architectures of 2 innate immunity LRR genes previously reported in plants. Xa21 (UniProt ID: Q1MX30) is a receptor kinase-like protein in Oryza sativa subsp. Japonica. NBS-LRR (represented as H9DWE0) is a class of proteins containing both P-loop NTPase and LRR domains in O. sativa subsp. Indica. CNBD, cyclic nucleotide-binding domain; GFR, growth factor receptor; LRR, leucine-rich repeat; P-loop NTPase, P-loop-containing nucleoside triphosphate hydrolase; PK, protein kinase; TPR, tetratricopeptide repeat. (TIF) S37 Fig. Gene prediction pipeline. Both ab initio-and homology-based methods were used for gene prediction. Assembled RNA-Seq data were used to generate training gene sets for ab initio gene predictions and were also incorporated as cDNA evidence. EvidenceModeler was used to generate a set of gene models combining evidence from all gene prediction programs. within an OrthoMCL ortholog group (related to S1 Text). OrthoMCL first identified reciprocal best blast hits between species ("true" orthologs: A1, B1, and C1) and then assigned, to the same ortholog group, genes that gave reciprocal better within-species blast hits to any of those genes (rest of the genes shown in the figure). Although OrthoMCL clusters are usually referred to as ortholog groups, pairs of genes within a cluster may be true orthologs (red arrows), inparalogs (if they are in the same species, blue arrow), or co-orthologs (if they are in different species, dotted black arrows). Orthologs are genes that are conserved between species. Inparalogs are members of a gene expansion within a certain species or lineage; they can be species-specific or very recently duplicated lineage-specific genes. (TIF) S1  Table. Domain accessions used for LRR and PK gene identification (related to S1 Text). LRR, leucine-rich repeat; PK, protein kinase. (DOCX) S1 Text. Supplementary methods. (DOCX) S1 Data. Ortholog groups generated by OrthoMCL. (ZIP) S2 Data. Individual numerical values that underlie the summary data displayed in the following figure panels: Fig 4D, Fig 8C, S3 Fig, S6B Fig, S7A Fig, S7B Fig, S10 Fig, S11A Fig,  S11B Fig, S11C Fig, S18 Fig left and right panels, S28 Fig, S29 Fig, S33