The Evolution of Tyrosine-Recombinase Elements in Nematoda

Transposable elements can be categorised into DNA and RNA elements based on their mechanism of transposition. Tyrosine recombinase elements (YREs) are relatively rare and poorly understood, despite sharing characteristics with both DNA and RNA elements. Previously, the Nematoda have been reported to have a substantially different diversity of YREs compared to other animal phyla: the Dirs1-like YRE retrotransposon was encountered in most animal phyla but not in Nematoda, and a unique Pat1-like YRE retrotransposon has only been recorded from Nematoda. We explored the diversity of YREs in Nematoda by sampling broadly across the phylum and including 34 genomes representing the three classes within Nematoda. We developed a method to isolate and classify YREs based on both feature organization and phylogenetic relationships in an open and reproducible workflow. We also ensured that our phylogenetic approach to YRE classification identified truncated and degenerate elements, informatively increasing the number of elements sampled. We identified Dirs1-like elements (thought to be absent from Nematoda) in the nematode classes Enoplia and Dorylaimia indicating that nematode model species do not adequately represent the diversity of transposable elements in the phylum. Nematode Pat1-like elements were found to be a derived form of another Pat1-like element that is present more widely in animals. Several sequence features used widely for the classification of YREs were found to be homoplasious, highlighting the need for a phylogenetically-based classification scheme. Nematode model species do not represent the diversity of transposable elements in the phylum.


Introduction Transposable elements
Transposable elements (TE) are mobile genetic elements capable of propagating within a genome and potentially transferring horizontally between organisms (Nakayashiki 2011). They typically constitute significant proportions of bilaterian genomes, comprising 45% of the human genome (Lander et al. 2001), 22% of the Drosophila melanogaster genome (Kapitonov and Jurka 2003) and 12% of the Caenorhabditis elegans genome (Bessereau 2006). TEs may also have important evolutionary effects, such as promoting alternative splicing (Sorek et al. 2002), inducing variation accumulation under stress (Badyaev 2005) and increasing the genetic load (Wessler 2006).
TEs can be broadly divided into DNA and RNA classes. DNA TEs (transposons) transfer as dsDNA, leaving a vacant locus at the point of origin, together with a target site duplication (TSD) (Wessler 2006). They are thought to increase in copy number via various recombination related mechanisms between vacant and populated TE loci, partly due to the similarity of TSDs across the genome (Wessler, 2006). RNA TEs (retrotransposons and retroposons) do not exit their locus of origin but rather propagate through the reverse transcription of an RNA intermediate copy back into an additional site in the genome (Finnegan 2012). RNA elements are usually the most numerous type of TE and can have tens of thousands, or even millions, of copies in a single genome (Finnegan 2012). Despite this there can be variation in the relative proportions and some species have DNA elements as the most frequent class, as the case is in C. elegans (Caenorhabditis elegans Sequencing Consortium 1998).

Tyrosine recombinase TEs
Tyrosine Recombinase Elements (YREs)  YREs are diverse in sequence and structure, but this diversity is not equally represented across the animal phyla (Poulter and Goodwin 2005;Kojima and Jurka 2011;Piednoël et al. 2011), and their evolutionary history can sometimes be puzzling. Nematoda for example are known to have one unique form of YREs (a form of Pat1 found only in this phylum) and to entirely lack another (Dirs1), which is otherwise relatively common (Piednoël et al. 2011). However, the diversity of YREs in Nematoda is still poorly understood and a phylogenetically informed analysis with broad taxonomic sampling of both the YREs and their hosts is required to thoroughly address the subject.

YRE classification
DNA YREs possess only a YR protein domain and include the Crypton and TEC elements. Although Cryptons were first discovered in fungi (Goodwin et al. 2003), four distinct, possibly polyphyletic, lineages have been defined in fungi, diatoms and animals (Kojima and Jurka 2011). It is thought that Cryptons may have contributed to the origin of RNA YREs (Kojima and Jurka 2011). TEC elements, by contrast, appear to have a very limited taxonomic distribution and are currently known only from ciliates (Doak et al. 1994;Jacobs et al. 2003).
RNA YREs, like other long terminal repeat (LTR) retrotransposons, possess the capsid protein Gag, and a polyprotein that includes the reverse transcriptase (RT) and RNase H (RH) domains. LTR retrotransposons (Gypsy, Copia and Bell) may have been the source of the ancestral RNA element of the YRE ancestor (Kojima and Jurka 2011). Unlike the LTR retrotransposons, YRE retrotransposons possess the YR domain and lack the integrase gene (Poulter and Goodwin 2005;Wicker et al. 2007). They sometimes also encode a methyltransferase (MT) domain.

Structure based classification of YREs
A set of molecular sequence features are widely used to classify YRE retrotransposons: the presence and strand of the RT, MT and YR domains, the presence of a zincfinger (ZF) motif in the Gag protein, and the presence and relative arrangement of characteristic repeat sequences (Cappello et al. 1984;Cappello et al. 1984;Goodwin and Poulter 2001;Goodwin et al. 2004;Piednoel and Bonnivard 2009;Piednoël et al. 2011;Muszewska et al. 2013; fig. 1). Three groups of YRE retrotransposons have been defined: DIRS, Ngaro and Viper (Goodwin and Poulter 2004;Lorenzi et al. 2006 1).

Fig. 1: The diversity of tyrosine recombinase elements (YREs) and their diagnostic features for taxonomic classification
The known taxonomic distribution of each element (A -H) is listed along with a cartoon of its structure. Metazoa are in bold font and Ecdysozoa are underlined. The features considered are the presence and absence of the reverse transcriptase (RT), methyltransferase (MT) and tyrosine recombinase (YR) domains and their direction (grey triangles), as well as the presence, absence and position of split direct repeats (pairs of triangles, sharing a colour and pointing in the same direction), inverted repeats (pairs of triangles, sharing a colour and pointing in opposite directions) and zinc-finger motifs (zf) from the Gag protein.

Nematoda-form
In spite of their exceptional diversity, YREs are quite rare. Dirs1 from Dictyostelium discoideum (Amoebozoa; Cappello et al. 1984) is present in 40 intact copies and 200 -300 fragments. Crypton ( fig. 1A) is present in a few dozen copies in each of a range of eukaryote species (Kojima and Jurka 2011). TEs with such small population size, however, will be subject to strong genetic drift and variation in copy number, and thus will be prone to elimination (Collins et al. 1987).

New Approaches
To identify and quantify YREs in nematodes, we utilized homology based search methods to locate YREs, made a preliminary classification based on characteristic features, and used phylogenetic methods to refine and corroborate these classifications. We conducted further phylogenetic analyses to classify partial or degenerate elements relative to complete elements. This stage allowed us to include partial and potentially degraded elements in the copy number counts and have a better understanding of the origins of the distribution of YREs among the nematode species. Unlike similarity based clustering methods (e.g., Piednoël et al. 2011;Muszewska et al. 2013;Guérillot et al. 2014;Iyer et al. 2014), a phylogenetic approach accounts for homoplasy and is better adapted for the analysis of potentially

Results
We identified putative homologues of three YRE protein domains, YR, RT and MT, in genome assemblies of 34 nematode and 12 outgroup species. Over 2,500 significant matches to YREs were found in 24 species (table 1) To assess whether the genome assemblies used were of sufficient quality to permit YRE discovery, we also searched for RT domains from three LTR elements, Gypsy, Copia and BELL, reasoning that if we were unable to detect any of the abundant LTR class elements it was likely that the assembly was too poor. The N50 contig lengths of the assemblies (table 1) did not correlate with the number of YRE matches (linear R 2 = 2*10 -3 , power R 2 = 8*10 -3 ). A greater number of matches were found in outgroup taxa with larger genomes than Nematoda. No species had zero matches in all four searches (YRE plus three LTR searches). Litomosoides sigmodontis had the lowest number of matches, including only three to BEL LTR retrotransposons, while Oscheius tipulae had 10 or less matches in any searches.

Bursaphelenchus xylophilus, Caenorhabditis angaria and
Caenorhabditis sp. 11 had a maximum of 27 matches in any of the searches. For the remaining species, at least 40 matches were found in at least one of the searches. Given these findings, we are confident that cases where no YREs were found usually indicate a real absence, or extreme scarcity, of YREs in those species. 11264 4102 12165 YR matches are shown, of which, the number of YREs that were classified based on their features and their phylogenetic position is indicated. In addition, the counts of RT hits from Bel, Copia and Gypsy LTR elements are indicated for each species. Matches were found in all the species in at least one of the PSITBLASTN searches. The number of matches found in each species seems to be detached from the mean contig length or contig length at N50 in the species' genome assembly.  (table 1). The absence Pat1 elements from P. redivivus was also unexpected, since this species is known to possess several Pat1-like elements (de Chastonay et al., 1992) and a reciprocal blast approach was taken to confirm this finding. The P. redivivus genome assembly was queried using BLAST with the first Pat1 sequence which was originally described in P. redivivus (Genbank accession X60774). Twelve significant matches were found. For confirmation, these fragments were then used as queries to search the online NCBI BLAST database with default settings, detecting the original Pat1 sequence (X60774) as a single hit. Since the matches were Pat1 fragments that did not contain the YR ORF, they had not been recovered by our pipeline, and this lack of complete Pat1 elements was likely due to incomplete assembly.

Non DIRS YREs
In the species surveyed we identified only a single Crypton element, in Nematostella, and this element has already been recorded in Repbase (locus Crypton-1_NV). An additional Crypton match in Nasonia was closely related to a previously identified element from oomycetes (locus CryptonF-6_PI in Repbase) and is a likely contamination. Using more lenient parameters, permitting larger clades with lower sh-like support to be included, increased the count of Crypton-like elements.
However, this resulted in clades with simultaneous conflicting classifications. In addition, we identified three major lineages of Ngaro elements, including 182 instances that clustered with LTR elements. These lineages included the Ngaro reference sequences. An additional minor lineage, from Caenorhabditis briggsae, clustered closely with Pat1-like elements from the same species and showed minimal sequence divergence from them ( fig. 2A). We suggest that this Ngaro lineage was a derived species-specific form of Pat1 element that has lost its MT domain. Unlike Crypton elements, Ngaro elements were found in most of the animal phyla examined ( fig. 3).
Ngaro were abundant in the cnidarian Nematostella vectensis (114 instances) and in the mollusc L. gigantea (53 instances). However, within Ecdysozoa, Ngaro counts were lower and ranged between 2 in the nematode E. brevis and 14 in H. dujardini.

Fig. 3: The distribution of YREs among Nematoda and outgroup species
The phylogenetic tree of Nematoda is based on De Ley and Blaxter (2002) and Kiontke et al. (2013). Element types are colour coded. The phylogenetically classified YRE matches in each species are indicated. Pie-charts represent the proportion of each element type with their radii proportional to the number of phylogenetically classified YRE matches.

The evolution of YRE features
Based on the RT phylogeny, one of the possible most parsimonious scenarios for feature evolution is annotated in fig. 2. Under this hypothesis, the loss of the MT domain, the inversion of the YR domain, the formation of split direct repeats and of inverted repeats, and the loss of a zinc-finger motif have each occurred more than once, independently, and both split direct repeats and inverted repeats must have been formed through multiple sequential inversions. Any other possible scenario would require that several YRE features have evolved in parallel. In addition, any possible scenario would be inconsistent with single step character changes between element types: in figure S3 we hypothesized a scenario in which the different YRE retrotransposons were created only by single character changes in preexisting element types. This scenario is not supported by the phylogenetic analysis.

Taxonomic representation in the study of TE
The distribution of transposable elements has been hypothesized to depend on a number of factors; with mating system, ploidy, zygosity, ecology and gene flow all potentially influencing the TE load and diversity in an organism, in addition to the constraints of its phylogenetic history (Charlesworth and Charlesworth 1995;Wright and Finnegan 2001;Dolgin and Charlesworth 2006;Kawakami et al. 2010;Carr et al. 2012;Eads et al. 2012). Even within species, strains and populations can differ markedly in TE abundance (Collins et al. 1987). Therefore, when studying the distribution of TEs, it is unlikely that one would identify a single or a few species that would accurately represent a whole phylum, especially a phylum as species rich and diverse as Nematoda. Piednoël et al. (2011) surveyed Dirs1-like YREs in a wide range of eukaryotes in order to understand the distribution of this element. Although 274 genome assemblies were analysed, only two nematode genomes were available to them, and these were from two closely related rhabditid superorders, Rhabditomorpha (C. elegans) and Diplogasteromorpha (P. pacificus). Neither species contained Dirs1-like sequences, leading to the conclusion that these elements were absent from nematodes as a whole. In this study, however, thanks to the wider taxonomic representation that is now available, we have identified Dirs1-like sequences in at least two out of the three nematode subclasses.
In addition, since many of the assemblies we screened were drafts and thus highly fragmented representations of the original genomes (the shortest average contig length was 411 bp in H. aoronymphium), we employed a search strategy that did not require the presence of complete YRE sequences, which may be as long as 6,000 bp (Piednoël et al. 2011). This approach,

YRE content in Nematoda has undergone a shift
Based on our findings, Nematoda has undergone a substantial change in the composition and numbers of YREs (see fig. 3). The YRE content of the enoplid and dorylaimid species examined was more similar to that of outgroup taxa in Dirs1 proportions than the YRE content of the rhabditid species. Indeed, Dirs1-like elements, relatively abundant in some outgroups, were found in E. brevis and R. culicivorax but were sparse in Rhabditida. The only potential Dirs1-like element found in Rhabditida was probably misclassified or a result of contamination, and Dirs1-like elements may be absent from Rhabditida altogether. In addition, the echinoderm-form Pat1-like element is found in R. culicivorax but not in other nematodes. It will be very informative to sample species from additional chromadorid superorders to identify the mode and tempo of this loss.

The evolution of PAT elements
The known distribution of the Pat1 group of elements in Metazoa has been puzzling. Here, through the phylogenetic classification of truncated elements, we identified the PAT elements in Mollusca and Cnidaria as Pat1-like, suggesting that these elements, though rare in general, are found in all three branches of Bilateria, and in non-bilaterian Metazoa. Surprisingly, the Pat1-like element that was found in the nematode R. culicivorax has the echinoderm-form structural arrangement rather than the nematode-form of Pat1. In addition, Pat1-like elements from Nematoda and from Echinodermata form sister clades in the RT tree ( fig. 2A). Thus, the nematode-form Pat1-like element is not an isolated element with an unknown origin, but rather a taxon specific clade of a more widespread Pat1 element family, and we suggest that there exists a greater diversity of these elements yet to be discovered by completed genome projects.

Homoplasy in YRE structural features and the need for phylogenetics
YREs have been suggested to have emerged from a composite ancestor combining an LTR element with a Crypton, as both Cryptons and LTRs are considered to be more ancient than YREs based on their distribution (Jurka et al. 2007). It is not clear, however, whether a single or several independent events of recombination are at the base of YRE retroelements. Our results support at least two origins for YRE retroposons: at least one for Ngaro elements and another for DIRS elements. As a consequence, split direct repeats must have evolved more than once, independently, resulting in homoplasious similarity. This result is in accordance with the phylogenetic tree presented in Goodwin and Poulter (2004). While Goodwin and Poulter (2004) found that PAT and Dirs1-like elements form a single clade each, we observed a paraphyletic, or possibly polyphyletic Dirs1 group. Since this was observed in both the RT and YR trees ( fig. 2), this result could either mean that PAT elements evolved from Dirs1 or that a Dirs1-like element evolved twice independently. It is worth noting that the formation of inverted repeats from split direct repeats is a complex process that would require some intermediate forms. However, these forms are not observed, possibly due their inviability.
Another homoplasious similarity between polyphyletic element lineages was observed in Ngaro and a derived lineage of Pat1-like elements in C.
briggsae, both lacking a MT domain. In addition, a derived PAT element in D. pulex had homoplasious similarity to Kangaroo from V. carteri, both having an inverted YR domain. Also, we infer that the loss of a zinc-finger motif from the Gag protein must have occurred independently multiple times. Taking these observations together, homoplasy in element features is a strong theme in the evolution of YREs. This strongly suggests that it is impractical to use structural characteristics as the sole descriptors for element classification, and incorporating an explicitly phylogenetic basis for classification would produce more biologically meaningful inferences.

Conclusions
In this study we utilised a large number of nematode genome assemblies to characterize the YRE content in Nematoda. We showed that the YRE content across the phylum is much more diverse than suggested by the analysis of a few model species. It was important to include truncated elements to fill the gaps in the extant diversity of both Dirs1-like and Pat1-like elements, both of which are more widely distributed than originally perceived. Our results strongly support a previous call (Seberg and Petersen, 2009) to classify transposable elements based on phylogenetic relationships rather than the features they contain or lack, thus conforming to a systematic approach to classification.

Material and Methods
The analyses presented here have been made In addition to genome assemblies, we also analysed the Repbase Crypton and DIRS datasets (Jurka et al. 2005), the Retrobase DIRS dataset (http://biocadmin.otago.ac.nz/fmi/xsl/retrobase/home.xsl) , four Pat1-like elements from P. pacificus kindly provided by M. Piednoël, and the first Pat1 sequence to have been described (Genbank accession X60774). These sources pooled together formed our reference dataset. We examined the validity of element classifications produced by the pipeline using these known elements and also for seeding query alignments.

Homology search based YRE identification
In order to find YREs in the assemblies we used a strategy modified from Piednoël et al. (2011). First, we searched for YR domains in each whole genome assembly. YR matches were extended by 10 kbp in each direction or to the contig end, whichever was encountered first. We then searched for RT and MT domains and direct and inverted repeats in the resulting sequences. This approach efficiently streamlined the homology searches while including only RT and MT domains that are likely to belong to YREs. The homology searches were conducted using PSITBLASTN (Altschul et al. 1997;Camacho et al. 2009) with an expected value threshold of 0.01. The query models for these searches were seeded with the alignments from Piednoël et al. (2011) and were extended by adding protein sequences from the reference dataset through PSIBLASTP search (Altschul et al. 1997;Camacho et al. 2009).
Direct and inverted repeats on the extended YR fragments were detected with the BLAST based program UGENE (Okonechnikov et al. 2012), with only identical repeats at least 20 bp long allowed. These values represent the minimal repeat sequence in the results of Piednoël et al. (2011). Each annotated fragment was subsequently programmatically given a preliminary classification based on its similarity to the structures illustrated in fig. 1.

Zinc-finger motifs pattern matching
Among PAT elements ( fig. 1), only Pat1 elements have zinc-finger motifs in their Gag sequence (Poulter and Goodwin 2005). Gag sequences from two Pat1 elements were used to query the reference databases to produce a Gag sequence model using PSIBLASTP (Altschul et al. 1997;Camacho et al. 2009). The sequences that were eventually used to produce the model represented all the DIRS elements' diversity.
PSITBLASTN (Altschul et al. 1997;Camacho et al. 2009) was used to recover Gag sequences from the YRE DNA sequences found in the previous stage, with an expected value threshold of 0.01. The Gag sequences detected were searched for the zinc-finger sequence patterns described by Poulter and Goodwin (2005) using a python script (see Supplementary Methods). The classification process was continued later on, using a phylogenetic approach, to account for partial and degraded elements as well as for complete ones.

Phylogenetic approach to YRE classification and quantification
We chose a phylogenetic approach to element classification over genetic-distance clustering methods to better account for homoplasy in our sequence data. Similar methods to the ones above were used to reconstruct two additional phylogenetic trees for the purpose of classification and quantification. The first tree was reconstructed from a dataset including only YR sequences from complete RNA YREs as well as Crypton YR sequences. This tree was used to delineate element clades. Only clades with sh-like support of 0.7 or above were considered, if they did not have conflicting YRE features based classifications. YR domain hits from reference elements helped to confirm the identity of the element clades.
The second tree included all the YR domain hits from both complete and truncated or degraded elements as well as YR sequences from Crypton elements. This tree was used to identify the phylogenetic position of degraded and truncated elements relatively to complete elements and adjust their count accordingly, for each of the clades recovered in the previous tree. Only truncated or degraded elements that clustered with complete elements with sh-like support of 0.9 or above were considered. However, we have detached nodes with long branches from clades that included complete elements and had sh-like value < 0.95, to avoid artifactual groupings. The branch-length cutoff that was used for node removal due to a long branch was four times the median branch-length of that clade.

Assessment of the reliabilty of YRE counts
Given that the originating genome does in fact contain YRE elements, draft genome assemblies could be missing YRE elements for two reasons: The first is that by being incomplete they may stochastically miss some elements. The second reason arises from the assembly algorithms used, where highly similar elements may yield assembly graphs that the algorithm rejects as being too complex, or of too high coverage, to include in the reported assembly contigs. Since YREs often have a low copy number (Cappello et al. 1984;Kojima and Jurka 2011) the second artefact is less likely, but a record of absence may simply reflect assembly quality. However, LTR retrotransposons are not likely to be absent from eukaryotic genomes and an inability to detect LTR elements would suggest that the assembly is simply not of sufficient quality. Therefore, in each of the species studied, we performed three additional PSITBLASTN (Altschul et al. 1997;Camacho et al. 2009) searches for RT domains of Gypsy, Copia and BEL LTR retrotransposons. The query alignments were constructed in the same manner as described above and are available in the github repository. This phylogeny was reconstructed using only YR sequences from elements with defined borders (also available as feagure 2B), with a midpoint root (white background). Clades from the full YR tree (in grey) are presented next to reduced tree clades with which they share leaves. Large font black leaves are shared between the full and reduced YR trees. Large font green leaves are additional reference sequences. Small font leaves from the full tree (in grey) were added to the leaf count of the corresponding reduced tree clade. Only full tree clades with sh-like support > 0.9 were considered. Full tree clades that included long branches were removed if they had sh-like support < 0.95. The branch-length cutoff was four times the median branch-length of the clade. Leaf names include the species code (as in tables 1 and S1), a unique number and the feature based classification. The unique number is the start position of the YR domain on its contig. table.out files in the pipeline results folder (http://dx.doi.org/10.6084/m9.figshare.1004150) provide access to the complete element information using the species code and the unique number. The unique number provides access to the element's diagram in the same folder. A flow chart depicting all the possible single step transitions between YRE retrotransposon types, using Ngaro as the ancestral form. Dirs1-like elements cannot be created from other element types in a single step. This scenario is not supported by the phylogenetic analysis (Fig. 2).