Superior ab initio identification, annotation and characterisation of TEs and segmental duplications from genome assemblies

Transposable Elements (TEs) are mobile DNA sequences that make up significant fractions of amniote genomes. However, they are difficult to detect and annotate ab initio because of their variable features, lengths and clade-specific variants. We have addressed this problem by refining and developing a Comprehensive ab initio Repeat Pipeline (CARP) to identify and cluster TEs and other repetitive sequences in genome assemblies. The pipeline begins with a pairwise alignment using krishna, a custom aligner. Single linkage clustering is then carried out to produce families of repetitive elements. Consensus sequences are then filtered for protein coding genes and then annotated using Repbase and a custom library of retrovirus and reverse transcriptase sequences. This process yields three types of family: fully annotated, partially annotated and unannotated. Fully annotated families reflect recently diverged/young known TEs present in Repbase. The remaining two types of families contain a mixture of novel TEs and segmental duplications. These can be resolved by aligning these consensus sequences back to the genome to assess copy number vs. length distribution. Our pipeline has three significant advantages compared to other methods for ab initio repeat identification: 1) we generate not only consensus sequences, but keep the genomic intervals for the original aligned sequences, allowing straightforward analysis of evolutionary dynamics, 2) consensus sequences represent low-divergence, recently/currently active TE families, 3) segmental duplications are annotated as a useful by-product. We have compared our ab initio repeat annotations for 7 genome assemblies to other methods and demonstrate that CARP compares favourably with RepeatModeler, the most widely used repeat annotation package.


12
Thousands of genomes have been sequenced thanks to decreased cost and increased speed of DNA se-13 quencing methods. The explosion of genome sequences has expanded our knowledge of repetitive DNA, 14 which is an important component of the genomes of almost all eukaryotes. Repetitive DNA is made up 15 of sequences that have been duplicated. Some repetitive elements are able to replicate to new genomic 16 locations and are referred to as transposable elements (TEs). TEs are known to account for a significant 17 proportion of genome sequences in eukaryotes, varying from a few percent to the majority of the genome. 18 For example, around 50% of the human [1] and 85% of the maize genome are TEs [2]. Therefore, it is 19 important to have an efficient and accurate ab initio method of identifying and annotating repeats in 20 newly sequenced genomes.
remove the enormous number of alignments produced by TEs that would overwhelm the SD output. This 31 means that repeat identification and annotation is currently required before SDs can be identified. 32 Transposable elements are the most prevalent repetitive sequences in eukaryotic genomes, and fall into 33 two major classes: those moving via direct cut and paste of their DNA sequences (DNA transposons) and 34 those moving/replicating via a copy and paste mechanism with an RNA intermediate (retrotransposons). 35 DNA transposons encode a transposase gene that is flanked by two Terminal Inverted Repeats (TIRs) [5]. 36 The transposase recognizes these TIRs to excise the transposon DNA, which is then inserted into a new 37 genomic location by cut and paste mobilization [6]. 38 39 Retrotransposons can be subdivided into two groups: those with long terminal repeats (LTRs), and 40 those without LTRs (non-LTR). Endogenous retroviruses (ERVs) are domesticated remnants of retroviral 41 infection and full-length ERVs encode an array of proteins (gag, pol, and env ) flanked by LTRs [7]. The env 42 protein allows ERVs to transfer to other organisms by infection [8] and thus ERVs can be acquired from the 43 environment. LTR retrotransposons are the dominant retrotransposons in plants and are less abundant in 44 mammals [9]. Similar to ERVs, LTR retrotransposons contain two long-terminal repeats that flank a 5-7kb 45 long internal protein-coding domain [10] containing two open reading frames (ORFs): gag and pol. The gag 46 ORF encodes the structural protein that makes up a virus-like particle (VLP) [11]. The pol ORF encodes 47 an enzyme needed for replication that contains protease (PR), integrase (IN), reverse transcriptase (RT), 48 and RNase H (RH) domains required for reverse transcription and integration. Promoter and transcription 49 termination signals are present in the LTRs that are divided into three functional areas: U3, R and U5. 50 U3 contains the enhancer and promoter sequences that drive viral transcription [11]. However, due to 51 the lack of env protein, LTR retrotransposons are not infectious; they are obligate intracellular elements [12]. 52 53 Non-LTR retrotransposons include two sub-types: autonomous long interspersed elements (LINEs), and 54 non-autonomous short interspersed elements (SINEs), that are dependent on LINEs for their replication [3]. 55 Typical insertions of non-LTR retrotransposons are flanked by target site duplications, which result from 56 micro-homology based repair during the insertion process [13]. LINEs contribute significantly to eukaryotic genomes. Full-length LINEs are around 6kb long and 59 usually contain two ORFs flanked by 5 and 3 untranslated regions (UTRs). LINE 5 UTRs possess 60 an internal RNA polymerase II promoter, which allows them to be transcribed [1]. ORF1 can vary 61 significantly from species to species, and can encode proteins with different characteristics [14]. ORF2 62 is similar across all LINEs and encodes a protein with endonuclease and reverse-transcriptase activities 63 Lu et.al 3/21 required for replication [14].
SINEs are much shorter; usually less than 500 base pairs. The 5 region contains an internal RNA 66 polymerase III promoter and the 3 end contains an oligo dA-rich tail. Alu elements have no ORFs, 67 therefore they have no coding capacity and are non-autonomous TEs. Because they share functional 68 sequences at their 3 with LINEs, they borrow the retrotransposition molecular machinery encoded by 69 LINEs that bind to their 3 end [1]. org/RepeatModeler/), REPET [18], Red [19] and PILER [20]. RepeatModeler (RMD) is a de novo 86 package that has been widely used for repeat identification and modeling that combines different programs: 87 RepeatMasker, RepeatScout [21], RECON [22] and TRF (Tandem Repeat Finder) [23]. GROUPER [24]) and a knowledge/library based annotation pipeline [25]. REPET produces a very 113 comprehensive output of repeat annotations, but excludes segmental duplications, is complex, requires 114 genome annotation of gene models and is computationally expensive.

116
In order to address these limitations, we have created a comprehensive ab initio repeat pipeline 117 (CARP) for identifying species-specific TE elements with high sensitivity and accuracy that deals with 118 both TEs and segmental duplications. Our method also provides a full audit trail that links identified 119 repeat sequences (and their genome intervals) to their families and consensus sequences. This permits 120 direct evolutionary analysis of highly similar TE families.

123
For a diagrammatic overview of our method for de novo discovery and annotation of repetitive elements 124 from genome sequences see ( Figure 1). that are typically separated in the genome, i.e. that are rarely or never found in tandem, and are usually 139 mobile elements such as retrotransposons [20]. Genome sequences were pairwise aligned using krishna 140 Identifiable repeat consensus sequences were annotated by using CENSOR [29]  whereas no SINE consensus sequences were produced by RepeatModeler in any of the species we tested 226 (see Table 2). Based on this result, CARP was more sensitive for detecting SINEs compared to RMD. 227 CARP generated many more consensus sequences than RMD and this is a function of the single linkage 228 clustering used to identify families. Because many LINE insertions are 5' truncated, leading to vari-229 able insertion sizes with a common 3' end, the requirement for family members to be at least 95% as 230 long as the longest family member means that many clusters are created across the insertion size continuum. 231 232 CENSOR was used to annotate the repeat content in our data set of seven species because it uses minimal 234 post-alignment processing of hits (see Table 3). In order to get a comprehensive annotation of repeats, we 235 used a combination of the Repbase 'Vertebrate' library and repeat consensus sequences generated from 236 CARP or RMD. Because CENSOR annotates based on the best hit, combining our consensus sequences 237 with Repbase sequences allows annotation of genomic intervals most similar to either recent/less diverged 238 repeats or Repbase repeats. As seen from In Table 3 we have labeled the unclassified repeat contribution to the genomes as segmental duplications 246 based on their properties (see below).  formed length for all seven genomes (see Figure 2  In order to determine if the small number of unclassified CENSOR hits with copy numbers >2000 were 258 Lu et.al 12/21 novel or partially annotated TEs, we used coverage plots for the CARP unclassified consensus sequences to 259 look for high copy number subsequences with TE properties. Figure 3 shows the top 5 high copy number 260 CARP unclassified consensus sequences from bearded dragon as an example. BLASTN and CENSOR 261 annotations were also used to characterize these consensus sequences in terms of TE or gene model homol-262 ogy. From Figure 3 we can see that coverage plots for high copy number CARP unclassified CENSOR hits 263 were of two types: those incorporating high copy subsequences ( Figure 3A,C,E) and those with uniform 264 high coverage ( Figure 3B,D). Close examination of the high copy subsequences from Figure 3A Based on the above results, we conclude that the vast majority of unclassified consensus sequences 275 represent segmental duplications. We have therefore labeled these annotations accordingly in Table 3. In 276 our final annotation, significant fractions of the genomes from our seven test species were annotated as 277 SD, particularly in bearded dragon (24.68%), anolis (12.02%) and echidna (12.60%) (see Table 3).

279
Because the human genome has the best SD annotation of our seven species, we compared segmental 280 duplication coordinates downloaded from the human 'Segmental Duplication Database' to our CARP 281 unclassified CENSOR hits. Approximately 70% of human SD overlapped with CENSOR hits from CARP 282 unclassified consensus sequences, confirming our conclusion above. Only 30% of human SD overlapped 283 with RMD unclassified CENSOR hits. are the most abundant and active repeats in monotremes (see Supplementary Table S9). This is in contrast 291 to metatheria and eutheria (marsupials and placentals) where they are inactive due to extinction 60-100 292 Myr ago.

294
L2s were defined as potentially active if they contained an intact ORF2 (regardless of the state 295 of ORF1), as this meant that they were capable of either autonomous retrotransposition [38] and/or 296 mobilisation of SINEs [39]. CARP identified numerous long L2 elements (2∼4kb) in the echidna genome. 297 Lu et.al 14/21 More than 66% (110/166) of these were potentially active based on the above criteria ( Figure 4A) and 298 some clusters of potentially active elements at the tips of short branches, were consistent with "hot" or 299 hyperactive elements. This differed significantly from the RMD result, which generated only four long 300 consensus sequences ( Figure 4C).

302
It is worth noting that the Repbase annotation for L2s puts the full-length platypus L2 consensus 303 sequences at 5kb long. However, based on both the CARP and RMD identification outputs, L2 elements 304 in echidna and platypus were significantly shorter, at 3kb, with the longest one we could find (in the 305 platypus genome) 3,110bp in length.  duplication annotation to what has been reported for these seven species, we found that our method 340 detected more candidate segmental duplications in the anolis (4.9%) [40] (Table 3) and the opossum 341 (1.7%) ( Here we introduce a simple and flexible ab initio repeat identification and annotation method (CARP) 350 that annotates TEs and candidate segmental duplications. We applied CARP to seven animal genomes 351 and demonstrated that it performs as well or better than RepeatModeller, the most commonly used ab 352 initio TE annotation package.

353
Limitation: Our approach is limited by memory requirements and runtime. However, as hardware improves 354 and becomes less expensive, these limitations will become less of an issue. to identify and annotate TEs from a genome assembly, including the benchmarks used for the seven 368 species in this report. 369 S1  Table. Repeat content in seven target species. Here we show the proportion and copy 379 number of each repeat class in chicken, anole, bearded dragon, opossum, platypus, echidna and human. 380