De novo identification of satellite DNAs in the sequenced genomes of Drosophila virilis and D. americana using the RepeatExplorer and TAREAN pipelines

Satellite DNAs are among the most abundant repetitive DNAs found in eukaryote genomes, where they participate in a variety of biological roles, from being components of important chromosome structures to gene regulation. Experimental methodologies used before the genomic era were insufficient, too laborious and time-consuming to recover the collection of all satDNAs from a genome. Today, the availability of whole sequenced genomes combined with the development of specific bioinformatic tools are expected to foster the identification of virtually all the “satellitome” of a particular species. While whole genome assemblies are important to obtain a global view of genome organization, most of them are incomplete and lack repetitive regions. We applied short-read sequencing and similarity clustering in order to perform a de novo identification of the most abundant satellite families in two Drosophila species from the virilis group: Drosophila virilis and D. americana, using the Tandem Repeat Analyzer (TAREAN) and RepeatExplorer pipelines. These species were chosen because they have been used as models to understand satDNA biology since the early 70’s. We combined the computational approach with data from the literature and chromosome mapping to obtain an overview of the major tandem repeat sequences of these species. The fact that all of the abundant tandem repeats (TRs) we detected were previously identified in the literature allowed us to evaluate the efficiency of TAREAN in correctly identifying true satDNAs. Our results indicate that raw sequencing reads can be efficiently used to detect satDNAs, but that abundant tandem repeats present in dispersed arrays or associated with transposable elements are frequent false positives. We demonstrate that TAREAN with its parent method RepeatExplorer may be used as resources to detect tandem repeats associated with transposable elements and also to reveal families of dispersed tandem repeats.

although short arrays may additionally be present in the euchromatin [5,6]. 70 The collection of satDNAs in the genome, also known as the "satellitome", 71 usually represents a significant fraction (>30%) of several animal and plant 72 genomes. Other classes of noncoding tandem repeats include the 73 microsatellites, with repeat units less than 10 bp long, array sizes around 100 74 bp and scattered distributed throughout the genome; and the minisatellites, with 75 repeats between 10 to 100 bp long, forming up to kb-size arrays, located at 76 several euchromatic regions, with a high density at terminal chromosome 77 regions [3,4]. Therefore, the best criteria to distinguish satellites from micro and 78 minisatellites are long array sizes and preferential accumulation at 79 heterochromatin for the former. 80 SatDNAs do not encode proteins, but they may play important functional roles 81 in the chromosomes, most notably related to chromatin modulation and the 82 establishment of centromeres [7][8][9]. They are among the fastest evolving 4 83 components of the genome (although some conserved satellites have also been 84 reported) [10][11][12], and such behavior combined to their abundance and 85 structural role have major implications for the evolution and diversification of 86 genomes and species [8,13]. 87 Since the discovery of satDNAs in the early 60's, species from the genus 88 Drosophila have been used as a model to address several aspects of satDNA 89 biology, such as their origin, organization, variation, evolution and function [e.g. Among them, the RepeatExplorer software [19] has been successfully used for 95 de novo identification of repetitive DNAs directly from unassembled short 96 sequence reads, and the recently implemented TAREAN pipeline [20] was 97 introduced to specifically identify putative satDNAs. Such a combination 98 between sequenced genomes and bioinformatic tools is now expected to foster 99 the identification of the full "satellitome" of any given species [e.g. 21-25] repeats less than 10 bp long but at low abundance [18].

125
The high throughput and low cost of current whole-genome sequencing 126 technologies have made it possible to obtain genome assemblies for a wide 127 range of organisms. However, de novo whole-genome shotgun strategies are 128 still largely unable to fully recover highly repetitive regions such as centromeres 129 and peri-centromeric regions and, as a result, satDNAs are usually 130 misrepresented or absent from such assemblies [37]. One way of circumventing 131 the assembly bottleneck is to directly identify repeats from raw sequencing 132 reads. One of such approaches is implemented in the RepeatExplorer pipeline, 6 133 already used in a wide range of plant and animal species [21,38,39].

134
RepeatExplorer performs similarity-based clustering of raw short sequencing 135 reads and partial consensus assembly, allowing for repeat identification even  In the present study, we aim to test the ability of TAREAN to correctly identify        autosomes except the small dot chromosomes, and in the X and Y 257 chromosomes ( Fig 1A). However, the hybridization in polytene chromosomes

303
Our results from the TAREAN analysis classified 154TR as a putative satellite 304 with low confidence in both species ( [34] similar to the one we observed for 172TR (Fig 2A and 2B)  few kbp) to each other (Fig 3). in two loci in chromosome 2, including the subtelomeric region (Fig 1A and 2A).

342
Most of its arrays are located at distal chromosome regions. No hybridization 343 signals were detected in the dot and Y chromosomes.

344
The FISH results in D. americana showed 172TR signals at multiple loci along 345 all autosomes, except the dot, and more equally distributed in both distal and 346 proximal regions of chromosome arms (Fig 1B and 2B). Similarly to D. virilis, no 347 hybridization signal was detected in the Y chromosome ( Fig 1B). The FISH data 348 (Fig 1, 2A  confirmed these results in D. virilis (Fig 2C), additionally showing that in D. 359 americana this family displays the same pattern of localization ( Fig 2D).

360
In addition, we also performed FISH with a 225TR probe in metaphase repeats (data not shown). However, most of the arrays are dispersed and 375 containing up to 10 tandemly repeated copies of the 225TR sequence, which 376 agrees with its classification as a putative satellite with low confidence.