Genome-Wide Analysis of Repeat Diversity across the Family Musaceae

Background The banana family (Musaceae) includes genetically a diverse group of species and their diploid and polyploid hybrids that are widely cultivated in the tropics. In spite of their socio-economic importance, the knowledge of Musaceae genomes is basically limited to draft genome assemblies of two species, Musa acuminata and M. balbisiana. Here we aimed to complement this information by analyzing repetitive genome fractions of six species selected to represent various phylogenetic groups within the family. Results Low-pass sequencing of M. acuminata, M. ornata, M. textilis, M. beccarii, M. balbisiana, and Ensete gilletii genomes was performed using a 454/Roche platform. Sequence reads were subjected to analysis of their overall intra- and inter-specific similarities and, all major repeat families were quantified using graph-based clustering. Maximus/SIRE and Angela lineages of Ty1/copia long terminal repeat (LTR) retrotransposons and the chromovirus lineage of Ty3/gypsy elements were found to make up most of highly repetitive DNA in all species (14–34.5% of the genome). However, there were quantitative differences and sequence variations detected for classified repeat families as well as for the bulk of total repetitive DNA. These differences were most pronounced between species from different taxonomic sections of the Musaceae family, whereas pairs of closely related species (M. acuminata/M. ornata and M. beccarii/M. textilis) shared similar populations of repetitive elements. Conclusions This study provided the first insight into the composition and sequence variation of repetitive parts of Musaceae genomes. It allowed identification of repetitive sequences specific for a single species or a group of species that can be utilized as molecular markers in breeding programs and generated computational resources that will be instrumental in repeat masking and annotation in future genome assembly projects.


Introduction
Bananas are giant perennial herbs belonging to the genus Musa, which are grown in tropical and subtropical regions. Edible sweet and starchy banana cultivars provide a staple food for many millions of people and are a major export commodity. Based on a set of morphological descriptors and basic chromosome number, the genus Musa has been traditionally subdivided into four sections: Eumusa (x = 11), Rhodochlamys (x = 11), Australimusa (x = 10), and Callimusa (x = 9 or 10) [1]. However, this classification has been often questioned. The recent use of a variety of molecular markers provided detailed information on Musa genetic diversity and phylogenesis [2][3][4][5][6][7][8]. Finally, in 2013, the Australimusa and Callimusa sections were merged into the section Callimusa, and sections Eumusa and Rhodochlamys were merged into the section Musa [9].
Most of the edible bananas are vegetatively propagated diploid and polyploid forms of M. acuminata (A genome, 2n = 2x = 22) and hybrids that originated from crosses between M. acuminata and M. balbisiana (B genome, 2n = 2x = 22) [10] belonging to the section Musa. Although there is some inconsistency in the classification of cultivated banana clones, it has been estimated that there are at least 1000 different cultivars grown worldwide [11]. Another group of edible cultivars, called Fei, represents a separate line of banana evolution and comprises a different species complex classified in the section Callimusa. The section is represented by a number of wild Musa species, including M. maclayi, M. peekelii, and M. lolodensis, the most probable progenitors of edible Fei bananas [12][13][14][15] and M. textilis (T genome). Fei bananas are parthenocarpic and vegetatively propagated like other edible banana clones. However, they were domesticated independently from the cultivars of the section Musa.
The production of bananas has been seriously threatened by the increasing range of fungal, viral, and insect diseases. At the same time, breeding of improved cultivars is hampered by seed sterility, unknown origin of the cultivated clones, and poor knowledge on genetic diversity of the genus Musa. The nuclear genome of Musa species is relatively small (1C,600 Mbp; [16,17]), and previous studies showed that ,55% of the genome is represented by repetitive DNA, especially different types of retroelements [18][19][20]. Genomic repeats evolve more rapidly than coding sequences, and plant geneticists and breeders found them a unique source of molecular markers to map important genes, analyze genetic diversity, and study processes of speciation and genome evolution [21][22][23].
Global characterization of complex populations of plant genomic repeats recently has been made feasible by combining next-generation sequencing technologies with newly developed bioinformatic tools [24,25]. This approach also led to the first characterization of major repeat types in the genome of 'Calcutta', a clone of M. acuminata ssp. burmannica, a popular male parent in a number of breeding programs [19]. Additional information about repeat composition and organization in this species (ssp. malaccensis) was obtained after producing a draft sequence of a double haploid individual derived from wild accession 'Pahang' [20]. Finally, initial data about repeats in another species, M. balbisiana (variety 'Pisang Klutuk Wulung') recently have been released along with its draft genome [26]. Apart from M. acuminata and M. balbisiana from the section Musa, similar information is missing for the section Callimusa and other representatives of the Musaceae family.
In this study, we employed bioinformatic analysis of low-pass genome sequencing data to get deep insight into repeat composition of Musaceae genomes. We selected five Musa species and one representative of the genus Ensete for comparative analysis of repetitive fractions of their genomes in order to (1) identify and quantify major groups of repetitive sequences, (2) assess sequence diversity of repeats between the species and investigate its correlation with the phylogeny of the Musaceae family, and (3) generate bioinformatic resources for development of repeat-based genome-specific markers and for repeat identification and annotation for future genome assembly projects.

Results
Low-pass genome sequencing, estimation of repeat proportions, and similarities between the species Five representatives of the genus Musa and one Ensete species were selected for analysis to cover various phylogenetic clades of the Musaceae family ( Fig. 1). They possess relatively small genomes with only moderate differences between the species, ranging from 567 to 763 Mbp/1C [16,17,27]. Whole-genome shotgun sequencing was performed using Roche/454 technology, and resulting reads were trimmed to the same length of 200 nucleotides. The same amount of reads (380,599) from each species was used for analysis, providing 0.10-0.136 genome coverage (Table 1). For this coverage, the probability of detecting repetitive sequences with 10 and 100 copies per haploid genome was 63-74% and .99.9%, respectively [28].
Sequence reads derived from genomic repeats were identified and quantified based on the number of similarity hits generated in all-to-all read comparisons. In principle, low sequencing coverage and similarity threshold used in this analysis (90% identity over 55% of the sequence length) provide a small chance of detecting hits between single-copy genomic sequences. Thus, most of the similarities are expected between the reads representing repetitive sequences and, their frequencies are proportional to copy number of corresponding repetitive elements in the genome. In the first part of the analysis, similarity hits were investigated separately for reads from each species compared to themselves, thus providing information about repeat proportions in individual genomes. There were similar amounts (55-60%) of reads generating at least one similarity hit in all species. However, there were differences in proportions of moderately (.100 copies/1C) and especially of high-copy (.1000 copies/1C) repeats that were most abundant in M. beccarii, M. acuminata, and M. ornata ( Fig. 2A). Higher proportions of high-copy repeats in these species were also evident from differences in total numbers of similarity hits (Fig. 2B).
To evaluate overall differences in sequence composition and abundance between pairs of species, inter-specific comparisons of read similarities were performed and visualized in a form of scatter plots, where dots represented reads and their positions were determined by numbers of similarity hits in both species (Fig. 3). This analysis revealed remarkable similarity of repeat composition of M. acuminata and M. ornata genomes, resulting in a diagonal pattern of the dot positions, which was due to similar abundance of corresponding sequences in both genomes (Fig. 3O). High similarity was also revealed between repeats from M. beccarii and M. textilis, except for much higher genomic proportion of 45S rDNA repeats in M. beccarii (Fig. 3I). Quantification of rDNA proportions revealed that its differential amplification accounts for most of the genome size difference between these species. The detected quantitative differences in 45S rDNA content are also in agreement with previously published FISH experiments which have shown higher number of 45S rDNA sites in M. beccarii and E. gilletii [17]. On the other hand, repeat composition of E. gilletii was the most divergent from other species (Fig. 3A-E). The diagram in  Classification and comparative analysis of major groups of repeats To classify repetitive sequences and identify their orthologous groups present in individual genomes, the identified intra-and inter-specific similarities of sequence reads were analyzed using the RepeatExplorer pipeline [25]. The pipeline runs graph-based clustering algorithm [24] to identify groups of frequently overlapping reads representing families of repetitive elements followed by similarity-and structure-based repeat identification tools that aid in repeat classification. Classification of repeats in the 106 largest clusters exceeding in size 0.03% of the analyzed reads revealed that Ty1/copia, Ty3/gypsy, and rDNA repeats make up the majority of highly and moderately repeated sequences in the Musaceae genomes ( Table 2, Fig. S1). Ty1/copia LTR-retrotransposons, mainly represented by Maximus and Angela lineages, were 2-4-fold more abundant than Ty3/gypsy. However, proportions of most repeat groups varied between the species even for the most abundant elements like Ty1/copia Maximus, which was much less abundant in E. gilletii and showed almost 1.5-fold variation in abundance between Musa species. Depending on the genome size and repeat content, the annotated repeats corresponded to 24% (E. gilletii) up to 44% (M. beccarii) of the genome. The rest of the repeats including mainly low-copy sequences forming small repeat clusters made up 26-43% (Table 2). In summary, repetitive sequences occupy about 66-71% in all genomes.
An inherent feature of the clustering analysis when applied to sequence data from multiple species is that orthologous repeat families from different species are grouped to the same clusters. This facilitates identification and quantification of repeats that are shared between the species as well as detection of species-specific sequences. The analysis revealed that a large part of the repeat clusters representing various families of LTR-retrotransposons, DNA transposons, long interspersed nuclear elements (LINE), and rDNA was shared by all Musa species and, to a smaller extent, also with E. gilletii (Fig. 4A, group 4). This group of clusters was also the most significant in terms of genome proportions, containing about 32% of analyzed reads (89% of annotated repeats). Smaller groups 3, 5, and 7 included clusters that were shared by two or three species. They comprised mostly non-coding parts of LTRretrotransposons (e.g., LTR sequences), which evolve more rapidly than their gag-pol regions. Groups of species sharing these sequences included M. beccarii/M. textilis (group 3), M. acuminata/ M. ornata (group 5), and M. acuminata/M. ornata/M. balbisiana (group 7), which is in agreement with overall read similarities between the species presented in Fig. 3. The largest number of species-specific clusters, including complete retrotransposon sequences, was detected for E. gilletii, which was in agreement with the phylogenetic divergence of this species.
To verify some of the differences in repeat composition revealed by the clustering analysis, three putative section or species-specific repeats were detected in a set of Musaceae species using Southern blot hybridization ( Fig. 4C-E). Experimental results were in all cases in agreement with the output of bioinformatic analysis. The probe derived from CL16, classified as a Reina lineage of Ty3/ gypsy elements specific for the genome of E. gilletii, produced strong hybridization signals in species of the Ensete genus, with weak or no labeling of genomic DNAs of Musa species (Fig. 4C). An M. beccariispecific tandem repeat found in the cluster CL51 also showed predominantly species-specific hybridization pattern with only minor signals in related species from Callimusa sections (Fig. 4D). The probe derived from tandem-like repeat CL30 present in sequence reads from M. acuminata, M. ornata, and M. balbisiana was confirmed to be specific for section Musa, which includes these three species (Fig. 4E). Table 1. Sequenced species. In addition to identifying species-and section-specific repeats based on their presence in different clusters, it was possible to reveal more subtle sequence variations even for repeats grouped into the same cluster. An example of this variability is presented in Fig. 5A-B, showing a graph structure of the cluster CL18 representing a family of Ty3/gypsy elements belonging to the chromovirus CRM clade. In higher plants, transposition of this group of LTR-retrotransposons is targeted to centromeres, and this localization has also been reported for M. acuminata CRM elements [29]. The cluster graph is composed of nodes representing individual sequence reads and edges connecting reads with similarities exceeding the specified threshold [24]. Since the node The barplot shows the total number of reads with detected similarity hits which is proportional to size of repetitive fraction of the genome. As the number of similarity hits to each read is also proportional to its copy number, reads derived from repetitive elements can be divided into low, medium and high copy number fractions. (B) The total number of similarity hits that correspond to number of read pairs with similarity. doi:10.1371/journal.pone.0098918.g002 distances are inversely proportional to read similarities, the graph layout reflects sequence variability of the element copies in the genome, and in the case of reads from different species, reveals its inter-specific variability. This variability (Fig. 5B) as well as phylogenetic analysis of reverse transcriptase (RT)-coding sequences extracted from the reads (Fig. 5C) were in agreement with phylogenetic reltionships of the species (Fig. 1). In addition, the RT-based phylogenetic tree provided better discrimination of sequences from closely related species and contained some speciesor section-specific branches with shorter edges, implying recent amplification of CRM elements in M. acuminata, M. ornata, and M. balbisiana.

Preparation of repeat databases and their use for repeat annotation in assembled genomes
Sequence databases specific for various types or families of repetitive elements were prepared by merging reads from clusters with the same annotations. These databases can be used for similarity-based repeat identification in assembled sequences, as implemented, for example, in our Profrep server (http://w3lamc. umbr.cas.cz/profrep/public/) [30]. Compared to approaches using databases of representative elements or consensus sequences, collections of reads gathered from clustering analysis provide better sensitivity for detecting divergent repetitive elements because they better represent their sequence variability. Examples of repeat annotation of randomly selected tracks of M. acuminata genome assembly [20] based on similarity hits to repeat-specific collections of sequence reads are provided in Fig. 6, S2, and S3. Identified repetitive regions (Fig. 6B) were mostly in agreement with the assembly annotation (Fig. 6C). However, it was possible to assign specific repeat types to many regions listed as unclassified repeats in the assembly annotation (e.g. most Ty1/copia Maximus elements in Fig. 6 or Ty1/copia Angela in Fig. S2). An additional benefit of our approach is the possibility to visualize abundance of corresponding repeats in other Musaceae species. For example, Angela elements present in the analyzed region of chromosome 9 are well conserved in all Musaceae species including E. gilletii, whereas other repeats show larger variations in their abundance   (Fig. 6A). On the other hand, our approach failed to detect some of the small regions annotated as DNA transposos or unclassified repeats in the M. acuminata assembly (Fig. 6B-C), probably due to their very small genomic abundance, which resulted in no hits to our sequence read databases.

Discussion
Identification of repetitive elements using graph-based clustering of sequence reads is one of the novel bioinformatic tools specifically designed to utilize the power of next generation sequencing technologies [24,25]. This approach proved to be efficient in global repeat characterization in complex plant [24,31,32] and animal [33] genomes and in investigation of repeat composition of individual chromosomes [30,34] or their compartments [35,36]. This study employs repeat clustering methodology for comparative analysis of multiple genomes, including species representing two genera of the Musaceae family. It extends the previous survey of M. acuminata repeats [19] by analyzing more sequence data and investigating representatives of the genus covering its two taxonomic sections Musa and Callimusa [9] and three banana genomes (A, B, and T). It also complements information about repeat composition gathered from the current genome assemblies of M. acuminata and M. balbisiana [20,26].
Although highly and moderately repeated sequences can reach up to 80-85% in plants with larger genomes [37,38], their proportions in the investigated Musaceae species were smaller (66-71%). Such repeat proportions are slightly above repeat content estimated in species with relatively small genomes, including Oryza (25-66%), Vitis vinifera (41.4%), Sorghum bicolor (61%), Malus6domestica (67%), and Nelumbo nucifera (50%) [39][40][41][42][43]. The observed dominance of LTR-retrotransposons in the fraction of highly repeated sequences is a common feature of higher plant genomes where retroelements represent one of the major forces driving genome size evolution [38,44,45]. Although individual retroelement families varied in genome proportions in the investigated Musaceae species, there was no significant correlation of their abundance with genome size variation. Thus, genome size differences in the studied species cannot be attributed to simple amplification of particular repetitive element as seen in some other plant genomes [38,39]. Rather, genome size was affected by joint activities of more transposable element (TE) lineages.
The observed sequence variation and quantitative differences among individual TE lineages correspond well to previous results on phylogeny of the Musaceae family. The highest similarity of repeats within the groups of M. beccarii/M. textilis and M. acuminata/M. ornata or M. acuminata/M. ornata/M. balbisiana are in agreement with previous results that showed close phylogenetic relationships of these species [6,8,46] and support the new taxonomy of Musaceae [9]. In general, sequence and quantitative differences were proportional to diversification of species of the Musaceae family (compare Fig. 1, 4, and 5). This is in agreement with previous comparative studies of repeats in genomes of the Orobanchaceae family [47] and genus Oryza [39]. Overall, the present and our previous studies show that repeat analysis can support the analysis of evolutionary relationships.
Clustering-based repeat analysis employed in this study provides a useful alternative to repeat quantification from genome assemblies. Comparison of our shotgun data with the published assembly of M. acuminata DH-Pahang [20] confirmed that repetitive sequences are under-represented in the assembly anchored into 11 Musa chromosomes, while repeats are overrepresented in remaining 30% sequences of un-anchored assembly (Fig. S4). However, in general, we observed good agreement between our annotation of M. acuminata clusters and curated TEs in the DH-Pahang assembly. Approximately 90-95% of elements annotated in the genome assembly comply with the annotation based on our clustering results. Additionally, estimates of total abundance of LTR TEs in DH-Pahang and our estimates for M. acuminata are very close in the cases of Ty1/copia and Ty3/gypsy ( Table 2). The same Ty1/copia and Ty3/gypsy lineages were identified in genome assembly with exceptions of Ty1/copia Ivana, which was newly pinpointed in this study. Graph-based clustering allowed us to assign all identified LTR retrotransposons into lineages, while in genomic assembly about 46% and 24% of Ty1/ copia and Ty3/gypsy, respectively, were not further classified. This is one of the benefits of graph-based clustering, as the sequence clusters contain both sequences derived from complete autonomous TEs and continuum of incomplete and mutated TEs copies, which could be difficult to detect when only similarity search against database of known repeats is used. On the other hand, the amount of DNA transposons and LINEs estimated in the present work is much lower compared with the estimates based on the genome assembly. To explain this difference, we have compared the annotation of the DH-Pahang assembly with all our M.  acuminata sequences. The similarity search revealed that the large fraction of sequences annotated as DNA transposons and LINEs in the DH-Pahang assembly provide similarity hits to unclustered sequences or to small unannotated clusters in our data (data not shown). Consequently, it seems that we have missed these sequences in our annotation. Some differences in repeat abundance estimates can be also attributed to incompleteness of assembly and biased composition of sequences in the genome assembly.

Conclusions
The present study provides a detailed insight into the composition and diversity of repeats in genomes representing the family Musaceae. Next generation sequencing with genome coverage greater than 10% enabled annotation and quantification of repeats that form 30-45% of the Musaceae genomes. The remaining part of the genome consists of unidentified repeats (,30%) and low-and single-copy sequences (26-45%). We show that there is a prevalence of Ty1/copia elements in all Musa genomes, with a majority of Ty1/copia elements being Maximus/ SIRE and Angela. Most of the elements in Ty3/gypsy family belong to the lineage of chromoviruses. The present study revealed significant divergence in repeat composition between the species of Musa, and the extent of repeat divergence was related to the estimated divergence dates of the species in the Musaceae family. We also demonstrated that database of repeats derived from graph based clustering is well suited for annotation of genome assemblies and can complement other repeat annotation methods.

Plant material, DNA isolation and sequencing
In vitro rooted plants of most of the Musa and Ensete species used in this study were obtained from the International Transit Centre (ITC, Katholieke Universiteit Leuven, Belgium). The clone 'Pisang Klutuk Wulung' of M. balbisiana was obtained from CIRAD (Guadeloupe) as rooted plants. Plants were transferred to soil and maintained in a greenhouse.
Roche/454 shotgun sequencing libraries were prepared by the GS Titanium library preparation kit (454 Life Sciences, a Roche company, Branford, USA). The single-stranded libraries were quantified by a qPCR assay and processed utilizing the GS Titanium SV/LV emPCR and XLR70 sequencing kits according to the manufacturer's instructions (Roche Diagnostics). Sequencing was performed on a half 70675 picotiter plate for each Musa cultivar [8]. Sequence reads were divided into clusters using a graph-based method according to [24] with the difference in that the reads were trimmed to 200 nucleotides and reads from all species were clustered together. Computational tools used for clustering step are available at the public server (www.repeatexplorer.org) [25].

Repeat cluster annotation and repeat identification
Several resources were used to manually annotate clusters. Reads from clusters were scanned for similarity to a database of plant repetitive elements with RepeatMasker [50] using databases that were improved by adding specific sequences derived from the banana genome based on our previous work [19]. Blastx and blastn [51] were used for similarity search against public databases and also against our database of protein domains derived from plant mobile genetic elements. Clusters represented as graphs were also analyzed using SeqGrapheR program (http://cran.rproject. org/web/packages/SeqGrapheR/index.html).

Musaceae phylogenetic tree construction
Internal transcribed spacers (ITS) data obtained by [19] was used to construct a BioNJ tree based on the Jukes-Cantor model in the SeaView v4.2.3 program [52]. Phylogenetic trees were drawn and edited using the FigTree (http://tree.bio.ed.ac.uk/software/ figtree/) program.

Construction of phylogenetic tree for RT domains
Reads with similarity to the Ty3/gypsy reverse-transcriptase domain were trimmed and aligned using MAFFT software [53], and the maximum-likelihood phylogenetic tree was estimated using the FastTree program [54]. The resulting alignment and tree are provided in Files S1 and S2.

Southern blots
Genomic DNA of 15 selected Musaceae representatives was prepared from nuclei isolated from healthy young leaf tissue. Aliquots of genomic DNA samples corresponding to16610 6 of nuclear genomes were digested using DraI, EcoRV, RsaI or MspI restriction enzymes, size-fractionated by 1.2% agarose gel electrophoresis, and transferred onto Hybond N+ nylon membranes (Amersham Biosciences, Bath, UK). Biotin-labeled oligomers (file S3) were used as probes. The Southern hybridization was done at 68uC overnight followed by stringent washes (stringency 90%). Signals were detected using the BrightStar BioDetect kit according to the manufacturer's instructions (Ambion, Austin, USA), incubated with chemiluminescent substrate (CDP-Star, Amersham Biosciences), and exposed to X-ray film.

Resources
Sequences were deposited in the Sequence Read Archive under accession ERX047938-ERX047944.

(FAS)
File S2 Phylogenetic tree estimated by FastTree program from sequences derived from the Ty3/gypsy CRM element from cluster CL18.

(TREE)
File S3 Sequences of oligonucleotide probes used in Southern blot hybridization. (TXT)