Genome-Wide Comparative Analysis of 20 Miniature Inverted-Repeat Transposable Element Families in Brassica rapa and B. oleracea

Miniature inverted-repeat transposable elements (MITEs) are ubiquitous, non-autonomous class II transposable elements. Here, we conducted genome-wide comparative analysis of 20 MITE families in B. rapa, B. oleracea, and Arabidopsis thaliana. A total of 5894 and 6026 MITE members belonging to the 20 families were found in the whole genome pseudo-chromosome sequences of B. rapa and B. oleracea, respectively. Meanwhile, only four of the 20 families, comprising 573 members, were identified in the Arabidopsis genome, indicating that most of the families were activated in the Brassica genus after divergence from Arabidopsis. Copy numbers varied from 4 to 1459 for each MITE family, and there was up to 6-fold variation between B. rapa and B. oleracea. In particular, analysis of intact members showed that whereas eleven families were present in similar copy numbers in B. rapa and B. oleracea, nine families showed copy number variation ranging from 2- to 16-fold. Four of those families (BraSto-3, BraTo-3, 4, 5) were more abundant in B. rapa, and the other five (BraSto-1, BraSto-4, BraTo-1, 7 and BraHAT-1) were more abundant in B. oleracea. Overall, 54% and 51% of the MITEs resided in or within 2 kb of a gene in the B. rapa and B. oleracea genomes, respectively. Notably, 92 MITEs were found within the CDS of annotated genes, suggesting that MITEs might play roles in diversification of genes in the recently triplicated Brassica genome. MITE insertion polymorphism (MIP) analysis of 289 MITE members showed that 52% and 23% were polymorphic at the inter- and intra-species levels, respectively, indicating that there has been recent MITE activity in the Brassica genome. These recently activated MITE families with abundant MIP will provide useful resources for molecular breeding and identification of novel functional genes arising from MITE insertion.


Introduction
The Brassicaceae is one of the largest and most important plant families, with 338 genera and about 3709 species including many economically significant vegetable crops, oil seed plants, condiments and fodder crops [1,2]. The genomic relationship between the six interrelated cultivated Brassica species, including three diploids [Brassica rapa (2n = 2x = 20, AA genome, 529 Mb genome size), B. nigra (2n = 2x = 16, BB, 632 Mb), B. oleracea (2n = 2x = 18, CC, 696 Mb)] and three amphidiploid derivatives [B. juncea (2n = 4x = 36, AABB, 1068 Mb), B. napus (2n = 4x = 38, AACC, 1132 Mb), and B. carinata (2n = 4x = 34, BBCC, 1284 Mb)] has been summarized as the triangle of U [3,4]. Due to its wide distribution and the differences in polyploidy of its members, the Brassicaceae provides an excellent system in which to study polyploidization-mediated evolution of plants [5]. Comparative studies with the diploid model plant A. thaliana (2n = 2x = 10, 125 Mb) have confirmed approximately 16-fold genome size variation in the Brassicaceae [6,7]. Studies have also revealed that B. rapa and its close relative B. oleracea evolved as whole genome triplication derivatives after their split with A. thaliana 13-17 million years ago (MYA) [7]. Sub-or neo-functionalization is one of the important driving forces for maintenance of genome integrity in genomes subjected to whole-genome duplication [8][9][10]. Genome-wide comparative analysis has revealed that wholegenome duplication is the major factor responsible for the increase of genome size in Brassica. In addition to the polyploidization events, accumulation and amplification of transposable elements (TEs) contribute to increased genome size in Brassica species [11,12].
TEs are fundamental agents for genome size enlargement and evolution [13][14][15][16][17], and are classified as class I (retrotransposons) or class II (DNA transposons) mobile genetic elements based on their transposition mechanism. TEs that contain their own coding sequences for transposition are called autonomous TEs. Conversely, TEs with defective or no coding sequences are referred to as non-autonomous [13,16,18]. Miniature inverted-repeat transposable elements (MITEs), class II-type TEs, are relatively small in size (,800 bp) and share conserved structural characteristics, such as terminal inverted repeats (TIRs) and target site duplication (TSD). MITEs are A+T-rich (.50-65%) [19], and preferentially insert into inter-genic, near-genic or intronic regions but usually avoid exonic regions [20]. There are two major superfamilies of MITEs, namely Stowaway (with TA as the TSD) and Tourist (with TAA as the TSD), and several other minor families such as hAT (5, 6, or 8 bp TSDs), Mutator (9-10 bp TSDs), and En/Spm (3 bp TSDs) [21][22][23][24]. In plants, MITEs are present in tens of thousands of copies throughout the entire genome and influence genomic diversity and differentiation [14,25]. Indeed, MITEs can occupy a major fraction of plant genomes, up to 10% in rice, 8% in Medicago, 4% in B. rapa and 0.71% in A. thaliana [26].
MITEs are active and important players in gene and genome evolution [27,28]. Due to their close association with genes or specific genic regions such as introns, exons, untranslated regions (UTRs) and promoters, MITEs can alter or disturb gene structure, expression, and/or function [14,18,[29][30][31]. MITEs have been reported to be involved in the alteration of triplicated genes in B. rapa and up-or down-regulated gene expression [9,20]. Micro-RNAs are small non-coding RNAs (21-24 nt) that regulate specific target genes or transposable elements at the transcriptional and post-transcriptional level. Recent studies suggest that 20% of the known miRNAs in the human genome originated from TEs [32]. MicroRNAs derived from MITEs through stem-loop structures play an important role in silencing TEs [33][34][35][36]. Because of the well-defined MITE boundaries including the TIRs and TSD, de novo identification of MITEs has become possible using tools like MUST, RSBP, MAK, MITE-Hunter and MITE Digger [23,[37][38][39][40]. In addition, previously annotated information for MITE homologs has enabled homology-based MITE identification with tools like CENSOR at the Repbase database and RepeatMasker [23,37,38]. MITEs can be an excellent resource for the development of DNA markers for genomics and evolutionary studies because most are stably inherited and present in high copy numbers [20,[41][42][43][44].
TEs are one of the major factors contributing to genome size in the highly duplicated Brassica genome [11] and are thought to occupy 39.5% and 38.8% of the genome in B. rapa and B. oleracea, respectively [12,45], for which whole-genome sequence information is now publicly available [12,46]. Recent genome-wide characterization of MITEs using various in silico tools has revealed 174 MITE families in B. rapa including 90 hAT, 56 Tourist and 16 Stowaway, 11 Mutator and 1 CACTA families. A total of 45821 MITE members occupy .11 Mb (4.08%) of the B. rapa genome [26]. However, there have been only a few comparative genomics studies on transposable elements, especially MITEs, and their practical application in breeding and evolutionary studies of the Brassica genome [11,20,47,48]. Among these is our previous characterization of a high-copy Stowaway MITE family (BraMi-1), which highlighted the utility of MITEs as molecular markers and the importance of MITEs in the Brassica genome for genomic and breeding purposes [20]. Now, we have conducted genome-wide comparative analysis of 20 MITE families in B. rapa and B. oleracea to provide a basis for understanding MITE dynamics in the Brassica genome. Our analysis explores all members of 20 MITE families, including three previously unknown families, and their distribution in B. rapa and B. oleracea. Furthermore, the potential utility of MITEs as molecular markers for genomics is demonstrated.

Materials and Methods
Identification and characterization of MITEs in the B. rapa and B. oleracea genomes The whole genome pseudo-chromosome sequences with unanchored scaffolds for B. rapa (283 Mb) Version 1.2 and pseudochromosome sequences of B. oleracea (385 Mb) Version 1.0, with their gene annotation information, were obtained from the public databases BRAD [49] and BolBase [46], respectively. The MITE-Hunter program [38] was used for identification of the MITEs in the B. rapa and B. oleracea genomes with default parameters. The putative MITE sequences generated by MITE-Hunter were characterized using the BLAST 2 sequences tool from NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi) and the EMBOSS application 'einverted' (http://emboss.bioinformatics.nl/cgi-bin/ emboss/einverted) to identify the TIR and TSD structures. The expected hairpin structures were estimated using one representative intact member of the each MITE family using mfold [50]. The identified MITEs were searched against Repbase, RepeatMasker and P-MITE database [26,51,52] to find the homologous MITEs from plant species. MITE families were searched for MITEderived miRNA at miRbase version 19 using an E-value of ,e-10 against the Brassicaceae species [53].

MITE members in the B. rapa and B. oleracea genomes and their phylogeny
To retrieve the MITE members and study the distribution pattern of each MITE family, the consensus sequences for the MITE families were used as queries in BLASTN searches against the pseudo-chromosome sequences of B. rapa and B. oleracea. The members of each MITE family were extracted from each genome using an E-value of ,e-5. The duplicate hits from the same physical positions were removed by manual analysis in order to count exact copy numbers. MITE insertion positions of family members from B. rapa (Br-members) and B. oleracea (Bo-members) were characterized using gene annotation information by a custom Perl script. MITE members with $80% alignment length and $80% identity, hereafter designated as 80:80 coverage, were considered to be intact candidate member for each MITE family according to 80:80:80 rule. The third 80 of this rule denotes that the element has $80 bp sequence similarity to a TE family and can therefore be considered a member of that TE family; it is also possible for an element with ,80 bp sequence similarity to be considered a member of the TE family upon in-depth analysis [16]. Phylogenetic analysis was conducted using 20% of the intact MITE members for each family, because there were too many members in certain families. Based on a ClustalW sequence alignment, a phylogenetic tree was generated using the neighborjoining method with 500 replications in MEGA5 [54].

MITE insertion polymorphism analysis
The presence or absence of a MITE in a particular locus can produce polymorphism between accessions, which can be analyzed using MITE insertion polymorphism (MIP) analysis [44]. MIP was surveyed using 346 MITE-flanking primer pairs against eight accessions including two Brassica diploid species B. rapa (A genome), B. oleracea (C genome) and a corresponding tetraploid species B. napus (AC genome), as well as A. thaliana. The details for the plant materials, e.g., their ploidy and subgroup of the Brassica genome, were the same as in the previous MIP analysis [20]. Total genomic DNA was extracted from fresh leaves using the modified CTAB method [55]. Primer information and expected product size, generated by the Primer3 program [56], are listed in Table S1 in File S1. PCR mixtures (20 mL total) consisted of 10 ng DNA, 16 PCR buffer, 0.2 mM each primer, 2.5 mM dNTPs, and 1 unit Taq DNA polymerase (VIVAGEN, Korea). PCR was carried out as 5 min at 94uC, 35 cycles of 95uC for 30 sec, 56uC-62uC (dependent on the primers) for 30 sec, and 72uC for 1 min, with a final 5-min extension at 72uC, using a MG96G thermo cycler (LongGene Scientific Instruments, China). The PCR products were separated on 2% agarose gels, and the gels were stained with ethidium bromide and visualized on a UV trans-illuminator.

Identification of 20 MITE families in the B. rapa and B. oleracea genomes
Genome-wide analysis of MITEs using the MITE-Hunter program yielded 145 and 175 putative MITEs from B. rapa and B. oleracea, respectively. These 320 putative MITEs were characterized for the basic structural characteristics of MITEs including TIRs and TSD using the BLAST 2 sequences, einverted application with manual annotation, and we thereby identified 20 MITE families including the two previously known MITE families BrMi-1 (herein termed BraSto-1, as described below) and BraMi-1 (herein termed BraSto-2) [20,47,48]. Each family was characterized by short length, ranging from 160 to 556 bp, and had TIR sequences of between 16 and 85 bp. The identified MITE families had .50% A+T content, as is usual for MITEs, with one exception (BraTo-9, 39%). Relative empty site analysis showed that the TSD sequences for the families were different, being 2, 3, 8 and 10 bp ( Figure 1). Based on their TSD, the MITE families were classified into one of four superfamilies, Stowaway, Tourist, hAT and Mutator. We named the Stowaway families as Brassica Stowaway (BraSto) 1-4, the Tourist families as Brassica Tourist (BraTo) 1-13, the hAT families as Brassica hAT (BraHAT) 1-2 and the Mutator family as Brassica Mutator (BraMu) 1 (Table S2 in File S1). Homology-based repeat analysis using the CENSOR program in Repbase [51] showed that five families, BraSto-3, BraSto-4, BraTo-2, BraTo-7, and BraTo-8, had sequence similarity to previously reported MITE families. Among them, BraSto-3 and BraTo-2 shared over than 75% sequence similarity with MITEs in other plant families such as METMITE in Barrel Clover (Medicago truncatula) and HARB-1N1_Stu in potato (Solanum tuberosum). BraSto-4, BraTo-7, and BraTo-8, showed the highest sequence similarity to ATPOGO (80%), ATTIRX-1B (72%) and ATTIRX-1C (75%) in A. thaliana. Subsequent similarity searches against the recently developed plant MITE (P-MITE) database [26] revealed that 3 out of the 20 identified families (BraSto-4, BraTo-11 and BraTo-13) were not represented in that database (Table S2 in File S1).  (Tables S2, S3 and S4 in File S1). BraSto-4 was present in the highest copy numbers, with 1369 and 1459, whereas BraSto-3 had the fewest copies, with four and five, in B. rapa and B. oleracea, respectively.
Br-and Bo-members occupied approximately 0.38% (0.93 Mb of 283 Mb) and 0.33% (1.05 Mb of 385 Mb) of the available the B. rapa and B. oleracea whole genome pseudo-chromosome sequences. Secondary structure analysis using one representative intact member of each of the 20 MITE families revealed unique characteristic loops, which might be needed for their transposition ( Figure S1 in File S1) [14,28]. Homology search analysis against an miRNA database (miRBase) revealed that 10 MITEs shared homology with 10 miRNAs reported in A. thaliana. MITE-derived miRNAs were identified in various positions of the MITEs, from terminal to internal regions, with some mismatches (up to 7 bases) ( Table S5 in File S1).
Multiple sequence alignment showed overall sequence conservation levels of 66-90% similarity between members in the same MITE families. The TIRs were especially well conserved, showing .90% sequence similarity. Phylogenetic analysis with a representative 20% of intact MITE members (showing 80:80 coverage),   were distributed throughout the genome. The 1645 Br-members and 1604 Bo-members that were relatively intact (80:80 coverage) showed similar distribution patterns to those of all members (Figure 4; Figure S2 in File S1). The MITE insertion positions were characterized based on the whole-genome annotation of B. rapa and B. oleracea. Among the 5894 Br-members, 5761 were successfully mapped on B. rapa pseudo-chromosome sequences while the other 133 members were identified on unallocated scaffold sequences (Figure 2b: Table S6 in File S1). Out of the 5761 Br-members, 2641 (46%), 1675 (28.4%) and 1445 (24.5%) were positioned in intergenic spaces, near genic regions and in genic regions, respectively. Among the 1445 members in the genic regions, 89 (1.5%) and 692 (11.7%) elements were present in the coding sequences (CDSs) and intronic regions, respectively (Figure 2b, Tables S3 and S6 in File S1). We found that 54% of MITEs were present within 2 kb of genic regions, and 664 (11.5%) members resided within the 500 bp up-and down-stream of the start and stop codons (designated as the 59-and 39-UTRs), respectively (Table S6 in File S1). Similarly, of the 6026 Bomembers, 49.1% (2958), 31.6% (1906) and 1162 (19.3%) were found in intergenic spaces, near genic regions and in genic regions, respectively. Among the 1162 elements in genic regions, 412 (6.8%), 380 (6.3%), and 367 (6.1%) were in introns, 59-UTRs and 39-UTRs, respectively (Figure 2b, Tables S4 and S6 in File S1).

BraTo-9 MITE family members preferentially reside in genic regions in B. rapa
Unlike the members of the other 19 MITE families, BraTo-9 members, including .66% (150/225) of the Br-members, were preferentially accumulated in exons of many genes in B. rapa but not in B. oleracea. Partial sequences of BraTo-9 elements were expressed as chimeric forms in 29 genes of B. rapa (Table 1). Among these 29 genes, 3 represented one copy among triplicated genes, 11 were one copy of duplicated genes and 15 were singlecopy genes in B. rapa. Of the 15 single-copy genes, 7 were not found either in B. oleracea or A. thaliana, indicating that those genes are unique to B. rapa, and BraTo-9 could have a role in the evolution of these unique MITE insertion-containing genes ( Table 1). Comparative analysis of a BraTo-9-inserted gene (Bra016667) and its homologs from B. rapa, B. oleracea and A. thaliana showed that the structure of the Bra016667 gene was altered by the inclusion of one additional exon derived from the BraTo-9 MITE sequence (Figure 5a, b). Bra016667 was predicted to encode 162 amino acid residues, compared to 64 in its noninserted homologs ( Figure S3 in File S1).

Characterization of 20 MITE families reveals MITE dynamics in the Brassica genome
We performed genome-wide systematic analysis to identify novel MITE families in the recently published genome sequences of B. rapa and B. oleracea using a computational tool, MITE-Hunter [38]. Overall, 18 MITE families were newly identified from 320 modeled MITEs after manual editing based on signature MITE structures. Recent genome-wide characterization of MITEs using three different in silico tools has revealed 174 MITE families in B. rapa reported in the P-MITE database [26]. Though our analysis using only a single tool, MITE-Hunter, we identified three new MITE families in B. rapa that were not present in the P-MITE database, one of which was homologous to a family reported in A. thaliana [26]. We chose to use the MITE-Hunter tool because of its efficiency in MITE detection and relatively low false positive rates compared to other tools. For example, only 17 MITE families were identified as genuine from the 1350 predicted structures using the MUST program in silkworm [57].
The 20 MITE families examined herein of the Brassica genomes, including two previously reported families, were classified into four superfamilies based on their TSDs (Figure 1; Table S2 in File S1). Four, two and one MITE family belonged to the Stowaway, hAT and Mutator superfamilies, respectively, whereas 14 belonged to the Tourist superfamily, which is one of the predominant superfamilies in Brassica [26]. Though Tourist MITEs are thought to have evolved as deletion derivatives from autonomous elements, we were not able identify their putative transposase partner. The presence of MITE-derived miRNAs suggests that MITEs might influence the regulation of gene expression and activation of related MITEs and TEs [33][34][35][36]58]. Ten miRNA families of A. thaliana showed high similarity to 10 different MITE family sequences. We could not identify complete structures for nine of the ten MITE families in A. thaliana, suggesting that the nine family members were not activated but instead degenerated in A. thaliana. More in depth analysis is required to elucidate the biogenesis of MITE-derived miRNAs and the potential functional roles of such miRNAs.

MITEs were actively amplified at gene-rich regions in Brassica genome
A total of 5894 Br-members and 6026 Bo-members belonging 20 MITE families were retrieved from B. rapa and B. oleracea, respectively. Only four families with 573 members have been identified in the A. thaliana genome, suggesting that MITE evolution, amplification and burst occurred in the Brassica genus after divergence with Arabidopsis 17 MYA [7,9]. The total number of Br-and Bo-members were similar and the numbers of members of 13 families were also similar, suggesting that the major members of 13 MITE families evolved before the divergence of the two species 4 MYA. However, seven MITE families (BraSto-3, BraTo-1, 4, 7, 9, 11 and BraMu-1) displayed large variation in copy numbers between the two species. In particular, BraTo-1 was represented by 1216 members in the B. oleracea genome but only 207 copies in B. rapa, suggesting that these MITE families actively amplified after the divergence of B. rapa and B. oleracea around 4 MYA [45].
Members of the 20 MITE families are widely distributed throughout the pseudo-chromosome sequences of B. rapa and B. oleracea ( Figure S2 in File S1). There is much evidence that MITEs are associated with gene and gene-rich regions [20,30,[59][60][61], and MITEs mostly reside in genic regions such as promoters, 59-and 39-UTRs, introns and CDSs, which may influence the expression Table 1. Cont. of genes by providing regulatory sequences or recruiting epigenetic modifications [27,30,62]. In the present study, 3120 out of 5761 (54%) Br-members were found within the 2-kb genic regions, which is a higher frequency than would be expected by random transposition. Additionally, 51% (3068/6026) of Bo-members were present in or near genic regions in the B. oleracea genome, suggesting that Bo-MITEs are even more closely associated with genes than are Br-MITEs.
BraTo-9 could play a role in the evolution of duplicated genes in B. rapa We found that especially BraTo-9 MITE family members preferentially reside in genic regions, potentially providing novel exons for functional genes (Table 1). To illustrate this possibility, we demonstrated that the structure of a B. rapa gene (Bra016667) was modified by BraTo-9 insertion ( Figure 5). The non-inserted ortholog of Bra016667 from A. thaliana (AT1G15270, TRANSLA-TION MACHINERY ASSOCIATED7) has an important functional role in protein translation, and deletion of this gene results in alteration of the protein biosynthesis rate [63]. We found that when BraTo-9 insertion occurred in triplicated or duplicated genes in B. rapa, it was always present in only one of duplicated or triplicated genes, suggesting that the BraTo-9 members were actively amplified in B. rapa after divergence with B. oleracea 4 MYA [2,7].
Unlike other MITEs, BraTo-9 members are rich in G+C (61%), which may be a crucial factor aiding their preferential incorporation into exonic regions. It is tempting to speculate that the high G+C content of BraTo-9 members and their preferential location in genic regions are due to acquisition and adaptation from coding sequences or transposases ( Table 1). The G+C content of TEs has been suggested to be responsible for the high efficiency of TE excision and integration [64,65].

MITEs as valuable sources of DNA markers
The principle characteristics of MITEs, such as small size, genetically stability, high copy numbers, and close associations with genes, are useful for development of marker systems in plants and animals [41,42,44,[66][67][68]. Most DNA markers, like those based on simple sequence repeats (SSRs), amplified fragment length polymorphism (AFLP), random amplified polymorphic DNA (RAPD), and restriction fragment length polymorphism (RFLP), detect gradually and simultaneously accumulated mutations. Meanwhile the MIP markers detect InDel polymorphism derived from insertion in a certain genotype and lineage-based inheritance. MITE-based or MIP markers have been effectively utilized for genetic diversity, high density mapping, genomic and evolutionary studies in rice, wheat, soybean and Brassica [20,44,61,69,70]. DNA markers developed from a single Tourist-MITE, mPing, in rice detect .80% (150/183) polymorphism between two japonica rice lines, and have been effectively employed to map the QTL for heading date [69]. Abundant insertion polymorphism can be identified in a short period of time using the MITE display approach [21,41,42].
In our previous study, we developed markers from BraSto-2 and utilized them for diversity analysis of the Brassica population and identification of evolution dates of MITEs [20]. Here, MIP was surveyed using 289 MITE members and showed 52% (150) polymorphism between Brassica species, including 23% (66) at the intra-species level, suggesting that most MITEs are active in the Brassica genome. However, MITE families such as BraTo-7, 10 and BraHAT-1 did not produce any polymorphism, suggesting that these MITE families have been silenced for a long time. We could also find MITEs that were activated after the diversification between B. rapa and B. oleracea 4 MYA. Most of the analyzed families showed moderate to high levels of MIP (13%-100%), suggesting that these MITE families were recently activated and randomly distributed among cultivars. Overall, the MIP markers developed in this study revealed considerable polymorphism in the Brassica species, and these DNA markers can be utilized for various genomics applications such as assessment of genetic diversity, association mapping, genotyping and identification of novel functional genes evolving from MITE insertion.
We have incorporated the MITE member and marker information reported herein into a database, BrassicaTED (http://im-crop.snu.ac.kr/ted/index.php), to promote its effective utilization for further studies [71].

Conclusion
Genome-wide analyses of the B. rapa genome identified 18 previously uncharacterized MITE families belonging to the Stowaway, Tourist, hAT and Mutator superfamilies. We conducted a comparative genome-wide survey of around 12000 MITE members belonging to 20 families in B. rapa, B. oleracea and A. thaliana. We found that 52% (150/289) of MITE members have remained active since genome triplication 17 MYA in the Brassica genus based on MIP analysis, suggesting that MITE members played a dynamic role in the evolution of the Brassica genome. Our findings promote our understanding of MITE dynamics in the

Supporting Information
File S1 Figure S1. Representation of predicted secondary structure and expected loop formation of 20 MITE families used in this study. Figure S2.  Table S3 and S4. (C) The distribution of members of four homologous MITE families in the A. thaliana genome. Figure S3.
Comparisons of the amino acid sequences encoded by the MITEinserted gene Bra016667 and its paralogs Bra026774, Bra026150 and orthologs (At1g15270, TRANSLATION MACHINERY ASSO-CIATED7) and Bol038124, Bol029240 and Bol031556. The sequence added by the BraTo-9 insertion is indicated and highlighted. Table S1. Primers and polymorphism profile from MITE insertion polymorphism analysis of 19 MITE families.