Multipartite viruses contain more than one distinctive genome component, and the origin of multipartite viruses has been suggested to evolve from a non-segmented wild-type virus. To explore whether recombination also plays a role in the evolution of the genomes of multipartite viruses, we developed a systematic approach that employs motif-finding tools to detect conserved motifs from divergent genomic regions and applies statistical approaches to select high-confidence motifs. The information that this approach provides helps us understand the evolution of viruses. In this study, we compared our motif-based strategy with current alignment-based recombination-detecting methods and applied our methods to the analysis of multipartite single-stranded plant DNA viruses, including bipartite begomoviruses, Banana bunchy top virus (BBTV) (consisting of 6 genome components) and Faba bean necrotic yellows virus (FBNYV) (consisting of 8 genome components). Our analysis revealed that recombination occurred between genome components in some begomoviruses, BBTV and FBNYV. Our data also show that several unusual recombination events have contributed to the evolution of BBTV genome components. We believe that similar approaches can be applied to resolve the evolutionary history of other viruses.
Citation: Wang H-I, Chang C-H, Lin P-H, Fu H-C, Tang C, Yeh H-H (2013) Application of Motif-Based Tools on Evolutionary Analysis of Multipartite Single-Stranded DNA Viruses. PLoS ONE 8(8): e71565. https://doi.org/10.1371/journal.pone.0071565
Editor: Frederick C. C. Leung, University of Hong Kong, China
Received: January 29, 2013; Accepted: July 1, 2013; Published: August 6, 2013
Copyright: © 2013 Wang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was partially supported by the National Science Council, Taiwan (grant nos. 97-2221-E-126-012-MY3, 98-2321-B-002-019-MY3, and 98-2313-B-002-043-MY3), and the Council of Agriculture, Taiwan (grant no. 100 management-2.4-plant quarantine-02). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Multipartite viruses contain more than one genome component (or segment), and for a multipartite virus to initiate a successful infection, all of the genome components must infect the same host cell and simultaneously replicate within the cell. Because the genome components of a multipartite virus can utilize the same protein(s) for replication and encapsidation, conserved sequences can be observed within these genome components. Aside from the small conserved region, the nucleotide sequences of the remaining genome components are usually quite different from each other. It has been suggested that multipartite viruses evolved from a non-segmented wild-type virus, although different factors favoring the generation of segmented genomes have been proposed , . Recombination events have been found between genome components in a variety of multipartite viruses , , , , , , .
Viruses evolved to have a fast replication cycle, and some studies have indicated that recombination events occur frequently , , , , . The rearranged genome could cause non-functional proteins to be produced; however, these recombined genome components might be maintained through complementation supported by other genome components or the original un-recombined genome component. Collectively, if recombination accumulated during the long evolution process in the small genome of a virus, then the sequence of the recombined genome could be shuffled, and we would not be able to align it well.
Phylogeny-based methods are the most commonly used methods for detecting recombination from the standpoint of evolutionary histories , . These methods are usually designed to slide a window along the aligned sequences and to monitor the phylogenetic variation (e.g., the tree topology), to locate the recombination breakpoint. Formally, phylogeny-based methods require a multiple sequence alignment as the first step. Phylogeny-based methods allow a comparison of the gene sequences of different genomes. However, it is difficult to deduce the evolutionary relationships of genomes that cannot be aligned .
An alternative to alignment-based evolutionary analysis, the rearrangement distance algorithm, was presented by Sankoff  for the analysis of the evolutionary relationships between genomes. The first step in this method is the identification of homologous genes or common regions shared between the genomes of progeny, followed by the use of these genes or segments as markers , . Later, the evolutionary distance can be calculated by the recombinational rearrangement steps that are necessary to convert the order of the markers in one genome to the order in another. The accuracy of these methods largely relies on the resolution of the markers between the genomes of the progeny.
Motifs are sequence patterns that recur in different genome regions; these patterns could have some biological significance, such as being protein binding sites of regulatory proteins or being associated with the structural motifs of proteins. Several motif-finding tools have been developed that can detect small stretches of recurrent sequences . Thus, we think that a motif-finding tool can be applied to detect possible sequence patterns that recur in different viral genome components of multipartite viruses; these patterns might have some biological significance . Moreover, the recurring motifs can be applied to studies of evolution.
In this study, we employed a motif-finding method and computational simulation to detect recombination events for phylogenetic analysis. We applied our methods to the analysis of multipartite plant DNA viruses, including bipartite begomoviruses, Banana bunchy top virus (BBTV) and Faba bean necrotic yellows virus (FBNYV). Our analysis revealed that several unexpected recombination events contributed to the evolution of these viruses. We believe that similar approaches can be applied to resolve the evolutionary history of other viruses.
Comparison of alignment- and motif-based recombination detection methods
To compare the alignment and motif-based recombination detection methods, we utilized published data of geminivirus for our initial trial. We first selected a monopartite begomovirus, Sweet potato leaf curl virus (SPLCV), which infects sweet potato, for analysis . Three isolates, SPLCV-BR (HQ393455), -SP (HQ393473) and -US (HQ393450), were selected for analysis (Table S1 A). The SPLCV-US isolate is a recombinant of the putative parents SPLCV-BR and SPLCV-SP . We repeated an alignment-based recombination analysis by use of BOOTSCAN , CHIMAERA , GENECONV , MAXCHI , RDP , SISCAN  and 3SEQ  implemented in the program RDP4 (Version 4.16)  to detect recombination in these isolates, and similar results were obtained (Figure 1A).
The schematic representations based on alignments of Sweet potato leaf curl virus (SPLCV) isolates (A) and Tomato leaf curl China virus (TYCCNV) components (B) are shown at the top of the figure, which indicates recombination events detected by RDP4 . Each sequence is represented by an open rectangle and colored differently from the other sequences. The details of the recombination breakpoint detected by RDP4 are shown. The motifs detected by MEME in the genome of SPLCV and TYCCNV are shown at (C) and (D), respectively, and the same motifs are in the same color. Identical motifs in the TYLCCNV genomes are indicated by open rectangles.
For motif-based recombination detection methods, we selected the motif-finding tool Multiple Em for Motif Elicitation (MEME) (http://meme.nbcr.net/meme/) in our analysis , . We used MEME to detect the common motifs between SPLCV-PR, -SP and -US (Figure 1C). MEME detected identical motif locations in each genome and was unable to detect any recombination event. This indicated that when analyzing genome sequences that are similar and well aligned, alignment-based methods are better than motif-based methods in detecting recombination events.
Next, we selected another monopartite begomovirus, Tomato leaf curl China virus (TYLCCNV), for analysis . Satellite molecules, named DNA-β, associate with DNA-A of TYLCCNV, which is essential for the induction of symptoms. Additionally, recombinant DNA molecules (RecDNA-Aβ) between DNA-A and DNA-β of TYLCCNV have been reported . We used the genome sequences of DNA-A (AJ319675), DNA-β (AJ421621) and RecDNA-Aβ (AJ781297) for analysis (Listed in Table S1A). Both alignment- and motif -based recombination methods were performed (Figure 1B and 1D). The motif allocation detected by MEME clearly indicated that the genome sequence of RecDNA-Aβ had recombined with sequence derived from DNA-β (130–1140 nt position) and with sequences derived from DNA-A at the beginning and the end of the genome alignment (1–134 and 1140–1335 nt positions) (Figure 1B). However, the alignment-based methods were not able to detect the recombination at position 1140–1314 (Figure 1D) because this region could not be aligned well. The comparative analysis revealed that the alignment-based recombination detection methods are suitable for analyzing recombination events when sequences share high sequence similarities and can be easily aligned, and the motif-based recombination detection methods are suitable for analyzing recombination events when sequences share fewer similarities or cannot easily be aligned.
Identification of common motifs between components A and B of bipartite begomoviruses by MEME
We next aimed to determine whether recombination events happened between genome components of multipartite viruses that are considered to contain distinct component sequences that only share short stretches of conserved sequences. We first analyzed if recombination events happened between the A and B components of all bipartite begomovirus listed in the Virus Taxonomy of International Committee on Taxonomy of Viruses (ICTV), including those that have not been approved as species (Table S1) .
Identification of high-confidence motifs between components A and B of bipartite begomoviruses by MEME
In total, 86 bipartite begomoviruses were analyzed by MEME (Figure 2A and data not shown). To understand whether the motifs that were detected by MEME might have significance in the evolutionary history of the viruses, we first generated several sequence sets and subjected them to MEME analysis to identify the IC value for evaluation (Figure 3A). To identify the maximum IC value, we generated identical sets, each set including two identical sequences whose length was the average of the corresponding DNA-A and DNA-B genome components of bipartite begomoviruses. We also generated begomovirus-genome random mimic sequence sets by simulating the bipartite begomovirus genome. Each random sequence had an identical length and GC content corresponding to a specific component of the bipartite begomovirus genome. We applied random mimic sequence sets to identify an IC value, which was analyzed by MEME. To provide better control to reflect the associations of non-coding and coding sequence stretches of begomovirus, we also generated begomovirus-genome coding region mimic sequence sets. Each begomovirus-genome coding region mimic sequence set comprised two sequences, each generated by randomly combining sequences that were selected from coding and non-coding sequences derived from the GenBank sequence database (see materials and methods) . The coding and non-coding sequences had identical lengths and corresponded to a particular component of the bipartite begomovirus genome. To further evaluate the motifs that were detected in the begomovirus genomes, we introduced another value, the mean pairwise distance Dh (see Materials and Methods), which described the compactness (the average similarity between the motifs in a set) of a set of motifs more precisely.
(A) The distribution of motifs in the genomes of DNA-A and DNA-B components of Potato yellow mosaic Panama virus (PYMPV), Gossypium punctatum mild leaf curl virus (GPMLCuV), Tomato leaf curl Hsinchu virus (ToLCHsV) and Tomato yellow leaf curl Kanchanaburi virus (TYLCKaV), detected by MEME . (B) Schematic representation of the locations of high-confidence motifs. The genome sequences are represented by gray lines. The colored rectangles on the genomes represent the identified motifs, and arrows indicate that the motif is reversed. The motifs belonging to the same set in the same genome are indicated in the same color. Potential recombinant regions are indicated by red open rectangles, and the Hd of each motif set in the regions is listed.
(A) The x-axis represents the IC. The y-axis represents the percentage of motif sets with a certain IC. The green line and blue line represent the distribution of the percentage of the motif sets detected by the simulation of the begomovirus-genome coding region mimic sequence set and the begomovirus-genome random mimic sequence set, respectively. (B) The x-axis represents the mean pairwise distance (Dh). The y-axis represents the percentage of motif sets with a certain Dh value. The green line and blue line represent the distribution of the percentage of the motif sets detected by the simulation of the begomovirus-genome coding region mimic sequence set and the begomovirus-genome random mimic sequence set, respectively. (C) The percentage of motifs detected by MEME from motifs inserted in randomly generated sequences. The Dh of each MEME-detected motif set is designated with the numbers 1 to 4. The x-axis represents the Dh of artificial motifs that were generated and inserted randomly in begomovirus-genome mimic sequences The y-axis represents the coverage rate Cr of detected motifs compared with the initially inserted artificial motifs.
We applied all sequence sets to identify the information content (IC) (a ranking number assigned to each set of motifs) (Figure 3A) and to calculate the Dh (Figure 3B) of every motif pair detected in the mimic genome. The results showed that the IC value of motifs detected in identical sequences was 40, whereas fewer than 0.5% of motifs detected in begomovirus-genome random mimic sequence sets and begomovirus-genome coding region mimic sequence sets had IC values larger than 37.6 and Dh values less than or equal to 3 (motif length = 20 bp). To further ascertain the motif detection accuracy by MEME, we constructed a set of simulation data to help determine the threshold values of Dh (Figure 3C). In the simulation, random sequences were generated and embedded with motif sets with Dh values less than or equal to 3, and the simulated sequences were analyzed by MEME. We then calculated the coverage rate Cr (see Materials and Methods) to estimate the fraction of embedded motif sets that were detected accurately by MEME. In the simulation, the coverage rate Cr was higher than 99.8%; i.e., more than 99.8% of the motifs (with Dh values less than or equal to 3) that were embedded in the simulation sequences were detected by MEME (Figure 3C). By the above evaluations, we selected motifs with IC greater than 37.6 and Dh less than or equal to 3 for the analysis of bipartite begomoviruses.
The high-confidence motif allocation of all bipartite begomovirus is shown in Figure S1, and the viruses with distinctive recombination events are shown in Figure 2. The conserved region between DNA-A and DNA-B was easily detected by MEME. Moreover, some possible recombination events were also found in the unaligned region between DNA-A and DNA-B in the Potato yellow mosaic Panama virus (PYMPV), Gossypium punctatum mild leaf curl virus (GPMLCuV), Tomato leaf curl Hsinchu virus (ToLCHsV) and Tomato yellow leaf curl Kanchanaburi virus (TYLCKaV) (Figure 2B).
Application of motif-based analysis to BBTV
Besides bipartite begomoviruses, we selected another multipartite virus, Banana bunchy top virus (BBTV), for our analysis. BBTV is a phytopathogenic DNA virus that causes banana bunchy top disease (BBTD). The genome of BBTV comprises at least six single-stranded, circular, integral DNA genome components, including DNA-R (encoding the replication initiation protein), -U3 (potentially encoding a protein with an unknown function), -S (encoding the capsid protein), -M (encoding a movement and RNA silencing suppressor protein), -C (encoding a cell cycle link and RNA silencing suppressor protein) and -N (encoding a nuclear shuttle protein) , , , , each of which is considered to be distinct and to share only short stretches of conserved sequences, the stem-loop (SL) and major (CR-M) common regions , , , . We selected one isolate from each of several different geographical regions for analysis. The complete genome sequences of BBTV for each component were available from five geographical isolates (the Australia, Taiwan Type I, India Bihar, China Hainan and Tonga Tongatapu isolates) (Table 1). We also selected an Egypt Kalubia isolate that had complete sequences for DNA-R, DNA-U3, DNA-S and DNA-M but only partial sequences for DNA-C and DNA-N (Table 1).
BBTV forms two groups, the Asian group (the Taiwan Type I and China Hainan isolates) and the Pacific group (the Australia, India Bihar, Egypt Kalubia and Tongatapu isolates), and our phylogenetic analysis was consistent with that grouping (Figure S2) , . Two conserved regions, the SL (60 to 71 bp) and the CR-M region (83 to 90 bp), of the BBTV genome have been identified , , , . Previously, the evolutionary relationships were analyzed based mainly on the SL and CR-M , . The total length of these conserved regions represented approximately 12.9%–15.5% of the genome.
We used MEME to detect the conserved motifs between the six integral components of BBTV. The analyses were conducted for all isolates of BBTV, and for convenience, the complete analysis of the BBTV Taiwan Type I isolate is shown. The top 25 motif sets were detected (20 nucleotides in length for each motif), along with their similarity rankings (Figure 4A). Each set of motifs was assigned an IC value (Table 2). To filter out the motifs that might have significance for the evolutionary history of the virus, we also generated sequence sets as we described above, including identical sets, BBTV-genome random mimic sequence sets and BBTV-genome coding region mimic sequence sets. We compared the IC values of each motif detected from BBTV (Figure 4B, red line), identical sequence sets (Figure 4B, yellow line), BBTV-genome random mimic sequence sets (Figure 4B, blue line) and BBTV-genome coding region mimic sequence sets (Figure 4B, green line). The comparison of the IC values indicated that the motif sets with a ranking within the top 13 had a higher IC value (22.7) compared with the highest IC values derived from the BBTV-genome coding region mimic (22.6) and random mimic sequences (20.8). We also calculated the Dh value of each motif set detected from the BBTV-genome coding region mimic sequence sets (Figure 4C, Blue line) and the BBTV-genome random sequence mimic sets (Figure 4C, Green line). The distributions of Dh are displayed in curves with the mean values of 11.1 for the BBTV-genome coding region mimic sequence sets and 12.3 for the random mimic sequence sets. We then highlighted the Dh values of 14 ranking motif sets with high IC values detected from BBTV on the curve. The results show that the Dh values of the BBTV motif sets were located at the left side of the curve (Figure 4C). We also constructed a set of simulation data to ascertain the motif detection accuracy by MEME. In the simulation, random sequences were generated and embedded with motif sets with different Dh values, and the simulated sequences were analyzed by MEME (Figure 4D). In the simulation, when the Dh value was smaller than 8, the coverage rate Cr was still higher than 70%. The Cr value decreased rapidly when Dh was higher than 8. When Dh became greater than 9, the Cr value dropped below 50.31% and then continuously approached the lower bound. Based on the simulation, we selected 8 as the threshold Dh value in our study.
(A) Schematic representation of the Banana bunchy top virus (BBTV) genome and the distribution of common motifs in the genomes of BBTV, detected by MEME . The genome sequences of BBTV are represented by gray lines. The rectangles on the genomes represent the identified motifs. The motifs belonging to the same set are indicated in the same color. (B) The evaluation of the motifs by IC is represented on the y-axis. The x-axis represents the rank of the motif among all of the motifs identified. The red line represents the motif sets detected in the BBTV whole-genome sequences, and the green line and blue line represent the motif sets detected by the simulation of the BBTV-genome coding region mimic sequence set and the BBTV-genome random mimic sequence set, respectively. The yellow line represents the IC values that were derived from six identical sequences. (C) The distribution of the percentage of the motif sets detected from BBTV-genome mimic sequences is shown. The x-axis represents the mean pairwise distance (Dh). The y-axis represents the percentage of motif sets with a certain Dh value. The green line and blue line represent the distribution of the percentage of the motif sets detected by the simulation of the BBTV-genome coding region mimic sequence set and the BBTV-genome random mimic sequence set, respectively. The black rectangle represents the Dh of the motif sets detected within the BBTV genome sequences. (D) The percentage of motifs detected by MEME from motifs inserted in randomly generated sequences. The x-axis represents the Dh of artificial motifs that were generated and inserted randomly in begomovirus-genome mimic sequences The y-axis represents the coverage rate Cr of detected motifs compared with the initially inserted artificial motifs. Only the result of the BBTV Taiwan Type I isolate is shown.
Phylogenetic analysis of BBTV
From the above analysis, we applied the most stringent criterion (Dh = 8) to select the high-confidence motifs from all of the BBTV genome components. In total, 8–12 high-confidence motifs were identified from different geographic isolates (Table 3). For every BBTV isolate, each identified high-confidence motif was studied using phylogenetic analysis. The results indicated that most of the sequences within the high-confidence motifs were similar and were phylogenetically unresolved. However, some of the motifs could be resolved (with bootstrap support >80), and the resolved phylogenies all indicated that BBTV DNA-S and DNA-M were the most closely related (Figure 5A; Table 3).
(A) The NJ trees supported the grouping of Banana bunchy top virus (BBTV) component; only the bootstrap values above 75% were counted. (B) The distance matrix calculated by SPRING  represents the recombination steps that are necessary for changing the motif order from that of one genome to that of another. (C) The neighbor-joining tree constructed from the distance matrix calculated in (B). (D) The index of specifically shared motifs. The number represents the number of motifs that were shared specifically by partial components. We applied these methods to all of the BBTV isolates (Australia, India, Egypt, Taiwan, China and Tonga) (Table 3). The results derived from all of the BBTV isolates are similar, although the rearrangement distance (B) and the number of motifs that were shared specifically by subsets of components (D) varied between BBTV isolates (Figure S3). Only the result derived from the BBTV Taiwan Type I isolate is shown here.
In the second strategy, we applied rearrangement distance algorithms to estimate the relationships of the BBTV components based on rearrangements. The conserved motifs of each genome component were used as markers, and each sequence was represented in the form of its marker order. We used SPRING  to calculate the rearrangement distance between genomes. The rearrangement distance matrices of six genomes derived from different geographic isolates are shown in Figure 5B and Figure S3, and the phylogenetic tree constructed by the distance matrix is shown in Figure 5C. The motif order of DNA-S and DNA-M had the minimal rearrangement distance of all analyzed isolates (Figure 5C and S3), indicating that these two components are more closely related.
In addition to the motif sets that were shared by all of the BBTV components, we also detected conserved motifs that were shared specifically by subsets of the components of BBTV (Figure 5D and S3, Table 3). We used a relatively stringent threshold and selected only motif sets with Dh values less than 6. The indices of these specifically shared motifs are shown in Figure 5D and S3. The pairs of DNA-S and DNA-M contained the largest number of specifically shared motifs (Figure 5D and S3). We conducted all of the analyses on the selected BBTV isolates; the correlation between DNA-S and DNA-M was supported by all of the methods.
Motif distribution of BBTV
We also plotted the identified high-confidence motifs on the aligned BBTV genome (Figure 6). Our results reveal several interesting phenomena. First, although the alignment of the SL region of all of the BBTV genome components showed that the SL regions of DNA-U3 and DNA-N were more distinctive than the other components (Figure S4), the SLs of DNA-U3 and -N both lacked a motif (Figure 6B and 6F). The missing motif in DNA-U3 (Figure 6B, Motif 1) was found upstream of the SL region in DNA-U3 (the BBTV genome is circular), and it was derived from all of the analyzed isolates except the Taiwan Type I isolate (Figure 6B, Motif 1). Additionally, near the CR-M regions of DNA-S and -M, there were actually long stretches of common sequences (Figure 6C and 6D). More interestingly, the detected high-confidence motifs identified surrounding the CR-M region of DNA-S and -M were also observed in the other genome components but were scattered in different positions (Figure 6).
The solid lines represent the sequences of each Banana bunchy top virus (BBTV) genome component derived from the different isolates (Aus, Australia; Ind, India; Egy, Egypt; Tai, Taiwan, Chi, China; Tog, Tonga). All of the BBTV genome components, DNA-R (A), -U3 (B), -S (C), -M (D), -C (E), and -N (F), which were derived from different isolates, are aligned separately. The dotted lines represented gapped regions (only the gapped lengths longer than 10 are shown). The two conserved regions, the stem-loop (SL) and the major (CR-M) common region, are marked above the alignment. The colored rectangles represent the high-confidence motifs (see text) that are shared by all of the genome components of the isolates, and the motifs with similar sequences are indicated in the same color; an arrow on a rectangle indicates that the motif is reversed; a star on a rectangle indicates a high-confidence motif that is shared only by a subset of the genome components of an isolate.
Application of motif-based analysis to FBNYV
To demonstrate that our analytical methods can also be applied to other viruses, we selected the type species of the genus Nanovirus, Faba bean necrotic yellows virus (GQ274023–GQ274030), for analysis (Figure 7A). We used all the motif-finding and statistical analyses to identify high-confidence motifs for FBNYV (Figure S5). The motif distribution of FBNYV indicated that MEME could easily detect the known conserved region between genome components; however, our analysis revealed that several recombination events happened in these conserved regions (Figure 7B). Our analysis also revealed that several motifs, 4, 10 and 12 found in the conserved region of most genome components were rearranged in the distal regions of other genome components (Figure 7B).
(A) Schematic representation of the Faba bean necrotic yellows virus (FBNYV) genome. (B) The distribution of common motifs in the genomes of BBTV detected by MEME. The solid lines represent the sequences of each FBNYV genome component: DNA-R, -U1, -U2, -U4, -S, -M, -C and -N. The stem-loop region of all components is marked at the top of figure. Three conserved regions, the CR-M-U2, CR-C-U1 and CR-N-U4 regions, shared only by certain components of FBNYV are also indicated. The colored rectangles represent the high-confidence motifs (see text) that are shared by genome components, and the motifs with similar sequences are indicated in the same color; an arrow on a rectangle indicates that the motif is reversed.
In this study, we have developed a systematic approach to analyze the common motifs that are shared by multipartite viruses. Our methods employ motif-finding tools to detect conserved motifs from divergent genomic regions and apply statistical approaches to select high-confidence motifs. Our methods also allow us to retrieve additional information that helps to understand the evolution of viruses. We have verified the effectiveness of our approach in bipartite begomoviruses, BBTV and FBNYV, which have multi-component genomes.
Our analysis revealed that within bipartite begomoviruses, there are few common motifs can be found outside the conserved region; however, putative recombination events were still observed outside the conserved region. For example, rearrangement and duplication were found outside the previously known common region of GPMLCuV (Figure 2B). Additionally, several reversions were found between genome components of PYMPV, ToLCHsV and TYLCKaV (Figure 2B).
Our statistical analysis indicated that several short DNA motifs were scattered throughout the BBTV genome and that these motifs were not likely to have been produced by random mutation. From the distribution of detected motifs (Figure 6), we found that the distinctive SL region of DNA-U3 most likely resulted from rearranging a common motif to a different region (Figure 6 B motif 1). This suggested that the recombination events occurred in DNA-U3. Phylogenetic analysis using each of the detected common motifs derived from different DNA components also indicated that DNA-S and DNA-M are the most closely related (Figure 5A). In addition to the sequence similarity, the order of these common motifs is also the most closely related between DNA-S and DNA-M (Figure 5B and S3). Furthermore, some of the motifs that we detected are actually not shared by all of the genome components; however, DNA-S and DNA-M contain more specifically shared motifs than other DNA components (Figure 5D, Figure S3 and Table 3). Thus, the correlation between DNA-S and -M was highly supported by phylogenetic-based methods (the analysis of each motif set), rearrangement-based methods, distance-based methods and the number of specifically shared motifs (Figure 5D and Table 3). No explanation for this observation has previously reported. The distribution and arrangement of the motifs in the BBTV genome are similar in both the Pacific and the Asian groups, which suggest that the recombination event happened before the geographic separation.
Furthermore, the largest number of motifs detected in our analysis is centered near the CR-M region in DNA-S and DNA-M (Figure 6C and 6D). Endogenous primers within the BBTV virions can bind to the CR-M region and initiate the synthesis of complementary-strand DNA in vitro . Interestingly, the detected motifs centered in the CR-M of DNA-S and DNA-M are also scattered within all of the other BBTV genome components (Figure 6A, 6B, 6E and 6F); whether these detected motifs are important for BBTV replication remains to be determined.
We conclude our analysis Figure 8A–8C. In the first situation (Figure 8A), translocation of short conserved sequences occurred in all of the BBTV genome components. This result is strongly supported in BBTV DNA-U3 (Figure 6B). Reversion of conserved sequences was also identified and found to occur in all of the isolates (Figure 6). Motifs that are shared between 2–5 genome components were also frequently found (Table 3).
(A) Conserved sequences (represented by dark rectangles) in the offspring genomes are separated by foreign segments (represented by empty blocks) as a result of multiple insertion events. (B) Recombination and inversion might occur in the offspring genomes and lead to positional and directional rearrangement of the conserved region (represented by dark arrows; the arrowhead indicates the direction). (C) Progeny genomes can share similar genome organization, but certain distinctive segments (represented by gray rectangles) within these regions can be shared only by a subset of the progeny.
Previously, BBTV recombination and reassortment events have been detailed by Stainton et al. (2012) using alignment based-methods , and several interesting recombination events have been deduced. For example, both inter- and intra-component recombination have been found in the SL common region of DNA-U3 in some BBTV Taiwan isolates (DQ826392, DQFJ778283 and FJ463043), an India isolate (FJ009239) and a Pakistan isolate (FJ859749) .
Our analysis revealed that unlike other DNA-U3, DNA-U3 of the Taiwan isolate consists of 3 motifs similar to other BBTV DNA genome components. DNA-U3 found in NC003475, FJ605508, AF102781, AY606084 and JF957644 isolates only contains 2 motifs (Figure 6B). Initially, it seemed that inter-recombination happened in the SL region of DNA-U3 of the Taiwan isolate. However, alignment using DNA-U3 of DQ826392, FJ463043, FJ009239 and FJ773283 showed that SL is more conserved between isolates (Figure S6) and is less conserved between its associated DNA-R, -S, -M, -C and –N. This result suggests that intra-recombination happened in the SL region between DNA-U3 of DQ826392, FJ463043, FJ009239 and FJ773283, as previously revealed by Stainton et al. (2012).
The alignment-based methods easily identify recombination events in the conserved region, as reported by Stainton et al. (2012); however, sequences outside the conserved region, for example, the missing motif 1 in SL of DNA-U3 (Figure 6), which has relocated to a different position, will not be easily detected by alignment-based methods. Therefore, both methods are needed for a better understanding of virus evolution.
In FBNYV, we also detected some interesting rearrangement events. For example, motif 4 is located in a similar position in DNA-U1, -M, -C, -N and -U2, but reversion of motif 4 can be found in other positions in DNA-R, -S and -U4 (Figure 7B). Additionally, the distinctive conserved region consisting of motifs 7, 4, 19, 3 and 2 of DNA-U2 and DNA-M are the relocations of motif 4 within motif 3 and 7, and distinctive motifs 12, 15 and 19 are only shared by DNA-U2 and DNA-M.
Collectively, our analysis allows us to detect motifs between genome components of multipartite viruses. The analysis of these motifs allows us to reveal unusual evolution events that occurred between genome components in some begomoviruses, BBTV and FBNYV, and our data strongly suggest that genome recombination events have contributed to the evolution of BBTV integral genome components. However, we cannot rule out the possibility that some of the motifs emerged from convergent evolution. Although we interpreted these motifs from the standpoint of evolution, these motifs, which were detected as common or as partly shared motifs, could preserve function, which would allow us to further analyze their biological significance.
Materials and Methods
Sequences used in this study
The sequences of SPLCV, TYLCCNV and the genomes of bipartite begomoviruses used in the analysis are listed in Table S1. Six different isolates of BBTV from different geographical regions (Australia, Taiwan, India, Egypt, China and Tonga) were used (Table 1). The genome sequences of Faba bean necrotic yellows virus (GQ274023–GQ274030) were selected. The above sequences were all obtained from GenBank .
Alignment, rearrangement, distance calculation, phylogenetic analysis and recombination seeking
The sequences were aligned by ClustalX 2.0  using the default settings, and the rearrangement degree was estimated by SPRING (http://algorithm.cs.nthu.edu.tw/tools/SPRING/) , which estimates the rearrangement distance between genomes by calculating the necessary editing steps of reversals and/or block-interchanges. All of the phylogenetic trees in this study were generated by MEGA 5 ,  using maximum parsimony (MP). We performed heuristic searches with 1,000 random additional replicates and tree bisection-recombination branch-swapping in the maximum parsimony analysis, and 10 trees were selected from each replicate. The branch support was estimated by PAUP version 4.0b10  by bootstrapping with 1,000 replicates for both maximum parsimony and neighbor-joining (NJ) analyses. Evidence of recombination was sought by the program RDP4 (Version 4.16) , which implements the methods of BOOTSCAN , CHIMAERA , GENECONV , MAXCHI , RDP , SISCAN  and 3SEQ .
The construction of the virus-genome mimic sequence set
The original genome sequences that were used in the sequence construction were retrieved from GenBank. The genome sequences were first processed by CDS Parser (the code can be download from http://Algorithm.cs.nthu.edu.tw/CDSParser.php) to exclude sequences that had undetermined (N) sites and to store coding regions and non-coding regions sequences separately in the database (Figure S7). We constructed virus-genome mimic sequences for the simulation of bipartite begomoviruses, BBTV and FBNYV separately. For each kind of virus, two types of virus-genome mimic sequence sets were constructed. First, a virus-genome coding region mimic sequence set was generated by simulating the virus genome, which comprises certain sequences corresponding to the mimic virus, i.e., two sequences for begomoviruses, six sequences for BBTV and eight sequences for FBNYV. Each virus-genome coding region mimic sequence was generated by combining sequences that were selected from coding and non-coding GenBank sequences, taking them from the database randomly. Each had identical lengths of coding and non-coding sequences corresponding to a particular component of the virus genome. Second, a virus-genome random mimic sequence set was also generated; each random sequence had the same length and GC content as a specific component of the virus genome.
Motif detection and measurement of similarity
The EM-based algorithm Multiple EM for Motif Elicitation (MEME), which was introduced by Bailey and Elkan , , was used for the detection of sequence motifs in the virus genomes (begomoviruses, BBTV and FBNYV) used in this study. The genome sequence of each virus was submitted to MEME (http://meme.nbcr.net/meme/)  for the determination of similar segments (motifs). The variables used in MEME are listed in Table 4. The information content value (IC, the relative entropy of the motif relative to a uniform background frequency model) of each motif set was calculated. Pairwise motif correlations were checked by MAST ,  to exclude similar motif pairs (correlation >0.60) and to identify the corresponding positions of each motif in the virus sequences.
The measure of similarity for each motif set was defined as the mean pairwise distance (MPD) or Dh to further verify the confidence of each motif set detected by MEME. For every set of motifs, we calculated the Hamming distance  h between each pair of motifs in the set as the first step. Then, the distance h between each motif pair was summed and divided by the total number of pairs (n motifs give = n(n-1)/2 possible pairs) to calculate the Dh score. The score Dh of the motif set M can be written as:where n is the number of motifs in motif set m. The Dh score for each motif set act as a normalized value for variation, in which low Dh scores correspond to high conservation between motifs. The observed Dh values theoretically reflect the compactness of the motif set m.
Simulation data construction for threshold determination
The procedure for constructing the simulation data includes 2 steps. In the first step, 1,000 sequence sets G1–G1000 are constructed, and each set Gi contains n sequences S1–Sn, where n = 2, 6 and 8 for the simulation of the bipartite begomovirus, BBTV and FBNYV genomes, respectively. Each sequence in the set (Sj) was constructed randomly. However, the length and GC content of each sequence in each set (Gi) were equal to those in each corresponding virus genome component.
In the second step, for each sequence set Gi in step 1, we randomly constructed 20 sets of motifs M1–M20, each of which contained n motifs m1–mn that were designed to have the Dh value d. Then, each mi of M1–M20 replaced a random subsequence that was located in Si and had the same length as mi. We also reversed the inserted motif randomly to reflect the sequence reversal events caused by recombination.
The sequence construction in the second step was repeated 10,000 times with a Dh value d ranging from 0 to 4, 20 for the simulation of the bipartite begomovirus, BBTV and FBNYV genomes, respectively. We also constructed 10,000 supplementary sequence sets that were embedded with random motif sets (d = ∞) for contrast. Each simulation sequence of set G was submitted to MEME for motif detection with the variables listed in Table 4. To estimate the percentage of embedded motif sets (Mc) that could be detected accurately by MEME under different d values, we defined a normalized measure of the coverage ratio Cr aswhere Md is the set of motif sets detected by MEME.
The high-confidence motifs detected between genomes of bipartite begomovirus. The gray line represents the genome sequence, and common motifs detected by MEME are indicated by colored squares. Motifs located below the gray line indicate that the motifs are reversed, and the motifs belonging to the same set in the same genome are indicated in the same color.
Phylogram of the genome sequences of Banana bunchy top virus. (A–F) Phylograms of the maximum-parsimony trees based on the total nucleotide sequences (DNA-R, -U3, -S, -M, -C and -N) from Banana bunchy top virus (BBTV) geographic isolates (Australia, Taiwan, India, China, Egypt and Tonga, respectively). In the phylograms in E and F, the component sequences of the Egypt Kalubia strain (AF102784 and AF148139) were excluded from the phylogenetic analysis because they were not full-length sequences (see Table 1). Along the branches are the bootstrap supports of the maximum-parsimony and neighbor-joining methods; only values >70% are shown. For the BBTV integral components, the sequences were derived from isolates for which all of the integral-component sequences were available in GenBank.
Phylogenetic analysis of the genome of Banana bunchy top virus using motifs detected by MEME. (A) The index of specifically shared motifs derived from the Banana bunchy top virus (BBTV) isolates (Australia, India, Egypt, China and Tonga). The number represents the number of motifs that were shared between paired genome components. (B) The distance matrix calculated by SPRING  represents the number of recombination steps necessary to change the motif order from that of one component genome to that of another.
Alignment of the stem-loop common region of Banana bunchy top virus. The sequence alignments of the stem-loop region derived from the Banana bunchy top virus (BBTV) genome components of the Pacific group (Australia, India Bihar, Egypt Kalubia) and the Asian group of isolates (Taiwan Type I, China Hainan). Identical and conserved sequences within the alignment are indicated in black and gray shadow, respectively.
The simulation result of Faba bean necrotic yellows virus (FBNYV). (A) The evaluation of the motifs detected in FBNYV using the information content (IC) is represented. (B) The distribution of the percentage of the motif sets detected from FBNYV-genome mimic sequences. (C) The percentage of motifs detected by MEME  from motifs inserted in randomly generated sequences. The strategies are similar to Figure 4B, 4C and 4D in the analysis of BBTV.
Alignment of the stem-loop common region of Banana bunchy top virus. The sequence alignments of the stem-loop region derived from the Banana bunchy top virus (BBTV) genome components of Taiwan, India and Pakistan isolates.
Schematic diagram of the construction of a BBTV-genome coding region mimic sequence set.
Conceived and designed the experiments: HIW HHY. Performed the experiments: HIW CHC. Analyzed the data: HCF CYT HHY. Contributed reagents/materials/analysis tools: CHC PHL. Wrote the paper: HIW.
- 1. Nee S (1987) The Evolution of Multicompartmental Genomes in Viruses. Journal of Molecular Evolution 25: 277–281.
- 2. Chao L (1991) Levels of Selection, Evolution of Sex in Rna Viruses, and the Origin of Life. Journal of Theoretical Biology 153: 229–246.
- 3. Fu HC, Hu JM, Hung TH, Su HJ, Yeh HH (2009) Unusual Events Involved in Banana bunchy top virus Strain Evolution. Phytopathology 99: 812–822.
- 4. Hyder MZ, Shah SH, Hameed S, Naqvi SM (2011) Evidence of recombination in the Banana bunchy top virus genome. Infect Genet Evol 11: 1293–1300.
- 5. Stainton D, Kraberger S, Walters M, Wiltshire EJ, Rosario K, et al. (2012) Evidence of inter-component recombination, intra-component recombination and reassortment in banana bunchy top virus. J Gen Virol 93: 1103–1119.
- 6. Orlich M, Gottwald H, Rott R (1994) Nonhomologous recombination between the hemagglutinin gene and the nucleoprotein gene of an influenza virus. Virology 204: 462–465.
- 7. Martin DP, Biagini P, Lefeuvre P, Golden M, Roumagnac P, et al. (2011) Recombination in eukaryotic single stranded DNA viruses. Viruses 3: 1699–1738.
- 8. Sztuba-Solinska J, Urbanowicz A, Figlerowicz M, Bujarski JJ (2011) RNA-RNA recombination in plant virus replication and evolution. Annu Rev Phytopathol 49: 415–443.
- 9. Bruyere A, Wantroba M, Flasinski S, Dzianott A, Bujarski JJ (2000) Frequent homologous recombination events between molecules of one RNA component in a multipartite RNA virus. J Virol 74: 4214–4219.
- 10. Lefeuvre P, Lett JM, Varsani A, Martin DP (2009) Widely conserved recombination patterns among single-stranded DNA viruses. J Virol 83: 2697–2707.
- 11. Duffy S, Holmes EC (2008) Phylogenetic evidence for rapid rates of molecular evolution in the single-stranded DNA begomovirus Tomato yellow leaf curl virus. Journal of Virology 82: 957–965.
- 12. Harkins GW, Martin DP, Duffy S, Monjane AL, Shepherd DN, et al. (2009) Dating the origins of the maize-adapted strain of maize streak virus, MSV-A. J Gen Virol 90: 3066–3074.
- 13. Grigoras I, Timchenko T, Grande-Perez A, Katul L, Vetten HJ, et al. (2010) High variability and rapid evolution of a nanovirus. J Virol 84: 9105–9117.
- 14. Roossinck MJ (1997) Mechanisms of plant virus evolution. Annu Rev Phytopathol 35: 191–209.
- 15. Posada D (2002) Evaluation of methods for detecting recombination from DNA sequences: Empirical data. Molecular Biology and Evolution 19: 708–717.
- 16. Posada D, Crandall KA (2002) The effect of recombination on the accuracy of phylogeny estimation. Journal of Molecular Evolution 54: 396–402.
- 17. Sankoff D, Cedergren R, Abel Y (1990) Genomic Divergence through Gene Rearrangement. Methods in Enzymology 183: 428–438.
- 18. Sankoff D (1992) Edit Distance for Genome Comparison Based on Nonlocal Operations. Lecture Notes in Computer Science 644: 121–135.
- 19. Lin YC, Lu CL, Chang HY, Tang CY (2005) An efficient algorithm for sorting by block-interchanges and its application to the evolution of vibrio species. J Comput Biol 12: 102–112.
- 20. Das MK, Dai HK (2007) A survey of DNA motif finding algorithms. BMC Bioinformatics 8 Suppl 7: S21.
- 21. Albuquerque LC, Inoue-Nagata AK, Pinheiro B, Resende RO, Moriones E, et al. (2012) Genetic diversity and recombination analysis of sweepoviruses from Brazil. Virol J 9: 241.
- 22. Martin DP, Posada D, Crandall KA, Williamson C (2005) A modified bootscan algorithm for automated identification of recombinant sequences and recombination breakpoints. AIDS Res Hum Retroviruses 21: 98–102.
- 23. Posada D, Crandall KA (2001) Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proc Natl Acad Sci U S A 98: 13757–13762.
- 24. Padidam M, Sawyer S, Fauquet CM (1999) Possible emergence of new geminiviruses by frequent recombination. Virology 265: 218–225.
- 25. Smith JM (1992) Analyzing the mosaic structure of genes. J Mol Evol 34: 126–129.
- 26. Martin D, Rybicki E (2000) RDP: detection of recombination amongst aligned sequences. Bioinformatics 16: 562–563.
- 27. Gibbs MJ, Armstrong JS, Gibbs AJ (2000) Sister-scanning: a Monte Carlo procedure for assessing signals in recombinant sequences. Bioinformatics 16: 573–582.
- 28. Boni MF, Posada D, Feldman MW (2007) An exact nonparametric method for inferring mosaic structure in sequence triplets. Genetics 176: 1035–1047.
- 29. Martin DP, Lemey P, Lott M, Moulton V, Posada D, et al. (2010) RDP3: a flexible and fast computer program for analyzing recombination. Bioinformatics 26: 2462–2463.
- 30. Bailey TL, Williams N, Misleh C, Li WW (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research 34: W369–W373.
- 31. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, et al. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37: W202–208.
- 32. Tao X, Zhou X (2008) Pathogenicity of a naturally occurring recombinant DNA satellite associated with tomato yellow leaf curl China virus. J Gen Virol 89: 306–311.
- 33. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2008) GenBank. Nucleic Acids Research 36: D25–D30.
- 34. Gronenborn B (2004) Nanoviruses: genome organisation and protein function. Vet Microbiol 98: 103–109.
- 35. Burns TM, Harding RM, Dale JL (1995) The Genome Organization of Banana Bunchy Top Virus - Analysis of 6 Ssdna Components. Journal of General Virology 76: 1471–1482.
- 36. Harding RM, Burns TM, Hafner G, Dietzgen RG, Dale JL (1993) Nucleotide sequence of one component of the banana bunchy top virus genome contains a putative replicase gene. J Gen Virol 74 (Pt 3) 323–328.
- 37. Niu S, Wang B, Guo X, Yu J, Wang X, et al. (2009) Identification of two RNA silencing suppressors from banana bunchy top virus. Arch Virol 154: 1775–1783.
- 38. Horser C, Harding R, Dale J (2001) Banana bunchy top nanovirus DNA-1 encodes the ‘master’ replication initiation protein. J Gen Virol 82: 459–464.
- 39. Hafner GJ, Stafford MR, Wolter LC, Harding RM, Dale JL (1997) Nicking and joining activity of banana bunchy top virus replication protein in vitro. J Gen Virol 78 (Pt 7) 1795–1799.
- 40. Hanley-Bowdoin L, Settlage SB, Orozco BM, Nagar S, Robertson D (2000) Geminiviruses: models for plant DNA replication, transcription, and cell cycle regulation. Crit Rev Biochem Mol Biol 35: 105–140.
- 41. Karan M, Harding RM, Dale JL (1994) Evidence for two groups of banana bunchy top virus isolates. J Gen Virol 75 (Pt 12) 3541–3546.
- 42. Hu JM, Fu HC, Lin CH, Su HJ, Yeh HH (2007) Reassortment and concerted evolution in Banana bunchy top virus genomes. Journal of Virology 81: 1746–1761.
- 43. Lin YC, Lu CL, Liu YC, Tang CY (2006) SPRING: a tool for the analysis of genome rearrangement using reversals and block-interchanges. Nucleic Acids Research 34: W696–W699.
- 44. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23: 2947–2948.
- 45. Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596–1599.
- 46. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, et al. (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28: 2731–2739.
- 47. Swofford DL (1993) Paup - a Computer-Program for Phylogenetic Inference Using Maximum Parsimony. Journal of General Physiology 102: A9–A9.
- 48. Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36.
- 49. Bailey TL, Gribskov M (1998) Methods and statistics for combining motif match scores. Journal of Computational Biology 5: 211–221.
- 50. Bailey TL, Gribskov M (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14: 48–54.
- 51. Hamming RW (1950) Error Detecting and Error Correcting Codes. Bell System Technical Journal 29: 147–160.