An Evolutionary Analysis of the Secoviridae Family of Viruses

The plant-infecting Secoviridae family of viruses forms part of the Picornavirales order, an important group of non-enveloped viruses that infect vertebrates, arthropods, plants and algae. The impact of the secovirids on cultivated crops is significant, infecting a wide range of plants from grapevine to rice. The overwhelming majority are transmitted by ecdysozoan vectors such as nematodes, beetles and aphids. In this study, we have applied a variety of computational methods to examine the evolutionary traits of these viruses. Strong purifying selection pressures were calculated for the coat protein (CP) sequences of nine species, although for two species evidence of both codon specific and episodic diversifying selection were found. By using Bayesian phylogenetic reconstruction methods CP nucleotide substitution rates for four species were estimated to range from between 9.29×10−3 to 2.74×10−3 (subs/site/year), values which are comparable with the short-term estimates of other related plant- and animal-infecting virus species. From these data, we were able to construct a time-measured phylogeny of the subfamily Comovirinae that estimated divergence of ninety-four extant sequences occurred less than 1,000 years ago with present virus species diversifying between 50 and 250 years ago; a period coinciding with the intensification of agricultural practices in industrial societies. Although recombination (modularity) was limited to closely related taxa, significant and often unique similarities in the protein domains between secovirid and animal infecting picorna-like viruses, especially for the protease and coat protein, suggested a shared ancestry. We discuss our results in a wider context and find tentative evidence to indicate that some members of the Secoviridae might have their origins in insects, possibly colonizing plants in a number of founding events that have led to speciation. Such a scenario; virus infection between species of different taxonomic kingdoms, has significant implications for virus emergence.


Introduction
The Picornavirales order contains viruses that infect a wide range of eukaryotic organisms including vertebrates (Picornaviridae), arthropods (Dicistroviridae), plants (Secoviridae), insects (Iflaviridae) and algae (Marnaviridae) [1]. Members of the Picornaviridae include rhinoviruses, poliovirus, Foot-and-mouth disease virus, and Hepatitis A virus. The plant-infecting members of the Picornavirales are in the Secoviridae family. Most secovirid species fall within the Comovirinae subfamily which contains the Nepovirus, Comovirus and Fabavirus genera [2]. However, in the last decade or so, a number of novel more distantly related viruses have been characterized; these include Apple latent spherical virus (ALSV) [3], Cherry rasp leaf virus (CRLV) [4], Satsuma dwarf virus (SDV) [5], Strawberry mottle virus (SMoV) [6], Strawberry latent ringspot virus (SLRSV) [7] and Tomato torrado virus (ToTV) [8]. The agronomic importance of members of the Secoviridae is significant: Grapevine fanleaf virus (GFLV) is the oldest and most widespread viral disease to affect grapevine, being first documented in 1865 [9], while rice tungro disease, caused by a combination of two viruses, one of which is the secovirid Rice tungro spherical virus (RTSV) emerged in the 1960s to seriously disrupt rice production in Asia [10]. More recently multiple members of the newly described Torradovirus genus show signs of emergence in tomatoes [11,12]. Their present impact on a wide range of agronomically important crops combined with their continuing emergence means that understanding the Secoviridae from an evolutionary perspective will enhance our ability to develop adequate control strategies against present and future threats.
Members of the Picornavirales are all characterized as having a positive-sense single-stranded RNA genome with a conserved module incorporating a superfamily 3 helicase (HEL), a chymotrypsin-like protease (PRO) and RNA-dependent RNA polymerase (RdRp) functions. Large exons are proteolytic processed via post-translational cleavage into discrete structural and/or functional proteins. Virus particles are non-enveloped, icosahedrons of around 30 nm in diameter made up of 60 capsomers each containing three jelly-roll domains. Except for members of the Secoviridae, of which almost all are bipartite, viruses in the order Picornavirales have non-segmented genomes. In the vast majority of cases the viral RNA is polyadenylated at the 3' terminus. A small virus-encoded protein (VPg), predicted for most of the species, has been shown to be covalently attached to the 5' terminus [1]. The coat protein (CP) of the Secoviridae consists of either one, two or three cleaved peptides, depending on the genus or species [2,13]. Upstream of the CP is the movement protein (MP) which is required for cell-to-cell movement, although its biological function has only been verified for a small number of viruses. There exists upstream of the HEL and some MP functional domains regions of low levels of conservation. Except for the Nepovirus and Comovirus genera the functions of these regions remain unknown [13].
Beyond a number of reports measuring selection pressures and detecting specific examples of recombination there is limited information on the evolution of the Secoviridae. High levels of purifying selection for the CP and MP have been calculated for GFLV strains isolated from California [14]. RTSV strains taken from endemic regions in South East Asia also exhibited the characteristics of purifying selection for the CP [15]. Equally, an analysis of 30 global isolates of Broad bean wilt virus-2 (BBWV-2) found evidence of strong purifying selection exerted on four functional domains including the CP and MP [16]. Recombination in GFLV appears to have occurred both at an intraspecific level and with the closely related Arabis mosaic virus (ArMV), at an interspecific level [17][18][19][20][21]. This intimate relationship is believed to have spawned Grapevine deformation virus which appears to be a mosaic between GFLV and ArMV [22]. For the comovirus Bean pod mottle virus (BPMV) Zhang et al. [23] identified a naturally occurring partial diploid reassortant strain that contained recombinant sequences derived from different BPMV strains and which could be replicated by coinfection and passaging. Recombinants have also been experimentally generated between nepoviruses Tobacco black ring virus (TBRV) and Grapevine chrome mosaic virus (GCMV) [24]. Beyond the Comovirinae sequence analyses of certain functional domains have identified a chimeric-like composition to secovirid genomes that extends outside of the Secoviridae. The CPs of CRLV and ToTV are reported to have highest identities with non-plant infecting members of the Picornavirales [4,25]. In this study, by using a comparative genomics approach we set out to identify the tendencies underlying the genetics and evolution of the entire Secoviridae family. The results obtained help define the Secoviridae as an emerging group of viruses that we postulate in some cases originated in insects.

Sequence data collection and preparation
The full-length nucleotide sequences of 27 secovirids were downloaded from the NCBI GenBank database (http://www. ncbi.nlm.nih.gov) ( Table 1). Selection was based on those species that were described as full-length and were sufficiently annotated to allow for the majority of genes to be identified and delineated as discrete sequences. This therefore precluded the initial selection of partial or insufficiently annotated sequences (eg. Raspberry ringspot virus). Full-length sequences were analyzed in Vector NTI (Invitrogen) with each open reading frame separated into its putative functional domains -Protease co-factor (ProCo), Helicase (HEL), Viral protein genome-linked (VPg), Protease (PRO), RNAdependent RNA-polymerase (RdRp), Movement protein (MP) and Coat protein (CP) -as determined by reported cleavage sites, determined either experimentally or in silico. Since for some species there was functional ambiguity and/or additionally reported in-frame domains at the N-termini of ProCo and MP sequences, designation as the 1N(ProCo) and 2N(MP) domains, respectively, was more appropriate and adopted throughout this study (Fig.1). Because of its unusual position and unknown function, the frame-shifted 5' domain identified in torradoviruses was not considered for analysis [12]. Individual CP sequences were selected from the Genbank database on the basis of whether a definitive year of isolation could be associated either by means of the annotation or from a cited publication.

Alignments, selection pressures, nucleotide substitution rates
Nucleotide sequences of the seven identified functional domains -1N(ProCo), HEL, VPg, PRO, RdRp, 2N(MP) and CP were aligned using the MUSCLE [26] codon align option in the MEGA4.0 suite [27]. Each alignment was screened for significant breakpoint signals using the GARD program [28] with standard default settings. Saturation of all alignments was assessed using DAMBE [29]. Global selection pressures were analyzed by determining the ratio of non-synonymous to synonymous (d N / d S ) nucleotide substitutions using the Single-Likelihood Ancestor Counting (SLAC) method of the HYPHY program [30]. Evidence for episodic diversifying selection was screened using the branchsite REL method [31], at the datamonkey website (http://www. datamonkey.org/). Both SLAC and FUBAR [32] methods were used to identify codon specific selection. For SLAC and REL methods input neighbor-joining trees, generated within the programs themselves, were used with each alignment. For the former, a GTR substitution model -determined as the optimal for every alignment using JModeltest [33] -was used calculating 95% confidence intervals for a x 2 distribution. For the latter, alignments that generated trees with a single branch where episodic diversifying selection was identified were only considered positive if the same branches were identified during a re-run with maximum likelihood input parameters. Nucleotide substitution rate estimates were determined using BEAST 1.7 [34,35]. The optimal substitution model for each alignment, identified using JModeltest [33], was used to compare against the simpler default HKY substitution model in the BEAUTi program in combination with different priors for the molecular clock (strict, lognormal relaxed, exponential relaxed) and tree (Bayesian skyline, exponential and constant population) all with codon partitioning. All sequences were prior 'tip-dated' with their year of isolation reported in the NCBI GenBank submission. BEAST results were analyzed in the Tracer program, each alignment being compared pairwise to obtain Bayes Factors and the marginal likelihood for each prior combination; the analysis with the largest marginal likelihood being selected as the best (results available on request). Finally, to assess the temporal signal, the best analyses were then compared using an identical BEAST run but with the 'tip-dates' randomized. BEAST runs were continued until all relevant parameters converged -discarding 10% of MCMC chains as burn-in. A highest density probability (HPD) of 95% was used to assess statistical significance.

Trees
Maximum clade credibility trees with statistical support in the form of Bayesian posterior probabilities at each node were generated using BEAST and processed in the programs TreeAnnotator and FigTree. For maximum likelihood (ML) trees the MUSCLE alignments of each of the seven identified functional domains -1N(ProCo), HEL, VPg, PRO, RdRp, 2N(MP) and CP and their respective optimal substitution models, as determined by JModeltest, were analyzed in PAUP* [36] using a heuristic search with 100 bootstrap replicates.

Detection of recombination
Evidence for recombination prior to calculating selection pressures was carried out, as mentioned above, using GARD [28]. The GARD algorithm initially works by looking for the number and placement of breakpoints that yields the best Akaike Information Criteria (AICc); a measure of the goodness of fit. However, any improvement in model fit could be due to a number of factors (eg. spatial rate variation, heterotachy) other than a change in the tree topology, which would be the primary signature for recombination. GARD subsequently checks for tree congruence either side of the putative breakpoint using the Kishino-Hasegawa (KH) test [37]. For calculating selection pressures therefore detection of significant incongruity in an alignment was adjusted prior to analysis. Evidence of putative local recombination events more specific to individual viruses was determined using the RDP3 software [38] in a stepwise fashion as previously described [39]. This involved realigning subsets of viral sequences that were determined monophyletic in maximum likelihood inferred trees, thereby reducing the amount of sequence noise and the possibility of spurious results due to misalignment artefacts. Putative recombinants were then realigned with their parental donors and reassessed in isolation for the predicted recombination event in RDP3 followed by a further check in GARD. The selection criteria for recombinants was set at p,0.05 by KH-testing in GARD and three or more methods with p,0.05 in the RDP suite.

Protein homologue searches and directionality of colonization
Protein sequences of one member of a monophyletic virus group that was consistently identified in three or more ML trees were used to search for the closest homologues within picorna-like viruses. A first screen was done using the Blast algorithm with setting to exclude the input virus species. A Position-Specific Scoring Matrix (PSSM) was then generated for each protein using PSI-BLAST (position-specific iterative basic alignment search tool) with no exclusion [40]. All significant (picorna-like virus) hits were recorded. Searches were made using individual virus sequences at an inclusion threshold of 0.01. Iterations were continued for a maximum of five. The same approach was used for an 'inverse' PSI-BLAST search for determining the host distribution of known related sequences using the most significant (lowest e-value) animal infecting viral protein that was identified as being significantly homologous. Virus species were identified at two levels using definitions provided at the NCBI taxonomy (http://www.ncbi. nlm.nih.gov/taxonomy) and ICTV (http://ictvonline.org/ virusTaxonomy.asp) databases. An e-value of 10 was used as the threshold if the hit was with a viral functional homolog.

Results
In order to gain an understanding of the evolutionary tendencies of the Secoviridae family we began by dividing our analyses into the two principal components of virus evolution, 1) mutation (or associated elements: selection pressures and substitution rates) and, 2) recombination (or modularity).

Strong negative selection pressures are evident for all species' CPs
Prior to analysis all alignments were checked for recombination using the GARD algorithm [41] and any recombinant molecules removed from the alignment (Table S1 and File S1-File S9). The coat protein dataset was divided into two formats for analysis; fulllength (ArMV, BPMV and RTSV), corresponding to discrete domain sequences with predicted cleavage product ends and shorter partial sequences (Table S2). Differences between the d N / d S ratios were not statistically significant (overlapping confidence intervals of 0.031-0.050, 0.030-0.058 and 0.033-0.053, respectively), with evidence of strong purifying selection (0.039, 0.043, 0.042, respectively) ( Table 2). For partial CP sequences d N /d S ratios were more variable (ranging from 0.014 to 0.088), but were in-line with the full-length ratios, averaging 0.045. The total percentage of negatively selected codons varied depended on the method used, SLAC or FUBAR, the latter identifying many more. FUBAR also identified some positively selected codons where SLAC found none. These variations are attributed to the documented power and sensitivities of the methods employed [32]. In most cases the proportion of negatively selected sites detected by FUBAR was close or more than 50% (ArMV 60%, BPMV 35%, RTSV 60%, BBWV2 93%, (Blackcurrant reversion virus) BRV 49%, GFLV 71%, SMoV 30%, Tomato ringspot virus (ToRSV) 61%, Tobacco ringspot virus (TRSV) 15%). Lower values for BPMV, SMoV and TRSV were also found using SLAC, and are reflected in the lower diversities obtained. The shorter sequences used for SMoV and TRSV might also be a reason for these differences. In addition to a global and codon-based calculations of selection pressures, we tested the alignments for the possibility of lineage specific variations in selection pressure, and were able to find significant (p,0.05) evidence of episodic diversifying selection for two isolates of RTSV ((accession number underlined) U70989 p = 7.01610 26 , U71440 p = 7.93610 23 ), and one isolate each for GFLV (EU702441 p = 6.10610 23 ) and ToRSV (AF135414 p = 5.80610 25 ). Removal of these isolates from the d N /d S analyses did not significantly alter the original values obtained. Significantly, FUBAR was also able to identify positively selected codons, one and two respectively, for both GFLV and RTSV.

Coat protein nucleotide substitution rates
The substitution rates of the CP of nine species were estimated using BEAST. We focused our attention on the CP as this domain provided the most sequences to allow for comparison and to provide an informative range of potential values. Our approach was to screen 'time-stamped' alignments [42] of sequences using a combination of priors and select the best model by means of Bayes Factor (BF) comparisons. Two prior models for nucleotide substitution were compared; the simpler Hasegawa, Kishino and Yano (HKY85) model and the optimal model for each alignment selected previously by JModeltest [33]. Using this approach yielded rate estimates with relatively narrow 95% HPD ranges, with the lowest value being 7.13610 25 for Broad bean wilt virus-2 (BBWV2); a value of #1610 27 being proposed as an indicator of unreliability [43] (Table 3; Fig 2). The most favored clock and tree priors for most species was exponential with the JModeltest selected model being preferred in all cases. Estimated rates for all CPs ranged over one order of magnitude (7.22610 22 -1.21610 23 ). However, the values for BRV and SMoV were taken to be unreliable because of the limited date range collection period (,10 years). The strength of the temporal signal in the sequences was tested by comparing 'time-stamped' and randomized datasets. In all cases, except TRSV, the 95% HPD range was greater in the randomized data than in the 'time-stamped' data; the former ranging from 3 to 5 orders of magnitude compared to 1 to 3 for the former. In six cases (ArMV, BRV, GFLV, RTSV, SMoV and ToRSV) the 95% HPD values of the randomized data fell outside the mean rates estimated for the 'time-stamped' data, suggesting the differences were significant and that the temporal structure was sufficient for reliable rate estimation. A temporal structure was also supported (larger R 2 values in time-stamped trees) for the majority of species in regression analyses (Table S3). Removal or inclusion of sequences with recombinant signals from the alignments, as was the case for ArMV, BBWV2 and GFLV, did not alter these findings.

TMRCAs of the Comovirinae
Based on the substitution rate estimates obtained we wanted to gain an understanding of the timescale for the emergence of secovirids and their evolutionary history. To do this we employed two approaches: 1) defining the calibration nodes using the predetermined TMRCAs for each virus species in conjunction with a Yule process (a simple model of speciation more appropriate for sequences from different species) tree prior, and 2) using the most conservative reliable mean estimate of the nucleotide substitution rate calculated for the CP data (ToRSV 2.74610 23 ) as the rate prior in time-stamped alignments. Clock priors were varied (strict, exponential and lognormal) with the optimum combination of priors being selected using Bayes factors in the BEAST-associated program Tracer. Our initial attempts to generate a full tree for all members of the Secoviridae were unsuccessful due to nucleotide saturation. As a result, by focusing specifically on the Comovirinae subfamily, we were able to generate a coalescent tree using both a lognormal relaxed (Fig 3) and a strict clock (Fig S1), the latter having the larger marginal likelihood (Bayes factor 82.8). For the lognormal tree the root was calculated at 726 years (502-962) with the nepovirus genus TMRCA at 621 years (423-841) and the fabavirus/comovirus TMRCA at 597 years (330-852). For the strict clock tree these values were 944 (832-1063), 893 (777-1014) and 878 (751-1010), respectively. For either clock method individual species have TMRCAs close to or below 100 years; the two oldest virus TMRCA lineages being ToRSV (112 (log) and 108 (strict) years) and BBWV2 (107 (log) and 99 (strict) years), with ArMV and GFLV separating some 200 years ago.

Detection of genetic exchange is limited
The existence of potential chimeric-like sequences [4,25] and reports of the evidence of recombination in secovirids in this work and others [17,18,23,24] led us to look for evidence of recombination across the whole genome. Initially, we wanted to use the recombination algorithms (RDP3 and GARD) described above to screen alignments of full-length sequences, firstly using the RNA sequence so as to include the untranslated regions (UTR) and secondly, using the codons of open reading frames. Given the computational demands involved in processing and the inability of the programs used to align according to the known functional domains we opted to individually align each discrete functional domain (1N(ProCo), HEL, PRO, RdRp, 2N(MP) and CP) predicted as a post-proteolytic cleavage product. From these alignments were generated ML trees for six of the functional domains (Fig.4); the VPg was omitted from further analyses because of the failure to generate a ML tree, as were the 5' and 3'UTR. The ML-derived cladograms shown represent the most conservative statistically significant topology for each domain. When analyzed as a whole, all alignments, except for the RdRp, showed significant saturation. Saturation was however within acceptable limits when analyzing alignments that contained only those sequences representative for each significant branch, as determined by the bootstrap values. As we were only looking for incongruences within these branches for each functional domain, saturation due to the presence of collapsed branches were deemed inconsequential. The most structured trees are those for the RdRp, HEL and CP; the Comovirinae are monophyletic only for the RdRp separating into two distinctively separate clades containing the nepoviruses in one and the comoviruses/fabaviruses in the other -for the HEL and CP. Notably, there is no significant topological difference between the trees that would be indicative of recombination or a modular genome makeup except for a collapse of clades to below the significance level probably due to saturation and an inability to align those sequences due to extreme divergence. Such a situation could be result of higher mutation rates, genetic drift and/or recombination. If recombination had occurred then failure to detect it could either be because of nucleotide substitution saturation in the alignment or because the hypothetical parental donor(s) is external to the Secoviridae. Instead, what is seen are relatively consistent similarities throughout with monophylies for the non-Comovirinae species for at least three domains with the ALSV/CRLV (cheraviruses), BRNV/ SMoV/SDV, ToTV/ToMarV (torradoviruses) and MCDV/ RTSV (waikaviruses).
Using the interspecific domain alignments described above neither GARD nor RDP3 found significant evidence of recombination. We were also unable to demonstrate any significant change in phylogenetic similarities that might be indicative of recombination by comparing ML trees. Another approach to identifying recombinants, particularly for alignments of relatively divergent viruses is to reduce noise due to potential misalignment by aligning only those sequences with higher identities [39]. To aid in this process we progressively aligned gene sequences according to their groupings in the ML trees (Fig.4) moving from larger to smaller clades and then screened them by RDP3 with a final confirmation using GARD. Here too, no significant recombination event was detected (not shown). At an intraspecific level when determining selection pressures, however, recombinants were detected for ArMV, BBWV2 and GFLV (Table S1).    Homologies between secovirus and animal virus protein domains are significant and often unique To further examine the possible reasons for low similarities between specific virus lineages in the ML analyses, we focused our attention on protein sequences and extended our search to include all sequences in the Genbank database. By searching for distantly related protein sequences using both the BLAST and PSI-BLAST algorithms we were able to identify putative picorna-like homologs for each functional domain. Based on the ML tree data we concentrated on non-Comovirinae viruses, choosing a representative virus for each apparent monophyly (eg. CRLV (representing an ALSV and CRLV clade), SMoV (representing a BRNV, SDV, SMoV clade), SLRSV alone, and ToTV (representing a ToTV, ToMarV clade) and RTSV (representing the monopartite clade of MCDV PYFV, and RTSV)) to initiate the search and generate a first Position-Specific Scoring Matrix (PSSM). Use of another representative from the same clade consistently produced similar hits to the same viruses. We principally focused on the nonreplication association genome domains as these were the most divergent. The results for the 1N(ProCo), PRO, 2N(MP) and CP showed strikingly different associations for each gene (Fig 5; Table  S4). For example, taking the 1N(ProCo) results, SLRSV had no significant homolog in the database, CRLV had similarity only to ALSV, and SMoV had highest homology to BRNV and ToTV, but also significant homology to a number of members of the Comovirinae (Fig 5; Table S4). In contrast, for the PRO domain almost all similarities detected were outside of the Secoviridae; SMoV having homologies to Iflaviridae and Picornaviridae proteases. CRLV PRO has significant similarity, using BLAST, with the NIa-Pro protein of the potyvirus Keunjorong mosaic virus. A common ancestry for the PRO domains in potyviruses is further supported by homology (BLAST) of the SLRSV PRO to the NIa-Pro of Potato virus V. A similar case of associations is found for the CP (Fig 5, Table S4) with no homology found with members of the Comovirinae. CRLV, RTSV and ToTV lineages all have high CP similarities with insect and vertebrate infecting members of the Picornavirales, while SMoV shows similarity to the waikavirus MCDV and two calicivirids, all of which are monopartite. Predictably, similarities for the MP are found mainly within the Secoviridae. CRLV and SLRSV have similarities to nepovirus MPs, SMoV is more isolated in its similarities, while ToTV surprisingly has PSI-BLAST hits with movement proteins from another plant virus family, namely, the Umbraviridae, reinforcing its currently tentative assigned function as a movement protein [12]. BLAST analyses of the RdRp and HEL domains of the above virus lineages and for the three type members of the Comovirinae genera revealed significant interfamily similarities, along with strong (lower e-values) similarities to animal-infecting picorna-like viruses (Table S5).

Directionality of host colonization
The homologies identified for the CP and PRO domains is evidence of an ancestral link with animal infecting picorna-like viruses. The existence within the predominantly bipartite Secoviridae family of a small group of monopartite viruses (the waikaand sequiviruses) that share a higher similarity to monopartite animal-infecting viruses (Fig S2) than their bipartite relatives could suggest that viruses like RTSV are recent arrivals to the plant kingdom. The remnants of positive selection in the CP of RTSV (Table 2) supports this notion. To tentatively examine this hypothesis, we looked at the host (animal or plant) distribution of known related sequences for the CP and PRO within the simple model that the number of related sequences (per species) would be directly proportional to time within the host. (Fig 6a). Such a  Table 3 model requires a number of assumptions to be made (e.g. a randomly sampled population, similar rates of evolution) that could significantly alter the outcome. Therefore our interpretation should be limited to identifying a distribution that refutes the above proposed model. The results (Fig 6c and Table S6) show at least for those lineages (CRLV, RTSV, SMoV and ToTV) outside the Comovirinae, where the association with animal infecting picorna-like viruses is more evident, that there are no cases where the number of plant virus species' sequences is larger than those of animal virus species' sequences and in the majority of cases (six out of seven -no animal homolog of the ToTV PRO was identified) there are over five times as many animal virus species. For the Comovirinae type members (BBWV1, CPMV and TRSV) the distribution is varied, with not a single defined host population exceeding double that of the corresponding population in the other host.

Discussion
All coat protein domains tested were found to be under extreme purifying selection, an observation already reported for members of the Secoviridae [14][15][16] and other plant-infecting RNA viruses [44][45][46][47]. It has been suggested that such extreme purifying selection is a result of selection constraints imposed by specific molecular interactions during vector transmission [47,48]. The technique used to calculate the d N /d S ratios itself could also possibly overestimate the degree of purifying selection by failing to detect directional selection [49]. This possibility was tested using the branch-site REL method [31] and at least for some viruses (RTSV, GFLV, ToRSV) was shown to occur, implicating episodic diversifying selection in the evolution of certain secovirid lineages, as has been recently demonstrated for a number of plant virus genes [50,51]. In addition, analysis using the FUBAR algorithm, we were able to detect positive selection on individual codons for both RTSV and GFLV. These results therefore suggest that despite an overall strong purifying selection observed for all CPs some viruses are still undergoing restricted positive selection due to pressures to adapt to novel environments, such as new host or vector variants.
Calculations of the nucleotide substitution rates for the same species' coat proteins obtained values ranging from 9.29610 203 to 2.74610 203 . This compares with 1.61610 23 to 5.73610 23 for the VP1 of enteroviruses (Picornaviridae) [52], 6610 24 to 3.5610 22 for the CP of the Luteoviridae [43], and 1.17610 23 for the CP of Rice yellow mottle virus (RYMV) [53], and falls within the general range expected for ssRNA viruses [54]. However, caution is required in interpreting these data; shorter sampling periods can potentially elevate substitution rate estimates [54], a phenomenon  aptly illustrated for BRV and SMoV. The longest sampling period in this study was 20 years for GFLV, compared to 62 for enteroviruses [52], 91 for the Luteoviridae [43], and 40 for RYMV [53]. Therefore, although our estimates here may be correct for extant sequences the continual birth and death of lineages [49,55] means that inferring long-term estimates is fraught with difficulties [56]. Attempts to produce a coalescent tree for the whole Secoviridae family proved inconsistent due to nucleotide saturation. However, by focusing only on members of the subfamily Comovirinae, using both lognormal relaxed and strict clock models with the most conservative substitution rate obtained at the species level (2.74610 23 ), we were able to generate reliable (converged) trees rooted at 726 and 944 years ago, respectively. Species divergence of all extant members of the Comovirinae therefore is estimated to have occurred in the past 500 years. The first records of fanleaf, the oldest known grapevine virus disease, date back to 1865 [9]. ArMV, its closest relative, first reported in 1963 [9], also invariably causes fanleaf-like symptoms. Its divergence from GFLV is estimated here to have occurred around 200 years ago. Divergence estimates for the Comovirinae genera coincide with limited resolution at the root of both trees for the whole subfamily. Aside from the effects of heterotachy, there is also the possibility of underestimating the age of the Comovirinae due to the replace-ment of older viral lineages by younger ones [57]. Such a processextinction and reinfection -would have to be acting across the entire phylogeny, which is less plausible for a group of viruses whose host range (wide to restricted) and type (annuals to woody perennials) is highly variable. Comparisons with other studies in plant RNA viruses using the same heretochronous sampling approach are variable in their predictions. For the Luteoviridae family the TMRCA was estimated at around 2,000 years ago, with the Polerovirus and Luteovirus genera TMRCAs falling around 1,000 years ago [43]. The radiation of Potyviridae family has been estimated around 6,600 years ago [58], while for the Sobemovirus genus diversification is calculated to have occurred some 3,000 years ago [53]. In our study estimates at a family level were not possible due to the limited data available. However, the genus TMRCAs determined here are not unlike those obtained for the Luteoviridae genera, and all extant secovirid species show a recent (.500 years) emergence in-line with the above reports and coinciding with both modern agriculture and the escalation of marine trade.
Maximum likelihood cladograms generated for each secovirid functional domain were able to identify a number of statistically unrelated lineages that pointed to a possible origin of these monophyletic groups that lay outside the Secoviridae family. As a result, we broadened our search using both BLAST and the PSI-BLAST algorithms; the latter looks for more distant protein homologies by pooling conserved three-dimensional protein structural information [40,59]. The results obtained (Fig.5) clearly indicate shifting similarities of the genes analyzed, suggesting distinct evolutionary origins for each lineage. There appear to be three classes of functional domain: i) plant-ancestral domainsnamely, the N-terminal proteins 1N(ProCo) and the 2N(MP)which only have detectable similarities to other plant viruses, ii) the animal-ancestral domains -namely, the PRO and CP -which overwhelmingly have similarities with animal-infecting picornalike viruses, and iii) the replication-associated domains -namely the HEL and RdRp -which have closer (lower e-values) similarities to other members of the Secoviridae, but also detectable similarities with animal-infecting picorna-like viruses. Such an arrangement produces a mosaic of viral functional domains that fits neatly into the proposed hypothesis that the Pro and CP, and the HEL and RdRp, are respective paralogous duplications of ancient proto-CP and proto-Rep genes [60]. In contrast, the plant-ancestral domains appear to represent proteins that might have been acquired while evolving in the plant. The origin of plant virus movement proteins, including those of some secovirids, is postulated to be plant-chaperon-like proteins, such as the heat-shock proteins [61].
Based on three lines of evidence, namely 1) the existence of a minority of monopartite viruses within the Secoviridae, 2) a higher similarity of these monopartite viruses to monopartite animalinfecting viruses, and 3) evidence of positive selection in the CP of a monopartite virus, it seems the most parsimonious hypothesis for host-switching, if we consider the Secoviridae as a whole, is from animal-to-plant. In order to explore this hypothesis further we used the Genbank database to analyze the distribution of related virus species according to their host (at the taxonomical rank of kingdom) on the simple assumption that the degree of speciation will be directly proportional to the age of the virus population. Clearly, there are number of caveats to this approach, including the artificial composition of the sampled population, and the possibility of different rates of evolution and extinction, and unrecognized selection pressures (for example the immune system of animals could, in theory, result in a higher selection pressure when compared to plant hosts, although the calculated selection pressures for the CP of picornaviruses [52] are as comparably low as those found in this study). To try and address the artificial composition of the sample, we examined the number of ICTV (International Committee for the Taxonomy of Viruses (http:// ictvonline.org/virusTaxonomy.asp))-recognized virus species and grouped them according to their host; the assumption being that the number of animal virus species, and in particular vertebrateinfecting viruses would dominate and therefore bias any interpretation on specific distributions. Contrary to our assumption, the plant viruses made up 45% of the total number of eukaryoteinfecting viruses, with vertebrate-infecting viruses and insectinfecting viruses at 39% and 12%, respectively (Fig 6b) (Table S7). The Picornavirales order itself reflects a similar host distribution of virus species at 53%, 27% and 16%, respectively (Table S7). Therefore the observation that at least for the CP of RTSV and ToTV and the PRO of CLRV, RTSV and SMoV there were more related virus species of insects than of plants suggested a trend supporting an 'animal-to-plant' colonization. For the Comovirinae lineage, the picture may not be so simplistic and could be either the result of an equilibrium or a more interchangeable host-switching mechanism. Another factor that could significantly alter speciation patterns is the differences in host range: if their tendency was to infect more isolated populations, secovirids could be older than related animal viruses and still have a lower prevalence than animal viruses. Although the host ranges of these viruses has not been rigorously studied there are examples of viruses with demonstrated broad (CRLV, SDV, Cricket paralysis virus (Dicistroviridae)) and assumed narrow ranges (ToTV, RTSV, Sacbrood virus (Iflaviridae), Black queen cell virus (Dicistroviridae)) [13,62].
Outside of the plant-ancestral domains what are the possible explanations for the present similarities observed for the various functional domains? It would seem plausible that upon colonization of plants the viral RdRp and HEL genes might have evolved convergently due to high adaptive selection pressures initially imposed by interacting host cofactors that are required for virus replication [63]. Rapid convergent evolution in the replicationassociated domains has been identified for a number of distinct viruses [64][65][66]. The main selection pressure on the CP would be to maintain vector transmission along with secondary roles in the infection cycle [67]. It is tempting to see the often unique similarities between secovirid CPs and animal-infecting picornalike viruses as evidence of an older co-evolutionary relationship where present-day vectors might have been past hosts [68,69] (Fig.  S3).
A unidirectionality to novel host colonization may be too simplistic, but the very tight relationship between plant viruses and their vectors and the presence of related viruses in ecdysozoans presents a plausible scenario wherein ecdysozoan-infecting viruses could have been introduced during feeding into plants (Fig. 7). In this case, whether animal-to-plants or plants-to-animals, the overall process of host-switching appears to have led to a level of specialization where, in the case of secovirids, the virus appears to have lost its ability to replicate in the vector: unlike in other virus families (for example Bunyaviridae, Rhabdoviridae, and Reoviridae), there is no evidence of virus replication in the vector, although some nepoviruses and comoviruses are described as being persistently transmitted [70]. Recently, TRSV has been detected in honeybees and suggested to be replicating in a variety of different tissues from brain to leg. If this report can be more fully substantiated, the propagation of TRSV in an insect would provide an example of transkingdom host-switching [71]. The insect infecting Rhopalosiphum padi virus (Dicistroviridae, genus Cripavirus) is transmitted horizontally via plants in which it is able spread to all parts without replication [72,73]. In theory, adaptation to a plant environment would eventually require acquisition of a movement protein (and perhaps a 1N(ProCo)) which in the case of the five secovirid lineages analyzed here appear to have occurred from a number of different sources, including from other virus families. Movement of the nonenveloped insect infecting Flock house virus (FHV, family Nodaviridae, also a picorna-like virus) [74] in the plant host Nicotiana benthamiana was complemented by the movement proteins of two unrelated plant viruses. Furthermore, FHV was shown to replicate in the inoculated leaves of six monocot and dicot species [75].
The only salient theoretical example of 'plant-to-animal' hostswitching also involves a calicivirus; where it was postulated that nanoviruses, which are plant single-stranded, circular DNA viruses must have switched to an animal host and then recombined in the C-terminal of the Rep protein with the RNA of a calicivirus prior to returning to plants [76]. Such a scenario was proposed purely on the basis that all caliciviruses described so far infect invertebrates. Conceptually, the loss of the movement protein would be a more parsimonious process, but the de novo inoculation and colonization of insect cells by a plant virus is equally as difficult to imagine. Picornaviruses require cell-surface receptors that are involved in cell attachment, signaling and endocytosis, and in triggering of capsid structural alterations that are required for infectious entry [77].
Taken together, the results provide evidence for the recent evolutionary tendencies and putative ancestral origins of members of the Secoviridae, and provide the basis for further studies into these and related viruses and their ability to adapt to a huge variety of different hosts. Unanswered questions raised in the discussion here that could be addressed in future studies are: 1) Does an RdRp domain evolve rapidly within a founder population colonizing a markedly different host? 2) Is secovirid genome segmentation an unavoidable consequence of plant colonization, and under what conditions does it become selectively detrimental? 3) Are there ecdysozoan vectors harboring viruses that more closely resemble secovirids? 4) Are there emergent secovirid-like viruses that could threaten future agricultural production? 5) What are the characteristics of picorna-like viruses that favor hostswitching? In answering these questions we hope to better predict the sources of emergent viruses and how to control them.