Grapevine virus T diversity as revealed by full-length genome sequences assembled from high-throughput sequence data

RNASeq or double-stranded RNA based approaches allowed the reconstruction of a total of 9 full-length or near full-length genomes of the recently discovered grapevine virus T (GVT). In addition, datamining of publicly available grapevine RNASeq transcriptome data allowed the reconstruction of a further 14 GVT genomes from five grapevine sources. Together with four GVT sequences available in Genbank, these novel sequences were used to analyse GVT diversity. GVT shows a very limited amount of indels variation but a high level of nucleotide and aminoacid polymorphism. This level is comparable to that shown in the closely related grapevine rupestris stem pitting-associated virus (GRSPaV). Further analyses showed that GVT mostly evolves under conservative selection pressure and that recombination has contributed to its evolutionary history. Phylogenetic analyses allow to identify at least seven clearly separated groups of GVT isolates. Analysis of the only reported PCR GVT-specific detection primer pair indicates that it is likely to fail to amplify some GVT isolates. Taken together these results point at the distinctiveness of GVT but also at the many points it shares with GRSPaV. They constitute the first pan-genomic analysis of the diversity of this novel virus.


Introduction
Grapevine (Vitis vinifera) is one of the most important fruit crops with a production of more than 70 million metric tons/year worldwide, according to the last data released by the Food and Agriculture Organization of the United Nations (FAO) (www.fao.org/faostat; accessed a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 July 2018). The grapevine industry is threatened by a wide range of pathogens, including a range of viruses associated to the four major viral diseases, infectious degeneration and decline, leafroll, rugose wood and fleck [1]. Recently, the application of high throughput sequencing (HTS) to plant virology has allowed the discovery of new viruses without prior information of their genome [2], increasing our knowledge on the diversity of the grapevine virome. There are currently over 70 plant viruses reported to infect grapevine, including members of the genus Foveavirus [1,3]. Foveaviruses belong to the family Betaflexiviridae [4] and have helically filamentous particles ca. 800 nm long. These viruses have a positive sense, single-stranded, polyadenylated RNA genome, ranging from 8.4 to 9.3 kb in size. They have a single type of coat protein with a size of 28 to 44 kDa [3,4]. Their genome harbors five open reading frames (ORFs), encoding from 5' to 3' a replication-associated protein (ORF 1), the putative movement proteins (ORF 2 to 4, constituting a triple gene block), and the coat protein (ORF 5), respectively [3,4,5]. Foveaviruses have no identified vectors and their transmission occurs by the vegetative propagation of their host plants [3]. Grapevine rupestris stem pittingassociated virus (GRSPaV), the type member of the Foveavirus genus, is one of the most prevalent viruses of grapevine [6][7][8][9]. Although the pathological properties of GRSPaV infection in most grapevine varieties remain largely unknown, it is consistently associated with ''Rupestris stem pitting", a component of the important Rugose wood complex of diseases [9][10][11][12][13]. GRSPaV exhibits extensive genetic diversity and may have evolved from an ancient recombination event between a Carlavirus and a Potexvirus [14]. Phylogenetic analyses of viral sequences derived from infected grapevines suggest that GRSPaV comprises a wide range of sequence variants [6][7][8][9][15][16][17][18], which lead to the definition of four groups of sequence variants that can be further sub-divided in a number of subgroups [9,[18][19].
Recent advances in HTS technologies and bioinformatics have generated huge new opportunities for discovering and diagnosing novel plant viruses and viroids [2,20]. Among the agents recently described in grapevine, Grapevine virus T (GVT) is a putative new member of the genus Foveavirus, closely related to GRSPaV. It was recently described from transcriptomic data of the Teroldego variety [21]. Since then, GVT has been reported from grapevines in Germany [22], Croatia [23] and Slovakia and the Czech Republic, where it appears to be relatively widespread [24]. Furthermore, similarly to GRSPaV, GVT seems to exhibit extensive genetic diversity [24]. A phylogenetic analysis performed on the complete GVT CP gene of several Slovakian and Czech isolates, together with the initial Teroldego isolate, has shown the existence of two well supported clusters of isolates and suggested the existence of potential additional phylogenetic clusters [24]. Taken together the data available on this recently described grapevine virus indicate significant prevalence and diversity, raising issues on its detectability and highlighting the need to generate more information on GVT diversity.
In the present work a total of 23 full-length or near full-length GVT sequences have been obtained by HTS analyses or more classical means. All these sequences, together with the few sequences already available in databases, have allowed us to perform an analysis of GVT diversity at the pangenomic level. The results presented here provide a first comprehensive picture on the genomic features and evolution of GVT, which might have important implications for the diagnostics of this recently discovered Foveavirus.

Virus isolates and plant samples
The grapevine samples used in the present study are described in Table 1, which presents data on the variety, the sampled plants origin and the other viruses found to be present in these plants in addition to grapevine virus T.

High-throughput sequencing of cDNAs derived from highly purified double stranded RNAs
Highly purified double stranded RNAs were extracted from grapevine phloem scrapings of dormant canes using the small scale procedure described in [25]. This procedure involves two rounds of cellulose affinity chromatography and a nuclease digestion step to eliminate DNA and single-stranded RNAs. Complementary DNA molecules were then synthesized and randomly amplified, again as described [25,26] before being sequenced on an Illumina HighSeq 2500 (2 � 250 nt paired-end reads) in a multiplexed format (Genewiz, South Plainfield, NJ, USA).

High-throughput RNAseq
Total RNAs were extracted from phloem scrapings or from leaves, using the Spectrum Plant Total RNA Kit (Sigma). For the SK809 sample, ribosomal RNA was removed using the Ribo-Zero rRNA Removal Kit (Illumina, San Diego, USA) while total RNAs from other samples were directly converted to double-stranded cDNA from which sequencing banks were prepared. High-throughput sequencing in multiplexed format was performed on the following Illumina platforms: MiSeq (2 � 200 nucleotides) for the SK809 sample, HighSeq3000 (2 � 150 nt) for the three Sauvignon samples.

High-throughput sequence data processing
Following demultiplexing and quality trimming, high-quality reads were used for de novo assembly and contigs were annotated by BlastN and BlastX analysis [27] against Genbank or by alignment to the viral genomes database (ftp://ftp.ncbi.nih.gov/genomes/Viruses/all.fna.tar. gz) using CLC Genomics Workbench 7.5 (http://www.clcbio.com) and Geneious v.8.1.9 (https://www.geneious.com/). In some cases, contigs were further extended by several rounds of mapping of residual reads in CLC Genomics Workbench. Subsequently, in order to assess the representation of GVT reads, the reads from each sample were mapped against the complete GVT sequence(s) assembled from that sample.

Completion and validation of GVT genomic sequences assembled from the high-throughput sequence data
For some genome regions, the accuracy of the sequence of the assembled GVT genomic sequences was verified by direct Sanger sequencing of PCR products using primers designed from the NGS-based sequences. The same strategy was used to completely resequence and validate the sequence of the 486-1 isolate. Similarly, approximately 0.7 to 1.3 kilobases which were missing from the 5' end of contigs assembled from double-stranded RNA data (isolates Limniona V34, Saperavi V49-12 and V49-24 and Sauvignon gris V51, see Table 2) were determined by direct Sanger sequencing of PCR products obtained using primers designed from the contigs and from available GVT sequences. Terminal 3' sequences were determined for all isolates (with the exception of Sauvignon I60-2 and F79) by direct Sanger sequencing of PCR products obtained using a polyA-anchored PCR and an internal primer designed from the contigs. For isolates Sauvignon I27 and 486-1, terminal 5' sequences were determined using the 5' Rapid Amplification of cDNA Ends (RACE) strategy and internal primers designed from the contigs, following the kit supplier's instructions (Takara Bio Europe/Clontech, Saint-Germain-en-Laye, France).

Determination of the complete genomic sequence of GVT isolate 486-1 by classical Sanger sequencing
For the 486-1 isolate, total RNAs were extracted from leaf material using the Nucleospin RNA plant and fungi kit (Macherey Nagel). The GVT genome was completely sequenced by Sanger sequencing of cDNA clones of overlapping PCR products, including for the 3' terminal sequence, for which a polyA-anchored PCR was used. The 5' terminal sequence was determined 5' RACE (Invitrogen), following the instructions of the supplier. The cDNAs were dCand dG-tailing in separate reactions, and amplified by PCR using a GVT-specific internal primer and oligo-dG or oligo-dC primers, respectively. At least four clones from each reaction were sequenced. The obtained 5'end sequences were all identical.

Datamining of grapevine RNASeq data and assembly of GVT genomic sequences
All analyses were performed using the CLC Genomics Workbench 11.0 software. A selection of grapevine RNAseq transcriptome data available in GenBank (choice based on variety, country. . .) was downloaded and screened for the presence of GVT by a direct mapping of the reads to the GVT-Teroldego reference (NC_035203). Transcriptomes for which GVT presence was detected in this way were then assembled de novo using the following parameters: word size 19, bubble size 50 and a minimal contig size of 300nt. GVT contigs identified by BlastN analysis were then further extended by multiple rounds of residual reads mapping, until near complete genomes were obtained.

Selection of a dataset of representative GRSPaV genomic sequences
A dataset composed of 21 full length or near full-length GRSPaV sequences selected so as to represent the known genetic diversity of GRSPaV was retrieved from the Genbank database. In particular these GRSPaV sequences were selected to be representative of the various phylogenetic groups and sub-groups identified in its diversity [6,9,18]. In total, group 1 is represented by four isolates including GRSPaV-SY, group 2 by 13 isolates (including GRSPaV-1 typifying subgroup 2a, GRSPaV-SG1, typifying subgroup 2b and GRSPaV-BS, group 3 by three isolates each typifying a subgroup (GRSPaV-JF, GRSPaV-ML and GRSPaV-PN) and group 4 by the divergent isolate GRSPaV-LSL.

Sequence, diversity and phylogenetic analyses
Multiple sequence alignments were performed using ClustalW [28] implemented in MEGA version 7 [29]. Phylogenetic trees were constructed using the neighbor-joining method with strict nucleotide or amino acid distances and randomized bootstrapping for the evaluation of branch validity. Genetic distances (p-distances calculated on nucleotide or amino acid identity) were calculated using MEGA version 7. An analysis of the local diversity along the GVT genome was performed using DNASP v6 [30] while a search for potential recombination events in GVT genomic sequences was performed using RDP 4.0 [31]. A codon-based analysis of selection pressures in the various GVT open reading frames was performed using the FEL and SLAC programs on the Datamonkey web server (http://www.datamonkey.org/) [32].

Results and discussion
Determination of the complete genome sequence of GVT isolates using RNAseq or dsRNA high-throughput sequencing approaches As indicated in the Materials and Methods section, in the course of the analysis of the grapevine virome a number of samples were analysed by high-throughput sequencing (HTS), either by a double-stranded RNA (dsRNA)-based approach or by a total RNA or ribosomal RNAdepleted RNASeq approach. In parallel, the GVT genome of isolate 486-1 was completely determined by classical Sanger sequencing. In total, GVT was identified in eight samples, three of which represent the same Sauvignon origin. Besides these three Sauvignon samples from France, were three accessions (Limniona (2288Mtp2), Saperavi (1734Mtp2), Sauvignon gris (301Mtp4) obtained from the INRA Grapevine biological resource center of Vassal-Montpellier (https://www6.montpellier.inra.fr/vassal_eng/), a Welshriesling sample from Slovakia and a Riesling sample from Germany ( Table 1). All of these samples were found to be simultaneously infected by several other viruses and, in most cases, viroids (Table 1). However, some of these samples (French Sauvignon, Slovak Welshriesling and German Riesling) did not show any symptoms despite the mixed infections affecting them.
Following the bioinformatics analysis of the sequencing reads of the three grapevine samples analysed by the dsRNA-based approach (Table 2), large GVT contigs were assembled for the Limniona and Sauvignon gris samples while two GVT contigs were assembled from the Saperavi sample (Table 2). As compared to the reference GVT genome (NC_035203) [21], these large contigs (7.3 to 7.9 kilobases, see Table 2) were missing approximately between 0.7 to 1.3 kb at their 5' end and 72-108 nucleotides at their 3' end. They all however had an otherwise excellent average coverage of between 112x and 322x ( Table 2). The 3' sequences of the various isolates were completed by the direct sequencing of PCR products amplified using a specific primer designed from the initial contig and an oligodT anchored primer. The 5' sequences were completed by the direct sequencing of PCR products using specific primers designed from the initial contig and a primer designed from the 5' end of I27 GVT sequence. In all cases, the overlap regions between the initial contig and the PCR products were found to be 100% identical. No effort was made to resequence the region covered by the 5' terminal amplification primer, which was therefore not integrated in the finalized sequences.
In parallel, GVT genomes were similarly assembled from high-throughput Illumina data derived from total RNAs or from ribosomal RNA-depleted total RNAs (Table 3). In the case of the three Sauvignon samples, near complete contigs were readily assembled, missing only 14-18 nt at the 5' end and 23-27 nt at the 3' end as compared to the NC_035203 reference isolate. These three contigs originate from the same grapevine variety and are highly homologous, sharing a minimum of 99.1% nucleotide (nt) identity (the sequences assembled from the I60-2 phloem scrappings and the F79 leaves are in fact 100% identical). It was therefore decided to complete the sequence of a single variant, I27. This was achieved as described above for the 3' terminal region and by a RACE PCR for the 5' terminal region. The completed genome of the I27 isolate is 8,694 nt long, excluding the 3' polyA tail. The genomes of the SK809 isolate was similarly assembled, while that of the 486-1 isolate was fully determined by the Sanger sequencing of overlapping PCR products. In the case of the SK809 isolate, the genome was partially resequenced in some regions by direct Sanger sequencing of PCR products. No effort was made to complete the nine missing terminal 5' nucleotides but the 3' end was completed as described above. For the 486-1 isolate, the 5' and 3' terminal regions were determined by RACE and polyA-anchored PCR. Similar to the I27 isolate, the genomic sequence of the 486-1 isolate is 8,694 nt in length, excluding the polyA tail, a value to be compared to the 8,701 nt of the reference NC_035203 isolate.

Datamining of grapevine RNAseq transcriptomic data allows the assembly of near complete GVT genomes
In an attempt to identifiy and assemble additionnal complete or near complete GVT variant genomic sequences, publicly available grapevine transcriptomic RNAseq data were screened for the presence of GVT by mapping reads against the GVT reference sequence using CLC Genomics Workbench 11.0. In order to allow the detection of the broadest GVT diversity possible, relaxed parameters for length fraction (0.5) and similarity (0.7) were used while default values were used for other parameters. GVT genomes were then recovered from GVT-positive datasets by de novo assembly. Contigs were finally extended by multiple rounds of mapping and assembly as described recenty in the case of GRSPaV [18]. In total, 14 new near complete GVT genomes could be assembled in this way (Table 4) from six different grapevine varieties.
Pinot noir, Barbera, Moscato rosa and the wild grapevine relative Vitis amurensis each yielded a single variant while respectively four and six variants could be assembled from Cabernet Franc and Lambrusco, respectively. Only in the case of V. amurensis, for which GVT coverage was lower, was it necessary to perform an assembly from multiple datasets. However, there was no evidence for GVT variability between the assembled datasets. Since these genomes were obtained by datamining, no specific efforts were made to complete the missing 5' or 3' terminal sequences. The amount of missing data is however limited since only 4-29 nt are missing from the 5' end of the assembled genomes as compared to the NC_035203 reference isolate, while 4-53 nt are missing from the genomes 3' ends. Only for one variant, Lambrusco-3 did a more significant 3' region end up missing (363 nt). Overall, as compared to the reference isolate, the assembled contigs represent over 99.2% of the complete genome (99.2-99.8%) with the exception of the Lambrusco-6 (98.3%) and of the Lambrusco-3 isolates (95.6%) ( Table 4).

Identification of GVT terminal genomic sequences
In total, 8 new complete or near-complete GVT genomes assembled from high-throughput dsRNA or RNAseq data (Tables 2 and 3) and a complete genome determined by Sanger sequencing were thus obtained, together with another 14 genomes assembled from publicly available grapevine RNAseq transcriptome data (Table 4). These were compared to the reference isolate, to the two incomplete sequences reported from Croatian autochtonous grapevines [23, MG001925-26] and to a complete, unreleased sequence of a Slovak isolate from the Veltliner variety [24, MH182704]. The reference sequence has been assembled from transcriptomic data of the Teroldego variety but not experimentally validated [21]. The precise 5' end of the GVT genome has thus been determined only in three isolates, the MH182704 sequence [24] and the I27 and 486-1 isolates reported here. The sequence determined for these last two isolates are identical (5' GGATAAAC. . .) while the MH182704 and NC_035203 sequences are further extended 5' but by different sequences: acatgggGGATAAAC. . .. (the extension in bold lower case) for MH182704 and cgcGGATAAAC. . .. for the NC_035203 reference sequence. Given these discrepancies it is not possible to unambiguously conclude on GVT 5' genome end but the convergence of the results independently obtained on the I27 and 486-1 suggests that the extreme 5' end has been identified. The possibility that different isolates might have different 5' terminal sequences remains but seems unlikely in view of the high conservation of these sequences between isolates in most viruses. The absolute conservation of the terminal 19 5' nucleotides in all isolates of the closely related GRSPaV [9] also argues aginst this notion. Assuming the terminal sequence identified in the I27 and 486-1 isolates to be correct, the first 16 nucleotides would be fully conserved between all GVT isolates sequenced here: 5' GGATAAACAACAGAAC 3'. This sequence shows high homology with the corresponding GRSPaV sequence, which only differs by a single terminal G and by the insertion of the ATAAC sequence in bold and underlined here: 5' GATAAACATAACAACAGAA 3'. In all the isolates analysed here, highly homologous 3' terminal sequences were identified. Surprisingly the sequence of the NC_035203 GVT reference isolate is four nt longer and completely unrelated to the sequence of all other isolates for its last 21 nucleotides. Given the expected high conservation of the 3' terminal sequence (this region is again highly conserved between GRSPaV isolates [9]) and the fact that the reference sequence has not been experimentally confirmed, the most likely explanation is that the reference sequence contains an assembly artefact. The alternative possibility, that the reference isolate possesses a highly divergent 3' terminal sequence, although less likely, cannot however be totally disproved. But overall a change in the choice of the GVT reference isolate for an isolate with experimentally validated genome ends seems warranted.

GVT genomic organization is highly conserved between isolates
The genomic organization is fully conserved between all GVT isolates and is typical of members of the Foveavirus genus [3,4], with its five open reading frames (ORFs). From 5' to 3', these correspond to a large replication-associated protein (REP, ORF1), three ORFs encoding the three proteins of a movement-associated triple gene block (TGB1-2-3, ORFs 2-3-4) [5] and a coat protein (CP, ORF5). The genome is highly colinear between all isolates, with very few indels. The MG001925 sequence contains a six nt [two amino acids (aa)] deletion in REP gene and a two nt (frameshift) deletion in CP gene. However, the genome coverage for this isolate is incomplete so that it may still contain some errors. No other indels in the genome coding regions are identified among the 27 genomic sequences analyzed here.
Minor indel polymorphisms are observed in the internal non-coding region (NCR) between the end of REP ORF and the beginning of the TGB1 ORF. The Lambrusco-1 has three extra nt in that region while the Lambrusco-3 and CF-4 isolates each have a one nt deletion. A single one nt deletion is observed in the 3'NCR in the Sauvignon gris V51, CF-4 and Lambrusco-4 and Lambrusco-5 isolates.
The TGB1 ORF is fully conserved in length (663 nt, 221 aa) and ends with TGA or TAA stop codons. The TGB2 ORF is conserved in length (348 nt, 116 aa) in all isolates except Limniona V34 (357 nt, 119 aa, three aa longer) because of a mutation changing the conserved TGA stop codon in an AGA and termination on the next in-frame stop codon (TAG), for a three aa longer protein. The TGB3 is the most variable ORF size-wise. It is generally 234 nt long (78 aa) but is 246 nt (82 aa) in 486-1 and 249 nt (83 aa) in the three Sauvignon variants I27, I60-2 and F79. In each case of a longer protein, termination is on the next in-frame stop codon, four or five amino acids downstream. Depending on the isolate, TGB3 terminates, on TGA, TAG or TAA stop codons. The expected conserved motives are found in each of the TGB proteins: Viral_helicase1 (pfam01443, TGB1 aa 25-220), Plant_vir_prot (pfam01307, TGB2 aa 4-62) and the misnamed 7kD_coat (pfam02495, TGB3 aa 6-60).
The CP ORF is fully conserved in length, at 765 nt (255 aa) and ends either with TGA, TAA or TAG stop codons. It contains the expectyed Flexi_CP conserved motif (pfam00286, aa 72-209).
Overall this genomic organization is highly parallel to that of the closely related GRSPaV (Fig 1) although GRSPaV has an additional putative ORF that largely overlaps with the 3' part of the CP gene [9,13]. In some GRSPaV isolates this ORF encodes a protein of up to 14 kDa, but if assuming translation initiation on a conserved methionine, this conserved extra ORF encodes a 95 aa protein. This potential protein has no homology with any other protein in Genbank. This putative ORF is not observed in any GVT isolate. Although a similar placed, but shorter potential ORF can be observed in some apple stem pitting virus (ASPV) isolates, it is not conserved in ASPV and it is not observed in other foveaviruses such as rubus canadensis virus 1, peach chlorotic mottle virus, asian prunus virus 1 or apricot latent virus. This fully conserved ORF appears thus to be a feature so far unique to GRSPaV in the Foveavirus genus and to differenciate it from GVT.

Analysis of GVT variability using full genome sequence data
Using the dataset of 27 full length or near full length GVT sequences we then analyzed GVT diversity. At whole genome level, the average pairwise divergence between GVT isolates was calculated to be 16.6% +/-0.2%. This relatively high value is in line with the known high variability of many Betaflexiviridae members and is comparable to the 16.2% +/-1.9% computed for a dataset composed of 21 full length or near full-length GRSPaV sequences selected so as to represent the known genetic diversity of GRSPaV [6,9]. Remarkably, at respectively 17.3% +/-0.3% and 17.7% +/-0.2%, the average genetic distance between the various isolates identified in the Cabernet franc or Lambrusco transcriptomes proved as high as the overall GVT value, demonstrating the extreme GVT diversity involved in these mixed infections.
The GVT 5' and 3' NCRs are slightly more conserved as compared to the rest of the genome, with average divergence values of respectively 7.6 and 11.6% (Table 5). The CP ORF is the most conserved (14.8% +/-0.7%), while the TGB1 is the most variable (17.3% +/-0.8%), closely followed by the REP and TGB2 ORFs. Looking at the encoded proteins, somewhat different trends are observed. The CP is again the most conserved with only 6.1% average divergence. This high conservation of the CP is also observed in GRSPaV [9]. However the second most conserved protein is the REP (9.8% average divergence) followed in order by the TGB1 and TGB2, while at 17.9% the TGB3 appears to be the less conserved GVT protein. As previously shown for other Betaflexiviridae members [33] a very large fraction of GVT diversity is concentrated on the third position of codons which shows from three to six times more diversity than the first position of codons. The comparable values are even higher when considering the 2 nd position, which is the most constrained ( Table 5).
Plotting of the nucleotide diversity (pi) along the genome confirms these observations and shows a very comparable profile between GVT and the closely related GRSPaV. In particular, the two viruses show a steep diversity peak around position 2000 of the genome (Fig 1). This is immediately upstream of the AlkB domain found at positions 721-849 (aa) of the REP protein of GVT. The CP gene and the 3' NCR tend for both viruses to show a lower diversity than the rest of the genome. However, this trend is clearly stronger in the case of GRSPaV with a very strong drop in diversity, which largely corresponds to the region of overlap between the CP   gene and the tentative ORF6. In this region, which corresponds to nearly half of the CP gene, the GRSPaV nucleotide diversity is sharply lower (7.7 +/-0.8%) as compared to the upstream portion of the gene (15.0 +/ 1.0%). This trend is particularly strong for the 2 nd position of the codons, for which diversity is reduced more than four-fold (from 2.7% to 0.6%) while codon positions 1 and 3 show only roughly a halving of the diversity, in line with the trend computed when considering all positions. This particular diversity pattern is coherent with the existence of a conservative selection pressure applying to ORF6 and therefore supports the hypothesis of the functionality of this GRSPaV ORF. A codon-based analysis of selection pressures was performed on the five ORFs encoded on the GVT genome using the FEL and SLAC programs on the Datamonkey web server [32]. This analysis failed to identify any evidence for the existence of codons under divergent selection pressure at the 1% threshold, suggesting that GVT largely appears to be evolving in grapevine under conservative selection pressure.

Phylogenetic analysis of GVT isolates demonstrates the existence of several clusters of isolates
A phylogenetic analysis of the complete or near complete GVT genome sequences was performed using the sequences reported here and those available in Genbank. The 21 GRSPaV genome sequences selected so as to represent its known genetic diversity were analyzed in parallel, providing a benchmark on which to compare GVT diversity. Fig 2 shows a very clear separation of GVT from GRSPaV, further supporting its recognition as a distinct Foveavirus species. For GRSPaV, the four well known major phylogenetic clusters are observed [9], with group IV represented by the divergent LSL isolate [34,KR054735]. The GVT portion of the tree similarly reveals four clusters of isolates supported by 100% bootstrap values (Fig 2). In addition, a further three groups correspond to single isolates at the end of long branches (Limniona V34, Lambrusco-1 and Lambrusco-2). These various clusters are quite distinct and clearly identified. They show an average intragroup divergence of between 7.4% +/-0.2% (Group IV) and 12.2% +/-0.2% (Group I) while the intergroup divergence varies between 17.3 and 19.8% +/-0.4%. Comparable values for GRSPaV range from 5.5% +/-0.2% (Group I) to 14.4% +/-0.2% (Group III) for intragroup values but are slightly higher, between 22.7 and 23.4% +/-0.3% for intergroup values.
Additional analyses performed on the individual REP and CP ORFs at the nucleotide (not shown) or amino acid levels (S1 and S2 Figs) further confirm this clustering of GVT isolates in a number of well separated groups. For both GRSPaV and GVT, the intraspecific diversity is clearly narrower when considering the CP tree, thus confirming the diversity analysis performed on the individual ORFs. The detailed comparison of the REP and CP trees shows, however, a different behavior of isolate SK809 in the two trees: while it clusters with 100% bootstrap support in Group I in the REP tree, it clusters in Group II in the CP tree with significant bootstrap support (S1 and S2 Figs). As this type of behavior can be indicative of recombination events, a specific search for recombination events in the GVT full genomes dataset was performed.

Search for potential recombination events in GVT isolates
The two Croatian GVT sequences [23] were removed from the dataset because they contain internal gaps and because their assembly might still contain artifacts. The remaining dataset of 25 near complete GVT sequences were analyzed using the RDP4 software [31]. Some statistically supported potential recombination events could be identified, involving two groups of isolates. In one case, four recombination events involving isolates 486.1, CF-3 and Moscato rosa are detected by seven of the nine tested methods, with combined corrected probabilities varying between 1.0 e -7 and 9.0 e -26 (Table 6). These events affect different parts of the genome and suggest the existence of multiple recombination events linking these isolates and a possible complex mosaic structure for the genome of one of them. The second set of tentative recombination events detected involve isolates SK809 and Moscato rosa and either an unknown parent or the Lambrusco-1 isolate. The highest probability event (involving Lambrusco-1, detected by 7/9 programs with an excellent overall corrected probability of 1.3 e-37 ) recapitulates the other two detected events, that have lower probabilities (Table 6).
Taken together with the abnormal behavior of isolate SK809 between the REP and the CP phylogenetic tree, these results indicate that, similar to many other Betaflexiviridae members, recombination appears to have played a significant role in the evolution of GVT. This observation was not unexpected in view of the complex mixed infections of GVT isolates reported here and of the known impact of recombination on the evolution of the closely related GRSPaV [9,[17][18].

Consequences of GVT variability for its molecular diagnostics
The diversity of GVT evidenced in the present work is quite large, as evidenced by the 16.6% average divergence between isolates at full genome level reported here. This translates in an average 3.3 differences for a 20 nucleotides PCR primer binding site, indicating that designing detection primers able to detect all GVT isolates might be a challenging task. To date, as single pair of PCR primers is reported for GVT [24]. In an attempt to evaluate the polyvalence of these primers, we confronted them to the 27 GVT genomes analyzed here. Of particular interest for this comparison is the fact that having been identified by high-throughput sequencing approaches, these isolates can be considered as an unbiased representation of GVT diversity. The results of this comparison are presented in Fig 3. For the reverse primer GVT_8534R, the first 17 nucleotides from the 3' primer end are fully conserved for all isolates for which sequence information is available in that region. The last three nucleotides of the primer match with from zero to three mutations with the sequence of the various isolates. However, these mutations are unlikely to drastically affect binding of the primer, which is therefore likely to show excellent polyvalence. In the case of the forward primer, GVT_7632F, the situation is somewhat different. From zero to five mutations affect the primer binding site, depending on the GVT isolate. In some cases, these mutations affect positions close to the primer 3' end. This is particularly the case for the Lambrusco-1 isolate (five mutations, one of which is internal and three of which are clustered next to the primer 3' terminal nucleotide) and, to a slightly lesser extent, of isolates Lambrusco-4 and -5 (four mutations, two clustered close to the primer 3' end) (Fig 3). Although this would have to be experimentally confirmed, the number and critical position of the mutations in the primer binding site strongly suggest that the efficiency of the GVT_7632F primer is likely to be severely affected if not abolished for some GVT isolates.

Conclusion
The sequences and analyses reported here represent the first in depth, full genome analysis of GVT diversity. The only information available to date on GVT diversity is a PCR-based analysis [24], which was limited to the CP gene of Czech and Slovak isolates. The authors report an average pairwise divergence of 11.2% +/-0.8% at the nucleotide level and of 8.7% at the amino acid level. These values are close to those reported here (14.8% +/-0.7% for the nucleotide sequences, 6.1% +/-0.7% for the amino acid sequences), confirming these initial results. GVT shares many features with its closest relative, GRSPaV, which is also a grapevine virus. In both cases, there is very little indel polymorphism, while single nucleotide polymorphism rates are high, in particular on the third positions of codons. The genetic organization of the two viruses are highly similar, with the exception of the presence of a conserved, putative ORF6 largely overlapping the CP ORF in GRSPaV, that is absent in GVT. GVT diversity appears to be relatively high and of the same order of magnitude than that of GRSPaV. Similar to GRSPaV, GVT diversity is structured and the phylogenetic analyses reported here allow the delineation of at least seven quite clearly separated groups of isolates. The recombination analysis performed here also provides the first evidence for recombination in GVT.
As for other viruses with a high diversity, the development of GVT PCR-based detection assays of broad specificity is likely to prove challenging. The analysis of the only published GVT-specific primer pair [24] suggests that it might miss some GVT isolates and that further development efforts are likely needed. The complete or near complete GVT genomic  [24]. The sequences are shown in a 5' to 3' orientation (left to right). Missing sequence information is indicated by n. Mutations are indicated in gray-shaded, bold type. The sequence of the primers is shown in bold italics at the bottom of the alignement (for the GVT_8534R reverse primer, the reverse complement is shown).
https://doi.org/10.1371/journal.pone.0206010.g003 sequences reported here should prove an invaluable resource for such efforts. In this respect, a cursory look at the sequence alignments indicates that the use of degenerate primers could be the most interesting and fruitful strategy.
Due to its recent discovery, GVT is still a very poorly known virus. In particular, there are currently no indications about is potential impact in grapevine. Some of the plants analysed here, such as the French Sauvignon samples, the Slovak Welshriesling or the German Riesling did not show any symptoms or evidence of viral infection. This suggests that similar to GRSPaV, GVT might symptomlessly infect at least some grapevine varieties. Whether this notion can be further extended remains to be investigated.