Plant Virology and Next Generation Sequencing: Experiences with a Potyvirus

Next generation sequencing is quickly emerging as the go-to tool for plant virologists when sequencing whole virus genomes, and undertaking plant metagenomic studies for new virus discoveries. This study aims to compare the genomic and biological properties of Bean yellow mosaic virus (BYMV) (genus Potyvirus), isolates from Lupinus angustifolius plants with black pod syndrome (BPS), systemic necrosis or non-necrotic symptoms, and from two other plant species. When one Clover yellow vein virus (ClYVV) (genus Potyvirus) and 22 BYMV isolates were sequenced on the Illumina HiSeq2000, one new ClYVV and 23 new BYMV sequences were obtained. When the 23 new BYMV genomes were compared with 17 other BYMV genomes available on Genbank, phylogenetic analysis provided strong support for existence of nine phylogenetic groupings. Biological studies involving seven isolates of BYMV and one of ClYVV gave no symptoms or reactions that could be used to distinguish BYMV isolates from L. angustifolius plants with black pod syndrome from other isolates. Here, we propose that the current system of nomenclature based on biological properties be replaced by numbered groups (I–IX). This is because use of whole genomes revealed that the previous phylogenetic grouping system based on partial sequences of virus genomes and original isolation hosts was unsustainable. This study also demonstrated that, where next generation sequencing is used to obtain complete plant virus genomes, consideration needs to be given to issues regarding sample preparation, adequate levels of coverage across a genome and methods of assembly. It also provided important lessons that will be helpful to other plant virologists using next generation sequencing in the future.


Introduction
Next generation sequencing (NGS) technologies are fast becoming a popular method to obtain whole plant virus genomes in a relatively short period of time [1]. Their uptake by plant virologists has been slower than by their counterparts in the medical sciences where the applications are extending much further, rapidly approaching the concept of personalized medicine. Such a situation was impossible before the advent of NGS and its' rapid evolution into an affordable and accessible technology now appearing on laboratory bench-tops throughout the world [2,3]. Because of the ability to use total RNA extractions for NGS, it is becoming increasingly common to use it to sequence complete genomes of plant viruses and still obtain excellent results [4][5][6][7][8][9]. The challenge now lies not in accessing and using NGS technology, but in analyzing and interpreting the very large datasets suddenly at our disposal [1].
Bean yellow mosaic virus (BYMV) (family Potyviridae, genus Potyvirus) is a single stranded positive sense RNA virus that occurs worldwide. It is a virus with an extensive natural host range that encompasses monocots and dicots, and both domesticated and wild plant species [10,11]. It is transmitted non-persistently by many different aphid species [12]. BYMV causes serious diseases and losses in many cultivated plant species worldwide. For example, early BYMV infection, which causes serious losses, normally results in systemic necrosis and plant death [13][14][15]. In contrast, late infection with BYMV causes black pod syndrome (BPS) in Lupinus angustifolius (narrow-leafed lupin) also resulting in damaging losses [16]. Plants with BPS develop characteristic flat, black pods that have little or no seed [17]. It seems likely that both the BPS and systemic necrosis responses are related to presence of hypersensitivy Nbm-1 gene and another similar resistance gene [15,[18][19][20].
Wylie et al. [21] provided evidence for existence of seven BYMV phylogenetic groupings based on coat protein (CP) sequences and the original hosts of the isolates sequenced: one generalist group with a broad host range including monocots and dicots called the general group, and six other specialist groups each named after the original hosts of the isolates within them (broad bean, canna, lupin, monocot, pea, W). Partial CP sequences from BYMV isolates originally from L. angustifolius plants with BPS, systemic necrosis or non-necrotic symptoms placed all of them into the general group [16,21].
This study aims to compare the genomic and biological properties of BYMV isolates from L. angustifolius plants with BPS, systemic necrosis or non-necrotic symptoms, and from two other plant species. NGS was used to sequence 22 BYMV isolates, obtained as part a study conducted in 2011 and from previous studies in south-west Australia [16,19]. Here, we present the results of genome comparisons with the resulting 23 new BYMV genomes and one Clover yellow vein virus (ClYVV) genome with 17 genomes retrieved from Genbank, and biological host range studies with seven BYMV and one ClYVV isolates. We also make recommendations based on the lessons learned from our NGS studies which will be useful to plant virologists employing this approach to obtain whole genomes of other plant viruses.

Isolates and host plants
Seventeen BYMV isolates were collected from L. angustifolius plants with BPS (i.e. systemic necrotic stem streaking with black pods) (11) and systemic necrosis (no black pods) (6), and two from L. cosentinii plants with mosaic and leaf deformation as part of a 2011 study in south-western Australia [16]. The remaining three BYMV isolates (FB, LMBNN and LP) were from previous studies [19]. They had been maintained as freeze-dried leaf material obtained from the West Australian Plant Pathogen Culture Collection (FB -WAC10051, LMBNN -WAC10094 and LP -WAC10059). The ClYVV isolate was from the same culture collection (WAC10102).
All plants were maintained at 18-22uC in an insect-proof, air conditioned glasshouse. Plants of L. angustifolius cvs Jenabillup (partially resistant to BPS), Mandelup (susceptible to BPS) and germplasm accession P26697 (Nbm-1 gene absent) were grown in washed river sand. Plants of Nicotiana benthamiana, Trifolium subterraneum cv. Woogenellup (subterranean clover), Chenopodium amaranticolor, C. quinoa, Pisum sativum cv. Greenfeast (pea) and Vicia faba cv. Coles early dwarf (faba bean) were grown in steam-sterilised potting mix. Cultures of virus isolates were maintained by serial mechanical inoculation of infective sap to plants of N. benthamaniana or T. subterraneum. For inoculations to maintain cultures, or as part of experiments, virus-infected leaves from systemically infected plants were ground in 0.1M phosphate buffer, pH 7.2, and the infective sap mixed with celite before being rubbed onto leaves.
For testing by ELISA, leaf samples were extracted (1 g per 20 ml) in phosphate-buffered saline (10 mM potassium phosphate, 150 mM sodium chloride, pH 7.4, Tween 20 at 5 ml/liter, and polyvinyl pyrrolidone at 20 g/liter) using a mixer mill (Retsch, Germany). Sample extracts were tested for BYMV or ClYVV by double-antibody sandwich ELISA based on a modified protocol described by Clark and Adams [22] and according to manufacturer's recommendations. For generic Potyvirus testing, samples were extracted in 0.05 M sodium carbonate buffer, pH 9.6, and tested using the antigen-coated indirect ELISA protocol of Torrance and Pead [23]. The polyclonal antiserum to BYMV was from DSMZ (AS-0717), Germany, to ClYVV from Neogen Phytodiagnostics -formerly Adgen, UK (1171-05) and to generic potyvirus from Agdia, USA (SRA27200). All samples were tested in duplicate wells in microtiter plates. Sap from BYMV or ClYVV infected and healthy T. subterraneum leaf samples was included in paired wells to provide positive and negative controls. The substrate was p-nitrophenyl phosphate at 1.0 mg/ml in diethanolamine, pH 9.8, at 100 ml/liter. Absorbance values at A 405 were measured in a microplate reader (Bio-Rad laboratories, USA). Absorbance values of positive samples were always more than three times those of the healthy sap control.

Sequence data
Twenty two BYMV and one ClYVV sample were sent for NGS on an Illumina HiSeq 2000 (Table 1). For BYMV in total there were 11 samples from L. angustifolius plants with BPS, six from L. angustifolius plants with systemic necrosis and one from a L. angustifolius plant with non-necrotic symptoms. The remaining samples consisted of isolates from other Lupinus spp. or were isolates from other hosts representing other phylogenetic groups based on Wylie et al. [21], including two samples from L. cosentinii, one from L. pilosus, and one from V. faba. The single ClYVV sample was from T. repens (white clover). Total RNA was extracted from each sample using a Spectrum Plant Total RNA kit (Sigma-Aldrich, Australia). Following extraction, total RNA was sent to the Australian Genome Research Facility (AGRF) for library preparation and barcoding (24 samples per lane) before 100 bp paired-end sequencing on an Illumina HiSeq2000. For each sample, reads were first trimmed using CLC Genomics Workbench 6.5 (CLCGW) (CLC bio) with the quality scores limit set to 0.01, maximum number of ambiguities to two and removing any reads with ,30 nucleotides (nt). Contigs were assembled using the de novo assembly function of CLCGW with automatic word size, automatic bubble size, minimum contig length 500, mismatch cost two, insertion cost three, deletion cost three, length fraction 0.5 and similarity fraction 0.9. Contigs were sorted by length and the longest subjected to a BLAST search [24]. In addition, reads were also imported into Geneious 6.1.6 (Biomatters) and provided with a reference sequence obtained from Genbank (JX173278 for BYMV and NC003536 for ClYVV). Mapping was performed with minimum overlap 10%, minimum overlap identity 80%, allow gaps 10% and fine tuning set to iterate up to 10 times. A consensus between the contig of interest from CLCGW and the consensus from mapping in Geneious was created in Geneious by alignment with Clustal W. Open reading frames (ORFs) were predicted and annotations made using Geneious. Finalized sequences were designated as ''complete'' based on comparison with the reference sequences used in the mapping process, ''nearly complete'' if some of the 59 or 39 UTR was missing but the coding region was intact, and ''partial'' if all of the 59 or 39 UTR and some of the P1 or CP genes were missing.

Phylogenetic analysis
The new sequences were aligned with the 17 retrieved from Genbank using Clustal W in MEGA 5.2.1, prior to phylogenetic analysis [25]. Phylogenetic analysis compared (i) coding regions of all BYMV genome sequences and (ii) coding regions of all BYMV genome sequences except seven with average coverage of 10 times or less. Neighbor-joining trees were made using the number of differences model with a bootstrap value of 1000, Maximum Likelihood trees using the Tamura-Nei model with a bootstrap value of 1000, and Minimum Evolution trees using the number of differences model with a bootstrap value of 1000. Tables of nucleotide (nt) percentage differences were calculated for the complete genomes using the pairwise comparison function with the number of differences model. Final sequences were submitted to the European Nucleotide Archive (ENA) with accession numbers HG970847-HG970870 (Table 1).

Biological data
For host range studies, seven isolates of BYMV and one of ClYVV were mechanically inoculated onto leaves of L. angusti-

Sequence data
From the single ClYVV and 22 BYMV samples, the numbers of raw reads obtained from NGS were 10,841,138-31,131,660, but these numbers were reduced to 10,582,250-29,877,478 after trimming (Table 1). Following de novo assembly of each individual sample using CLCGW, the numbers of contigs produced were 149-2498. Contig of interest lengths were 534-9,655 nt with average coverage 3-10,173 times and the numbers of reads mapped to each contig were 18-987,972. After mapping to a reference genome in Geneious, the lengths of the consensus sequences were 9,034-10,324 nt, with average coverage of 4-12,313 times and the numbers of reads mapped to the references sequence were 471-1,002,513. Final sequence lengths consisted of the consensus of the contig from CLCGW and the consensus from Geneious, and were 9,274-9,530 nt. All samples yielded one sequence of interest, with the exception of FB, which contained a second BYMV sequence which we called ''LPexFB''. In all cases, except for ClYVV, the contigs of interest were most closely related to BYMV after being subjected to Blastn analysis. ClYVV was most closely related to the only other ClYVV complete genome available on Genbank. In total, there were nine complete genomes, ten nearly complete genomes (including ClYVV) and five partial genomes.

Phylogenetic analysis
Phylogenetic analysis comparing the coding regions of 23 new complete or nearly complete BYMV genomes and one new nearly complete ClYVV genome with those of 17 BYMV and one ClYVV genome retrieved from Genbank provided 100% bootstrap support for eight of nine phylogenetic groups (I, II, IV-IX). The remaining group (III) had 98% bootstrap support. Seven of the new genomes had average coverages of less than or equal to ten times (MD5, MD6, GB42C, ES69C, ES67C, PN77C and AR98C) and five of these (MD5, MD6, ES67C, ES69C and PN77C) did not sit well within groups I and II. Although they appear to belong to them, genomes such as MD6 and PN77C sit out on their own, almost separate from the other sequences, leaving groups I and II poorly resolved (Figure 1a). In contrast, when sequences of the seven genomes with poor average coverage (#10 times) were removed, phylogenetic analysis gave the same results but with much greater resolution between groups I and II and improved bootstrap support for groups I-IX (Figure 1b). Those removed were designated as ''draft'' genomes because all had low coverage and/or small gaps. When all the genomes, including those with poor coverage were analyzed using Maximum Likelihood or Minimum Evolution methods, the tree topologies shown were the same as the Neighbor-Joining method.
The range of original isolation hosts within each grouping varied ( Table 2). Group I consisted of nine sequences from two dicot, and two monocot species. Group II consisted of seven sequences from two dicot and one monocot species. Group III consisted entirely of three sequences from one monocot species. Group IV was made up of three sequences from an unknown original host or hosts, as well as two from a monocot and one from a dicot species. Groups V-IX consisted entirely of dicot species belonging to a single family, and were represented by up to three sequences. All dicot species were from families Fabaceae or Gentianaceae, and all monocot species were from families Orchidaceae or Iridaceae.

Sequence analysis
When the coding regions of the 16 new BYMV genomes (draft genomes excluded) and one ClYVV genome were analyzed against those retrieved from Genbank, the nt percentage identities within each phylogenetic group were $96.6% (I), $98.6% (II), $ 93.9% (III), $94% (IV), $90.7% (V), $99.8% (VI), $97.6% (VII) and $97.5% for ClYVV (Table S1). When the six sequences from L. angustifolius plants with BPS were compared to each other their percentage nt identites were $93.8%. When the sequences from all L. angustifolius plants were compared to each other their percentage nt identities were also $93.8%. Across all 33 BYMV sequences used in this analysis the nt identities were $75.6%. When the ClYVV sequences were compared to the BYMV sequences, overall the percentage nt identities were 66.4-67.9%.

Biological data
All seven BYMV isolates and one ClYVV isolate inoculated to plants caused systemic symptoms of varying severity in N. benthamiana, T. subterraneum and V. faba (Table 3). However, apart from ClYVV and BYMV isolate GB17A in V. faba, none of them induced systemic necrotic symptoms, which were severe only with ClYVV. In C. amaranticolor, ClYVV and five BYMV isolates caused obvious systemic symptoms, while infection was restricted to inoculated leaves with the isolate originally from L. angustifolius plants with BPS and another originally from an L. angustifolius plant with non-necrotic symptoms. In C. quinoa, although all isolates infected inoculated leaves, only ClYVV caused systemic invasion. In contrast, in P. sativum, only BYMV isolate LP caused any infection.
In L. angustifolius cvs Jenabillup and Mandelup, three BYMV isolates caused systemic non-necrotic symptoms. These were originally from plants of this species with non-necrotic symptoms (LMBNN) or systemic necrotic symptoms (ES11A), and L. cosentinii (MD7) from a plant with mosaic and leaf distortion. All other BYMV isolates and the ClYVV caused systemic necrotic symptoms in cvs Jenabillup and Mandelup. In accession P26697, with ClYVV and four BYMV isolates for which symptom data are available, the reactions resembled those in cv. Jenabillup, with the exception of MD5 which produced severe mosaic (i.e. nonnecrotic) symptoms instead of systemic necrosis. Isolates LBMNN and ES11A caused non-necrotic symptoms, while ClYVV and LP caused systemic necrosis. Failure of isolates AR93C and MD7 to infect P26697 probably represents escapes, but there was no seed left of P26697 for further testing. Isolate LP did not infect L. angustifolius cv. Mandelup on two separate occasions by sap inoculation, but further inoculations using grafting or aphids would be needed to establish if this is a resistance reaction.

Discussion
Before this study was conducted, there were only 17 complete BYMV genomes on Genbank. The ten complete and eight nearly complete genomes from this study doubled available BYMV   genomic data in the database. Moreover, the five additional partial genomes we obtained will be useful in future studies. Our genome results enabled the phylogenetic makeup of BYMV to be examined thoroughly, revealing presence of nine distinct groups, including the subdivision of the former generalist group into three new groups. We recommend replacing the phylogenetic groupings of Wylie et al. [21] with numbered group names (I-IX). We have not included one former specialist group based on CP genes, the canna group, in our analysis because it was not represented by any whole genome sequence. Use of whole genomes revealed that the previous phylogenetic grouping system based on partial genome sequences and original isolation hosts was unsustainable. This is because genome sequences from broad bean are present in two former specialist groups (now V and VII), from various Lupinus species in two former specialist groups (now VI and VIII), and two former generalist groups (now I and II). Moreover, although we have not re-analyzed sequences of CP genes, Wylie et al. [21] had previously placed a CP sequence from the dicot species Eustoma russellianum (family Gentianaceae), in the former monocot group (now III). Numbering of groups prevents such confusion arising from use of natural isolation host names. Our results highlight the importance of using complete genomes wherever possible to define phylogenetic groupings. The results also highlight the need for further sequencing and analysis of BYMV isolates likely to belong to former specialist phylogenetic groupings, which will provide greater insight into the genetic makeup of BYMV. Close examination of the nt percentage sequence identities between BYMV and ClYVV genomes revealed that the divergence between them is greater than previously thought. Overall, BYMV percentage nt identities ranged from 75.6 to 99.5%. The species demarcation for potyviruses is currently 23-24% divergence at the nt level [26], and some of the BYMV isolates compared came close to this. The two ClYVV genomes shared 97.5% nt identity, but when compared to all the BYMV genomes, nt identities were 66.4-67.9%, well beyond the species demarcation point for potyviruses. ClYVV was originally considered an isolate of BYMV but was later shown to be a distinct virus [26,27,28]. Our percentage identities support that distinction. However, some BYMV phylogenetic groups were more closely related to ClYVV than others. For example when compared with all other BYMV sequences, the single sequences from groups VIII and IX had percentage identities of just 78.4-79.8% and 75.6-76.9% to BYMV respectively, whereas when compared to ClYVV their nt percentage identities were 67.0-67.7% (Table S1). Again, further genome sequences from these groups and ClYVV are required for a more conclusive analysis.
Based on our phylogenetic and sequence analyses, BYMV isolates associated with BPS in L. angustifolius were not different phylogenetically from other BYMV isolates we sequenced from L. angustifolius, L. cosentinii, or other hosts within the same phylogenetic groups (I and II). Also, from the host data from our inoculations, there was no host reaction that could be used to distinguish a particular isolate as causing BPS. However, there were some other interesting differences. Although isolate ES11A behaved in a similar manner to isolate LMBNN, which overcomes the Nbm-1 hypersensitivity gene in L. angustifolius plants [19,20], it was isolated from a L. angustifolius plant originally displaying systemic necrosis. ClYVV behaved like isolate LP, but whether ClYVV interacts with both Nbm-1 and the second putative BYMV hypersensitivity genes, or unknown ClYVV-specific genes in L. angustifolius, is not clear [19]. ClYVV and all group I and II isolates failed to infect P. sativum cv. Greenfeast although the group VI isolate LP did cause infection. This may be due to the fact that this cultivar, like many commercial pea cultivars, may contain the BYMV resistance gene mo and ClYVV resistance genes cyv or cyv-2 [29,30] and their responses are strain specific. Induction of severe necrotic symptoms in V. faba by ClYVV but not the BYMV isolates is expected, as this is the classical method for distinguishing BYMV from ClYVV [10,20].
In this study, we used NGS to obtain complete virus genomes and it proved both an advantage and a disadvantage over traditional sequencing methods. It allowed large amounts of data to be generated quickly, but analysis of the data proved a major challenge. Many free programs exist for the assembly of NGS data (e.g. Velvet, SOAP de novo, Abyss and bowtie) but they all require the researcher to be proficient in the use of command line driven applications. As so-called ''benchtop biologists'', the use of Geneious and CLCGW was easy to learn and their cost was acceptable in view of the time saved in learning the use of command line driven programs. That said, our success was probably attributable to the small genome sizes of plant viruses, particularly BYMV and ClYVV, which are both c. 9535 nt long. Larger genomes, from unpurified RNA samples would undoubtedly be much harder to piece together, but not impossible. We found in most cases (17 out of 23) there was sufficient average coverage to be confident of good genome representation for the isolate sequenced. These sequences had average coverages as low as 65 and 457 with remaining average coverages being greater than 737 and up to 12,313 times when mapped back to a reference sequence using Geneious. Currently, sequencing a human genome of approximately 300 MB on an Illumina platform requires 30 times coverage to be adequate [31]. Therefore, it seems reasonable to designate our virus genomes with less than 30 times coverage as draft sequences. Although not meeting minimum requirements for average coverage, they are still valuable data sets, particularly given the low numbers of complete or nearly complete BYMV genomes available (now 32 including those from this study).
The settings used in de novo assembly are sufficient to distinguish between more than one strain or group of a plant virus when present in the same sample, as previously demonstrated by Kehoe et al. [9]. In our case, the sample from a V. faba BYMV isolate (FB) retrieved from the culture collection also contained a nearly complete LP isolate genome. The contamination probably occurred more than ten years ago when they were maintained next to each other in the same glasshouse prior to freeze-drying and storage in the collection. In such instances, if we had only been using Geneious to map to a reference genome, we would have likely missed the second sequence. It is therefore important to perform de novo assembly, as well as mapping to a reference genome. In cases where either the mapping or the de novo sequence had a gap, it was usually resolved after alignment with the sequence from the second program. However, for genomes with coverage less than ten (i.e. the draft genomes) this method was ineffective.
The uptake of NGS amongst plant virologists is increasing as the cost associated with it decreases [1]. The relatively small genome size of plant viruses allows us the opportunity to extract complete or nearly complete genomes using commercial packages. Use of NGS does raise concerns regarding the consequences of an increase in the discovery of virus or virus-like sequences. As such, MacDiarmid et al. [32] made recommendations regarding the identification of plant viruses through NGS, and the potential biosecurity issues associated with this. One of the recommendations was that the term ''uncultured virus'' should be used with any plant virus sequence not associated with a recognized virus infection. We support this recommendation whole-heartedly.
We know of no recommendation regarding requirements for depth of coverage for plant virus genomes, particularly ones involving new virus discoveries. Until such time as an appropriate set of comparative studies are done, we would recommend following in the path of our human genome colleagues by requiring a minimum coverage of at least 30 times, but this would likely lead to many nearly complete or draft plant virus genomes. However as with BYMV for example, we required coverage well into the 1000's to ensure a complete genome (including 59 and 39 UTRs, a constant challenge for plant virologists). Our samples sent for sequencing were total RNA, so different methods of sample preparation might have increased the numbers of virus reads. For example, use of subtractive hybridization [4], or extracting for dsRNA first, followed by random cDNA synthesis [1,33]. Despite this, there is no doubt that NGS has been an exceedingly useful tool for our study.