Analysis of intra-host genetic diversity of Prunus necrotic ringspot virus (PNRSV) using amplicon next generation sequencing

PCR amplicon next generation sequencing (NGS) analysis offers a broadly applicable and targeted approach to detect populations of both high- or low-frequency virus variants in one or more plant samples. In this study, amplicon NGS was used to explore the diversity of the tripartite genome virus, Prunus necrotic ringspot virus (PNRSV) from 53 PNRSV-infected trees using amplicons from conserved gene regions of each of PNRSV RNA1, RNA2 and RNA3. Sequencing of the amplicons from 53 PNRSV-infected trees revealed differing levels of polymorphism across the three different components of the PNRSV genome with a total number of 5040, 2083 and 5486 sequence variants observed for RNA1, RNA2 and RNA3 respectively. The RNA2 had the lowest diversity of sequences compared to RNA1 and RNA3, reflecting the lack of flexibility tolerated by the replicase gene that is encoded by this RNA component. Distinct PNRSV phylo-groups, consisting of closely related clusters of sequence variants, were observed in each of PNRSV RNA1, RNA2 and RNA3. Most plant samples had a single phylo-group for each RNA component. Haplotype network analysis showed that smaller clusters of PNRSV sequence variants were genetically connected to the largest sequence variant cluster within a phylo-group of each RNA component. Some plant samples had sequence variants occurring in multiple PNRSV phylo-groups in at least one of each RNA and these phylo-groups formed distinct clades that represent PNRSV genetic strains. Variants within the same phylo-group of each Prunus plant sample had ≥97% similarity and phylo-groups within a Prunus plant sample and between samples had less ≤97% similarity. Based on the analysis of diversity, a definition of a PNRSV genetic strain was proposed. The proposed definition was applied to determine the number of PNRSV genetic strains in each of the plant samples and the complexity in defining genetic strains in multipartite genome viruses was explored.

Introduction RNA viruses are genetically heterogeneous and their replication mechanism results in mutants or variants that may differ genetically from the original virus [1]. A single virus isolate (representing a single specific sample of virus) does not consist of a single RNA sequence, rather, it consists of a population of related sequence variants, often referred to as 'quasi-species', which are generated due to the error-prone nature of the viral RNA-dependent RNA polymerase and rapid replication of virus genomes [2][3][4]. These populations of sequence variants are significant in defining strains that make up virus species and have been shown to have implications in the emergence of new and fitter virus strains that result from changing selection pressures [2] and could complicate their diagnosis. The occurrence of quasi-species, or sequence variants, compounds the already complex challenge of defining strains for viruses with multipartite genomes. Multiple combinations of genome components, inherited from different parental lineages, may lead to the selection and emergence of highly virulent strains and wide host range [5][6][7]. Equally, the polymorphisms could lead to an accumulation of less virulent strains that are more persistent in the population due to their lack of symptom expression and latent infection status.
RNA viruses have been widely characterised genetically using direct sequencing of either virus amplicons generated from RT-PCR or sequencing the molecular clones of individual RT-PCR amplicons [8][9][10]. Although direct sequencing of a population of RT-PCR amplicons may be informative, the output is a consensus sequence of variants present in a virus-infected sample, which limits the suitability of direct sequencing in virus population diversity studies. Until recently, cloning and sequencing of individual RT-PCR amplicons has been the alternative method used to direct sequencing to assess the genetic diversity of RNA virus populations. However, this approach is time consuming and labour intensive, thus limiting virus population surveys to a few clones that can be feasibly sequenced [11]. It is also likely that cloning only identifies the sequence variants occurring with high frequency and some infrequent, but potentially important, variants may be missed. This pool of infrequent variants may be a reservoir on which selection can occur and from which fitter and more pathogenic virus variants and strains can emerge [12].
Next Generation Sequencing (NGS) allows for rapid and deep viral sequence coverage that is an obvious improvement for the characterization of RNA viruses compared to cloning and sequencing of cDNA fragments [13,14]. However, most applications of NGS, especially in plant virology, are designed for virus discovery and full genome sequencing, using de novo assembly and reference mapping tools to generate a virus consensus sequence output [13,15,16]. In addition, most sample preparation methods lead to high background levels of host sequences compared to virus sequences associated during NGS [17,18]. The consensus sequence and high levels of non-viral sequence can limit NGS to the identification of frequently occurring virus sequence variants. NGS of individual virus PCR amplicons offers an alternative approach to investigate diversity of virus populations, allowing targeted analysis of a specific region of the virus genome at greater depth of coverage [19,20] and minimizes background host sequence interference during bioinformatics analysis. The high coverage, depth and targeted approach that NGS of PCR amplicons provides also makes it possible to pool multiple virus amplicons in a single sequence run. NGS of amplicons also detects populations of both high and low-occurring virus sequence variants [11]. Amplicon sequencing of viral sequence variants has largely been applied to human or animal viruses [11,12,21,22]. The few studies on plant viruses have focused on viruses with monopartite genomes [23][24][25]. Less attention has been given to amplicon NGS of plant viruses with multipartite genomes and the subsequent analysis of observed sequence variants within each RNA component.
Prunus necrotic ringspot virus (PNRSV) (genus Ilarvirus) has a positive-sense, singlestranded tripartite RNA genome of 7,868 nucleotides (nt) in size, with each RNA component separately encapsidated with coat protein into icosahedral particles. RNA1 (3,332 nt) and RNA2 (1,943 nt) encodes for replicase proteins; and RNA3 (1,951 nt) encodes a movement protein (MP) and a coat protein (CP) [26,27]. PNRSV has been traditionally grouped into three serotypes [28] that have more recently been confirmed based on phylogenetic analysis of PNRSV RNA3 CP sequences [29,30]. These three PNRSV phylo-groups (PV32, PV96 and PE5) are based on RNA3 CP and are currently widely used for PNRSV isolate speciation [10,31,32]. However, these phylo-groups are only based on consensus sequences derived from cloning and/or direct Sanger sequencing of PNRSV isolates. It is unknown if low level variants exist within these phylo-groups and how these variants may contribute to PNRSV phylogeny. In addition, it is difficult to know if similar phylo-groupings occur across all the three RNA components of the PNRSV genome and how the combination of RNA variants within each component might represent groups of PNRSV genetic strains.
In this study, a broadly applicable approach of amplicon NGS and sequence analysis is described for virus genetic strain and diversity determination within and between virusinfected plant samples. The approach was successfully applied to PNRSV, which has a multipartite genome using RT-PCR amplicons from conserved regions of methyltransferase (MT), RNA dependent RNA polymerase (RdRp) and coat protein (CP) genes on RNA1, RNA2 and RNA3 respectively of PNRSV-infected Prunus trees (as determined by positive RT-PCR amplification with PNRSV-specific primers of RNA3). The sequence data generated were analysed using phylogenetic and identity-based methods to determine the relationship of the PNRSV amplicon sequence variants and from this analysis a definition of a PNRSV genetic strain is proposed.

Plant material and total RNA extraction
Leaf tissue samples from 53 Prunus trees that had previously been found to be infected with PNRSV by RT-PCR test with PNRSV-specific primers of RNA3 [33], were collected in spring (2014-15) from five states in Australia through channels provided by the Almond Board of Australia and the Victorian state government (S1 Table). Total RNA was extracted from 0.3g leaf tissue (fresh weight) of each sample using the RNeasy1 Plant Mini Kit (Qiagen) as described previously [33].

RT-PCR amplification and amplicon purification
The SuperScript™ III One-Step RT-PCR System (Invitrogen) was used for the amplification of 218 bp, 387 bp and 455 bp segments of the methyltransferase (MT) gene on RNA1, the RNA dependent RNA polymerase (RdRp) gene on RNA2, and the coat protein (CP) gene on RNA3, respectively (Table 1). One step RT-PCR was conducted according to the manufacturer's instructions except that the total reaction volume was 25 μl. Cycling conditions consisted of: a reverse transcription step at 50˚C for 30 minutes; denaturation at 95˚C for 15 minutes, followed by 35 cycles, denaturing at 95˚C for 1 minute, annealing for 30 seconds at 56˚C, 60˚C and 54˚C for MT, RdRP and CP primer pairs respectively, elongation at 72˚C for 1 minute; and a final elongation step at 72˚C for 10 minutes.
To confirm successful amplification of the MT, RdRp and CP genes segments in each of the 53 samples, 8 μl of each of the RT-PCR reactions were run on a 1% agarose gel and stained with SYBR1 Safe DNA gel stain (Invitrogen) for visualization. The remaining 17 μl of each the 159 PNRSV RT-PCR amplicons were purified using the Promega Wizard1 PCR clean-up kit (Promega) according to the manufacturer's instructions.

Library preparation for next generation sequencing
Twenty μl of each of the 159 purified amplicons were used to prepare the NGS amplicon libraries. For Illumina dual indexing, adapter mpxPE2 consisting of unique barcodes for each amplicon were ligated to the 3'-terminus to incorporate the sequencing primer site, while an adaptor containing one of 8 indexing sequences 5 bp in length (mpxPE1), was ligated to the 5'terminus of each of the amplicons using the NEBNext1 T4 ligase (New England BioLabs) in separate 30 µl reactions. The AMPure XP1 bead solution (Beckman Coulter) was used to purify the adapter-ligated amplicons prior to PCR enrichment.
For PCR enrichment, 6 μl of each of the adapter-ligated amplicons were used in separate 50 μl reactions each containing 4 μl of in-house multiplex PE barcode primer mix (2.5 μM) that was made up of the 8 mpxPE1 primers and mpxPE2 primers which were unique for each of the 159 amplicons and 40 μl Phusion1 high fidelity PCR mastermix (New England Bio-Labs). PCR cycling conditions consisted of: one cycle at 98˚C for 30 seconds, 15 cycles at 98˚C for 10 seconds, 65˚C for 30 seconds, 72˚C for 30 seconds; and a final extension step at 72˚C for 5 minutes. After the PCR, excess primers and any primer dimer present were removed using Ampure XP1 system (Beckman Coulter) according to the manufacturer's instructions and the final amplicon libraries eluted in 20 μl of RNAse-free water.
The size distribution and concentration of the amplicon libraries were determined using the 2200 TapeStation1 system (Agilent technologies) and Qubit1 Fluorometer 2.0 (Invitrogen) respectively. The resulting quantification values were used to pool the amplicon libraries at equal concentration. The pooled library was diluted to 4 nM, denatured, and diluted to a final concentration of 12 pM [34]. The amplicon library was sequenced using the Illumina MiSeq with a paired read length of 301 base pairs. The sequence read data for this study have been submitted to the NCBI Sequence Read Archive (SRA) database under the Bioproject accession PRJNA326695 and SRA study accession SRP077034.

Amplicon sequence reads pre-processing
The sequence reads were quality filtered, the adapter sequences were removed, and the sequence data subjected to read pair validation using Trim Galore! (version 0.4.0), with a quality score of >20 and minimum read length of 200 nt. [35]. The quality trimmed reads were then paired using PEAR (version 0.9.4), with default parameters [36] to generate paired-end reads. To ensure amplicon sequence reads from each PNRSV RNA gene segment had similar length and similar sequence start and stop, the paired reads in fastq format were first converted into FASTA format using fastq_to_fasta function of FASTX-Toolkit [37]. A local BLAST analysis with default settings using BLASTn (version 2.2.18) [38] was done to ensure all the paired amplicon sequence reads were in the correct forward orientation. Amplicon sequence reads that had a reverse orientation were reversed and complemented using fastx_reverse_complement function of FASTX-Toolkit [37]. The amplicon reads were then aligned using Muscle (version 3.8.31) [39] with default parameters. The overlapping alignment coverage for each RNA amplicon read was identified and Cutadapt (version 1.4.1) [40] was used to trim each RNA amplicon read as follows: 10 nt from 5' end and 8 nt from the 3' end of RNA1; 14 nt from the 5' end and 23 nt from the 3' end of RNA2; and 24 nt from the 5' end and 31 nt from the 3' end of RNA3. The resulting length of all amplicons reads for each of RNA1, RNA2 and RNA3 were 200 nt, 350 nt and 400 nt respectively; shorter reads in each set of amplicons were discarded.
The read depth for each of the MT (RNA1), RdRP (RNA2) and CP (RNA3) amplicons in of each of the 53 plant samples was calculated as the percentage fraction of number of reads of each amplicon to a combined total number of trimmed reads for all three amplicons.

Amplicon read clustering
Groups or clusters of unique sequence variants for each gene segment and their frequency were determined using Usearch (version 7.0.1090) [41]. Prior to analysis, the Illumina barcode read labels on each of the 159 amplicons were replaced with the name of the tree samples and RNA amplicon component using the Usearch fastq_strip_barcode_relabel.py python script. The Uclust function in the Usearch program was then used to cluster the relabeled amplicon reads at 100% identity/similarity and to determine the proportion of reads assigned to each cluster.
The resulting clusters were then sorted in descending order according to the proportion of reads represented using the Usearch "sortbysize" function. The clusters including single reads that did not have 100% identity to any other read were translated into amino acid sequences using Emboss Transeq (version 6.6.0) with default parameters [42] to identify the proportion of non-coding sequence clusters. The encoding amplicon cluster sequences were then translated back into nucleotide sequences using Emboss Backtranseq (version 6.6.0) [42] with default parameters.

Amplicon sequencing error detection control and filtering
Three transcribed RNA (trRNA) controls were produced from a single Prunus isolate cloned PNRSV RNA3 RT-PCR amplicon to assess the rate of sequence error introduced during RT-PCR and sequencing. To produce the controls, three RNA3 amplicons were cloned using the pGEM 1 -T Easy vector system (Promega). Three plasmids containing inserts of the expected size were selected and verified by DNA sequencing (BigDye v3.1 chemistry sequencing kit, Applied Biosystems). The plasmids were linearized using the SalI restriction enzyme (Promega) and transcribed into RNA using a MEGAScript 1 T7 Kit (Invitrogen), as recommended by the manufacturer.
DNA was removed from the trRNA using a DNA-free™ kit (Invitrogen) according to the manufacturers' instructions. PCR with PNRSV RNA3 primers (Table 1) was used to detect any remaining DNA in the DNase treated 20 μl trRNA reaction volume using Platinum™ Taq DNA Polymerase kit (Invitrogen) according to the manufacturer's instructions. PCR cycling conditions were: 1 cycle of 94˚C for 2 minutes followed by 35 cycles of 94˚C for 1 minute, 54˚C for 30 seconds, and 72˚C for 1 minute; with a final extension of 72˚C for 10 minutes. The DNase treatment was repeated as many times as necessary until DNA was not detected in the trRNA.
The three trRNA3 products were reverse transcribed and amplified using SuperScript™ III One-Step RT-PCR System (Invitrogen) and RNA3 primers (Table 1) as previously described. Separate libraries were prepared for each of the resulting amplicons and these were sequenced using the Illumina MiSeq with a paired read length of 301 base pairs as previously described. The resulting sequence reads were pre-processed and clustered at 100% identity as described previously.
The following process was used to determine the proportion (%) of error-associated reads and the error rate (%) occurring during RT-PCR and amplicon NGS: clusters of amplicon sequences for each trRNA control were compared to the original sequence of each clone; clusters that were not 100% identical to the original cloned PCR product were presumed to be error-associated; these error-associated clusters were also translated into amino acids using Emboss Transeq (version 6.6.0) with default parameters [42] to identify the proportion of non-coding clusters; the total number of amplicon sequencing reads in the error-associated clusters was divided by the total number of reads in each triplicate amplicon control to calculate the average proportion (%) of error-associated with RT-PCR and amplicon NGS; the NGS amplicon error rate (%) was calculated as the average of the proportion (%) of total nucleotide bases in the error-associated reads that were different to the source clone's sequence compared to the total nucleotide bases in each triplicate amplicon control.
It was assumed that this proportion of error was the same for the NGS data for each of the 159 amplicons. Therefore, a range of cluster sequences containing the same proportion of error-associated reads in ascending order starting from singletons and non-coding variants clusters were removed from the 159 amplicon samples data. The remaining filtered sequence variants from the 159 amplicon samples was used for phylogenetic and sequence identity analysis.

Phylogenetic and sequence identity analysis
Prior to phylogenetic and sequence identity analysis, sequences of the PNRSV type reference and phylo-groups sequences (AF278534, AF278535, U57046, Y07568, S78312, L38823; S2 Table) were retrieved from GenBank and trimmed according to each corresponding region of the genome that was amplified from each of the 53 Prunus tree samples and subsequently analysed in this study.
Nucleotide sequence from the type isolates and each of the filtered amplicon clusters, from all samples and for each RNA segment were aligned using Muscle (version 3.8.31) [39]. A maximum likelihood phylogenetic tree was constructed in RAxML (version 8 [44] and branches that had less than 90% bootstrap support were collapsed. Sequence identity analysis using the sequence demarcation tool (SDT) (version 1.2) [45] was carried out on the aligned amplicon clusters of each RNA and also on each of the individual amplicon clusters.

Amplicon cluster intra-host relationships
Haplotype networks were created in order to visualize the genetic variation and evolutionary relationship of amplicon sequence variant clusters for each of PNRSV RNA1, RNA2 and RNA3 at the intra-host level. The haplotype networks were constructed using Median Joining (MJ) algorithm [46] and visualized using PopART software (http://popart.otago.ac.nz).

Amplicon sequence variant calling and annotation
Each of the error-filtered RNA amplicon sequence variants were mapped to their respective PNRSV type reference (S2 Table) using the BWA (version 0.6.2) with default parameters [47]. The generated mapping SAM file was converted to BAM file format using Samtools (version 0.1.18) with default parameters [48]. Sequence mutations were then detected using Freebayes (version 1.0.2) with default parameters [49]. The resulting variant calling format (VCF) files were annotated using SnpEff (version 4.2) [50], to determine the functional significance of mutations identified.

Next generation sequencing data of PNRSV amplicons
The total number of raw reads was 3,431,678 when the NGS amplicon data for PNRSV MT, RdRp and CP gene segments of RNA1, RNA2 and RNA3 respectively, which were amplified from all 53 plant samples, were combined. The total number of reads for MT, RdRp and CP amplicons were 1,571,434, 541,757 and 1,318,487 respectively. After quality trimming there was an overall total number of 3,140,735 reads used for analysis and the read numbers for the MT, RdRp and CP amplicons were reduced to 1,449,457, 468,106 and 1,223,172 respectively ( Table 2). The read depth for PNRSV MT, RdRp and CP amplicons in each of the 53 plant samples, when compared to the overall total number of quality trimmed reads, ranged from 0.55-1.19%, 0.2-0.52% and 0.45-1.02% for MT, RdRp and CP gene segments amplified from RNA1, RNA2 and RNA3 respectively (S3 Table).

Amplicon read clustering
Clustering of the trimmed amplicon sequence reads at 100% identity resulted in unique sequence variants averaging at 1,032, 2,625 and 6,539 variants per sample for MT, RdRp and CP gene segments respectively, in each of the 53 plant samples ( Table 2). Singletons accounted for an average of 2%, 25% and 24% of total reads of MT, RdRp and CP gene variants respectively (S5, S6 and S7 Tables) and 80% of these reads resulted in non-coding amino acid sequence. On translation of all of the sequence variants, excluding singletons, into amino acid sequence, there were an average of 50 and 44 non-coding variant clusters for each of MT, and CP gene segments respectively across the 53 plants samples and more than 95% of these noncoding clusters had 2-10 reads (S6 and S8 Tables). Only three plant samples had non-coding variant clusters (57, 69 and 42 clusters) of RdRp gene segment and all of these were also clusters comprised of 1-10 reads (S7 Table).

Amplicon sequencing error detection control and filter
Sequencing and analysis of PCR amplicons from transcribed RNA (trRNA) error detection controls showed that the most abundant sequence in all analyses (i.e. the sequence with the highest frequency number of reads) in each triplicate control was 100% identical to their original source clone's sequence. It also revealed a high proportion of singletons averaging 29% of total reads and non-coding clusters averaging 20% of total clusters. This indicated that lowoccurring variants that includes singletons and non-coding variants are error-associated. These error-associated reads averaged 37% of the total reads generated in the error detection control (S4 Table) with an average amplicon error rate of 0.6% (S5 Table).
Based on the determined proportion of amplicon error-associated reads, 37% of total reads in ascending order, starting from singletons, were identified to occur mainly in low level variants with less than 10 reads in the 159 amplicons generated from the plant samples. Therefore, all variant sequences made up of less than 10 reads and non-coding variant sequences were removed from the data. A total of 1,034,572 (32%) of amplicon sequence reads were removed from the combined total of 3,140,735 quality trimmed and pre-processed reads. This resulted in a decrease of the average number of sequence variants per plant sample from 1,032 to 95 variants for the MT gene segment; 2,625 to 39 variants for the RdRp gene segment and 6,539 to 104 variants per sample for the CP gene segment ( Table 2).

Phylogenetic analysis and sequence identity analysis
Phylogenetic analysis of the error-filtered MT, RdRp and CP amplicon sequence variants of RNA1, RNA2 and RNA3 respectively from all 53 plant samples and the published PNRSV type reference and type phylo-groups provided >95% bootstrap support for two phylogenetic groups for each of RNA1 and RNA2 and three phylogenetic groups for RNA3 (Fig 1). In this study the phylo-groups PV32-I, PV96-II and PE5-III [30,51] were referred to as PG1, PGII and PGIII respectively and the phylogenetic groupings for both RNA1 and RNA2 were designated as PGI and PGII in accordance with this naming criteria. The largest phylo-groups for each PNRSV RNA segment were RNA1 PGI, RNA2 PGI and RNA3 PGII having 3788, 1692 and 3498 sequence variants respectively and occurred in a larger proportion of plant samples compared to the other phylo-groups. RNA1 PGI occurred in 49/53 plants samples, RNA2 PGI and RNA3 PGII each occurred in 46/53 plant samples (Fig 1). Seventeen of the 53 plant samples had amplicon sequence variants occurring in two or three phylogenetic groups of at least one of RNA1, RNA2 and/or RNA3 segments and 36/53 plant samples had amplicon sequence variants belonging to a single phylogenetic group for each RNA segment ( Table 3). The most frequently occurring combination of phylo-groups was RNA1 PGI, RNA2 PGI and RNA3 PGII (34/53 plant samples; Table 3).
SDT identity analysis was used to determine the identity demarcation of sequence variants in each of the 159 amplicons and the phylogenetic groups. Variants in each of the 159 amplicons had an identity cut-off of more than 97% with the exception of amplicons with variants occurring in multiple phylogenetic groups (Table 3). SDT identity analysis of amplicon sequence variants of each RNA showed similar groupings as observed in the phylogenetic analysis. Variants occurring within the same phylo-group for each of RNA1, RNA2 and RNA3 shared more than 97% similarity ( Table 3). The similarity of variants between phylo-groups for each of RNA1, RNA2 and RNA3 was less than 97% with some variants sharing as little as 84% similarity between phylo-groups (Table 3).

Amplicon sequence variants intra-host relationships
Median-joining haplotype networks were inferred to show the genetic and evolutionary relationship within each MT, RdRp and CP amplicon sequence variant in a sample. As an example, the haplotype network for the MT, RdRp and CP amplicon sequences of PNRSV of sample Q15 is shown in Fig 2 and the results indicate that there was a clear separation of variant clusters based on their RNA1, RNA2 and RNA3 phylo-groups. This trend was observed in the haplotype networks of all 53 amplicon samples (data not shown). In each RNA haplotype network, low-frequency sequence variants (variants made up of low copy number of reads)    Genetic diversity of Prunus necrotic ringspot virus (PNRSV) using amplicon next generation sequencing appeared to diverge in a star-burst pattern from the largest sequence variant (variants made up of highest copy number of reads) The multiple phylo-groups of RNA1 and RNA3 occurring in sample Q15 appeared as distinct populations separated by a series of nucleotide base changes in the haplotype networks.

Amplicon sequence variant calling and annotation
Variant calling on each of the RNA amplicon error-filtered sequence variants identified a total of 389, 211 and 682 nucleotide polymorphic sites on the amplified MT, RdRP and CP of RNA1, RNA2 and RNA3 respectively (S9 Table). The analysis showed that 387 of the 389 changes observed in RNA1 were synonymous mutations (S9 Table) with only one sample having 2 non-synonymous mutations that resulted in a missense substitution (S10 Table). In contrast, the majority of nucleotide changes observed in RNA2 and RNA3 (199 of 211 and 536 of 682 respectively) were non-synonymous mutations (S9 Table). All RNA2 and RNA3 non-synonymous mutations resulted in missense substitutions. In addition, all RNA3 amplicon samples belonging to PGII and III in 45 of the plant 53 samples each had a single hexa-nucleotide in-frame deletion mutation (S10 Table).

Discussion
This study describes a novel use of amplicon NGS to investigate the diversity of a tripartite genome virus, PNRSV, by focusing on conserved regions within the MT, RdRp and CP genes of each RNA component of the genome. The high sequence read depth associated with amplicon NGS allowed for an improved estimation of PNRSV diversity within and between plant samples when compared to other methods used for estimating virus diversity. This is the first in depth study of the genetic diversity of PNRSV RNA components 1 and 2. It provides evidence for two distinct PNRSV phylo-groupings for each of RNA1 and RNA2, which were observed for similar gene regions of RNA1 and RNA2 of the Bromoviridae family members Alfalfa mosaic virus (AMV) and Cucumber mosaic virus (CMV; genus Cucumovirus) [52,53]. Phylogenetic analysis of CP sequence variants from the 53 plant samples diverged into the three already defined PNRSV RNA3 phylo-groups [26,27,30,32], even though a smaller portion of the genome was used. The phylogenetic groups identified for RNA1, RNA2 and RNA3 of PNRSV in the 53 plant samples were further confirmed by pairwise sequence identity comparison of sequence variants within each PNRSV amplicon which had a ! 97% sequence identity. Therefore, an identity of 97% was determined as the demarcation threshold for each of the PNRSV RNA phylo-groups occurring within an amplicon sample.
The haplotype analysis also indicated that each phylo-group could consist of distinct clades of variants for each of the RNA1, RNA2 and RNA3 sequence variants within a plant sample and these were formed by smaller sequence variant clusters that were genetically connected but diverging from the largest sequence variant cluster (Fig 2). The occurrence of distinct multiple phylo-groups of an RNA component and distinct clades of variants within a phylo-group in a single plant (eg. Sample Q15; Fig 2) may indicate different infection events or infection at the same time by distinct RNA phylo-group variants or strains.
The PNRSV phylogenetic groups observed in this study were supported by the sequence identity analysis and haplotype networks, which indicates that these variant clusters within each of PNRSV MT, RdRp and CP amplicons represent a functional unit that is important in defining PNRSV species taxonomy. Virus species have widely been defined as a polythetic class whose members share a consensus group of properties but not all share a single common property [54][55][56]. These properties may include genome, biological and serological properties [57], and the variation in some of these properties among members of a virus species may be the differentiating element of species into variants. These variants of a virus species, can be further defined as strains if they are a genetically stable collection of variants that differ from the type variant in stable biological, serological or molecular characteristics [58][59][60].
In context to this study, phylogenetic analysis differentiated the PNRSV amplicon sequence variants in each RNA component into phylo-groups which represent distinct taxonomic units of each RNA component that may define a PNRSV genetic strain. We therefore propose the following definition of a PNRSV genetic strain based on our PNRSV amplicon sequence analysis: A genetic strain of PNRSV in a biological isolate (plant) must comprise of at least one variant of each RNA component that encodes the expected open reading frame (ORF); and may include sequence variants that are !97% similar. There is no formal taxonomic definition below the species level by ICTV for any plant virus and it may be possible to extend this proposed virus genetic strain definition to other species of both monopartite and multipartite genome viruses where functional proteins encoded by RNA component/s are required for a virus infection. However, like the demarcation criteria between virus species [61], the proposed 97% demarcation threshold assigned to PNRSV in this study may not hold for strains of other virus species and would need to be determined on a case by case basis.
In this study, the majority of plant samples (36 of 53) had variants of RNA1, RNA2 and RNA3 belonging to a single phylo-group and therefore could be described as having a single genetic strain of PNRSV. Based on the different possible combination of the phylo-groups for RNA1, RNA2 and RNA3 these 36 samples represent three PNRSV genetic strains. The most common PNRSV genetic strain consisting of single phylogenetic groups of each RNA component occurred in 34 samples and had the combination of RNA1 PGI, RNA2 PGI and RNA3 PGII [30]. The two remaining genetic strains each occurred in a single plant sample and consisted of RNA1 PGI, RNA2 PGI and RNA3 PGI and RNA1 PGII, RNA2 PGII and RNA3 PGI.
The existence of variants occurring in multiple phylo-groups for RNA1, RNA2 and/or RNA3 observed in 17 plant samples complicates the definition of a genetic strain for a multipartite virus and the determination of the number of PNRSV genetic strains present in a biological sample. The RNA components of a single-stranded RNA virus with a multipartite genome are frequently encapsidated separately allowing for pseudo-recombination or reassortment between the genome component variants [62,63]. This leads to a complexity that presents a challenge in accurately determining the number of genetic strains of a multipartite genome virus present in a plant. For example, tree sample M19 had two phylo-groups in each of RNA1, RNA2 and RNA3, which could either mean M19 is infected with two PNRSV genetic strains or may be up to eight (2 3 ) PNRSV genetic strains depending upon how the RNA components function together. Conversely in tree sample Q15 (Fig 2) there are two phylo-groups of RNA1 and RNA3 and one phylo-group of RNA2 suggesting that Q15 could either have 2 or 4 genetic strains of PNRSV depending on the different combinations the RNA components.
With this genetic complexity in mind, in this study each plant sample represents a biological isolate that may consist of one or more PNRSV genetic strains which are defined by the PNRSV RNA1, RNA2 and RNA 3 phylo-groups that were present. However, there was no association between PNRSV strains or specific RNA1, RNA2 or RNA3 phylogroups and host geographical origin or host specificity which supports similar observations of previous phylogenetic studies of PNRSV RNA3 [30,64]. In most samples, symptoms specific to a PNRSV infection were not observed but symptom expression could have been affected by the time of year that the sample was taken as well as the specific Prunus species or variety that was sampled.
The raw and clustered filtered NGS data showed that each of the PNRSV RNA2 genome components occurred with lower frequency compared to RNA1 and 3 amplicon sequence reads. Similarly, infection studies with the related tripartite genome virus, Alfalfa mosaic virus (AMV; genus Alfamovirus, family Bromoviridae) also showed that RNA2 occurred at a lower frequency than the other RNA genome segments [65]. The lower number of RdRp amplicon sequence variants is likely due to the highly conserved nature of the RdRp gene, which encodes several motifs necessary for replication [66,67]. However, the majority of nucleotide changes observed in RdRp and CP amplicon sequences were non-synonymous mutations which resulted in different amino acid sequence compared to the type isolate reference sequence [64,68]. RdRp non-synonymous mutations were mis-sense substitutions and the biological significance of these amino acid sequence changes is not known.
CP amplicons of all 53 samples had mis-sense mutations and CP amplicons of 45 of these samples which belong to phylo-groups PGII and PGIII also had a single hexa-nucleotide inframe deletion mutation. This six nucleotide deletion resulted in the deletion of two amino acids (Asp-Arg) when compared to the type reference [64] and has been reported previously as a characteristic feature of PNRSV CP PGII and III [30,64]. The biological significance of this deletion is not known, but it does serve as a PNRSV CP PGII-and III-specific feature that can be used to discriminate against PGI.
Regardless of the high level of sequence diversity at the nucleic acid level, all MT sequence variants, with the exception of two variants in one sample, had synonymous mutations that resulted in no change at the amino acid level. This high level of PNRSV methyltransferase amino acid sequence conservation could be associated with the importance of the MT gene in PNRSV replication [67,69], but might also be associated with less observed diversity due to the short RNA1 amplicon size (200 bp) used for this analysis.
The PNRSV amplicon NGS data from this study represent the minimum number of sequence variants within a virus isolate from a single sample of a tree at one time point. A high level of sequence diversity was observed across all the samples using NGS amplicon sequencing, with a total number of 5040, 2083 and 5486 sequence variants observed for MT, RdRp and CP amplicons respectively. This high level of diversity represents only a small portion of each RNA component and it is likely that a greater level of sequence diversity would be observed if the whole PNRSV genome sequence was analysed. Furthermore, the distribution of virus variant populations in a plant host can vary according to the plant tissue that was sampled [70,71] and in this study, variant detection may be biased towards the leaf tissue that was used.
Errors can occur during PCR and amplicon NGS which may inflate diversity [72][73][74][75]. Several different approaches to remove amplicon NGS errors that use statistical software have been reported [76][77][78][79][80]. Despite the usefulness of these error correction methods, Schirmer et al. [81] found that various experimental factors have a major influence on error patterns of sequence data. Therefore a stringent internal error detection control and filtering process was specifically designed for the experimental factors in this study to minimise the occurrence of sequences with errors so that they do not have a significant impact on the outcome of the bioinformatics analyses. The PNRSV amplicon sequencing error rate of 0.6% determined in this study was within the 97% amplicon sequence identity cut-off observed within PNRSV amplicon sequence variants present in each RNA phylo-grouping. Therefore, any error associated sequence variants that were filtered out in this study did not have a significant impact on the downstream PNRSV diversity analysis.
The error detection and filtering process that was used in this study could be adapted to other similar studies. However it is possible that the filtered variants, including some with non-coding sequences could be real [82]. The possibility also exists that some sequence variants that do not contain non-coding sequence artefacts of PCR amplification and sequencing, which are more difficult to distinguish from true biological variants, were missed during filtering and remained in the dataset.
This study explored the usefulness of amplicon NGS as an alternative to traditional cloning and Sanger sequencing to determine the genetic diversity of PNRSV in 53 Prunus plant samples from Australia. Based on the findings presented, we were able to propose a definition of a PNRSV genetic strain and determine the number of PNRSV genetic strains present in plant samples with sequence variants belonging to a single phylo-group in each RNA component. However, the complexity associated with variants occurring in multiple phylo-groups for RNA1, RNA2 and/or RNA3 made it difficult to determine the number of strains of PNRSV in these plant samples. Never the less the analytical approach described here is multi-dimensional and applicable to both monopartite and multipartite genome viruses, and may assist biosecurity agencies in defining virus strains of quarantine significance.
Supporting information S1 Table. The origin of each Prunus species sample used in this study.  Table. The total number of reads generated from transcribed RNA amplicon control triplicates and the resulting total number of clusters at 100% identity, the total number of nucleotide bases in all the clusters and the total number of error-associated nucleotide bases. The percentage amplicon error rate for each triplicate control and their average are also shown.