A computational method for the identification of Dengue, Zika and Chikungunya virus species and genotypes

In recent years, an increasing number of outbreaks of Dengue, Chikungunya and Zika viruses have been reported in Asia and the Americas. Monitoring virus genotype diversity is crucial to understand the emergence and spread of outbreaks, both aspects that are vital to develop effective prevention and treatment strategies. Hence, we developed an efficient method to classify virus sequences with respect to their species and sub-species (i.e. serotype and/or genotype). This tool provides an easy-to-use software implementation of this new method and was validated on a large dataset assessing the classification performance with respect to whole-genome sequences and partial-genome sequences. Available online: http://krisp.org.za/tools.php.


Introduction
In the recent years, an increasing number of outbreaks of Dengue (DENV), Chikungunya (CHIKV) and Zika (ZIKV) viruses have been reported in Asia and the Americas [1][2][3]. The predominant mosquito species transmitting DENV, CHIKV and ZIKV, are Aedes aegypti and Aedes Albopictus, which are widely distributed in tropical and sub-tropical regions [4]. In the past few years, several studies have reported concurrent outbreaks of DENV, CHIKV and ZIKV in the same geographical area [5,6]. Currently, unprecedented outbreaks of DENV, CHIKV and ZIKV are co-occurring in Brazil. In 2017, the Brazilian Ministry of Health estimated that approximately 251,000 suspected cases of DENV, 185,000 suspected cases of CHIKV and close to 18,000 suspected ZIKV cases had occurred in Brazil [7].
Monitoring virus genotype diversity is crucial to understand the emergence and spread of outbreaks, both aspects that are vital to develop effective prevention and treatment strategies. Both DENV and CHIKV epidemics are associated with a mortality and morbidity that puts a significant economic burden on the affected regions [8,9]. While infections with ZIKV are rarely fatal, as stated before, ZIKV infections may result in Guillain-Barré syndrome and congenital malformations [10,11]. Genomic surveillance of epidemics at the appropriate resolution and consistently classifying the reported genetic sequences, also enables the identification of strains associated with greater epidemic potential [12] or disease severity [13].
However, methods that consistently classify arbovirus sequences at the level of species and sub-species (i.e. serotype and/or genotype) are currently lacking. Additionally, whole genome sequences are often not available in routine clinical settings, forcing the use of shorter gene sequences to classify at viral species or sub-species level. It has however insufficiently been explored which genomic regions are most suitable for accurate classification.
A new computational method for the identification of DENV/CHIKV/ZIKV sequences, with respect to species and sub-species (i.e. serotype and/or genotype), is presented. The classification method is implemented in the Genome Detective software tool, which was validated on a large dataset by assessing the classification performance of whole-genome sequences, partial-genome sequences and products from next-generation sequencing methods. Furthermore, the suitability of different genomic regions for virus classification was evaluated.

Datasets
Global whole-genome sequence dataset (Global-WG). A dataset of previously published whole-genome sequences from GenBank [14] was compiled. This dataset consists out of 4,118 DENV sequences, 653 CHIKV sequences and 413 ZIKV sequences and contains DENV sequences for each of the four known serotypes: DENV-sero1 (n = 1688), DENV-sero2 (n = 1317), DENV-sero3 (n = 897) and DENV-sero4 (n = 216). The list of GenBank accession numbers for this global whole-genome dataset is available in the Supporting Information section (S1 File). In the remainder of this manuscript, this dataset will be referred to as Global-WG.
Global envelope sequence dataset (Global-ENV). A dataset of previously published envelope sequences from GenBank [14] was compiled. This dataset consists out of 4,118 DENV sequences, 2,531 CHIKV sequences and 413 ZIKV sequences and contains DENV sequences for each of the four known serotypes: DENV-sero1 (n = 1688), DENV-sero2 (n = 1317), DENV-sero3 (n = 897) and DENV-sero4 (n = 216). The list of GenBank accession numbers for this global envelope dataset is available in the Supporting Information section (S2 File). In the remainder of this manuscript, this dataset will be referred to as Global-ENV.
Identification of genotypes and selection of reference sequences. To identify the viral genotypes, a multiple sequence alignment was constructed with the MAFFT alignment software [15] per virus species, using the Global-WG dataset. Each alignment was edited manually until a codon-correct alignment was achieved in all genes. The next step in this exploration involved a phylogenetic analysis using PhyML (i.e. Maximum likelihood, 1000 bootstrap replicates) and MrBayes (i.e. Bayesian) [16,17]. With this approach, four main DENV clades (i.e. serotype 1 to 4) and 19 genotypes (i.e. 1I, 1II, 1III, 1IV, 1V, 2I, 2II, 2III, 2IV, 2V, 2VI, 3I, 3II, 3III, 3V, 4I, 4II, 4III and 4IV) were identified. These findings are in agreement with the current consensus in DENV classification [18][19][20][21]. For CHIKV, three phylogenetic clades can be distinguished: The East-Central-South African (ECSA) genotype, the Asian-Caribbean genotype and the West African genotype. The West African genotype being more divergent and less widespread than the ECSA genotype and the Asian-Caribbean genotype [22,23]. ZIKV, as well, can be classified into two genotypes. The African genotype, originally identified in Uganda in 1947 [24], is found in many African countries [25]. The Asian genotype was identified in Malaysia in 1966 [26], this genotype has recently caused the worldwide epidemic in Asia and the Pacific [27,28], and is responsible for the epidemic in the Americas [5].
The accuracy and consistency with which a method identifies viral species and genotype clades depends on the selection of a set of representative reference sequences [29][30][31].
The initial step in the selection of reference strains for our method involved the identification of highly divergent but equidistant whole-genome sequences that are representative for the diversity within the different DENV, CHIKV and ZIKV genotypes, by screening all published complete genome sequences in our Global-WG dataset. For example, we normally start by selecting 5-10 sequences that represent the diversity of each virus genotypes. Sequences that met these selection criteria were quality controlled for the presence of insertions, deletions, frame shifts and non-IUPAC characters using VIRULIGN [32]. For DENV, we used the reference sequences that are included with the VIRULIGN software, for ZIKV, we used the reference sequence presented in [33], and for CHIKV we constructed a new reference sequence from NC_004162 that we added to the VIRULIGN repository. Sequences that pass the quality control were aligned using MAFFT [15], and were subjected to phylogenetic analysis using PAUP � (i.e. Neighbor Joining), MrBayes (i.e. Bayesian) and PhyML (i.e. Maximum likelihood) [16,17,34,35] using GTR+G+I. Sequences that gave consistent topologies using all three tree inference methods were retained as potential reference sequences (see Supporting Information, S1 Table) and used in the next step of the evaluation process.
We established that none of the selected reference strains were recombinants (S2 Fig) using the recombination detection program RDP4 [36] Suitability of sub-genomic regions for genotyping purposes. The reference strain dataset (S1 Table) was then explored to establish the suitability of sub-genomic regions for automated genotyping. Two different methods were used.
The first was a boot-scanning method, using a sliding window approach exploring the range between 200 and 2,000 nucleotides. All windows across the genome were used for the construction of Neighbor joining trees with 1,000 bootstrap replicates. The aim was to find the size and segments of the genome that would correctly classify a query sequence with a bootstrap support of >70%.
The second method involved the calculation of the phylogenetic signal present in each of the DENV, CHIKV and ZIKV genes, using the same set of reference sequences. To compute the phylogenetic signal, the TreePuzzle software [37] implementation of the likelihood-mapping method [38] was used. Only between-genotype quartets were evaluated. Quartet puzzling essentially is a three-step procedure, first reconstructing all possible quartet maximum likelihood trees (maximum-likelihood step), then repeatedly combining the quartet trees to an overall tree (puzzling step), and finally computing the majority rule consensus of all intermediate trees giving the quartet puzzling tree (consensus step).

Classification method and implementation
Classification method. Our method involves a viral classification pipeline, drawing inspiration from the one described previously to classify HIV, hepatitis C virus and human T-lymphotropic virus sequences [29,30]. The classification pipeline presented here consists of two classification components. The first classification component enables species and sub-species assignments. The classification analysis subjects a query sequence to a BLAST analysis against a set of reference sequences [39]. A query is assigned to a particular type when BLAST reports an assignment with a score that exceeds a predefined threshold.
The second classification component involves the construction of a Neighbor Joining phylogenetic tree. This component enables assignments on genotype and/or subtype level. First, the query sequence is aligned with a set of reference sequences.
The alignment is produced using the profile alignment option in the ClustalW software [40], such that the query sequence is added to the existing alignment of reference sequences. Subsequent to the alignment, a Neighbor Joining phylogenetic tree, with 100 bootstrap replicates, is constructed. The tree is constructed using the HKY distance metric with gamma among-site rate variation, as implemented in the PAUP � software [34]. The query sequence is assigned to a particular genotype if it clusters monophyletically with that genotype clade with bootstrap support >70%. If the bootstrap support is <70%, the genotype is reported to be unassigned.
Software implementation. While the classification method was inspired by the one previously presented [29], a new software framework was developed to be easily adaptable to the classification procedures for various viral pathogens. All source code is written in the Java programming language (Fig 1). The software framework is part of the Genome Detective toolchain [41] ArboTyping classification method and implementation. Firstly, the viral species is determined using BLAST, classifying the sequence as DENV, CHIKV or ZIKV.
In case the submitted sequence was assigned either as ZIKV or CHIKV, a Neighbor joining tree is inferred to determine the respective ZIKV or CHIKV genotype. Only for DENV, another BLAST procedure is invoked to assign the serotype first. Based on the inferred serotype, a serotype specific Neighbor joining tree is constructed to determine the Dengue genotype.
For each of these steps, the earlier discussed reference strains were used, with respect to the appropriate typing level (i.e. virus species, serotype or genotype). This process is summarized in a decision tree in Fig 2. Testing revealed that a BLAST cut-off value of 200 allowed accurate identification of the virus species and DENV serotypes using sequence segments >150 base pairs.
Note that the species and serotype classification procedure are implemented as separate BLAST steps. This enables the tool to efficiently perform large throughput species classification, such as for the classification of next-generation sequencing reads.
An instance of the ArboTyping web application is publically available on a dedicated server (http://krisp.org.za/tools.php). The web interface on this server accepts up to 2,000 whole-genome or partial genome sequences at a time. The tool can be accessed by the Genome Detective interface or by the selection of individual viruses typing tool (i.e. Zika, Dengue and Chikungunya). . The typing report presents information about the sequence name of the query sequence, the nucleotide length of the sequence, an illustration of the position of the sequence in the virus' genome, the species assignment and the genotype assignment. A detailed report is provided for the phylogenetic analysis that resulted into this classification. All results can be exported to a variety of file formats (XML, CSV, Excel or FASTA format). The detailed HTML report (B) contains information on the sequence name, length, assigned virus and genotype, an illustration of the position of the sequence in the virus' genome and the phylogenetic analysis section. The phylogenetic analysis section shows the alignment and constructed phylogeny: the query sequence is always shown at the top of the phylogenetic tree. https://doi.org/10.1371/journal.pntd.0007231.g001 Classification performance for whole-genomes and sub-genomic regions. To determine the accuracy of the automated method for whole-genome sequences, the method was evaluated on a whole-genome sequence dataset (i.e. Global-WG dataset).
As sequences from sub-genomic regions are more commonly available than whole-genome sequences, the method's accuracy was also evaluated in this context. For this purpose, the envelope sequences in the Global-ENV dataset were used for evaluation.
Each of the sequences considered for evaluation was assigned using both the gold standard and the here described automated method. The gold standard, a manual classification consists of performing an assignment using both Bayesian (i.e. MrBayes, assignment with posterior > 90% [17]) and Maximum likelihood (i.e. PhyML, 1000 bootstrap replicates, assignment with > 70% of replicates [16]) phylogenetic analysis. When the assignments generated by both the Bayesian and Maximum likelihood technique match, the classification is confirmed [31]. , the viral species is determined using BLAST. When the submitted sequence is a Zika virus, a Neighbor joining tree is constructed to determine the Zika genotype (B). When the submitted sequence is a Chikungunya virus, a Neighbor joining tree is constructed to determine the Chikungunya genotype (C). When the submitted sequence is a Dengue virus, the serotype is determined using another BLAST invocation (D). Based on the inferred serotype, a serotype specific Neighbor joining tree is constructed to determine the Dengue genotype (E, F, G, H). https://doi.org/10.1371/journal.pntd.0007231.g002 The sensitivity, specificity and accuracy of our method was calculated for both species assignment and genotyping. Sensitivity was computed by the formula TP TPþFN , specificity by the formula TN TNþFP and accuracy by the formula TPþTN TPþFPþFNþTN [42]. In these formulas: TP = True Positives, FP = False Positives, TN = True Negatives and FN = False Negatives.

ArboTyping classification method and implementation
An efficient method to classify virus sequences with respect to their species and sub-species (i.e. serotype and/or genotype) was developed. This method was implemented in Java and this implementation was integrated in an easy-to-use web interface. A detailed description of the method and its implementation can be found in the 'Classification method and implementation' Methods subsection.

Suitability of sub-genomic regions for genotyping purposes
Two different methods were used to verify the suitability of sub-genomic regions for genotyping purposes: a boot-scanning method and a likelihood-mapping method (see Methods).
For DENV, the only sub genomic region that supports confident genotype assignment across the four different serotypes was the envelope gene. For CHIKV, the envelope region E1 was the only region that allowed consistent assignment. The boot-scanning analysis showed that for ZIKV, segments of around 1,200-1,500 base pairs support the genotype assignment with bootstrap > 70% (Fig 3). This was the case over the entire genome, with the exception of the end of the genome (i.e. the non-coding region) and near the NS3 region, where bootstraps fell below 60%.
Our likelihood-mapping analyses show that for DENV, the envelope, NS1, NS3 and NS5 had good phylogenetic signal across all four serotypes. For CHIKV, the envelope E2 gene had the best signal but this region did not provide good boot-scanning support for the classification of the ECSA genotype (Fig 3). For ZIKV, the envelope, NS1, NS2A, NS3, NS4A, NS4B and NS5 regions had good phylogenetic signal. A detailed overview of the results of the likelihood-mapping analysis can be found in the S2 Table of the Supporting Information. In summary, these analyses show that the envelope genes of the reference datasets of the three pathogens (DENV, 1,485 nucleotides; CHIKV, 1,317 nucleotides; ZIKV, 1,525 nucleotides) are the most suitable targets for reliable genotype classification.

Classification performance for whole-genome sequences
Our automated method provided specificity, sensitivity and accuracy of 100% for the identification of complete genomes for all viral species and genotypes compared to the gold standard, a manual classification. For a detailed overview of the DENV, CHIKV and ZIKV assignment performance, we refer to the Supporting Information S3 Table. Only ten of 4118 DENV whole-genomes could not be classified at the genotype level, either by manual phylogenetic analysis or by our automated method. Notably, the seven sequences (AF298807, KF864667, EU179860, JQ922546, KF184975, KF289073, EF457905 of DENV--Sero1 were outliers in the phylogenetic tree (see Supporting Information, S1 Fig). We tested all ten sequences for recombination using boot-scanning (see Supporting Information, S2 Fig) and the recombination detection program RDP4 [36]. We only found sequence AY496879 to be a clear recombinant of DENV genotype 3I and 3II. The two other sequences (DENV-Sero2 KF744408 and DENV-Sero3 JF262783) were also identified as a divergent outlier.

Classification performance for sub-genomic regions
Our analysis shows that the classification results for the envelope sub-genomic region at the species and genotype level were similar to that obtained using whole-genome sequences and largely in agreement with the gold standard, a manual classification.
For DENV, most of the genotypes were classified with great accuracy (i.e. specificity and sensitivity > 99%) using the envelope gene. The exception was DENV-sero2 genotype IV, of which 41 envelope sequences were available and for which 33 were correctly identified (i.e. sensitivity 80.49%, specificity 100%). The CHIKV sequences covering the E1 region were accurately classified for all three genotypes (i.e. 100% sensitivity and specificity). All the ZIKV envelope sequences were classified with 100% sensitivity and specificity. For a detailed overview of the DENV, CHIKV and ZIKV assignment performance refer to Supporting Information S4 Table. Since a good phylogenetic signal was reported for the DENV and ZIKV NS5 region and the CHIKV E2 region, a classification analysis was performed for these regions as well. For the DENV NS5 region a sensitivity of 57,48% and specificity of 31,35%) was observed. Nearly all ZIKV NS5 sequences were correctly assigned to the African genotype (i.e. sensitivity of 97.72% and specificity of 100%). This indicates that the ZIKV NS5 region might also be used for genotype classification. For CHIKV, the E2 region showed perfect accuracy, similar to the E1 region (i.e. specificity and sensitivity of 100%). However, our previous boot-scanning support showed that the genetically variable E2 region may cause problems for some strains to be correctly identified as ECSA genotype.
In summary, our results suggest that the envelope region of DENV and ZIKV and the E1 envelope region of CHIKV are suitable for genotyping purposes. In addition, these regions contain the largest number of sequences in public databases, which easily allows for a wide range of comparative analyses and validation experiments.

Discussion
Emerging infectious diseases caused by viral pathogens still represent a major threat to public health worldwide, as recently demonstrated by outbreaks of Ebola, Zika, Middle East Respiratory Syndrome (MERS) and Yellow Fever virus. Fast and accurate real-time monitoring of outbreaks and surveillance of on-going epidemics is crucial to anticipate viral spread and to design effective prevention or treatment strategies. To this end, an accurate and reliable method for the classification of ZIKV/DENV/CHIKV arboviruses was developed: The ArboTyping tool.
The ArboTyping tool implements a classification pipeline that consists of a BLAST-based species assignment and phylogenetic assessment to identify subspecies (i.e. genotypes) with respect to a set of reference strains, as exemplified for other virus species by previous work [29][30][31]. To enable accurate classification, a set of reference sequences that cover the extent of diversity within species and subspecies, was carefully selected.
The classification performance of the ArboTyping tool was assessed on a dataset of wholegenome sequences. All whole-genome sequences in this dataset that could be confidently assigned a species and genotype with the gold standard, a manual classification procedure, were concordant with the typing tool.
There were, however, 10 sequences that could not be classified using the manual classification procedure: further analyses show that these 10 sequences consist out of 3 outlier sequences, 2 clades of outlier sequences (3 sequences in each outlier clade) and 1 recombinant sequence. As these outliers have been previously identified [43], these results need to be further investigated to assess whether these outliers form new genotypes [44].
However, whole-genome sequences are currently not routinely available and the suitability of the different genomic regions was evaluated with respect to their use for classification. Since the envelope gene is a popular target for phylogenetic classification, there is a large availability of envelope sequences in public databases. Therefore, the performance of the ArboTyping tool was evaluated on a large dataset of envelope sequences (i.e. Global-ENV dataset). For these envelope sequences, a classification performance close to the tool's performance on wholegenome sequences was reported.
While the availability of sequence products originating from other genomic regions is currently low, it can be expected that these regions will increase in relevance given the interest in developing antiviral agents that target non-structural proteins. Therefore, more detailed studies to assess the classification performance of other genomic regions are warranted [44].
In this manuscript, we focus on the classification of consensus sequences on the species and sub-species level. However, Genome Detective, the framework in which our tools are integrated, is also a virus discovery toolchain [41]. Genome Detective's user interface allows users to supply raw next-generation sequence reads that can be automatically assembled into a consensus and passed to the ArboTyping tool. Details on the methods used to assemble reads in Genome Detective and an extensive validation using raw NGS reads can be found in [41].
In conclusion, the new method presented here allows the fast, accurate and high-throughput classification of DENV, CHIKV and ZIKV species and genotypes. Species can be classified using different sequencing products (i.e. whole-genome sequences, envelope sequences and individual next-generation sequencing reads) and genotypes can be classified most confidently when using envelope sequences or whole-genome sequences. This method accommodates the need to consistently and accurately classify DENV/CHIKV/ZIKV sequences, which is essential to implement epidemic tracing and to support outbreak surveillance efforts. Additionally, we present a solid framework that has the potential to serve as the foundation for many other arbovirus classification tools. These tools are also useful to be integrated in data management environments [45].
Our method is implemented in the Genome Detective software framework, suitable for many virus typing tools. The web application that makes our tool available through an easy-touse web interface is available online via a dedicated server that is hosted at http://www.krisp. org.za/tools.php.
Supporting information S1 Fig. Maximum likelihood phylogenetic tree of the DENV-sero1 outliers. All full genome DENV-sero1 sequences were assigned to genotype-level using manual phylogenetic analysis and classification by the automated typing tool. In total, seven full genomes of DENV-sero1 could not be classified at genotype level by either classification method. These seven sequences are visualized in a phylogenetic tree of the WGS datasets, colored according to genotype. (1I in blue, 1II in green, 1III in red, 1IV in yellow, 1V in pink) It can be seen that a divergent cluster of six genomes (AF298807, KF864667, KF184975, EU179860, KF289073 and JQ922546 in black) form an outlier clade and one genome (EF457905 in black) can be considered an outlier. However, note that these seven genomes could be properly assigned to serotype 1. for the ten whole genomes of DENV that could not be classified at genotype level are shown. Boot-scanning analysis was performed using a window length of 1500 base pairs and a step size of 100 base pairs. The different colours represent the genotypes for each serotype. The X-axis represents the nucleotide position in the genome and the Y-axis represents bootstrap results in percentages. In total, 7 DENV-sero1 sequences were analysed and 1 sequence for each of the other serotypes, i.e. DENV-sero2, DENV-sero3 and DENV-sero4. We only found sequence AY496879 to be a recombinant of DENV genotype 3I and 3II. The other sequences are outliers (i.e. JF262783, KF744408, EF457905) or clades of outliers (i.e.: AF298807, KF864667 and KF184975 form an outlier clade; EU179860, KF289073 and JQ922546 form an outlier clade). (TIF) S1 Table. Reference strains selected for the DENV, CHIKV, ZIKV genotypes. These reference sequences were selected to be representative for the diversity within the different DENV, CHIKV and ZIKV genotypes that circulate within these virus species. (DOCX) S2 Table. Phylogenetic signal estimated by likelihood mapping for DENV (DENV-sero1 to DENV-sero4), CHIKV and ZIKV sub-genomic regions. Phylogenetic signal was calculated separately per protein by the likelihood mapping method implemented in the software Tree-Puzzle. Likelihood mapping analysis computes the likelihood of the three possible trees that can be constructed from all possible inter-genotype quartets of taxa. The results for the resolved quartets and unresolved quartets are shown in the table, while the partially resolved quartets are not listed (can be obtained by 100%-(un)resolved quartets). Partially resolved quartets represent the quartets for which conflicting phylogenetic signal or potential recombination is present. Genomic regions for which the percentage of resolved quartets is higher than 90% are shaded in orange and are considered to be characterized by sufficient phylogenetic signal. (DOCX) S3 Table. Evaluation of the automated phylogenetic method to classify DENV, CHIKV and ZIKV whole-genome genomes. The new classification method consists of 2 parts: determining the species (and for DENV also the serotype) using a BLAST procedure, followed by determining the genotype using an automated phylogenetic method. Our method was able to assign all sequences in the whole-genome validation dataset to the right species and DENV serotype. Therefore, in this table, we focus on the classification performance with respect to genotype assignment, based on the output of the BLAST step (i.e. a dataset of the proper species and serotype). The classification results were compared to manual phylogenetic analysis. Column names: TP = total positives, TN = total negatives, FP = false positive, FN = false negative, SENS = sensitivity, SPEC = specificity, ACC = accuracy. (DOCX) S4 Table. Evaluation of the automated phylogenetic method to classify DENV, CHIKV and ZIKV envelope genomes. The new classification method consists of 2 parts: determining the species (and for DENV also the serotype) using a BLAST procedure, followed by determining the genotype using an automated phylogenetic method. Our method was able to assign all sequences in the envelope validation dataset to the right species and DENV serotype. Therefore, in this table, we focus on the classification performance with respect to genotype assignment, based on the output of the BLAST step (i.e. a dataset of the proper species and serotype). The classification results were compared to manual phylogenetic analysis. Column names: TP = total positives, TN = total negatives, FP = false positive, FN = false negative, SENS = sensitivity, SPEC = specificity, ACC = accuracy. (DOCX) S1 File. Accession number of the sequences collected from DENV, ZIKV and CHIKV whole-genome genomes. A GenBank mining of sequences was performed against whole-genome genomes of these viruses that had the genotype reported for sensitivity, specificity and accuracy tests of the tool. (XLSX) S2 File. Accession number of the sequences collected from DENV, ZIKV and CHIKV envelope genomes. A GenBank mining of sequences was performed against envelope genomes of these viruses that had the genotype reported for sensitivity, specificity and accuracy tests of the tool. (XLSX)