An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes

For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.

Subtype classification is an important and challenging problem in the field of virology. 2 Subtypes (also termed clades or genotypes) are a fundamental unit of virus 3 nomenclature (taxonomy) within a defined species, where each subtype corresponds to a 4 cluster of genetic similarity among isolates from the global population. Defined subtype 5 references for hepatitis C virus, for example, can diverge by as much as 30% of the 6 nucleotide genome sequence [1], but there is no consistent threshold among virus species.
vector" which contains information on the number and distribution of nucleotides in the 59 sequence, applied to the classification of single-segmented [18] and multi-segmented [20] 60 whole viral genomes, as well as viral proteomes [21]; based on neural networks using 61 digital signal processing techniques to yield "genomic cepstral coefficient" features, 62 applied to distinguishing four different pathogenic viruses [22]; based on different 63 genomic materials (namely DNA sequences, protein sequences, and functional domains), 64 applied to the classification of some viral species at the order, family, and genus 65 levels [23]; and based on interpolated Markov models, applied to the phylogenetic 66 classification of metagenomic samples [24]. 67 k-mer-based classifiers 68 The use of k-mer (substrings of length k) frequencies for phylogenetic applications 69 started with Blaisdell, who reported success in constructing accurate phylogenetic trees 70 from several coding and non-coding nDNA sequences [25] and some mammalian alpha 71 and beta-globin genes [26]. Other authors [27][28][29][30][31] have observed that the excess and 72 scarcity of specific k-mers, across a variety of different DNA sequence types (including 73 viral DNA in [27]), can be explained by factors such as physical DNA/RNA structure, 74 mutational events, and some prokaryotic and eukaryotic repair and correction systems. 75 Typically, k-mer frequency vectors are paired together with a distance function in order 76 to measure the quantitative similarity between any pair of sequences. Studies measuring 77 quantitative similarity between DNA sequences from different sources have been 78 performed, for instance using the Manhattan distance [32,33], the weighted or 79 standardized Euclidean distance [34,35], and the Jensen-Shannon distance [36,37]. 80 Applications of these distances and others have been compared and benchmarked 81 in [38][39][40][41], and detailed reviews of the literature can be found in [42][43][44][45]. 82 In the context of viral phylogenetics, k-mer frequency vectors paired with a distance 83 metric have been used to construct pairwise distance matrices and derive phylogenetic 84 trees, e.g., dsDNA eukaryotic viruses [46], or fragments from Flaviviridae genomes [47]. 85 Other studies have investigated the multifractal properties of k-mer frequency patterns 86 in HIV-1 genomes [48], and the changes in dinucleotide frequencies in the HIV genome 87 across different years [49]. We used k-mer frequency vectors to train supervised 88 classification algorithms. Similar approaches have previously been explored (with 89 different classifiers than those used here), for example to subtype Influenza and classify 90 Polyoma and Rhinovirus fragments [50], to predict HPV genotypes [51,52], to classify 91 whole bacterial genomes to their corresponding taxonomic groups at different levels [53], 92 to classify whole eukaryotic mitochondrial genomes [54][55][56][57], to classify microbial 93 metagenomic samples [58], to predict virus-host relationships for some bacterial 94 genera [59], and to identify viral sequences in metagenomic samples [60]. 95 To evaluate our method, we curated manually-validated testing sets of 'real-world' 96 HIV-1 data sets. We assessed fifteen classification algorithms and conclude that for 97 these data the SVM-based classifiers, multilayer perceptron, and logistic regression 98 achieved the highest accuracy, with the SVM-based classifiers also achieving the lowest 99 running time out of those. We measured classification accuracy and running time for 100 k-mers of length k = 1 . . . 10 and found that k = 6 provides the optimal balance of 101 accuracy and speed. Overall, our open-source method obtains a classification accuracy 102 average of 97%, with individual accuracies equal or exceeding other subtyping methods 103 for most datasets, and processes over 1,500 sequences per second. Our method is also Methods 108 Supervised classification 109 First, we needed to determine which supervised classification method would be the most 110 effective for classifying virus sequences, using their respective k-mer frequencies as 111 feature vectors (numerical representations). We trained each of 15 classifiers (Table 2) 112 on a set S = {s 1 , s 2 , . . . s n } of nucleotide sequences partitioned into groups 113 g 1 , g 2 , . . . , g p . Given as input any new, previously unseen, sequence (i.e., not in the 114 dataset S), the method outputs a prediction of the group g i that the sequence belongs 115 to, having 'learned' from the training set S the correspondence between the k-mer 116 frequencies of training sequences and their groups. The feature vector F k (s) for an 117 input sequence s was constructed from the number of occurrences of all 4 k possible 118 k-mers (given the nucleotide alphabet {A, C, G, T }), divided by the total length of s.

119
Any ambiguous nucleotide codes (e.g., 'N' for completely ambiguous nucleotides) were 120 removed from s before computing F k (s).

121
Next, we processed the feature vectors for more efficient use by classifiers. We 122 rescaled the k-mer frequencies in F k (s) to have a standard deviation of 1, which satisfies 123 some statistical assumptions invoked by several classification methods. In addition, we 124 performed dimensionality reduction using truncated singular value decomposition [61] to 125 reduce the vectors to 10% of the average number of non-zero entries of the feature 126 vectors. This greatly reduces running time for most classifiers while having a negligible 127 effect on classification accuracy.

128
Finally, we trained a supervised classifier on the vectors F k (s). Supervised classifiers, 129 in general, can be intuitively thought of as constructing a mapping from the input 130 feature space to another space which in some sense effectively separates each training 131 class. As a concrete example, the support vector machine (SVM) classifier maps the 132 input space to another space of equal or higher dimensionality using a kernel function, 133 and then selects hyperplanes that represent the largest separation between every pair of 134 classes. Those hyperplanes induce a partition on the transformed space which is then 135 used for the classification of new items. We tested fifteen different specific classifier 136 algorithms: 10-nearest-neighbors [62] with Euclidean metric and uniform weights 137 (10-nearest-neighbors); nearest centroid, to class mean (nearest-centroid-mean) 138 and to class median (nearest-centroid-median) [63]; logistic regression with L2 139 regularization, one-vs-rest as the multiclass generalization, stopping tolerance of 10 −4 , 140 and regularization strength (λ) of 1 (logistic-regression) [64]; SVM with the linear 141 (linear-svm), quadratic (quadratic-svm), and cubic (cubic-svm) kernel functions, 142 with penalty parameter of 1 and stopping tolerance of 10 −3 [65]; SVM with stochastic 143 gradient descent learning (maximum 5 training epochs) and linear kernel function 144 (sgd) [66]; decision tree with Gini impurity metric (decision-tree) [67]; random forest 145 using 10 decision trees with Gini impurity metric as sub-estimators 146 (random-forest) [68]; AdaBoost with maximum 50 decision trees as the weak learners 147 and the SAMME.R real boosting algorithm (adaboost) [69,70]; Gaussian naïve Bayes 148 (gaussian-naive-bayes) [71]; linear (lda) and quadratic (qda) discriminant 149 analysis [72]; and multi-layer perceptron with a single 100-neuron hidden layer, rectified 150 linear unit (ReLU) activation function, the Adam stochastic gradient-based weight 151 optimizer, L2 penalty term of 10 −4 , learning rate of 10 −3 , and maximum 200 epochs 152 (multilayer-perceptron) [73,74]. We used the implementations of these classifiers in 153 the Python library scikit-learn [75] -all settings and hyperparameters were left as 154 the defaults given in the library.

155
For some of the results that follow, we required a method for measuring classification 156 accuracy without the need for a separate testing dataset. To do so, we used 10-fold classifiers [76]. N -fold cross-validation is performed by taking the given dataset and unsupervised data exploration techniques are useful, and herein we also explore the use 171 of Molecular Distance Maps (MoDMaps), previously described in [41,77,78], for this 172 purpose. After computing the vectors F k (s), this method proceeds by first constructing 173 a pairwise distance matrix. In this paper, we use the well-known Manhattan 174 distance [79], defined between two vectors A = (a 1 , . . . a n ) and B = (b 1 , . . . b n ) as being: 175 Next, the distance matrix is visualized by classical MultiDimensional Scaling (MDS) [80]. 176 MDS takes as input a pairwise distance matrix and produces as output a 2D or 3D plot, 177 called a MoDMap [81], wherein each point represents a different sequence, and the 178 distances between points approximate the distances from the input distance matrix. As 179 MoDMaps are constrained to two or three dimensions, it is in general not possible for 180 the distances in the 2D or 3D plot to match exactly the distances in the distance 181 matrix, but MDS attempts to make the difference as small as possible.

183
We have developed a software package called Kameris which implements our method. 184 It can be obtained from https://github.com/stephensolis/kameris, and may be 185 used on Windows, macOS, and Linux. Kameris is implemented in Python, with the 186 feature vector computation parts implemented in C++ for performance. It is packaged 187 so as to have no external dependencies, and thus is easy to run. The package has three 188 different modes: first, it can train one or more classifiers on a dataset and evaluate r4.8xlarge instance with 16 physical cores (32 threads) of a 2.3GHz Intel Xeon E5-2686 195 v4 processor. We also note that many of the implementations of the classifier algorithms 196 we use are single-threaded and that performance can almost certainly be substantially 197 improved by using parallelized implementations. 199 In this paper, a variety of different datasets were used to validate the performance of 200 the method. Straightforward reproducibility of results was a priority in the design of this study, and to that end, every sequence and its metadata from every dataset 202 referenced here can be retrieved from our GitHub repository at 203 https://github.com/stephensolis/kameris-experiments. Further, instructions for 204 using Kameris to replicate the experiment results are available in S1 Appendix.

205
In some cases, these datasets had few examples for some classes. Training on classes 206 with very few examples would unfairly lower accuracy since the classifier does not have 207 enough information to learn, so we wish to omit such classes from our analysis. However, 208 the minimum number of examples per class to achieve proper training of a classifier is 209 difficult to estimate; this number is known to be dependent on both the complexity of 210 the feature vectors and characteristics of the classifier algorithm being used [82,83].

211
Since we vary both k and the classifier algorithms in this study, this makes it especially 212 challenging to determine an adequate minimum class size. Here, we arbitrarily selected 213 18 as our minimum, so we omitted from analysis any subtype with fewer than 18 214 sequences. It may be that specific values of k and some classifier algorithms work well 215 in scenarios with very small datasets, and we leave this as an open question. To evaluate classifiers trained on HIV-1 sequences and subtype annotations curated by 237 the LANL database, we needed testing sets but wanted to avoid selecting them from the 238 same database. We manually searched the GenBank database for large datasets 239 comprising HIV-1 pol sequences collected from a region with known history of a 240 predominant subtype, and evaluated the associated publications to verify the 241 characteristics of the study population (Table 1). After selection of the datasets, we 242 wished to obtain labels without relying on another subtyping method. To do so, first we 243 made use of the known geographic distribution of HIV-1 subtypes, where specific 244 regions are predominantly affected by one or two particular subtypes or circulating 245 recombinant forms due to historical 'founding' events [84]. Next, we screened each 246 dataset using a manual phylogenetic subtyping process to verify subtype assignments 247 against the standard reference sequences. This was done, essentially, by reconstructing 248 phylogenetic trees to identify possible subtype clusters. A cluster was identified as a 249 certain subtype if it included a specific subtype reference sequence we had initially August 10, 2017, using the query options "Excluding recombinants", "Excluding 'no 309 genotype"', "Genomic region: complete genome", and "Excluding problematic" for a

314
Our subtype classification method has two main parameters that may be varied: 315 namely, the specific classifier to be used, and the value k of the length of the k-mers to 316 count when producing feature vectors. We begin with the full set of full-length HIV-1 317 genomes from the LANL database, and we perform a separate 10-fold cross-validation 318 experiment for each of the fifteen classifiers listed in the Methods section, and all values 319 of k from 1 to 10, that is, 160 independent experiments in total. For each value of k, we 320 plot the highest accuracy obtained by any classifier as well as the average running time 321 over the classifiers, see Fig 1. We observe that k = 6 achieves a good balance between 322 classifier performance and accuracy, so at k = 6, we list the accuracy obtained by each 323 classifier and its corresponding running time averaged over five runs, see Table 2. As  Running time is another important performance indicator, so we also compare the 359 performance of these five tools for the dataset of van Zyl et al. [95] and also for all 360 datasets together (see Table 4). We observe that our tool matches or outperforms the 361 competing state-of-the-art. Note that, for these comparison experiments, CASTOR,

362
COMET, SCUEAL, and REGA were run from their web-based interfaces, and therefore 363 the exact specifications of the machines running each programs could not be determined. 364 For this reason, the running times presented here should be taken as rough  In this case, a substantial number of sequences that were classified as subtype A by REGA and our method were labeled unclassified subtypes (U) by COMET. In an HIV-1 phylogeny, subtype U sequences tend to be assigned a basal position (near the root) within the subtype A clade, suggesting that these sequences may be unrecognized variants or complex recombinants of subtype A. 2 These low accuracies are primarily caused by REGA misclassifying many CRF01 sequences as subtype A, and subtype A is mostly equivalent to CRF01 in the pol region. If CRF01 and A were treated as equivalent, these accuracies would be 97.9% and 86.4% for the Rhee and Sukasem datasets, respectively, and unweighted and weighted averages of 93.8% and 96.2%, respectively.
So far, we have only discussed supervised classification, and we have presented 373 promising results for our approach. However, supervised classification requires data with 374 known labels, which can be problematic considering that the rapid rates of mutation and 375 recombination of viruses (particularly HIV-1) can lead to novel strains and recombinant 376 forms emerging quickly. Unsupervised data exploration tools can help address this 377 problem. To demonstrate, we take the set of all whole genomes from the LANL 378 Manhattan distance matrix obtained by computing all pairs of k-mer frequency vectors 380 (see Methods section), for 9 different pure subtypes or groups (Fig 2), and just subtypes 381 A, B, and C (Fig 3). As can be seen, based on these distances, the points in the plots 382 are grouped according to known subtypes, and indeed it can be seen that subtypes A1 383 and A6 group together, and as well B and D group together, as could be expected.  Synthetic data has been useful in the study of viral species such as HIV-1, because a 385 ground-truth classification is known for synthetic sequences without ambiguity.

386
However, one may wonder how well such synthetic sequences model natural ones. We 387 attempt to measure this by training a classifier on natural and synthetic HIV-1 388 sequence data -if natural and synthetic sequences cannot be distinguished, one may 389 conclude that the simulation is realistic. For the 'natural' class we use the set of all pol 390 genes from the LANL database, and for the 'synthetic' class we use 1500 synthetic pol 391 genes produced as detailed above (see 'Synthetic data' in the Methods section), and we 392 perform a 10-fold cross-validation at k = 6 and with the linear SVM classifier. We 393 obtain an accuracy of 100%, meaning that the classifier can distinguish natural from 394 synthetic sequences with perfect accuracy. This suggests that synthetic sequence data 395 should be used with caution, since this result indicates it may not be perfectly 396 representative of natural sequence data -specifically, our result suggests there is some 397 characteristic of the synthetic sequences which differs from the natural sequences, which 398 our method is able to recognize and use. We explore this further by generating a 399 MoDMap, as seen in Fig 4. Interestingly, even though our supervised classifiers 400 succeeded to discriminate between real and synthetic sequences with an accuracy of 401 100%, the approach using distances between k-mer frequency vectors results in the 402 natural and synthetic sequences of specific subtypes grouping together, indicating that 403 the synthetic sequences have some features that relate them to the corresponding 404 natural sequences of the same subtype.

406
The k-mer based supervised classification method we propose in this paper has several 407 advantages compared to other popular software packages for the classification of virus 408 subtypes. First, we have shown on several manually-curated data sets that k-mer 409 classification can be highly successful for rapid and accurate HIV-1 subtyping relative to 410 the current state-of-the-art. Furthermore, releasing our method as an open-source 411 software project confers significant advantages with respect to data privacy, 412 transparency and reproducibility. Other subtyping algorithms such as REGA [104] and 413 COMET [8] are usually accessed through a web application, where HIV-1 sequence data 414 is transmitted over the Internet to be processed on a remote server. This arrangement is 415 convenient for end-users because there is no requirement for installing software other 416 than a web browser. However, the act of transmitting HIV-1 sequence data over a 417 network may present a risk to data privacy and patient confidentiality -concerns 418 include web applications neglecting to use encryption protocols such as TLS, or servers 419 becoming compromised by malicious actors. As a concrete example, the webserver 420 hosting the first two major releases of the REGA subtyping algorithm [104] was recently 421 compromised by an unauthorized user (last access attempt on November 27, 2017). In 422 contrast, our implementation is available as a standalone program, without any need to 423 transmit sequence data to an external server, eliminating those issues. In addition, our 424 implementation is released under a permissive open-source license (MIT). In contrast, 425 REGA [9] and COMET [8] are proprietary 'closed-source' software, making it 426 impossible to determine exactly how subtype predictions are being generated from the 427 input sequences.

428
Relying on a remote web server to process HIV-1 sequence data makes it difficult to 429 determine which version of the software has been used to generate subtype 430 classifications, and by extension difficult to guarantee that classification results can be 431 reproduced. There is growing recognition that tracking the provenance (origin) of 432 bioinformatic analytical outputs is a necessary component of clinical practice. For 433 example, the College of American Pathologists recently amended laboratory guidelines 434 on next-generation sequence (NGS) data processing to require that: "the specific 435 version(s) of the bioinformatics pipeline for clinical testing using NGS data files are 436 traceable for each patient report" [105]. In contrast to other tools, our standalone 437 package makes it easy to use exactly the desired version of the software and thus 438 enables precise reproducibility. 439 We now discuss some limitations of our approach. Like many machine learning 440 approaches, our method does not provide an accessible explanation as to why a DNA 441 sequence is classified a certain way, compared to a more traditional alignment-based 442 method. In some sense, the classifiers act more as a black box, without providing a 443 rationale for their results. Another issue is our requirement for a sizable, clean set of 444 training data. As opposed to an alignment-based method that could function with even 445 a single curated reference genome per class, machine learning requires several examples 446 per training class, as discussed previously, to properly train. Finally, one issue common 447 to any HIV-1 subtyping tool is the fact that recombination and rapid sequence 448 divergence can make subtyping difficult, especially in cases where the recombinant form 449 was not known at the time of training. Other tools are capable of giving a result of 'no 450 match' to handle ambiguous cases, but our method always reports results from the 451 classes used for training.

452
To more clearly demonstrate this last issue, we generate a random sequence of length 453 10,000 with equal occurrence probabilities for A, C, G, and T, and we ask the five 454 subtyping tools evaluated in our study to predict its HIV-1 subtype. As expected, 455 REGA gives a result of 'unassigned' and SCUEAL reports a failure to align with the 456 reference. Our tool reports subtype 'U' with 100% confidence, CASTOR predicts HIV-1 457 group 'O' with 100% confidence, and COMET reports SIV CPZ (simian 458 immunodeficiency virus from chimpanzee) with 100% confidence. These outcomes are 459 consistent with the disproportionately large genetic distances that separate HIV-1 group 460 O and SIV CPZ from HIV-1 group M -a line drawn from a random point in sequence 461 space is more likely to intersect the branch relating either of these distant taxa to group 462 M. Similarly, branches leading to subtype U sequences tend to be longer and to intersect 463 the HIV-1 group M tree at a basal location 4 . This artificial example implies that real 464 HIV-1 sequences that do not readily fit into any of the defined subtypes or circulating 465 recombinant forms may result in incorrect predictions with misleadingly high confidence 466 scores.

467
In spite of these limitations, our method not only matches or improves upon current 468 HIV-1 subtyping algorithms, but it should also be broadly applicable to any DNA 469 sequence classification problem, including other virus subtyping problems. To 470 demonstrate this, we use the same method (with k set to 6 and a linear SVM classifier) 471 and 10-fold cross-validation to measure the accuracies for classifying dengue, hepatitis 472 B, hepatitis C, and influenza type A virus full-length genomes (described in the 473 Datasets section) to their respective reference subtypes. Overall, we obtain accuracies of 474 100% for dengue virus, 95.81% for hepatitis B virus, 100% for hepatitis C virus, and  Because of the exponential growth of sequence databases, modern bioinformatics 480 tools increasingly must be capable of handling NGS sequence data and must be scalable 481 enough to manage huge sets of data. As well, researchers often demand the privacy, 482 security, and reproducibility characteristics an open-source, standalone, offline tool such 483 as ours provides. However, there remain several areas for future work. Although our 484 tool matches or exceeds the classification speed of the competing state-of-the-art, 485 performance optimization was not a focus of this study and we believe there is room to 486 substantially improve running time even further. Similarly, although we match or 487 exceed the classification accuracy of the competing state-of-the-art, different modern 488 machine learning methods such as GeneVec [106] or deep neural networks may permit us 489 to achieve even higher accuracy on challenging datasets. As well, given the rapid rate of 490 mutation of many viruses, it would be highly useful for our tool to be capable of giving a 491 result of 'no match' with its training data. Further, since our software is able to classify 492 sequences quickly, we believe it should be capable of handling large NGS sequence 493 datasets, and we leave a careful study of this application to future work. Each of these 494 possibilities could make our method and software even more useful in the future.