ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples

Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as “unknown” since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as “unknown” by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.

Reply : We thank the reviewer for these comments. We ran HMMER with pfam database on the viral 300bp sequences and the table below represents the found proteins. The table only contains the most frequent proteins, because we want to have at least 10 data points when calculating average metrics. The second column counts the occurrences of the protein in the entire dataset (train+val+test). The third column counts the occurrences of the protein among val+test set samples. The fourth column shows the average probability of being viral according to the model for val+test set samples of the given protein (i.e. the average output value). The fifth column gives the quartiles of these score values. The sixth column gives the AUROC (using test+val sequences containing given protein and all test set non-viruses). 2) Why didn't you try contig lengths larger than 300, for instance 1000, 5000, 10000? Would your model perform better with those contig lengths?
Reply : As mentioned in "Data processing and labeling" subsection, we also tried sequence length 500, but it performed clearly worse than 300. In initial experiments we also briefly tested sequence length 1000, but the results were even weaker. We hypothesize that this is due to having less data points with longer (more restrictive) sequence lengths.
With sequence length 500 we have 3 times less data than with sequence length 300. With sequence length 1000 we have 20 times less data than with length 300. Empirical results show that in the current case having more data is the most important.
3) The tool requires further validation with more data. I understand that you are using 19 metagenomes and using partitions/baselines to train, and the AUROC can be considered a good parameter to evaluate your model. However, it is imperative to certainly know what is in your metagenome to be able to validate the current technology. I'd suggest to generate simulated human metagenomes using taxon profiles similar to the ones that you used in the training model. I would suggest using NeSSM, ART, MetaSim for the simulations to determine how your trained model performs in completely new datasets.
Reply : We believe that the true effectiveness of our models can only be measured by their performance on real data, preferably originating from a sequencing experiment that has not been used for training the model. Leaving an entire dataset out from the training procedure (using training set) and model picking (using validation data) procedure achieves just that.
That said, we agree that working with real data, there is always some risk of unknown biases making results nicer than they should be. We have thus repeated our experiments on simulated data, as requested.
We considered all three simulation tools mentioned by the reviewer and decided to use ART, as it is most understandable and easy to use. The article now contains a new methods section describing the simulation procedure and parameters, and a results subsection describing the results obtained.
The results showed that the model that was trained our 19 metagenomic experiments produced AUROC 0.751 on the simulated data. Even though the model performs clearly above the random level, it is indeed a more moderate performance compared to the model performance on the main test set. However, consider that the simulated dataset was generated based on randomly picked viral reference genomes from GenBank without any prior selection. Using a ViraMiner model both trained and tested on simulated data, the test AUROC increases to 0.921. We believe that this further proves that the architecture is useful and able to generalize to different datasets.
- This article proposed a new machine learning approach for characterizing unknown metagenomics contigs. The approach using ANN with raw DNA sequences as inputs is unique and novel. The authors demonstrated that the proposed approach "viralMiner" performs better than random forests and kmer as baseline. The writing is excellent as well. Because of the novelty of the approach, I recommend the paper accepted after minor revision.
Several minor areas can be improved: 1) The AUC is 0.92, however, the real performance 0.9 accuracy and 0.32 recall is not as impressive. I believe these numbers are much worse than blast, so I recommend emphasize this in the abstract.