Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses

doi:10.1371/journal.pone.0239381

Fig 1.

Feature selection algorithms applied on succeeding levels to scale the model.

Tree-based and Lasso regularization-based feature selection applied on two levels successively yielded 6 total sets of features on the corresponding criteria of selection.

More »

Expand

Table 1.

Top 3 k-meric features sets when ranked by feature importance instated by our feature selection flow-chart.

More »

Expand

Fig 2.

AUROC for 18 fitted models trained while varying the number of features selected.

(a) Performance of all classifiers with all features. (b) Performance with 194 features after first round of selection. (c) Performance with 68 selected features. The weakest performing model was the Gaussian Naïve Bayes model displaying 0.96 AUC. No significant decline in model performance was noted when feature numbers were shrunk from 194 to 68.

More »

Expand

Table 2.

Performance metrics of 18 models fitted and the selected model.

More »

Expand

Fig 3.

Leave-one-family-out cross-validation results heatmap.

For every RNA virus family on the horizontal axis, there is a performance score for each model in the vertical axis. Other than the Decision Tree and Naïve Bayes models, performance was reasonably uniform all throughout the map.

More »

Expand

Fig 4.

Performance of the rbf-SVM68 and RF68 models on divergent data set.

To assess the capacity of our models to generalize to new examples, we attempted to confound the model with sequence data from viruses of other genotypes (having other than plus and minus RNA genomes) and RNA transcripts from mice (Listed in S3 Table). The robustness of the model performance metrics was still found to be intact.

More »

Expand

Table 3.

Performance metrics of rbf-SVM68 model on real data.

More »

Expand

Fig 5.

Performance on RNA-Seq assembly data from human cells cultured with Ebola virus.

The model performed the best with AUC of 0.95 when human transcripts below the training range length were left out. Filtering the short virus transcripts out as well as the human transcripts resulted in a drop in the curve owing to the skew in the number of examples belonging to the human class.

More »

Expand

Fig 6.

Performance on RNA-Seq assembly of human cells cultured with Ebola virus based on the software for assembly.

There is significant variation in performance across the different assembly software. Spades had the most uniform performance where performance was modest whereas more popular assembly tools such as trinity, oases and trans-abyss had poorer but similarly uniform performance.

More »

Expand

Fig 7.

Visualizing count of detected ORFs in RNA virus sequences changing in relation to the minimum length cutoff that was set.

We would expect all viral genomes to encode a minimal set of proteins necessary for successful replication and the number of these proteins to deviate minimally across viruses of different types. The graph depicts that at a cutoff range between 100 and 150, we achieve the optimal cutoff point where only important ORFs that have greater probability of being involved in the information pathway of the virus are considered.

More »

Expand

Table 4.

Performance evaluation of rbf-SVM68 against BLAST and HMMER3.

More »

Expand

Fig 8.

Performance of the rbf-SVM model trained with the same computational pipeline for classifying positive and negative sense RNA viruses.

(a) Performance on different train-test splits when trained with all 5460 features. (b) Performance when trained with 194 features. (c) Performance when trained with 68 selected features. All models exhibit a great ability to specify the different sequence classes demonstrating the flexibility of our pipeline.

More »

Expand

Fig 9.

Diagrammatic representation of our complete experimental design.

More »

Expand

Table 5.

Performance of rbf-SVM in classifying positive and negative sense RNA viruses.

More »

Expand

Fig 10.

Suggested approach for acquiring the best automated annotations.

Our explorations indicate that there is no single annotation tool that can produce the best results without requiring further curation. Perhaps the best way to use these tools is by stacking the results on top of each other to remove maximum number of false positives while consolidating the true positive results.

More »

Expand