Table 1.
Heterogeneity of previously-described ML models that predict the likelihood of human infection based on viral genetic information. Our approach is shown in the bottom row, and cell colors are used to emphasize the similarities and differences between approaches. The ROC AUC of 0.73 on the holdout dataset was the best performance across all of our estimator types based on a reconstitution of the original training and testing sets produced at runtime by execution of the code at https://github.com/Nardus/zoonotic_rank at hash 42f15a07.
Fig 1.
Average estimator performances on test (ROC AUC across 10 seeds, with standard deviations) on different datasets in this work.
Estimators averaged include Random Forest, Extra Trees, gradient boosted trees, and support vector machines. The datasets include the original (O) from Mollentze, a corrected version (C) of the Mollentze dataset, and a rebalanced (R) version of the corrected dataset. Results are further divided by optimization of hyperparameters (+) or lack thereof (-).
Fig 2.
Distribution of viral genomes in the datasets used in this work, categorized by human infectivity and training and test data split.
The datasets are improved versions of the original dataset analyzed by Mollentze et al. [2], and previously curated by several others [4,25,26], with specific improvements including removal of problematic genomes and updating known human infectivity (A), or additionally rebalancing the datasets by random shuffling with preservation of human infectivity ratios (B).
Fig 3.
The performance of common ML models was evaluated on the corrected (A) and rebalanced (B) datasets for prediction of viral human infectivity.
Features calculated using our workflow were similar to those used by Mollentze et al. [2], except that we additionally included peptide kmers. Hyperparameter optimization was performed for each model before training (see Methods). Mean ROC curves were calculated from prediction scores on the test data across 10 random seeds, with ROC values shown in the legends and standard deviations (SD) summarized in Fig 4. The dashed red line represents an estimator that is no better than random chance.
Fig 4.
Mean ROC AUC on test broken down by estimator, dataset (C = corrected; R = rebalanced), hyperparameter optimization status (- = non-optimized; + = optimized), and host target.
Results are averages and standard deviations across ten random seeds.
Fig 5.
Mean ROC AUC on test broken down by estimator and hyperparameter optimization status.
These results are for a dataset closely matched to the one originally leveraged by Mollentze et al., which only has human target labels. Results are averages and standard deviations across ten random seeds.
Fig 6.
The performance of common machine learning models was evaluated on the corrected and rebalanced datasets for human, primate, and mammal host targets.
The ROC AUC for each model is reported for the predictions on test across 10 seeds. The violin plot displays the statistical distributions for ROC AUC across estimators, with dashed lines for the three quartiles–the 25th percentile, the median, and the 75th percentile.
Fig 7.
Tabulation of confusion matrix data across hyperparameter optimized Random Forest estimators averaged across ten random seeds for their predictions on test set data.
The confusion matrix data for other estimators may be reproduced using the open source code accompanying this work. Average and standard deviation values are shown for different datasets. The threshold used for prediction was the equal error rate (EER).
Fig 8.
Representative example of the process of determining the number of trials required for hyperparameter optimization.
The estimation is based on a plateau of performance improvement across five independent random seeds with 5-fold cross-validation mean ROC AUC as the metric. In this case, approximately 500 steps appears to be sufficient to optimize the hyperparameters for SVC with a linear kernel.
Table 2.
The range of hyperparameter values over which optimizations were performed. n represents the number of records in the design matrix, while f represents the number of features in the design matrix. In many cases the samples were drawn over the intervals using a uniform distribution, but other approaches were also used—see the open source machine learning workflow for details.