Fig 1.
Virulence of currently known human RNA viruses with respect to taxonomy.
Number of known human RNA virus species split by ICTV taxonomic family. Shading denotes disease severity rating. Supporting data are available via figshare: 10.6084/m9.figshare.7406441.v3 (https://figshare.com/articles/Data_and_supporting_R_script_for_Tissue_Tropism_and_Transmission_Ecology_Predict_Virulence_of_Human_RNA_Viruses/7406441/3). ICTV, International Committee on Taxonomy of Viruses.
Fig 2.
Final pruned classification tree predicting disease severity for 181 human RNA viruses.
Final classification tree structure predicting virulence. Viruses begin at the top and are classified according to split criteria (white boxes) until reaching terminal nodes with the model’s prediction of disease severity, and the fraction of viruses following that path correctly classified is shown based on literature-assigned ratings (shaded boxes). ‘Tp: primary’ denotes primary tissue tropism, ‘Tr level’ denotes level of human-to-human transmissibility, and ‘Tp: renal’ denotes having a known renal tissue tropism. Tp, tropism; Tr, transmissibility.
Fig 3.
Variable importance from random forest models.
Importance of each variable in predicting virulence in random forest models applied to all known human RNA viruses and zoonotic viruses only, calculated as the average decrease in Gini impurity following a tree split based on that predictor and scaled against the most informative predictor within each random forest to give a relative measure. Points denote mean values across 200 random forest models with alternative training/test partitions. Error bars denote ± 1 standard deviation. Colour key denotes type of predictor variable. Supporting data are available via figshare: 10.6084/m9.figshare.7406441.v3 (https://figshare.com/articles/Data_and_supporting_R_script_for_Tissue_Tropism_and_Transmission_Ecology_Predict_Virulence_of_Human_RNA_Viruses/7406441/3). nh, nonhuman; tr, transmissibility.
Fig 4.
Partial dependence from random forest models in predicting severe virulence.
Predicted probability of classifying virulence as ‘severe’ for each of the most informative risk factors in random forest models applied to all known human RNA viruses and zoonotic viruses only (primary tissue tropism, any known neural tropism, any known renal tropism, level of human-to-human transmissibility, primary transmission route, and any known vector-borne transmission). Predicted probabilities are marginal, i.e., averaging over any effects of other predictors. Boxes denote distribution of probabilities across 200 random forest models with alternative training/test partitions, with heavy lines denoting median probability. Dashed line denotes raw prevalence of ‘severe’ virulence rating among the respective training datasets. Colour key denotes predictor variable type as in Fig 3, i.e., blue = tissue tropism, green = transmissibility, red = transmission route. Supporting data are available via figshare: 10.6084/m9.figshare.7406441.v3 (https://figshare.com/articles/Data_and_supporting_R_script_for_Tissue_Tropism_and_Transmission_Ecology_Predict_Virulence_of_Human_RNA_Viruses/7406441/3).
Table 1.
Predictive performance metrics for classification tree and random forest model.
Sensitivity, specificity, NPV (proportion of ‘nonsevere’ predictions that correctly matched literature rating), TSS (sensitivity + specificity − 1), and AUROC for predictive model methods applied to predict virulence of viruses within the test set. Random forest diagnostics indicate mean values across 200 training/test partitions. Supporting data are available via figshare: 10.6084/m9.figshare.7406441.v3 (https://figshare.com/articles/Data_and_supporting_R_script_for_Tissue_Tropism_and_Transmission_Ecology_Predict_Virulence_of_Human_RNA_Viruses/7406441/3).
Fig 5.
Receiver operating characteristic curve for tree-based machine learning models.
Plotted models in predicting virulence in test set(s) for the single classification tree (bold black line) and averaged random forest models (bold red line) over 200 training/test set partitions. y Axis denotes sensitivity (or true positive rate; proportion of viruses rated ‘severe’ by literature protocol that were correctly predicted as ‘severe’ by the model), and x axis denotes 1 –specificity (or false positive rate; proportion of viruses rated ‘nonsevere’ by literature protocol that were incorrectly predicted as ‘severe’ by the model). Dashed black line indicates null expectation (i.e., a model with no discriminatory power). Model profiles further toward the top left indicate a better predictive performance.