Using machine learning and big data to explore the drug resistance landscape in HIV

doi:10.1371/journal.pcbi.1008873

Table 1.

Summary of the UK and African datasets.

More »

Expand

Table 2.

All training and testing datasets used during this study.

More »

Expand

Fig 1.

Classifier Performance on UK and African datasets.

NB: naive Bayes, LR: Logistic Regression with Lasso regularization, RF: Random Forest, FC: Fisher Classifier, RD: Agnostic random probabilistic classifier (this classifier predicts, as the probability of a sample belonging to a class, the frequency of that class in the training data). A) Adjusted mutual information (higher is better) between ground truth and predictions by classifiers trained on dataset with all features (blue), without features corresponding to known RAMs (orange) and without RAM features and without sequences that have at least 1 known RAM (green). Hatching indicates the training set on which a classifier was trained and the testing set on which the performance was measured. The expected value for a null classifier is 0, and 1 for a perfect classifier and a * denotes that the p-value derived from mutual information is ≤ 0.05. For example when trained with all features all the classifiers have a significative MI. Conversely when removing RAM features and RAM sequences none of the classifiers have a significative MI and only LR trained on the entirety of the UK dataset has an AMI >10⁻³. B) Balanced Accuracy score, i.e. average of accuracies per-class (higher is better) for the same classifiers as in A). The red line at y = 0.5 is the expected balanced accuracy for a null classifier that only predicts the majority class as well as a random uniform (i.e. 50/50) classifier. C) Brier score, which is the mean squared difference between the sample’s experience to RTI and the predicted probability of being RTI experienced (lower is better), for the same classifiers as in A) and B).

More »

Expand

Fig 2.

Discrimination between sequences having at least one RAM, and those having none on sequences with training features corresponding to known RAMs removed.

NB: naive Bayes, LR: Logistic Regression with Lasso regularization, RF: Random Forest, FC: Fisher Classifier. A) Adjusted mutual information (higher is better) for classifiers trained without features corresponding to known RAMs. The classifiers are either trained to discriminate RTI-naive from RTI-experienced sequences (blue), or sequences with at least one known RAM from sequences that have none (orange). Hatching and braced annotations indicate the training and testing sets resulting in a given performance measure. B) Balanced accuracy, i.e. average of accuracies per-class for the same classifiers as in A) (higher is better). The red line at y = 0.5 is the expected value for a classifier only predicting the majority class as well as a random uniform (50/50) classifier.

More »

Expand

Table 3.

Analysis of new potential RAMs.

More »

Expand

Fig 3.

Relative risk of the new mutations with regards to known RAMs on the UK dataset.

(i.e. the prevalence of the new mutation in sequences with a given known RAM divided by the prevalence of the new mutation in sequences without this RAM). RRs were only computed for mutations (new and RAMs) that appeared in at least 0.1% (=55) sequences. 95% confidence intervals, represented by vertical bars, were computed with 1000 bootstrap samples of UK sequences. Only RRs with a lower CI boundary greater than 4 are shown. The shape and color of the point represents the type of RAM as defined by Stanford’s HIVDB. Blue circle: NRTI, orange square: NNRTI, green diamond: Other. RR values are shown from left to right, by order of decreasing values on the lower bound of the 95% CI.

More »

Expand

Fig 4.

Structure of HIV-1 RT with highlighted important sites.

The p66 subunit is colored dark gray and the p51 subunit white. The active site is highlighted in blue, and the NNIBP is highlighted in yellow. The sites of new mutations are colored in red.

More »

Expand