Prediction of HIV status based on socio-behavioural characteristics in East and Southern Africa

Introduction High yield HIV testing strategies are critical to reach epidemic control in high prevalence and low-resource settings such as East and Southern Africa. In this study, we aimed to predict the HIV status of individuals living in Angola, Burundi, Ethiopia, Lesotho, Malawi, Mozambique, Namibia, Rwanda, Zambia and Zimbabwe with the highest precision and sensitivity for different policy targets and constraints based on a minimal set of socio-behavioural characteristics. Methods We analysed the most recent Demographic and Health Survey from these 10 countries to predict individual’s HIV status using four different algorithms (a penalized logistic regression, a generalized additive model, a support vector machine, and a gradient boosting trees). The algorithms were trained and validated on 80% of the data, and tested on the remaining 20%. We compared the predictions based on the F1 score, the harmonic mean of sensitivity and positive predictive value (PPV), and we assessed the generalization of our models by testing them against an independent left-out country. The best performing algorithm was trained on a minimal subset of variables which were identified as the most predictive, and used to 1) identify 95% of people living with HIV (PLHIV) while maximising precision and 2) identify groups of individuals by adjusting the probability threshold of being HIV positive (90% in our scenario) for achieving specific testing strategies. Results Overall 55,151 males and 69,626 females were included in the analysis. The gradient boosting trees algorithm performed best in predicting HIV status with a mean F1 score of 76.8% [95% confidence interval (CI) 76.0%-77.6%] for males (vs [CI 67.8%-70.6%] for SVM) and 78.8% [CI 78.2%-79.4%] for females (vs [CI 73.4%-75.8%] for SVM). Among the ten most predictive variables for each sex, nine were identical: longitude, latitude and, altitude of place of residence, current age, age of most recent partner, total lifetime number of sexual partners, years lived in current place of residence, condom use during last intercourse and, wealth index. Only age at first sex for male (ranked 10th) and Rohrer’s index for female (ranked 6th) were not similar for both sexes. Our large-scale scenario, which consisted in identifying 95% of all PLHIV, would have required testing 49.4% of males and 48.1% of females while achieving a precision of 15.4% for males and 22.7% for females. For the second scenario, only 4.6% of males and 6.0% of females would have had to be tested to find 55.7% of all males and 50.5% of all females living with HIV. Conclusions We trained a gradient boosting trees algorithm to find 95% of PLHIV with a precision twice higher than with general population testing by using only a limited number of socio-behavioural characteristics. We also successfully identified people at high risk of infection who may be offered pre-exposure prophylaxis or voluntary medical male circumcision. These findings can inform the implementation of new high-yield HIV tests and help develop very precise strategies based on low-resource settings constraints.

This is an area of significant importance, and indeed an area in which modern machine learning methods can be brought to bear for great societal impact. The paper is furthermore well-written and mostly easy to read, and the figures and tables are generally well-prepared (more on this below). It therefore saddens me to report, that I cannot support publication at this stage due to serious shortcomings of the methods description, and potentially also inconsistencies in the results which undermine the claims of the authors (this latter I cannot judge in full due to shortcomings of the methods description).

Major points
1) The methods section is missing a number of important aspects in order for me to assess in detail the steps the authors have taken.
• How were the models fitted? What were the target variable, and what was the outcome from each of the trainings? As far as I can tell from the files shared in the author's model and code repository the model selection analysis is not included (and the readme in the repository is not providing much help). As will be apparent further below, details of model training will be important to assess the validity of the results the authors claim. • Description of the details of the imputation methods used are missing. Is the variance conserved for the MICE algorithm that the authors use? And what is the implications of using the imputation of the XGBoost algorithm? This needs to be clarified to know whether this step is artificially limiting or enhancing the presented results. • I cannot follow the details of the three-step training process (Fig. 1). Is the whole thing a nested cross-validation where the outer loop (Step 1) is across countries (switching "holdout country") and the inner step (Steps 2 and 3) is the 5-fold cross-validation used during model training? I sense the approach is fine but cannot follow all steps to make sure. 2) Two of the four model types, specifically support vector machines and gradient-boosted trees, actually do not model the probability that an individual has HIV unless specific steps are taken (e.g. use Platt scaling). It does not seem to be the case that such steps have been taken but cannot know for sure due to insufficient methods details provided. It is highly problematic that gradient-boosted trees (the type which is selected by the authors) does not return probabilities since the authors use it to identify the subpopulation which has more than 95 % probability of having HIV (scenario 2); put differently, the results from scenario 2 cannot be trusted since gradient-boosted trees are used. This concern seems to invalidate these results.
For context: These models assign individuals to classes based on classification rules (the sign of the solution to the convex quadratic optimization problem for support vector machines, the particular splitting rules used in the gradient-boosted trees (usually cross-entropy) [1]) instead of modeling directly the probability that an individual has HIV. These are examples of improper scoring rules which are well-known to cause incorrect probabilities. An observed probability can be inferred afterwards from these models, but that will depend on the particular elements of the dataset and as such will change with the addition of a single new datapoint, and is in general not a good estimate of the actual probability.
3) The results in Fig. 2 illustrates substantially lower F1 scores on the left-out samples for support vector machines and gradient-boosted trees (XGBoost) which needs to be investigated. This illustrates that the performance found in training is not at all retrieved when the models are applied to new data, i.e. it undermines the trust which can be put in the abilities of these models to offer usable predictions. This puts all presented results at risk of being incorrect! While the authors do observe this, no explanation is offered nor is further investigation conducted. As a minimum the authors would need to explain why this should not be a point of concern for trusting the results.
It is unclear to me whether this is a result of overfitting, is a consequence of using the F1 metric (itself an improper scoring rule) to evaluate the models (that metric itself being a non-linear function of the class assignments of the models), or something else entirely.
4) It appears the authors are using the F1 score to pick the best model (in Algorithms, as part of Methods section). As already mentioned, this is an improper scoring rule (which does not indicate which model best predicts the probability that an individual has HIV) and selecting models based on this metric could lead to models with good F1 scores which are a bad representation of whether or not an individual has HIV! The authors will need to clarify that they are not selecting models based on an improper scoring rule.
More details on considerations for the important topic of proper scoring rules for classification can be found in [2] and [3].
Minor points 1. The authors use random selection of model hyper-parameters, which is a well-established approach. It would be good with some graphical illustration of the improvement from hyperparameter optimization as an insight into whether model type or hyperparameter tuning is more important for this case. 2. The authors use Shapley values to assess the impact of each covariate on the outcome, which is also an established approach. However, readers without deep statistical training could be led to believe that the type of impact suggested by Shapley values could be used for shaping intervention strategies. Unfortunately this would not be correct because Shapley values do not describe the causal impact of each covariate, only the additional change in overall outcome by adding this covariate. I would suggest to add a comment along these lines to mitigate any unintended conclusions.