Unsupervised machine learning predicts future sexual behaviour and sexually transmitted infections among HIV-positive men who have sex with men

Machine learning is increasingly introduced into medical fields, yet there is limited evidence for its benefit over more commonly used statistical methods in epidemiological studies. We introduce an unsupervised machine learning framework for longitudinal features and evaluate it using sexual behaviour data from the last 20 years from over 3’700 participants in the Swiss HIV Cohort Study (SHCS). We use hierarchical clustering to find subgroups of men who have sex with men in the SHCS with similar sexual behaviour up to May 2017, and apply regression to test whether these clusters enhance predictions of sexual behaviour or sexually transmitted diseases (STIs) after May 2017 beyond what can be predicted with conventional parameters. We find that behavioural clusters enhance model performance according to likelihood ratio test, Akaike information criterion and area under the receiver operator characteristic curve for all outcomes studied, and according to Bayesian information criterion for five out of ten outcomes, with particularly good performance for predicting future sexual behaviour and recurrent STIs. We thus assess a methodology that can be used as an alternative means for creating exposure categories from longitudinal data in epidemiological models, and can contribute to the understanding of time-varying risk factors.


The authors used likelihood ratio test
, AIC, BIC, and auROC to assess the predictive performance of the clusters. To my knowledge, LRT, AIC, and BIC are typically used for model selection and not prediction. While these three metrics showed that a model including the cluster variables improves model fit, these metrics do not assess prediction. Therefore, based on my understanding, statements such as those on lines 174 ("…improved the model fit for predicting") and 180 ("…improved model performance") are not accurate as LRT does not assess prediction.
Response: We used regression to assess whether adding clusters to existing models would improve model fit. LRT was chosen as it can aptly compare the goodness of fit of nested models; and AIC and BIC were chosen for their over-/underfitting assessment properties. In other words, we compare a regression model for STIs that applies conventional variables to the same model, but augmented by the behavioural clusters. The aim of this study was to apply this clustering method (which can be seen as a dimensionality reduction framework for time-varying data) to a clinical routine problem rather than to find the ideal way to predict STIs within the Swiss HIV Cohort Study. We however acknowledge that more accurate wording could help the interpretation of our results. Therefore, in line with the reviewer's comment, we further clarify this in the revised manuscript. The relevant section in the methods now reads: We used likelihood ratio tests (LRT) and Bayesian information criteria (BIC) to compare model fits with and without behavioural clusters. Likelihood ratio tests were chosen as they can aptly compare the goodness of fit of nested models, while Bayesian information criteria were chosen for their over-/underfitting assessment properties.
And we added the following passage in the discussion: We recognise that there may be more performant ways to predict STIs. However, the aim of this study was to test the relevance of behavioral clusters for understanding the epidemiology of STIs (i.e. to assess whether these clusters were associated with distinct patterns of STI incidence) rather than to find the ideal way to predict STIs within the SHCS. We test the association between behavioral clusters and STI in a predictive context but, more generally, this analysis also informs about which dimensions of human behavior matter most for STI incidence and how complex temporal variation of behavioral data can be best simplified to capture these essential dimensions.
2. The manuscript did not discuss the use of a training and validation datasets. As written, it seems that the entire dataset was used to assess prediction. Without the use of training/validation datasets the prediction is typically too optimistic. Creating training/validation datasets seem to be the standard approach for prediction, so it would be nice to understand why that was not used.
Response: We did not originally consider training and validation datasets because our analysis was not a predictive one in the classic sense (please see our response to comment 1). Further, we did not expect there to be a large risk of overfitting given the small number of parameters and the large number of events. However, following the reviewer's suggestion, we now added an analysis using a 5fold cross-validation in the supplementary of the revised version. Accuracy in the test set ranged between 78% and 91% for predicting future sexual behaviour and STIs. As seen in the ROC analysis, adding clusters to a model considering other predictors brought only marginal and in some cases no benefit in accuracy. We report these analyses in the revised supplementary material (see Figures S4  and S5).

The paper makes that claim that the use of the clusters increases the prediction. However, it would be nice to see a more robust model selection framework, i.e., including the previous two nsCAI values as variables (instead of just the previous one) or an "ever nsCAI" variable as well as investigation of the functional form of age.
Response: Following the reviewers' suggestion, the revised version contains analyses including the previous two values, an "ever reported nsCAI" variable, a "mean nsCAI" variable. The results suggest that considering clusters consistently yields a better model performance than using an "ever reported nsCAI" variable, though in most cases a worse model performance than using the last two nsCAI values available, or a "mean nsCAI" value. Considering the last two nsCAI values available strongly improved model performance for predicting future nsCAI, yet yielded little to no performance improvement for predicting future STIs and syphilis. We used age as a linear predictor as exploring its functional form showed a steady decrease of STI incidence with age. We display the analyses above in the revised supplementary material (see Figures S2 and S6) and added the following passage to the results section: Comparing models with behavioural clusters to models including other metrics derived from past behaviour showed that while clusters improve model fit, equal or better improvements can be achieved by considering other parameters, such as the last two available nsCAI values, or using a mean nsCAI value before cut-off. Models considering behavioural clusters performed consistently better than those only considering whether a participant had ever reported nsCAI (Supplementary Figure S6).
4. The auROC metric (which was the one metric in the paper that I typically have seen used to assess prediction) did not seem very different with and without the clusters (as seen in Table 2). This left me wondering if the unsupervised machine learning clusters really did increase prediction. Especially once training and validation datasets are created and a more robust model selection approach is taken.