Predicting suicide attempt or suicide death following a visit to psychiatric specialty care: A machine learning study using Swedish national registry data

Background Suicide is a major public health concern globally. Accurately predicting suicidal behavior remains challenging. This study aimed to use machine learning approaches to examine the potential of the Swedish national registry data for prediction of suicidal behavior. Methods and findings The study sample consisted of 541,300 inpatient and outpatient visits by 126,205 Sweden-born patients (54% female and 46% male) aged 18 to 39 (mean age at the visit: 27.3) years to psychiatric specialty care in Sweden between January 1, 2011 and December 31, 2012. The most common psychiatric diagnoses at the visit were anxiety disorders (20.0%), major depressive disorder (16.9%), and substance use disorders (13.6%). A total of 425 candidate predictors covering demographic characteristics, socioeconomic status (SES), electronic medical records, criminality, as well as family history of disease and crime were extracted from the Swedish registry data. The sample was randomly split into an 80% training set containing 433,024 visits and a 20% test set containing 108,276 visits. Models were trained separately for suicide attempt/death within 90 and 30 days following a visit using multiple machine learning algorithms. Model discrimination and calibration were both evaluated. Among all eligible visits, 3.5% (18,682) were followed by a suicide attempt/death within 90 days and 1.7% (9,099) within 30 days. The final models were based on ensemble learning that combined predictions from elastic net penalized logistic regression, random forest, gradient boosting, and a neural network. The area under the receiver operating characteristic (ROC) curves (AUCs) on the test set were 0.88 (95% confidence interval [CI] = 0.87–0.89) and 0.89 (95% CI = 0.88–0.90) for the outcome within 90 days and 30 days, respectively, both being significantly better than chance (i.e., AUC = 0.50) (p < 0.01). Sensitivity, specificity, and predictive values were reported at different risk thresholds. A limitation of our study is that our models have not yet been externally validated, and thus, the generalizability of the models to other populations remains unknown. Conclusions By combining the ensemble method of multiple machine learning algorithms and high-quality data solely from the Swedish registers, we developed prognostic models to predict short-term suicide attempt/death with good discrimination and calibration. Whether novel predictors can improve predictive performance requires further investigation.

8.TRIPOD Guideline: S1 Checklist is not present in the file inventory, please provide the TRIPOD checklist. When completing the checklist, please use section names and paragraph numbers to refer to locations within the text, rather than page numbers.

Response:
We have now uploaded the TRIPOD checklist.
Comments from the reviewers: Reviewer #3: The authors have done a good job in making clarifications and addressing reviewer concerns. The additional limitations added to the discussion are important. I am still confused by the calibration plot. I cannot find in the paper if the calibration plot is created using the training data or the validation data set. It should be created in the validation data set using percentile bins from the training data. It appears that deciles were used for the calibration plot (but why are there are only 9 dots instead of 10 ?). I am still very surprised that the observed probability of a suicide attempt in the highest risk group is 100%. The math doesn't make sense here, because if this was created in the validation data set, then there would be about 10,000 visits in the highest risk decile; given the graph it says that nearly 100% of those visits were observed to have a suicide attempt following the visits. that would be about 10,000 suicide attempts, but there should only be about 3,726 suicide attempts in the entire validation data set. The math continues to be a problem with if the calibration plot was created with the training data set.
Please provide more details (not just the function that was used) on how these calibration plots were created.
A common approach is to divide your visits into deciles, these deciles are on the x axis with the mean predicted risk in that percentile. Then on the y-axis is the observed proportion of visits followed by a suicide attempt. At the end of the day a calibration plot needs to indicate in specific bins of people defined by risk, how similar is their predicted risk (from the model) and their observed risk (proportion of those visits with an event following the visit).
Response: Thank you for giving us another opportunity to clarify how the calibration curves were generated.
The calibration curves were derived from the test set. This has been clarified in the Methods section on page 9 line 5: "The Brier score (equal to zero under perfect calibration), along with calibration plots, was used to assess model calibration in the test set (i.e., the agreement between observed proportion of positives and mean predicted risk of the outcome in different risk strata)" The number of bins (n_bins) is a parameter of the python function sklean.calibration.calibration_curve (https://scikit-learn.org/stable/modules/generated/sklearn.calibration.calibration_curve.html). It is the number of bins to split the [0, 1] interval and the default is 5 (i.e., predicted risk at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0). The selection of the parameter value is somewhat arbitrary. The parameter does not have to be 10, but can be assigned any positive whole number. A principle guides the choice of the number of bins, which we have followed -the size of subsample in each bin should not be too small. Bins with no subsample, however, would not affect the overall pattern of the calibration curve, because no value would be returned for such bins.
In our study, the parameter (n_bins or number of risk bins) was set to be 9 for the 90-day outcome and 8 for the 30-day outcome. Detailed numbers and calculations underlying the curves are shown in the tables below. To show this as clearly as possible, tables have been combined into one table and added to the online supplement as eTable 9.
In the Results section on page 10 line 22: "… More details can be found in eTable 9." For the 90-day outcome, all 31 index visits in the risk group with predicted risk between 0.889 and 1.000 were followed by a suicidal event within 90 days. Hence, the observed proportion of positives in the risk group was 100%. For the 30-day outcome, only 6 dots were generated, given no index visits were in the last two bins.
We would like to illustrate in this response letter how the calibration curves would look like if the parameter value (i.e., number of risk bins) were set to be 10. For the 90-day outcome, there would be too few index visits (17 and 21) in the 9 th and 10 th bins. For the 30-day outcome, there would be only 2 index visits in the 8 th bin, resulting in a relatively large distortion of the curve. Therefore, we did not set the parameter to be 10. This can be seen in the two calibration plots below for illustrative purposes: