Use of machine learning to identify risk factors for insomnia

Importance Sleep is critical to a person’s physical and mental health, but there are few studies systematically assessing risk factors for sleep disorders. Objective The objective of this study was to identify risk factors for a sleep disorder through machine-learning and assess this methodology. Design, setting, and participants A retrospective, cross-sectional cohort study using the publicly available National Health and Nutrition Examination Survey (NHANES) was conducted in patients who completed the demographic, dietary, exercise, and mental health questionnaire and had laboratory and physical exam data. Methods A physician diagnosis of insomnia was the outcome of this study. Univariate logistic models, with insomnia as the outcome, were used to identify covariates that were associated with insomnia. Covariates that had a p<0.0001 on univariate analysis were included within the final machine-learning model. The machine learning model XGBoost was used due to its prevalence within the literature as well as its increased predictive accuracy in healthcare prediction. Model covariates were ranked according to the cover statistic to identify risk factors for insomnia. Shapely Additive Explanations (SHAP) were utilized to visualize the relationship between these potential risk factors and insomnia. Results Of the 7,929 patients that met the inclusion criteria in this study, 4,055 (51% were female, 3,874 (49%) were male. The mean age was 49.2 (SD = 18.4), with 2,885 (36%) White patients, 2,144 (27%) Black patients, 1,639 (21%) Hispanic patients, and 1,261 (16%) patients of another race. The machine learning model had 64 out of a total of 684 features that were found to be significant on univariate analysis (P<0.0001 used). These were fitted into the XGBoost model and an AUROC = 0.87, Sensitivity = 0.77, Specificity = 0.77 were observed. The top four highest ranked features by cover, a measure of the percentage contribution of the covariate to the overall model prediction, were the Patient Health Questionnaire depression survey (PHQ-9) (Cover = 31.1%), age (Cover = 7.54%), physician recommendation of exercise (Cover = 3.86%), weight (Cover = 2.99%), and waist circumference (Cover = 2.70%). Conclusion Machine learning models can effectively predict risk for a sleep disorder using demographic, laboratory, physical exam, and lifestyle covariates and identify key risk factors.


Introduction
Sleep is critical to a person's physical and mental health [1][2][3][4][5][6]. However, the prevalence of diagnosed sleep disorders among American patients has significantly increased over the past decade [1,5,[7][8][9][10]. Sleep disorders are a broad categorization of disorders that encompass conditions that lead to difficulty falling asleep, poor sleep quality, early waking, circadian rhythm disorders, parasomnias, sleep-related movement disorders, and sleep-related breathing disorders [11][12][13]. This is particularly important as sleep disorders are a significant risk factor for diabetes, heart disease, obesity, and depression, leading to decreased quality of life and increased healthcare usage [14,15]. Additionally, poor quality of sleep has been associated with decreased productivity at work and at school, increased stress, and decreased quality of life [16][17][18][19]. To combat the debilitating consequences of sleep disorders, a plethora of pharmacologic treatments have been introduced to the market and prescribed by physicians [20][21][22][23][24][25][26]. While medications have shown efficacy in decreasing sleep latency, significant side effects have been associated with these medications [27][28][29][30][31][32]. These include addiction, respiratory depression, decreased quality of sleep, and significant withdrawal symptoms when these medications are discontinued [21,[33][34][35]. Furthermore, due to the increasing prevalence of obstructive sleep apnea, continuous positive airway pressure (CPAP) machines are more regularly prescribed [27].
Despite recognition of sleep disorders as a strong contributor to increasing mortality and morbidity, little is known regarding specific risk factors that are strongly linked with increased probability of having sleep disorders. Given these limitations in the literature, we will leverage transparent machine-learning methods (Shapely Additive Explanations (SHAP) model explanations and model gain statistics) to identify pertinent risk-factors for sleep disorders and compute their relative contribution to model prediction of risk for sleep disorder; the NHANES 2017-2020 cohort, a large, nationally representative sample of US adults, will be used within this study.

Methods
A retrospective, cross-sectional cohort study using the publicly available National Health and Nutrition Examination Survey (NHANES) was conducted in patients who completed the demographic, dietary, exercise, and mental health questionnaire and had laboratory and physical exam data. The acquisition and analysis of the data within this study was approved by the National Center for Health Statistics Ethics Review Board. Within this retrospective cohort, all data (medical records, survey information, demographic information) was fully anonymized before data analysis was carried out and all patients consented to their data being publicly available.

Dataset and cohort selection
The National Health and Nutrition Examination Survey (NHANES 2017-2020) is a program designed by the National Center for Health Statistics (NCHS), which has been leveraged to assess the health and nutritional status of the United States population. The NHANES dataset is a series of cross-sectional, complex, multi-stage surveys conducted by the Centers for Disease Control and Prevention (CDC) on a nationally representative cohort of the United States population to provide health, nutritional, and physical activity data. In the present study, we analyzed adult (�18 years old) patients in the NHANES dataset who completed the demographic, dietary, exercise, and mental health questionnaire and had laboratory and physical exam data.

Assessment of sleep disorder
The medical conditions file was used to identify patients with a sleep disorder. Participants were asked: "Have you ever told a doctor or other healthcare professional that you have trouble sleeping?" Participants who answered "Yes" to this question were considered to have a sleep disorder within this study.

Independent variable
Potential model covariates were identified within the demographics, dietary, physical examination, laboratory, and medical questionnaire datasets in NHANES. A total of 783 covariates were identified from the NHANES dataset. All covariates were extracted and merged with the sleep disorder indicator.

Model construction and statistical analysis
Univariate logistic models, with a sleep disorder as the outcome, were used to identify covariates that were associated with a sleep disorder. Covariates that had a p<0.0001 on univariate analysis were included within the final machine-learning model. Utilizing univariable logistic models to do an initial filter of the 700+ covariates that were within the dataset was used to ensure that all covariates used within the machine learning models were strong independent covariates. Furthermore, this initial filtering allowed for physician review of risk factors that were clinically relevant. After initial filtering, model importance statistics from machine-learning models were used to identify pertinent risk factors.
Four machine-learning methods were carried out: XGBoost, Random Forest (RF), Adaptive Boost (ADABoost), and Artificial Neural Network (ANN). All machine-learning models were constructed using 10-fold cross validation. Cross validation was applied to only the training set. A train:test (80:20) was used to compute the final set of model fit parameters. The model fit parameters used in this study were accuracy, F1, sensitivity, specificity, positive predictive value, negative predictive value, and AUROC (Area under the receiver operator characteristic curve).
A grid search of hyperparameters for the XGBoost, Random Forest, and Adaptive Boost methods was conducted. Trees were searched between 200 and 2000 at 100 tree increments, with the optimal number being 600 trees for all models. The artificial neural network was comprised of an input layer with hidden layers and a scalar output layer. Additionally, the ReLu function at each hidden layer and a Sigmoid function at the output layer was used. The hyperparameters were determined by optimal accuracy across a grid search of 2-10 hidden layers, 128-1024 for hidden layer dimensions, and 64-512 for batch size. The hyperparameters that were most optimal were 4 hidden layers, 256 hidden layer dimensions, and 64 for the batch size.
The machine learning model XGBoost was used due to its prevalence within the literature as well as its increased predictive accuracy in healthcare prediction. Furthermore, XGBoost was chosen as the most optimal model based upon the mean AUROC:

Model feature importance statistics and SHAP visualization
Model covariates were ranked according to the Gain, Cover, and Frequency to identify risk factors for a sleep disorder. The Gain is the relative contribution of the feature within the model. The Cover is the number of observations related to this feature that were present. The Frequency is the percentage of times the feature occurs in the trees of the machine-learning model. The Gain statistic was chosen as the method to rank features based upon feature importance due to its ease of interpretation: the proportion the covariate contributed to the final prediction.
SHAP explanations were utilized to visualize the continuous covariates with the strongest relationship between the potential risk factors and a sleep disorder. SHAP visualizations were conducted for the top four continuous covariates by model cover (Fig 6). We observed that increased PHQ-9 scores were strongly linked to the odds of a sleep disorder. Each increase in PHQ-9 score is associated with increased odds of a sleep disorder up to around a PHQ-9 score of 11, at which the odds of sleep disorder no longer increase with increased PHQ-9 score. Additionally, we observed a curvilinear relationship between weight and odds of a sleep disorder. There is no significant increase in odds of a sleep disorder with increasing weight for patients weighing under 80 kg, but after 80 kg, increased weight is associated with significantly increased odds of a sleep disorder. Furthermore, age was found to be a significant risk factor for a sleep disorder, with odds of a sleep disorder increasing between age 20 until age 60, at which point there does not appear to be an increase in sleep disorder with increasing age. Lastly, there is a strong relationship between waist circumference and a sleep disorder. There is no significant increase in odds of a sleep disorder with increasing waist circumference until after 100cm, at which there is a significant increase in odds of a sleep disorder with increasing waist circumference.

Discussion
In this retrospective, cross sectional cohort of United States adults, a machine learning model utilizing demographic, laboratory, physical examination, and lifestyle questionnaire data had strong predictive accuracy (AUROC = 0.87). The greatest predictors for a sleep disorder included depression (PHQ-9), weight, age, and waist circumference.
Prior studies have accurately predicted the presence of sleep disorders using machine-learning methods from a variety of datasets using numerous machine-learning methods [36-38]. Short-term insomnia detection was conducted using a single-channel sleep Electrooculography. Furthermore, natural language processing on 18,901 tweets was conducted to find correlations between words related to insomnia and negative health information [38][39][40]. Furthermore, a comparative study of 15 machine learning algorithms identified 14 main factors for the prediction of insomnia, identifying that vision problems, mobility problems, and sleep disorders were significantly related to insomnia [38,39]. These studies highlight the utility of machine learning models in identifying patients at risk for sleep disorders. What our study adds to the literature is a large dataset (N = 7,929) and a diverse wealth of potential covariates (700+ covariates) to study how lifestyle, diet, demographic, and medical covariates are able to predict insomnia.
The visualizations completed for the top four continuous covariates were concordant with current literature: there is strong epidemiological evidence that sleep problems are heavily linked with depression. Multiple papers have found difficulty falling asleep and decreased hours of sleep with increased depression [41][42][43][44][45][46][47][48][49][50][51][52][53]. Additionally, depression has bene linked to lower quality sleep and increased day time exhaustion [31,34,46,54,55]. There is also strong literature evidence for the link between weight and sleep disorders [4]. There is epidemiological evidence for the relationship between increased age and increased sleep disorders, older age has been associated with increased sleep latency, decreased time spent in rapid eye movement (REM) sleep and stage-3 sleep, and increased frequency of waking up during the night   [56][57][58][59][60][61][62]. Furthermore, increased caffeine usage has been found to be linked with difficulty falling asleep, decreased time falling asleep, and decreased quality of sleep [63]. Additionally, increased alcohol is associated with sleep disorders, leading to decreased sleep latency and potential physiologic need for alcohol as a depressant to allow for sleep in multiple patients [64][65][66].
Since visualizations for risk factors match literature relationships, we have increased confidence that the machine learning model is able to capture the actual physiological relationships of these covariates. These transparent machine-learning tools allow for increased confidence that these algorithms are picking up true signal within these covariates to predict the presence of a sleep disorder rather than just replicating potential biases stemming from systemic dataquality errors that are present within the dataset. Additionally, these SHAP visualizations allow us to interpret that the increase predictive power of these machine-learning methods is associated with the ability for these non-parametric methods to more accurately capture the non-linear interactive relationship between the covariates, rather than just over-fitting the model to get increased accuracy.
The greatest strength of this algorithmic method for identification of the covariates is the ability to search through hundreds of covariates systematically without relying upon judgment form the researcher, which may be muddled by potential personal biases. This method also allows for the ranking of the relative importance of each of these covariates through the cover statistic, which allows us to obtain the relative contribution to the prediction each covariate has and thus infer from there an estimate for the relative contribution to true risk for a sleep disorder that each patient has. Another strength is that after these covariates are selected and the model built, SHAP visualizations can be used to make sure that each of the covariate either matches current literature understandings of the covariate's association with a sleep disorder or in the case of a discrepancy, allow researchers to validate the plausibility of this feature and then evaluate for potential errors in data-quality.

PLOS ONE
Use of machine learning to identify risk factors for insomnia A potential weakness to this machine-learning analysis is the necessity of the retrospective nature of this cohort. The covariates that were selected within this study will be better at predicting risk for a sleep disorder for this cohort than for other cohorts. However, this was limited by the use of training: testing sets to be able to minimize the errors that come with overfitting. Furthermore, visualizations of SHAP allow researchers to test for physiologic plausibility of each of these covariates and allows for effective analysis by researchers of whether these effects are due to true signal or if they are just noise that may be contributing to a type-1 error.
Given the analysis of the strengths and weaknesses of these methods, we argue that use of machine-learning methods can be an effective first step in the identification of risk-factors that can then be further selected by clinicians based upon the specific clinical presentation.

Limitations
This study has several strengths and weaknesses. We utilized the NHANES dataset, which is a retrospective cohort, carrying the limitations of retrospective studies. However, this study allows for the selection of a large cohort, evaluation of data quality, and due to the publicly available nature of the cohort, allows for increased replication and follow-up studies based upon the same cohort. Furthermore, the cohort relied on surveys to obtain the outcome of interest (a sleep disorder requiring medical attention) as well as the dietary and lifestyle information. More accurate measurements may have been achieved with prospective studies with automated measurement of foods. However, self-reported survey information allows for the volume of participants to be included within this study. Another weakness was the voluntary nature of this cohort, with participants choosing to opt into the study instead of being randomly selected. This may artificially select a different cohort that may significantly differ from the population. However, our analysis found a demographically diverse population, so these results may still be generalizable to other cohorts.

Conclusion
Machine learning models can effectively predict risk for a sleep disorder using demographic, laboratory, physical exam, and lifestyle covariates and identify key risk factors. Depression, age, weight, and waist circumference were the strongest predictors of sleep disorder.