Figures
Abstract
Importance
Sleep is critical to a person’s physical and mental health and there is a need to create high performing machine learning models and critically understand how models rank covariates.
Objective
The study aimed to compare how different model metrics rank the importance of various covariates.
Design, setting, and participants
A cross-sectional cohort study was conducted retrospectively using the National Health and Nutrition Examination Survey (NHANES), which is publicly available.
Methods
This study employed univariate logistic models to filter out strong, independent covariates associated with sleep disorder outcome, which were then used in machine-learning models, of which, the most optimal was chosen. The machine-learning model was used to rank model covariates based on gain, cover, and frequency to identify risk factors for sleep disorder and feature importance was evaluated using both univariable and multivariable t-statistics. A correlation matrix was created to determine the similarity of the importance of variables ranked by different model metrics.
Results
The XGBoost model had the highest mean AUROC of 0.865 (SD = 0.010) with Accuracy of 0.762 (SD = 0.019), F1 of 0.875 (SD = 0.766), Sensitivity of 0.768 (SD = 0.023), Specificity of 0.782 (SD = 0.025), Positive Predictive Value of 0.806 (SD = 0.025), and Negative Predictive Value of 0.737 (SD = 0.034). The model metrics from the machine learning of gain and cover were strongly positively correlated with one another (r > 0.70). Model metrics from the multivariable model and univariable model were weakly negatively correlated with machine learning model metrics (R between -0.3 and 0).
Citation: Huang AA, Huang SY (2024) Comparison of model feature importance statistics to identify covariates that contribute most to model accuracy in prediction of insomnia. PLoS ONE 19(7): e0306359. https://doi.org/10.1371/journal.pone.0306359
Editor: Sergio A. Useche, University of Valencia: Universitat de Valencia, SPAIN
Received: April 22, 2023; Accepted: June 14, 2024; Published: July 2, 2024
Copyright: © 2024 Huang, Huang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data Availability: The data from this cohort is freely available without restriction and can be found on the NHANES section of the CDC website. Data Share Statement: Data described in the manuscript are present at: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?cycle=2017-2020.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Insomnia is a widespread clinical condition characterized by difficulty initiating or maintaining sleep, which can result in significant physical and mental health consequences. The annual prevalence of insomnia symptoms in the general adult population ranges from 35–50%, with the prevalence of insomnia disorder ranging from 12–20% [1]. Several risk factors contribute to the development of insomnia, including depression, female sex, older age, lower socioeconomic status, concurrent medical and mental disorders, marital status, and race [2–6]. Moreover, insomnia often follows a chronic course, and its functional consequences include reduced productivity, increased absenteeism, and increased healthcare costs. Insomnia also increases the risk of developing mental disorders, including depression, and is associated with worse treatment outcomes in depression and alcohol dependence [3, 7–11]. Furthermore, insomnia is linked to an increased risk of developing metabolic syndrome, hypertension, and coronary heart disease [1, 7, 12–15]. Despite the high prevalence and negative consequences of sleep disorders, researchers are only beginning to apply advanced mathematical models in the field of sleep disorders and it is necessary for physicians to understand how the models are created. Increased research into feature importance will increase the clinical reasoning capacity for physicians.
Linear regression, logistic regression, multivariable statistics, and machine learning have all been essential tools for outcome researchers and physicians in the diagnosis and treatment of various diseases [16, 17]. Linear regression has been used to assess the relationship between continuous predictors and outcomes, which is useful in identifying risk factors for disease progression or treatment outcomes [18]. Logistic regression, on the other hand, has been used to model the probability of binary outcomes, such as the presence or absence of a disease. This technique is particularly useful for diagnosis, where the aim is to correctly classify patients as either having or not having a disease based on various predictors such as symptoms, demographics, and lab values. Multivariable statistics, which include techniques such as multiple regression and analysis of variance, have been used to model the relationship between multiple predictors and an outcome variable [19]. This is important in identifying the most important risk factors for disease progression or treatment outcomes, as well as determining the optimal treatment approach for different patient groups. Machine learning techniques, such as XGBoost and neural networks, have become increasingly popular in recent years due to their ability to handle complex data and identify patterns that may not be apparent with traditional statistical methods [20]. Machine learning has been used to develop predictive models for various diseases, such as diabetes, heart disease, and cancer, as well as to identify subgroups of patients who may benefit from targeted treatments [21–25]. With the rise of machine learning techniques in healthcare research, it is crucial to examine how these models compare to traditional statistical approaches in terms of variable selection and ranking. While traditional statistical models focus on hypothesis testing and estimation, machine learning models aim to predict outcomes by learning patterns in the data. These differences in approach may lead to variations in variable importance and ranking, which can ultimately impact clinical decision-making. Therefore, to study prediction of insomnia better, it is essential to conduct studies that assess the degree of similarity between machine learning and traditional statistical models in terms of their variable selection and ranking based on various gain statistics, such as cover, frequency, gain, univariable t-statistic, and multivariable t-statistic. This study aims to address this gap in knowledge and provide insight into the similarities and differences between these two modeling approaches.
To address these limitations, we will highlight how some of the most common models researchers use rank various covariates in the model based upon the model statistics. We will utilize a correlation matrix to visually present the correlational relationships between these various model statistics and show the degree of similarity. We will use the NHANES 2017–2020 cohort, a large nationally representative sample of US adults, to analyze demographic, laboratory, physical exam, and lifestyle covariates. This study will help to increase understanding of the different methods for evaluating risk factors for sleep disorders and provide a better understanding of the key risk factors for sleep disorders in the US population. The analysis will utilize the NHANES 2017–2020 cohort, a large, nationally representative sample of US adults, will be used within this study.
Methods
A cross-sectional cohort study was carried out using the publicly available National Health and Nutrition Examination Survey (NHANES) data. The retrospective study included patients who had completed questionnaires related to their demographics, diet, exercise, and mental health, as well as had undergone laboratory and physical exams. The National Center for Health Statistics (NCHS) Ethics Review Board approved the data acquisition and analysis for this study. To ensure patient privacy, all data including medical records, survey information, and demographic information were fully anonymized prior to analysis. All patients provided their written consent for their data to be made public.
Dataset and cohort selection
The NHANES program, developed by the NCHS, aims to assess the health and nutritional status of the US population through complex, multi-stage surveys conducted by the CDC. The NHANES dataset includes data on health, nutrition, and physical activity from a representative sample of the US population. For this particular study, the focus was on individuals aged 18 years or older who completed the demographic, dietary, exercise, and mental health questionnaire and had both laboratory and physical exam data available for analysis. All patients in the dataset with full insomnia data were included in this study. 7,929 patients that met the inclusion criteria in this study. A total of 2,302 (29%) of patients had a sleep disorder.
Assessment of sleep disorder
To identify patients with sleep disorders in this study, we utilized the medical conditions file. Participants were queried with the following question: "Have you ever reported to a healthcare professional or doctor that you experience difficulty sleeping?" If the answer to this question was "Yes," the participant was classified as having a sleep disorder for the purposes of this study.
Independent variable
The NHANES dataset was searched to identify potential model covariates from the demographics, dietary, physical examination, laboratory, and medical questionnaire datasets. In total, 783 covariates were found and extracted, and they were then merged with the sleep disorder indicator.
Covariate selection considerations
Recognizing the potential influence of collinearity on variable selection, we underscored the importance of preliminary correlation analysis prior to covariate selection in our revised discussion. To facilitate a more balanced comparison between machine learning and regression models’ predictive capabilities for insomnia, we standardized our evaluation approach by incorporating common performance metrics—mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error (MAPE). This uniformity in performance evaluation enables a more equitable assessment of each model’s effectiveness. Furthermore, we enriched our analysis through the inclusion of a residual analysis, examining the discrepancies between observed insomnia probabilities and those predicted by the models. This addition not only enhances the robustness of our findings but also provides deeper insights into the predictive accuracies of the models under consideration, thereby offering a more comprehensive understanding of their utility in the context of insomnia prediction. Through these methodological refinements, our study now presents a nuanced exploration of the comparative advantages of machine learning over traditional regression models in the realm of sleep research, with a particular focus on the selection and ranking of covariates pertinent to insomnia. Given these concerns, we found the most effective methodology was to utilize what has been proposed based upon its simplicity while offering comparable results.
Model construction and statistical analysis
In this study, univariate logistic models were employed to determine which covariates were associated with a sleep disorder outcome. Covariates that demonstrated a p-value of less than 0.0001 in univariate analysis were included in the final machine-learning model. The use of univariate logistic models served as an initial filter of the 700+ covariates present in the dataset, ensuring that only strong, independent covariates were used in the machine learning models. This initial filtering also facilitated physician review of clinically relevant risk factors. Following the initial filtering process, model importance statistics derived from the machine-learning models were used to identify key risk factors.
Four machine-learning methods were carried out: XGBoost, Random Forest, Adaptive Boost, and Artificial Neural Network. All machine-learning models were constructed using 10-fold cross validation. A train:test (80:20) was used to compute the final set of model fit parameters. The model fit parameters considered in this study included accuracy, F1, sensitivity, specificity, positive predictive value, negative predictive value, and AUROC (Area under the receiver operator characteristic curve).
If it was determined that the models performed differently from one another, the best model based upon the model metrics would be chosen. If the models performed similarly to one another, then the machine learning model of choice would be decided based upon a literature search. In this case, the machine learning model XGBoost was used due to its prevalence within the literature as well as its increased predictive accuracy in healthcare prediction. Furthermore, XGBoost was chosen as the most optimal model based upon the seven model fit parameters that were computed. To identify risk factors for sleep disorder in this study, model covariates were ranked based on three criteria: Gain, Cover, and Frequency. The Gain refers to the relative contribution of a feature within the machine-learning model, while Cover is the number of observations associated with the feature. Frequency refers to the percentage of times the feature is present in the trees of the model. To visualize the relationship between potential risk factors and sleep disorder, SHAP explanations were utilized. Additionally, feature importance was evaluated using both univariable and multivariable t-statistics.
Determination of the similarity of the importance of variables by the model metrics
Variables were ranked based on each criterion (Gain, Cover, Frequency, univariable t-statistic, and multivariable t-statistic). A correlation matrix was created that calculated the correlation coefficient between all possible pairings of gain, cover, frequency, univariable t-statistic, and multivariable t-statistic. All statistical analysis was done using R Version 2023.06.0+421 (2023.06.0+421). Packages utilized: dplyr, tidyr, stringr, lubridate, summarytools, psych, ggplot2, plotly, ggpubr, caret, randomForest, glmnet, xgboost, keras, shap, pROC, missForest, boot, cvms, recipes, VennDiagram, fastshap [26].
Results
Overall performance and variability of the models
Table 1 shows model accuracy statistics for the four machine learning models. The XGBoost model had strong performance, most notably with the highest mean AUROC of all model metrics with mean AUROC = 0.865 (SD = 0.010), Accuracy = 0.762 (SD = 0.019), F1 = 0.875 (SD = 0.766), Sensitivity = 0.768 (SD = 0.023), Specificity = 0.782 (SD = 0.025), Positive Predictive Value = 0.806 (SD = 0.025), and Negative Predictive Value = 0.737 (SD = 0.034). Among 10,000 simulations completed, we observed that the AUROC ranged from 0.755 to 0.918, a difference of 0.163, the accuracy ranged from 0.657 to 0.894, a 0.237 difference, the F1 ranged from 0.655 to 0.875, a 0.221 difference, the sensitivity ranged from 0.675 to 0.768, a 0.211 difference, and the specificity ranged from 0.565 to 0.936, a 0.370 difference. The machine learning models all had strong performance with mean AUROCs ranging from 0.818 to 0.865.
Table 2 shows the model statistics including the gain, cover, frequency, univariable t-statistic, and multivariable t-statistic for all covariates with p-values <0.0001. These allowed for variable selection and clinical evaluation of the importance of each of these potential features.
Table 3 highlights the top ten variables for each of the model statistics. For all feature importance statistics, PHQ-9 score was the most important with a gain of 0.309, cover of 0.197, frequency of 0.609, univariable t-statistic of 5.536, and multivariable t-statistic of 0.281. Age was the second most important feature importance statistic for 4 out of the 5 a gain of 0.075, cover of 0.094, frequency of 0.061, univariable t-statistic of 5.933, and multivariable t-statistic of 0.177.
Correlation matrix of correlations between model gain statistics
Fig 1 shows that Gain and Cover were strongly positively correlated with a correlation coefficient of 0.96. Pairs that were moderately positively correlated included gain and frequency with a correlation coefficient of 0.61 as well as cover and frequency with a correlation coefficient of 0.68. The pairs of univariable t-statistic and gain (r = -0.027), univariable t-statistic and cover (-0.067), univariable t-statistic and frequency (r = -0.046), multivariable t-statistic and gain (r = -0.13), multivariable t-statistic and cover (r = -0.16), multivariable t-statistic and frequency (r = -0.23), multivariable t-statistic and univariable t-statistic (r = -0.076) had weakly negative correlations.
Discussion
In this retrospective, cross sectional cohort of United States adults, machine learning models utilizing demographic, laboratory, physical examination, and lifestyle questionnaire data all had strong predictive accuracy with mean AUROCs ranging from 0.818 to 0.865. From the machine learning models the variables with highest associations with a sleep disorder were as follows: depression (PHQ-9), weight, age, and waist circumference. XGBoost was chosen as the machine learning model of choice because it had the highest mean AUROC. It was important to compare various machine learning models to show that the performance metrics of the different machine learning models are similar across the distribution to ensure that any differences in variable contributions are not due to differences in model performance.
In the field of machine learning, identifying the most important variables in predicting an outcome is crucial. Our study reveals that different measures of feature importance result in wide variability in selecting the top 10 covariates in the final model. This discrepancy arises from the varied methods used to assess which covariates contribute the most to the model, such as linear regression’s reliance on a least squares metric that considers estimates as non-interactive. In contrast, machine learning models construct feature importance metrics using gain cover and frequency statistics, resulting in a different set of top covariates [17]. The interaction of these biomolecular pathways is challenging to comprehend, and traditional regression models may not effectively account for these complex interactions [27]. Therefore, we propose that machine learning methods utilizing gain cover and frequency model selection statistics are better equipped to handle these complexities and provide a more accurate representation of the most important covariates in predicting outcomes.
In the context of feature selection in machine learning, it is crucial to recognize that each method yields a different set of best covariates. As such, different model selection statistics need to be combined to determine the best approach. In this study, we evaluated three measures for machine learning feature importance, including cover, gain, and frequency, as well as two measures for regression, including univariable T statistics and multivariable T statistics. We found a strong correlation between the different machine learning models of frequency, gain, and cover. However, there were weak and sometimes negative correlations between the feature importance ranks of machine learning models and those of univariable and multivariable regression. These findings suggest that there are complex interactions happening within the machine learning models that are not accounted for in the multivariable regression. As such, interpreting the multivariable results may yield an inaccurate representation of the importance of these covariates [28].
In modeling, understanding which covariates are important due to their interactions with other covariates or on their own is challenging. Accounting for confounding variables has always been difficult, and multivariable regression is the most common approach, but it cannot efficiently account for every possible interaction. It is impossible to run all the pairwise, three-way, four-way, and five-way interactions present in a multivariable model efficiently [29]. Thus, the most efficient way to capture these interactions is through machine learning models that iterate through the data, develop the most efficient models, and are cross-validated and effectively tested through train-test splits. Therefore, the large discrepancy between feature ranks between univariable and multivariable regression and that of machine learning models, highlights the importance of accounting for interaction terms through machine learning methods.
Our study evaluates the differences between interaction terms and how they lead to differences in model feature statistic rankings. By efficiently visualizing the relationship between each covariate and accounting for all confounding variables through machine learning, we can better investigate and identify the most important variables for further investigation in prospective studies. Therefore, we argue that machine learning brings a new way of evaluating variables beyond traditional regression, as it can account for confounding variables and identify important variables for future studies.
Limitations
This study has both strengths and limitations. The utilization of the NHANES dataset, which is a large retrospective cohort, allows for the selection of a substantial sample size, evaluation of data quality, and broad generalizability. However, it also carries the limitations of retrospective studies, such as reliance on self-reported surveys to obtain information on the outcome of interest and lifestyle choices. Prospective studies with automated measurements of foods may be more accurate, but they may not have the advantage of including a larger volume of participants through self-reported information. Another limitation is the voluntary nature of the cohort, which may introduce selection bias. However, the demographic diversity of the cohort analyzed suggests that the findings may still be generalizable to other cohorts. It is important to note that while this study focused on machine learning models and traditional statistical models, other models that are not linear or involve machine learning could be explored in future studies.
Conclusion
Machine learning models offer additional information in ranking variable importance for predicting insomnia in addition to regression models.
References
- 1. Buysse DJ. Insomnia. JAMA. 2013;309(7):706–16. pmid:23423416; PubMed Central PMCID: PMC3632369.
- 2. Blake MJ, Trinder JA, Allen NB. Mechanisms underlying the association between insomnia, anxiety, and depression in adolescence: Implications for behavioral sleep interventions. Clin Psychol Rev. 2018;63:25–40. Epub 20180528. pmid:29879564.
- 3. Di H, Guo Y, Daghlas I, Wang L, Liu G, Pan A, et al. Evaluation of Sleep Habits and Disturbances Among US Adults, 2017–2020. JAMA Netw Open. 2022;5(11):e2240788. Epub 20221101. pmid:36346632; PubMed Central PMCID: PMC9644264.
- 4. M KP, Latreille V. Sleep Disorders. Am J Med. 2019;132(3):292–9. Epub 20181004. pmid:30292731.
- 5. Muth CC. Sleep-Wake Disorders. JAMA. 2016;316(21):2322. pmid:27923092.
- 6. Wesselius HM, van den Ende ES, Alsma J, Ter Maaten JC, Schuit SCE, Stassen PM, et al. Quality and Quantity of Sleep and Factors Associated With Sleep Disturbance in Hospitalized Patients. JAMA Intern Med. 2018;178(9):1201–8. pmid:30014139; PubMed Central PMCID: PMC6142965.
- 7. Edinger JD. Classifying insomnia in a clinically useful way. J Clin Psychiatry. 2004;65 Suppl 8:36–43. pmid:15153066.
- 8. Frydman D. [Individual evolution of idiopathic insomnia]. Waking Sleeping. 1979;3(1):51–5. pmid:494641.
- 9. Goldberg LD. Managing insomnia in an evolving marketplace. Am J Manag Care. 2006;12(8 Suppl):S212–3. pmid:16686590.
- 10. Medina-Chávez JH, Fuentes-Alexandro SA, Gil-Palafox IB, Adame-Galván L, Solís-Lam F, Sánchez-Herrera LY, et al. [Clinical practice guideline. Diagnosis and treatment of insomnia in the elderly]. Rev Med Inst Mex Seguro Soc. 2014;52(1):108–19. pmid:24625494.
- 11. Roth T. Introduction—Advances in our understanding of insomnia and its management. Sleep Med. 2007;8 Suppl 3:25–6. pmid:18032105.
- 12. Spiegelhalder K, Espie C, Nissen C, Riemann D. Sleep-related attentional bias in patients with primary insomnia compared with sleep experts and healthy controls. J Sleep Res. 2008;17(2):191–6. pmid:18482107.
- 13. Tsuchihashi-Makaya M, Matsuoka S. Insomnia in Heart Failure. Circ J. 2016;80(7):1525–6. Epub 20160603. pmid:27264415.
- 14. Wittchen HU, Krause P, Höfler M, Pittrow D, Winter S, Spiegel B, et al. [NISAS-2000: The "Nationwide Insomnia Screening and Awareness Study". Prevalence and interventions in primary care]. Fortschr Med Orig. 2001;119(1):9–19. pmid:11935661.
- 15. Yoshihisa A, Kanno Y, Takeishi Y. Insomnia and Cardiac Events in Patients With Heart Failure- Reply. Circ J. 2016;81(1):126. Epub 20161214. pmid:27980297.
- 16. Castro HM, Ferreira JC. Linear and logistic regression models: when to use and how to interpret them? J Bras Pneumol. 2023;48(6):e20220439. Epub 20230113. pmid:36651441; PubMed Central PMCID: PMC9747134.
- 17. Huang AA, Huang SY. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations. PLoS One. 2023;18(2):e0281922. Epub 20230223. pmid:36821544; PubMed Central PMCID: PMC9949629.
- 18. Gomila R. Logistic or linear? Estimating causal effects of experimental treatments on binary outcomes using regression analysis. J Exp Psychol Gen. 2021;150(4):700–9. Epub 20200924. pmid:32969684.
- 19. Richardson AM, Joshy G, D’Este CA. Understanding statistical principles in linear and logistic regression. Med J Aust. 2018;208(8):332–4. pmid:29716508.
- 20. Huang AA, Huang SY. Use of machine learning to identify risk factors for insomnia. PLoS One. 2023;18(4):e0282622. Epub 20230412. pmid:37043435; PubMed Central PMCID: PMC10096447.
- 21. Baik SM, Kim KT, Lee H, Lee JH. Machine learning algorithm for early-stage prediction of severe morbidity in COVID-19 pneumonia patients based on bio-signals. BMC Pulm Med. 2023;23(1):121. Epub 20230414. pmid:37059983.
- 22. Cai Y, Su H, Si Y, Ni N. Machine learning-based prediction of diagnostic markers for Graves’ orbitopathy. Endocrine. 2023. Epub 20230415. pmid:37059863.
- 23. Dos Reis AHS, de Oliveira ALM, Fritsch C, Zouch J, Ferreira P, Polese JC. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12(1):68. Epub 20230415. pmid:37061711.
- 24. Meza Ramirez CA, Greenop M, Almoshawah YA, Martin Hirsch PL, Rehman IU. Advancing cervical cancer diagnosis and screening with spectroscopy and machine learning. Expert Rev Mol Diagn. 2023. Epub 20230415. pmid:37060617.
- 25. Mohebi M, Amini M, Alemzadeh-Ansari MJ, Alizadehasl A, Rajabi AB, Shiri I, et al. Post-revascularization Ejection Fraction Prediction for Patients Undergoing Percutaneous Coronary Intervention Based on Myocardial Perfusion SPECT Imaging Radiomics: a Preliminary Machine Learning Study. J Digit Imaging. 2023. Epub 20230414. pmid:37059890.
- 26. Liu Q, Gui D, Zhang L, Niu J, Dai H, Wei G, et al. Simulation of regional groundwater levels in arid regions using interpretable machine learning models. Sci Total Environ. 2022;831:154902. Epub 20220329. pmid:35364142.
- 27. Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15(4):233–4. Epub 20180403. pmid:30100822; PubMed Central PMCID: PMC6082636.
- 28. Dharma C, Fu R, Chaiton M. Table 2 Fallacy in Descriptive Epidemiology: Bringing Machine Learning to the Table. Int J Environ Res Public Health. 2023;20(13). Epub 20230621. pmid:37444042; PubMed Central PMCID: PMC10340623.
- 29. Bunce C, Czanner G, Grzeda MT, Dore CJ, Freemantle N. Ophthalmic statistics note 12: multivariable or multivariate: what’s in a name? Br J Ophthalmol. 2017;101(10):1303–5. Epub 20170816. pmid:28814413.