Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Application of a transparent artificial intelligence algorithm for US adults in the obese category of weight

Abstract

Objective and aims

Identification of associations between the obese category of weight in the general US population will continue to advance our understanding of the condition and allow clinicians, providers, communities, families, and individuals make more informed decisions. This study aims to improve the prediction of the obese category of weight and investigate its relationships with factors, ultimately contributing to healthier lifestyle choices and timely management of obesity.

Methods

Questionnaires that included demographic, dietary, exercise and health information from the US National Health and Nutrition Examination Survey (NHANES 2017–2020) were utilized with BMI 30 or higher defined as obesity. A machine learning model, XGBoost predicted the obese category of weight and Shapely Additive Explanations (SHAP) visualized the various covariates and their feature importance. Model statistics including Area under the receiver operator curve (AUROC), sensitivity, specificity, positive predictive value, negative predictive value and feature properties such as gain, cover, and frequency were measured. SHAP explanations were created for transparent and interpretable analysis.

Results

There were 6,146 adults (age > 18) that were included in the study with average age 58.39 (SD = 12.94) and 3122 (51%) females. The machine learning model had an Area under the receiver operator curve of 0.8295. The top four covariates include waist circumference (gain = 0.185), GGT (gain = 0.101), platelet count (gain = 0.059), AST (gain = 0.057), weight (gain = 0.049), HDL cholesterol (gain = 0.032), and ferritin (gain = 0.034).

Conclusion

In conclusion, the utilization of machine learning models proves to be highly effective in accurately predicting the obese category of weight. By considering various factors such as demographic information, laboratory results, physical examination findings, and lifestyle factors, these models successfully identify crucial risk factors associated with the obese category of weight.

Introduction

Obesity is a worldwide concern, including in the United States, where it has reached epidemic proportions. Over the past few decades, general epidemiological trends indicate a steady increase in the prevalence of obesity across all age groups and socioeconomic strata. This surge in obesity rates has far-reaching consequences for public health, as it is associated with a myriad of serious health issues and imposes a significant economic burden on individuals and society [1].

The alarming rise of obesity in the American population has raised numerous red flags for policymakers, healthcare providers, and researchers. Obesity is not just a matter of aesthetics or body image; it is a multifaceted problem with severe health implications. Individuals affected by obesity are at a higher risk of developing chronic conditions such as type 2 diabetes, cardiovascular diseases, hypertension, and certain cancers [14]. Moreover, obesity has been linked to reduced quality of life, decreased life expectancy, and increased healthcare costs.

The biochemical pathways that underlie obesity are complex and multifactorial. Genetic predisposition, environmental factors, sedentary lifestyles, and poor dietary choices all play pivotal roles in the development and progression of obesity. Understanding these underlying mechanisms is crucial for designing effective interventions and tailored treatment strategies.

In response to the obesity epidemic, health authorities have established guidelines and recommendations for its prevention and management. These guidelines typically emphasize a comprehensive approach that includes dietary modifications, increased physical activity, behavior change strategies, and, in some cases, medical interventions. Nevertheless, combating obesity remains a challenge due to its multifaceted nature and the need for personalized interventions [2].

Shapely Additive Explanations (SHAP) has emerged as a promising tool for understanding the complex interplay of factors contributing to obesity. SHAP explanations provide transparent and interpretable insights into the machine learning models used to predict obesity and help identify the most influential features driving the predictions. This enhances our understanding of the complex relationships between various factors and the obese category of weight [5, 6].

The aim of this study is to leverage the power of machine learning, specifically XGBoost, along with SHAP explanations to improve the prediction of the obese category of weight. By utilizing data from the US National Health and Nutrition Examination Survey (NHANES), we seek to investigate the associations between obesity and various demographic, dietary, exercise, and health factors. The ultimate goal is to gain deeper insights into obesity’s underlying mechanisms, contribute to the development of more effective preventive strategies, and facilitate timely and personalized management of obesity-related health conditions. By addressing obesity at its root causes, we hope to pave the way for a healthier and more resilient population.

Methods

A cross-sectional cohort study was conducted with participants who responded to a detailed questionnaire covering demographic information, dietary habits, exercise routines, mental health, as well as laboratory tests and physical examinations using data from the publicly available National Health and Nutrition Examination Survey (NHANES). The National Center for Health Statistics’ (NCHS) Ethics Review Board gave its approval for the study’s data gathering and processing. All data, including medical records, survey responses, and demographic data, were de-identified before analysis to safeguard the participants’ confidentiality and privacy. Participants gave written agreement prior to the study’s start allowing the public release of their data.

Dataset and cohort selection

The National Center for Health Statistics (NCHS) developed the National Health and Nutrition Examination Survey (NHANES) 2017–2020 to evaluate the health and nutritional status of the American population. The Centers for Disease Control and Prevention (CDC) conducted a comprehensive series of cross-sectional, multi-stage surveys to collect data on health, nutrition, and physical activity for the NHANES dataset. Our investigation focused on adult participants (aged 18 and above) in the NHANES dataset who completed demographic, dietary, exercise, and mental health questionnaires, as well as underwent physical and laboratory examinations. This sample was selected to represent the national population of the United States.

Assessment of obesity

The obese category of weight is defined as having a Body Mass Index (BMI) greater than or equal to 30. BMI is calculated by dividing an individual’s weight (in kilograms) by the square of their height (in meters). A BMI of 30 or above indicates obesity, as per the standard classification set by the World Health Organization (WHO) and many health authorities.

Model construction and statistical analysis

In this study, the NHANES dataset encompassed covariates that involved socioeconomics, dietary data, actual assessments, research center outcomes, and clinical surveys. Univariate analysis was initially used to explore the relationship between these covariates and the obese category of weight, the outcome variable. To identify strong independent covariates, the machine learning model selected variables with p-values below 0.0001 from the univariate analysis before considering their interaction in the larger model. XGBoost, a widely used and effective algorithm in healthcare predictions, was chosen for this review. Past studies using NHANES data have identified XGBoost as the optimal algorithm, offering a balance of training efficiency, model accuracy, and interpretability. The dataset was split into a train set (80%) and a test set (20%) to determine the parameters for the final model fit. Model performance was evaluated using various parameters, including the area under the receiver operator characteristic curve (AUROC), sensitivity, specificity, positive predictive value, negative predictive value, prevalence, detection rate, detection prevalence, and balanced accuracy. These parameters were used to assess and evaluate the model’s performance. Additionally, before carrying out the XGBoost modeling that was present in this paper, other machine learning methods just as artificial neural networks, gradient boost modeling, and random forest, to name a few, were utilized. XGBoost was the most accurate by all metrics (AUROC, sensitivity, specificity) combined and thus was utilized in this paper.

Model feature importance statistics and SHAP visualization

The Gain metric assesses a feature’s importance in the model by calculating its individual contribution for each tree. A higher Gain value suggests greater significance in generating predictions compared to other features. On the other hand, the Cover metric indicates the relative number of observations associated with a specific feature. It is calculated by summing up the occurrences of that feature across all trees. For example, if feature one appears in 15, 10, 8, and 5 observations in tree one, tree two, tree three, and tree four, respectively, the Cover metric for feature one would be 38 observations. The Cover metric is then expressed as a percentage based on the total cover for all features. Additionally, the Frequency metric represents the relative occurrence of a particular feature in the model’s trees. Using the previous example, if feature one appears in 3, 2, 4, and 1 splits within tree one, tree two, tree three, and tree four, respectively, the weightage for feature one would be 10. The Frequency for feature one is determined by calculating its percentage weight relative to the weights of all features.

Results

Table 1 shows the 6,146 patients that met the inclusion criteria in this study. The average age was 58.39 (SD = 12.94). Individuals had mean HS C-Reactive Protein levels of 4.34 mg/L (SD = 8.89), insulin levels of 15.13 umol/mL (SD = 25.09), Blood lead levels of 1.33 ug/dL (SD = 1.30), Blood cadmium levels of 0.50ug/L (SD = 0.56), Uric acid levels of 5.48 mg/dL (SD = 1.48), Creatinine levels of 121.99 mg/dL (SD = 80.62). Compared to those in the obese category of weight to those that were not, there was a mean HS C-Reactive Protein levels of 5.89 mg/L (SD = 9.56) compared to a mean HS C-Reactive Protein levels of 3.14 mg/L (SD = 8.14), insulin levels of 21.85 umol/mL (SD = 33.16) compared to insulin levels of 10.21 umol/mL (SD = 15.11), Blood lead levels of 1.15 ug/dL (SD = 1.29) compared to Blood lead levels of 1.47 ug/dL (SD = 1.30), Blood cadmium levels of 0.43ug/L (SD = 0.54) compared to Blood cadmium levels of 0.55ug/L (SD = 0.56), Uric acid levels of 5.80 mg/dL (SD = 1.51) compared to Uric acid levels of 5.24 mg/dL (SD = 1.42), Creatinine levels of 134.34 mg/dL (SD = 84.15) compared to Creatinine levels of 112.60 mg/dL (SD = 76.52).

The machine learning model had 78 features that were found to be significant on univariate analysis (P<0.0001 used). These were fitted into the XGBoost model, Fig 1 and Table 2 shows an AUROC = 0.8295, Sensitivity = 0.9125, Specificity = 0.5615, Positive predictive value 0.4226, negative predictive value of 0.9481 were observed.

thumbnail
Fig 1. Receiver operator characteristic curve and model statistics.

The Receiver operating characteristic curve for the machine-learning model predicting whether the patient were in the obese category of weight. AUROC = 0.8295 (P<0.0001).

https://doi.org/10.1371/journal.pone.0304509.g001

Table 3 shows that the top four covariates ranked by the gain, a measure of the percentage contribution of the covariate to the overall model prediction, were HS C-Reactive Protein (mg/L) (Gain = 0.181), Insulin (uU/mL) (Gain = 0.149), Blood lead (ug/dL) (Gain = 0.056), blood cadmium (ug/L) (Gain = 0.038).

In Fig 2, overall SHAP explanations can be seen for all the statistically significant covariates on univariable regression.

thumbnail
Fig 2. Overall SHAP explanations.

SHAP explanations, purple color representing higher values of the covariate while yellow representing lower values of the covariate. X-axis is the change in log-odds for advanced the obese category of weight.

https://doi.org/10.1371/journal.pone.0304509.g002

In Fig 3, SHAP visualizations were conducted for the top four continuous covariates by overall SHAP explanations. Trends included a positive association of HS C-reactive Protein and insulin and obesity as well as a negative association between blood lead level and blood cadmium level and obesity.

thumbnail
Fig 3. SHAP explanations, covariate value on the x-axis, change in log-odds on the y-axis, red line represents the relationship between the covariate and log-odds for being in the obese category of weight, each black dot represents an observation.

Covariates: top left–HS C-Reactive Protein (mg/mL), top right–Insulin (uU/mL), bottom left–Blood lead (ug/dL), bottom right–Blood cadmium (ug/L).

https://doi.org/10.1371/journal.pone.0304509.g003

Discussion

In this cross-sectional cohort study of US adults, an artificial intelligence algorithm trained on information from the National Health and Nutrition Examination Survey (NHANES) demographic, laboratory, physical examination, and lifestyle factors demonstrated a high predictive accuracy with an area under the receiver operating curve (AUROC) of 0.8295. This indicates that the model was able to strongly predict obesity above what is to be expected of standard chance. The top four covariates that had significant associations with the obese category of weight in the artificial intelligence model based off SHAP value included HS C-Reactive protein, insulin, has doctor told you to reduce salt in the diet and blood lead levels. The top four covariates ranked by gain, which corresponds to the contribution of each feature in the overall artificial intelligence algorithm includes HS C-Reactive Protein (mg/L) (Gain = 0.181), Insulin (uU/mL) (Gain = 0.149), Blood lead (ug/dL) (Gain = 0.056), blood cadmium (ug/L) (Gain = 0.038).

The artificial intelligence algorithm employed in our study demonstrates consistent associations and directionality with those reported in existing literature concerning the obese category of weight [1, 79]. These findings, supported by multiple studies, offer valuable insights into how the algorithm perceives these associations. The alignment of our study’s results with established literature enhances our confidence in the algorithm’s ability to accurately capture genuine physiological relationships related to obesity. A notable advantage of the algorithmic approach used in our study is its impartiality in identifying significant covariates. By systematically exploring numerous variables based on mathematical relationships, subjective influence from researchers is minimized. This enables the uncovering of nonlinear patterns, and the covariates can be ranked based on performance metrics, assessing the overall accuracy and reliability of the machine learning model in predicting obesity [10]. SHAP visualizations further aid researchers in comparing their own understanding of variable relationships with the machine learning model’s assessment, allowing for the testing of physiological plausibility. These visualizations provide a valuable tool for validating and comprehending the associations identified by the algorithm, enhancing the interpretability and applicability of the model’s predictions [8].

The study inherits the advantages and limitations associated with cross-sectional multistage survey questionnaire studies. These surveys use multistage sampling techniques, ensuring a representative sample and enabling generalizability to the broader population. However, they offer only a snapshot at a single point in time, restricting the ability to establish causality or temporal sequences. Despite their cost-effectiveness and efficiency in gathering data from a large number of participants within a relatively short period, there is a risk of recall and response bias. Therefore, it is vital to take these limitations into account when interpreting findings from cross-sectional multistage survey questionnaire data. Additionally, multiple variables that are significant may not seem to have causal links, such as lead levels and obesity, and further research is needed to identify if these are just correlations in which the cause is a third-variable (for lead levels and obesity–socioeconomic status), or if there may be causal nature. Further studies are needed to confirm these connections.

Conclusion

The artificial intelligence algorithm predicted obesity over and above random chance and uncovered associations and relationships in a understandable way for clinicians.

References

  1. 1. Koh HE, Cao C, Mittendorfer B. Insulin Clearance in Obesity and Type 2 Diabetes. Int J Mol Sci. 2022;23(2). Epub 20220106. pmid:35054781; PubMed Central PMCID: PMC8776220.
  2. 2. Cleveland LP, Grummon AH, Konieczynski E, Mancini S, Rao A, Simon D, et al. Obesity prevention across the US: A review of state-level policies from 2009 to 2019. Obes Sci Pract. 2023;9(2):95–102. Epub 20220615. pmid:37034562; PubMed Central PMCID: PMC10073818.
  3. 3. Jokela M, Laakasuo M. Obesity as a causal risk factor for depression: Systematic review and meta-analysis of Mendelian Randomization studies and implications for population mental health. J Psychiatr Res. 2023;163:86–92. Epub 20230504. pmid:37207436.
  4. 4. Jordan K, Fawsitt CG, Carty PG, Clyne B, Teljeur C, Harrington P, et al. Cost-effectiveness of metabolic surgery for the treatment of type 2 diabetes and obesity: a systematic review of economic evaluations. Eur J Health Econ. 2023;24(4):575–90. Epub 20220722. pmid:35869383; PubMed Central PMCID: PMC10175448.
  5. 5. Huang AA, Huang SY. Use of machine learning to identify risk factors for insomnia. PLoS One. 2023;18(4):e0282622. Epub 20230412. pmid:37043435; PubMed Central PMCID: PMC10096447.
  6. 6. Huang AA, Huang SY. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations. PLoS One. 2023;18(2):e0281922. Epub 20230223. pmid:36821544; PubMed Central PMCID: PMC9949629.
  7. 7. Choi J, Joseph L, Pilote L. Obesity and C-reactive protein in various populations: a systematic review and meta-analysis. Obes Rev. 2013;14(3):232–44. Epub 20121122. pmid:23171381.
  8. 8. Shin H, Shim S, Oh S. Machine learning-based predictive model for prevention of metabolic syndrome. PLoS One. 2023;18(6):e0286635. Epub 20230602. pmid:37267302; PubMed Central PMCID: PMC10237504.
  9. 9. Tinkov AA, Filippini T, Ajsuvakova OP, Aaseth J, Gluhcheva YG, Ivanova JM, et al. The role of cadmium in obesity and diabetes. Sci Total Environ. 2017;601–602:741–55. Epub 20170601. pmid:28577409.
  10. 10. Huang AA, Huang SY. Dendrogram of transparent feature importance machine learning statistics to classify associations for heart failure: A reanalysis of a retrospective cohort study of the Medical Information Mart for Intensive Care III (MIMIC-III) database. PLoS One. 2023;18(7):e0288819. Epub 20230720. pmid:37471315; PubMed Central PMCID: PMC10358877.