Fig 1.
Flowchart depicting the analytic workflow adopted in the study.
a-adjusted by resampling methods incl. oversampling, under-sampling, random oversampling (ROSE) and synthetic minority oversampling technique (SMOTE).
Fig 2.
Overlapped ROC curves demonstrating predictive performance of logistic regression models on internal validation data.
Using unbalanced, original training data and re-structured with oversampling, ROSE and SMOTE resampling.
Fig 3.
Overlapped ROC curves demonstrating predictive performance of logistic regression models on external validation data.
Using unbalanced, original training data and re-structured with oversampling, ROSE and SMOTE resampling.
Fig 4.
Overlapped ROC curves demonstrating predictive performance of random forests models on internal validation data.
Using unbalanced, original training data and re-structured with oversampling, ROSE and SMOTE resampling.
Fig 5.
Overlapped ROC curves demonstrating predictive performance of random forests models on external validation data.
Using unbalanced, original training data and re-structured with oversampling, ROSE and SMOTE resampling.
Fig 6.
Overlapped ROC curves demonstrating predictive performance of artificial neural network models on internal validation data.
Using unbalanced, original training data and re-structured with oversampling, ROSE and SMOTE resampling.
Fig 7.
Overlapped ROC curves demonstrating predictive performance of artificial neural network models on external validation data.
Using unbalanced, original training data and re-structured with oversampling, ROSE and SMOTE resampling.
Fig 8.
Variable importance plot of the best-performing random forest model produced by ROSE resampling.
BMXWAIST = waist circumference; RIDAGEYR = age; BMXBMI = body mass index; WHQ150 = age when heaviest weight; BMXLEG = upper leg length; BMXARMC = arm circumference; BMXWT = weight; WHD050 = self-reported weight– 1 year ago; WHD020 = current self-reported weight; WHD140 = self-reported greatest weight; BMXHT = standing height; carb = carbohydrate; caffeine = caffeine; INDFMPIR = income-poverty ratio; bcar = beta carotene; acar = alpha carotene; kcal = energy; dodecanoic = SFA 12:0 (Dodecanoic); copper = copper; atoc = vitamin E alpha tocopherol.
Fig 9.
Variable importance plot of the best-performing artificial neural network model produced by ROSE resampling.
BMXWAIST = waist circumference; BMXBMI = body mass index; RIDAGEYR = age; WHQ150 = age when heaviest weight; BMXARMC = arm circumference; BMXWT = weight; WHD050 = self-reported weight– 1 year ago; WHD020 = current self-reported weight; BMXLEG = upper leg length; DMDEDUC2 = education level; WHD140 = self-reported greatest weight; HSD010 = self-rated general health; WHQ030 = How do you consider your weight?; PAQ650 = vigorous recreational activities; WHQ040 = Like to weigh more, less, or same?; DBD895 = number of meals not home prepared; BMXHT = standing height; PAQ665 = moderate recreational activities; DBD910 = number of frozen meals/pizzas in past 30 days; carb = carbohydrate.
Fig 10.
Variable importance plot of the best-performing artificial neural network models produced by SMOTE resampling.
BMXWAIST = waist circumference; BMXBMI = body mass index; RIDAGEYR = age; WHQ150 = age when heaviest weight; BMXARMC = arm circumference; BMXWT = weight; WHD050 = self-reported weight– 1 year ago; WHD020 = current self-reported weight; BMXLEG = upper leg length; DMDEDUC2 = education level; WHD140 = self-reported greatest weight; HSD010 = self-rated general health; WHQ030 = How do you consider your weight?; PAQ650 = vigorous recreational activities; WHQ040 = Like to weigh more, less, or same?; DBD895 = number of meals not home prepared; BMXHT = standing height; PAQ665 = moderate recreational activities; DBD910 = number of frozen meals/pizzas in past 30 days; carb = carbohydrate.
Table 1.
Nutritional and other markers of undiagnosed type 2 diabetes identified by the best-performing logistic model (AUC = 75.7%).
Table 2.
Nutritional and other markers of undiagnosed type 2 diabetes identified by best-performing ANN and RF models.
Fig 11.
Benchmarking with the ADA diabetes risk test.
Comparison of predictive performance of ADA diabetes risk test on internal validation data (AUC = 0.737028) and the best-performing predictive model on internal validation data (AUC = 0.7566544), as per DeLong test for comparing two ROC curves, was non-significant (p = 0.3201) indicating performances on a par with each other. Comparison of predictive performance of ADA diabetes risk test on external validation data (AUC = 0.7401352) and the best-performing predictive model on external validation data (AUC = 0.7464), as per DeLong test for comparing two ROC curves, was also non-significant (p = 0.0643).
Table 3.
Creation of variables analogous to those in the American Diabetes Association (ADA) diabetes risk test using National Health and Nutrition Examination Survey (NHANES) data.
Table 4.
Performance comparison of the ADA diabetes risk test versus the best-performing model on NHANES data.