Development of the diabetes typology model for discerning Type 2 diabetes mellitus with national survey data

Objective To classify individuals with diabetes mellitus (DM) into DM subtypes using population-based studies. Design Population-based survey Setting Individuals participated in 2003–2004, 2005–2006, or 2009–2010 the National Health and Nutrition Examination Survey (NHANES), and 2010 Coronary Artery Risk Development in Young Adults (CARDIA) survey (research materials obtained from the National Heart, Lung, and Blood Institute Biologic Specimen and Data Repository Information Coordinating Center) Participants 3084, 3040 and 3318 US adults from the 2003–2004, 2005–2006 and 2009–2010 NHANES samples respectively, and 5,115 US adults in the CARDIA cohort Primary outcome measures We proposed the Diabetes Typology Model (DTM) through the use of six composite measures based on the Homeostatic Model Assessment (HOMA-IR, HOMA-%β, high HOMA-%S), insulin and glucose levels, and body mass index and conducted latent class analyses to empirically classify individuals into different classes. Results Three empirical latent classes consistently emerged across studies (entropy = 0.81–0.998). These three classes were likely Type 1 DM, likely Type 2 DM, and atypical DM. The classification has high sensitivity (75.5%), specificity (83.3%), and positive predictive value (97.4%) when validated against C-peptide level. Correlates of Type 2 DM were significantly associated with model-identified Type 2 DM. Compared to regression analysis on known correlates of Type 2 DM using all diabetes cases as outcomes, using DTM to remove likely Type 1 DM and atypical DM cases results in a 2.5–5.3% r-square improvement in the regression analysis, as well as model fits as indicated by significant improvement in -2 log likelihood (p<0.01). Lastly, model-defined likely Type 2 DM was significantly associated with known correlates of Type 2 DM (e.g., age, waist circumference), which provide additional validation of the DTM-defined classes. Conclusions Our Diabetes Typology Model reflects a promising first step toward discerning likely DM types from population-based data. This novel tool will improve how large population-based studies can be used to examine behavioral and environmental factors associated with different types of DM.


Introduction
Diabetes mellitus (DM) is a public health concern in the US. It has been estimated that 9.3% of the US population (29.1 million) have DM; of those, 27.8% are undiagnosed [1]. DM was the seventh leading cause of death in the US in 2010, claiming 69,071 lives [1]. DM is a complex metabolic disorder that develops due to inadequate insulin production or ineffective insulin utilization by insulin target cells in muscle, fat and the liver. Patients with diabetes are typically classified as having Type 1 (T1DM), Type 2 (T2DM) or gestational diabetes based on the lack of insulin production, insulin resistance or insulin resistance during pregnancy, respectively. Although T2DM is most common, paradoxically some patients manifest symptoms of both T1DM and T2DM. Additionally, other rarer forms of diabetes occur because of specific genetic mutations and pancreatic disease owing to tissue insults from drugs and toxins. Although the incidence of T1DM is highest among children and young adults, it is an autoimmune disease that can manifest at any age [2]. Owing in part to the global obesity epidemic, the incidence of T2DM in children continues to increase, and minority youth are disproportionately affected [3][4][5].
A current challenge in diabetes research is to use population-based studies to estimate the prevalence of diabetes subtypes despite the imprecise nature of the classification of diabetes in these studies. Respondents are often asked about whether they have ever been diagnosed with diabetes, but are not often asked a follow up question regarding DM type. Further, individuals with undiagnosed DM will not be able to provide information on diabetes subtypes. Additionally, no large national surveys of adults that measured autoantibodies that can be used to identify T1DM cases. In contrast, it is increasingly common for population-based studies to collect physiologic data, including blood glucose and insulin levels that can be used to screen for diabetes and evaluate insulin resistance and sensitivity. Surrogate indicators for insulin resistance and sensitivity, as well as pancreatic β-cell function, can be extrapolated from fasting blood glucose and insulin levels that are commonly included in population-based studies. The homeostatic model assessments (HOMA) are well recognized methods for estimating pancreatic β-cell function and how well insulin is utilized by its target cell populations. Specifically, HOMA-%β is a surrogate for pancreatic β-cell insulin production, HOMA-IR is a measure for insulin resistance, and HOMA-%S is a measure for insulin sensitivity [6,7]. While we cannot use these surrogate indexes to diagnose DM subtypes, it is possible that we can use these surrogate indexes coupled with anthropometric measures like body mass index (BMI) to correctly classify individuals with DM, both diagnosed and undiagnosed, into their corresponding subtypes. No studies to date have leveraged these measures to perform DM subtype classification.
In this study, we proposed the Diabetes Typography Model (DTM), which aimed to classify individuals with DM into subtypes of DM using a latent class analysis approach. We then examined the sensitivity, specificity, and positive predictive value of this classification method. Lastly, we tested these model-defined classes using known correlates of T2DM to examine the construct validity of these classes in discerning different subtypes of DM. The proposed DTM will enable researchers to estimate prevalence of the various subtypes of DM and to examine the behavioral and environmental factors associated with each subtype of DM in populationbased studies. The NHANES is a serial cross-sectional health survey of a US representative sample of adults and children undertaken by the U.S. Centers for Disease Control and Prevention [8]. It includes a detailed survey component and full medical examination using a mobile examination center. Data collected include self-reported demographic; social, health and nutrition information; supplementary blood test results and anthropometric measurements. NHANES includes Mexican American, non-Mexican Hispanic, non-Hispanic White, non-Hispanic Black, and participants of other races.

Data sources
The CARDIA study, began in 1985-1986 with 5,115 adults aged 18-30, is a prospective longitudinal study evaluating the risk of developing heart diseases over time for Black and White US adults [9]. The same cohort of respondents has been followed for eight waves through 2010 (3,450 respondents, ages 43-55) at varying intervals ranging from 1-5 years between waves. Like NHANES, CARDIA collects biological and survey data. The data we used are from Wave 8 collected in 2010. This was a secondary data analysis on de-identified data and therefore was exempted from a review by the institutional review board.

DTM Model Variables.
Due to the importance of the Homeostasis Model in discerning diabetes type [7], we included three measures based in this model-HOMA-IR, HOMA-%β, and HOMA-%S. Due to the importance of insulin levels in these calculations, we omitted all respondents currently taking insulin as the presence of exogenous insulin would affect the validity of the models. HOMA-IR estimates insulin resistance (IR, Eq 1). Higher HOMA-IR values reflect higher IR where the body is producing enough insulin, but the insulin produced is not effectively controlling blood glucose levels; a characteristic of T2DM [10]. A value of 3 indicates moderate insulin resistance and ! 5 indicates severe insulin resistance [10]. We use a cut point of 1.7 to classify respondents as having had low HOMA-IR (0-1.7 = 1; else = 0).
HOMA-%S estimates insulin sensitivity (Eq 3). High insulin sensitivity indicates the body is utilizing insulin effectively. High insulin sensitivity values are uncommon in T2DM and are more common in T1DM. Respondents with a value of 65% or above were classified as having high HOMA-%S.
Past research has shown a strong relationship between being overweight and obese according to body mass index (BMI, Eq 4) and risk of T2DM [12,13]. Respondents were classified as low-normal BMI if their BMI was <25.
The remaining two measures focus on insulin: (1) low fasting insulin (0-5 μU/mL = 1; else = 0), which is uncommon in T2DM and frequently seen with T1DM and other types of diabetes; [14] and (2) high glucose to insulin ratio (G: I, Eq 5). Type 2-diabetics typically have relatively low G:I ratios because they have high insulin production relative to glucose levels. The opposite pattern is typically seen with T1DM. Respondents with values >20 were classified as having high G:I ratio.

Glucose to insulin ratio
Demographic and T2DM correlates. Both NHANES and CARDIA sample different racial/ethnic groups. For CARDIA data, we include both Black and White respondents. For NHANES data, we re-categorized racial/ethnic categories to Hispanic (of any origin), non-Hispanic Black, non-Hispanic White, and other. Other demographic variables included sex (male = 1; female = 0), age at interview (continuous), marital status (married = 1; else = 0), and level of education (high school or less, some college, or college or above). We included the following six dichotomous measures of known T2DM correlates in our analyses: high genderspecific waist circumference (female-35+ inches, male 40+ inches), severe insulin resistance (HOMA-IR = 5+), high triglycerides (200+), high total cholesterol (200+), high diastolic blood pressure (90+ mmHg), and high systolic blood pressure (140+ mmHg).

Statistical analysis
Our focal analyses involve latent class analyses (LCA) across four samples of diabetic respondents identified either through self-report (diagnosed) or by hemoglobin A1c (HA1c) level (undiagnosed) in each sample of individuals. LCA is a type of mixture model, which is developed to explore the heterogeneity within a population that is not directly observed. LCA aims to categorize individuals into mutually exclusive and exhaustive subpopulations (i.e., classes), based on their responses to a set of directly observed variables, so that each class will present a unique pattern of responses. "Best" model was chosen based on entropy, model fit, interpretability, and class sizes. Goodness of fit was tested using Vuong-Lo-Mendell-Rubin Likelihood Ratio Test to examine if a k-class model was a better fit than a k-1 class model, in which a significant (p<0.05) test results means the model with higher number of classes is a better fit than the one with lower number of classes. Six indicators (HOMA-IR, HOMA-%β, HOMA-%S, BMI, glucose to insulin ratio, and fasting insulin) were included in the LCA models. The analysis was conducted across four different samples to determine if the best-fit model is replicated across samples. Mplus version 7.4 was used to conduct the latent class analysis. Class membership based on the "best" model was then assigned to each individual in each sample.
To validate the empirically derived classes, we cross-tabulated class membership against whether respondents had C-peptide values consistent with LCA defined classes. Low C-peptide is a T1DM correlate that quantifies endogenous insulin secretion [15,16], which would be low in T1DM and normal/high in T2DM cases. C-peptide measurements were only available in the 2003-2004 NHANES data. We calculated positive predictive value, sensitivity, and specificity of DTM in this sample. We also examined whether excluding DTM-defined cases that were inconsistent with the physiological profile of T2DM would increase the model fit and variance explained when assessing the association between demographics and known T2DM correlates across studies. For these analyses, we ran logistic regression models predicting all diabetes against logistic regression models predicting DTM-defined likely T2DM. Model fit statistics (variance explained, pseudo r-square, negative 2-log likelihood) were examined. Table 1 Table 2 displays results of the latent class analyses. The percentages under each class represent the proportion of respondents in that particular class having a specific attribute. For example, in 2003-2004 NHANES, 100%, 1.5%, and 4.1% of respondents in Class 1, 2, and 3, respectively, had HOMA-IR lower than 1.7. We also presented the size of each class and its proportion to all diabetes respondents in a given sample. Three of the four models had entropy values above 0.995 with the fourth (NHANES 2009-2010 sample) at 0.817, indicating high classification certainty. Both Vuong-Lo-Mendell-Rubin tests and Lo-Mendell-Rubin adjust likelihood ratio tests showed a three-class model fitted the data significantly better than a two-class model in all samples (P<0.0001), while a four-class model did not significantly fit the data better than a three-class model (P>0.05).

Latent class analyses
The profiles of these three classes were also consistent across samples. Fig 1 presents the percent of respondents in each class having each of the six indicators. Class 1 is characterized by uniformly low HOMA-IR, high prevalence of low HOMA-%β, high prevalence of high HOMA-%S, moderate-high prevalence of high G:I ratios, high prevalence of low fasting insulin. The measurement profile of this class was consistent with T1DM. We named this class "likely T1DM". Class 2 was characterized by uniformly low prevalence of high HOMA-%S, low prevalence of low fasting insulin, high G:I ratios; low prevalence of low HOMA-IR, and low prevalence of low or normal BMI. The measurement profile of this class was consistent with T2DM. We named this class "likely T2DM".
A consistent third class with a unique measurement profile also exists among individuals with DM. Class 3 had uniformly high prevalence of low HOMA-%β and low prevalence of high HOMA-%S, moderate-high prevalence of high G:I ratios, very low prevalence of low HOMA-IR, low prevalence of low or normal BMI, and low prevalence of low fasting insulin. Since the measurement profile of this class was inconsistent with either T1DM or T2DM, we named this class "atypical DM".

Validation analyses
The top part of Table 3 presents the validation analysis of DTM-defined T2DM against C-peptide. When cross-tabulating C-peptide (low vs. normal/high) and class membership (class 2 vs.   other) among diabetics not taking insulin in the 2003-2004 NHANES data, the DTM has 75.5% sensitivity and 83.3% specificity for T2DM, indicating it identified 75.5% of all cases with normal/high C-peptide as likely to be T2DM and 83.3% of all cases with low C-peptide as unlikely to be T2DM (Table 3). More importantly however, is that the positive predictive value of the DTM is 97.42%, which shows that 97.42% of the model-identified T2DM cases had normal or high C-peptide. This further demonstrates that the DTM can classify T2DM with a high level of certainty. The bottom part of Table 3 presents variance explained and various model fit statistics from regression models with all DM cases versus regression models with only DTM-defined T2DM cases. When comparing the variance explained by known correlates of T2DM, we found that excluding DTM-defined unlikely T2DM cases (i.e., classes 1 and 3) increased the variance explained by known correlates in multiple logistic regression models. Specifically, variance explained was 19%-32% using all cases of diabetes in the regression models, and increased to 24%-35% (a 2.6%-5.3% increase in variance explained; Table 3). Similarly, we observed significant improvement in model fit (i.e., -2-log likelihood after excluding unlikely T2DM cases) from the models.  Table 4 shows the associations between known T2DM correlates and DTM-define T2DM (Class 2), after removing DTM-defined non-T2DM (classes 1 and 3). Significant positive associations were observed between Class 2 (vs. non-diabetics) and known T2DM risk factors: non-White race/ethnicity, older age, high gender-adjusted waist circumference, severe insulin resistance, and high triglycerides.

Discussion
Existing research on prevalence, incidence, and predictors of DM type across the life course is often limited to small clinical samples of diagnosed T1DM or T2DM cases, [17][18][19] or samples restricted to one stage of the life course [20]. Population-based research can inform trends on DM prevalence and incidence among diagnosed and undiagnosed diabetics. However, the inability to classify diabetes subtype in population-based studies hinders researchers from tracking DM prevalence, and from using population-based studies to examine risk factors for each subtype of DM. Even though some studies, such as SEARCH for Diabetes in Youth, [20,21] are able to classify diabetes subtype from a blood sample and testing for autoantibodies, this ability comes with high costs and is not standard practice for adult population-based surveys without an expressed diabetes focus. It is expensive and logistically challenging to have blood drawn via venipuncture for all respondents in a large national sample. Additionally, sending phlebotomists to individuals' homes is costly and practically cumbersome. Alternatively, asking individuals to visit a clinic for a blood draw is likely to result in low response rate, particularly among those with limited access to medical care including residents in rural areas and individuals of low socioeconomic status. Due to the challenges of collecting blood samples by venipuncture, some nationally representative studies have shifted toward blood collection via blood spots especially for glucose and HbA1c testing [22].
Given the utility of a method to sort population-based samples of DM by subtype, we sought to create a model that could use multiple sources of biological and anthropometric data from population-based data to meet this goal. Moreover, we sought measures that would yield valid indicators from blood samples collected by either venipuncture or blood spot methods. As demonstrated here the DTM, based on an expanded set of variables and measures of the Homeostatic Model, is capable of meeting this goal. Model fit statistics showed that the DTM possesses a high certainty in assigning individuals with diabetes into three classes (likely T1DM, likely T2DM, atypical DM). Our validation analyses also showed that DTM has a high positive predictive value, sensitivity, and specificity in identifying DM cases that are consistent with the physiological and anthropometrical profile of T2DM among individuals with diabetes. Herein we also showed that by excluding model-defined classes 1 and 3, known T2DM correlates (e.g., non-White race/ethnicity, older age, high gender-adjusted waist circumference, severe insulin resistance, and high triglycerides) explained a greater amount of variance for class 2 (T2DM) than unsorted comingled DM cases. Given that the physiological measures used in LCA can be obtained through blood spots, DTM greatly increases our ability to separate DM type in large population-based studies, thus enhancing our ability to examine the risk factors for each type of DM outside of clinical studies.
Interestingly, DTM consistently detects a third class of individuals with DM that did not fit the T1DM-T2DM dichotomy. The size of this class (10%-17% across four samples of individuals) is non-trivial among all diabetics and additional investigations are needed to determine the significance of this DTM identified population. While the available data is insufficient to determine whether this group reflects a different subtype of DM, we suspect that it could be an atypical presentation of a variant or subtype of either T1DM or T2DM, e.g., Maturity Onset Diabetes of the young (MODY) or secondary diabetes. However, we cannot conclusively determine this with our data. Nonetheless, the finding that a clear third class emerged from the use of the DTM is promising given that the model clearly functions with its intended purpose of distinguishing different subtypes of DM, which would make research on this atypical DM variant possible in future studies.
As prevalence of all subtypes of DM increase and present across the age spectrum, it will become increasingly important for population-based research to identify social and environmental precursors to the development of DM. The only information required to replicate our latent class analyses are respondent DM status (either self-report or confirmed using HbA1c levels), body mass index (or both height and weight), and fasting glucose and fasting insulin levels, which can be collected by either venipuncture or blood spot. Thus, the simplicity of this model and general availability of these measures in population-based studies employing select biological sample collections creates a unique and important opportunity to begin more comprehensive research on the incidence, prevalence, and socio-environmental contexts of DM by subtype.
Despite DTM's ability to classify DM subtypes in population-based studies, our analysis has several limitations. First, DTM cannot be used in pre-diabetic or normoglycemic individuals. However, once the presence of diabetes has been established, this model can facilitate sorting of cases by likely diabetes subtype. Second, while we were able to replicate the model in four different national adult samples, the utility of DTM in specific subpopulations is unknown. Future research needs to focus on refining this model and testing its accuracy across the age spectrum and race/ethnicity. Third, we were only able to validate DTM-defined likely T2DM classification with C-peptide in one of the four samples since the measure was not available in the other three samples. Measures to validate DTM-defined likely T1DM and atypical DM were not available in NHANES and CARDIA. A previous analysis estimated the prevalence of T1DM based on age of DM diagnosis, age of insulin initiation, and current use of insulin. [23] However, with the increasing prevalence of late onset T1DM and early onset T2DM resulting in early insulin use, the previous approach may misclassify DM subtypes and should not be used to validate DTM-defined likely T1DM cases. DTM will benefit from additional validation using data from large samples of clinically confirmed T1DM and T2DM cases. This will allow for a more confident determination of the model's validity and its ability to predict T1DM and T2DM. Lastly, we did not have information about use of oral hypoglycemic agents in our samples. It is possible that some individuals may be using these agents at the time of study. However, we did not find reports in the literature that indicate oral hypoglycemic agents having the ability to change insulin levels and to alter HOMA indexes used in DTM. In conclusion, DTM is a novel tool for classifying DM subtypes in large population-based datasets, and it has great potential to improve how these vast datasets are used to examine behavioral and environmental factors associated with different types of DM. Potential discoveries using this tool can inform preventive clinical practice in the near future.