Characterizing clinical pediatric obesity subtypes using electronic health record data

In this work, we present a study of electronic health record (EHR) data that aims to identify pediatric obesity clinical subtypes. Specifically, we examine whether certain temporal condition patterns associated with childhood obesity incidence tend to cluster together to characterize subtypes of clinically similar patients. In a previous study, the sequence mining algorithm, SPADE was implemented on EHR data from a large retrospective cohort (n = 49 594 patients) to identify common condition trajectories surrounding pediatric obesity incidence. In this study, we used Latent Class Analysis (LCA) to identify potential subtypes formed by these temporal condition patterns. The demographic characteristics of patients in each subtype are also examined. An LCA model with 8 classes was developed that identified clinically similar patient subtypes. Patients in Class 1 had a high prevalence of respiratory and sleep disorders, patients in Class 2 had high rates of inflammatory skin conditions, patients in Class 3 had a high prevalence of seizure disorders, and patients in Class 4 had a high prevalence of Asthma. Patients in Class 5 lacked a clear characteristic morbidity pattern, and patients in Classes 6, 7, and 8 had a high prevalence of gastrointestinal issues, neurodevelopmental disorders, and physical symptoms respectively. Subjects generally had high membership probability for a single class (>70%), suggesting shared clinical characterization within the individual groups. We identified patient subtypes with temporal condition patterns that are significantly more common among obese pediatric patients using a Latent Class Analysis approach. Our findings may be used to characterize the prevalence of common conditions among newly obese pediatric patients and to identify pediatric obesity subtypes. The identified subtypes align with prior knowledge on comorbidities associated with childhood obesity, including gastro-intestinal, dermatologic, developmental, and sleep disorders, as well as asthma.


Introduction
Approximately one third of children in the United States are overweight (age-and sex-specific body mass index (BMI) greater than or equal to the 85th percentile per Centers for Disease Control and Prevention (CDC) growth charts) or obese (age-and sex-specific BMI greater than or equal to the 95th percentile per CDC growth charts). [1,2] Obesity is linked with an increased risk of developing multiple comorbidities including asthma, diabetes, hypertension, and psychological conditions among pediatric patients during childhood and later in life. [3,4] Pediatric obesity is a socially significant health issue that disproportionately impacts American Indian, African American, and Latino children, compared to non-Hispanic whites. [5,6] Obesity prevalence is also higher among low-income, rural, or less-educated population subtypes. [5,7] Despite its prevalence and social importance, it remains uncertain if childhood obesity represents a single condition or is composed of unique phenotypes with possibly different underlying causes. Grouping all types of overweight and obesity into one clinical condition may conceal associations between risk factors and specific subtypes of obesity, which has implications for improving prevention, recognition, and treatment of pediatric obesity. [8] Analyzing large healthcare datasets may potentially uncover previously unknown relationships concerning diagnoses, event patterns, and outcomes in healthcare, such as the presence of childhood obesity subtypes.
In recent years, use of such datasets in healthcare has increased. [9] Data sources include electronic health records (EHRs), medical imaging, wearable devices, genome sequencing, and payer records among others. Data mining methods for pattern discovery and extraction form a core set of methods to facilitate knowledge discovery from large healthcare datasets. [10,11] Data mining in healthcare may be used for numerous purposes including diagnostic outcomes evaluation, to uncover comorbidity and clinical event patterns, and to detect fraud and abuse. [12,13] We present an investigation of EHR data to identify clinically similar subtypes among a population of newly obese pediatric patients. We examine whether certain temporal condition patterns associated with childhood obesity incidence tend to cluster together to characterize subtypes of clinically similar patients, and to describe the demographic characteristics of these patient subtypes. Specifically, we address the following: 1. Do temporal condition patterns associated with childhood obesity form clinically meaningful subtypes?
2. What are the demographic characteristics of patients within these subtypes?
3. How might associations in diagnostic clustering and demographic subtypes be used to advance clinical and public health pediatric obesity research?

Materials and methods
Data from this study were derived from the Pediatric Big Data (PBD) resource at the Children's Hospital of Philadelphia (CHOP) (a pediatric tertiary academic medical center). The PBD resource includes clinical data collected from CHOP, the CHOP Care Network (a primary care network of over 30 sites), and CHOP Specialty Care and Surgical Centers. Both clinical and non-clinical observations (as defined by Observational Health Data Sciences and Informatics (OHDSI) condition domain standards) from a patient's EHR are included in the PBD database. [14] The PBD resource contains health-related information, including demographic, encounter, medication, procedure, and measurement (e.g. vital signs, laboratory results) elements for a large, unselected population of children. Non-study personnel extracted all data from the EHR and removed protected health information (PHI) identifiers, with the exception of dates, prior to transfer to the study database. Date information was removed from the analysis dataset as described below. The CHOP Institutional Review Board approved this study and waived the requirement for consent.

Temporal condition patterns
In a previous study, [15] we applied a sequential pattern mining algorithm to a large retrospective cohort of patients (n = 49 694) from CHOP to identify common condition trajectories surrounding pediatric obesity incidence. This analysis used the CDC definition of childhood obesity (BMI z-score at or above the 95th percentile for age and sex). [16,17] Patients had at least one obesity measurement during a CHOP primary care visit and at least one visit prior to the first obesity measurement where an obese BMI was not recorded. The BMI z-scores were centrally calculated in this analysis. The same definition of obesity was used across study sites for the entire study period. Campbell, et al includes a full study diagram detailing the inclusion criteria implementation for obtaining the study population. [15] EHR data from patients' records for healthcare visits in which an obese BMI was first recorded (the index visit), as well as immediately before (pre-index visit) and after (post-index visit) were compiled for analysis. The presence of a pre-index visit was required for study inclusion to ensure that patients who became new patients in the CHOP healthcare system were not already obese. However, the presence of a post-index visit was not required for inclusion. Approximately two thirds of patients (67.6%) had a post-index visit.
The SPADE algorithm [18] was used to discover frequent temporal patterns among pre, index, and post visits in the study cohort. SPADE is a sequential pattern mining algorithm that finds frequent subsequence patterns from a larger sequence through an Apriori-based candidate generation method. [19] SPADE identified 163 condition patterns that were present in at least 1% of case patients. An example pattern is "1-ALL04, 2-EAR01" (a diagnosis of asthma in the pre-index visit, followed by a diagnosis of otitis media in the index visit). A control population of patients with a healthy BMI matched on age, prior healthcare visits, and sex was obtained and analyzed. We then examined prevalence in the control population of the common patterns identified among case patients. McNemar's test results indicated that 80 of the 163 patterns were significantly more common among case patients (p<0.05). [15]

Latent class analysis
The current study builds on results from Campbell, et al. and utilizes the same study population and temporal diagnoses that were previously identified. In this study, latent class analysis (LCA) [20] was used to identify potential subtypes formed by the diagnoses in temporal condition patterns that were significantly more common among obese pediatric patients. The assumption of conditional independence that underlies LCA is violated through the inclusion of both super-sequences and their frequent subsequences because each super-sequence is the intersection of its frequent subsequences. Therefore, only the frequent subsequences (i.e. individual temporal diagnoses) were included in our LCA modelling efforts. A total of 37 temporal diagnoses were evaluated to create patient subtypes in the LCA; these are listed in Table 1 in the Supplemental section. Each subsequence was considered as an individual feature in the dataset used for the LCA, with a binary value of 0 or 1 for if a patient had this temporal diagnosis or not.
LCA requires specification of the number of classes as a user-selected parameter. Prior research on chronic diseases, including asthma, [21] diabetes, [22] and adult obesity [23] have indicated that there are typically 4-5 subtypes for these diseases. Input from clinician collaborators on this study suggested 8 classes as the maximum number that would be manageable and useful in care provision. Therefore, to obtain a clinically meaningful and interpretable number of patient subtypes, we elected to constrain our LCA evaluation to models with 3-8 classes.
The Akaike information criterion (AIC) and Bayesian information criterion (BIC) [24] were used to evaluate goodness-of-fit for each of the models tested.
R Version 3.6.1 was used for all data analysis in this study, [25] and the poLCA package [26] was used for the latent class modelling.

Demographic subtype analysis
The LCA model assigns a probability of membership for each subtype (class) for a given individual. To facilitate analysis, using the final clustering model, each patient was assigned to the group for which he/she had the highest probability of membership. The high-prevalence diagnoses within each LCA-identified subtype, defined as those with � 10% prevalence among patients, were used to clinically describe and name the subtypes. Finally, demographic information from patients' EHR was incorporated to describe the patient subtypes. The demographic variables considered were sex, race, Medicaid enrollment (a proxy for socioeconomic status at the time of obesity incidence), [27,28] age at index visit (with age evaluated as both a continuous and categorical variable), and Philadelphia residence. Patients were classified as Hispanic if their self-identified ethnicity was specified as Hispanic or Latino; otherwise they were categorized by the value of their self-identified race the EHR. Patients with missing race and ethnicity information were classified as unknown. If patients used multiple insurance types during their index visit, they were classified as being enrolled in Medicaid if one of those insurance plans was Medicaid or Children's Health Insurance Program (CHIP), Pennsylvania's state program to provide health insurance to uninsured children and teens who are ineligible or not enrolled in Medicaid. [29] If a patient did not have insurance information recorded for their index visit, all insurance information for patients' visits within a year of their index visits was obtained from the PBD database and analyzed. If patients had a record of Medicaid/CHIP enrollment within a year of their index visit, then they were classified in the Medicaid/CHIP enrollment category (Medicaid/CHIP eligibility is assessed annually). [30] One hundred patients did not have any insurance information for a visit within a year of the index visit, and were dropped, leaving a total study population of 49,594 patients.
The frequency of categorical demographic variables (sex, race, Medicaid enrollment, age at index visit, and Philadelphia residence) and mean and standard deviation (SD) of continuous variables (mean age at index visit) were provided overall and for each subtype.

Code availability
The code used for data processing and analysis in this study may be found at: https://github. com/chop-dbhi/masino-lab-obesity-incidence. Table 1 presents the AIC and BIC values for each iteration, as well as the percent reduction in each criterion between models. As the number of classes in the latent class models increased, AIC and BIC values both declined. The model with 8 latent classes had both the lowest AIC and BIC values, and was the final model selected to study clinical subtypes among the study population. Table 2 shows the prevalence rates for all 37 temporal diagnoses evaluated in the LCA among the 8 classes and the total study population. The high-prevalence diagnoses (� 10% prevalence) used to characterize the eight subtypes are highlighted.

Clinical subtypes
The high-prevalence diagnoses for the LCA-derived classes were aggregated and evaluated to characterize each subtype (Table 3). Patients in Class 1 had a high prevalence of upper respiratory and sleep disorders, including sleep apnea and chronic pharyngitis and tonsillitis. Inflammatory skin conditions (i.e. dermatitis and eczema) were common among patients in Class 2. Seizures and other neurological disorders were prevalent among patients in Class 3, as was asthma among patients in Class 4. No condition pattern was prevalent at a rate at or above 5% among patients in Class 5; thus, patients in this subtype were characterized by their lack of a clear morbidity pattern (which also may include patients without any of the temporal condition patterns). Gastrointestinal and genitourinary symptoms were common among patients in Class 6, and neurodevelopmental disorders (such as Autism) were prevalent among patients in Class 7. Finally, patients in Class 8 had a high prevalence of physical symptoms including headaches, fever, and nausea/vomiting.
The mean and standard deviation of the probability that patients categorized in their respective subtypes belong in that group, as well as their mean probability of membership in other classes is presented in Table 4. The mean probability of class membership to assigned groups ranged from 70.21% (patient in Class 8) to 89.7% (patients in Class 1). The mean probability of membership to other classes (those that patients in each group were not assigned to), ranged from 0.09% (the probability of patients in Class 3 belonging to Class 2 or Class 4) to 18.8% (patients in Class 5 belonging to Class 8).

Demographic analysis
Demographic data obtained from patients' EHRs were used to characterize the total study population (Table 5) and patient subtypes (  30.5% of patients were between two-and four-years-old, 42.9% were between five and eleven years, and 26.6% were between twelve and eighteen years. The mean age of patients at the time of the index visit was 8.5 years. The demographic analysis of patients in the LCA-derived classes indicated that females tended to have higher rates of gastrointestinal issues; they comprised a majority of patients in Class 6. Males had a much higher prevalence of neurodevelopmental disorders. More than three quarters (77.6%) of patients in Class 7 were male. More than half of patients in Class 4 (Asthma) were African American (52.8%), despite African American patients comprising approximately one third of the total study population. African American patients comprised similarly high proportions of patients in Class 2 (45.2%) and Class 8 (41.4%) (Class 2 and Class 8 were characterized by having a high prevalence of inflammatory skin conditions and physical symptoms respectively). Class 2, Class 4, and Class 8 also had high proportions of urban youth compared to the total study population. Medicaid enrollment was higher among patients in Class 3 (Seizures Disorders and Epilepsy), Class 4 (Asthma), and Class 7 (Neurodevelopmental Disorders) than among the total study population. Newly obese patients with neurodevelopmental disorders (Class 7) tended to be younger, suggesting that obesity may occur earlier among patients with Autism and developmental disorders. Finally, seizure disorders and epilepsy had higher prevalence among Hispanics; 12.2% of patients in Class 3 were Hispanic, compared to 8.8% of the total study population.

Pediatric obesity subclasses
The preceding study utilized temporal condition patterns surrounding the time of obesity incidence that were previously identified as occurring at a significantly higher rate among obese pediatric patients compared to matched pediatric patients with a healthy BMI. These temporal condition patterns were used as input features to develop an LCA model with eight classes. Patients were assigned to the class for which they had the highest probability of membership. The mean probability of membership to the assigned class was high (>70% for each class) and the mean probability of belonging to a different class was low (<20%), suggesting a shared clinical characterization within the individual groups. The common condition patterns that occurred at a high prevalence rate (� 10% prevalence) among patients in each subtype were used to characterize the eight LCA-derived classes. Finally, the demographic characteristics of patients in each class were analyzed. Our findings reflect extant literature on known comorbidities associated with pediatric obesity, including sleep issues as well as dermatologic, endocrine, gastro-intestinal, neurologic, musculoskeletal, and psychosocial conditions. [3,31] While our findings reflect current clinical knowledge on known pediatric obesity comorbidities, the presence of these diagnoses at obesity onset suggest there may be a bi-directional relationship between the conditions. Additionally, there was a strong association between patients with both asthma and allergic rhinitis in Group 4. Prior research has shown a strong association between pediatric obesity and asthma development, as well as the possibility that early-life asthma contributes to pediatric obesity onset. [32][33][34] However, the relationship between allergic rhinitis is less clear. Some studies did not find a strong association between allergic rhinitis and obesity. [35,36] Han, et al. [37] found that centrally obese children had a reduced odds of developing allergic rhinitis but Lei, et al. [38] found that pediatric patients with overweight and obesity had an increased risk of developing allergic rhinitis. Our results indicate that comorbid asthma may mediate the relationship between allergic rhinitis and pediatric obesity, which may guide future research and clinical care provision aimed at preventing obesity among pediatric patients with allergic rhinitis.
From a demographic standpoint, similarities in the characteristics of our subtypes both align with and challenge prior clinical knowledge. As in our study, African American, lowincome, and urban patients are known to be disproportionately represented among the asthma and inflammatory skin conditions subtypes. [39][40][41] Our study found a strong association between male sex and neurodevelopmental disorders (such as Autism Spectrum Disorder). While the link between autism and pediatric obesity is well established, [42] the strong link between sex, obesity, and neurodevelopmental disorders present in this study has not been seen in others. [43][44][45] Additionally, prior research has shown an equal distribution between sexes or male preponderance for gastro-intestinal conditions, [46][47][48] while in our study, a majority of patients in the Gastrointestinal/Genitourinary Symptoms subclass were female. This suggests that obesity may influence the association between sex and gastrointestinal disorder prevalence.

Strengths and limitations
This study presents a data-driven approach to uncover pediatric obesity subtypes from a large electronic health record dataset. The study used a high volume of data from a randomly sampled population, which strengthens its goals to identify possible obesity subtypes without assuming an a priori hypothesis. Additionally, while LCA allows individuals to have a probability of membership in multiple subtypes, our results showed subjects generally had a very high membership probability for a single class (>70%). This suggests shared clinical characterization within the individual groups. However, it is important to note our study's limitations. When working with such a large number of temporal condition patterns, certain constraints must be considered. Given the potential for false discovery rate, multiple comparisons testing would be necessary if attempting to establish statistical significance of the associations discovered using this approach. A further limitation comes from the somewhat arbitrary limits imposed in our study definitions (namely the 10% prevalence threshold for "high prevalence" conditions and limiting the number of LCA classes to a maximum of 8 in the modelling efforts). It is possible that a larger number of classes may achieve a better AIC or BIC score; however, one of our objectives was to constrain the number of classes to be clinically manageable.

Future work
Our LCA modelling results suggest the existence of obesity subtypes. Obese patients with similar comorbidities at the time of obesity incidence may have similar future health trajectories. The clinical subtypes identified in our study can serve as hypothesis generation for such patient classes, whose future health outcomes can be explored. This would allow more specialized clinical care for obese pediatric patients with certain comorbidities.
Obesity is a complex and socially significant health issue that may affect different clinical and demographic subtypes of pediatric patients differently. Our findings can support the work of public health researchers and practitioners who seek to address the social disparities component of the obesity epidemic. By better understanding the needs and population demographics of obese pediatric patients, and the comorbidities that certain newly obese patients may be more likely to develop, the epidemiology of pediatric obesity can be better studied and resources can be better allocated to populations in need.
Future research may utilize the identified subtypes as hypotheses for possible pediatric obesity subtype to explore causality in the associations uncovered in this study. Understanding both the demographic characteristics and physical comorbidities that differentiate obese pediatric patients can help to better understand the etiology of the condition and appropriate treatment for the diverse groups of patients the condition affects. Additionally, this work shows the possibility for future researchers to utilize temporal condition patterns to develop classification models that could identify the clinical subtype for individuals newly diagnosed with pediatric obesity. Developing data-driven subtypes using diagnostic information in patient EHR data represents an exciting new frontier for researchers across clinical domains.
Supporting information S1 Table. Statistically Significant Temporal Diagnoses among Newly Obese Pediatric Patients (n = 49 594 patients). The numbers before each diagnosis in a sequence represents the diagnosis timing class: '1' denotes that the observation was recorded during a patient's preindex visit, '2' represents the index visit, and '3' signifies the post-index visit. (DOCX)