Figures
Abstract
Electronic Health Records (EHRs) are increasingly used to develop machine learning models in predictive medicine. There has been limited research on utilizing machine learning methods to predict childhood obesity and related disparities in classifier performance among vulnerable patient subpopulations. In this work, classification models are developed to recognize pediatric obesity using temporal condition patterns obtained from patient EHR data in a U.S. study population. We trained four machine learning algorithms (Logistic Regression, Random Forest, Gradient Boosted Trees, and Neural Networks) to classify cases and controls as obesity positive or negative, and optimized hyperparameter settings through a bootstrapping methodology. To assess the classifiers for bias, we studied model performance by population subgroups then used permutation analysis to identify the most predictive features for each model and the demographic characteristics of patients with these features. Mean AUC-ROC values were consistent across classifiers, ranging from 0.72–0.80. Some evidence of bias was identified, although this was through the models performing better for minority subgroups (African Americans and patients enrolled in Medicaid). Permutation analysis revealed that patients from vulnerable population subgroups were over-represented among patients with the most predictive diagnostic patterns. We hypothesize that our models performed better on under-represented groups because the features more strongly associated with obesity were more commonly observed among minority patients. These findings highlight the complex ways that bias may arise in machine learning models and can be incorporated into future research to develop a thorough analytical approach to identify and mitigate bias that may arise from features and within EHR datasets when developing more equitable models.
Author summary
Childhood obesity is a pressing health issue. Machine learning methods are useful tools to study and predict the condition. Electronic Health Record (EHR) data may be used in clinical research to develop solutions and improve outcomes for pressing health issues such as pediatric obesity. However, EHR data may contain biases that impact how machine learning models perform for marginalized patient subgroups. In this paper, we present a comprehensive framework of how bias may be present within EHR data and external sources of bias in the model development process. Our pediatric obesity case study describes a detailed exploration of a real-world machine learning model to contextualize how concepts related to EHR data and machine learning model bias occur in an applied setting. We describe how we evaluated our models for bias, and considered how these results are representative of health disparity issues related to pediatric obesity. Our paper adds to the limited body of literature on the use of machine learning methods to study pediatric obesity and investigates the potential pitfalls in using a machine learning approach when studying socially significant health issues.
Citation: Campbell EA, Bose S, Masino AJ (2024) Conceptualizing bias in EHR data: A case study in performance disparities by demographic subgroups for a pediatric obesity incidence classifier. PLOS Digit Health 3(10): e0000642. https://doi.org/10.1371/journal.pdig.0000642
Editor: Rabie Adel El Arab, Almoosa College of Health Sciences, SAUDI ARABIA
Received: October 4, 2023; Accepted: September 14, 2024; Published: October 23, 2024
Copyright: © 2024 Campbell et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data cannot be shared publicly because of HIPAA requirements. Please contact pedsnet@chop.edu with questions regarding data availability.
Funding: This work was supported by a grant from the Commonwealth Universal Research Enhancement (C.U.R.E.) program funded by the Pennsylvania Department of Health—2015 Formula award—SAP #4100072543. This work was also supported by funding from The Children’s Hospital of Philadelphia (CHOP)-Drexel Research Fellowship Program: Informatics and Analytics Collaborative Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Throughout most disciplines, massive amounts of data are being digitally generated, collected, and stored at a rapidly expanding pace. Additionally, advances in computational methods enable extraction of information from such datasets that produce useful insights and knowledge [1].
In healthcare, there is increasing use of large and variable data sources that include medical imaging, wearable devices, genome sequencing, and payer records among others; electronic health records (EHRs) are one particularly robust source of healthcare data. This data is available in an exceptionally high volume, spans the healthcare sector’s digital ethos, and is extremely variable in its structure, semantics, and information content [2,3].
Advanced data mining and analytical methods are necessary to obtain, transform, and analyze EHR data for secondary uses such as clinical and health policy research; machine learning methods are key to addressing challenges in secondary EHR uses. However, although EHR data and the models that may be trained with this data hold tremendous potential to transform clinical care and research, caution must be exercised in how this data is utilized and interpreted analytically [4,5]. Bias is an inherent property to statistical models and within data collection and can also be introduced algorithmically or found within the data used to train and test machine learning models [6,7].
Issues in using machine learning methods to analyze EHR data often arise when letting data speak for itself. Algorithms may be subject to biases that are present in EHR datasets from several sources including study population characteristics, systemic errors in how EHR data is collected, missing data, misclassification, and sample size [8]. Spurious correlations and other dataset deficiencies such as multicollinear, correlated predictors may lead to algorithms overfitting predictions to biased data and producing unstable estimates. In turn, this affects the models’ performance and generalizability, potentially causing or perpetuating health system disparities. Machine learning models may be subject to new biases not typically seen in more traditional observational studies or statistical methods [9], such as adjusting away healthcare quality differences between patients or misinterpreting treatment outcomes when making therapy recommendations [10,11].
Bias definitions
Bias that is present in EHR data may result from numerous sources including measurement errors or selection bias in populations that are represented in EHR data versus the communities that they represent [12]. Biased data may reflect existing prejudices or disparities of the contexts from which data are collected. For example, inadequate access to insurance or under-diagnosis of certain conditions may lead to a misrepresentation of a condition’s prevalence among vulnerable populations. In EHR data, these pernicious biases may also manifest from inequities in usage and access to care or in the care that vulnerable subgroups may receive in healthcare settings. Bias may also be introduced algorithmically or in model design processes, which makes measuring bias when evaluating machine learning models an important area for promoting equity [13–16]. In Fig 1, we conceptualize these EHR data bias sources and how they contribute to developing biased machine learning models in clinical research. In this study, we focus on contextualizing pernicious bias in a particular dataset and how such biases may be characterized.
A Framework to Understand Bias in EHR Data and Machine Learning Models. We identify three sources of bias (pernicious, measurement, and sampling biases) that may occur in raw data that translates to biased datasets. Biased data along with bias introduced from design processes and algorithmically may lead to differential machine learning model performance, interpretation, and implementation, which in turn may perpetuate health system disparities.
In this work, we investigate potential biases (which we define here to mean systemic errors or misrepresentations embedded within datasets) that may exist in machine learning models developed from EHR data in the context of a childhood obesity incidence case study.
Study motivation
Biased data may translate to machine learning model performance inequities among vulnerable population subgroups (such as race, gender, etc.) that impact health outcomes and deepen health disparities [17–20]. Social determinants of health, such as inadequate housing, food insecurity, and poor access to healthcare, that drive adverse health outcomes may replicate in model performance [13]. For example, Obermeyer et al. found that a commercial algorithm used to predict population health risk was biased against Black patients, as it used healthcare costs as a proxy for health; less money is spent caring for Blacks in the U.S. healthcare system as compared to White patients, and Black patients are less likely to seek out and obtain medical care for health issues as compared to their White counterparts [17]. As machine learning models are increasingly used to automate clinical practice and medical decision making, it is paramount that principles and methods be created for developing fair machine learning models in healthcare settings [21,22].
While machine learning models have been increasingly used in predictive medicine [23–25], there has been limited research on utilizing machine learning methods to predict childhood obesity [26,27], with an even further limited scope in considering model performance for population subgroups (e.g. race, age) [28]. Due to the limited extant literature, the seriousness of the pediatric obesity epidemic in the United States, and the condition’s associated health disparities and social vulnerabilities [29], pediatric obesity was selected as a case study to analyze bias in EHR data and machine learning models developed with such data. Within the United States, approximately one third of children are overweight (age- and sex-specific body mass index (BMI) greater than or equal to the 85th percentile per Centers for Disease Control and Prevention (CDC) growth charts) or obese (age- and sex-specific BMI greater than or equal to the 95th percentile per CDC growth charts) [30,31]. Obesity is linked with numerous comorbidities, including an increased risk of developing asthma, diabetes, hypertension, and psychological conditions during childhood and later in life as pediatric patients with obesity are more likely to stay obese into adulthood [32,33]. Pediatric obesity is a socially significant health issue, making it a vital focus area for a study on understanding and mitigating bias in EHR data. It disproportionately impacts American Indian, African American, and Latino children, compared to non-Hispanic whites. Obesity prevalence is also higher among low-income, rural, or less-educated population subgroups [33,34].
In this work, we add to the limited extant literature on machine learning uses to study childhood obesity by developing a classification model to identify childhood obesity incidence using our previously published temporal condition patterns surrounding pediatric obesity that were derived from EHR data and patient demographic data from a US study population [35]. These models serve as a case study in how societal bias manifests in EHR data and translates into machine learning models developed using this data. Our study aims to address the following research questions:
- Can a machine learning classification model using temporal condition patterns and demographic information from EHR data accurately identify childhood obesity incidence?
- How can machine learning models developed using EHR data be analyzed for bias in model performance amongst population subgroups?
- How can biased model performance be understood in the context of the individual condition that a researcher is working to address?
Materials and methods
Setting
EHR data was obtained from the Pediatric Big Data (PBD) resource at the Children’s Hospital of Philadelphia (CHOP). Patients in this study were from a retrospective cohort of newly obese patients from the Philadelphia metropolitan area and matched control patients with a healthy BMI identified in a previous study [35]. The PBD resource was created for secondary research use, and contains health-related information, including demographic, encounter, medication, procedure, and measurement (e.g. vital signs, laboratory results) elements for a large, unselected population of children.
Ethics statement: Non-study personnel extracted all data from the EHR and removed protected health information (PHI) identifiers, except for dates, prior to transfer to the study database. Date information was removed from the analysis dataset used in this study. The CHOP Institutional Review Board approved this study and waived the requirement for consent.
Temporal condition pattern mining methodology
In a previous study, we applied a sequential pattern mining algorithm to a dataset from a large retrospective cohort of newly obese pediatric patients (n = 49 694) at CHOP from 2009–2017 [35,36]. Patients were identified using the CDC definition of childhood obesity (BMI z-score at or above the 95th percentile for age and sex). The BMI z-scores were centrally calculated in this analysis. The same definition of obesity was used across study sites for the entire study period. Patients had at least one obesity measurement during a CHOP primary care visit and at least one visit prior to the first obesity measurement where an obese BMI was not recorded. The analysis aimed to identify common temporal condition patterns derived from visits immediately before (pre-index) and after (post-index) the first visit with an obese BMI (index). We found 163 condition patterns present in at least 1% of the obese patients, of which 80 were significantly more common than in matched controls. Campbell, et al includes a full study diagram detailing the inclusion criteria implementation for obtaining the study population and derivation of the temporal condition patterns [35].
Study population
To obtain the study population for the machine learning case study presented here, we began with 49,694 pairs of matched cases and controls identified in the PBD database from the prior study. Patients in our final study population must have had both a BMI measurement in the pre-index and index visit (for control patients the index visit was the date for the visit that they were matched on with their corresponding case patient). For case patients, this meant that they needed to have a non-obese BMI measurement in the pre-index visit, and an obese BMI measurement in the index visit; 15,522 case patients met these criteria. Control patients needed a healthy BMI measurement in both their pre- and index visits; 31,366 control patients met this criterion. Finally, only patients from case-control pairs where both patients met the BMI inclusion criteria were kept in the study population; 4,843 case-control pairs met the criteria and 44,851 did not. A total of 9,686 patients met the BMI criterion for inclusion.
Patients and their corresponding matched case or control were eliminated if they did not have insurance information within 2 years of the matched index visit. For controls, 45 were missing insurance information from within two years or altogether; these 45 controls and their matched cases were eliminated from the study population. Seven cases were missing this information and were eliminated from the study population (along with their matched controls). S1 Table presents the inclusion and exclusion criteria for the study population. The final study population contained 4,777 matched pairs, and 9,554 total patients. The study population and data acquisition process are summarized in Fig 2. Table 1 presents the demographic characteristics of the total study population, as well as the case and control populations respectively.
A flow chart illustrating how patients in the final study population were obtained after filtering through the study’s inclusion criteria. M represents the total number of patients, and N represents the total number of matched case-control pairs.
Feature selection, data acquisition and preprocessing
The features selected for the machine learning case study included temporal condition patterns uncovered in Campbell, et al. [35] For obesity incidence prediction, only temporal condition patterns from the pre-index and index visits were considered in this model. Of the original 80 patterns identified in the previous study, 70 temporal condition patterns were selected for inclusion in this analysis. Each temporal condition pattern is considered separately as a feature for this study. Patient EHR data were analyzed for the presence of each temporal condition pattern, and patients were assigned a binary value of 0 (indicating a patient did not have a record of the temporal condition pattern) or a 1 (indicating that a patient did have a record of the temporal condition pattern) for each variable. Diagnoses in the temporal condition patterns are coded using the Expanded Diagnostic Clusters (EDCs) from the Adjusted Clinical Group (ACG) System [37,38]. The temporal diagnoses that comprised the condition patterns used in this study may be found in S2 Table.
Person-level characteristics including race, sex, ethnicity, age at index visit, and insurance were also extracted from EHR data within the PBD database and included in the final dataset. The demographic variables considered were sex assigned at birth, race, Medicaid enrollment (a proxy for socioeconomic status at the time of obesity incidence) [39,40], and age at index visit (coded as a categorical variable, for patients who were 2–4 years, 5–11 years, and 12–18 years). Patients were classified as Hispanic if their self-identified ethnicity was specified as Hispanic or Latino; otherwise, they were categorized by the value of their self-identified race from the EHR. Patients with missing race and ethnicity information were classified as unknown. Patients were classified as being enrolled in Medicaid if they used multiple insurance plans and one of those was Medicaid or Children’s Health Insurance Program (CHIP), Pennsylvania’s state program to provide health insurance to uninsured children and teens who are ineligible or not enrolled in Medicaid [41]. For patients who did not have insurance information recorded for their index visit, all insurance information for their visits within a year of their index visit was obtained from the PBD database and analyzed. If patients had a record of Medicaid/CHIP enrollment within a year of their index visit, they were classified in the Medicaid/CHIP enrollment category.
Machine learning analysis
We trained four machine learning models (Logistic Regression, Random Forest, Gradient Boosted Trees, and Neural Networks) to classify cases and controls as obesity positive or negative, and optimized hyperparameter settings through a bootstrapping methodology. We randomly shuffled the data and split it into training and validation folds in a stratified fashion relative to the 50:50 class balance to ensure an equal number of obese and non-obese patients in the training and test sets. The demographic characteristics of patients in the training and test sets may be found in S3 Table. We trained each model with all hyperparameter settings on the training fold and evaluated its Area Under the Receiver Operating Curve (AUC-ROC) on the validation fold. We repeated the process 200 times to obtain 200 validation AUC-ROCs for each hyperparameter setting for each model, then selected the hyperparameter combination with the highest mean validation AUC-ROC for a given model class.
We used a Multi-layer Perceptron (MLP) architecture with 3 hidden layers for our Neural Network models. We defined a parameter grid containing a range of values for key hyperparameters that influence the performance of the MLP classifier. We considered both tanh and relu activation functions, constant and adaptive learning rates with initial learning rates of 0.1, 0.01 and 0.001 and several combinations of hidden layer dimensions including (30,50,30), (100,100,100), (50,100,50) and (75,50,25) where the numbers in the tuples represent the number of nodes in the corresponding hidden layers. We used the Adam solver for weight optimization. To select the most optimal hyperparameter setting, we created 200 stratified train-test splits and computed the test set area under the receiver operating curve (AUC-ROC) for all hyperparameter combinations derived from the grid described above. The hyperparameter combination yielding the highest mean AUC-ROC across the 200 test sets was selected as the most optimal.
We implemented all algorithms using the Scikit-learn library in Python 3 [42]. We calculated the mean and standard deviation (SD) AUC-ROC values for the total study population and demographic subgroups for each algorithm.
Permutation analysis
We performed a permutation feature importance analysis on the data split with median validation AUC-ROC for the model with hyperparameters corresponding to the highest mean validation AUC-ROC. The feature importance is computed by measuring the change in the AUC-ROC on the validation set when the values in the dataset for a given feature are randomly shuffled among samples. Feature importance is reflected by a decrease in AUC-ROC as compared to when the feature is not permuted, with higher importance indicated by a larger decrease.
Results
Study population
Table 1 presents the demographic characteristics of the total study population, as well as the case and control populations respectively.
The study population is majority male (55.4%) and majority White (60.1%). African Americans are the second largest racial /ethnic group (25.3%). Approximately 1/3 of patients (32.1%) were enrolled in Medicaid at the time of their index visit. The case population is majority male (55.4%) and majority White (53.9%) but is comprised of a higher proportion of African Americans (31.7% vs. 25.3%) and Hispanic patients (4.4% vs. 3.5%) patients compared to the entire study population. Additionally, a greater proportion of case patients (38.1%) were enrolled in Medicaid compared to the overall study population (32.1%). The control population has a higher proportion of White patients compared to the case population (66.3% versus 53.9%) and a lower proportion of racial minorities. The control population also had a lower rate of Medicaid enrollment than the case population (26.1% versus 38.1%).
Machine learning results
Table 2 presents Mean AUC-ROC values across algorithms, which had consistent ranges from 0.72–0.80. On the full study population, Neural Net had a mean AUC-ROC value of 0.76, and mean AUC-ROC values ranged from 0.72–0.79 across demographic subgroups. On the full study population, Random Forest had a mean AUC-ROC value of 0.77, and mean AUC-ROC values ranged from 0.73–0.79 across demographic subgroups. On the full study population, Logistic Regression had a mean AUC-ROC value of 0.78, and mean AUC-ROC values ranged from 0.73–0.80 across demographic subgroups. On the full study population, Gradient Boosted Trees had a mean AUC-ROC value of 0.78, and mean AUC-ROC values ranged from 0.73–0.80 across demographic subgroups. Gradient Boosted Trees and Logistic regression tended to perform the best on the full study population and when evaluated by demographic subgroups.
Some evidence of bias was identified, although surprisingly this was through the models performing better for minority subgroups. The highest observed mean AUC-ROC values (0.80) were observed for pediatric obesity identification with the Gradient Boosted Trees and Logistic Regression models for patients enrolled in Medicaid (a proxy for low-income status), as well as patients who were ages 5–11 years old. African American patients were the demographic subgroup with the second highest model performance, with mean AUC-ROC values ranging from 0.78–0.79 across all four models. The models performed the poorest for patients with unknown race information and who were multiple race (ranges of mean AUC-ROC values were 0.72–0.73 and 0.73–0.76 respectively). White patients also had among the lowest mean AUC-ROC values, ranging from 0.74–0.76 across models.
S4 Table presents ANOVA results comparing model performance overall and for demographic subgroups and S5 Table presents ANOVA results comparing model performance for demographic subgroups for each model. Models performed differently on African American patients from one another at the p<0.01 level for all model comparison combinations except Logistic Regression and Neural Net. All four models performed differently for African American patients compared to non-African American patients p<0.01 level. Similarly, models performed differently on patients enrolled in Medicaid from one another at the p<0.01 level for all model comparison combinations except Random Forest and Neural Net. All four models performed differently for patients enrolled in Medicaid compared to patients who were not at the p<0.01 level.
Permutation analysis findings
A permutation analysis was undertaken to investigate why models tended to perform slightly better for under-represented groups in the study population. We hypothesized that the features that are most predictive of obesity may be more common among marginalized subpopulations. Thus, we undertook a permutation feature analysis to identify which features were most important in classifying patients as obese or not obese for each algorithm.
Table 3 presents the top ten most predictive sequences for each classification algorithm. Four temporal condition patterns were among the top ten most predictive features across all four algorithms: 1-ALL04 (a diagnosis of asthma in the pre-index visit), 2-MUS01 (a diagnosis of Musculoskeletal signs and symptoms in the index visit), 2-ALL03 (a diagnosis of allergic rhinitis in the index visit0, and 2-SKN04 (a diagnosis of acne in the index visit). A diagnosis of asthma in the pre-index visit (1-ALL04) was the most predictive feature for the Gradient Boosted Trees, Neural Network, and Random Forest algorithms, and was the third most predictive for Logistic Regression. To better understand the disparate machine learning model performance and permutation analysis findings, the prevalence of these four temporal condition patterns were assessed among demographic subgroups in the study population, Table 4.
Two features that were most common among patients were also a most predictive sequence for all four classifiers: 1-ALL04 (asthma in the pre-index visit), of which 851 patients had a record in their EHR data, and 2-ALL03 (allergic rhinitis in the index visit) of which 459 patients had a record (compared to 190 patients with the 2-MUS01 sequence and 140 patients with the 2-SKN04 diagnosis). African American patients and patients enrolled in Medicaid are over-represented among patients who have these diagnoses. Almost half of patients (49.4%) with a 1-ALL04 diagnosis were African American, even though African American patients make up only 25.3% of the study population, and 47.5% of patients with this diagnosis were enrolled in Medicaid compared to only 32.1% of the total study population. Similarly, 46.2% of patients with the 2-ALL03 diagnosis were African American and 40.1% were enrolled in Medicaid.
Discussion
Key findings
Machine learning model development using temporal condition patterns in EHR data.
In this study, four supervised machine learning algorithms were trained to identify pediatric patients from a US study population as obese or not obese using demographic variables and temporal condition patterns previously found to be associated with obesity incidence. [35] Model performance was evaluated for the total population and by demographic subgroups using mean AUC-ROC values. Mean AUC-ROC values were consistent across algorithms, ranging from 0.72–0.80. Gradient Boosted Trees and Logistic regression tended to perform the best on the full study population and when evaluated by demographic subgroups. This result is consistent with existing work demonstrating that complex machine learning models may not perform better than logistic regression when using structured clinical data. [43–46] Our conjecture is that this is likely true when inputs consist of highly engineered feature variables such as that found in the EHR and similarly, the temporal patterns used in our work.
Model performance and pernicious bias
Algorithms tended to perform relatively consistently compared to one another and when each classifier’s performance was analyzed by demographic subgroups. Some evidence of model bias relative to population subtypes was detected, although unexpectedly the bias manifested as the classifiers tending to perform slightly better on vulnerable subgroups including African American patients and patients enrolled in Medicaid (a proxy for lower socioeconomic status) than the entire population.
Our findings contradict prior studies that show biased machine learning model performance against vulnerable patient populations, [47] even those that address socially significant health issues. Schuch, et al. utilized governmental survey data and four machine learning algorithms to predict foregone dental care and Juhn, et al. employed machine learning models trained using EHR data to predict asthma exacerbation. [48–50] Juhn, et al. found that lower socioeconomic status was associated with poorer model performance, and models performed worse for racial and ethnic minorities compared to the full study population in Schuch, et al. The fundamental causes of predictive disparities in healthcare machine learning models, and bias may be introduced at any stage in the model development process [51,52] Possible biases need to be examined carefully. Our study’s findings illustrate that there are many complex ways that bias may emerge within EHR data, and that the context in which data is collected and the population impacted by a condition must be carefully considered when assessing EHR data for bias.
Contextualizing findings for conditions of interest
A permutation analysis was conducted on the classifiers developed in our case study. The ten variables that were most predictive for each of the four classifiers developed in the obesity incidence prediction research were identified and their average impact for each model’s AUC-ROC was computed. Two variables, 1-ALL04 (asthma in the pre-index visit) and 2-ALL03 (allergic rhinitis in the index visit), were among the most predictive variables for all four classifiers and were the most prevalent condition patterns among the study population. African American patients and patients enrolled in Medicaid were over-represented among patients who had this temporal condition pattern.
Our findings align with prior research on the association between asthma and pediatric obesity which provide insight into why these variables were so informative. Pediatric obesity and asthma are strongly associated, and early-life asthma contributes to the onset of pediatric obesity.[53]Although the relationship between allergic rhinitis and pediatric obesity is unclear, [54,55] allergic rhinitis has been shown to be comorbid with pediatric asthma[56] Low-income, urban, and racial minority children are disproportionately impacted by both pediatric obesity and asthma,[57,58] which explains their over-representation amongst patients with the most predictive features and the slight classifier performance bias in their favor. Because these features were more represented among patients in vulnerable subgroups in this study, we hypothesize that this caused the classification algorithms to perform generally equitably if not slightly better for African American patients and patients enrolled in Medicaid (a bias that is in favor of medically underserved subpopulations). This may explain why our findings contradict prior research that has shown that bias in machine learning typically results in lower model performance for minorities due, in part, to an under-representation in data. [17,21] While these are simply associations and causality cannot be inferred, our results support the idea that causes of bias in datasets and the models trained from them are much more nuanced than initially thought. It is also important to note that our findings are consistent with how pediatric obesity is understood amongst children and adolescents in the United States, and therefore may be different when considering how disparities related to pediatric obesity in other countries may manifest in EHR data.
Limitations
While this study serves to illustrate challenges and nuances associated with bias in machine learning models developed using EHR, it does have limitations. First, the temporal condition patterns mined from EHR data that were utilized as features for the machine learning models only show associations. Findings are descriptive and the discovered temporal patterns and comorbidities should be viewed in this light. No causality can be attributed to the associations uncovered in this study. We also acknowledge that when considering the outcome of obesity in the machine learning prediction problem, minority patients comprised a greater proportion of the case population compared to controls. This may have contributed to the model performance bias in favor of vulnerable subgroups. However, the case/control ratio in the data, and corresponding performance, reflects the disease burden among the different demographic groups and thus the model performance is reflective of what would be observed if deployed in this setting. While a more balanced dataset could alter model performance, we believe that given the primary research question was to ascertain performance differences and potential bias among different groups, it was important in this context to maintain the balance observed in the data. Finally, all EHR data used in this study was derived from a US population and with data that cannot be obtained without special permission from CHOP due to patient privacy and confidentiality constraints. These restrictions limit the study’s reproducibility. Given the geographic constraints in the patient catchment area, our study’s findings and the types of bias that manifest within this data may not be generalizable to EHR data from other settings.
Conclusion
Our paper presents a comprehensive framework of how bias may be present within EHR data and external sources of bias in the model development process, which in turn impacts machine learning model development and clinical applications. Our pediatric obesity case study describes a detailed exploration of a real-world machine learning model to contextualize how concepts related to EHR data and machine learning model bias occur in an applied setting. We describe how we evaluated our models for bias, and considered how these results are representative of health disparity issues related to pediatric obesity. Finally, our paper presents a novel application of data-driven temporal condition patterns that surround pediatric obesity incidence into a predictive machine learning model. This adds to the limited body of literature on the use of machine learning methods to study pediatric obesity and investigates the potential pitfalls in using a machine learning approach when studying social significant health issues.
Bias is a complex and multi-faceted issue that is present in society and translates into data collected in applied settings. We expect that our study may be used to define the types of bias that researchers working with EHR data to develop machine learning models may look for, and to understand that bias may manifest in machine learning models in unexpected ways. Our approach to evaluating a machine learning model for bias and contextualizing our model evaluation alongside clinical and psychosocial knowledge surrounding pediatric obesity provides a useful blueprint for researchers developing and evaluating machine learning models with EHR data in the obesity space and beyond. Finally, our findings support more equitable model development, and may be used to guide researchers and clinicians in the precision medicine space to consider the types of bias that may be present in machine learning models and how to implement these models in clinical settings in a way that helps to address and not advance existing systemic disparities.
Supporting information
S1 Table. Presents the inclusion criteria for case and control patients in the study population.
https://doi.org/10.1371/journal.pdig.0000642.s001
(DOCX)
S2 Table. Presents the temporal diagnoses that served as features the machine learning model development.
https://doi.org/10.1371/journal.pdig.0000642.s002
(DOCX)
S3 Table. Presents the demographic characteristics of the test and training set patient populations.
https://doi.org/10.1371/journal.pdig.0000642.s003
(DOCX)
S4 Table. Presents ANOVA results comparing inter-model performance overall and for demographic subgroups between classifiers. P-value <0.05 indicates a statistically significant difference in model performance.
https://doi.org/10.1371/journal.pdig.0000642.s004
(DOCX)
S5 Table. Presents ANOVA results comparing intra-model performance for demographic subgroups for each model. P-value <0.05 indicates a statistically significant difference in model performance.
https://doi.org/10.1371/journal.pdig.0000642.s005
(DOCX)
References
- 1. Murdoch TB, Detsky AS. The Inevitable Application of Big Data to Health Care. JAMA. 2013 Apr 3;309(13):1351–2. pmid:23549579
- 2. Andreu-Perez J., Poon C. C. Y., Merrifield R. D., Wong S. T. C., Yang G. -Z. Big Data for Health. IEEE Journal of Biomedical and Health Informatics. 2015 Jul;19(4):1193–208. pmid:26173222
- 3. Ferrao JC, Oliveira MD, Janela F, Martins HM. Preprocessing structured clinical data for predictive modeling and decision support. Applied clinical informatics. 2016;7(04):1135–53.
- 4. Char DS, Shah NH, Magnus D. Implementing machine learning in health care—addressing ethical challenges. The New England journal of medicine. 2018;378(11):981. pmid:29539284
- 5. Carter SM, Rogers W, Win KT, Frazer H, Richards B, Houssami N. The ethical, legal and social implications of using artificial intelligence systems in breast cancer care. The Breast. 2020;49:25–32.
- 6. Shah DS, Schwartz HA, Hovy D. Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview. In 2020. p. 5248–64.
- 7. Paulus Jessica K., and Kent David M. "Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities." NPJ digital medicine 3.1 (2020): 99. pmid:32821854
- 8. Yadav P, Steinbach M, Kumar V, Simon G. Mining electronic health records (EHRs) A survey. ACM Computing Surveys (CSUR). 2018;50(6):1–40.
- 9. Vokinger KN, Feuerriegel S, Kesselheim AS. Mitigating bias in machine learning for medicine. Communications medicine. 2021 Aug 23;1(1):25. pmid:34522916
- 10. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine. 2018;178(11):1544–7. pmid:30128552
- 11. Auerbach A, Fihn SD. Discovery, Learning, and Experimentation With Artificial Intelligence–Based Tools at the Point of Care—Perils and Opportunity. JAMA Network Open. 2021;4(3):e211474–e211474. pmid:33704470
- 12. Cutillo CM, Sharma KR, Foschini L, Kundu S, Mackintosh M, Mandl KD. Machine intelligence in healthcare—perspectives on trustworthiness, explainability, usability, and transparency. NPJ digital medicine. 2020;3(1):1–5. pmid:32258429
- 13. McCradden MD, Joshi S, Anderson JA, Mazwi M, Goldenberg A, Zlotnik Shaul R. Patient safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine learning. Journal of the American Medical Informatics Association. 2020;27(12):2024–7. pmid:32585698
- 14. Chen IY, Szolovits P, Ghassemi M. Can AI help reduce disparities in general medical and mental health care? AMA journal of ethics. 2019;21(2):167–79.
- 15.
Kaushal A, Kaushal D, Pathak Y. Practicing Medicine and Ethical Issues Applying Artificial Intelligence. InEthical Issues in AI for Bioinformatics and Chemoinformatics (pp. 87–106). CRC Press
- 16. Navarro CL, Damen JA, Takada T, Nijman SW, Dhiman P, Ma J, Collins GS, Bajpai R, Riley RD, Moons KG, Hooft L. Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. bmj. 2021 Oct 20;37.
- 17. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. pmid:31649194
- 18. Benjamin R. Assessing risk, automating racism. Science. 2019;366(6464):421. pmid:31649182
- 19. Mhasawade V, Zhao Y, Chunara R. Machine learning and algorithmic fairness in public and population health. Nature Machine Intelligence. 2021 Aug;3(8):659–66.
- 20. Chen Z, Zhang JM, Sarro F, Harman M. A comprehensive empirical study of bias mitigation methods for machine learning classifiers. ACM transactions on software engineering and methodology. 2023 May 27;32(4):1–30. Zhang H, Lu , Abdalla M, McDermott M, Ghassemi M, editors.
- 21.
Hurtful words: quantifying biases in clinical contextual word embeddings. Proceedings of the ACM Conference on Health, Inference, and Learning; 2020.
- 22.
Pfohl S, Marafino B, Coulet A, Rodriguez F, Palaniappan L, Shah NH, editors. Creating fair models of atherosclerotic cardiovascular disease risk. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society; 2019
- 23. Garg A, Mago V. Role of machine learning in medical research: A survey. Computer science review. 2021 May 1;40:100370.
- 24. Ben-Israel D, Jacobs WB, Casha S, Lang S, Ryu WH, de Lotbiniere-Bassett M, Cadotte DW. The impact of machine learning on patient care: a systematic review. Artificial intelligence in medicine. 2020 Mar 1;103:101785 pmid:32143792
- 25. Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJ, Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Network Open. 2022 Sep 1;5(9):e2233946. pmid:36173632
- 26. Colmenarejo G. Machine learning models to predict childhood and adolescent obesity: a review. Nutrients. 2020 Aug 16;12(8):2466. pmid:32824342
- 27. Triantafyllidis A.; Polychronidou E.; Alexiadis A.; Rocha C.L.; Oliveira D.N.; da Silva A.S.; Freire A.L.; Macedo C.; Sousa I.F.; Werbet E.; et al. Computerized decision support and machine learning applications for the prevention and treatment of childhood obesity: A systematic review of the literature. Artif. Intell. Med. 2020, 104, 101844 pmid:32498995
- 28. Siddiqui H, Rattani A, Woods NK, Cure L, Lewis RK, Twomey J, Smith-Campbell B, Hill TJ. A survey on machine and deep learning models for childhood and adolescent obesity. IEEE Access. 2021 Nov 25;9:157337–60.
- 29. Iguacel I, Gasch-Gallen A, Ayala-Marin AM, De Miguel-Etayo P, Moreno LA. Social vulnerabilities as risk factor of childhood obesity development and their role in prevention programs. International journal of obesity. 2021 Jan;45(1):1–1. pmid:33033393
- 30.
Kuczmarski RJ. 2000 CDC Growth Charts for the United States: methods and development. Department of Health and Human Services, Centers for Disease Control and …; 2002.
- 31. Karnik S, Kanekar A. Childhood obesity: a global public health crisis. Int J Prev Med. 2012;3(1):1–7. pmid:22506094
- 32. Pulgarón ER. Childhood obesity: a review of increased risk for physical and psychological comorbidities. Clinical therapeutics. 2013;35(1):A18–32. pmid:23328273
- 33. Ogden CL, Carroll MD, Fakhouri TH, Hales CM, Fryar CD, Li X, et al. Prevalence of obesity among youths by household income and education level of head of household—United States 2011–2014. Morbidity and mortality weekly report. 2018;67(6):186. pmid:29447142
- 34. Skinner AC, Ravanbakht SN, Skelton JA, Perrin EM, Armstrong SC. Prevalence of obesity and severe obesity in US children, 1999–2016. Pediatrics. 2018;141(3).
- 35. Campbell EA, Qian T, Miller JM, Bass EJ, Masino AJ. Identification of temporal condition patterns associated with pediatric obesity incidence using sequence mining and big data. International Journal of Obesity. 2020;44(8):1753–65. pmid:32494036
- 36. Campbell EA, Bass EJ, Masino AJ. Temporal condition pattern mining in large, sparse electronic health record data: A case study in characterizing pediatric asthma. Journal of the American Medical Informatics Association. 2020;27(4):558–66. pmid:32049282
- 37. Bailey LC, Milov DE, Kelleher K, Kahn MG, Del Beccaro M, Yu F, et al. Multi-institutional sharing of electronic health record data to assess childhood obesity. PloS one. 2013;8(6):e66192. pmid:23823186
- 38. Weiner J, Abrams C. The Johns Hopkins ACG System Technical Reference Guide, Version 10.0. John Hopkins Bloomberg School of Public Health. 2011.
- 39. Schechter MS, Shelton BJ, Margolis PA, FitzSimmons SC. The association of socioeconomic status with outcomes in cystic fibrosis patients in the United States. American journal of respiratory and critical care medicine. 2001;163(6):1331–7. pmid:11371397
- 40. Arpey NC, Gaglioti AH, Rosenbaum ME. How socioeconomic status affects patient perceptions of health care: a qualitative study. Journal of primary care & community health. 2017;8(3):169–75. pmid:28606031
- 41.
About CHIP: Commonwealth of Pennsylvania [Internet]. 2019. Available from: https://www.chipcoverspakids.com/AboutCHIP/Pages/default.aspx
- 42. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30.
- 43. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology. 2019;110:12–22. pmid:30763612
- 44. Lynam A.L., Dennis J.M., Owen K.R. et al. Logistic regression has similar performance to optimised machine learning algorithms in a clinical setting: application to the discrimination between type 1 and type 2 diabetes in young adults. Diagn Progn Res 4, 6 (2020). pmid:32607451
- 45. de Hond AAH, Kant IMJ, Honkoop PJ, Smith AD, Steyerberg EW, Sont JK. Machine learning did not beat logistic regression in time series prediction for severe asthma exacerbations. Sci Rep. 2022 Nov 27;12(1):20363. pmid:36437306; PMCID: PMC9701686.
- 46. Gravesteijn BY, Nieboer D, Ercole A, Lingsma HF, Nelson D, Van Calster B, Steyerberg EW, Åkerlund C, Amrein K, Andelic N, Andreassen L. Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury. Journal of clinical epidemiology. 2020 Jun 1;122:95–107. pmid:32201256
- 47. Barton M, Hamza M, Guevel B. Racial Equity in Healthcare Machine Learning: Illustrating Bias in Models With Minimal Bias Mitigation. Cureus. 2023 Feb 15;15(2):e35037. pmid:36942183; PMCID: PMC10023594.
- 48. Schuch HS, Furtado M, Silva GFDS, Kawachi I, Chiavegatto Filho ADP, Elani HW. Fairness of Machine Learning Algorithms for Predicting Foregone Preventive Dental Care for Adults. JAMA Netw Open. 2023 Nov 1;6(11):e2341625. pmid:37921762; PMCID: PMC10625037.
- 49. Seol HY, Shrestha P, Muth JF, Wi CI, Sohn S, Ryu E, Park M, Ihrke K, Moon S, King K, Wheeler P, Borah B, Moriarty J, Rosedahl J, Liu H, McWilliams DB, Juhn YJ. Artificial intelligence-assisted clinical decision support for childhood asthma management: A randomized clinical trial. PLoS One. 2021 Aug 2;16(8):e0255261. pmid:34339438; PMCID: PMC8328289.
- 50. Juhn YJ, Ryu E, Wi CI, King KS, Malik M, Romero-Brufau S, Weng C, Sohn S, Sharp RR, Halamka JD. Assessing socioeconomic bias in machine learning algorithms in health care: a case study of the HOUSES index. Journal of the American Medical Informatics Association. 2022 Jul 1;29(7):1142–51. pmid:35396996
- 51. Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring fairness in machine learning to advance health equity. Annals of internal medicine. 2018 Dec 18;169(12):866–72.
- 52. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR). 2021 Jul 13;54(6):1–35.
- 53. Azizpour Y, Delpisheh A, Montazeri Z, Sayehmiri K, Darabi B. Effect of childhood BMI on asthma: a systematic review and meta-analysis of case-control studies. BMC pediatrics. 2018;18(1):1–13.
- 54. Sidell D, Shapiro NL, Bhattacharyya N. Obesity and the risk of chronic rhinosinusitis, allergic rhinitis, and acute otitis media in school-age children. The Laryngoscope. 2013;123(10):2360–3. pmid:23918707
- 55. Weinmayr G, Forastiere F, Büchele G, Jaensch A, Strachan DP, Nagel G, et al. Overweight/obesity and respiratory and allergic disease in children: international study of asthma and allergies in childhood (ISAAC) phase two. PloS one. 2014;9(12):e113996. pmid:25474308
- 56. Stern J, Chen M, Fagnano M, Halterman JS. Allergic rhinitis co-morbidity on asthma outcomes in city school children. Journal of Asthma. 2022;1–7. pmid:35195499
- 57. Thakur N, Oh SS, Nguyen EA, Martin M, Roth LA, Galanter J, et al. Socioeconomic status and childhood asthma in urban minority youths. The GALA II and SAGE II studies. American journal of respiratory and critical care medicine. 2013;188(10):1202–9. pmid:24050698
- 58. Persky VW, Slezak J, Contreras A, Becker L, Hernandez E, Ramakrishnan V, et al. Relationships of race and socioeconomic status with prevalence, severity, and symptoms of asthma in Chicago school children. Annals of Allergy, Asthma & Immunology. 1998;81(3):266–71. pmid:9759806