Figures
Abstract
Objective
This retrospective, case-control study with internal validation evaluates the performance of machine learning (ML) and deep learning (DL) models in classifying pediatric patients at risk for anxiety disorders using structured electronic health records (EHRs) and area-based measures of health (ABMH). The aim is to enable proactive care by monitoring potential anxiety onset across developmental stages.
Methods
We trained a series of ML models (Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, XGBoost) and DL models (LSTM, GRU, RETAIN, Dipole) using structured EHR data from 30-day windows prior to diagnosis. Two datasets were used per age group: one with structured EHR data only, and another including both EHR and ABMH data. ML models were trained using short-term cross-sectional features, while DL models leveraged full longitudinal patient histories. Performance was assessed using AUROC, AUPRC, PPV, NPV, F1 score, and accuracy. Due to differences in input scope, model performance reflects both algorithmic and temporal design differences and is not intended as a direct comparison between ML and DL.
Results
ML models offered strong baseline performance, with XGBoost achieving AUROC scores of 0.817 (EHR) and 0.816 (EHR+ABMH) for 8-year-olds. Adding ABMH features did not significantly improve performance. DL models, particularly RETAIN and Dipole, achieved the highest AUROC values (e.g., Dipole: 0.853 with EHR, 0.857 with EHR+ABMH for 8-year-olds), outperforming other DL and ML models within their respective design constraints.
Conclusion
Both ML and DL models successfully identified likely anxiety onset using structured EHR data. DL models using longitudinal data achieved the highest performance, while XGBoost provided a robust ML baseline. The minimal impact of ABMH features highlights integration challenges, and performance variation across ages emphasizes the need for age-stratified modeling approaches.
Citation: Lee EW, Choo S, Maguire D, Shivanna A, Santel D, Bhatnagar S, et al. (2026) Comparing machine and deep learning models for pediatric anxiety classification using structured EHRs and area-based measures of health data. PLoS One 21(5): e0324673. https://doi.org/10.1371/journal.pone.0324673
Editor: Sreeram V. Ramagopalan, University of Oxford, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: April 30, 2025; Accepted: April 14, 2026; Published: May 12, 2026
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: The data underlying this study are derived from patient EHRs and cannot be shared publicly due to legal and ethical restrictions. These restrictions are imposed by the Cincinnati Children’s Hospital Institutional Review Board (IRB) because the data contain potentially identifying information. Requests for data access can be submitted to the Cincinnati Children’s Hospital Medical Center IRB (CCHMC IRB; Phone: 513-636-8039; Address: 3333 Burnet Avenue, MLC 7040, Cincinnati, OH 45229). Email: IRB@cchmc.org.
Funding: This work was supported by Cincinnati Children’s Hospital Medical Center under Strategic Partnership Projects agreement NFE-21-08617. This work has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC05-00OR22725. There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Anxiety disorders are the most common type of mental disorder, and an estimated 19.1% of adults in the U.S. population have an anxiety disorder [1]. Anxiety disorders typically begin during childhood or adolescence and persist into adulthood, with a lifetime prevalence [2,3]. They manifest at an earlier stage of development compared to depression [4,5] and, if not treated, are associated with substantial functional impairment and healthcare burden [6]. Early-onset anxiety disorders are more likely to lead to significant depression, substance dependency, suicidal behavior, and educational underachievement [7,8]. Therefore, implementing effective strategies for the early identification of anxiety could result in improved health outcomes throughout an individual’s lifespan.
Currently, the identification of children at risk for developing anxiety disorders is sub-optimal. Recent national data indicate persistent gaps in access to mental health care among children and adolescents [3,9]. Traditionally, anxiety is diagnosed by pediatricians, primary care clinicians, and psychiatric clinicians and requires knowledge of the patient’s history, specific symptoms (e.g., sleep patterns, concentration, restlessness, etc), and physical health. Given the extensive nature of medical records, it is often difficult for a clinician to ingest and summarize the large volume of information about a patient’s health across their life course [10]. This is a significant barrier to effective and timely intervention, impacting short- and long-term patient outcomes. Enhancing electronic health records (EHRs) systems with intelligent, streamlined tools for early diagnoses of clinical anxiety using classification models, i.e., machine learning (ML) and deep learning (DL), could substantially improve clinical decision-making processes and reduce the likelihood of missed diagnoses. Furthermore, early and effective anxiety diagnosis can improve the long-term health of the patient and diminish long-term healthcare expenses associated with a missed diagnosis [1,11]. In this study, we compare the performance of several ML and DL methods in identifying and classifying pediatric patients with anxiety. These identification and classification tasks are performed using time-dependent and static features in the EHR, along with Area-based measures of health (ABMH) data for ages 2–21.
EHRs contain structured data such as diagnosis and treatment codes, prescription medications, and demographic data, which researchers have utilized to build ML and DL models to predict illness and diseases [12–14]. Although successful, many classification models ignore the time-dependent EHR data, which can be used effectively as a signal of future risk [15]. Each patient’s diagnoses are documented using the International Classification of Disease (ICD) codes at each clinical encounter, and the number of encounters increases over time. This sparse nature makes using the time-dependent EHR data challenging [16]. Studies have explored different techniques using recurrent neural networks (RNNs)-based models to handle long sequences of encounters [15,17–19]. Although several of these approaches have been proposed, a comprehensive comparative assessment of the different methods has not yet been performed, and is needed to confirm the best model to identify pediatric anxiety. Recent developments in model optimization and representation learning, including knowledge distillation in clinical prediction tasks and graph-based learning architectures, continue to inform the evolution of machine learning approaches in healthcare [20,21].
The influence of ABMH on child and adolescent development—and the risk of psychopathology—has been well-established over decades of research [4,22,23]. Yet, despite this robust evidence, ABMH is inconsistently integrated into our understanding of how and why psychiatric disorders emerge in children and adolescents [6]. Environmental exposures, such as poverty or unsafe neighborhoods as well as neighborhood-level air pollution [24,25], are linked to a higher likelihood of anxiety and related disorders [5,23]. Limited family resources significantly increase the risk of mood disorders, while negative life events and caregiver strain not only heighten the risk for anxiety but also reduce the chances of responding to treatment [26,27]. Additionally, exposure to childhood violence has a clear association with the development of anxiety and depression, although evidence regarding sex differences remains inconsistent [28]. At the same time, structured EHR fields often incompletely capture broader social and environmental factors, and area-level indices can dilute individual-level effects, which may limit incremental predictive value in retrospective designs [29–31]. Yet, despite this, much of our work in understanding and screening for anxiety disorders has focused narrowly on individual risk factors—like inhibited temperament, family history, or subsyndromal symptoms—without adequately considering the influence of social and environmental contexts. This oversight is more than a knowledge gap; it is a missed opportunity. Incorporating ABMH into screening represents a substantial advance in assessing risk and an opportunity to identify vulnerabilities earlier, design more comprehensive interventions, and ultimately reduce disparities in mental health care access and outcomes. For clinicians, screening and treating anxiety disorders in youth means moving closer to a system that recognizes the interconnectedness of their environment, experiences, and biology [11].
This study investigates the utility of various computational methods for identifying pediatric patients with anxiety using time-dependent and static features in EHRs and ABMH data. First, we assess the performance of ML models that incorporate only recent 30-day information with a 30-day blackout period prior to diagnosis. Second, we assess the performance of DL models that incorporate time-dependent features that cover the span of the patient’s history. We evaluate the performance of each model across different age groups with two nested datasets: (1) structured features generated from EHR data (EHR) and (2) the combination of structured EHR and ABMH data (EHR+ABMH). The overall goal is to provide bioinformaticians and clinicians with comprehensive information that can be used to guide the development of predictive models for pediatric anxiety.
Methods
This section introduces the datasets used in this study: (1) structured features generated from EHR data and (2) ABMH features extracted from multiple sources and linked to a patient’s residential location at the census tract level. Our approach applies a suite of ML models that are commonly used in classification tasks, including logistic regression (LR), decision tree (DT), random forest (RF) [32], k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) [33]. For the DL models, we select RNN-based models to handle long sequences of encounters such as gated recurrent unit (GRU) [34], long-short term memory (LSTM) [35], reverse time attention (RETAIN) [18], and diagnosis classification model (Dipole) [19]. In anxiety, the presentation of disease and features important for diagnosis can vary by age, and hence, the models were stratified by age to account for this effect. Fig 1 provides a comprehensive summary of our approach for developing classification algorithms on EHR datasets for a single age group.
Dataset
The Cincinnati Children’s Hospital Institutional Review Board approved this retrospective study (STUDY# 2020−0942). The study analyzed existing pediatric electronic health records (EHRs) from 1.3 million patients collected between January 1, 2009, and March 31, 2022. The data were extracted and processed by authorized staff at Cincinnati Children’s Hospital Medical Center (CCHMC) to create a static database, which was subsequently transferred to Oak Ridge National Laboratory for secure hosting. The Institutional Review Board waived the requirement to obtain informed consent from adult participants, parental permission from parents or guardians of child participants, and assent from children. The IRB also granted a waiver from the requirement to obtain authorization for the use and/or disclosure of protected health information (PHI).
In this study, we used a retrospective case-control study design, where each anxiety case (anxiety group) was matched to a control (non-anxiety group) by age at the time of the case’s diagnosis and sex assigned at birth. Anxiety cases were identified using ICD codes described in S1 Table. The date of anxiety onset was determined using the first instance of an anxiety ICD code. This resulted in the selection of 53,728 anxiety patients diagnosed between the ages of 2 and 21 between 2009–2022. We included patients aged 2–21 to encompass the full developmental span from early childhood through young adulthood, consistent with recent psychiatric epidemiology studies that examine youth mental health across broad developmental windows [3,36]. At least one visit in the 18 months prior to the diagnosis date was required for inclusion in the anxiety or non-anxiety group. For the non-anxiety group selection, the patient was required to be the same sex assigned at birth as the case, born within 30 days of the case, have not developed anxiety at the time of the case’s anxiety diagnosis record, and have had at least one encounter in the EHRs in the 18 months preceding the case’s anxiety diagnosis date. The matched case-control dataset was then stratified by single-year age groups from ages 2–21 based on age of diagnosis of the case. The descriptive statistics for each age group are shown in Table 1.
Data preprocessing and feature engineering
Comprehensive patient histories were extracted from the CCHMC EHR. We then followed a series of preprocessing and feature engineering steps to create the final analytic tables to train the ML and DL models: 1) We extracted the static features that describe patient characteristics that do not change over time. 2) We collapsed time-dependent features into 30-day time-bins to capture the relevant EHR events during each time-bin. 3) We created an analytic file for the ML analyses that only included 30-day information prior to a recent 30-day blackout period to the time of diagnosis. 4) We created a time-dependent dataset for the DL analyses that included 30-day time-bins from birth to the time of the case’s diagnosis with a 30-day blackout period. 5) We appended the ABMH data to the analytic files created in steps three and four. 6) We split the analytic files into age-specific datasets for analysis. 7) We created train/test splits using patient ID.
We selected 30-day time bins to reflect clinically meaningful intervals commonly used in pediatric monitoring (e.g., follow-up visits, readmission risk). A 30-day blackout period prior to diagnosis was applied to reduce information leakage from diagnostic encounters themselves, ensuring that models learned predictive rather than diagnostic signals. Temporally bounded windows are widely used in predictive modeling with EHR data [37–39]. Our approach extends this precedent by discretizing the full patient history into 30-day intervals, enabling deep learning models to capture longitudinal developmental patterns while maintaining clinical interpretability.
Structured EHR data
The structured data includes information from two types of features: time-dependent (features that change over time) and static (features that do not change over time). The time-dependent features consist of diagnosis codes, procedure codes, medication codes, visit metadata (encounter type, provider type, place of service, care site, hospitalization), and measures (BMI, height, weight, blood pressure, heart rate). The static features include information set at birth, such as allergies, family mental health history, and patient demographics. The categorization of allergies (food, medications, and environment) and family histories (psychiatric disorders, substance abuse, sexual/verbal abuse, autism, attention-deficit/hyperactivity disorder, and developmental disorder). For the structured EHR data, we use frequency encoding and replace missing values with −1. Detailed information on time-dependent and static features is discussed in S1 Appendix.
Generation of time-dependent features
Fig 1 illustrates the utilization of time-dependent features. For these experiments, we used 30-day time-bins. The 30-day time-bin consolidates all EHR and ABMH data within each time-bin during the 30-day period. Time-bins extend from the time of birth to the time of the case’s diagnosis and include relevant EHR events that occurred during each 30-day window. Using the 30-day time-bins results in approximately 12 time-bins per year. We utilize this sequence of time-bins as the input feature for the DL-based models, while for the ML-based models, we use the final time-bin (xT) as the input feature. Table 1 summarizes the feature size and the number of time-bins of each age group.
Area-based Measures of Health (ABMH) data
Measuring ABMH across time using EHR data is a challenging task. This information is not consistently stored in EHR records and must be reconstructed using residential history information for each patient. In order to do this effectively, each residential address must be geocoded, assigned a relevant time window, and spatially joined to external sources of data that describe a patient’s community environment. We use the DeGAUSS package [40] to define a community environment at a single time point and then develop ABMH trajectories using a patient’s residential history.
We sought to capture the dynamic ABMH a patient might be exposed to throughout their life course. Therefore, ABMH features were considered to be time-dependent. This was done by constructing patient residential histories from residential address information captured in the CCHMC enterprise data warehouse (EDW). We extracted all known residential locations for each patient and assigned a start and stop date for each unique location. The date of visit corresponding to the residential location was used to construct the residential history of the patient from birth to the time of anxiety diagnosis. This required several assumptions: (1) the patient’s first observed address in the EHR record was also their address at the time of birth, and (2) the patient resided at their last known or current location for the full length of time between visits. DeGauss was used to geocode and link each patient’s residential location to a corresponding United States Census Tract (CT) at a single point in time. We accounted for changes in US Census Tract boundaries over time to ensure a correct characterization of the community a patient resided in at each point in time. The CT level measures used to characterize a patient’s ABMH over time can be found in S2 Table. Using a modified version of the time-bin code, we created environment measures for every 30-day time-bin used in the EHR data construction. If an individual moved during a 30-day time-bin, we created a weighted average of each environmental measure, with weights determined by the proportion of time spent at each location during the bin to select the most common address during the time. In addition to the structured features (EHR), 17 features were added to each time-bin as ABMH features (EHR+ABMH) to train the model.
Classification models
ML models were implemented as cross-sectional classifiers using 30-day feature windows to evaluate short-term patterns preceding diagnosis, while DL models were designed as sequence models that utilized full longitudinal patient histories to capture temporal dependencies. These approaches were intended to provide complementary perspectives on predictive modeling rather than to serve as direct comparators.
Machine learning-based models
For each age group, we apply ML models such as Logistic Regression (LR), Decision Tree (DT), Random Forest (RL) [32], K-Nearest Neighbor (KNN), and Extreme Gradient Boosting (XGBoost) [33]. All models are widely used in classification tasks in various domains such as image recognition [41,42], natural language processing [43,44], and recommender systems [45,46]. For all ML-based models, we use the last 30-day time-bin, xT, from Fig 1, as an input feature to prevent the feature size from becoming too large. The increment of the feature size can cause a curse of dimensionality [47], which results in performance degradation by losing the meaning of the distances of the data points. Detailed information on each model is introduced in S2 Appendix.
Deep learning-based models
For the DL-based models, we apply Long Short-Term Memory (LSTM) [35], Gated Recurrent Unit (GRU) [34], Reverse Time Attention (RETAIN) [18], and Diagnosis Prediction Model (Dipole) [19]. Unlike ML-based models, all time-bins ( from the Fig 1) are used as input feature matrix. We use Recurrent Neural Network (RNN)-based models that are effective, especially with sequential information, to handle such a long sequence of encounters. Table 1 shows the statistics of the features and time-bins used for each age group. Note that the feature size in Table 1 is the feature size of EHR. Detailed information on each model is introduced in S2 Appendix.
Experimental setting
Model training and hyperparameter tuning are conducted to generate accurate classifications of patients at risk of developing anxiety. For hyperparameter tuning, we split the dataset into 64% training, 16% validation, and 20% test sets. All models use the same setting and are compared with the same split across the age group. All ML models are implemented using the scikit-learn package [48], except XGBoost, which has its dedicated library [33]. The implementation of LSTM and GRU uses the PyTorch library [49]. We use the source code provided by the papers for RETAIN (implementation available at PyHealth RETAIN) and Dipole (implementation available at Dipole GitHub repository). We optimize the ML-based models through a grid search for the most effective hyperparameters using the scikit-learn package [48]. For the DL-based models, we perform fine-tuning using the Adam optimizer with a grid search. S3 Appendix summarizes the hyperparameters used for tuning and shows the selected hyperparameters for each model. Hyperparameters are selected using the EHR features, and the same hyperparameter settings are used throughout the experiment.
This study was reported in accordance with the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement to ensure transparent and complete reporting of predictive modeling studies [50].
Evaluation metrics
We present six metrics for the pediatric anxiety classification task in this study: (1) accuracy, (2) area under the receiver operating characteristics curve (AUROC), (3) area under the precision and recall curve (AUPRC), (4) positive predictive values (PPV), (5) negative predictive values (NPV), and (6) F1 score. The AUROC is a common evaluation tool that quantifies overall classification performance. The AUROC scale is from 0 to 1, with a value of 0.5 indicating an uninformative classifier. Higher AUROC scores typically signify better performance. While AUROC measures the area under the true positive and false positive rate, AUPRC evaluates the performance on the precision and recall graph. AUROC is a metric independent of class imbalance, and AUPRC is a metric for imbalanced datasets. The F1 score is the harmonic mean of the precision and recall of a classification model. It ranges between 0 and 1, and models with values closer to 1 are better models. PPV and NPV represent the accuracy of a diagnostic test in identifying true positive and true negative results, respectively. Predictive values indicate the likelihood that a specific diagnosis given by a test is accurate for a subject. The following equation computes PPV:
where TP is the true positive (the number of cases correctly identified as anxiety), and FP is the false positive (the number of cases incorrectly identified as anxiety). NPV can be computed with:
TN is the true negative (the number of cases correctly identified as non-anxiety), and FN is the false negative (the number of cases incorrectly identified as non-anxiety).
Empirical results
We compare the performance of ML- and DL-based models using two datasets: (1) structured features generated from EHR data (EHR) and (2) the combination of structured EHR and ABMH data (EHR+ABMH). We utilize only the final time-bin which includes the 30-days prior to the 30-day blackout period to the time of diagnosis for the ML models, whereas for the DL models, we incorporate all of the time-bins. Our model performance was bolstered by statistical analysis through 1000 bootstrapping iterations, ensuring a robust assessment of each model’s predictive capabilities. Fig 2 and Fig 3 display the mean of 1000 bootstrapping iterations at each data point, while the colored region represents the 95% confidence interval (CI). However, CIs are tight; therefore, they are not visible in some places on the figure. This section exclusively focuses on the evaluation measures AUROC score and PPV. S4 Appendix provides results of additional evaluation metrics, such as accuracy, AUPRC, NPV, and F1 score of all ML and DL models with two datasets across different age groups.
Machine learning-based models
Fig 2 displays the outcomes of ML models in terms of Area Under the Receiver Operating Characteristic (AUROC) score and Positive Predictive Value (PPV) for all age groups. Fig 2(a) and (c) employ the EHR features, while Fig 2(b) and (d) showcase outcomes utilizing EHR+ABMH features. Fig 2(a) and (b) display the AUROC score, whereas Fig 2(c) and (d) display the PPV for each model across the age groups.
According to the results shown in Fig 2(a) and (b), XGBoost consistently beats other ML models in terms of AUROC score in most of the age groups. RF demonstrates the second-best performance, while KNN performs worse than all other models. RF surpasses XGBoost in AUROC score in age groups 2 and 21, with the lowest number of patients. Nevertheless, Fig 2(c) and (d) demonstrate that LR and DT exhibit superior PPV compared to RF. This indicates that LR and DT have a higher precision but a lower predictive performance than the other two models. The findings generally indicate that XGBoost performs better than other models in terms of AUROC score and PPV, while KNN performs worse than other models.
Deep learning-based models
Fig 3 displays the outcomes of DL models in terms of AUROC score and PPV for all age groups. Fig 3(a) and (c) employ the EHR features, while Fig 3(b) and (d) display the outcomes utilizing EHR+ABMH features in AUROC score and PPV, respectively.
Fig 3(a) demonstrates that either RETAIN or Dipole achieve higher AUROC scores than other DL models in most of the age groups. However, in the case of the results utilizing EHR+ABMH features, unlike the ML results, which exhibit identical outcomes to those with EHR features, the results utilizing EHR+ABMH features display a distinct pattern. Although most models show a consistent pattern when using EHR features, EHR+ABMH features experience declines in certain age groups (namely, the GRU results for age group 10 and the RETAIN results for age group 17). Furthermore, although RETAIN demonstrates high performance in AUROC score when utilizing EHR features, EHR+ABMH features do not exhibit the same outcome. Fig 3(b) demonstrates that LSTM yields more consistent results but does not surpass other models. In both datasets, the RETAIN and Dipole models have superior AUROC scores, surpassing all other models in most age groups.
The PPV results exhibit greater complexity, as depicted in Fig 3(c) and (d). According to the data, Dipole does not demonstrate the highest PPV across the age groups. On the contrary, RETAIN demonstrates superior PPV in about 50% of the age groups, although the AUROC score is lower than Dipole in most age groups. From this, we infer that the findings of RETAIN exhibit higher precision but a poorer predictive performance than Dipole. For some age groups, specifically 2, 3, 20, and 21, the 95% confidence interval (CI) is substantial. This means the data exhibits significant variability, which could impact the analysis. It also shows that the patient population needs to be increased to train the model adequately. For instance, for the results utilizing EHR features (Fig 3(c)), age group 2 consists of 930 patients, resulting in a CI of 0.0026 for the PPV of the RETAIN model. On the other hand, age group 16 includes 11,150 patients and a CI of 0.00088. This indicates that the size of the dataset is also essential to train a more effective model.
Developmental subgroup analyses
To contextualize age-specific results, we summarized model performance across three developmental periods: preschool (ages 2–5), school-age (6–12), and adolescence (13–21), aggregating the single-year estimates reported in Figs 2 and 3. For ML models, XGBoost generally achieved the highest AUROC across subgroups, with RF competitive at the youngest and oldest edges (e.g., age 21 AUROC 0.845 with EHR and 0.848 with EHR+ABMH). For DL models, RETAIN and Dipole consistently yielded the strongest AUROC across all developmental periods under EHR, with Dipole exhibiting relatively stable performance into adolescence (e.g., ages 13–17 with EHR: 0.848, 0.854, 0.829, 0.845, 0.819). PPV was more variable: RETAIN often showed higher precision in preschool and school-age (e.g., EHR PPV from 0.834 to 0.838 at ages 4–8), whereas Dipole was frequently competitive in adolescence under EHR+ABMH. Comparing EHR+ABMH to EHR, ABMH features produced modest and non-uniform changes across ages and models (e.g., Dipole age 11 AUROC increased from 0.842 to 0.856, age 17 decreased from 0.819 to 0.813; RETAIN age 2 decreased from 0.833 to 0.765, age 8 increased from 0.851 to 0.853). Overall, these patterns indicate that developmental stage influences both discrimination and precision, with DL architectures (RETAIN/Dipole) generally outperforming other models and XGBoost providing the strongest ML baseline.
Discussion
Our study demonstrates the considerable potential of ML- and DL-based models to classify anxiety disorders in pediatric patients using structured features generated from EHR data and the combination of structured EHR and Area-based measures of health (ABMH) data. The use of ML models can leverage the predictive ability of structured features, such as diagnosis codes, demographics, and risk assessments, to classify pediatric mental health [14,51,52]. Many ML classification models, such as decision trees [53,54], support vector machines [55,56], and random forests [57,58] have previously been successfully used to classify anxiety disorders using structured features.
Unlike typical machine learning or ‘shallow’ learning approaches, deep learning employs artificial neural networks inspired by the structure and operation of the brain. Especially for time-dependent features, recurrent neural networks (RNNs) are introduced to handle sequential information to offer insights into disease development over time [59]. Many variants of RNN models have been proposed, for example, Long Short-Term Memory (LSTM) networks, and these models have proven highly effective in assessing time-series data, including patient visit timelines, symptom development, or treatment history [12,13,18,19,60]. These models can accurately represent the complex temporal patterns involved in the emergence and progression of anxiety disorders in pediatric patients. The ability to predict future symptoms based on previous data enables timely intervention, leading to substantial improvements in patient outcomes.
Attention-based interpretability in RETAIN [18] provides clinicians with insights into which temporal events or clinical features most strongly influence predicted anxiety risk. For example, elevated attention weights assigned to medication changes, visit frequency, or diagnostic transitions could be highlighted in the patient notes or extracted and displayed on a clinical dashboard. Drawing the physician’s attention to the factors that increase risk during clinic visits could inform earlier clinical intervention and/or patient monitoring. Such interpretability may enhance the practical adoption of these models in clinical settings by providing transparent and explainable outputs that align with established medical reasoning processes.
ABMH data provides information about a child’s social and physical environment, which is incorporated into ML and DL models [61–63]. ABMH data contain residential history and environmental information to help understand pediatric anxiety. Incorporating these determinants into ML and DL models can improve the prediction power of the anxiety classification models and address health disparities [64,65]. However, when constructing the ABMH data, we assumed that the patients resided at the same location between clinic visits and that any address changes were effective starting at the date of the visit. There is potential for bias in this assumption, as patients may move locations between visits, which may artificially affect the impact of the ABMH measures in our models.
Anxiety disorders develop differently in different age groups and also depend on the social and physical environment of each individual. For example, preschool-aged children may develop anxiety disorders when they are separated from their primary caregivers or experience traumatic separation-related events [66], school-aged children may develop anxiety disorders due to their certain school environment [67], and adolescents may develop anxiety disorders through certain psychological processes which may be perpetuated or modeled in certain family environment [5]. Understanding these age-specific manifestations is essential for accurate diagnoses and appropriate treatment.
The finding that ABMH features provided little additional predictive value likely reflects two factors. First, ABMH variables often exert their influence indirectly, shaping long-term developmental trajectories rather than immediately preceding diagnostic encounters. A retrospective case-control design anchored to diagnosis dates is less suited to capture these gradual, upstream effects. Second, some ABMH constructs overlap with information already captured in structured EHR data (e.g., race/ethnicity, visit patterns, or family history), which may reduce their unique contribution. Additionally, model performance varied by age, reflecting differences in developmental pathways to anxiety and in how clinicians document and diagnose anxiety across developmental stages. Together, these considerations suggest that ABMH features may prove more informative in prospective or longitudinal prediction frameworks designed to capture developmental processes over time. In addition, the minimal performance gain observed from ABMH features may reflect limited variability and temporal resolution in area-level measures relative to individual-level EHR data. While area-based data provide valuable contextual information, their integration with clinical records may require more granular, temporally aligned, or prospective modeling approaches to capture environmental influences on pediatric mental health more accurately. Furthermore, assumptions made during the construction of residential histories, such as assuming patients remained at the same address between visits and using visit dates to infer transitions, may introduce noise into ABMH linkage, reducing its predictive value in retrospective analyses.
This study has several limitations. First, this study employed a retrospective case-control design, which carries inherent limitations. Structured EHR data are not collected for research purposes. As a result, we do not have complete follow-up information on all individuals in the study and the observations that we do see are related to a medical event that required care. While a prospective design would allow us to collect information systematically across a patient cohort and could be designed to capture measures of anxiety that are not tied to medical care, this type of study design is expensive and takes years to conduct. Therefore, we believe the benefits of this type of study design outweigh the limitations. In addition to limitation in study design, EHR data is also subject to biases that can affect the interpretation of results and reproducibility outside of the CCHMC ecosystem. For example, variability in ICD coding practices and patterns of missingness in measurements are determined by the EHR system and common practices at each medical institution. Such biases may affect model performance and limit portability across institutions. Future work using prospective cohorts and external validation datasets is needed to confirm generalizability. Second, comparators were drawn from non-anxiety hospital group rather than healthy community controls, which may reduce generalizability to the broader pediatric population. Third, the construction of ABMH features required assumptions about residential stability between visits, which may introduce bias if patients moved between visits. Fourth, our modeling framework implemented both cross-sectional ML and longitudinal DL models to explore predictive value across different temporal scales. These approaches were evaluated in parallel to provide complementary insights rather than to serve as direct comparators. While this approach does not allow us to directly examine temporal modeling benefits, it does allow us to compare the performance of ML models for short-term predictions and DL models for long-term predictions. The differences in input scope and model structure should be considered when interpreting their respective performance characteristics. Finally, model performance was only evaluated using internal splits (64% training, 16% validation, 20% test). The absence of external validation limits the ability to assess generalizability, and future studies should evaluate these models on independent datasets.
Conclusion
The findings of this study affirm that employing ML and DL models can enable the identification of age-stratified pediatric patients at high risk of anxiety onset, although parametric logistic regression (LR) models performed at least as well as other ML-based models (DT, RF, KNN, XGBoost). The consistency of the models and predictive strength have substantial implications for enhancing clinical decision-making and patient outcomes. Integrating such models into healthcare practices promises a shift toward more efficient, data-driven, and personalized care. Future efforts should focus on customizing these models for diverse patient cohorts to maximize their utility in real-world settings and compare the age-stratified model to an all-age one. This study underscores the potential of a data-driven methodology, streamlining early detection and catalyzing a transformative shift in pediatric mental healthcare practices.
Supporting information
S1 Table. List of ICD codes to determine anxiety patients.
https://doi.org/10.1371/journal.pone.0324673.s001
(PDF)
S2 Table. Area-based Measures of Health (ABMH) data description.
https://doi.org/10.1371/journal.pone.0324673.s002
(PDF)
S1 Appendix. Structured EHR Data.
Provide extensive information about the time-dependent and static features in the dataset.
https://doi.org/10.1371/journal.pone.0324673.s003
(PDF)
S2 Appendix. Classification Models.
Discuss the specifics of the ML- and DL-based classification models that are applied.
https://doi.org/10.1371/journal.pone.0324673.s004
(PDF)
S3 Appendix. Hyperparameter Tuning.
Discuss the hyperparameter tuning procedure, as well as the hyperparameters selected for ML- and DL-based models.
https://doi.org/10.1371/journal.pone.0324673.s005
(PDF)
S4 Appendix. Additional Results.
Present the additional results using various evaluation metrics (accuracy, NPV, F1 score, and AUPRC) for both EHR and EHR+ABMH features of ML- and DL-based models.
https://doi.org/10.1371/journal.pone.0324673.s006
(PDF)
References
- 1.
O’Connor E, Henninger M, Perdue LA, Coppola EL, Thomas R, Gaynes BN. Screening for depression, anxiety, and suicide risk in adults: A systematic evidence review for the US preventive services task force. Agency for Healthcare Research and Quality (US). 2023.
- 2. Kessler RC, Berglund P, Demler O, Jin R, Merikangas KR, Walters EE. Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the National Comorbidity Survey Replication. Arch Gen Psychiatry. 2005;62(6):593–602. pmid:15939837
- 3. Xiang AH, Martinez MP, Chow T, Carter SA, Negriff S, Velasquez B, et al. Depression and Anxiety Among US Children and Young Adults. JAMA Netw Open. 2024;7(10):e2436906. pmid:39352699
- 4. Beesdo K, Pine DS, Lieb R, Wittchen H-U. Incidence and risk patterns of anxiety and depressive disorders and categorization of generalized anxiety disorder. Arch Gen Psychiatry. 2010;67(1):47–57. pmid:20048222
- 5. Warner EN, Ammerman RT, Glauser TA, Pestian JP, Agasthya G, Strawn JR. Developmental Epidemiology of Pediatric Anxiety Disorders. Child Adolesc Psychiatr Clin N Am. 2023;32(3):511–30. pmid:37201964
- 6. Walkup JT, Green CM, Strawn JR. Screening for Pediatric Anxiety Disorders. JAMA. 2022;328(14):1399–401. pmid:36219415
- 7. Woodward LJ, Fergusson DM. Life course outcomes of young people with anxiety disorders in adolescence. J Am Acad Child Adolesc Psychiatry. 2001;40(9):1086–93. pmid:11556633
- 8. Racine N, McArthur BA, Cooke JE, Eirich R, Zhu J, Madigan S. Global Prevalence of Depressive and Anxiety Symptoms in Children and Adolescents During COVID-19: A Meta-analysis. JAMA Pediatr. 2021;175(11):1142–50. pmid:34369987
- 9.
Zablotsky B, Ng A. Mental health treatment among children aged 5-17 years: United States, 2021. NCHS Data Brief. 2023.
- 10. Kariotis TC, Prictor M, Chang S, Gray K. Impact of Electronic Health Records on Information Practices in Mental Health Contexts: Scoping Review. J Med Internet Res. 2022;24(5):e30405. pmid:35507393
- 11. Fusar-Poli P, Correll CU, Arango C, Berk M, Patel V, Ioannidis JPA. Preventive psychiatry: a blueprint for improving the mental health of young people. World Psychiatry. 2021;20(2):200–21. pmid:34002494
- 12. Goenka N, Tiwari S. Deep learning for Alzheimer prediction using brain biomarkers. Artif Intell Rev. 2021;54(7):4827–71.
- 13.
Suresha PB, Wang Y, Xiao C, Glass L, Yuan Y, Clifford GD. A deep learning approach for classifying nonalcoholic steatohepatitis patients from nonalcoholic fatty liver disease patients using electronic medical records. Explainable AI in Healthcare and Medicine: Building a Culture of Transparency and Accountability. 2021. p. 107–13. https://doi.org/10.1007/978-3-030-53352-6_10
- 14. Garriga R, Mas J, Abraha S, Nolan J, Harrison O, Tadros G, et al. Machine learning model to predict mental health crises from electronic health records. Nat Med. 2022;28(6):1240–8. pmid:35577964
- 15. Xie F, Yuan H, Ning Y, Ong MEH, Feng M, Hsu W, et al. Deep learning for temporal data representation in electronic health records: A systematic review of challenges and methodologies. J Biomed Inform. 2022;126:103980. pmid:34974189
- 16. Holmes JH, Beinlich J, Boland MR, Bowles KH, Chen Y, Cook TS, et al. Why is the electronic health record so challenging for research and clinical care?. Methods of Information in Medicine. 2021;60(01/02):032–48.
- 17. Lee DS, Stitt A, Austin PC, Stukel TA, Schull MJ, Chong A, et al. Prediction of heart failure mortality in emergent care: a cohort study. Ann Intern Med. 2012;156(11):767–75, W-261, W-262. pmid:22665814
- 18. Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Advances in Neural Information Processing Systems. 2016;29.
- 19.
Ma F, Chitta R, Zhou J, You Q, Sun T, Gao J. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017. 1903–11. https://doi.org/10.1145/3097983.3098088
- 20. Karim AAJ, Asad KHM, Alam MGR. Larger models yield better results? Streamlined severity classification of ADHD-related concerns using BERT-based knowledge distillation. PLoS One. 2025;20(2):e0315829. pmid:39913350
- 21. Tu Z, Zhang J, Li H, Chen Y, Yuan J. Joint-Bone Fusion Graph Convolutional Network for Semi-Supervised Skeleton Action Recognition. IEEE Trans Multimedia. 2023;25:1819–31.
- 22. Beesdo-Baum K, Knappe S. Developmental epidemiology of anxiety disorders. Child Adolesc Psychiatr Clin N Am. 2012;21(3):457–78. pmid:22800989
- 23. Warner EN, Strawn JR. Risk Factors for Pediatric Anxiety Disorders. Child Adolesc Psychiatr Clin N Am. 2023;32(3):485–510. pmid:37201963
- 24. Brokamp C, Strawn JR, Beck AF, Ryan P. Pediatric Psychiatric Emergency Department Utilization and Fine Particulate Matter: A Case-Crossover Study. Environ Health Perspect. 2019;127(9):97006. pmid:31553231
- 25. Zundel CG, Ryan P, Brokamp C, Heeter A, Huang Y, Strawn JR, et al. Air pollution, depressive and anxiety disorders, and brain effects: A systematic review. Neurotoxicology. 2022;93:272–300. pmid:36280190
- 26. Compton SN, Peris TS, Almirall D, Birmaher B, Sherrill J, Kendall PC, et al. Predictors and moderators of treatment response in childhood anxiety disorders: results from the CAMS trial. J Consult Clin Psychol. 2014;82(2):212–24. pmid:24417601
- 27. Ginsburg GS, Becker EM, Keeton CP, Sakolsky D, Piacentini J, Albano AM, et al. Naturalistic follow-up of youths treated for pediatric anxiety disorders. JAMA Psychiatry. 2014;71(3):310–8. pmid:24477837
- 28. Wang S, Geng F, Gu M, Gu J, Shi Y, Yang Y, et al. Network analysis of childhood maltreatment and internet addiction in adolescents with major depressive disorder. BMC Psychiatry. 2024;24(1):768. pmid:39501224
- 29. He Z, Pfaff E, Guo SJ, Guo Y, Wu Y, Tao C, et al. Enriching Real-world Data with Social Determinants of Health for Health Outcomes and Health Equity: Successes, Challenges, and Opportunities. Yearb Med Inform. 2023;32(1):253–63. pmid:38147867
- 30. Guevara M, Chen S, Thomas S, Chaunzwa TL, Franco I, Kann BH, et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit Med. 2024;7(1):6. pmid:38200151
- 31. Foryciarz A, Gladish N, Rehkopf DH, Rose S. Incorporating area-level social drivers of health in predictive algorithms using electronic health record data. J Am Med Inform Assoc. 2025;32(3):595–601. pmid:39832294
- 32. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
- 33.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94. https://doi.org/10.1145/2939672.2939785
- 34. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint. 2014.
- 35. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
- 36. Elia J, Pajer K, Prasad R, Pumariega A, Maltenfort M, Utidjian L, et al. Electronic health records identify timely trends in childhood mental health conditions. Child Adolesc Psychiatry Ment Health. 2023;17(1):107. pmid:37710303
- 37. Nagamine T, Gillette B, Kahoun J, Burghaus R, Lippert J, Saxena M. Data-driven identification of heart failure disease states and progression pathways using electronic health records. Sci Rep. 2022;12(1):17871. pmid:36284167
- 38. Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci Rep. 2016;6:26094. pmid:27185194
- 39. Lin SC, Jha AK, Adler-Milstein J. Electronic Health Records Associated With Lower Hospital Mortality After Systems Have Time To Mature. Health Aff (Millwood). 2018;37(7):1128–35. pmid:29985687
- 40. Brokamp C, Wolfe C, Lingren T, Harley J, Ryan P. Decentralized and reproducible geocoding and characterization of community and environmental exposures for multisite studies. J Am Med Inform Assoc. 2018;25(3):309–14. pmid:29126118
- 41. Mishra NK, Celebi ME. An overview of melanoma detection in dermoscopy images using image processing and machine learning. 2016.
- 42.
Sonka M, Hlavac V, Boyle R. Image processing, analysis and machine vision. Springer. 2013.
- 43. Le Glaz A, Haralambous Y, Kim-Dufor D-H, Lenca P, Billot R, Ryan TC, et al. Machine Learning and Natural Language Processing in Mental Health: Systematic Review. J Med Internet Res. 2021;23(5):e15708. pmid:33944788
- 44.
Powers DM, Turk CC. Machine learning of natural language. Springer Science & Business Media. 2012.
- 45. Portugal I, Alencar P, Cowan D. The use of machine learning algorithms in recommender systems: A systematic review. Expert Systems with Applications. 2018;97:205–27.
- 46. Khanal SS, Prasad PWC, Alsadoon A, Maag A. A systematic review: machine learning based recommendation systems for e-learning. Educ Inf Technol. 2019;25(4):2635–64.
- 47. Bellman R. Dynamic programming. Science. 1966;153(3731):34–7. pmid:17730601
- 48. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research. 2011;12:2825–30.
- 49. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.
- 50. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD Statement. Br J Surg. 2015;102(3):148–58. pmid:25627261
- 51. Muhammad A, Ashjan B, Ghufran M, Taghreed S, Nada A, Nada A, et al. Classification of Anxiety Disorders using Machine Learning Methods: A Literature Review. Insights Biomed Res. 2020;4(1).
- 52.
Kotsilieris T, Pintelas E, Livieris I, Pintelas P. Reviewing machine learning techniques for predicting anxiety disorders. TR01-18. University of Patras. 2018.
- 53. Carpenter KLH, Sprechmann P, Calderbank R, Sapiro G, Egger HL. Quantifying Risk for Anxiety Disorders in Preschool Children: A Machine Learning Approach. PLoS One. 2016;11(11):e0165524. pmid:27880812
- 54. Wy S, Choe S, Lee YJ, Bak E, Jang M, Lee SC, et al. Decision Tree Algorithm-Based Prediction of Vulnerability to Depressive and Anxiety Symptoms in Caregivers of Children With Glaucoma. Am J Ophthalmol. 2022;239:90–7. pmid:35172169
- 55. Yang J, Chen Y, Yao G, Wang Z, Fu X, Tian Y, et al. Key factors selection on adolescents with non-suicidal self-injury: A support vector machine-based approach. Front Public Health. 2022;10:1049069. pmid:36438278
- 56. Sumathi Ms, B. Dr. Prediction of Mental Health Problems Among Children Using Machine Learning Techniques. ijacsa. 2016;7(1).
- 57. Chavanne AV, Paillère Martinot ML, Penttilä J, Grimmer Y, Conrod P, Stringaris A, et al. Anxiety onset in adolescents: a machine-learning prediction. Mol Psychiatry. 2023;28(2):639–46. pmid:36481929
- 58.
Li G, Jiang L. Random Forest Algorithm-based Modelling and Neural Network Analysis Between Social Anxiety Disorder of Childhood and Parents’ Socioeconomic Attributes. In: 2023 IEEE 6th Eurasian Conference on Educational Innovation (ECEI), 2023. 222–5. https://doi.org/10.1109/ecei57668.2023.10105416
- 59. Chen R, Stewart WF, Sun J, Ng K, Yan X. Recurrent Neural Networks for Early Detection of Heart Failure From Longitudinal Electronic Health Record Data: Implications for Temporal Modeling With Respect to Time Before Diagnosis, Data Density, Data Quantity, and Data Type. Circ Cardiovasc Qual Outcomes. 2019;12(10):e005114. pmid:31610714
- 60.
Penchina B, Sundaresan A, Cheong S, Martel A. Deep LSTM Recurrent Neural Network for Anxiety Classification from EEG in Adolescents with Autism. Lecture Notes in Computer Science. Springer International Publishing. 2020. p. 227–38. https://doi.org/10.1007/978-3-030-59277-6_21
- 61. de Lacy N, Ramshaw M. Predicting the onset of internalizing disorders in early adolescence using deep learning optimized with AI. medRxiv. 2023.
- 62. Siddiqui H, Rattani A, Woods NK, Cure L, Lewis RK, Twomey J, et al. A Survey on Machine and Deep Learning Models for Childhood and Adolescent Obesity. IEEE Access. 2021;9:157337–60.
- 63. Rothenberg WA, Bizzego A, Esposito G, Lansford JE, Al-Hassan SM, Bacchini D, et al. Predicting Adolescent Mental Health Outcomes Across Cultures: A Machine Learning Approach. J Youth Adolesc. 2023;52(8):1595–619. pmid:37074622
- 64. Alegría M, NeMoyer A, Falgàs Bagué I, Wang Y, Alvarez K. Social Determinants of Mental Health: Where We Are and Where We Need to Go. Curr Psychiatry Rep. 2018;20(11):95. pmid:30221308
- 65. Deferio JJ, Breitinger S, Khullar D, Sheth A, Pathak J. Social determinants of health in mental health care and research: a case for greater inclusion. J Am Med Inform Assoc. 2019;26(8–9):895–9. pmid:31329877
- 66. Egger HL, Angold A. Common emotional and behavioral disorders in preschool children: presentation, nosology, and epidemiology. J Child Psychol Psychiatry. 2006;47(3–4):313–37. pmid:16492262
- 67. Beesdo K, Knappe S, Pine DS. Anxiety and anxiety disorders in children and adolescents: developmental issues and implications for DSM-V. Psychiatr Clin North Am. 2009;32(3):483–524. pmid:19716988