Figures
Abstract
As clinical understanding of pediatric Post-Acute Sequelae of SARS CoV-2 (PASC) develops, and hence the clinical definition evolves, it is desirable to have a method to reliably identify patients who are likely to have post-acute sequelae of SARS CoV-2 (PASC) in health systems data. In this study, we developed and validated a machine learning algorithm to classify which patients have PASC (distinguishing between Multisystem Inflammatory Syndrome in Children (MIS-C) and non-MIS-C variants) from a cohort of patients with positive SARS- CoV-2 test results in pediatric health systems within the PEDSnet EHR network. Patient features included in the model were selected from conditions, procedures, performance of diagnostic testing, and medications using a tree-based scan statistic approach. We used an XGboost model, with hyperparameters selected through cross-validated grid search, and model performance was assessed using 5-fold cross-validation. Model predictions and feature importance were evaluated using Shapley Additive exPlanation (SHAP) values. The model provides a tool for identifying patients with PASC and an approach to characterizing PASC using diagnosis, medication, laboratory, and procedure features in health systems data. Using appropriate threshold settings, the model can be used to identify PASC patients in health systems data at higher precision for inclusion in studies or at higher recall in screening for clinical trials, especially in settings where PASC diagnosis codes are used less frequently or less reliably. Analysis of how specific features contribute to the classification process may assist in gaining a better understanding of features that are associated with PASC diagnoses.
Citation: Lorman V, Razzaghi H, Song X, Morse K, Utidjian L, Allen AJ, et al. (2023) A machine learning-based phenotype for long COVID in children: An EHR-based study from the RECOVER program. PLoS ONE 18(8): e0289774. https://doi.org/10.1371/journal.pone.0289774
Editor: Zhe He, Florida State University, UNITED STATES
Received: February 24, 2023; Accepted: July 25, 2023; Published: August 10, 2023
Copyright: © 2023 Lorman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The results reported here are based on detailed individual-level patient data compiled as part of the RECOVER program. Due to the high risk of reidentification based on the number of unique patterns in the date, patient privacy regulations prohibit us from releasing the data publicly. The data are maintained in a secure enclave, with access managed by the program coordinating center to remain compliant with regulatory and program requirements. Please direct requests to access the data, either for reproduction of the work reported here or for other purposes, to recover@chop.edu.
Funding: “This research was funded by the National Institutes of Health (NIH) Agreement OT2HL161847-01 as part of the Researching COVID to Enhance Recovery (RECOVER) program of research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”
Competing interests: “"Dr. Rao reports prior grant support from GSK and Biofire and is a consultant for Sequiris. Dr. Jhaveri is a consultant for AstraZeneca, Seqirus, Dynavax, receives an editorial stipend from Elsevier and Pediatric Infectious Diseases Society and royalties from Up To Date/Wolters Kluwer. Dr. Lee serves on the PASC Advisory Board for United Health Group. Dr Bailey has received grants from Patient-Centered Outcomes Research Institute All other authors have nothing to disclose. This does not alter our adherence to PLOS ONE policies on sharing data and materials.”
Introduction
While long-term consequences of SARS-CoV2 infection have been studied from the perspectives of both clinical manifestations and underlying mechanisms [1, 2], formal definitions for Post-Acute Sequelae of SARS CoV-2 (PASC, or long COVID) are currently broad and necessarily nonspecific. Although advances have been made in describing features that characterize PASC, studies to date have suggested a heterogeneous presentation, particularly in children [3, 4]. This heterogeneity has made it difficult to form a definition for predicting or classifying children with PASC in the absence of patient-specific expert review, resulting in challenges in obtaining “gold standard” labels for cases and controls, which consequently poses a challenge to conducting large-scale research towards improving patient outcomes.
To address this gap, researchers have developed machine learning algorithms intended to classify adult patients with PASC in large clinical databases [5]. In contrast with explicit rule-based definitions, machine learning algorithms have the advantage of being able to detect complex patterns involving thousands of covariates [6].
Diagnosing PASC in the pediatric population is particularly challenging due to large differences in clinical manifestations across the age spectrum, compounded by a paucity of pediatric research to date. Multisystem Inflammatory Syndrome in Children (MIS-C), a clinically severe illness which follows SARS-COV-2 infection and satisfies the current time-based definition for PASC is considered a distinct entity in practice and does have a case definition which involves laboratory evidence of inflammation and involvement of at least two organ systems among SARS CoV-2-positive patients [7]. However, a fuller understanding of the subphenotypes of MIS-C, including the mechanisms by which they develop and their longer-term trajectories, is still emerging [8, 9]. In the case of non-MIS-C PASC, the variety of presentations, variability of methods for diagnosis, and treatment modalities presents an even greater challenge in identifying such patients from EHR data [1–4, 10, 11].
The goal of this study is to implement and validate a machine learning model to classify patients with a PASC diagnosis (including both MIS-C and non-MIS-C variants). Our model was trained to classify patients with a PASC diagnosis code in a cohort of SARS-CoV-2 positive patients, with the model’s utility found in its ability to detect patients who are likely to have had PASC based on a large collection of clinical features in settings where the diagnosis code was not reliably present. To define patient features for use in the model from large hierarchical vocabularies of clinical codes, we employed the tree-based scan statistic [12], a data mining tool previously used for pharmacovigilance, vaccine safety surveillance, and occupational disease surveillance. In our context, the tree-based scan statistic was used to detect clusters of diagnosis, medication, procedure, and diagnostic test codes which occur disproportionately often among PASC-diagnosed patients; it detects the level of threshold hierarchical granularity at which to cluster the codes, allowing for feature selection without expert clinical input regarding which features to use and how to cluster them. To our knowledge, this is a novel application of the tree-based scan statistic to selecting features from a hierarchical structure for use in machine learning models and may be of independent methodological interest.
Methods
Study population
This retrospective cohort study is part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which seeks to understand, treat, and prevent PASC. For more information on RECOVER, visit https://recovercovid.org/. The study population includes EHR data from the PEDSnet network and the following institutions: Children’s Hospital of Philadelphia, Cincinnati Children’s Hospital Medical Center, Children’s Hospital of Colorado, Ann & Robert H. Lurie Children’s Hospital of Chicago, Nationwide Children’s Hospital, Nemours Children’s Health System (in Delaware and Florida), Seattle Children’s Hospital, and Stanford Children’s Health. The PEDSnet RECOVER Database Version 230316 was used with data available through March 16, 2023.
This study constitutes human subjects research; IRB approval was obtained under BRANY protocol #21-08-508 and consent and HIPAA authorization were waived.
Identifying PASC diagnoses
We identified patients with MIS-C by the presence of ICD-10-CM code M35.81, ICD 10 codes U10 or U10.9, OHDSI extension code OMOP5042964, or an (EHR) interface term containing the string ‘MIS-C’ or both the strings ‘multisystem’ and ‘inflam’. We identified patients with non-MIS-C subphenotypes of PASC using the ICD-10-CM diagnosis code U09.9 (available October 1, 2021 [13]) or by an interface term in the EHR containing the string ‘long covid’ or ‘post-acute’ and ‘covid’ or ‘sars’. Based on evidence [14] and early CDC guidance [15] that the non-COVID-specific ICD-10-CM code B94.8 (“Sequelae of other specified infectious and parasitic diseases”) was used as a placeholder for long COVID prior to the introduction of U09.9, we elected to include among our cases patients who had an occurrence of this code or similar nonspecific post-viral infection codes in the SNOMED CT [16] vocabulary that were not attributable to other conditions, when such codes occurred following a positive SARS CoV-2 test result.
We define PASC diagnosis as the presence of any of the above codes, including those for MIS-C, and we will use the terms ‘PASC’ or ‘PASC (any)’ to refer to it. We define patients with ‘Non-MIS-C PASC’ to be the subset of patients with a PASC diagnosis who did not have a MIS-C diagnosis, and patients with ‘MIS-C’ to be those who had a MIS-C diagnosis. Patients without PASC are defined to be those with no PASC diagnosis of any kind.
Cohort definition
Our cohort comprised patients less than 21 years old who had a positive SARS CoV-2 test (antigen, serology, or Reverse Transcriptase-Polymerase Chain Reaction (RT-PCR)), or PASC or MIS-C diagnosis at any point after January 1, 2021 [7]. We used this cutoff date because it is when the MIS-C diagnosis code M35.81 first came into use. Because home viral testing became increasingly prevalent in 2022, and not all PASC-diagnosed patients in our cohort had a SARS CoV-2 test result, we did not require one. Further, for many PASC (including MIS-C)-diagnosed patients who did have a positive SARS CoV-2 test result, the test occurred on or near their earliest PASC diagnosis. Due to the resulting difficulties in capturing the true date of onset of SARS-CoV-2 infection in the EHR, we elected to impute the index date for all PASC-diagnosed patients by selecting a random date in the 28 to 90 days prior to their earliest PASC diagnosis. For non-PASC SARS-CoV-2 infected patients, we defined their index date as the date of first positive test result. We further required the presence of at least two visits following the index date for all patients in our cohort.
Feature selection
Our feature set consisted of person-level demographics, and presence of condition, performance of diagnostic testing, procedures, and medication codes in the EHR. Demographic variables included site, index date month and year, age at index date, sex, and race/ethnicity. All condition codes used to define the presence of PASC were excluded from the feature set. Additionally, because patients with PASC were not required to have a prior SARS-CoV-2 test whereas patients without PASC in our cohort were, we also excluded codes indicating presence of SARS-CoV-2 testing from the diagnostic testing codes.
A priori, there are hundreds of thousands of codes in these domains, making a model which uses all of them infeasible. Furthermore, due to hierarchical relationships between codes in a given vocabulary, the most relevant features in classifying which patients have PASC can occur at varying levels of granularity. To effectively select features, defined as groups of codes, from each vocabulary, we used the tree-based scan statistic computed by the TreeScan (2.0) software [17]. Given a clinical vocabulary hierarchy (e.g. SNOMED CT) and the counts of patients in our cohort, stratified by those with and without PASC, who had each code in that hierarchy occur during the post-acute period following COVID-19, the tree-based scan statistic identifies (using a likelihood ratio statistic) clusters of codes in the hierarchy which occurred significantly more frequently among patients with PASC than those without PASC.
Since for each code the outcome for each patient was binary, we used a Bernoulli probability model, testing the null hypothesis that the proportion of patients with the outcome who had PASC is equal to the proportion of patients with PASC in the cohort as a whole. We refer the reader to the tree-based scan statistic literature for more details [12, 18].
We used as input into the TreeScan software the SNOMED CT hierarchy for conditions, the RxNorm [19] hierarchy for medications, the Logical Observation Identifiers Names and Codes (LOINC) [20] hierarchy for labs, and a union of the International Classification of Diseases, 10th Revision, Procedure Coding System (ICD-10-PCS) [21], Healthcare Common Procedure Coding System (HCPCS) [22], and Current Procedural Terminology (CPT4) [23] hierarchies for procedures. To ensure uniform follow-up time, the cohort we used for TreeScan feature selection was a subcohort of the full cohort of our study consisting of patients who had at least 3 months of follow-up. The code occurrences used as input to TreeScan were counted during the 28–180 days following the index date.
From the TreeScan output, we then selected the branches which were 1) significant at the p <0.01 threshold and 2) had at least 500 occurrences for conditions, 400 occurrences for procedures, 10,000 occurrences for diagnostic tests, and 3000 occurrences for medication prescriptions. We also omitted from consideration cuts above the 5th level of the respective vocabularies for conditions, above the 3rd level for procedures, and the 4th level for diagnostic tests to avoid selecting clusters that were too general. These parameters were additionally chosen to limit the number of features in the model to a more computationally feasible number without decreasing model performance.
To create the feature space for our model, five binary variables were constructed for each branch selected by the tree-based scan statistic approach to indicate whether the feature was present for each of the following time periods relative to index date: -1 to 0 months, 0 to 1 months, 1 to 2 months, 2–3 months, and 3–6 months. In addition to condition, procedure, diagnostic test, and medication features for each of the above time periods, we also included visit counts for each of 7 types of visits (Outpatient Office, Outpatient: Test Only, Inpatient ICU, Inpatient non-ICU, Emergency Department ICU, Emergency Department non-ICU, and Other/Unknown) and total number of laboratory tests performed over each time period. Finally, we included patient-level demographic variables consisting of sex, race/ethnicity, and month of cohort entry.
Model selection and evaluation
We used an XGBoost model [24] with hyperparameters selected using cross-validated grid search using Scikit-learn [25]. To distinguish between MIS-C and non-MIS-C PASC, the model was trained to classify patients as belonging to one of three classes (MIS-C, non-MIS-PASC, or no PASC) and model output for each patient consisted of one probability for each class such that the three probabilities sum to 1.
We used fivefold cross-validation to evaluate our model. The feature selection process was performed during each training step, and hence the dimension of the feature space varies slightly between the folds. Model performance was evaluated using a variety of metrics, including recall and precision at various probability thresholds, F1 scores, area under the Receiver Operating Characteristic (ROC) curve (AUROC), and area under the Precision-Recall (PR) curve (AUPR). Because both accuracy and AUROC can be misleading measures of model performance for imbalanced classification problems (in our case, cases form about 3% of our cohort), we favored AUPR as the primary metric to evaluate our models as it provides a concise summary of the recall-precision tradeoff across different levels of probability threshold [26].
Feature importance
To evaluate our model’s predictions and feature importance, we calculated SHapley Additive exPlanation (SHAP) values. [27, 28] SHAP values, a game-theoretic concept repurposed for machine learning, allow us to see how various features from our high dimensional feature space contribute to determining the model’s output for each patient. Feature importance ranking is based on the mean absolute value of SHAP values for each feature.
Results
Cohort
Our cohort comprised 125,361 children meeting eligibility criteria for SARS-CoV-2 infection and follow-up (Table 1, Fig 1). Of these, 1,435 had a diagnosis of MIS-C, and 1,627 were diagnosed with non-MIS-C PASC. Most patients entered the cohort between November 2021 and February 2022. Serology testing in the absence of viral test results was used to identify 1.9% of the controls. Consistent with code availability, a higher proportion of patients with MIS-C diagnoses entered the cohort in early 2021 than those with non-MIS-C PASC (S1 Fig).
Tree-based scan statistic-selected features
The tree-based scan statistic approach resulted in 114 condition features, 181 diagnostic test features, 167 procedure features, and 189 medication features. The terms from the SNOMED CT ontology for which TreeScan found the greatest enrichment in patients diagnosed with PASC are shown in S2a Table. Some terms reflect specific pathologic findings (e.g. myocarditis), especially cardiovascular processes, and particularly those consistent with MIS-C. However, many of the terms with highest risk ratios are internal nodes in the SNOMED CT hierarchy and reflect the sum of contributions from multiple more specific terms. This pattern reflects the heterogeneity of presentation, particularly in patients without MIS-C, such that no single specific diagnosis best distinguishes between cases and non-cases. Results from medication data (S2d Table), diagnostic testing (S2b Table), and procedures (S2c Table) again reflect cardiovascular therapies, as well as immunomodulation, and supportive care for acute illness.
Model performance
The final algorithm was an XGBoost [24] model with a 3,312-dimensional feature space. The feature space consisted of variables coded for 5 different time windows around time of SARS-CoV-2 infection (-1 to 0 months, 0 to 1 months, 1 to 2 months, 2–3 months, and 3–6 months) for each of 114 conditions, 181 diagnostic tests, 167 procedures, 189 medications, along with lab counts and visit counts by visit type for each window, as well as sex, race/ethnicity, and month of cohort entry.
Using five-fold cross-validation we found that our model classified whether patients had PASC of any kind with 85.2% AUPR, 98.4% AUROC (Table 2, Fig 2), with 66.9% recall and 91.0% precision at the p = 0.5 threshold. In classifying whether patients had MIS-C, the model achieved 94.5% AUPR and 99.8% AUROC, with 85.3% recall and 92.8% precision at the p = 0.5 threshold. In classifying non-MISC PASC patients, the model performed with 62.3% AUPR and 97.1% AUROC, with 40.7% recall and 79.1% precision at the p = 0.5 threshold. Our model compared favorably against random forest and regularized logistic regression models (S1 Table).
For each of the three outcomes (PASC (any), non MIS-C PASC, and MIS-C) the Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves are estimated and plotted 5 times, once for each cross-validation fold.
The table displays several performance metrics for each of the three outcomes (PASC (any), non MIS-C PASC, and MIS-C). Accuracy, F1 score, precision, and recall were all computed at the p = 0.5 threshold. In addition to composite performance statistics, we report the area under the precision-recall curve (AUPR) and area under the receiver operating characteristic curve (AUROC). The curves themselves are shown and described in Fig 2.
Feature importance
SHAP summaries [27, 28] provide approximations of features’ impact on model results. Results for the 3-outcome model are presented in Fig 3, aggregated by clinical domain. Because the model output is multinomial, the x axis is expressed in terms of log odds, rather than probabilities. Among demographic features, time of cohort entry was most impactful. Unsurprisingly, the clinical feature with the greatest positive result was the presence of an inflammatory diagnosis. Diagnosis features appeared more often as positive predictors, while laboratory testing appeared more often as negative predictors. Additional SHAP summaries for specific data domains are presented in S2a–S2g Fig.
The plots show the most significant features as determined by the sum of SHAP value magnitudes over all samples. For each feature, SHAP values for each patient are plotted, with color representing the feature value (e.g. red if feature was present and blue if absent in case of a binary variable). For the SHAP values pictured, the x axis is interpreted as change in log odds (in particular, SHAP values are not confined to be between –1 and 1).
Discussion
We have employed machine learning to identify potential patients with PASC based on EHR data collected during clinical care. As controls substantially outnumber cases in our cohort, accuracy and AUROC tend to be inflated, so we assessed our model’s performance primarily in terms of the Precision-Recall curve. At the p = 0.5 threshold, our model was a high-precision classifier for PASC in general as well as for MIS-C and non-MIS-C PASC separately. Indeed, the model was able to classify patients with up to 60% recall with near perfect precision, with a drop-off in precision below 80% beginning only around 80% recall. At this threshold, our model’s high precision could be a valuable tool for identifying patients with PASC for study cohorts, particularly patients who did not receive a diagnosis due to unavailability of specific diagnosis codes, evolving clinical understanding of PASC, or less severe presentation. At lower probability thresholds, the model is potentially useful for identifying patients who may have features of PASC for further screening, such as might be done for clinical trial recruitment.
The model’s difficulties with the remaining 20% of PASC patients were driven by patients with non-MIS-C PASC. The lower recall for these patients could be a result of several factors. First, because the U09.9 diagnosis code was implemented in October 2021, there is a shorter overall duration for appropriately labelled cases to accrue in the training data. Though we attempted to mitigate this by including as cases patients who received more general post-viral disorder codes following a positive COVID test, it is possible that these more general codes were not used consistently. Additionally, non-MIS-C PASC patients had, on average, less follow-up time than patients with MIS-C. A second factor is that MIS-C is a more sharply defined condition with similarities to Kawasaki disease, an existing clinical phenotype, and thus there are features in the data related to diagnostic test utilization that align with consensus definitions of MIS-C (S2e Fig). By contrast, we see more reliance on co-occurring symptoms and conditions to define non-MISC-PASC. These are more heterogeneous given the range of possible symptoms and presentations and likely include more characteristics that are poorly coded in discrete EHR data, such as fatigue or school difficulties.
We expect that our observed ceiling in recall around 80% may also reflect several limitations in the available training set. The primary challenge of using PASC diagnosis information in EHR data as a training standard, particularly for non-MIS-C PASC, is that the lack of a specific definition means the presence or absence of PASC-specific diagnosis codes does not constitute a gold standard for defining PASC. Moreover, patients who were seen with PASC and MIS-C may not have received the relevant diagnosis codes either due to the recent release of the codes or the early limited understanding of PASC, especially in the case of less severe sub-phenotypes. Thus, while diagnosis codes can serve as a starting point for assessing complications due to COVID-19, there is likely under-ascertainment of PASC which may result in false negative labels in our control population. The lack of a consensus definition may drive differences in clinical assignment of the code, and many patients may either be undiagnosed or never visit a clinician if the symptoms are not severe. On the other hand, false positives are also possible: many PASC features may reflect ongoing symptoms or conditions related to the acute COVID-19 infection, and therefore the lack of clear boundary between acute and post-acute symptoms and conditions may have led to patients in the acute phase of illness receiving a PASC diagnosis. Additionally, unlike adult care, there are relatively few institutional markers, such as dedicated long COVID-19 clinics, for patients with PASC, who receive their care in broader specialty settings such as cardiology or neurology. Lack of follow up for non-MIS-C PASC patients may also limit the model’s ability to identify distinguishing characteristics. The SHAP value plots indicate that many of the predictive features of PASC occur within the first three months from diagnosis, which may reflect this limitation.
When compared with adults, PASC in children may present with a milder clinical course [29], leading to underdiagnosis in the health care setting. Some effects of PASC may have been seen in school performance or extracurricular participation, neither of which may have been captured in the EHR. Pediatric PASC may less frequently involve focal organ dysfunction that prompts specific utilization, such as laboratory testing or diagnostic imaging which is well-captured in EHR data. Additionally, the majority of PASC manifestations in children possibly occur shortly following acute infection and resolve more rapidly than in adults [10]. These factors both require a distinct approach to classifying children with PASC and warrant further investigation as clinical guidelines and definitions are developed. Review of feature contribution using SHAP value visualization reveals complex patterns, especially for non-MIS-C variants of PASC. In the more straightforward MIS-C case, the diagnoses with greatest positive impact are mobility findings pre-diagnosis and vascular malformation post-diagnosis, possibly related to testing for coronary artery aneurysms (S2d Fig). Consistent with expectation, procedure codes show a positive contribution of cardiac testing (S2g Fig), and laboratory testing for inflammation has a positive impact (S2e Fig; labelling as a rheumatoid arthritis panel reflects TreeScan’s aggregation of specific tests into umbrella terms that subsume them), while additional infection-related testing is more characteristic of unaffected patients. Oral formulations targeted at younger children contribute positively to MIS-C classification, likely representing use of aspirin prophylaxis in these patients (S2f Fig).
SHAP data for the non-MIS-C group of patients reflects their greater heterogeneity. The skew toward later cohort entry despite masking of the PASC diagnosis code itself may reflect differences in PASC natural history, or in practice patterns for evaluation and treatment (S2a Fig). Like children with MIS-C, lab utilization at the time of SARS-CoV-2 infection is not associated with caseness, but post-acute utilization is (S2b and S2c Fig). Interestingly, testing for inflammatory markers in the acute period is also a positive influence for non-MIS-C PASC classification, while evaluation of infectious diseases is not, possibly reflective of MIS-C-informed practice patterns. In contrast to MIS-C, however, respiratory medications contribute more positively to the non-MIS-C PASC scoring (S2f Fig). Similarly, early thoracic imaging has positive predictive value; while electrocardiography shows positive effect, echocardiography does not, in contrast to MIS-C (S2g Fig). Among diagnoses, respiratory and neurologic codes were observed to exert the greatest positive impact (S2d Fig). Overall, this constellation of factors suggests that the presence of persistent respiratory problems is most characteristic of the non-MIS-C PASC group, with a lesser contribution of persistent pain.
To our knowledge, this is the first study that has applied machine learning to an exclusively pediatric population to identify patients with PASC. A recent study [5] developed a machine learning phenotype in the adult population trained on classifying whether patients were seen at a long COVID clinic. The model achieved 0.92 AUROC with 0.85 precision and 0.86 recall, the latter being slightly greater than observed here. Differences between our model and theirs include our pediatric population, and the use of PASC diagnoses as an outcome rather than specialty clinic attendance, due to limited use of dedicated PASC clinics at pediatric institutions.
Of note, we believe this study is also the first to use a tree-based scan statistic for clinical feature selection in a machine learning model. The sets of possible features in each domain number in the hundreds of thousands, making reduction of dimensionality a necessary part of model development. Further, it is difficult to identify the correct level of granularity at which concepts should be clustered to define model features. This can be addressed through clinician expertise but is a time-consuming process which may incorporate bias stemming from individual clinicians’ practices or the coding practices at their institutions. However, vocabularies such as SNOMED CT or RxNorm have rich hierarchies based on physiology or chemical structure, respectively. A tree-based scan statistic approach therefore presents an appealing alternative: it takes the hierarchical structure of clinical vocabularies into account and does not require pre-specifying the level of granularity at which features should be selected.
Our study does have additional limitations, many in common with other studies which use EHR data to investigate PASC, a new condition with a broad definition. First, our model was trained to classify whether patients had a PASC diagnosis, which does not include all patients having PASC. Patients with milder manifestations may have been less likely to receive a diagnosis and thus would be less likely to be detected by our model. Second, our model was trained on data from tertiary and quaternary pediatric health systems, and so it may reflect biases in access to care as well as clinicians’ practices. Though a matched cohort design could be used to address some of these biases, it would run counter to our purpose of identifying patients likely to have PASC from routine EHR data. Third, while using data from a variety of sites may avoid overfitting the model to site-specific coding practices, there could be bias arising from coding practices across PEDSnet, as a network of pediatric health systems. A fourth potential limitation stems from our approach to imputing index dates for patients with PASC. Due to our difficulty in accurately capturing the initial SARS-CoV-2 infection date for such patients, we used a random date between 28 and 90 days prior to the first PASC diagnosis as the index date. However, this may not accurately reflect the date of the patient’s SARS-CoV-2 infection and as a result, relevant features may appear in a different time window relative to infection than when they actually occurred. Given the short follow-up for many patients, this could lead to a model that heavily weighs data close in time to infection. Finally, as noted above, the limited availability of PASC and MIS-C diagnosis codes may have not only reduced follow-up but led to models that focus on manifestation associated with more recent viral variants.
There is substantial potential for future work as understanding of PASC develops and more data become available. Training on data with more cases could help address the model’s potential limitation due to heterogeneity of PASC (particularly non-MIS-C PASC). Clinician review of false positives from the model could be used to determine the extent to which PASC-likely patients as determined by the model may have been undiagnosed, and review of false negatives could indicate which features the model is either misinterpreting or failing to detect. Further, our planned validation of the model on sites outside the PEDSnet network will help to evaluate the model’s generalizability. Supplementing the model with other sources of clinical data such as unstructured chart notes could further improve model performance. Different features may be associated with PASC over different time periods relative to COVID infection. The tree temporal scan statistic has the potential to address this by picking out both features and time windows over which patients with PASC disproportionately have those features occur, and thereby construct a more refined feature space than the fixed windows we used.
Conclusion
We have applied machine learning methods to available EHR data for children with SARS-CoV-2 infection and subsequent diagnosis of PASC. Using appropriate threshold settings, the model can be used to identify PASC patients in health systems data at higher precision for inclusion in studies or at higher recall in screening for clinical trials, especially in settings where PASC diagnosis codes are used less frequently or less reliably. Analysis of how specific features contribute to the classification process may assist in gaining a better understanding of features that are associated with PASC diagnoses. While additional work is required to improve identification of children with uncommon manifestations of PASC, the current classifier provides a valuable research tool, especially in cases where the scale or provenance of data make it infeasible to determine each patient’s PASC status by direct report. It may also find use in health care delivery, to identify undiagnosed patients who may benefit from further screening to identify and ameliorate effects of PASC.
Supporting information
S2 Fig. a-g SHapley Additive exPlanation (SHAP) values for model features by class (MIS-C/Non-MIS-C PASC).
The plots show the most significant features as determined by the sum of SHAP value magnitudes over all samples. For each feature, SHAP values for each patient are plotted, with color representing the feature value (e.g. red if feature was present and blue if absent in case of a binary variable). The SHAP values pictured are for the 3 class classification task and the x axis is interpreted as change in log odds for the corresponding outcome (e.g. MIS-C) as opposed to change in probability (in particular, SHAP values are not confined to be between –1 and 1).
https://doi.org/10.1371/journal.pone.0289774.s002
(ZIP)
S1 Table. Model performance benchmarks.
The table below displays several performance metrics for the XGBoost model, a random forest model, and binary logistic regression model computed using 5-fold cross-validation with the same set of features for each model. In all three models, the outcome predicted is PASC (any). Model parameters, selected by cross-validated grid search are displayed below each model. Any parameters not listed were set to the default values in the corresponding Python libraries.
https://doi.org/10.1371/journal.pone.0289774.s003
(DOCX)
S2 Table. a-d: TreeScan-selected feature clusters.
These tables show the TreeScan-selected cuts for conditions, labs, procedures, and medications. Each row describes the top node which characterizes the cluster. In other words, the node, together with all descendant codes, defines the feature cluster.
https://doi.org/10.1371/journal.pone.0289774.s004
(ZIP)
References
- 1. Fainardi V, Meoli A, Chiopris G, Motta M, Skenderaj K, Grandinetti R, et al. Long COVID in Children and Adolescents. Life Basel Switz 2022;12:285. pmid:35207572
- 2. Thallapureddy K, Thallapureddy K, Zerda E, Suresh N, Kamat D, Rajasekaran K, et al. Long-Term Complications of COVID-19 Infection in Adolescents and Children. Curr Pediatr Rep 2022;10:11–7. pmid:35127274
- 3. Rao S, Lee GM, Razzaghi H, Lorman V, Mejias A, Pajor NM, et al. Clinical features and burden of post-acute sequelae of SARS-CoV-2 infection in children and adolescents: an exploratory EHR-based cohort study from the RECOVER program. MedRxiv Prepr Serv Health Sci 2022:2022.05.24.22275544. pmid:35665016
- 4. Reese J, Blau H, Bergquist T, Loomba JJ, Callahan T, Laraway B, et al. Generalizable Long COVID Subtypes: Findings from the NIH N3C and RECOVER Programs. MedRxiv Prepr Serv Health Sci 2022:2022.05.24.22275398. pmid:35665012
- 5. Pfaff ER, Girvin AT, Bennett TD, Bhatia A, Brooks IM, Deer RR, et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health 2022;4:e532–41. pmid:35589549
- 6. Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: A methodical review 2022:2022.04.23.22274218.
- 7.
HAN Archive—00432 | Health Alert Network (HAN) 2021. https://emergency.cdc.gov/han/2020/han00432.asp (accessed August 18, 2022).
- 8. Algarni AS, Alamri NM, Khayat NZ, Alabdali RA, Alsubhi RS, Alghamdi SH. Clinical practice guidelines in multisystem inflammatory syndrome (MIS-C) related to COVID-19: a critical review and recommendations. World J Pediatr 2022;18:83–90. pmid:34982402
- 9. Mahmoud S, El-Kalliny M, Kotby A, El-Ganzoury M, Fouda E, Ibrahim H. Treatment of MIS-C in Children and Adolescents. Curr Pediatr Rep 2022;10:1–10. pmid:35036079
- 10. Borch L, Holm M, Knudsen M, Ellermann-Eriksen S, Hagstroem S. Long COVID symptoms and duration in SARS-CoV-2 positive children—a nationwide cohort study. Eur J Pediatr 2022;181:1597–607. pmid:35000003
- 11. Ramakrishnan RK, Kashour T, Hamid Q, Halwani R, Tleyjeh IM. Unraveling the Mystery Surrounding Post-Acute Sequelae of COVID-19. Front Immunol 2021;12:686029. pmid:34276671
- 12. Kulldorff M, Fang Z, Walsh SJ. A tree-based scan statistic for database disease surveillance. Biometrics 2003;59:323–31. pmid:12926717
- 13.
CDC. Healthcare Workers. Cent Dis Control Prev 2020. https://www.cdc.gov/coronavirus/2019-ncov/hcp/clinical-care/post-covid-public-health-recs.html (accessed August 18, 2022).
- 14.
Coding Long COVID: Characterizing a new disease through an ICD-10 lens | medRxiv n.d. https://www.medrxiv.org/content/10.1101/2022.04.18.22273968v1 (accessed August 18, 2022).
- 15.
CDC Announces Approval of ICD-10 Code for Post-Acute Sequelae of COVID-19. AapmrOrg n.d. https://www.aapmr.org/members-publications/member-news/member-news-details/2021/07/20/cdc-announces-approval-of-icd-10-code-for-post-acute-sequelae-of-covid-19 (accessed August 22, 2022).
- 16.
SNOMED International n.d. https://www.snomed.org.
- 17.
Kulldorff, Martin M. TreeScan User Guide, v2.0. 2020.
- 18. Wang SV, Maro JC, Baro E, Izem R, Dashevsky I, Rogers JR, et al. Data Mining for Adverse Drug Events With a Propensity Score-matched Tree-based Scan Statistic. Epidemiol Camb Mass 2018;29:895–903. pmid:30074538
- 19.
RxNorm n.d. https://www.nlm.nih.gov/research/umls/rxnorm/index.html.
- 20.
LOINC n.d. https://loinc.org.
- 21.
2023 ICD-10-PCS n.d. https://www.cms.gov/medicare/icd-10/2023-icd-10-pcs.
- 22.
HCPCS-General Information n.d. https://www.cms.gov/medicare/coding/medhcpcsgeninfo.
- 23.
CPT/Medicare Payment Search n.d. https://cptsearch.ama-assn.org/CptSearch/user/search/cptSearch.do.
- 24.
XGBoost | Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining n.d. https://dl.acm.org/doi/10.1145/2939672.2939785 (accessed August 18, 2022).
- 25. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 2011;12:2825–30.
- 26. Saito T, Rehmsmeier M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015;10:e0118432. pmid:25738806
- 27.
Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst., vol. 30, Curran Associates, Inc.; 2017.
- 28.
GPUTreeShap: massively parallel exact calculation of SHAP scores for tree ensembles [PeerJ] n.d. https://peerj.com/articles/cs-880/ (accessed August 18, 2022).
- 29. Pellegrino R, Chiappini E, Licari A, Galli L, Marseglia GL. Prevalence and clinical presentation of long COVID in children: a systematic review. Eur J Pediatr 2022;181:3995–4009. pmid:36107254