Fig 1.
Workflow of the overall study.
Workflow graph detailing the overall process with details for each step. The workflow includes data extraction, cohort selection and data processing, clustering, and predictive modeling.
Fig 2.
Participants from the JH-CROWN are stratified into different cohorts, with four final cohorts: a low confidence of exposure cohort (resistant/non-resistant) and a high confidence of exposure cohort (resistant/non-resistant).
Table 1.
Summary statistics for the low-confidence and high-confidence exposure groups broken down by resistance.
Number of patients (percent of total population) is reported for categorical variables; mean (standard deviation) is presented for continuous variables.
Table 2.
Prevalence of patterns found using MASPC method in both resistant and non-resistant patients.
Five diagnostic code patterns were found with a p-value less than 0.05. Odds Ratios less than 1 indicate prevalence in non-resistant cohort, whereas odds ratios greater than 1 indicate prevalence in resistant cohort.
Fig 3.
Clustering results from MASPC method.
Clusters form by demographic features, with a majority of females, males, and children respectively in each of the three clusters. Shown are the distribution of patients with patterns of ICD10 codes: nicotine dependence [F17], depressive episode [F32], long term drug therapy & Type 2 diabetes [Z79, E11], screening for malignant neoplasms [Z12], and asthma [J45].
Fig 4.
Receiver operating curves of XGBoost (XGB), Random Forest (RF), and Logistic Regression (LR) models.
(a) Testing Set: XGB is the best performing model and all three models have statistically significant AUROCs (p<0.001) (b) Household Index Testing Set: XGB is again the best performing model, yet the p-values are less statistically significant due to the small sample size.
Table 3.
XGBoost had the best model performance for both the testing set of the low-confidence group and the HHI testing set, with AUROCs of 0.61 and 0.62, respectively.
Fig 5.
Shapley feature importance of the XGBoost model.
Points on the right with positive SHAP values indicate that inclusion of the feature moves the prediction toward resistance. The color red represents a high value for the feature whereas blue indicates a low value. Features are sorted vertically by their mean absolute influence on the prediction. Bolded features are features that were also identified as important using the MASPC patterns as shown in Table 2.