Factors associated with resistance to SARS-CoV-2 infection discovered using large-scale medical record data and machine learning

doi:10.1371/journal.pone.0278466

Fig 1.

Workflow of the overall study.

Workflow graph detailing the overall process with details for each step. The workflow includes data extraction, cohort selection and data processing, clustering, and predictive modeling.

More »

Expand

Fig 2.

Cohort selection flowchart.

Participants from the JH-CROWN are stratified into different cohorts, with four final cohorts: a low confidence of exposure cohort (resistant/non-resistant) and a high confidence of exposure cohort (resistant/non-resistant).

More »

Expand

Table 1.

Summary statistics for the low-confidence and high-confidence exposure groups broken down by resistance.

Number of patients (percent of total population) is reported for categorical variables; mean (standard deviation) is presented for continuous variables.

More »

Expand

Table 2.

Prevalence of patterns found using MASPC method in both resistant and non-resistant patients.

Five diagnostic code patterns were found with a p-value less than 0.05. Odds Ratios less than 1 indicate prevalence in non-resistant cohort, whereas odds ratios greater than 1 indicate prevalence in resistant cohort.

More »

Expand

Fig 3.

Clustering results from MASPC method.

Clusters form by demographic features, with a majority of females, males, and children respectively in each of the three clusters. Shown are the distribution of patients with patterns of ICD10 codes: nicotine dependence [F17], depressive episode [F32], long term drug therapy & Type 2 diabetes [Z79, E11], screening for malignant neoplasms [Z12], and asthma [J45].

More »

Expand

Fig 4.

Receiver operating curves of XGBoost (XGB), Random Forest (RF), and Logistic Regression (LR) models.

(a) Testing Set: XGB is the best performing model and all three models have statistically significant AUROCs (p<0.001) (b) Household Index Testing Set: XGB is again the best performing model, yet the p-values are less statistically significant due to the small sample size.

More »

Expand

Table 3.

Predictive model performance.

XGBoost had the best model performance for both the testing set of the low-confidence group and the HHI testing set, with AUROCs of 0.61 and 0.62, respectively.

More »

Expand

Fig 5.

Shapley feature importance of the XGBoost model.

Points on the right with positive SHAP values indicate that inclusion of the feature moves the prediction toward resistance. The color red represents a high value for the feature whereas blue indicates a low value. Features are sorted vertically by their mean absolute influence on the prediction. Bolded features are features that were also identified as important using the MASPC patterns as shown in Table 2.

More »

Expand