A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data

doi:10.1371/journal.pone.0280192

Fig 1.

Schematic overview of Fed-GLMM.

Fed-GLMM enables the joint implementation of GLMM for EHRs from multiple sites without sharing individual-level data. In step 1, each site fits GLMM locally to obtain the initial parameter estimates. In step 2, each site calculates intermediate summary statistics evaluated at the initial values and broadcasts them to the central analytics. For the k-th site, these summary statistics are denoted as s_k and H_k, and they are functions of the local data D_k, the common parameter value , and the site-specific parameter value . The local data D_k is composed of the local design matrix for the common fixed effect X_k, the local design matrix for the site-specific fixed effect W_k, and the local outcome vector y_k. The site-specific parameter value is composed of the values of site-specific fixed effect and site-specific variance parameter . In step 3, the central analytics combines all the local intermediate results to construct a surrogate global likelihood function that provides updates for parameter estimates. Steps 2–3 can be iteratively performed to keep updating parameter estimates.

More »

Expand

Fig 2.

Accuracy of Fed-GLMM and meta-analysis estimates relative to the gold-standard pooled analysis.

We compared the accuracy of Fed-GLMM with the standard meta-analysis by calculating the median absolute relative difference compared to the gold-standard pooled estimate of the coefficient of a binary exposure variable. The underlying model has a binary outcome, a binary exposure, three more covariates with 8 site-specific fixed effect coefficients for the normally distributed covariate and a patient-level random intercept. The model also includes 8 site-specific parameters for variance components. We considered 25 combinations of outcome and exposure prevalence to assess the model accuracy with 100 simulation replicates per combination. Fed-GLMM demonstrated reduced relative bias after 1–2 iterations compared with the meta-analysis, which was highly biased in the presence of rare events.

More »

Expand

Fig 3.

Comparison of computation time and estimate accuracy for Fed-GLMM and meta-analysis relative to gold-standard pooled analysis with increasing computing nodes/EHR subsets.

We compared Fed-GLMM with meta-analysis using the ratio (in percentage) of computation time over the pooled analysis. For each simulation replicate, we generated one single centralized EHR. The underlying model has a binary outcome, a binary exposure, three more covariates and a patient-level random intercept. We considered dividing the centralized EHR data into varying numbers of subsets to be computed in parallel. Both Fed-GLMM and the meta-analysis spent less than 5% of the computation time required by the pooled analysis with the number of computing nodes greater than 20. However, the meta-analysis had increased relative bias for the exposure coefficient when the number of subsets increased, while Fed-GLMM retained its accuracy relative to the pooled analysis. The points and bars represent median and interquartile range of computation time and relative bias in percentage respectively.

More »

Expand

Fig 4.

Adjusted odds ratios of virtual vs. in-person visit by patient and visit characteristics.

Using the forest plot, we visualized the adjusted odds ratios obtained through Fed-GMM for both all facilities (federated setting to demonstrate privacy preservation) and single facility (centralized setting to demonstrate computation improvement). The points and bars represent the point estimates and 95% confidence intervals, respectively. Abbreviations: OR—Odds Ratio; NH—Non-Hispanic; LEP—Limited English Proficiency.

More »

Expand