Fig 1.
An illustration of the training and testing processes of GeM-LR.
Top panel: The training process involves initializing the embedded GMM model through conventional GMM fitting without using Y. Then, GeM-LR is estimated by the Expectation-Maximization (EM) algorithm. In particular, the embedded GMM and LR models are jointly optimized by EM iterations. Bottom panel: The testing process begins with computing weights for each LR model based on the fitted GMM model for each test point. These weights are then used to calculate the weighted sum of the predicted class posterior probabilities of each LR model, which serves as the final prediction.
Fig 2.
Visualizing GeM-LR results for HVTN 505 data.
A: Scatter plot of Env IgA vs. Env gp-140-specific IgA. The cutoff value 4 for Env gp-140-specific IgA is shown in blue dotted line; B: Violin plot of HIV-1 Env IgA for the two clusters generated by GeM-LR; C: Box plots displaying the AUCs of the 5-fold CV for five repetitions, comparing GeM-LRs and the CP methods with C = 2:3: The left panel is for methods employing Env IgA as the clustering variable, whereas the right panel is for methods utilizing Env gp140-specific IgA for clustering. D: Violin plots of ADCP (top rows) and FcγRIIa (bottom rows) by the infected/uninfected outcome, with y-axis showing the values for ADCP and FcγRIIa, respectively, stratified by GeM-LR clustering result; p-values are computed by the nonparametric Mann–Whitney test comparing whether ADCP/FcγRIIa differs in the two clusters; E: Heatmaps for visualizing the regression coefficients in the three GeM-LRs.
Fig 3.
VAST data visualization and analysis results.
A: Visualization by PCA, t-SNE, and UMAP with the protection status and cluster membership of individuals marked by different shapes and colors (NP: not protected; P: protected); B: Bar graphs showing the point (heights of the bars) and 95% confidence interval (black lines) estimates of CV AUCs by GeM-LR and the competing methods. C: Stacked bar chart highlighting the distribution of protection status and the two vaccines within each cluster; D: Identified discriminative features for all three clusters with the corresponding Accuracy Ac(h); E-G: Scatter plots on the selected features with individuals color-coded by their cluster memberships. The x- and y-axes represent the normalized values of the corresponding variables. H: Heatmap for visualizing GeM-LR and LR regression coefficients; The regression coefficient values are color-coded from red (larger) to blue (smaller).
Fig 4.
A: PCA, t-SNE, and UMAP visualizations with individual protection status and cluster memberships represented by shapes and colors (NP: not protected; P: protected); B: Bar graphs showing the point (heights of the bars) and 95% confidence interval (black lines) estimates of CV AUCs by GeM-LR and the competing methods. C: Stacked bar chart highlighting the distribution of protection status and the three studies within each cluster; D: Scatter plot on one of the identified discriminative features with individuals color-coded by their cluster memberships; The x- and y-axes represent the normalized values of the corresponding variables. E: Heatmap for visualizing GeM-LR and LR regression coefficients; The regression coefficient values are color-coded from red (larger) to blue (smaller).