Generative AI mitigates representation bias and improves model fairness through synthetic health data

doi:10.1371/journal.pcbi.1013080

Fig 1.

Two-dimensional representations of the acute hypotension dataset for Black patients, including marginal distributions of the principal components.

Top panels: PCA two-dimensional representation of real (red) and synthetic (blue) data, where CA-GAN provides the best overall coverage of real data distribution, while SMOTE and WGAN-GP* show evidence of reduced coverage and mode collapse. Bottom panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP, and CA-GAN. It can be seen that CA-GAN more uniformly covers the real distribution, while SMOTE does not cover a significant part of it (top right in the panel) and WGAN-GP* coverage is almost completely separated from the real data.

More »

Expand

Fig 2.

Two-dimensional representations of the sepsis dataset for Black patients, including marginal distributions of the principal components.

Top panels: PCA two-dimensional representation of real (red) and synthetic (blue) data, where CA-GAN provides more coverage than SMOTE (especially in the top right and bottom left part of the panel), while WGAN-GP* provides the lowest coverage. Bottom panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP, and CA-GAN. It can be seen that SMOTE follows an interpolation pattern, while CA-GAN expands into latent space, generating authentic data points while remaining within the clusters identified by t-SNE. Data generated by WGAN-GP* fall outside of the real data.

More »

Expand

Fig 3.

Distribution plots[0mm][-3mm]AQ1[4mm][-3mm]AQ2 of each variable, overlaying real and synthetic data for acute hypotension dataset.

Distribution of variables related to blood pressure (MAP, diastolic and systolic) is captured well by our method in comparison to WGAN-GP* and SMOTE. CA-GAN performs better also for categorical variables, while all the three methods struggle with variables with long tail, non-normal distributions. Top panel: CA-GAN. Middle panel: WGAN-GP. Bottom panel: SMOTE

More »

Expand

Table 1.

KL-Divergence and Maximum Mean Discrepancy between the distribution of real and synthetic data for each variable of the datasets.

More »

Expand

Fig 4.

Kendall’s rank correlation coefficients for the real data and the data generated with CA-GAN, WGAN-GP*, and SMOTE.

Top panel: Acute hypotension data. Bottom panel: Sepsis data

More »

Expand

Table 2.

Minimum Euclidean distance between real and synthetic data generated by SMOTE, WGAN-GP* and CA-GAN. No method generates exact copies of the real data.

More »

Expand

Table 3.

Mean prediction errors of a BiLSTM trained on real, synthetic, and augmented data for a downstream prediction task. The numbers in parentheses represent the standard deviation

More »

Expand

Table 4.

Results of downstream regression task on gender-conditioned data, based on Mean Absolute Error (MAE), with standard deviation shown in brackets. Biased Real represents the real datatset from which we have removed 80% of the data from female patients.

More »

Expand

Fig 5.

Proposed architecture of our CA-GAN. The Generator and the Discriminator are two deep networks with similar structure and number of parameters.

Both employ three stacked Bidirectional LSTMs (BILSTMs) to capture the temporal relationships of longitudinal data. They are trained together adversarially, with a minimax game. Conditioning is achieved with static labels, passed as input to both networks. The Generator also takes Gaussian noise as input and generates time-series data (synthetic patients). The discriminator evaluates the plausibility of the output of the Generator, compared with the real data.

More »

Expand