Figures
Abstract
Representation bias in health data can lead to unfair decisions and compromise the generalisability of research findings. As a consequence, underrepresented subpopulations, such as those from specific ethnic backgrounds or genders, do not benefit equally from clinical discoveries. Several approaches have been developed to mitigate representation bias, ranging from simple resampling methods, such as SMOTE, to recent approaches based on generative adversarial networks (GAN). However, generating high-dimensional time-series synthetic health data remains a significant challenge. In response, we devised a novel architecture (CA-GAN) that synthesises authentic, high-dimensional time series data. CA-GAN outperforms state-of-the-art methods in a qualitative and a quantitative evaluation while avoiding mode collapse, a serious GAN failure. We perform evaluation using 7535 patients with hypotension and sepsis from two diverse, real-world clinical datasets. We show that synthetic data generated by our CA-GAN improves model fairness in Black patients as well as female patients when evaluated separately for each subpopulation. Furthermore, CA-GAN generates authentic data of the minority class while faithfully maintaining the original distribution of data, resulting in improved performance in a downstream predictive task.
Author summary
Doctors and other healthcare professionals are increasingly using Artificial Intelligence (AI) to make better decisions about patients’ diagnosis, suggest optimal treatments, and estimate patients’ future health risks. These AI systems learn from existing health data which might not accurately reflect the health of everyone, particularly people from certain racial or ethnic groups, genders, or those with lower incomes. This can mean the AI doesn’t work as well for these groups and could even make existing health disparities worse. To address this, we have developed a purposely built AI software that can create synthetic patient data. Synthetic data created by our software mimics real patient data without actually copying them, protecting patients’ privacy. Using our synthetic data results in more representative dataset of all groups, and ensures that AI algorithms learn to be fairer for all patients.
Citation: Marchesi R, Micheletti N, I-Hsien Kuo N, Barbieri S, Jurman G, Osmani V (2025) Generative AI mitigates representation bias and improves model fairness through synthetic health data. PLoS Comput Biol 21(5): e1013080. https://doi.org/10.1371/journal.pcbi.1013080
Editor: Piero Fariselli
Received: October 31, 2024; Accepted: April 22, 2025; Published: May 19, 2025
Copyright: © 2025 Marchesi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying this article are publicly available in the MIMIC-III repository (https://doi.org/10.13026/C2XW26): https://physionet.org/content/mimiciii/1.4/. The source code is available at the github repository: https://github.com/nic-olo/CA-GAN.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Clinical practice is poised to benefit from developments in machine learning as data-driven digital health technologies transform health care [1]. Digital health can catalyse the World Health Organisation’s (WHO) vision of promoting equitable, affordable, and universal access to health and care [2]. However, as machine learning methods increasingly weave themselves into the societal fabric, critical issues related to fairness and algorithmic bias in decision-making are coming to light. Algorithmic bias can originate from diverse sources, including socio-economic factors, where income disparities between ethno-racial groups are reflected in algorithms deciding which patients need care [3]. Bias can also originate from the underrepresentation of particular demographics, such as ethnicity, gender, and age in the datasets used to develop machine learning models, known as health data poverty [4]. Health data poverty impedes underrepresented subpopulations from benefiting from clinical discoveries, compromising the generalisability of research findings and leading to representation bias that can compound health disparities.
Machine learning community has developed several approaches to mitigate representation bias, with data resampling being the most widely used. Oversampling generates representative synthetic data from the underrepresented subpopulation (minority class), resulting in a similar or equal representation. Synthetic Minority Over-sampling TEchnique (SMOTE) [5] is a representative example of this method, where synthetic samples lie between a randomly selected data sample and its randomly selected neighbour (using k-nearest neighbour algorithm [6]). SMOTE and related methods [7–9] are popular approaches due to their simplicity and computational efficiency.
However, SMOTE, when used with high-dimensional time-series data, may decrease data variability and introduce correlation between samples [10–12]. In response, alternative approaches based on Generative Adversarial Networks (GAN) are gaining ground [13–17]. GANs have shown incredible results in generating realistic images [18], text [19], and speech [20] in addition to improving privacy [21]. While GANs address some of the issues of SMOTE-based approaches, the generation of high-dimensional time-series data remains a significant research challenge [22–24].
To address this challenge, we propose a new generative architecture called Conditional Augmentation GAN (CA-GAN). Our CA-GAN extends Wasserstein GAN with Gradient Penalty [25, 26], presented in the Health Gym study [27] (referred to in this paper as WGAN-GP*). However, our work has a different objective. Instead of generating new synthetic datasets, we condition our GAN to augment the minority class only, while maintaining correlations between the variables and correlations over time, in contrast to the recent work [28]. As a result, CA-GAN captures the distribution of the overall dataset, including the majority class. We compare the performance of our CA-GAN with WGAN-GP* and SMOTE in generating synthetic data of patients of an underrepresented ethnicity (Black patients in our case) as well as gender (female). We use two critical care datasets comprising acute hypotension (n=3343) and sepsis (n=4192), resulting in 7535 patients overall.
Our datasets include both categorical and continuous variables with diverse distributions and are derived from the well-studied MIMIC-III critical care database [29]. These datasets were chosen to allow direct comparison with state of the art approaches, our architecture is data agnostic.
Our work makes the following contributions: (1) We propose a new CA-GAN architecture that addresses the shortcomings of traditional and the state-of-the-art methods in generating high-dimensional, time-series, synthetic data, using two real-world datasets. (2) Our multi-metric evaluation using qualitative and quantitative methods demonstrate superior performance of CA-GAN with respect to the state of the art architecture, while avoiding mode-collapse, a significant GAN failure. (3) We evaluate our CA-GAN against SMOTE, a naive but cost-effective resampling method, however limited in the synthesis of authentic data. (4) We show the impact of synthetic data in improving model performance for the minority (underrepresented) class, resulting in a fairer model between Black and White ethnicities. (5) We also show that CA-GAN can synthesise realistic clinical data of specific ethnicity and gender, improving the performance in a downstream predictive task.
2. Results
We primarily focus on multi-metric evaluation of synthetic data generated by our CA-GAN architecture in comparison to the data generated by state-of-the-art WGAN-GP* architecture and the popular SMOTE approach. We provide a separate analysis on the impact of synthetic data generated by our architecture to mitigate representation bias and improve model fairness in Section Improving model fairness. Considering significant challenges in evaluating generative models in general, [30], and high-dimensional time-series data in particular [23], we adopted a holistic approach to evaluating our work based on both qualitative and quantitative methods. We present the results of the data generated comparing the performance of the three methods in augmenting the underrepresented (minority) class, namely Black ethnicity and female gender.
2.1. Qualitative evaluation
To gain initial insights into the obtained results, we conduct a qualitative evaluation employing visual representation methods that show the coverage of synthetic data with respect to the real data. We use Principal Component Analysis (PCA) to project the real and synthetic data onto a two-dimensional space. We also use t-distributed Stochastic Neighbor Embedding (t-SNE) [31] to plot both real and synthetic datasets in a two-dimensional latent space while preserving the local neighbourhood relationships between data points. To compare the performance between the methods and ensure a consistent visualisation of the real data, we have computed a common t-SNE embedding.
In S7 Appendix we present the results of Uniform Manifold Approximation and Projection (UMAP) [32], which offers better preservation of the global structure of the dataset when compared to t-SNE. The parameters of t-SNE and UMAP are the same for all three methods as shown in S3 Appendix.
The results are illustrated in Fig 1 (acute hypotension) and Fig 2 (sepsis). Synthetic data generated by CA-GAN exhibits significant overlap with the real data, indicating our model’s ability to accurately capture the underlying structure of real data. This is especially evident in PCA, where the representations reveal that the synthetic data generated by CA-GAN provides the best overall coverage of the real data distribution. Further evidence is provided from the marginal distributions. For the acute hypotension dataset (Fig 1), both WGAN-GP* and SMOTE show evidence of mode collapse (evident also in the t-SNE plots), where synthetic data is generated from a limited space. Similarly, for the sepsis dataset (Fig 2), CA-GAN covers more of the real data distribution compared to SMOTE, while WGAN-GP* again tends towards mode collapse.
Top panels: PCA two-dimensional representation of real (red) and synthetic (blue) data, where CA-GAN provides the best overall coverage of real data distribution, while SMOTE and WGAN-GP* show evidence of reduced coverage and mode collapse. Bottom panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP, and CA-GAN. It can be seen that CA-GAN more uniformly covers the real distribution, while SMOTE does not cover a significant part of it (top right in the panel) and WGAN-GP* coverage is almost completely separated from the real data.
Top panels: PCA two-dimensional representation of real (red) and synthetic (blue) data, where CA-GAN provides more coverage than SMOTE (especially in the top right and bottom left part of the panel), while WGAN-GP* provides the lowest coverage. Bottom panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP, and CA-GAN. It can be seen that SMOTE follows an interpolation pattern, while CA-GAN expands into latent space, generating authentic data points while remaining within the clusters identified by t-SNE. Data generated by WGAN-GP* fall outside of the real data.
Figs 1 and 2 in the bottom panels show the t-SNE representations of the real and synthetic data. For the acute hypotension dataset, CA-GAN more uniformly covers the distribution of real data, while SMOTE does not cover a significant part of it (top right in the panel). This is also evident from the marginal distributions. WGAN-GP* coverage is almost completely separated from the real data. For the sepsis dataset, t-SNE shows that SMOTE follows an interpolation pattern failing to expand into the latent space. In contrast, CA-GAN successfully expands the distribution into the latent space, generating authentic data points, while remaining within the clusters identified by t-SNE. Data generated by WGAN-GP* fall outside of the real data.
Figs 1 and 2 provide evidence that state of the art WGAN-GP* appears to suffer from mode collapse, a significant limitation of GANs [33]. Mode collapse occurs when the generator produces a limited variety of samples despite being trained on a diverse dataset. The generator cannot fully capture the complexity of the target distribution, limiting the quantity of generated samples and resulting in repetitive output. This is because the generator can get stuck in a local minimum where a few outputs are repeatedly generated, even though the training data contains more modes that can be learned. This presents a significant challenge in generating high-quality, authentic samples, while our CA-GAN model overcomes this limitation.
The evidence that CA-GAN captures accurately the underlying structure of real data is further reinforced based on joint distribution of variables, which we show in S4 Appendix. In S7 Appendix we also present the UMAP latent representation of the data, which preserves the global structure.
Finally, we show the distribution of individual variables of synthetic data overlaid on the distribution of the real data. We use this to compare the performance of our method with state of the art WGAN-GP* as well as SMOTE as the baseline method, using acute hypotension dataset in Fig 3 and sepsis in S1 Appendix. Joint distributions are shown in S4 Appendix). The distribution of synthetic data generated by our CA-GAN exhibits the closest match to that of the real data. This close alignment is particularly evident in variables related to blood pressure, including MAP, diastolic, and systolic measurements. However, certain variables, such as urine and ALT/AST, pose challenges for all three methods. These variables have highly skewed, non-Gaussian distributions with long tails, making them difficult to transform effectively using power or logarithmic transformations. In contrast, our CA-GAN and WP-GAN* effectively capture the distribution of categorical variables. Conversely, SMOTE encounters difficulties with several variables, including both the numeric variable of urine and the categorical variable of the Glasgow Coma Score (GCS). These observations are also reflected in the quantitative evaluation in Sect 2.2. The variables in the sepsis dataset are not only more than twice as many as those in acute hypotension but also have more complex distributions. Variables such as SGOT, SGPT, total bilirubin, maximum dose of vasopressors, and others have extremely long tails. The three methods struggle to generate these kinds of distributions and show a tendency to converge to the median value. In contrast, the behaviour is similar to acute hypotension for categorical and numerical variables normally distributed.
Distribution of variables related to blood pressure (MAP, diastolic and systolic) is captured well by our method in comparison to WGAN-GP* and SMOTE. CA-GAN performs better also for categorical variables, while all the three methods struggle with variables with long tail, non-normal distributions. Top panel: CA-GAN. Middle panel: WGAN-GP. Bottom panel: SMOTE
2.2. Quantitative evaluation
We used Kullback-Leibler (KL) ,divergence [34] to measure the similarity between the discrete density function of the real data and that of the synthetic data. For each variable v of the dataset, we calculate:
where Qv is the true distribution of the variable and Pv is the generated distribution. The smaller the divergence, the more similar the distributions; zero indicates identical distributions. The left half of Table 1a and 1b show the results of the KL divergence for each variable. Our CA-GAN method has the lowest median across all variables for acute hypotension and sepsis data compared to WGAN-GP* and SMOTE. This is despite the fact that SMOTE is specifically designed to maintain the distribution of the original variables.
In addition, we used Maximum Mean Discrepancy (MMD) [35] to calculate the distance between the distributions based on kernel embeddings, that is, the distance of the distributions represented as elements of a reproducing kernel Hilbert space (RKHS). We used a Radial Basis Function (RBF) Kernel:
with σ = 1. The right half of Table 1a and 1b shows the MMD results for SMOTE, WGAN-GP* and our CA-GAN. Again, our model has the best median performance across all the variables for acute hypotension data, while for sepsis data, SMOTE shows a difference in performance by 0.00028. In summary, CA-GAN performs best in the acute hypotension dataset by a wide margin while showing comparable performance with SMOTE in the sepsis dataset. A summary of distance metrics, including median, mean, standard deviation, maximum and minimum is shown in S5 Appendix.
2.3. Variable correlations
We used the Kendall rank correlation coefficient τ [36] to investigate whether synthetic data maintained original correlations between variables found in the real data of acute hypotension and sepsis datasets. This choice is motivated by the fact that the τ coefficient does not assume a normal distribution, which is the case for some of our variables, of the sepsis dataset in particular (as shown in Fig 3 and S1 Appendix). Fig 4 shows the results of Kendall’s rank correlation coefficients. For the acute hypotension dataset (top panel of Fig 4), CA-GAN captures the original variable correlations, as does SMOTE, with the former having the closest results on categorical variables, while the latter on numerical ones. WGAN-GP* shows the worst performance, accentuating correlations that do not exist in real data. Similar patterns are also obtained for the variables of patients with sepsis in the bottom panel of Fig 4. For additional insight, S8 Appendix presents the absolute difference between correlations, highlighting the same patterns.
Top panel: Acute hypotension data. Bottom panel: Sepsis data
2.4. Synthetic data authenticity
When generating synthetic data, the output must be a realistic representation of the original data. Still, we also need to verify that the model has not merely learned to copy the real data. GANs are prone to overfitting by memorising the real data [37]; therefore, we use Euclidean Distance (L2 Norm) to evaluate the originality of our model’s output. Our analysis shows that the smallest distance between a synthetic and a real sample is 52.6 for acute hypotension and 44.2 for sepsis, indicating that the generated synthetic data are not a mere copy of the real data. This result, coupled with the visual representation of CA-GAN (shown in Figs 1 and 2), illustrates the ability of our model to generate authentic data. SMOTE, which by design interpolates the original data points, is unable to explore the underlying multidimensional space. Therefore its generated data samples are much closer to the real ones, with a minimum Euclidean distance of 0.0023 for acute hypotension and 0.033 for sepsis. Table 2 shows a summary of the minimum Euclidian distance for each method.
2.5. Improving model fairness
Having evaluated the quality of synthetic data generated by our CA-GAN, we now focus on evaluating the potential of synthetic data in improving model fairness. The machine learning literature has highlighted many examples in which class imbalanced datasets lead to unfair decisions for the minority class [38]. This is especially important when those decisions significantly affect the health of patients and the society as a whole, as is the case with our study. We measure fairness by comparing the performance of the model between subgroups, based on the approach described in [39]. In this respect, we have performed an analysis to understand the impact of synthetic data in the performance of a predictive model within Black patients and White patients separately. For this purpose, we chose the task of predicting blood lactate within each ethnic subgroup, based on our previous work [40]. Blood lactate is essential in guiding clinical decision-making, especially for patients with sepsis and ultimately affects patients’ survival [41, 42]. Therefore, any differences between Black and White patients in lactate prediction performance would result in potentially unfair treatment decisions due to data representation bias.
Based on our work in [40] we trained a Random Forest classifier to predict whether the outcome of the last lactate value in the time series of patients was above a clinically relevant threshold, using as input the previous observations of the clinical variables in our dataset. Initially, we devised a fixed test set of real data, that was not seen by our generative model and we ran our generative model five times. We used only the real (original) sepsis dataset where the predictive performance of the classifier within the Black patient cohort was AUC of 0.569 (±0.020). This is in contrast to AUC of 0.652 (±0.016) within the White patient cohort. For comparison, the overall performance of the model when including both ethnicities was AUC of 0.611 (±0.013). Then we augmented the original Black patient cohort with synthetic data (conditioned on Black patients only) generated by our CA-GAN model, such that the representation between Black and White patients was equal. As a consequence, the predictive performance within Black patient cohort increased to AUC of 0.620 (±0.017), while for White patients it was AUC of 0.643 (±0.014) and AUC of 0.629 (±0.013) for the overall test set with both ethnicities.
Overall, synthetic data augmentation resulted in a fairer model between ethnicities. The performance gap between ethnicities was reduced from 8% (△AUC = 0.083±0.028) to 2% (△AUC = 0.025±0.015), with a statistically significant difference between non-augmented and augmented datasets (p=0.0236).
2.6. Downstream regression task on ethnicity
Finally, we also sought to evaluate the ability of CA-GAN to maintain the temporal properties of time series data, considering ethnicity and gender (in the following section). Since our objective is to augment the minority class to mitigate representation bias, we wanted to verify that the datasets augmented with synthetic data generated by our model can maintain or improve the predictive performance of the original data on a downstream task. Initially, we trained only a Bidirectional Long Short-Term Memory (BiLSTM) with real data as the baseline. Later, we trained the BiLSTM with synthetic and augmented datasets separately, considering ethnicity and gender diversity. They were used to evaluate the performance in a regression task. To address the inherent randomness in the models, we used CA-GAN to create five different synthetic datasets. For each of these datasets, we trained five BiLSTM models, resulting in a total of 25 BiLSTM models (5 datasets × 5 models per dataset). A complete description of how we performed these experiments is provided in S6 Appendix.
Table 3a and 3b show the mean absolute errors between the BiLSTM prediction and the actual acute hypotension and sepsis observations, respectively. In these experiments, the test sets consisted exclusively of patients from the minority class. In the first column, we show the results achieved using only the real data to make the predictions; in the second column, the results using only the synthetic data; and finally, in the third column, the results achieved by predicting with the augmented dataset, that is, with both the real and synthetic data together, considering both ethnic and gender diversity in the augmentation process. Overall, adding the synthetic data reduces the predictive error. This indicates that the temporal characteristics of the data generated by our CA-GAN model are close enough to those of the real data to maintain the original predictive performance. Thus, the augmented dataset could be used in a downstream task, mitigating the representation bias.
It should be noted that the errors in Input Fluids Total and Total Urine Output are exceptionally high compared to the other variables. This is because predicting these variables is generally challenging, stemming partly from how they are collected and recorded rather than an issue inherent to synthetic data generation.
2.7. Downstream regression task on gender-conditioned data
As a final test, we have devised an additional dataset of patient subpopulations with an artificially biased gender representation. Specifically, from the original sepsis data (defined as “Real”), we have removed 80% of the data from female patients to introduce a gender-based bias (“Biased Real” dataset). We used the biased dataset to train our architecture and generate synthetic data. To evaluate the performance of our approach, we applied a downstream regression task similarly to the one presented above for the ethnicity data. A test set consisting of patients from the minority class (i.e. female patients) was kept separate for this evaluation.
Our results show that the performance of data generated by our model is comparable to that of real data. Namely using the original dataset, we obtain a Mean Absolute Error (MAE) of 4.784 (0.393), while we obtain an MAE of 3.193 (0.052) using the synthetic dataset, indicating that the CA-GAN generates faithful synthetic data. What is even more interesting is that combining synthetic data with the real data further reduced the MAE to 2.870 (0.220), which is lower than that of the real data alone (MAE of 3.294 (0.241)).
3. Discussion
As machine intelligence scales upwards in clinical decision-making, the risk of perpetuating existing health inequities increases significantly. This is because biased decision-making can continuously feed back the data used to train the models, creating a vicious circle that further ingrains discrimination towards underrepresented groups. Representation bias, in particular, frequently occurs in health data, leading to decisions that may not be in the best interests of all patients, favouring specific subpopulations while treating underrepresented subpopulations, such as those with standard set characteristics including ethnicity, gender, and disability unfavourably.
To address these issues, representation must be improved before algorithmic decision-making becomes integral to clinical practice. While unequal representation is a multifaceted challenge involving diverse factors such as socio-economic, cultural, systemic, and data, our work represents a step towards addressing one significant facet of this challenge: mitigating existing representation bias in health data.
We have shown that our work can generate high-quality synthetic data when evaluated against state-of-the-art architectures and traditional approaches such as SMOTE. SMOTE has notable advantages over other data generation techniques as it requires no training and can work with smaller datasets. It can mirror non-normal distributions even if it tends to overestimate the median in long-tail distributions. This is in contrast to GANs, which struggle with these types of distributions. However, generating authentic data remains a significant challenge for SMOTE, especially important when considering confidentiality of data and patients’ privacy.
Through qualitative and quantitative evaluation, we have shown that CA-GAN can generate authentic data samples with high distribution coverage, avoiding mode collapse failure, while ensuring that the generated data are not copies of the real data. We have also shown that augmenting the dataset with the synthetic data generated by CA-GAN leads to lower errors in the downstream regression task. This indicates that our model can generalise well from the original data.
A notable advantage of our approach is that it uses the overall dataset, and not only the minority class, as is the case with WGAN-GP* and SMOTE. This means that CA-GAN can be applied in smaller datasets and those with highly imbalanced classes, such as rare diseases.
Furthermore, we evaluated our method on two datasets with diverse characteristics and found that our CA-GAN performed better on the acute hypotension dataset. This may be because some of the numerical variables in the sepsis dataset have long-tailed distributions, presenting a modelling challenge for all the methods. Similar challenges are observed for variables with non-normal distributions. Additionally, the sepsis dataset contained fewer data points per patient than acute hypotension (15 versus 48 observations). A shorter input sequence may have created difficulty for BiLSTM modules to learn the underlying structure of the original data effectively, coupled with a higher number of variables (twice as many) in the sepsis dataset. We also note that the lower number of patients in the acute hypotension dataset does not impact the generative performance of our method. This ability to work with fewer data points (patients) is encouraging, given the overall objective of our goal of augmenting representation.
The task of generalising to unseen data categories is particularly challenging due to the inherent unknowns these categories represent. While a definitive strategy is yet to be developed, we believe that integrating conditional generation with metric learning, as seen in prototypical networks [43], could provide a dual advantage. This integration could not only facilitate the generation of novel data points but also offer a quantifiable framework to assess their similarity or dissimilarity to known data categories. Such an approach could extend the capabilities of CA-GAN beyond data augmentation, potentially improving the interpretability and applicability of the synthetic data generated. A future development of this work involves the use our architecture to study counterfactual examples in downstream tasks [44]. Counterfactuals would make it possible to identify specific conditions that produce disparities in model prediction, improving the understanding of bias mitigation.
In terms of evaluation metrics, our current approach primarily focuses on statistical properties of the generated data. However, to ascertain the practical utility and accuracy of the synthetic data produced by CA-GAN, we recognise the importance of domain-specific validations, including performance in subpopulations [45, 46]. Inspired by the collaborative efforts outlined in [47], we are committed to exploring partnerships with experts in relevant fields such as healthcare and clinical practice to guide the development of evaluation metrics. Their input will ensure that our synthetic datasets can meet the rigorous demands of real-world applications and contribute meaningfully to the domains they are intended to serve.
Our architecture can provide a solid basis to generate privacy preserving synthetic data and mitigate barriers to access clinical data. This is because, we have ensured that the synthetic data generated by our CA-GAN are not a mere copy of the real data on one hand, while on the other, the synthetic data reflect the distributions of the real data. However, privacy preserving aspects will require additional analysis, such as reconstruction attacks, which are beyond the scope of the current work.
While CA-GAN architecture showed superior performance with respect to state of the art method as well as computationally inexpensive approaches, some limitations are present. Namely, CA-GAN may require additional optimisations to further increase performance on datasets with variables with non Gaussian distributions and those with long-tailed distributions. Furthermore, additional analysis will be required to evaluate the generalisation capability of our architecture with datasets of different characteristics. In this respect we aim to refine CA-GAN while exploring alternative architectures (such as Diffusion Models) in addressing some of these limitations. One approach might be using convolutional neural networks (CNNs) or Temporal Convolutional Networks (TCNs) as a promising direction to potentially improve the efficiency of our model. The work of Bai et al. [48] provides an empirical foundation for this approach, indicating that such network structures can rival the performance of recurrent networks for sequence modelling tasks. Additionally, prior research [49] suggests that simplifying the internal mechanisms of these recurrent units can lead to improvements in both performance and computational efficiency. These insights provide a strong impetus for our future work, where we aim to refine the CA-GAN model to harness the benefits of these alternative architectures without compromising its ability to perform conditional generation.
Finally, we are aware that the use of synthetic data may generate several ethical and policy implications including the fact that synthetic data cannot fully address the historical biases and discriminatory practices which are often reflected in the data [50]. While our work can mitigate existing representation biases, we must ensure that this does not come at the risk of disincentivising participation of underrepresented groups or perpetuating other types of data biases [51, 52]. Finally, while we showed the utility of synthetic data, we also note that the findings should always be confirmed using real data.
4. Methods
We begin by formally formulating the problem we are addressing. Then we discuss the data sources we used to train our models and compare and contrast Generative Adversarial Networks (GANs) and Conditional Generative Adversarial Networks (CGANs). We also provide an in-depth analysis of the baseline model for this work, WGAN-GP*. Finally, we present the architecture (Fig 5) of our proposed Conditional Augmentation GAN (CA-GAN) and discuss its advantages over other methods.
Both employ three stacked Bidirectional LSTMs (BILSTMs) to capture the temporal relationships of longitudinal data. They are trained together adversarially, with a minimax game. Conditioning is achieved with static labels, passed as input to both networks. The Generator also takes Gaussian noise as input and generates time-series data (synthetic patients). The discriminator evaluates the plausibility of the output of the Generator, compared with the real data.
4.1. Ethics statement
The data in MIMIC-III was previously de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of the database for research.
4.2. Problem formulation
Let A be a vector space of features and let a ∈ A represent a feature vector. Let L = {0, 1} be a binary distribution modifier, and let l a binary mask extracted from L. We consider a data set with l = 0, where individual samples are indexed by n ∈ { 1 , … , N } and we also consider a data set
with l = 1, where individual samples are indexed by
, and N>M. We define the training data set D as D = D0 ⋃ D1.
Our goal is to learn a density function that approximates the true distribution d{A} of D. We also define
as
with l = 1 applied.
To balance the number of samples in D, we draw random variables X from and add them to D1 until N = M. Thus, we balance out D.
4.3. Data sources, variables and patient population
Our analysis uses two datasets extracted from the MIMIC-III database. The detailed data pre-processing steps are outlined in our previous publication [27], and in S2 Appendix. We chose these two datasets as they have already been used in the study describing WGAN-GP* [27] making the comparison with our method fairer. We decided to test the methods for the oversampling of only one minority class, thus including only patients that belonged to the White (coded as Caucasian) or Black ethnic groups. We used a similar approach for gender, shown in Subsect 2.7.
The acute hypotension dataset comprises 3343 patients admitted to critical care; the patients were either of Black (395) or White (2948) ethnicity. Each patient is represented by 48 data points, corresponding to the first 48 hours after the admission, and 20 variables, namely nine numeric, four categorical, and seven binary variables. Details of this dataset are presented in S2 Appendix.
The Sepsis dataset comprises 4192 patients admitted to critical care of either Black (461) or White (3731) ethnicity. Each patient is represented by 15 data points, corresponding to observations taken every four hours from admission, and 44 variables, namely 35 numeric, six categorical, and three binary variables. Details of this dataset are presented in S2 Appendix.
4.4. GAN vs CGAN
The Generative Adversarial Network (GAN) [53] entails two components: a generator and a discriminator. The generator G is fed a noise vector z taken from a latent distribution pz and outputs a sample of synthetic data. The discriminator D inputs either fake samples created by the generator or real samples x taken from the true data distribution pdata. Hence, the GAN can be represented by the following minimax loss function:
The goal of the discriminator is to maximise the probability of discerning fake from real data, whilst the purpose of the generator is to make samples realistic enough to fool the discriminator, i.e., to minimise . As a result of the reciprocal competition, both the generator and discriminator improve during training.
The limitations of vanilla GAN models become evident when working with highly imbalanced datasets, where there might not be sufficient samples to train the models to generate minority-class samples. A modified version of GAN, the Conditional GAN [54], solves this problem using labels y in both the generator and discriminator. The additional information y divides the generation and the discrimination in different classes. Hence, the model can now be trained on the whole dataset to generate only minority-class samples. Thus, the loss function is modified as follows:
GAN and CGAN, overall, share the same significant weaknesses during training, namely mode collapse and vanishing gradient [33]. In addition, as GANs were initially designed to generate images, they have been shown unsuitable for generating time-series [55] and discrete data samples [56].
4.5. WGAN-GP*ast
The WGAN-GP* introduced by Kuo et al. [27] solved many of the limitations of vanilla GANs. The model was a modified version of a WGAN-GP* [25, 26]; thus, it applied the Earth Mover distance (EM) [57] to the distributions, which had been shown to solve both vanishing gradient and mode collapse [58]. In addition, the model applied the Gradient Penalty during training, which helped to enforce the Lipschitz constraint on the discriminator efficiently. In contrast with vanilla WGAN-GP*, WGAN-GP* employed soft embeddings [59, 60], which allowed the model to use inputs as numeric vectors for both binary and categorical variables, and a Bidirectional LSTM layer [61, 62], which allowed for the generation of samples in time-series. While LD was kept the same, LG was modified by Kuo et al. [27] by introducing alignment loss, which helped the model to capture correlation among variables over time better. Hence, the loss functions of WGAN-GP* are the following:
To calculate alignment loss, we computed Pearson’s r correlation [63] for every unique pair of variables X(i) and X(j). We then applied the L1 loss to the differences in the correlations between rsyn and rreal, with λcorr representing a constant acting as a strength regulator of the loss.
In their follow-up papers, Kuo et al. noted that their simulated data based on their proposed WGAN-GP* lacked diversity. In [64], the authors found that WGAN-GP* continued to suffer from mode collapse like the vanilla GAN. Similar to our own CA-GAN, the authors extended the WGAN-GP* setup with a conditional element where they externally stored features of the real data during training and replayed them to the generator sub-network at test time.
In [65], the same panel of researchers also experimented with diffusion models [66] and found that diffusion models better represent binary and categorical variables. Nonetheless, they demonstrated that GAN-based models encoded less bias (in the means and variances) of the numeric variable distributions.
4.6. CA-GAN
We built our CA-GAN by conditioning the generator and the discriminator on static labels y. Hence, the updated loss functions used by our model are as follows:
Where y can be any categorical label. During training, the label y was used to differentiate the minority from the majority class, and during generation, they were used to create fake samples of the minority class.
Compared to WGAN-GP* we also increased the number of BiLSTMs from 1 to 3 both in the generator and the discriminator, as stacked BiLSTMs have been shown to capture complex time-series better [67]. In addition, we decreased the learning rate and batch size during training. An overview of the CA-GAN architecture is shown in Fig 5.
Supporting information
S4 Appendix. Joint distributions of variables.
https://doi.org/10.1371/journal.pcbi.1013080.s004
(PDF)
S6 Appendix. Description of downstream regression task.
https://doi.org/10.1371/journal.pcbi.1013080.s006
(PDF)
S8 Appendix. Absolute differences in correlations.
https://doi.org/10.1371/journal.pcbi.1013080.s008
(PDF)
References
- 1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. pmid:30617339
- 2.
World Health Organization. Global strategy on digital health 2020–2025. World Health Organization. 2021.
- 3. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. pmid:31649194
- 4. Ibrahim H, Liu X, Zariffa N, Morris AD, Denniston AK. Health data poverty: an assailable barrier to equitable digital health care. Lancet Digit Health. 2021;3(4):e260–5. pmid:33678589
- 5. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. jair. 2002;16:321–57.
- 6. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967;13(1):21–7.
- 7.
Han H, Wang W, Mao B. Borderline-smote: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Science. Berlin, Heidelberg: Springer. 2005. p. 878–87.
- 8.
Gosain A, Sardana S. Farthest smote: a modified smote approach. Advances in intelligent systems and computing. Singapore: Springer. 2019. p. 309–20.
- 9. Nguyen HM, Cooper EW, Kamei K. Borderline over-sampling for imbalanced data classification. IJKESDP. 2011;3(1):4.
- 10. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14:106. pmid:23522326
- 11. Fernandez A, Garcia S, Herrera F, Chawla NV. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. jair. 2018;61:863–905.
- 12. Haibo He, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.
- 13.
Lu C, Reddy C, Wang P, Nie D, Ning Y. Multi-label clinical time-series generation via conditional GAN. arXiv preprint 2022. https://arxiv.org/abs/2204.04797
- 14. Engelmann J, Lessmann S. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Exp Syst Appl. 2021;174:114582.
- 15. Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, et al. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci. 2020;512:1009–23.
- 16.
Seibold M, Hoch A, Farshad M, Navab N, Fürnstahl P. Conditional generative data augmentation for clinical audio datasets. arXiv preprint 2022. https://arxiv.org/abs/2203.11570
- 17. Gao X, Deng F, Yue X. Data augmentation in fault diagnosis based on the Wasserstein generative adversarial network with gradient penalty. Neurocomputing. 2020;396:487–94.
- 18.
Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019. p. 4396–405.
- 19. de Rosa GH, Papa JP. A survey on text generation using generative adversarial networks. Pattern Recognit. 2021;119:108098.
- 20. Kong J, Kim J, Bae J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv Neural Inf Process Syst. 2020;33:17022–33.
- 21.
Savage N. Synthetic data could be better than real data. Nature. 2023. https://doi.org/10.1038/d41586-023-01445-8 pmid:37106108
- 22. Brophy E, Wang Z, She Q, Ward T. Generative adversarial networks in time series: a systematic literature review. ACM Comput Surv. 2023;55(10):1–31.
- 23.
Alaa A, Van Breugel B, Saveliev ES, van der Schaar M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In: Proceedings of the International Conference on Machine Learning. PMLR. p. 290–306.
- 24.
Ghosheh G, Li J, Zhu T. A review of generative adversarial networks for electronic health records: applications, evaluation measures and data sources. arXiv preprint 2022. https://arxiv.org/abs/2203.07018
- 25.
Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Proceedings of the International Conference on Machine Learning. PMLR. p. 214–23.
- 26.
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A. Improved training of wasserstein gans. Adv Neural Inf Process Syst. 2017.
- 27. Kuo NI-H, Polizzotto MN, Finfer S, Garcia F, Sönnerborg A, Zazzi M, et al. The health gym: synthetic health-related datasets for the development of reinforcement learning algorithms. Sci Data. 2022;9(1):693. pmid:36369205
- 28. Juwara L, El-Hussuna A, El Emam K. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns (N Y). 2024;5(4):100946. pmid:38645766
- 29. Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. pmid:27219127
- 30.
Theis L, van den Oord A, Bethge M. A note on the evaluation of generative models. In: International Conference on Learning Representations; 2016. http://arxiv.org/abs/1511.01844
- 31. vanderMaaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(86):2579–605.
- 32.
McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2020.
- 33.
Goodfellow I. Nips 2016 tutorial: Generative adversarial networks. CoRR. 2017. https://arxiv.org/abs/1701.00160
- 34. Kullback S, Leibler R. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86.
- 35. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A kernel two-sample test. J Mach Learn Res. 2012;13(25):723–73.
- 36. Kendall MG. The treatment of ties in ranking problems. Biometrika. 1945;33:239–51. pmid:21006841
- 37.
Yazici Y, Foo C-S, Winkler S, Yap K-H, Chandrasekhar V. Empirical analysis of overfitting and mode drop in gan training. In: 2020 IEEE International Conference on Image Processing (ICIP). IEEE. 2020. p. 1651–5. https://doi.org/10.1109/icip40778.2020.9191083
- 38. Caton S, Haas C. Fairness in machine learning: a survey. ACM Comput Surv. 2024;56(7):1–38.
- 39. Ricci Lara MA, Echeveste R, Ferrante E. Addressing fairness in artificial intelligence for medical imaging. Nat Commun. 2022;13(1):4581. pmid:35933408
- 40. Mamandipoor B, Yeung W, Agha-Mir-Salim L, Stone DJ, Osmani V, Celi LA. Prediction of blood lactate values in critically ill patients: a retrospective multi-center cohort study. J Clin Monit Comput. 2022;36(4):1087–97. pmid:34224051
- 41. Rezar R, Mamandipoor B, Seelmaier C, Jung C, Lichtenauer M, Hoppe UC, et al. Hyperlactatemia and altered lactate kinetics are associated with excess mortality in sepsis: a multicenter retrospective observational study. Wiener klinische Wochenschrift. 2023;135(3):80–8.
- 42. Bruno RR, Wernly B, Binneboessel S, Baldia P, Duse DA, Erkens R, et al. Failure of lactate clearance predicts the outcome of critically ill septic patients. Diagnostics (Basel). 2020;10(12):1105. pmid:33352862
- 43.
Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. Adv Neural Inf Process Syst. 2017.
- 44. Del Ser J, Barredo-Arrieta A, Díaz-Rodríguez N, Herrera F, Saranti A, Holzinger A. On generating trustworthy counterfactual explanations. Inf Sci. 2024;655:119898.
- 45.
Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B. Deep ROC analysis and AUC as balanced average accuracy to improve model selection, understanding and interpretation. arXiv preprint 2021. https://arxiv.org/abs/2103.11357
- 46. Carrington AM, Manuel DG, Fieguth PW, Ramsay T, Osmani V, Wernly B, et al. Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):329–41. pmid:35077357
- 47.
Kuo N, Perez-Concha O, Hanly M, Mnatzaganian E, Hao B, Di Sipio M. Enriching data science and healthcare education: application and impact of synthetic datasets through the health gym project. JMIR Med Educ. 2023.
- 48.
Bai S, Kolter J, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint 2018. https://arxiv.org/abs/1803.01271
- 49.
Kuo NIH, Harandi M, Fourrier N, Walder C, Ferraro G, Suominen H. An input residual connection for simplifying gated recurrent neural networks. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE. 2020. p. 1–8. https://doi.org/10.1109/ijcnn48605.2020.9207238
- 50.
Johansson P, Bright J, Krishna S, Fischer C, Leslie D. Exploring responsible applications of synthetic data to advance online safety research and development. arXiv preprint 2024. https://arxiv.org/abs/2402.04910
- 51.
Velichkovska B, Gjoreski H, Denkovski D, Kalendar M, Mullan ID, Gichoya JW, et al. AI learns racial information from the values of vital signs. Cold Spring Harbor Laboratory. 2023. https://doi.org/10.1101/2023.12.11.23299819
- 52.
Velichkovska B, Gjoreski H, Denkovski D, Kalendar M, Mamandipoor B, Celi LA, et al. Vital signs as a source of racial bias. Cold Spring Harbor Laboratory. 2022. https://doi.org/10.1101/2022.02.03.22270291
- 53.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S. Generative adversarial networks. arXiv preprint 2014. https://arxiv.org/abs/1406.2661
- 54.
Mirza M, Osindero S. Conditional generative adversarial nets. 2014.
- 55.
Yoon J, Jarrett D, VanderSchaar M. Time-series generative adversarial networks. Adv Neural Inf Process Syst. 2019.
- 56.
Yu L, Zhang W, Wang J, Yu Y. Seqgan: sequence generative adversarial nets with policy gradient. CoRR. 2016. https://arxiv.org/abs/1609.05473
- 57.
Levina E, Bickel P. The Earth Mover’s distance is the Mallows distance: some insights from statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. IEEE Computer Society. p. 251–6. https://doi.org/10.1109/iccv.2001.937632
- 58.
Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. arXiv preprint 2017. https://arxiv.org/abs/1701.04862
- 59. Landauer TK, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes. 1998;25(2–3):259–84.
- 60.
Mottini A, Lheritier A, Acuna-Agost R. Airline passenger name record generation using generative adversarial networks. CoRR. 2018. https://doi.org/abs/1807.06657
- 61. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
- 62.
Graves A, Fernández S, Schmidhuber J. Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications - Volume Part II. Berlin, Heidelberg: Springer-Verlag. 2005. p. 799–804.
- 63. Mukaka MM. Statistics corner: a guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69–71. pmid:23638278
- 64.
Kuo N, Jorm L, Barbieri S. Generating synthetic clinical data that capture class imbalanced distributions with generative adversarial networks: Example using antiretroviral therapy for HIV. arXiv preprint 2022. https://arxiv.org/abs/2208.08655
- 65.
Kuo N, Jorm L, Barbieri S. Synthetic health-related longitudinal data with mixed-type variables generated using diffusion models. arXiv preprint 2023. https://arxiv.org/abs/2303.12281
- 66.
Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. In: Proceedings of the International Conference on Machine Learning. PMLR. p. 2256–65.
- 67.
Althelaya KA, El-Alfy E-SM, Mohammed S. Evaluation of bidirectional LSTM for short-and long-term stock market prediction. In: 2018 9th International Conference on Information and Communication Systems (ICICS). IEEE. 2018. p. 151–6. https://doi.org/10.1109/iacs.2018.8355458