Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data

Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We systematically investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic dataset generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generated using AIM and MWEM PGM algorithms can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.


Introduction
Differential privacy (DP) is the standard for privacy-preserving statistical summaries [1].Companies such as Microsoft [2], Google [3], Apple [4], and government organizations such as the US Census [5], have successfully applied DP in machine learning and data sharing scenarios.The popularity of DP is due to its strong mathematical guarantees.Differential Privacy guarantees privacy by ensuring that the inclusion or exclusion of 1/21 arXiv:2310.19250v1[cs.LG] 30 Oct 2023 any particular individual does not significantly change the output distribution of an algorithm.
In areas ranging from health care, humanitarian action, education, and socioeconomic studies, the publication and sharing of data is crucial for informing society and scientific collaboration.However, the disclosure of such data sets can often reveal private, sensitive information.Privacy-preserving data publishing aims at enabling such collaborations while preserving the privacy of individual entries in the data set.Tabular/categorical data about individuals are relevant in many applications, from health care to humanitarian action.Privacy-preserving data publishing for such data can be done in the form of a synthetic data table that has the same schema and similar distributional properties as the real data.The aim here is to release a perturbed version of the original information, so that it can still be used for statistical analysis, but the privacy of individuals in the database is preserved.
The biggest advantage of synthetic data sets is that, once released, all data analysis and machine learning tasks are performed in the same way it is done with real data.As noted by [6], the switch between real and synthetic data in data analysis and machine learning pipelines is seamless -the same analysis tools, libraries and algorithms are applied in the same manner in both data sets.Other privacy-preserving technologies, such as federated learning, requires expertise and appropriate tools to perform data analysis and model training.
Due to the all the potential benefits of synthetic data, understanding the impacts of synthetic data in downstream classification tasks have become of extreme importance.A trend observed in recent studies is to evaluate performance of synthetic data generators of two types: marginal-based synthesizers [7] and generative adversarial networks (GAN) based synthesizers [6,8,9].Marginal-based synthetic data generators are suitable for tabular data only, and have gained increased popularity after the algorithm MST won the NIST competition in 2018 [10].Marginal-based synthesizers are named as such due to the fact that they learn approximate data distributions by querying noisy marginals from the real data.Notable marginal-based algorithms are MWEM PGM [11] and PrivBayes [12].GAN-based synthesizers, on the other hand, are flexible algorithms, and are suitable for tabular, image and other data formats.GANs learn patterns and relationships from the input data based on a game, in the sense of game theory, between two machine learning models, a discriminator model and the generator model.Among popular differentially private GAN architectures we list DP-GAN [13], DP-CTGAN [14] , PATE-GAN [15] and PATE-CTGAN [14].
One of the major applications of synthetic data is for training machine learning models.Therefore, it is paramount to understand how exchanging real data for synthetic data impacts the performance of the trained machine learning models.By performance, we mean not only the utility of the model (its accuracy, for example) but also how well the model performs for different subgroups of the data set -the fairness of the model.The impact of machine learning models on minorities subgroups is an active area of research, and several works have investigated the trade-offs among model accuracy, bias, and privacy [16][17][18][19].However, only recently bias caused by the use of synthetic data in downstream classification received attention [7,20,21].This problem becomes particularly relevant in the context of synthetic data sets generated with differential privacy guarantees.It is known that differential privacy can affect fairness in machine learning models [17].Despite recent work investigating the impact of synthetic data in downstream model fairness [8,20], there are important questions that remain unanswered.
• There is no published work that systematically studies the utility and fairness of machine learning models trained on several GAN based and marginal-based synthetic tabular data set generation algorithms.
• Previous studies have not evaluated machine learning models trained on synthetic data set generation algorithms for multiple definitions of fairness.
• In previous studies, it was always assumed that real data was available for evaluating the fairness of models trained on synthetic data.Here, we propose and evaluate a pipeline where no such assumption is necessary.

Contributions
In this work, we investigate the impacts of differentially private synthetic data on downstream classification, where we focus on understanding the impacts on model utility and fairness.Our investigation focus on two aspects of such impact: • What is the impact in model utility when utilizing synthetic data for training machine learning models?Can synthetic data also be used to evaluate utility of machine learning models?
• What is the impact in model fairness when utilizing synthetic data for training machine learning models?Can synthetic data be used to evaluate fairness of machine learning models?
In our investigations we also evaluate if there are clear differences in performance between marginal-based and GAN-based synthetic data, and if there is a synthesizer algorithm that produces data that clearly outperform others.Our research work evaluates the impact of utilizing synthetic data sets for both training and testing in machine learning pipelines.We empirically compare the performance of marginal-based synthesizers and GAN-based synthesizers within the context of a machine learning pipeline.Our experiments yield a comprehensive analysis, encompassing utility and fairness metrics.Our main contributions are: • We propose a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data.
• We present an extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models.In particular, this is the first systematic comparison of several marginal-based and GAN-based algorithms for fairness and utility of the resulting machine learning models.
• This is the first of such studies that includes several different definitions of fairness.
Main Findings: 1 Marginal-based synthetic data can accurately train machine learning models for tabular data.Marginal-based synthetic data can train models with similar utility to models trained on real data.Our experiments show that for a privacy-loss parameter ϵ > 5.0, models trained with MWEM PGM (AUC = 0.684), MST (AUC = 0.662) and Privbayes (AUC = 0.668) provides utility very similar to models trained on real data (AUC = 0.684).Additionally, we evaluated models using synthetic data, and found that marginal-based synthetic provides a good evaluation, with synthetic data providing an AUC = 0.671 versus AUC = 0.684 (measured using real data).
2 Synthetic data sets trained with MWEM PGM can be used for accurate model training and fairness evaluation in the case of tabular data.We found that MWEM PGM synthetic data can train models that achieves very similar utility and fairness characteristics of models trained with real data.Additionally, the synthetic data generated by MWEM PGM algorithm showed very similar behavior to real data when used to evaluate utility an fairness of machine learning models.This is the first study that (first time that it is showing that synthetic data can actually present reliable behavior and a potential substitute for real data sets in end-to-end machine learning pipelines) This work significantly extends and sub sums a previous version, presented at the Machine Learning for Data: Automated Creation, Privacy, Bias Workshop at the International Conference on Machine Learning (ICML) (workshop without proceedings) [22].

Related Works
As synthetic data generation becomes standard practice for data sharing and publishing, understanding the impacts of utilizing synthetic data in machine learning pipelines is of significant importance.Although previous works have advised against using synthetic data to train and evaluate any final tools deployed in the real world [23], in very sensitive scenarios, such as human trafficking data [24], synthetic data might be the only available data for training and testing models.
The promises synthetic data brings generated an interest in understanding impacts of utilizing synthetic in data analysis and machine learning.Some of these works include analysing the utility of differentially private synthetic data in different tasks [25], investigating if training models with differentially private synthetic images can increase subgroup disparities [8], the impacts different types of synthetic data can have in model fairness [20,26], utility of synthetic data in downstream health care classification systems [7], and whether feature importance can be accurately analyzed using differentially private synthetic data [21].All these works are ultimately trying to answer a same question: to which extent can we substitute real data with synthetic data, and which are the best synthetic data generation techniques for model training?However these works still left questions unanswered.First of all, there hasn't been a systematic study of impacts of using synthetic data sets in end-to-end machine learning pipelines, which means evaluating the use of synthetic data for model training and model evaluation.Additionally, there has been a lot of focus on image classification tasks [8,20] where the disparity in accuracy are largely attributable to the class imbalance in these data sets: i.e disadvantaged classes are also rare classes in the data set thereby leading to worse performance on these.In contrast, our work studies these issues in the context of tabular data sets and in settings where the data has an intrinsic bias against sub-populations that are not necessarily rare in the data set.Moreover, our work focus on comparing two types of data synthetization algorithm families: marginal-based and GAN-based data synthesizers.While, these two type of data synthetization algorithms have been previously compared for utility [25], no such extensive comparative analysis exists for fairness.
We are the first to extensively study the differences of applying data generated by these two families types of data synthetization algorithms in end-to-end machine learning pipelines for utility and multiple fairness metrics.

Preliminaries
In this section we introduce the concepts of differential privacy and algorithmic fairness.We refer the reader to [1,27,28] for detailed explanation of these concepts.Additionally, we describe the synthetic data generation techniques and the data sets used in our experiments.

Differential privacy
Differential privacy is a rigorous privacy notion used to protect an individual's data in a data set disclosure.We present in this section notation and definitions that we will use to describe our privatization approach.We refer the reader to [29], [30] and [31] for detailed explanations of these definitions and theorems.

Pure Differential
The privacy loss of the mechanism is defined by the parameter ϵ ≥ 0 in the case of 'pure' differential privacy and parameters ϵ, δ ≥ 0 in the case of 'approximate' differential privacy.
The definition of neighboring databases used in this paper is user-level privacy.User-level privacy defines neighboring to be the addition or deletion of a single user in the data and all possible records of that user.Informally, the definition above states that the addition or removal of a single individual in the database does not provoke significant changes in the probability of any differentially private output.Therefore, differential privacy limits the amount of information that the output reveals about any individual.

5/21
A function f (also called query) from a data set D ∈ D to a result set A ⊆ A can be made differentially private by injecting random noise to its output.The amount of noise depends on the sensitivity of the query.

Fairness Metrics
In this section we present the definition of two different fairness metrics: Equal Opportunity [27] and Statistical Disparity [28].Given a data set W = (X, Y ′ , C) with binary protected attribute C (e.g.race, sex, religion, etc), remaining decision variables X and predicted outcome Y ′ , we define Equal Opportunity and Statistical Disparity as follows.
Equal Opportunity/ Equality of Odds requires equal True Positive Rate (TPR) across subgroups: where Y' is the model output.Statistical Parity requires positive predictions to be unaffected by the value of the protected attribute, regardless of true label We follow the approach of [32,33] and utilize difference in Equal Oportunity

Differentially Private Synthetic Data Generators.
We use several differentially private (DP) synthetic data generators that have been specifically tailored for generating tabular data with the goal of enhancing their utility for learning tasks.We consider two broad categories of approaches: i) marginal-based methods, ii) and Generative Adversarial Network (GAN) based models.

Marginal-based methods
MWEM PGM Is a variation of the multiplicative weights with exponential mechanism algorithm (MWEM), which is an algorithm that generated synthetic data based on linear queries.The algorithm aims to produce a data distribution that produces query answers similar answers resulted when querying the real data set.The MWEM PGM variation combines probabilistic graphical models with the MWEM algorithm.The structure of the graphical model is determined by the measurements, such that no information is lost relative to a full contingency table representation.
MST Is a synthetic data generation algorithm that acts selecting 2-and 3-way marginals for measurement.It combines one principled step, which is to find the maximum spanning tree (MST) on the graph where edge weights correspond to mutual information between two attributes, with some additional heuristics to ensure that certain important attribute pairs are selected, and a final step to select triples while keeping the graph tree-like.
PrivBayes In order to improve the utility of the generated synthetic data, [12] approximates the actual distribution of the data by constructing a Bayesian network using the correlations between the data attributes.This allows them to factorize the joint distribution of the data into marginal distributions.Next, to ensure differential 6/21 privacy, noise is injected into each of the marginal distributions and the simulated data is sampled from the approximate joint distribution constructed from these noisy marginals.

GAN-based methods
Generative neural networks (GANs) are a type of artificial neural network used in machine learning for generating new data samples similar to a given training data set.Generative adversarial networks are based on a game, in the sense of game theory, between two machine learning models, a discriminator model D and the generator G model.The goal of the generator is to learn realistic samples that can fool the discriminator, while the goal of the discriminator is to be able to tell generator generated samples from real ones [13].
Conditional Tabular GAN (CTGAN) [34] is an approach for generating tabular data.CTGAN adapts GANs by addressing issues that are unique to tabular data that conventional GANs cannot handle, such as the modeling of multivariate discrete and mixed discrete and continuous distributions.It achieves these challenges by augmenting the training procedure with mode-specific normalization, and by employing a conditional generator and training-by-sampling that allows it to explore discrete values more evenly.When applying differentially private SGD (DP-SGD) [35] in combination with CTGAN the result is a DP approach for generating tabular data.
The PATE (Private Aggregation of Teacher Ensembles) framework [36] protects the privacy of sensitive data during training, by transferring knowledge from an ensemble of teacher models trained on partitions of the data to a student model.To achieve DP guarantees, only the student model is published while keeping the teachers private.The framework adds Laplacian noise to the aggregated answers from the teachers that are used to train the student models.CTGAN can provide differential privacy by applying the PATE framework.We call this combination PATE-CTGAN, which is similar to PATE-GAN [15], for images.The original data set is partitioned into k subsets and a DP teacher discriminator is trained on each subset.Further, instead of using one generator to generate samples, k conditional generators are used for each subset of the data.

Data sets
Adult data set In the Adult data set (32561 instances), the features were categorized as protected variable (C): gender (male, female); and response variable (Y): income (binary); decision variables (X): the remaining variables in the data set.We map into categorical variables all continuous variables.
Prison Recidivism data set From the COMPAS data set (7214 instances), we select severity of charge, number of prior crimes, and age category to be the decision variables (X).The outcome variable (Y) is a binary indicator of whether the individual recidivated (re-offended), and race is set to be the protected variable (C).We utilize a reduced set of features as proposed in [18].
Fair Prison Recidivism data set We construct a "fair" data set based on the COMPAS recidivism data set by employing a data preprocessing technique for learning non-discriminating classifiers from [37], which involves changing the class labels in order to remove discrimination from the data set.This approach selects examples close to the decision boundary to be either 'promoted', i.e label flipped to the desirable class, or 'demoted', i.e label flipped to the undesirable class (ex: the 'recidivate' label in the 7/21 COMPAS data set is the undesirable class).By flipping an equal number of positive and negative class examples, the class skew in the data set is maintained.

Experimental Evaluation
One potential outcome of synthetic data sharing is the utilization of synthetic data for training and evaluating an ML model.The trained model could be deployed without assessing its performance on real data, due to lack of data access.However, it is important to acknowledge that these trained models are ultimately applied to real data.This scenario is illustrated in Figure 1.In our experiments, we address the concern that there may be substantial disparities in performance between the evaluation phase (employing synthetic data) and the deployment phase (utilizing real data).We compare the performance of logistic regression models trained with differentially private synthesizers, focusing on two performance dimensions: utility and fairness.The follow the approach of [20] and use logistic regression for downstream classification evaluation to avoid another layer of stochasticity.
To assess the utility performance, we employ the AUC-ROC metric, which quantifies trade-off between the recall and false positive rate.We examine fairness performance through three different perspectives.Previous research [17] has indicated that differentially private machine learning models tend to perform worse on minority groups.To this point we evaluate the decay in accuracy for the different subgroups in the protected attribute.We also measure the difference in equality of odds (DEO) and the difference in statistical parity (DSP).These metrics allow us to assess any disparities or bias in the model's predictions across different groups.Furthermore, we also investigate the extent to which one can accurately assess a model utilizing synthetic data sets.Again, we evaluate two performance dimensions: utility and fairness.
We randomly divide the real data set into an 80/20 split, separating the data into generator and test data sets.We run 10 rounds of synthetic DP data generation on the 80% split (generator data), where we generate synthetic train and synthetic test data sets.We utilize the SmartNoise Library1 implementation of the synthesizers, and approximate-DP approaches use the library's default value of δ.For experiments using PrivBayes Synthesizers, we use the DiffPrivLib implementation2 .
We train Logistic Regression models using the generated DP synthetic data sets.In experiments where we test the trained models on real data, model performance is evaluated on the real test data (the 20% test split from the real data).In experiments where we test the trained models on synthetic data, models are evaluated using the synthetic test data sets.
We report, for each technique and each value of privacy loss parameter, the mean across 10 rounds.Our experiments use three data sets: the UCI Adult data set [39] and ProPublica's COMPAS recidivism data [40], and a fair COMPAS data set as defined in Section 2.4.The fair COMPAS data set provides a way to evaluate synthetic data generation performance in fair and biased versions of the same data set.

Utility analysis of synthetic data in machine learning pipelines
We evaluate the quality of models trained with synthetic data sets by measuring AUC and accuracy of the protected class.We consider privacy-loss budgets of 0.5, 1.0, 5.0 and 10.0 .We compare the AUC obtained in our experiments with the AUC measured by training models with the real (non-synthetic) Adult, COMPAS, and fair COMPAS data sets.Figure 2 (a) shows AUC for different privacy losses and different synthesizers.The plots show the variation of AUC as a function of ϵ for marginal-based and GAN-based synhtesizers.The top row refers to marginal-based synthesizers.Overall, the performance of the models trained on marginal-based synthetic data is very close to the baseline model, trained on real data.For all three synthesizers, we see an increase in AUC as we increase ϵ.For all data sets, Adult, COMPAS and fair COMPAS, the perfomance of MST and MWEM-PGM are similar across all values of ϵ.PrivBayes has a slightly lower performance.For ϵ > 5.0, all three synthesizer presented very similar performance.For COMPAS data set (which has a small dimension) the performance of synthetic data sets as training data is very close to the performance of the real data.The bottom row of figure 2 (a) presents the perfomance of GAN-based synthetic data.The overall performance of this type of synthesizer is worse and the performance of the marginal-based synthesizer.As noted by [25], models trained on GAN-based synthetic data perform worse than models trained on marginal-based synthetic data.With AUC ≈ 0.5, we can say that they do not do much better than random guessing.Additionally, we see a much greater variance in results for a same privacy-loss budget, which is observed by the large error bars.Finally, as the privacy-loss budget increases, the utility does not necessarily increase.
Although several works have assessed the performance of machine learning models trained with synthetic data sets [20,21,25], this is the first study to analyze if synthetic data sets can be used for model assessment, and how close to reality such assessment is.In Figure 2 (b) we present the plots of variation of AUC for different values of epsilon.The plots in the first line refer to performance of models trained on marginal-based synthesizers, the the plots in the second line refer to GAN-based synthesizers.By comparing the evaluation of models trained with marginal-based data in Figure 2 (a)assessment with real data, and in Figure 2 (b) -assessment with synthetic data, we see that the assessment is very similar in both cases when the synthesizers are MST and MWEM PGM.When assessing with synthetic data, we notice that PrivBayes present a large difference in assessment results when assessing model trained on Adult and fair COMPAS synthetic data.GAN-based synthetic data present inconsistent behavior when used for model assessment.When comparing the assessments in Figure 2 (a) assessment with real data, and (b) -assessment with synthetic data, we notice that using DP-GAN sythetic data for model assessment can over estimate model AUC.Overall, GAN-based synthetic data will make assessments that are as good as random guessing.
Marginal-based synthetic data does better at training and assessing utility of models.We ranked the utility performance of all synthesizers taking based on two criteria: ability to generate synthetic data for model training and ability to generate synthetic data for model assessment.Table 1 shows the ranking of synthesizers when generating training ans assessment data for the Adult data and the COMPAS data.The table also shows model AUC metrics when measured with real data -AUC(R), and model AUC when measured with synthetic data -AUC(S).All table results accounts for synthetic data generated with privacy-loss parameter ϵ = 5.0.
MWEM PGM synthetic data outperforms all other synthetic data for both tasks: utility as training data for machine learning models and utility as evaluation data for 9/21    1. Synthesizer utility comparison.We compare and rank all synthesizers by their ability to generate quality training data and evaluation data for machine learning pipelines.The comparison presented accounts for synthetic data generated with privacy-loss parameter ϵ = 5.0.In addition to present a performance ranking for Adult and COMPAS data, we show a comparison of model AUC measured with real data -AUC(R), and model AUC measured with synthetic data -AUC(S).machine learning models.The performance of synthetic data sets generated with MWEM PGM and MST perform well and with a small performance decay when compared to real data, both when using the synthetic data for model training and model assessment.For model training, when comparing the AUC achieved by model trained with the real data set (AUC = 0.892 ) to the metrics achieved by models trained with MWEM PGM data (AUC = 0.850 ) and MST (AUC = 0.836), the decrease in performance is small.The synthetic data sets also present a good performance as assessment data.The model assessment resulted when using MST data (AUC = 0.804) and MWEM PGM data (AUC = 0.820) presents consistent results with a small decay.Although PrivBayes data presents good performance in model training (AUC = 0.846), there is a significant discrepancy between assessment utilizing real data and assessment utilizing synthetic data.We reach similar conclusions when analysing results for COMPAS data.Using GAN-based data as training data resulted in models with utility very close to random guess, as already observed in previous analysis, with DP-GAN synthetic data performing slightly better than the rest of GAN-based data sets.

Fairness analysis of synthetic data in machine learning pipelines
Impacts on subgroup accuracy In the previous section, we showed that adding privacy by utilizing synthetic data sets in machine learning pipelines results in a utility decrease.We now proceed to perform a fairness analysis.In this experiment, presented in Table 2 we analyze model accuracy for different groups in the protected class to understand whether the addition of privacy to the data pipeline harms model utility more for the minority class than it does for the privileged class.Results in Table 2 refer to the Adult data set.From a fairness perspective, the overall behavior of all synthesizers is to have less accuracy decay for the protected class than it does for the privileged class.As observed on the utility experiments, MWEM PGM and MST are the best performing synthetic data sets for both pipeline tasks: training and evaluation.Although MWEM PGM presents good results for minority and privileged classes, where the model accuracy is very close to the baseline model -captured by accuracy minority(R) and accuracy privileged(R) in Table 2. Additionally, evaluation with MWEM PGM synthetic data sets captured accuracy metric for both classes -captured by accuracy minority(S) and accuracy privileged(S) -that are very close to model evaluation done with real data.2. Accuracy comparison for different groups.The comparison presented accounts for synthetic data generated with privacy-loss parameter ϵ = 5.0.We show a comparison of model accuracy for the different groups measured with real data (R), and model accuracy measured with synthetic data (S).
Impacts on statistical parity A model presents statistical parity if the percentage of positive predictions are the same for all subgroups.The goal of the experiments in this section is to measure whether models trained with synthetic data preserve the characteristics of models trained on real data.
Our experiments measure the difference in statistical parity (DSP) of models.We measure DSP of models using real data -DSP(R), and using synthetic data -DSP(S).We present a detailed comparison of DSP for all three data sets and all synthesizers on Table 3.We notice from our experiments that several models trained on synthetic data seem to be less biased than the model trained on real data.MWEM PGM synthesizer presented the best utility overall, based on the results present in the previous experiments.PATE-CTGAN, however, was ranked in 5th place in utility.
To understand better what is behind this apparent fairness provided by PAET-CTGAN, we investigate the percentage of positive labelled samples in the training data, evaluation data and predictions.We present percentages for minority and privileged classes for adult data in Table 4.
We observe in Table 4 that synthetic data generated with PATE-CTGAN presents a very similar percentages of samples with positive labels, of ≈ 5% for each group that belongs to the protected attribute.At a first sight, this seems like a data set with promising fairness capabilities.However, when training models with such data, there are no positive predictions resulting from the model scoring.The model trained with PATE-CTGAN data acts like a majority baseline classifier for all groups.The data sets generated with DP-CTGAN presented an accentuated disparity in positive labels percentages between minority and privileged classes.In the real data 30% of privileged class contains positive labels, while only 10% of minority class contains positive labels.Although DP-GAN synthesizer generates data where 31% of privileged class with positive labels (a value similar to the one presented in the real data -30%), there is a significant decrease in the percentage of positive class in the minority class, which is ≈ 6%.This imbalance is even further accentuated by the models trained with DP-GAN synthetic data.Model predictions resulted in over half of samples from the privileged class being classified with positive labels (versus 20% of minority class).
MWEM PGM once again was the best overall performing model, as it preserves similar percentages of positive labels for all groups, 11% and 30% (compared to 11% and 30% in real data).Models trained with MWEM PGM also presented similar metric to models trained with real data, and even presenting slightly improvement in fairness.
The DSP delta presented in  3. Difference in statistical parity (DSP) of models trained with synthetic data.We measure the DSP of models using real test data -DSP(R) and synthetic test data DSP(S).DEO delta quantifies the difference between DSP(R) and DSP(S).All synthetic data where generated using privacy-loss parameter ϵ = 5.0.data set, a positive DSP delta means that evaluation with synthetic data observed fairer results than evaluation with real data.For COMPAS and fair COMPAS data, a negative DSP delta means that evaluation with synthetic data observed fairer results than evaluation with real data.
Across all data sets, models trained with MWEM PGM presented DSP metrics very similar to models trained with real data, this is captured by the DSP(R) metric.
Impacts on equal opportunity Equal Opportunity requires equal True Positive Rate (TPR) across subgroups.Difference in equal opportunity (DEO) measures the difference of privileged group TPR and minority group TPR.
We perform a thorough analysis to understand two points.First, what is the DEO of models trained with synthetic data sets, and how does it compare with models trained with real data?Second, we investigate whether synthetic data preserves similar true positive rates across all subgroups.
We present in Table 5 experiment results comparing DEO of models trained with differentially private synthetic data sets (ϵ = 5.0).These experiment are similar to the statistical parity experiments, we use real data -DEO(R) -to measure DEO of models trained on synthetic data, as well as synthetic data -DEO(S).The model trained with MWEM PGM synthetic data was the only one that presented a similar DEO to the baseline model, outperforming all other models trained with synthetic data.Note that our comparison, as in the DSP case, focus on understanding which synthetic data sets  5. Difference in equal opportunity (DEO) of models trained with synthetic data.We measure the DEO of models using real test data -DEO(R) and synthetic test data DEO(S).DEO delta quantifies the difference between DEO(R) and DEO(S).All synthetic data where generated using privacy-loss parameter ϵ = 5.0.can train model that behave as close as possible to models trained with real data.Models trained with MST, which presented promising utility metrics and subgroup accuracy, did not capture as well the difference in equality on odds in experiments with the Adult data.For experiments with COMPAS and fair COMPAS data, MST performs better, but still worse than MWEM PGM, as we can see on Table 5.
As we investigate the details of variation in TPR it becomes clear MWEM PGM is the the best technique for training models that preserve fairness characteristics of models trained with real data.Experiments with Adult data (Figure 3) show that the difference between the privileged group TPR and the minority group TPR of models trained with MWEM PGM data is very similar to the difference between subgroups TPR of models trained with real data.Experiments with COMPAS data (Figure 4) are even more compelling.Not only the difference between the subgroup TPR of the model trained with MWEM PGM data is close to that of the model trained with real data, but the true positive rates of the subgroups are also very similar to the TPR of the model trained with real data.Figures 3 and 4 show that models trained with marginal-based synthetic data outperforms models trained with GAN-based synthetic data for our tested data sets.
We make a similar analysis when evaluating how good synthetic data sets are for

17/21
Although the data sets utilized in our analysis are commonly employed in fairness literature, extending the validity of our findings to larger-scale data sets would provide a more comprehensive understanding of the generalizability and robustness of marginal-based synthetic data approaches.Future research should focus on exploring the performance of these frameworks in real-world scenarios with diverse and extensive data sets.This would contribute to the broader applicability and reliability of synthetic data methods in various domains and facilitate a more nuanced understanding of their limitations and capabilities.Finally, extending our analysis to non-tabular data would be an interestign sequel to this work.

Conclusion
Our research comprehensively evaluates the impact of synthetic data sets for training and testing in machine learning pipelines in the case of tabular data sets.Specifically, we compare the performance of marginal-based and GAN-based synthesizers within a machine-learning pipeline and analyze various utility and fairness metrics for tabular data sets.
Our main findings are as follows: Marginal-based synthetic data demonstrated comparable utility to real data in end-to-end machine-learning pipelines.MWEM PGM (AUC = 0.684) provides utility very close to models trained on real data (AUC = 0.684).Furthermore, we show that model evaluation using synthetic data also provides similar results to evaluation using real data, for tabular data.The metrics obtained when utilizing marginal-based synthetic data (AUC=0.671)are comparable to real data (AUC = 0.684).Synthetic data sets trained with MWEM PGM do not increase model bias and can provide a realistic fairness evaluation.Our study reveals that MWEM PGM synthetic data can train models that achieve similar utility and fairness characteristics as models trained with real data.Additionally, when used to evaluate the utility and fairness of machine learning models, the synthetic data generated by the MWEM PGM algorithm exhibits behavior very similar to real data.
These findings highlight synthetic data's potential reliability and viability as a substitute for real data sets in end-to-end machine learning pipelines for tabular data.Furthermore, our research sheds light on the implications of model fairness when utilizing differentially private synthetic data for model training.
One crucial observation is that synthetic data that does well in model training might perform differently when used as evaluation data.This was the case with Privbayes and some of the GAN-based synthetic data.This observation is important as synthetic data techniques gain acceptance as the standard data publishing approach in domains such as healthcare, humanitarian action, education, and population studies.

Fig 1 .
Fig 1. Pipeline for model training and evaluation using synthetic data (1) We generate Synthetic data sets for model training and model testing utilizing differentially private synthesizers.(2) We train models utilizing synthetic data and evaluate on a synthetic test data.Model selection is made during this phase.(3) Based on the previous phase results, model is trained using synthetic data and deployed.Model is applied to real (test) data in production phase.
Privacy.A randomized mechanism M : D → A with data base domain D and output set A is ϵ-differentially private if, for any output A ⊆ Y and neighboring databases D, D ′ ∈ D (i.e., D and D ′ differ in at most one entry), we have Pr[M(D) ∈ A] ≤ e ϵ Pr[M(D ′ ) ∈ A] Approximate Differential Privacy.A randomized mechanism M : D → A with data base domain D and output set A is (ϵ, δ)-differentially private if, for any output A ⊆ Y and neighboring databases D, D ′ ∈ D (i.e., D and D ′ differ in at most one entry), we have (a) AUC variation of models trained on synthetic data and evaluated using real data.(b)AUC variation of models trained on synthetic data and evaluated using synthetic data.

Fig 2 .
Fig 2. Impact in utility caused by the use of differentially private synthetic data in model training and testing.In (a) we show the decay in model utility when utilizing marginal-based and GAN-based synthetic data sets for model training.In (b) we show what is the measured model utility when the instrument for measuring model performance is a synthetic data set.

Fig 3 .
Fig 3. True positive rate (TPR) variation of different subgroups of the protected attribute of the Adult data.The top two rows shows TPR variation for different values of privacy-loss parameter ϵ, of models trained with synthetic data and evaluated with real data.The bottom two rows shows TPR variation for different values of privacy-loss parameter ϵ, of models trained with synthetic data and evaluated with synthetic data.

Fig 4 .
Fig 4. TPR variation: COMPAS and ϵ, of models trained with synthetic data and evaluated with real data.We also present TPR variation for different values of ϵ, of models trained with synthetic data and evaluated with synthetic data.
Table 3 quantifies the difference in DSP observed during model evalution with real data and model evaluation with synthetic data.For Adult

Table 4 .
Ratio of samples with positive labels for each subgroup in the protect class in the Adult data.We compare percentages present in the true labels of the real data and the predicted labels.Analogously, we measure the percentage of samples with positive present in the training, testing and predicted labels for data sets generated from three distinct synthesizer techniques: MWEM PGM, PATE-CTGAN and DP-GAN.Predictions(data1/data2) represents prediction labels of an experiment where model was trained with data1, and predictions were performed on data2.