Skip to main content
Advertisement
  • Loading metrics

SynthCraft: An AI partner for synthetic data generation to support data access and augmentation in healthcare

  • Thomas Callender ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Visualization, Writing – original draft, Writing – review & editing

    tac68@cam.ac.uk (TC), mv472@damtp.cam.ac.uk (MvdS)

    Affiliations Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom, Cambridge Centre for AI in Medicine, University of Cambridge, Cambridge, United Kingdom

  • Anders Boyd,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing

    Affiliations Department of Infectious Diseases, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland, Amsterdam UMC location University of Amsterdam, Department of Infectious Diseases, Meibergdreef 9, Amsterdam, The Netherlands, Amsterdam Institute for Immunology and Infectious Diseases, Infectious Diseases, Amsterdam, The Netherlands

  • Robert Davis,

    Roles Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Cambridge Centre for AI in Medicine, University of Cambridge, Cambridge, United Kingdom, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom

  • Silas Ruhrberg Estevez,

    Roles Data curation, Formal analysis, Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom

  • Juan M. Lavista Ferres,

    Roles Writing – review & editing

    Affiliation AI for Good Lab, Microsoft, Redmond, Washington, United States of America

  • Mihaela van der Schaar

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    tac68@cam.ac.uk (TC), mv472@damtp.cam.ac.uk (MvdS)

    Affiliations Cambridge Centre for AI in Medicine, University of Cambridge, Cambridge, United Kingdom, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom

Abstract

Access to high-quality data provides the foundation for biomedical research. But data access is often limited or challenging due to privacy constraints, whilst the data themselves may be unrepresentative or sparse. Synthetic data can support both privacy-preserving data access and advanced analytical workflows, including data augmentation or the development of digital twins. However, the use of synthetic data remains limited due to the complexity of the methods themselves, their use, and their evaluation. To address this, we developed SynthCraft, an AI tool to support the principled, transparent, application of state-of-the-art synthetic data generation methods. SynthCraft couples a reinforcement learning-based reasoning engine with large language models (LLMs) to orchestrate the workflow necessary for the generation of synthetic data based on dynamic interaction with the user through natural language. We demonstrate the capability of SynthCraft with both tabular and genomic datasets: the National Health and Nutrition Examination Survey (NHANES) and the Cancer Genome Atlas (TCGA). Using SynthCraft, we analysed the privacy, statistical fidelity, and downstream utility of four different synthetic data generators both with and without explicit privacy-preserving designs when applied to both the NHANES and TCGA datasets. We show that how different generators perform differently – and that no single method was optimal – across varying use-cases and datasets. Furthermore, we demonstrate how SynthCraft can be used for data augmentation as part of a workflow to attempt to mitigate imbalances in the proportion of individuals from different ethnic backgrounds. In conclusion, a human-in-the-loop AI partner using LLMs can support the generation of synthetic datasets. Such tools could improve the quality, reproducibility, and transparency of research methods, whilst increasing their accessibility. Research into their use across different methodological areas is warranted.

Author summary

Medical research depends on access to patient data, but legitimate privacy concerns often mean access is restricted. We created SynthCraft to address this challenge. SynthCraft is an AI partner designed to help researchers generate synthetic versions of medical datasets entirely through natural language, without requiring programming skills. Synthetic data mimic the patterns seen in real datasets, but without containing actual patient data. However, creating and evaluating synthetic data is technically complex, requiring specialised knowledge that limits its accessibility. SynthCraft supports users through each step in the generation of synthetic data: analysing the original data, selecting appropriate generation methods, creating synthetic data itself, before finally rigorously evaluating the results. All actions and code used by SynthCraft are recorded throughout. We demonstrated SynthCraft’s capabilities using a national health survey and cancer genomics dataset. Models trained on our synthetic data performed comparably to those trained on real data. We also explored using synthetic data to address imbalances in ethnic representation, though we did not find that this improved model performance in these analyses. By making advanced methods accessible through natural language and ensuring transparent, reproducible workflows, such tools could transform how researchers apply state-of-the-art methods across biomedical research.

Introduction

Healthcare research is premised on access to high-quality data. Nevertheless, legitimate privacy concerns mean healthcare data are often difficult to access if not entirely unavailable [1]. When available, data may be sparse, subject to biases, or unrepresentative [2]. The implications of this are felt throughout biomedical research.

Synthetic data has emerged as a powerful approach to overcome these problems. Rather than masking or anonymizing real data, synthetic data are generated to mirror the statistical patterns and relationships found in real datasets [2,3]. Because of this, synthetic data can support privacy-preserving data access and the development of digital twins [25]. When applied with care, synthetic data can also play a role in mitigating issues of fairness, biases, and data sparsity through data augmentation and adaptation [2,4]. While the benefits of synthetic data are increasingly recognised, its application in healthcare remains in its infancy [6]. Key challenges include the complexity of developing synthetic data, the speed with which synthetic data methodologies are being developed, and the lack of standardisation of the metrics by which synthetic data should be judged [7,8].

Efforts to improve the quality and accessibility of research methods have historically revolved around training, the development of software packages that abstract elements of how a method is implemented [911], and reporting guidelines [12]. Software for synthetic data have begun to bridge this gap [911], yet their use still demands advanced programming and familiarity with complex data pipelines; skills that many healthcare researchers do not possess or lack the time to apply effectively [13]. Guidelines have been used as both an educational tool and to improve the quality of research [12], but their impact has been mixed [14].

Here we introduce SynthCraft, a large language model (LLM)-based partner for synthetic data development as a solution to these barriers. This system uses LLMs to orchestrate multiple sequential or concurrent steps to solve complex problems. Working together entirely in natural language, SynthCraft empowers a researcher with state-of-the-art software libraries to build synthetic data by progressing through a principled, stepwise, approach, discussing problems and solutions as they arise. In developing this system, we first show the necessity of bespoke reasoning systems over using LLMs alone before comparing the performance of SynthCraft using different LLMs. We then demonstrate the capability of SynthCraft across tabular and genomic datasets for both data access and augmentation tasks.

Methods

Overview of SynthCraft

Designed to act as a “human-in-the-loop” partner, SynthCraft guides users through an interactive, step-by-step process (Fig 1). First, SynthCraft characterises the real dataset, identifying key features and the data structure, as well as performing exploratory data analyses. SynthCraft then engages the user to consider the most appropriate synthetic data generation methods for their circumstances, explaining the different strengths and weakness of alternative approaches, before invoking Synthcity to generate synthetic data itself [9]. Synthcity is an open-source package with extensive community engagement that provides a standardised interface for accessing and evaluating a comprehensive array of generators for any synthetic data use case – from synthetic data generation to managing privacy, fairness, domain adaptation and image generation [9] (Table A in S1 Text). Together, the user and partner compare the generated data and iteratively refine analyses based on user feedback. The process has been designed to minimise ambiguity, ensure methodological rigour, and quality-assure the generated synthetic data. User prompts and any code generated by SynthCraft can be saved directly whilst, on completion, a structured report detailing all steps taken, decisions made, and code run, is produced to support transparency and reproducibility.

thumbnail
Fig 1. Overview of SynthCraft.

(A) SynthCraft is a modular framework consisting of large language model (LLM) agents with access to tools (SynthCity, Python) linked with working memory and underpinned by a reasoning system for in-context learning. Interaction with the user is through a natural language interface. An illustration of SynthCraft’s synthetic data generation pipeline is shown in (B) alongside a schematic of how the episodic multi-armed bandit reasoning approach works (to the right of the vertical line) [15,16]. At each stage of the pipeline, from data preparation and developing an analysis plan through evaluating the quality of the synthetic data generated, multiple intermediate steps may be needed. In the illustration, we have simplified for clarity, but at each step the agent has a particular state, reflecting progress towards completing the relevant task, can select appropriate actions, and then receives feedback either from interaction with the user, external tools invoked, or self-reflection [15,16]. The number of episodes – or steps – will vary depending on the use case, the dataset, and the interaction with the user. Though presented as a sequential pipeline, SynthCraft may need to return to earlier stages; for example, to trial alternative synthetic data generators if the quality of the synthetic data generated for a given task is insufficient at the evaluation stage. Subfigures adapted from refs [15,16].

https://doi.org/10.1371/journal.pdig.0001290.g001

SynthCraft is a modular LLM framework, in which a coordinating LLM agent controls worker agents with the ability to generate code and run specialised synthetic data generation tools. This framework includes several important guardrails: first, the LLMs are not responsible for writing code to generate or evaluate synthetic data, instead using Synthcity (version 0.2.12). Second, we protect against the introduction of errors at the point of generating synthetic data by ensuring the human user is asked to validate key parameters before Synthcity is invoked. More broadly, the user is kept in the loop throughout, can ask questions at any point, and can revert steps. We used GPT-5 from OpenAI (model: GPT-5; API version: “2025-08-07”; temperature: 0.5) on a secured instance of Azure as the underlying LLM in these analyses, but an alternative LLM could be simply interchanged to underpin the agents in this framework. The LLMs themselves are not privy to the real or synthetic data at any point, only prompts – the instructions provided by the user – whilst all analyses are undertaken on a user’s own device.

The collaborative, sequential, decision-making process between the user and SynthCraft uses a reasoning system trained using multi-armed bandits, a type of reinforcement learning designed for complex multi-stage optimisation problems (Fig 1) [15,16]. Each stage in the process by which synthetic data are generated – which are encoded as fixed logic – can be considered as an episode consisting of one or more steps. The steps necessary for each episode will develop based on the needs of the user and characteristics of the data – i.e., adaptive learning [15]. For example, because the performance of synthetic data generation methods differs between use-cases and datasets, the generative step may require several iterative cycles before completion. At each step, an action is taken, such as running a synthetic data generator tool or producing a report. Each action is associated with both a cost and a reward. Should the task be completed or a problem encountered by the LLM that requires discussion with the user, a stop action occurs. Feedback received from all other actions taken within that episode informs the next action to be taken. The overall objective of the LLM is to maximise net rewards (total rewards less costs), which occurs on successful completion of the task. Further details are presented in Box 1.

BOX 1: Details of the SynthCraft reasoning engine

SynthCraft integrates a reasoning engine that guides users through the process of synthetic data generation using a transparent, structured decision framework. Inspired by the CLiMB architecture [16], this reasoning process is formalised as an episodic multi-armed bandit [15], tailored to maximise utility (output quality) while minimising interaction cost (user burden).

The reasoning engine consists of several core components [15,16]:

  • A set of tasks (episodes) necessary to complete an analysis plan.
  • Costs associated with specific actions that are returned as feedback.
  • Actions to complete the tasks, from invoking tools to asking the user for feedback.
  • A state that corresponds to all previous actions within the episode along with their costs.

Let be the set of episode types or subtasks in the synthetic data pipeline, including:

  1. Dataset intake and characterisation
  2. Analytic intent elicitation
  3. Synthetic data generation with or without specific privacy guarantees
  4. Evaluation strategy selection
  5. Iterative refinement
  6. Transparent documentation

Note, data pre-processing is currently expected to be performed before data are passed to SynthCraft.

Each episode, , corresponds to one subtask, and consists of a sequence of actions () drawn from a set of Actions, [15,16]. These actions either progress the task or end the episode and engage the user. We enforce a maximum number of actions per episode before automatically discussing with the user to prevent infinite iteration [16].

The next action to be taken draws on the previous sequence of actions and the feedback received from each action [15]. Feedback is attributed a cost, c, of 0 or 1, and can be received from [16]:

  • external tools (e.g., the results from statistical analyses; synthetic data generators);
  • LLM self-reflection; or,
  • the user themselves (either mid-episode or at the end of the episode).

Feedback that requires the user is penalised (c = 1) to minimise user burden. At the end of episode, there is a terminal reward again of 1 – corresponding to approval of the actions taken by SynthCraft – or 0 if the user requires the episode to be performed differently.

The reasoning agent maintains a dynamic plan over the subtasks. A subtask, , is considered complete when the associated episode receives a reward . The plan is not strictly sequential; SynthCraft can reorder subtasks based on task context and user feedback, maintaining flexibility and robustness across different use cases. The ultimate objective of SynthCraft’s reasoning system is to maximise terminal rewards by completing the analysis plan and its constituent steps efficiently with minimal user input [16]. All actions, feedback, and decisions are recorded to enable full reproducibility and auditability of the synthetic data workflow.

To analyse the added value of SynthCraft above using an LLM (GPT-5) in combination with Synthcity, we performed reasoning ablation studies. In each study, we replace the reasoning engine of SynthCraft with a coordinator agent that uses GPT-5. Here, the GPT-5 coordinating agent has access to the same tools, allowing us to analyse the impact of the SynthCraft beyond using LLMs on their own when running an end-to-end workflow to generate and evaluate synthetic data. Our assessments included whether relevant steps were performed and the quality of analyses, if errors were generated, as well as how or whether a human user is involved at critical stages in decision making.

Datasets

To demonstrate the capacity of SynthCraft to generate synthetic datasets across both epidemiological data and genomic data, we used National Health and Nutrition Examination Survey (NHANES) (wave 2021–2023) [17] and The Cancer Genome Atlas (TCGA) [18]. In brief, the NHANES study is a complex, stratified, multistage cluster probability sample of the civilian, noninstitutionalized population of the United States of America (USA). NHANES collects data through household interviews (for demographics, diet, tobacco use, and medical history) and a mobile examination centre (for health, dental, anthropometric, and biochemical examinations and biospecimen collection). For this study, we included the covariates age, gender, ethnicity, body mass index (BMI), total cholesterol levels, insulin levels, and self-reported myocardial infarction. We excluded participants who had missing data on these covariates, resulting in a dataset of 2,924 unique observations.

The Cancer Genome Atlas (TCGA) reflects the high-dimensional, heterogeneous data structures characteristic of multi-omics studies. TCGA is an ongoing collaboration between the National Cancer Institute and National Human Genome Research Institute in the USA with the aim of generating comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer [18]. TCGA contains multi-omic data on tumour tissue and matched normal tissues from more than 11,000 patients alongside clinical information. For this study, we included data on tumour purity in bulk RNA sequencing (using the Illumina HiSeq 2000 platform), as determined from the ABSOLUTE algorithm [19] and a previously derived gene signature [20]. We included all tumour samples for which there were no missing values in gene expression, resulting in a dataset of 9,678 samples.

Synthetic data generation and performance evaluation

We generated synthetic versions of both the NHANES and TCGA datasets using the following four data generators, selected to represent a range of generators both with and without specific privacy-preserving features, trained with their default hyperparameters: Private Aggregation of Teacher Ensembles Generative Adversarial Networks (PATE-GAN) [21], Denoising diffusion probabilistic models (DDPM) [22], Anonymization through Data Synthesis using Generative Adversarial Networks (ADS-GAN) [23], and Conditional Tabular Generative Adversarial Network (CTGAN) [24].

PATE-GAN, DDPM, and ADS-GAN are specifically privacy-preserving synthetic data generators. PATE-GAN uses differential privacy, a mathematical framework that ensures the inclusion or exclusion of a single data point does not significantly affect the output of data analysis [25]. We used a default privacy epsilon of 1 for these analyses based on previous work that has shown that this value can still provide strong privacy guarantees with minimal impact on the usefulness of the synthetic data [21]. As the epsilon is reduced, the privacy of the resulting synthetic data rises but with a trade-off in terms of its utility for downstream tasks such as prognostic modelling [21,26]. DDPM and ADS-GAN are specifically tailored to protect against re-identification attacks, where an adversary attempts to identify individuals within anonymised or pseudonymised datasets. CTGAN has no additional privacy-preserving framework embedded. Many generators will produce synthetic data that preserve privacy. However, generators that use differential privacy or protect against reidentification attacks provide users with specific, tuneable, mathematical guarantees over the privacy and fidelity of the resulting synthetic data. Such guarantees do not, on their own, reduce the need for rigorous evaluation of the privacy of the datasets generated.

To evaluate the quality, utility, and privacy of the generated data, we focused on several performance metrics (an exhaustive list of metrics and results are presented in Table B in S1 Text) [9,27,28]. For data quality, we used several metrics including Jensen-Shannon distance, the empirical maximum mean discrepancy, and α-precision [27]. To evaluate the utility of the data for downstream tasks, we then built predictive models and measured the mean squared error (MSE) (for regression) and the area under the received operating characteristic (AUC) curve (for classification) on both training and out-of-distribution data. For data privacy, we measured its authenticity, k-anonymity, and identifiability score. Authenticity quantifies the percentage of generated samples that are not near‐identical copies of any real training example [27]. K-anonymization involves quantifying the smallest k such that, in the real data (for ground truth) or the synthetic data, every combination of quasi-identifiers appears at least k times. The identifiability score assesses how easily a synthetic record can be linked back to a specific real individual by comparing quasi-identifiers between real and synthetic datasets to test for privacy in the synthetic dataset.

Synthetic data augmentation

In common with many research cohorts, NHANES does not have equal representation across different ethnicities. To address this, we used a four-stage augmentation pipeline: baseline enumeration; enrichment target calculation; conditional synthetic generation; and cohort integration and evaluation. This allowed us to train models on augmented datasets containing the same number of cases from each ethnic group represented. Further details can be found in the Supplementary Methods.

Statistical analysis

To compare distributions of variables between real and synthetic datasets, we calculated the counts and percentages (for categorical variables) and medians and interquartile ranges (IQR) (for continuous variables) in the real and synthetic datasets in the NHANES dataset and mean and standard deviation for the gene expression in the TCGA dataset.

For the real and synthetic NHANES databases, we modelled the relationship between self-reported myocardial infarction using logistic regression and including all other variables in the dataset as covariables. We obtained odds ratios (OR) and 95% confidence intervals (CI) from these models. We used ethnicity categorisations as provided by NHANES. We used a complete case analysis. We subsequently built prediction models, analysing discriminative performance (area under the receiver operating curve; AUC) in aggregate and across sub-groups using bootstrapped (1,000 runs) confidence intervals. We present results from models that were trained on synthetic data and tested on the real dataset. For the TCGA dataset, we calculated the MSE of the predicted purity. We used a 5-fold cross-validation with 80% training and 20% testing data in each fold reporting the mean error across the folds alongside the standard deviation.

Results

Comparison of SynthCraft against standalone LLMs

SynthCraft used chain-of-thought reasoning to adapt its workflow to generate high-quality synthetic versions of both a tabular epidemiological dataset – NHANES – and a genomic dataset – TCGA (Fig 1). A complete example of this workflow can be found in Fig A in S1 Text.

To assess the value of SynthCraft over the use of state-of-the-art LLMs (GPT-5), we performed three ablation studies with the NHANES dataset. In each study, we used an instance of GPT-5 equipped with access to the same tools as SynthCraft so that we could isolate the advantages of the reasoning engine in orchestrating an end-to-end research workflow. In none of the ablation studies was GPT-5 alone able to complete the analyses (Table 1). Different error types occurred in each study, ranging from ignoring the Synthcity tool and trying to write its own code to generate synthetic data (i.e., bypassing established methods), through attempting to generate synthetic data using all possible generators available in the Synthcity package and overwhelming the compute available, to failing to evaluate the synthetic data. Importantly, in none of the ablation studies did GPT-5 involve the human user at critical stages in the workflow, which could lead to errors and inappropriate analyses.

thumbnail
Table 1. Ablation studies comparing SynthCraft against GPT-5.

https://doi.org/10.1371/journal.pdig.0001290.t001

Results of synthetic data generated from NHANES using SynthCraft

We first compared the quality of the different datasets generated to the original NHANES dataset. The synthetic datasets generated with ADS-GAN, CTGAN, and DDPM, had similar statistical measures of fidelity to each other (Table C in S1 Text), whilst replicating the variable distributions seen in the original dataset (Table 2). Despite consistency at an aggregate level, all three models showed some departure from the original dataset in the proportion of individuals with the outcome of interest - myocardial infarction - with 3.1%, 2.6%, and 7.4% of the synthetic cohorts generated using ADS-GAN, CTGAN, and DDPM, respectively, having the outcome relative to 4.2% of the real cohort. By contrast, the data generated by PATE-GAN had both lower statistical measures of fidelity and more divergence from the descriptive characteristics of the NHANES data, but would be considered more private, with greater k-anonymity and authenticity metrics, and the lowest identifiability scores (Table C in S1 Text).

thumbnail
Table 2. Comparison of variable distributions in the real and synthetically generated datasets (NHANES dataset).

https://doi.org/10.1371/journal.pdig.0001290.t002

We then modelled the relationship between the covariates and the occurrence of myocardial infarction in each dataset using logistic regression (Table 3). In keeping with the statistical measures of fidelity and descriptive statistics, regression models generated using synthetic data broadly preserved the relationships between variables and outcome seen in the real dataset. ADS-GAN and CTGAN reproduced the parameter estimates found in the real dataset most closely, however there were notable discrepancies in the relationship between ethnic group and myocardial infarction.

thumbnail
Table 3. Comparison of regression parameters for self-reported myocardial infarction estimates in the real and synthetically generated datasets (NHANES dataset).

https://doi.org/10.1371/journal.pdig.0001290.t003

We subsequently analysed the discriminative performance of logistic regression models trained on synthetic data when tested on real data (Fig 2). Models trained on purely synthetic data generated using ADS-GAN and CTGAN showed near equivalent AUCs to a model trained on the original dataset (real: 0.818, 95% confidence intervals [CI]: 0.773-0.859; ADS-GAN: 0.781, 95% CI: 0.733-0.827; CTGAN: 0.797 95% CI: 0.746-0.847), although models trained on PATE-GAN and DDPM performed less well (Table D in S1 Text). These patterns were maintained when analysing performance by age and ethnicity sub-groups (Fig 2 and Table D in S1 Text).

thumbnail
Fig 2. Discriminative performance (AUC) overall and by sub-group of logistic regression models trained on the original (NHANES) dataset, purely synthetic datasets (circles), or original dataset with augmentation (squares) when tested on the original real dataset.

The discriminative performance of models trained purely on synthetic data can rival a model trained on real data, with the quality of models varying by synthetic dataset. Augmentation of the real dataset did not improve performance relative to using the real dataset for model development without augmentation. Abbreviations: ADS-GAN, Anonymization through Data Synthesis using Generative Adversarial Networks; CTGAN, conditional table generative adversarial network; DDPM, denoising diffusion probabilistic models; NHANES, National Health and Nutrition Examination Survey; PATE-GAN, Private Aggregation of Teacher Ensembles Generative Adversarial Networks; AUC, area under the curve.

https://doi.org/10.1371/journal.pdig.0001290.g002

The impact of data augmentation on model performance

Augmentation – where synthetic data are added to the original data to reduce imbalances in particular features – is an important use-case for synthetic data. We thus asked SynthCraft to undertake ethnicity-specific augmentation of the original NHANES data. However, training predictive models of self-reported myocardial infarction on these augmented datasets did not improve either overall or sex- or ethnicity-specific discriminative performance (Fig 2 and Tables D-E in S1 Text). Indeed, augmentation with synthetic data across all synthetic data generators led to reductions in the overall AUC and discriminative performance by subgroup by comparison with using the real data alone. Because the performance of models built on the original data were both high (AUC > 0.8) and consistent across subgroups, these findings are not unexpected. They do highlight the need for iterative testing of synthetic data both across generators and use cases, and that synthetic data can balance representation across groups, but that does not necessarily remove bias or fairness [29].

Results of synthetic data generated from TCGA using SynthCraft

Our second use case for SynthCraft was in genomic data. Models trained on synthetic genomic data predict tumour purity. For the TCGA cohort, we generated five synthetic versions using PATE-GAN, ADS-GAN, CTGAN, and DDPM via the SynthCraft platform. Quality metrics for each synthetic dataset are summarized in Table F in S1 Text. We then compared the distributions of gene expression levels and tumour purity estimates between the real and synthetic cohorts, finding close alignment across all methods. To assess downstream analytical fidelity, we applied a previously identified gene signature [20] in an XGBoost regression model to predict tumour purity. Predictive performance was comparable between the real data and all synthetic datasets except for PATE-GAN, which underperformed.

We repeated the same synthetic data generation pipeline for the TCGA cohort, demonstrating that SynthCraft can equally process high-dimensional genomic data. We found similar patterns to those in NHANES, where the statistical fidelity and utility of the synthetic data varied by generator (Tables G-H in S1 Text). XGBoost regression models predicting tumour purity had comparable mean squared errors when trained on synthetic data and the original TCGA cohort, except for models trained on synthetic data generated by PATE-GAN (Fig 3 and Table H in S1 Text).

thumbnail
Fig 3. Scatter plots illustrate the correlation between actual and predicted tumour purity for the real dataset and four synthetic datasets.

Correlations between gene expression levels and estimates of tumour purity were broadly similar across synthetic and real datasets, but the mean squared error of an XGBoost regression model predicting tumour purity was noticeably different when using synthetic data generated with PATE-GAN. Abbreviations: ADS-GAN, Anonymization through Data Synthesis using Generative Adversarial Networks; CTGAN, conditional table generative adversarial network; DDPM, denoising diffusion probabilistic models; NHANES, National Health and Nutrition Examination Survey; PATE-GAN, Private Aggregation of Teacher Ensembles Generative Adversarial Networks; MSE, mean squared error.

https://doi.org/10.1371/journal.pdig.0001290.g003

Discussion

Despite its potential [2], the use of synthetic data in practice is complex, from the selection and training of the synthetic data generator to context-specific evaluation. We demonstrate that an AI-based partner can support the systematic use of state-of-the-art synthetic data generation methods to develop and evaluate synthetic data across both tabular and genomic datasets entirely through natural language.

Improving the adoption, accessibility, reproducibility, and quality of research methods has relied upon improved training and a proliferation of research checklists and guidelines. These approaches are inherently limited: although guidelines can provide an approach to tackling a problem, there are few mechanisms to ensure their systematic use either by researchers or publishers [14]. Furthermore, it has been suggested that nearly 95% of time required to conduct analyses involving machine learning is spent programming, a technical debt [30] that slows the dissemination of new methodological advances or potentially restricts access to teams with sufficient resources with contributing necessarily to scientific advance.

We show here that LLMs, specifically the use of agent-based frameworks, could support an alternative approach [31] in which researchers interact with AI partners that encode and standardise good practice, ensuring that recommendations from guidelines and research checklists are considered, improving the quality of research performed. This approach simultaneously improves the accessibility of state-of-the-art ML methods: the user is no longer required to be an expert programmer, to be limited to synthetic data generators with which they have familiarity, or indeed to only those that are available in a single programming language.

Agent-based frameworks build on the potential for LLMs to act as reasoning engines [32], with guardrails enforced through access only to pre-specified methodological software packages. This ensure that the resulting analyses are technically sound and is supported by the transparent reporting of any code run. A crucial feature of this approach is collaboration between the user and the framework – human-in-the-loop iteration – providing SynthCraft with domain knowledge to contextualise the problem; SynthCraft is a partner that augments a researcher, not a replacement.

Both SynthCraft and the underlying Synthcity package used to generate the synthetic data itself are open source, such that they verified, improved, or even adapted by the broader research community. As new synthetic data methods become available, these new tools can be incorporated into SynthCraft without requiring re-training of the underlying reasoning framework. We have demonstrated the creation of synthetic data for data access and augmentation, but SynthCraft is not limited to these, with any use case supported by the underlying Synthcity package. As with the underlying LLMs, Synthcity itself could be swapped for an alternative synthetic data generating package. SynthCraft has a specific purpose: the generation of synthetic data. This could be extended by linking other agent-based frameworks, for example to create downstream prediction models, which could operate in a cooperative manner to create end-to-end analytic pipelines.

SynthCraft has some limitations. We found that the framework can require prompting to continue to the next stage of analyses, becoming focussed on the specific task at hand. With future research into training agent-based workflows and the ongoing improvement in underlying LLMs, we expect this limitation to be less prevalent in the future. Although the underlying LLM does not have access inherently to the data, this safeguard could be bypassed by user prompting. Controlled instances of LLMs are available from cloud services that are compliant with healthcare privacy regulations and should be used to add security to the system; we provide instructions on how to set up SynthCraft with this feature built-in. SynthCraft is also capable of being used with open-weight or open-source LLMs, providing choice over how the system is deployed and allowing for more granular privacy controls. In the development of SynthCraft we discussed the prototype with individuals from different professional backgrounds and levels of programming proficiency but have not conducted a formal user study. This will be the subject of future work. Furthermore, as an open-source project, we encourage users to suggest improvements and contribute to the development of the software. In our analyses, we demonstrate the system with two different datasets (epidemiological and genetic). Future work could test how the system performs across a broader range of datasets, including of different scales.

In conclusion, we present an AI partner – SynthCraft – that enables the generation of synthetic datasets and augmented synthetic datasets using natural language. Such AI partners hold promise in democratising access to state-of-the-art methodologies, whilst improving the quality and reproducibility of analyses, furthering a new approach to scientific analysis.

Supporting information

S1 Text. Supporting Information.

Table A. Summary of synthetic data generator tools available in SynthCraft. Table B. Data quality metrics in SynthCraft. Table C. Performance metrics for generated synthetic datasets (NHANES). Table D. Discrimination (AUC) by sub-group for logistic regression models trained on the real data and on synthetic data (NHANES dataset). Table E. Discrimination (AUC) by sub-group for logistic regression models trained on the real data and on real data augmented with synthetic data (NHANES dataset). Table F. Exhaustive list of performance metrics for generated synthetic datasets (TCGA dataset). Table G. Comparison of variable distributions in the real and synthetically generated datasets (TCGA purity dataset). Table H. Performance on the TCGA gene purity dataset. Table I. Ablation studies comparing SynthCraft against GPT-4o. Fig A. Workflow for the NHANES dataset.

https://doi.org/10.1371/journal.pdig.0001290.s001

(PDF)

References

  1. 1. European Union. General Data Protection Regulation (GDPR). https://gdpr.eu/tag/gdpr/. 2018. 2022 November 22.
  2. 2. van Breugel B, Liu T, Oglic D, van der Schaar M. Synthetic data in biomedicine via generative artificial intelligence. Nat Rev Bioeng. 2024;2(12):991–1004.
  3. 3. Qian Z, Callender T, Cebere B, Janes SM, Navani N, van der Schaar M. Synthetic data for privacy-preserving clinical risk prediction. Sci Rep. 2024;14(1):25676. pmid:39463411
  4. 4. Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. NPJ Digit Med. 2023;6(1):186. pmid:37813960
  5. 5. Liu Y, Acharya UR, Tan JH. Preserving privacy in healthcare: A systematic review of deep learning approaches for synthetic data generation. Comput Methods Programs Biomed. 2025;260:108571. pmid:39742693
  6. 6. Rujas M, Martín Gómez Del Moral Herranz R, Fico G, Merino-Barbancho B. Synthetic data generation in healthcare: A scoping review of reviews on domains, motivations, and future applications. Int J Med Inform. 2025;195:105763. pmid:39719743
  7. 7. Pezoulas VC, Zaridis DI, Mylona E, Androutsos C, Apostolidis K, Tachos NS, et al. Synthetic data generation methods in healthcare: A review on open-source tools and methods. Comput Struct Biotechnol J. 2024;23:2892–910. pmid:39108677
  8. 8. Arora A, Wagner SK, Carpenter R, Jena R, Keane PA. The urgent need to accelerate synthetic data privacy frameworks for medical research. Lancet Digit Health. 2025;7(2):e157–60. pmid:39603900
  9. 9. Qian Z, Cebere BC, van der Schaar M. Synthcity: facilitating innovative use cases of synthetic data in different data modalities. In: 2023. http://arxiv.org/abs/2301.07573
  10. 10. Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc. 2018;25(3):230–8. pmid:29025144
  11. 11. Nowok B, Raab GM, Dibben C. synthpop: Bespoke Creation of Synthetic Data in R. J Stat Soft. 2016;74(11).
  12. 12. Enhancing the quality and transparency of health research (EQUATOR) network. https://www.equator-network.org/ 2022 December 3.
  13. 13. Attwood TK, Blackford S, Brazas MD, Davies A, Schneider MV. A global perspective on evolving bioinformatics and data science training needs. Brief Bioinform. 2019;20(2):398–404. pmid:28968751
  14. 14. Zamanipoor Najafabadi AH, Ramspek CL, Dekker FW, Heus P, Hooft L, Moons KGM, et al. TRIPOD statement: a preliminary pre-post analysis of reporting and methods of prediction models. BMJ Open. 2020;10(9):e041537. pmid:32948578
  15. 15. Tekin C, van der Schaar M. Episodic Multi-armed Bandits. In: 2015. http://arxiv.org/abs/1508.00641
  16. 16. Saveliev E, Schubert T, Pouplin T, Kosmoliaptsis V, van der Schaar M. CliMB: An AI-enabled partner for clinical predictive modeling. arXiv. 2024. http://arxiv.org/abs/2410.03736
  17. 17. Centers for Disease Control and Prevention CDC. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. 2023. https://wwwn.cdc.gov/nchs/nhanes/ContinuousNhanes/default.aspx?Cycle=2021-2023
  18. 18. National Cancer Institute. The cancer genome atlas program (TCGA). https://www.cancer.gov/ccg/research/genome-sequencing/tcga 2022. 2025 July 21.
  19. 19. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21. pmid:22544022
  20. 20. Li Y, Umbach DM, Bingham A, Li Q-J, Zhuang Y, Li L. Putative biomarkers for predicting tumor sample purity based on gene expression data. BMC Genomics. 2019;20(1):1021. pmid:31881847
  21. 21. Jordon J, Yoon J, Van Der Schaar M. PATE-GAN: Generating synthetic data with differential privacy guarantees. In: 2019. https://openreview.net/forum?id=S1zk9iRqF7
  22. 22. Kotelnikov A, Baranchuk D, Rubachev I, Babenko A. TabDDPM: Modelling Tabular Data with Diffusion Models. In: 2023.
  23. 23. Yoon J, Drumright LN, van der Schaar M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE J Biomed Health Inform. 2020;24(8):2378–88. pmid:32167919
  24. 24. Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. NeurIPS. 2019.
  25. 25. Dwork C. Differential Privacy. Lecture Notes in Computer Science. Springer Berlin Heidelberg. 2006. 1–12.
  26. 26. Ganev G, Oprisanu B, De Cristofaro E. Robin Hood and Matthew Effects: Differential Privacy Has Disparate Impact on Synthetic Data. In: Proceedings of the 39th International Conference on Machine Learning, 2022. 6944–59. https://proceedings.mlr.press/v162/ganev22a.html
  27. 27. Alaa A, Van Breugel B, Saveliev ES, van der Schaar M. How Faithful is Your Synthetic Data? Sample-Level Metrics for Evaluating and Auditing Generative Models. In: Proceedings of the 39th International Conference on Machine Learning, 2022. 290–306. https://proceedings.mlr.press/v162/alaa22a.html
  28. 28. Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol. 2020;20(1):108. pmid:32381039
  29. 29. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A Survey on Bias and Fairness in Machine Learning. ACM Comput Surv. 2021;54(6):1–35.
  30. 30. Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D. Hidden technical debt in machine learning systems. In: NeurIPS, 2015. https://proceedings.neurips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
  31. 31. Callender T, van der Schaar M. Automated machine learning as a partner in predictive modelling. Lancet Digit Health. 2023;5(5):e254–6. pmid:37100541
  32. 32. Truhn D, Reis-Filho JS, Kather JN. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat Med. 2023;29(12):2983–4. pmid:37853138