Synthetic data governance tightens: new ERC project, WEF framework push, and fresh evidence on medical SDG tradeoffs

Synthetic data is moving from a tactical workaround to a governed data asset: new research programs are probing societal impacts, standards bodies are pushing frameworks, and clinical studies are quantifying the fidelity/utility/privacy/cost trade space. Data teams should expect tighter scrutiny on provenance, evaluation, and “fitness for use” claims as synthetic datasets scale with generative AI.

New project to investigate societal consequences of using synthetic data to train algorithms

The University of York announced the launch of SYNDATA, a European Research Council-funded project led by Dr. Benjamin Jacobsen. The project will examine the practical, ethical, and political consequences of using synthetic data to train AI systems, with attention to real-world deployments in sectors including healthcare and finance.

Unlike technical benchmarks that focus on model performance, SYNDATA is positioned as a large-scale social science effort to study how synthetic data changes decision-making, accountability, and power structures as organizations substitute or augment real-world data with generated data.

Compliance is becoming socio-technical: governance will increasingly need to cover not just privacy risk, but downstream impacts (bias, access, and accountability) when synthetic data is used in high-stakes settings.
Expect procurement questions to change: buyers will ask for evidence on how synthetic datasets affect outcomes and who benefits/loses—especially in healthcare and finance.
Regulatory narratives are forming now: work like this often feeds future guidance; teams should document intended use, limitations, and evaluation methods early.

Synthetic Data: The New Data Frontier

The World Economic Forum published a briefing paper framing synthetic data as a way to address data gaps, enable AI training, and protect privacy—particularly in sensitive domains like healthcare and finance. The paper also calls for governance standards and cross-sector collaboration to support responsible use as reliance on AI grows.

For practitioners, the notable shift is the positioning: synthetic data is treated less as an experimental technique and more as infrastructure that requires shared definitions, controls, and assurance practices across organizations and jurisdictions.

Standards pressure is rising: if governance expectations converge, “we generated it” won’t be an acceptable risk argument without documented methods and controls.
Privacy claims will need substantiation: teams should be prepared to explain what privacy protection means in their SDG approach and how it is assessed.
Equity becomes part of the spec: the paper’s framing ties synthetic data to innovation and fairness, implying evaluation should include representativeness and harm analysis—not only utility.

Impact of synthetic data generation for high-dimensional cross-sectional medical data: fidelity, utility, privacy, and cost considerations

In JAMIA, researchers evaluated strategies for synthetic data generation (SDG) for high-dimensional cross-sectional medical data, comparing approaches that use full datasets versus subset-based methods. The study reports that using the full high-dimensional datasets better preserves fidelity, utility, and privacy, while also being cost-effective relative to subset-based approaches.

The paper is a practical reminder that “simplifying” the input data to make SDG easier can backfire: reducing dimensionality or using subsets may degrade the synthetic output in ways that affect downstream research validity and privacy characteristics.

Design choice matters: SDG pipelines should justify whether they use full-feature inputs or subsets, because that decision can shift fidelity, utility, and privacy outcomes.
Cost is not a reason to cut corners: the findings suggest teams may not need to trade off quality and privacy to stay cost-effective.
Better evidence for data-sharing programs: medical data sharing platforms and training environments can point to concrete guidance when selecting SDG strategies.

Synthetic data created by generative AI poses ethical challenges

NIEHS highlighted ethical challenges associated with synthetic data created by generative AI. The piece notes synthetic data’s long history (about 60 years) but argues that new generative approaches introduce additional risks—particularly around privacy and accuracy—while also addressing data scarcity in research contexts.

The takeaway is not that synthetic data is inherently unsafe, but that “synthetic” is not synonymous with “risk-free.” As generative AI makes it easier to create and distribute synthetic datasets, ethical review and validation practices have to keep pace.

Accuracy is an ethics issue: in public health and environmental science, low-fidelity synthetic data can mislead analyses even if privacy risk is reduced.
Privacy risk doesn’t disappear: teams should treat synthetic outputs as potentially sensitive unless they have evidence-based privacy evaluation.
Governance needs clear use boundaries: define where synthetic data is acceptable (training, testing, education) versus where real-world validation is required.