Synthetic data governance gets real: new ERC study, WEF playbook, and fresh evidence from healthcare
Daily Brief4 min read

Synthetic data governance gets real: new ERC study, WEF playbook, and fresh evidence from healthcare

A new ERC-funded SYNDATA project at the University of York will study the practical, ethical, and political consequences of using synthetic data to train…

daily-briefsynthetic-datadata-governanceprivacyhealthcare-a-ia-i-regulation

Synthetic data is moving from a tactical privacy workaround to a governed asset class. New work spans (1) societal impacts and power dynamics, (2) multi-stakeholder governance guidance, (3) empirical evidence on what improves medical SDG quality, and (4) a reminder that “synthetic” does not automatically mean safe or accurate.

New project to investigate societal consequences of using synthetic data to train algorithms

The University of York announced the launch of SYNDATA, a European Research Council-funded project led by Dr. Benjamin Jacobsen. The project will examine the practical, ethical, and political consequences of using synthetic data to train AI systems, with attention to deployments across sectors including healthcare and finance.

Instead of focusing only on model performance, SYNDATA is positioned as a large-scale social science effort to understand how synthetic data use reshapes decision-making, accountability, and the distribution of power as generative AI increases demand for training data.

  • Compliance and policy signal: Expect more scrutiny on “who benefits” and “who is harmed” by synthetic data pipelines—beyond narrow privacy claims.
  • Procurement pressure: Data teams may be asked to document intended use, affected populations, and governance controls, not just technical generation methods.
  • Risk framing shift: Synthetic data discussions are expanding from re-identification to broader ethical and political consequences, which can influence future regulation.

Synthetic Data: The New Data Frontier

The World Economic Forum published a briefing paper positioning synthetic data as a way to address data gaps, protect privacy, and enable AI training in sensitive domains such as healthcare and finance. The paper also calls for stronger governance standards and cross-sector collaboration to support responsible adoption.

For organizations already experimenting with synthetic data, the WEF framing is less about novelty and more about standardization: how to make synthetic datasets trustworthy enough for sharing, benchmarking, and regulated use cases.

  • Governance becomes table stakes: “We synthesized it” won’t satisfy stakeholders without standards for quality, privacy protection, and appropriate use.
  • Shared language for audits: A mainstream policy/industry body pushing frameworks can accelerate common controls (documentation, validation, access rules) across vendors and sectors.
  • Equity and access implications: The paper explicitly ties synthetic data to innovation with safeguards—raising expectations that teams consider representativeness and downstream impacts, not only privacy.

Impact of synthetic data generation for high-dimensional cross-sectional medical data: fidelity, utility, privacy, and cost considerations

In JAMIA, researchers evaluated strategies for synthetic data generation (SDG) for high-dimensional, cross-sectional medical data. They found that generating synthetic data using the full high-dimensional dataset better preserves fidelity, utility, and privacy than approaches that rely on subsets of the data—and that this can also be cost-effective.

The study focuses on the trade-offs data custodians actually face: how to choose SDG strategies that keep research usefulness high while protecting sensitive patient information and managing operational cost.

  • Implementation guidance: If you’re synthesizing medical datasets, “subset then synthesize” may underperform on multiple axes compared with full high-dimensional generation.
  • Platform design: Medical data sharing and education platforms can use these findings to justify SDG configurations that better balance utility, privacy, and cost.
  • Evaluation expectations: The paper reinforces that teams should measure fidelity, utility, and privacy together—optimizing one in isolation is likely to disappoint reviewers and users.

Synthetic data created by generative AI poses ethical challenges

NIEHS highlighted ethical challenges associated with synthetic data created by generative AI. While noting that synthetic data has a roughly 60-year history, the piece argues that newer generative approaches introduce fresh risks—especially around privacy and accuracy—even as they help address data scarcity in research.

The emphasis is pragmatic: synthetic data can be a useful tool in public health and environmental science, but it can also mislead if treated as automatically de-identified or inherently correct.

  • Accuracy is an ethics issue: If synthetic data degrades or distorts signals, downstream models and analyses can produce confidently wrong conclusions.
  • Privacy isn’t guaranteed: “Synthetic” does not equal “non-sensitive,” so teams still need explicit privacy evaluation and controls.
  • Governance scope expands: Expect ethics reviews and data governance boards to ask for documented limitations and intended-use boundaries, not just generation methods.