Synthetic data governance tightens: ERC scrutiny, WEF standards push, and new medical SDG evidence

Synthetic data is moving from a tactical workaround to a governed asset class. New research and policy-oriented work this week converge on the same message: the value is real, but so are the societal, clinical, and ethical tradeoffs—so teams need clearer standards and better evaluation practices.

New project to investigate societal consequences of using synthetic data to train algorithms

The University of York announced the launch of SYNDATA, a European Research Council-funded project led by Dr. Benjamin Jacobsen. The work is positioned as a social-science investigation into the practical, ethical, and political consequences of using synthetic data to train algorithms, including in high-stakes sectors such as healthcare and finance.

Rather than treating synthetic data as a purely technical privacy or scaling tool, the project frames it as a mechanism that can reshape institutions, decision-making, and power—an angle that often lags behind deployment as generative AI accelerates synthetic-data production and use.

Regulatory exposure will expand beyond privacy: if synthetic data changes outcomes or embeds institutional assumptions, compliance questions will include fairness, accountability, and explainability—not just de-identification.
Procurement and oversight may get stricter: expect more requests for documentation on how synthetic datasets were generated, validated, and governed, especially in healthcare and financial services.
Data teams should prepare for “societal impact” reviews: internal model risk management may need to cover synthetic-data pipelines, not just model training and inference.

Synthetic Data: The New Data Frontier

The World Economic Forum published a briefing paper arguing that synthetic data can address data gaps, support privacy protection, and enable AI training—particularly in sensitive domains like healthcare and finance. Alongside the opportunity framing, the paper calls for governance standards and cross-sector collaboration to guide responsible use.

For practitioners, the notable point is the direction of travel: synthetic data is being treated as infrastructure for AI adoption, which tends to pull in standard-setting, auditability, and shared terminology across public and private sectors.

Standards pressure is rising: “synthetic” won’t be a sufficient label—teams will be expected to specify intended use, risk controls, and evaluation methods.
Privacy claims will need evidence: governance language implies organizations should be ready to justify how synthetic data protects sensitive information in practice, not just in principle.
Collaboration becomes a requirement: if frameworks emerge from multi-stakeholder efforts, vendor and partner alignment (contracts, SLAs, audits) will matter as much as model quality.

Impact of synthetic data generation for high-dimensional cross-sectional medical data: fidelity, utility, privacy, and cost considerations

In JAMIA, researchers evaluated synthetic data generation (SDG) strategies for high-dimensional cross-sectional medical data across fidelity, utility, privacy, and cost. They report that generating synthetic data using full high-dimensional datasets preserves fidelity, utility, and privacy better than subset-based approaches, while also being cost-effective.

The practical takeaway is not “always use more data,” but that shortcuts—like training SDG models on subsets for convenience—can degrade the very properties synthetic data is supposed to deliver. The paper adds empirical weight to design choices that data-sharing platforms and research groups routinely face.

Pipeline design affects privacy and usefulness: subset-based SDG may look cheaper operationally, but can underperform on fidelity/utility/privacy compared with full high-dimensional approaches.
Better defaults for medical data sharing: the findings support more standardized SDG configurations for research and education use cases without exposing sensitive patient information.
Cost is part of governance: “cost-effective” full-dataset SDG strengthens the case for building repeatable, validated synthetic-data pipelines rather than ad hoc dataset-by-dataset work.

Synthetic data created by generative AI poses ethical challenges

NIEHS highlighted ethical challenges associated with synthetic data created by generative AI. While noting synthetic data’s long history (described as 60 years), the piece emphasizes that generative AI changes the risk profile—particularly around privacy and accuracy—at a time when synthetic data is increasingly used to address data scarcity in research.

For teams working in public health, environmental science, or adjacent domains, the message is that “synthetic” is not a free pass: accuracy limitations, misuse, and privacy leakage remain governance concerns even when the data is artificially generated.

Ethics reviews will broaden: generative AI-driven synthetic data introduces new questions about accuracy, representativeness, and downstream harm—not just access enablement.
Privacy is still on the table: synthetic datasets can still create privacy risks, so teams should avoid treating synthetic outputs as automatically non-sensitive.
Public-sector expectations may harden: as agencies discuss risks openly, expect tighter guidance on disclosure, validation, and appropriate use in research workflows.