Synthetic data is moving from a tactical privacy workaround to a governed asset class. A new ERC-funded research project, a WEF briefing, and fresh medical evidence on privacy-versus-utility tradeoffs all point to the same conclusion: “synthetic” doesn’t mean “risk-free.”
New project to investigate societal consequences of using synthetic data to train algorithms
The University of York announced the launch of SYNDATA, a European Research Council-funded project led by Dr. Benjamin Jacobsen. The project will examine the practical, ethical, and political consequences of using synthetic data to train algorithms across sectors including healthcare and finance.
SYNDATA plans to use archival research, fieldwork, and case studies to understand how synthetic data changes decision-making and how it may reshape society and power structures—an increasingly urgent question as generative AI blurs the line between “real” and “synthetic” training inputs.
- Governance pressure will shift upstream. If synthetic data affects power structures, teams should expect scrutiny not just on model outputs, but on how synthetic datasets are produced, selected, and justified for specific uses.
- “Synthetic” won’t automatically satisfy ethics reviews. Procurement and model risk processes will likely need to treat synthetic datasets as first-class data assets with documented provenance, intended use, and limitations.
- Signal for regulators and standards bodies. Work like this can inform global discussions on data ethics and algorithmic fairness where synthetic data is used to sidestep real-data access constraints.
Synthetic Data: The New Data Frontier (WEF briefing paper)
The World Economic Forum published a briefing paper positioning synthetic data as a scalable response to data gaps, privacy constraints, and training needs in sensitive domains such as healthcare and finance. The paper highlights use cases including testing, personalized AI, and red-teaming, while emphasizing the need for governance to manage accuracy, equity, and privacy risks.
The throughline is pragmatic: synthetic data can expand what teams can build when real data is limited or restricted, but it requires coordination across stakeholders to avoid embedding errors, bias, or false confidence into downstream systems.
- Expect “governed synthetic data” to become a baseline requirement. The WEF framing reinforces that enterprises will be asked to demonstrate controls, not just benefits—especially in regulated workflows.
- Testing and red-teaming are becoming mainstream justifications. If you’re building evaluation pipelines, synthetic data is increasingly treated as a tool for stress tests—provided you can show it reflects relevant edge cases without leaking sensitive patterns.
- Equity and accuracy are now part of the synthetic data spec. Data teams should plan to measure representativeness and error modes, not only privacy properties, before synthetic datasets are used for training or validation.
Impact of synthetic data generation for high-dimensional cross-sectional medical data: privacy versus utility considerations
A study published in the Journal of the American Medical Informatics Association evaluates three strategies for synthetic data generation (SDG) for high-dimensional, cross-sectional medical datasets. The authors compare privacy risk—specifically membership disclosure—against data utility, and examine how results change when generating synthetic data from the full dataset versus subsets.
The paper’s core contribution is evidence-based guidance for balancing privacy and utility in medical data sharing platforms, where synthetic data is often proposed as a way to enable research while reducing exposure of patient information.
- Privacy risk is measurable—and not eliminated. Membership disclosure analysis underscores that synthetic releases can still leak information, so “synthetic” should not be treated as a blanket de-identification claim.
- Subset vs. full-dataset generation is a design lever. The study’s comparison gives teams a practical knob to tune depending on whether the priority is utility for modeling or risk reduction for sharing.
- Compliance teams get a stronger technical basis. Quantifying privacy-utility tradeoffs supports documentation and review processes relevant to GDPR and HIPAA-aligned controls in healthcare research pipelines.
