LLM-driven synthetic data: new reviews, governance guidance, and ethics pressure

Five new reads triangulate where synthetic data is headed: LLM-first generation is now mainstream in biomedicine, while governance bodies and bioethicists are converging on the same message—synthetic data reduces friction, but it doesn’t remove accountability.

A Scoping Review of Synthetic Data Generation by Language Models for Biomedical Applications

An arXiv scoping review surveys 59 studies (2020–2025) on using large language models to generate synthetic biomedical and clinical data. Prompt-based generation dominates the literature (74.6% of studies), with applications spanning EHR synthesis and synthetic radiology reports used in settings like cancer detection.

For data teams, the key signal is methodological convergence: many groups are choosing prompt-driven approaches over heavier modeling pipelines, which can speed prototyping but also increases reliance on careful prompt design, evaluation, and documentation when outputs are used downstream for training or benchmarking.

Healthcare synthetic data is increasingly LLM-shaped, so validation plans should explicitly test prompt sensitivity and stability.
Use cases (EHR, radiology text) are high-risk; privacy and fairness claims need evidence, not assumptions.
Review coverage (59 studies) gives compliance and RAI leads a concrete map of what peers are actually doing.

Synthetic Data: The New Data Frontier

The World Economic Forum publishes a strategic brief positioning synthetic data as a response to data scarcity, privacy restrictions, and representativeness gaps across sectors. It offers governance recommendations and highlights use cases including healthcare, e-commerce, and child behavior modeling.

The report also stresses hybrid strategies that combine synthetic and organic data to reduce risks like model collapse and to support equity goals—guidance founders can translate into product guardrails and procurement requirements.

Expect buyers to ask for governance artifacts (policies, controls, evaluation) alongside model performance.
“Hybrid data” becomes a practical default: plan pipelines, lineage, and audits across both data types.
Public-sector framing can influence regulation and standards, shaping enterprise checklists.

Synthetic data created by generative AI poses ethical challenges

NIEHS features bioethicist David Resnik on ethical challenges from GenAI-created synthetic data, placing today’s surge in context of synthetic data’s 60+ year history. The argument: faster generation doesn’t eliminate ethical duties around research integrity, misuse, and governance.

For compliance leads, this is a reminder that “no real people” is not a universal safe harbor—synthetic datasets can still encode sensitive attributes, enable harmful inferences, or be misrepresented as ground truth.

Ethics scrutiny is moving upstream: dataset intent, disclosure, and limitations statements will matter more.
Government voices can accelerate expectations for formal oversight in research and clinical contexts.
Teams should define acceptable-use boundaries and monitoring before synthetic data scales internally.

Synthetic Data for Artificial Intelligence and Machine Learning

SPIE’s Defense + Commercial Sensing 2025 proceedings volume compiles 13 sessions and 33 papers on synthetic data for AI/ML, reflecting both research and industry practice in high-stakes domains.

Peer-reviewed proceedings are a signal that synthetic data methods are maturing beyond vendor claims.
Defense/commercial overlap often drives tooling that later lands in enterprise ML stacks.

Examining the Expanding Role of Synthetic Data Throughout the AI Lifecycle

An ACM Digital Library qualitative study draws on 29 interviews with AI practitioners and responsible AI experts to map how synthetic data is used from training through deployment. The focus is organizational reality: where synthetic data fits, where it doesn’t, and what governance gaps persist.

Interview-based evidence helps teams benchmark adoption patterns and common failure modes.
Governance gaps show up across the lifecycle, not just at data generation time.