Synthetic data in biomed: growth, governance gaps, and new scaling claims

Healthcare synthetic data is scaling fast, but evaluation and provenance controls are still lagging. Five new pieces—from scoping reviews to enterprise frameworks—converge on the same message: adoption is real, standards are not.

A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research

An arXiv scoping review analyzed 59 studies (2020–2025) on LLM-based synthetic data in biomedical work. It finds prompt-based generation dominates (74.6%), and unstructured text is the main modality (78%). The authors also flag persistent gaps in standardized evaluation protocols and call for transparent frameworks to support clinical adoption.

Data teams should treat “synthetic” as a product with test suites, not a one-off augmentation step.
Compliance leads can use the review’s findings to justify minimum evaluation and documentation requirements before clinical use.
Founders selling synthetic pipelines will be judged on validation rigor, not just generation throughput.

Synthetic Data: The New Data Frontier

The World Economic Forum’s September 2025 report positions synthetic data as a response to data scarcity, privacy constraints, and representativeness issues across sectors. It also outlines governance recommendations for developers, organizations, and regulators, and highlights risks including model collapse and data integrity concerns. The framing is policy-forward: synthetic data is useful, but only if oversight keeps pace with deployment.

Expect procurement and audits to ask for governance artifacts (risk assessments, intended-use statements, monitoring plans).
“Model collapse” and integrity language gives regulators a vocabulary to scrutinize synthetic-heavy training loops.

Synthetic Data in Health Economics and Outcomes Research

A peer-reviewed HEOR-focused article reviews how synthetic data can address privacy limits, data insufficiency, and underrepresented populations—especially in rare disease research. It argues that evaluation frameworks should be tailored to HEOR contexts rather than borrowed wholesale from other ML benchmarks. For teams building evidence packages, the point is practical: utility metrics must map to downstream economic and outcomes analyses.

HEOR stakeholders may require domain-specific validation (e.g., preserving subgroup effects) beyond generic similarity scores.
Rare-disease programs can use synthetic data to expand analysis while keeping tighter privacy postures.

SynthLLM: Breaking the AI ‘Data Wall’ with Scalable Synthetic Data

Microsoft Research Asia introduced SynthLLM, a framework to generate synthetic training data from pretraining corpora without manual labeling. The article positions it as broadly applicable across healthcare, autonomous driving, education, and code generation. The core claim is scalability: synthetic generation as a repeatable pipeline rather than a bespoke labeling effort.

Teams should interrogate provenance: what source corpora are used, and how are licensing and sensitive-content risks handled?
“No labeling” shifts cost, but increases the need for automated QA gates and drift monitoring.

GenAI Synthetic Data Create Ethical Challenges for Scientists

A PNAS study highlights integrity and ethics risks when GenAI-produced synthetic data is misrepresented as real, and how that can undermine reproducibility. The paper signals the need for transparency and validation standards, especially in scientific contexts where data lineage is part of the evidence chain. The takeaway: synthetic data can be legitimate, but undisclosed synthetic data is a governance failure.

Research orgs may need explicit labeling policies and provenance metadata as default publication requirements.
Validation standards will likely become table stakes for journals, funders, and IRBs reviewing AI-enabled studies.