Synthetic data is being pulled in two directions at once: accelerating technical adoption (especially via LLMs) while governance and ethics bodies push for clearer rules on where it’s safe, representative, and auditable.
A Scoping Review of Synthetic Data Generation by Language Models for Biomedical Applications
An arXiv scoping review surveys 59 studies (2020–2025) on LLM-driven synthetic data generation for biomedical and clinical research. It finds prompt-based generation dominates the literature (74.6% of studies), with use cases spanning EHR synthesis and synthetic radiology reports used for tasks like cancer detection. The paper maps methods and applications rather than pitching a single tool, which makes it useful for teams benchmarking “what’s common” versus “what’s experimental.”
- For healthcare ML teams, the 74.6% figure is a signal that prompt-based pipelines are the default baseline—so evaluations should compare against that, not only bespoke generators.
- Privacy and fairness claims need proof: the review’s breadth highlights how often synthetic data is used to address scarcity and representativeness, but governance should require measurable utility and bias checks.
- Founders selling synthetic EHR or report generation should expect buyers to ask where their approach sits relative to the surveyed methods and what failure modes were observed.
Synthetic Data: The New Data Frontier
The World Economic Forum publishes a strategic brief positioning synthetic data as a response to data scarcity, privacy restrictions, and representativeness gaps across sectors. It offers governance recommendations and highlights use cases including healthcare, e-commerce, and child behavior modeling. A key framing is that synthetic data is not a universal substitute; hybrid approaches combining synthetic and organic data are emphasized to reduce risks like model collapse and to support equity goals.
- Compliance leads can treat this as a policy-aligned checklist starter: governance, documentation, and decision rights are becoming table stakes, not “nice to have.”
- Data teams should plan for hybrid datasets and monitoring—operationally, that means lineage, dataset composition tracking, and periodic re-validation as distributions drift.
- Public-sector and regulated buyers may increasingly reference WEF-style frameworks in procurement and audits, shaping vendor requirements.
Synthetic data created by generative AI poses ethical challenges
NIEHS bioethicist David Resnik outlines ethical challenges tied to GenAI-created synthetic data, placing today’s surge in context of more than 60 years of synthetic data use. The piece focuses on how rapidly expanding capability (e.g., systems like ChatGPT) increases both opportunity and governance risk. It’s a reminder that “synthetic” doesn’t automatically mean harmless—misuse and misinterpretation risks remain, especially when downstream consumers assume the data is neutral or privacy-safe by default.
- Ethics review and IRB-style thinking is likely to expand beyond human-subjects data to include synthetic datasets used in research and clinical workflows.
- Teams should document intended use, known limitations, and who may be harmed by errors—treat synthetic datasets as products with safety requirements.
- Regulators may scrutinize synthetic data claims more aggressively as GenAI lowers the barrier to producing plausible-but-wrong records.
Synthetic Data for Artificial Intelligence and Machine Learning
SPIE’s Defense + Commercial Sensing 2025 proceedings volume compiles 13 sessions and 33 papers on synthetic data for AI/ML, reflecting both research and applied practice. As peer-reviewed conference output, it signals sustained investment in synthetic data methods and evaluation in high-stakes environments. For practitioners, proceedings like this often surface emerging validation techniques and domain-specific constraints before they appear in standards.
- Defense and sensing domains tend to pressure-test edge cases; methods validated there often migrate into commercial tooling and benchmarks.
- Engineering leads can mine proceedings for evaluation patterns (scenario coverage, edge-case generation, domain shift testing) to harden their own pipelines.
Examining the Expanding Role of Synthetic Data Throughout the AI Lifecycle
An ACM study uses 29 interviews with AI practitioners and responsible AI experts to map how synthetic data is used across the lifecycle—from training through deployment. The qualitative lens highlights adoption realities: where teams rely on synthetic data, what governance gaps persist, and how “responsible AI” roles interact with engineering decisions. For organizations, this kind of evidence is useful for building internal policy that matches actual workflow rather than idealized diagrams.
- Interview-based findings can help leaders anticipate friction points: ownership of synthetic data quality, sign-off processes, and monitoring responsibilities post-deploy.
- Vendors should expect buyers to ask not just “does it work,” but “who approves it, how is it audited, and what happens when it fails in production?”
