Synthetic Data Generation in Healthcare: A Review of Methods and Impacts

A new healthcare-focused review catalogs synthetic data generation approaches across tabular, imaging, and omics data, and makes the case that synthetic data can expand AI development while reducing direct exposure of patient records. For data and compliance teams, the useful takeaway is less “synthetic solves privacy” and more “which method fits which data type, and what to validate before sharing or training.”

Review: Synthetic data methods in healthcare—statistical to deep learning, across tabular, imaging, and omics

A systematic review surveys synthetic data generation in healthcare and organizes techniques into broad families (statistical, probabilistic, machine learning, and deep learning) applied to multiple healthcare data modalities, including tabular clinical datasets, medical imaging, and omics. The paper frames synthetic data as a response to two recurring blockers in healthcare AI: limited access to real-world datasets (scarcity, fragmentation) and strict privacy constraints around patient records.

The review argues synthetic data can help teams: (1) build and test models when real data access is restricted, (2) reduce privacy risk relative to distributing raw patient data, and (3) lower time and cost in research workflows such as clinical trials by generating “virtual populations,” including for rare diseases. It also highlights a fairness angle: synthetic generation can be used to rebalance datasets when underrepresented demographics drive biased model behavior, potentially improving generalizability and reducing harmful disparities—provided teams validate what the synthetic process actually changed.

Method selection is now a governance decision. Choosing between statistical/probabilistic approaches vs. ML/deep learning isn’t just a modeling preference; it changes privacy risk, fidelity, and how you prove utility for tabular vs. imaging vs. omics data.
“Privacy-preserving” still requires evidence. The review positions synthetic data as a way to mitigate privacy exposure and support HIPAA/GDPR-aligned sharing, but operationally that means documenting intended use, running disclosure risk checks, and setting release policies—not assuming synthetic equals safe.
Clinical trial acceleration is a concrete near-term use case. Virtual populations can reduce cost and time, especially for rare disease scenarios; data leads should evaluate where synthetic cohorts can support feasibility studies, simulation, or protocol design without touching identifiable records.
Fairness work can move earlier in the pipeline. Synthetic generation can help address underrepresentation, but teams should measure downstream effects (performance by subgroup, calibration, error modes) and track whether the generator introduces artifacts that look like “improvements” but don’t hold in real-world validation.