Synthetic Data Drives Rare Disease Research Forward — Key Findings from Recent Review
Daily Brief

Synthetic Data Drives Rare Disease Research Forward — Key Findings from Recent Review

A Nov 2025 scoping review of 118 studies (2018–2025) finds synthetic data is increasingly used in rare disease research, especially imaging. It stresses r…

daily-briefresearchprivacy

A November 2025 scoping review of 118 papers (2018–2025) finds synthetic data is now a mainstream tactic in rare disease research—especially for medical imaging. The consistent message: synthetic data can help with scarcity, but it must be validated rigorously before it touches clinical decisions.

Scoping review: Synthetic data adoption is accelerating in rare disease work, led by imaging

A November 2025 scoping review analyzed 118 peer-reviewed studies published between 2018 and 2025 on synthetic data and data augmentation for rare disease research. Across the surveyed literature, imaging datasets (including X-rays and MRIs) appeared most frequently, reflecting where synthetic generation and augmentation are easiest to operationalize (and where labels are expensive and scarce). The review also notes that deep generative models have gained traction since 2021, indicating a shift from primarily classical augmentation toward more model-driven synthesis approaches.

While the review frames synthetic data as a practical response to chronic data scarcity in rare diseases, it draws a line between “expanding datasets” and “deploying clinically.” The authors’ key caution is that synthetic data—whether produced via classical augmentation or deep generative models—needs strict validation for biological/clinical relevance. In other words, the field is moving, but the limiting factor is less about generating more samples and more about proving those samples don’t introduce clinically misleading artifacts or spurious correlations.

  • For ML teams: treat synthetic data as a controlled intervention, not a shortcut—pair synthetic with real data and define acceptance criteria (e.g., holdout performance on real-only test sets) before using it to support model claims.
  • For clinical and product stakeholders: the review reinforces that “bigger datasets” aren’t automatically “better datasets”; without validation, synthetic data can inflate apparent robustness while hiding failure modes that matter at the bedside.
  • For privacy and collaboration: the review highlights a growing pattern of combining synthetic data with federated learning and privacy-preserving analytics to enable multi-center rare disease studies with lower exposure risk than raw data sharing.