Synthetic Data Revolutionizing Rare Disease Research

A scoping review of 118 studies argues synthetic data can materially expand rare-disease datasets and improve model robustness. The catch: teams need biological plausibility validation, not just statistical similarity, to avoid generating clinically misleading signals.

Review of 118 studies: CTGAN and other generators help with rare-disease data scarcity

An MDPI-published scoping review (Nov. 10, 2025) synthesizing 118 studies finds synthetic data is increasingly used to address a core bottleneck in rare-disease research: small, fragmented datasets that limit analysis and model training. Across the surveyed literature, the review highlights both classical augmentation approaches and deep generative methods—explicitly including Conditional GANs such as CTGAN—as practical tools to expand datasets and support more robust modeling.

The review also flags a key risk: synthetic records that look “right” statistically can still be biologically implausible. It emphasizes biological plausibility validation as a required step so generated data remains clinically meaningful, rather than becoming a high-volume source of subtle artifacts that distort downstream findings.

For ML teams: synthetic data can help stabilize training and evaluation in low-N settings, but only if you validate that generated features preserve clinically plausible relationships—not just marginal distributions.
For privacy and governance: “more data” is not automatically safer or better; generation should be paired with documented utility and plausibility checks to reduce the chance of harmful, misleading outputs entering research pipelines.
For product and research leads: the review reinforces a pragmatic path to progress in rare disease: combine augmentation and generative modeling to widen cohorts, while treating plausibility testing as a gating control before analysis or model release.

Daily BriefMay 29, 20264 min