Rare disease to oncology: new synthetic data work tightens methods and expands clinical use cases

Synthetic data is moving from “can we generate it?” to “can we trust it for decisions?” Five new pieces span clinical imaging, longitudinal oncology data, and statistical inference—plus two broad research primers on methods and evaluation.

Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review

A new scoping review surveys how rare-disease researchers are using data augmentation and synthetic data generation to cope with small cohorts and heterogeneous phenotypes. The authors screened 2,864 candidate studies and highlight increased use of deep generative models since 2021. They contrast rule-based approaches (often more interpretable) with techniques that can be more powerful but require stronger validation.

For data leads, it’s a selection framework: match technique to scarcity pattern (tiny n, missing modalities, phenotype heterogeneity) and validation burden.
For compliance teams, the emphasis on ethical validation and regulatory alignment is a reminder: “synthetic” doesn’t remove the need for documented risk assessment.
For founders, the gap is productizable: tooling that standardizes evaluation and reporting across rare-disease settings.

Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

This tutorial-style preprint summarizes foundations and current practice for synthetic data generation using LLMs, diffusion models, and GANs, with a focus on data mining workflows. It frames synthetic data as a response to scarcity, privacy constraints, and labeling cost, and spends meaningful time on evaluation strategies and practical frameworks. The value here is less novelty and more consolidation: a shared vocabulary for teams mixing “classic” generators with GenAI-era models.

Evaluation is becoming the differentiator: teams will be judged on utility, privacy, and failure modes—not on having a generator.
Governance programs can use this as a checklist to standardize documentation (generation method, intended use, evaluation protocol).

Valid Inference with Imperfect Synthetic Data

A research paper proposes a hyperparameter-free estimator based on generalized method of moments to support statistically valid inference when combining real data with imperfect synthetic data, including LLM-generated samples. The target use case is low-data regimes such as computational social science, where synthetic augmentation is tempting but can distort inference. The core claim is improved empirical performance by leveraging interactions between synthetic and real data residuals.

For analytics teams, this tackles the hardest question: not “does the model train?” but “are conclusions valid?”
For regulated domains, it points toward auditable, theory-backed use of synthetic augmentation in human-subjects-adjacent work.
For platform builders, inference-safe pipelines could become a product category alongside privacy-safe generation.

Using generative AI to create synthetic data

Stanford Medicine describes RoentGen, an open model that generates realistic synthetic X-rays from text descriptions, aimed at filling gaps for rare diseases and uncommon conditions. The stated goal is to support training medical AI (e.g., pneumonia or cardiomegaly detection) while reducing reliance on real patient images. The work positions synthetic imaging as both a data-availability lever and a privacy-preserving tactic.

Clinical ML teams can use text-to-image generation to target underrepresented findings—if evaluation demonstrates clinical plausibility and label fidelity.
Privacy teams still need to assess memorization and re-identification risk, even with “open” models and synthetic outputs.

Longitudinal Synthetic Data Generation by Artificial Intelligence to Address Data Fragmentation in Oncology

An ASCO Journals study explores AI-generated longitudinal synthetic data to address privacy concerns, fragmentation, and scarcity in oncology research. The premise is practical: longitudinal analysis often requires linking across institutions and time, which is where sharing constraints and incomplete records bite hardest. Synthetic longitudinal datasets are presented as a way to enable analysis without exposing real patient information.

For oncology data teams, longitudinal synthetic data could enable method development and hypothesis testing when real linkage is blocked.
For compliance, it reinforces synthetic data as a scalable alternative to broad data sharing under privacy regimes including GDPR—paired with defensible evaluation.