Synthetic data: rare disease methods, valid inference, and new medical imaging generators

Synthetic data work is converging on two hard problems: picking the right generation method for thin, messy biomedical datasets, and proving downstream analyses remain valid when synthetic records are imperfect.

Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review

A PMC scoping review maps how rare disease teams use data augmentation and synthetic data generation to cope with small cohorts and heterogeneous phenotypes. The authors screened 2,864 candidate studies and highlight increased use of deep generative models since 2021, while also comparing rule-based approaches that trade performance for interpretability. A recurring theme is that synthetic outputs still require validation, including checks for biological plausibility, not just model metrics.

For data leads, the review functions as a method-selection checklist for data-scarce programs, not a one-size-fits-all endorsement of deep generators.
Compliance teams can use the emphasis on plausibility validation to push for documented QA gates before synthetic data is shared or used for clinical-adjacent modeling.
Founders selling “synthetic rare disease data” should expect buyers to ask which failure modes were tested (e.g., implausible phenotype combinations), not just privacy claims.

Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

This arXiv tutorial surveys generative approaches—LLMs, diffusion models, and GANs—positioning synthetic data as a response to data scarcity, privacy constraints, and annotation bottlenecks in data mining. It focuses on practical methodology: frameworks for generation, evaluation strategies, and application patterns. For teams adopting GenAI-era pipelines, the paper underscores that evaluation is not optional; you need task- and risk-aligned metrics rather than “looks realistic” checks.

Engineering teams get a consolidated map of model families and when each tends to fit (text-heavy vs. tabular vs. image settings).
Product owners can translate “synthetic data” into concrete workflows: reduce labeling load, bootstrap cold-start models, and run privacy-aware experiments.

Valid Inference with Imperfect Synthetic Data

An arXiv paper targets a common reality: synthetic data from LLMs is imperfect, but teams still want statistically valid inference when mixing synthetic and real data (e.g., computational social science or survey augmentation). The authors propose a hyperparameter-free estimator based on generalized method of moments, leveraging interactions between synthetic and real-data residuals, and report empirical gains. The contribution is less about prettier samples and more about defensible conclusions under imperfect synthesis.

Data scientists can treat synthetic data as an inference component with guarantees, not just a training-data hack.
For privacy programs, “mixed real + synthetic” becomes more viable when the inference step is explicitly designed for synthetic imperfections.
Teams should still budget for sensitivity analyses: the method assumes you can characterize the mismatch between synthetic and real distributions.

Using generative AI to create synthetic data

Stanford Medicine describes RoentGen, an open model that generates synthetic X-rays from text prompts, aimed at filling gaps for rare diseases and uncommon conditions. The article positions it as a way to train imaging models while reducing bias and improving privacy, with example tasks including pneumonia and cardiomegaly detection. The key operational point: prompt-driven generation can target underrepresented findings, but teams must validate that generated artifacts don’t create shortcut features that models learn.

Imaging teams can prototype “data on demand” for scarce labels, but should run artifact audits and out-of-distribution checks.
Open availability raises governance questions: who is accountable for downstream clinical misuse versus research use.

Longitudinal Synthetic Data Generation by Artificial Intelligence to Address Privacy, Fragmentation, and Data Scarcity in Oncology Research

An ASCO Journals study explores AI-generated longitudinal synthetic datasets to address privacy concerns, fragmentation across institutions, and scarcity in oncology research. The focus is on preserving realistic patient trajectories while protecting privacy, enabling analysis and collaboration where raw data sharing is blocked. For implementers, longitudinal synthesis raises additional constraints—temporal consistency and clinically plausible transitions—not just marginal feature realism.

Oncology collaborations can use synthetic longitudinal data to test hypotheses and pipelines before negotiating access to identifiable records.
Privacy and compliance leads should require trajectory-level plausibility and re-identification risk assessments, not just record-level summaries.