Synthetic data: rare disease reviews, valid inference, and clinical imaging moves

Synthetic data work is splitting into two tracks: practical generation for scarce clinical domains, and methods to make downstream conclusions defensible when synthetic data is imperfect. Today’s set spans rare disease evidence mapping, new inference theory, and medical imaging and longitudinal healthcare use cases.

Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review

A scoping review in rare disease research maps how data augmentation and synthetic data generation are being used, screening 2,864 candidate studies to identify techniques, purposes, and methodological challenges. The review highlights synthetic data as a response to structural scarcity in rare disease cohorts, with reported dataset expansion up to 10× or more in some settings. It also flags a recurring gap: rigorous validation that generated samples remain biologically plausible and don’t just “look right” to a model.

For ML teams, “more data” is not the same as “more signal”; the review reinforces the need for plausibility checks and domain expert review.
For founders, rare disease is a high-need wedge where synthetic pipelines can be productized—if validation and auditability are first-class features.
For compliance leads, the paper underscores that governance questions shift from identifiability to scientific validity and bias amplification.

Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

This arXiv tutorial surveys synthetic data generation with generative models, explicitly covering LLMs, diffusion models, and GANs, along with evaluation strategies. It frames synthetic data as a practical tool for data scarcity, privacy constraints, and annotation bottlenecks, and organizes methodologies and frameworks for practitioners who need repeatable pipelines rather than one-off demos.

Useful as a shared baseline for teams arguing about “which model class” to use and how to evaluate outputs beyond eyeballing samples.
Highlights that evaluation is part of the system design—quality, privacy, and utility trade-offs need measurable targets.

Valid Inference with Imperfect Synthetic Data

Another arXiv paper targets a common failure mode: teams combine small real datasets with LLM-generated synthetic records, then run standard statistics as if the data were fully observed. The authors introduce a new estimator based on generalized method of moments to support statistically valid conclusions in limited-data regimes when synthetic data is imperfect. The work is positioned for applications like computational social science and human-subjects research, where inference validity is as important as model accuracy.

Moves synthetic data governance from “did we leak PII?” to “are our conclusions statistically defensible?”—critical for regulated decisions.
For data leads, it suggests a path to using synthetic augmentation without silently invalidating confidence intervals and hypothesis tests.
Creates a clearer contract between generation and analysis: synthetic data quality assumptions must be made explicit.

Using generative AI to create synthetic data

Stanford Medicine reports on RoentGen, an open model from a team led by Curtis Langlotz and Akshay Chaudhari that generates synthetic X-rays from medical descriptions. The goal is to fill data gaps for rare diseases and uncommon conditions, where real imaging examples are limited and skewed. The piece emphasizes realism, bias reduction, privacy, and enabling more responsible medical imaging AI development.

Imaging teams can use text-to-image generation to target specific underrepresented conditions, not just “augment everything.”
Open models raise practical questions: provenance, intended use, and how hospitals validate outputs before model training.

Longitudinal Synthetic Data Generation by Artificial Intelligence to Address Privacy, Fragmentation, and Data Scarcity

In the Journal of Clinical Cancer Informatics (ASCO), researchers explore AI-generated longitudinal synthetic data to address privacy concerns, fragmented clinical records, and scarcity in clinical research. The focus is on generating realistic time-series patient trajectories that preserve utility for analysis while maintaining patient privacy.

Longitudinal generation is where many synthetic approaches break—temporal consistency matters for outcomes, progression, and treatment effects.
For multi-site studies, synthetic longitudinal datasets can reduce sharing friction, but only if utility is demonstrated for the intended endpoints.