Synthetic data: new playbooks for rare disease, valid inference, and longitudinal oncology

Synthetic data work this week clusters around three practical needs: filling rare-disease and imaging gaps, making mixed real+synthetic inference statistically defensible, and generating longitudinal patient trajectories without exposing PHI.

Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review

A scoping review surveys how augmentation and synthetic generation are being used in rare disease research to counter small cohorts and heterogeneous phenotypes. The authors screened 2,864 candidate studies and highlight increased use of deep generative models since 2021, while also comparing simpler rule-based approaches that can be easier to interpret but still require careful validation.

The throughline is that “more data” is not the same as “usable evidence”: biological plausibility checks and domain validation are repeatedly flagged as the gating factor for deployment in biomedical ML.

Data teams get a selection framework: match technique to scarcity type (tiny n vs. missing modalities) and validation burden.
Compliance leads can treat plausibility testing as a control, not an afterthought, when synthetic data is used for model development.
Founders selling synthetic data into biomed should expect buyers to ask for validation protocols, not just privacy claims.

Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

This arXiv tutorial consolidates foundations and recent advances in generative models—LLMs, diffusion models, and GANs—for synthetic data generation in data mining. It focuses on practical drivers: data scarcity, privacy constraints, and annotation cost, and it outlines methodologies, evaluation strategies, and application patterns.

Useful as a shared reference for teams standardizing evaluation (utility, privacy, and failure modes) across model families.
Helps engineering leads compare “text-first” LLM approaches vs. classic tabular/image generators based on downstream workload.

Valid Inference with Imperfect Synthetic Data

This paper targets a common reality: synthetic data is imperfect, especially when produced by LLMs, yet teams still want valid statistical inference when mixing synthetic and real samples. The authors propose a hyperparameter-free estimator based on generalized method of moments (GMM) and show empirical gains by leveraging interactions between synthetic and real-data residuals.

For applied groups (surveys, social science, product research), the contribution is less about generating better synthetic data and more about not over-claiming results when synthetic records are part of the pipeline.

Gives practitioners a path to defensible inference in low-data regimes where synthetic augmentation is tempting but risky.
Creates pressure for vendors to document how their synthetic generation affects downstream estimators, not only model accuracy.
Supports privacy-preserving workflows where only limited real data can be accessed, but conclusions still need guarantees.

Using generative AI to create synthetic data

Stanford Medicine describes RoentGen, an open model that generates synthetic X-rays from text descriptions. The stated goal is to address data gaps in rare diseases and uncommon conditions, supporting training for imaging AI tasks (including examples like pneumonia or cardiomegaly detection), while reducing bias and improving privacy posture.

Text-to-image generation can target specific edge cases (rare findings) that are underrepresented in hospital archives.
Open models raise operational questions: governance for prompt libraries, provenance tracking, and clinical validation gates.

Longitudinal Synthetic Data Generation by Artificial Intelligence to Address Privacy, Fragmentation, and Data Scarcity in Oncology Research

This ASCO Journals study explores AI-generated longitudinal synthetic data for oncology research, aiming to mitigate privacy concerns, fragmentation, and limited access to complete patient histories. It emphasizes constructing realistic patient trajectories while preserving privacy, positioning synthetic longitudinal datasets as a collaboration layer when direct data sharing is constrained.

Longitudinal realism is the hard part; teams should test whether temporal correlations and treatment sequences remain plausible.
Enables multi-site studies when data use agreements block pooling, but still requires clear disclosure of synthetic provenance.