Synthetic Data News Brief - December 10, 2025

Three product moves signal where synthetic data is landing in 2026: inside clean rooms for cross-org ML, inside DevOps platforms for test data at scale, and inside open-source pipelines for LLM dataset production. The common thread is operationalizing privacy claims with controls, metrics, and deployment models that fit regulated environments.

AWS Clean Rooms ML adds privacy-enhancing synthetic dataset generation

AWS announced a new synthetic dataset generation feature for AWS Clean Rooms ML aimed at training regression and classification models on sensitive, collaborative data without exposing individual records. The workflow generates synthetic datasets intended to preserve statistical patterns while reducing re-identification risk via configurable privacy thresholds and membership inference attack protections.

AWS says generation typically completes within hours and includes quality reporting that covers fidelity (including KL-divergence) and privacy scores, positioning the feature as both an ML-enablement tool and a compliance artifact for collaborative analytics use cases.

Clean-room + synthetic is becoming the default pattern for joint modeling (marketing, fraud, medical research) where raw sharing is politically or legally blocked.
Configurable thresholds and MIA defenses move privacy from “policy” to “system setting,” which is easier to audit and harder to bypass.
Metrics like KL-divergence plus privacy scoring give data leads a way to standardize acceptance criteria (fidelity vs. risk) across partners and projects.

Perforce Delphix AI brings synthetic test data generation into the DevOps perimeter

Perforce Software announced Delphix AI, an embedded language model within the Delphix DevOps Data Platform that automatically generates synthetic data for development and testing environments. Perforce positions the model as trained on Delphix’s proprietary data privacy IP and designed to run entirely within an organization’s IT perimeter, including air-gapped, CPU-only deployments.

The pitch: produce enterprise-grade synthetic data that is customizable, referentially intact, and aligned to testing requirements—without external cloud connectivity or dedicated data science staffing.

In-perimeter generation reduces exfiltration risk and shortens security review cycles for regulated teams that can’t ship data to third-party services.
Embedding SD into DevOps treats synthetic data as a repeatable delivery mechanism (like environments and builds), not a one-off privacy project.
Referential integrity matters operationally: it’s the difference between “fake rows” and test data that actually exercises end-to-end application logic.
Governance pressure is rising (Perforce cites Gartner: “60% of data and analytics leaders will face critical failures managing synthetic data by 2027”), making integrated controls more attractive than ad hoc scripts.

Red Hat SDG Hub tees up modular, open-source synthetic pipelines for LLM datasets

Red Hat introduced SDG Hub, an open-source framework for building, composing, and scaling synthetic data pipelines for large language models. The toolkit focuses on modular building blocks for generation and filtering—supporting workflows from raw documents to structured data to instruction datasets—so teams can assemble pipelines rather than reinvent them.

Red Hat also points to integration with Red Hat OpenShift AI (tech preview) for running validated or custom pipelines at scale, with future updates planned around RAG evaluation and teacher-model comparisons.

Open, composable pipelines can standardize SD operations (validation, monitoring, filtering) across teams that otherwise build bespoke LLM data factories.
Production-readiness signals (async execution, Pydantic validation, monitoring) target the gap between research notebooks and governed data workflows.
Platform integration shifts the buying center: synthetic LLM data becomes an MLOps/platform concern, not just an applied research task.