OECD flags synthetic data privacy risks as California and Texas push AI disclosure and governance

Two OECD reports bookend the same message: synthetic data can reduce exposure to personal data, but it does not automatically eliminate privacy risk. Meanwhile, US states are moving toward disclosure and governance rules that will force teams to document training data and AI system controls with more rigor.

OECD Report Highlights Privacy Risks in Synthetic Data Generation

The OECD published a report on AI, data governance, and privacy that explicitly calls out synthetic data as a growing input to AI training—and a growing source of misunderstood risk. The report highlights that synthetic datasets can remain susceptible to re-identification attacks, and it also raises the possibility of “model collapse” over time when models are trained on their own generated outputs.

For teams using synthetic data to sidestep privacy constraints, the OECD framing is clear: synthetic is a technique, not a guarantee. Risk depends on how data is generated, what it preserves, and what adversaries can infer from releases or downstream models.

Re-identification risk means you still need privacy testing (e.g., membership inference and linkage-style evaluations), not just a “synthetic” label in the data catalog.
“Model collapse” risk is a governance issue: you may need provenance controls to prevent synthetic-on-synthetic feedback loops in long-lived pipelines.
Procurement and compliance teams should treat synthetic data generators as high-impact components that require documentation and review, not as a blanket exemption from privacy obligations.

California's AB 2013 Requires Generative AI Developers to Disclose Training Data

California Assembly Bill 2013 (AB 2013) would require developers of generative AI systems to publicly disclose information about the data used to train their models. Per the bill summary, the requirement takes effect January 1, 2026.

Even before enforcement, the operational impact is immediate: disclosure obligations push organizations toward cleaner training-data inventories, clearer licensing positions, and repeatable reporting workflows—especially where training mixes first-party data, third-party data, and synthetic data.

Training-data transparency requirements can turn “we don’t know what’s in the corpus” into a legal and reputational liability.
Synthetic data won’t automatically simplify disclosures; teams may still need to explain source datasets, generation methods, and intended use.
Data governance tooling (lineage, dataset registries, license metadata) becomes a compliance dependency, not an engineering nice-to-have.

Texas Enacts Responsible AI Governance Act to Regulate AI Systems

Texas’ Responsible Artificial Intelligence Governance Act (TRAIGA) establishes a state-level framework for regulating AI systems. The summary describes prohibitions on certain harmful uses and the creation of a Texas Artificial Intelligence Council.

For organizations deploying AI in regulated or consumer-facing contexts, TRAIGA signals continued fragmentation: multiple states are defining their own governance expectations, which will affect model documentation, risk controls, and acceptable-use boundaries—regardless of whether training data is real, de-identified, or synthetic.

State-level governance frameworks can require policy and control mapping across jurisdictions, increasing the cost of “one model, many markets.”
Prohibitions on harmful uses elevate the importance of downstream guardrails (monitoring, red-teaming, abuse prevention) beyond data privacy alone.
Internal AI councils and review boards may need to broaden scope to include synthetic data generation as part of system-level risk assessment.

OECD Explores Synthetic Data with Differential Privacy for AI Testing

In a separate OECD report on sharing trustworthy AI models with privacy-enhancing technologies, the organization discusses generating synthetic data using AI models with differential privacy. One example described is creating artificial facial images for testing, aiming to simulate different ethnic groups while avoiding the use of real-world personal data.

The key technical point is the pairing: synthetic data generation plus differential privacy is presented as a way to reduce exposure to identifiable individuals while still supporting evaluation needs such as coverage across demographic attributes. It positions synthetic data as part of a broader PET stack rather than a standalone solution.

Differential privacy can provide a more formal privacy layer than ad hoc “anonymization” claims, but it requires careful parameterization and validation.
Using synthetic faces for testing highlights a pragmatic use case: evaluation and QA where representative diversity matters, but collecting real images is high-risk.
Expect more scrutiny on whether synthetic datasets actually preserve the properties needed for fairness and performance testing—without leaking sensitive traits.