Two arXiv papers sharpen the synthetic data privacy debate

Two new arXiv papers pull synthetic data teams in opposite but complementary directions: one argues that generative releases can still leak training-set information through distributional structure, while the other shows how encrypted training workflows may reduce trust requirements for data holders. Together, they shift the conversation from "is it synthetic?" to "what are the actual attack surfaces and controls?"

When Privacy Isn't Synthetic: Hidden Data Leakage in Generative AI Models

An arXiv paper warns that synthetic data generated by modern models can still expose information about the underlying training set, even when the release is positioned as privacy-preserving. The authors describe a black-box membership inference attack that does not require model internals or access to the real dataset. Instead, an attacker repeatedly queries the generator, clusters the synthetic outputs, and inspects dense regions whose medoids and neighborhoods can act as proxies for high-density regions in the original data manifold.

Across healthcare, finance, and other sensitive domains, the paper reports measurable membership leakage driven by overlap between real and synthetic data distributions. The authors say this persists even when the generator is trained with differential privacy or other noise mechanisms, and argue that current protections may focus too narrowly on sample memorization rather than neighborhood-level distributional inference.

Privacy reviews for synthetic data releases may need to test black-box querying and cluster-based inference, not just direct record memorization.
Differential privacy alone may not close every practical leakage path if downstream attackers can exploit structural overlap in the learned distribution.
Teams publishing synthetic data in regulated domains should revisit release criteria, red-team methods, and claims made to customers or regulators.

FHAIM: Fully Homomorphic AIM for Private Synthetic Data Generation

A second arXiv paper tackles a different weak point in synthetic data pipelines: the need to trust a service provider with raw sensitive data during training. The authors introduce FHAIM, which they describe as the first fully homomorphic encryption framework for training a marginal-based synthetic data generator on encrypted tabular data. The system adapts the AIM algorithm to run in an FHE setting, with the private data remaining encrypted throughout training.

The paper positions FHAIM as a way to support synthetic data generation in sectors where data is locked behind privacy and regulatory constraints, including healthcare, education, and finance. According to the authors, FHAIM preserves AIM's performance while keeping runtimes feasible, and releases outputs only with differential privacy guarantees. In practice, that makes it a notable attempt to reduce both disclosure risk and the trust burden in SDG-as-a-service workflows.

Encrypted training could make synthetic-data-as-a-service more viable for organizations that cannot hand over plaintext data to vendors.
The combination of FHE during computation and differential privacy at release points toward a more layered control model for sensitive tabular data.
Feasible runtime claims matter: privacy-preserving architectures only change procurement decisions if they are operationally usable.