OpenMined’s updated guidance argues HIPAA can permit generating synthetic datasets from protected health information—so long as the synthesis workflow is controlled and the resulting data has low re-identification risk. For healthcare data teams, that reframes synthetic data as a practical path to model development and data sharing, but it also concentrates compliance obligations in the pipeline.
HIPAA footing for synthetic data gets clearer—controls and re-ID testing become the linchpin
On Nov. 5, 2025, OpenMined updated guidance stating that HIPAA allows organizations to use PHI to create synthetic datasets when appropriate safeguards are in place. The guidance emphasizes that once data is properly synthesized and re-identification risk is low, the synthetic output can be treated as HIPAA-exempt—changing how teams think about secondary use, collaboration, and vendor sharing.
The practical takeaway is less about “synthetic is automatically safe” and more about process: teams need to document how synthesis is performed, apply privacy-preserving techniques, and validate that the output does not create meaningful re-identification risk. The guidance also implies a shift in compliance burden upstream: while the synthetic dataset may be shareable, the environment that touches PHI to generate it must still operate under HIPAA expectations (e.g., access controls and audit logging).
- Unblocks model development and collaboration: Data and ML teams can train and share models using synthetic datasets with clearer HIPAA framing, reducing legal friction for internal experimentation and external partnerships.
- Moves risk to engineering decisions: The “HIPAA-exempt” outcome hinges on defensible re-ID risk reduction, which means privacy engineering (testing, documentation, and controls) becomes part of the deliverable—not an afterthought.
- Creates demand for compliant synthesis infrastructure: Because PHI is still handled during generation, organizations may need HIPAA-grade pipelines (role-based access, audit logs, controlled environments) even if the final dataset is broadly shareable.
