Synthetic data is increasingly used to train AI—especially in healthcare—but experts are warning that many projects lack the validation and ethical guardrails needed to prevent privacy harm and unreliable results.
Nature: Synthetic data adoption is outpacing validation and ethical review
Nature reports that synthetic data are being used more frequently to train AI systems, with healthcare a major driver because real-world clinical data can be scarce, sensitive, and hard to share. These datasets are generated algorithmically to mimic statistical patterns in real data and are already being applied in areas like medical imaging, including models for X-ray interpretation.
The article flags two gaps: (1) weak or inconsistent validation of models trained on synthetic data against real-world benchmarks, and (2) uneven ethical oversight. It notes that some institutions are waiving ethical review processes typically required for research involving human data, raising concerns about whether individuals whose information contributed to the original data are adequately protected. Zisis Kozlakidis of the World Health Organization emphasizes the need to validate AI trained on synthetic data and to disclose generation methods and assumptions so others can independently assess the work. The piece also highlights the risk of “model collapse,” where systems degrade or become unreliable over time, particularly if synthetic data are used without careful checks.
- Validation has to be operational, not rhetorical. If synthetic data are used because real data are hard to access, teams still need a plan for real-world benchmarking (even if limited) to detect drift and failure modes before deployment.
- Transparency is becoming a technical requirement. Disclosing how synthetic data were generated (methods, assumptions, constraints) enables independent validation and makes it easier for downstream users to judge fitness-for-purpose.
- “Synthetic” doesn’t automatically mean “no human subjects risk.” If ethical review is waived, privacy and compliance teams should push for explicit re-identification risk controls and documented governance, not informal assurances.
