Differential Privacy — Definition and Explained
Differential privacy provides a mathematical guarantee that individual records cannot be identified from published datasets or model outputs. Learn the definition, epsilon parameter, DP-SGD, and applications to AI.
Differential privacy is a mathematical privacy framework that limits the information any single individual's data contributes to a published dataset or model output, providing a formal, quantifiable privacy guarantee.
Differential privacy (DP) is a mathematical framework for measuring and limiting privacy risk in data publishing and machine learning. Formalized by Dwork, McSherry, Nissim, and Smith in 2006, it is now the standard for formal privacy guarantees in published datasets and AI systems.
A mechanism M is ε-differentially private if, for any two datasets D and D' differing in a single record, and any output S, the probability ratio Pr[M(D) ∈ S] / Pr[M(D') ∈ S] is bounded by e^ε. In plain terms: no single individual's data can shift the probability of any output by more than e^ε.
Smaller values of ε provide stronger privacy but typically reduce the utility of the output — this is the fundamental DP privacy-utility trade-off. Values of ε between 0.1 and 10 are common in practice, with context determining acceptable bounds.
DP-SGD and Neural Network Training
DP-SGD (Differentially Private Stochastic Gradient Descent), introduced by Abadi et al. (2016), applies differential privacy to neural network training. It clips per-sample gradients to bound sensitivity, then adds calibrated Gaussian noise before each parameter update. DP-SGD is used to train models on sensitive data while providing a formal ε-differential privacy guarantee over the training set.
Differential Privacy and Synthetic Data
Synthetic data generation can be combined with differential privacy to produce datasets with formal privacy guarantees. PATE-GAN (Jordon et al., 2019) is a foundational approach combining GAN-based generation with DP. However, DP-constrained synthetic data generation typically introduces a utility cost — higher fidelity synthetic data generally requires larger ε budgets.