Gartner expects synthetic customer data generation to shift from niche to default: 75% of businesses using genAI for it by 2026, up from under 5% in 2023. If synthetic data becomes the training substrate for customer-facing agents, governance and measurement move from “nice-to-have” to operational controls.
Gartner: 75% of Businesses Will Use GenAI for Synthetic Customer Data by 2026
Gartner forecasts that by 2026, 75% of businesses will use generative AI to create synthetic customer data—up from less than 5% in 2023. The NVIDIA write-up frames this as a practical response to two constraints that data teams hit quickly: limited “real” customer data in low-resource settings, and the inability to freely share proprietary or sensitive datasets across teams, vendors, and development environments.
The same piece positions synthetic data as increasingly central for training “agentic AI” systems—AI agents expected to operate across workflows, tools, and customer interactions. The implication is that synthetic data won’t just be a privacy workaround; it becomes a core input for model development and evaluation when real customer data is scarce, restricted, or too risky to move.
- Governance has to cover synthetic, not just real data. If synthetic customer data becomes widely used, teams need explicit policies for provenance, labeling (synthetic vs. real), acceptable use, and retention—otherwise synthetic datasets will quietly enter analytics and model pipelines without controls.
- Quality and bias become measurable requirements, not assumptions. Scaling synthetic generation increases the chance of distribution drift, missing edge cases, and amplified bias; data leads should define validation gates (utility metrics, slice-based performance, and bias checks) before synthetic data is allowed into training or testing.
- Transparency matters for audits and downstream consumers. Compliance and risk teams will need documentation that explains how synthetic customer data was produced, what real data (if any) was used to condition it, and how privacy risks were assessed—especially when agents are trained on or evaluated with synthetic records.
- Agentic systems raise the bar on “representativeness.” Agents fail in long-tail operational scenarios; synthetic data programs should prioritize scenario coverage and adversarial/edge-case generation, not just row-level realism.
