Gartner Predicts 80% of AI Training Data Will Be Synthetic by 2028
Daily Brief

Gartner Predicts 80% of AI Training Data Will Be Synthetic by 2028

Gartner says 80% of AI training data will be synthetic by 2028, up from ~5% five years earlier. Reported Nov 10, 2025; driven by the AI “data wall” and sc…

daily-brief

Gartner projects a steep shift toward synthetic training corpora over the next three years as teams hit an “AI data wall” of scarce, ethically usable real-world data. For builders, the takeaway is operational: synthetic generation and validation are becoming core parts of the training-data stack, not a niche privacy workaround.

Gartner projects 80% synthetic AI training data by 2028

A Pure Storage analysis citing Gartner research says synthetic data will represent 80% of AI training data by 2028, up from roughly 5% five years earlier (reported Nov. 10, 2025). The driver is what Microsoft Research has described as the AI “data wall”: organizations can’t reliably source enough high-quality, ethically obtained real-world data to keep scaling model development.

The post argues synthetic data is increasingly positioned as a performance and cost lever, not just a compliance tactic. It references MIT’s Data to AI Lab, claiming models trained with synthetic datasets can improve accuracy by up to 3 percentage points versus real data alone (example: 60% vs 57%). It also points to adoption examples including J.P. Morgan using synthetic data in fraud detection, Waymo simulating 20 billion miles of driving scenarios daily, and healthcare teams training diagnostic systems on synthetic patient records to reduce HIPAA exposure.

  • Data engineering becomes the bottleneck: If synthetic data becomes the dominant training input, teams need high-throughput generation pipelines, dataset versioning, and quality gates that behave more like “data manufacturing” than traditional ETL.
  • Validation moves to the center: More synthetic data increases the need for measurable utility checks (task performance), bias monitoring, and leakage/memorization testing—especially when synthetic records are derived from sensitive sources.
  • Privacy posture can improve—if controls are real: Synthetic data can reduce direct handling of regulated data, but only when generation methods, access controls, and auditability are strong enough to satisfy internal risk teams and external regulators.
  • Competitive differentiation shifts: If competitors can generate targeted synthetic edge cases faster (rare fraud patterns, long-tail driving scenes, underrepresented patient cohorts), model quality and iteration speed may increasingly track synthetic capability, not raw data access.