SynthLLM scales synthetic tokens; WEF calls for synthetic data governance

Microsoft Research says synthetic data can deliver predictable LLM gains at massive scale, while the World Economic Forum warns that synthetic data’s growing role in AI makes governance, provenance, and traceability non-optional.

SynthLLM: Breaking the AI "data wall" with scalable synthetic data

Microsoft Research Asia introduced SynthLLM, a system designed to generate synthetic training data at scale from pretraining corpora. The team reports that synthetic data follows “rectified scaling laws” for LLMs, aiming to make performance improvements more predictable as synthetic token counts increase.

In the reported results, SynthLLM supports predictable performance gains up to 300 billion tokens. Microsoft positions the approach as applicable in data-constrained or sensitive domains—including healthcare, autonomous driving, and education—where access to sufficient real-world data can be limited by privacy, cost, or operational constraints.

Planning and budgeting: If scaling behavior is predictable up to 300B tokens, teams can treat synthetic token generation as an engineering lever with measurable ROI, not an ad-hoc augmentation step.
Privacy and access constraints: Generating synthetic data from pretraining corpora offers a path to expand training material when direct use of real-world datasets is restricted—especially relevant for healthcare workflows.
Evaluation pressure: “Predictable gains” raises the bar for internal benchmarking: data teams will need tight eval suites to verify improvements and catch regressions or domain drift as synthetic mixes change.

Artificial intelligence and the growth of synthetic data

The World Economic Forum argues synthetic data is moving beyond a niche privacy technique into a broader AI innovation driver—used to fill data gaps and enable new scenarios. At the same time, the piece notes that synthetic data can blur lines with real data, increasing the need for stronger controls as adoption accelerates.

The WEF brief urges business leaders to implement robust governance, traceability, and provenance systems for synthetic data. The emphasis is less on whether synthetic data will be used, and more on whether organizations can prove where it came from, how it was generated, and how it should be handled across the AI lifecycle.

Governance becomes infrastructure: As synthetic data is treated like a core asset (not a one-off privacy workaround), teams need durable policies and tooling for provenance, access control, and lifecycle management.
Compliance and auditability: Traceability and provenance help demonstrate privacy protection and regulatory alignment when synthetic datasets are used in model training and downstream decisioning.
Risk management: If synthetic data becomes harder to distinguish from real data, organizations need clear labeling/handling rules to prevent misuse, misrepresentation, or inappropriate sharing.