New SWiRL Methodology Enhances LLM Accuracy with Synthetic Data

A new arXiv paper proposes SWiRL, a step-wise reinforcement learning approach that uses synthetic, multi-step reasoning trajectories to improve LLM performance on hard reasoning benchmarks. The reported gains suggest a path to better accuracy with less dependence on expensive, human-labeled chain-of-thought data.

SWiRL: step-wise RL + synthetic trajectories for multi-step reasoning

An arXiv paper (Nov 10, 2025) introduces Step-Wise Reinforcement Learning (SWiRL), which combines synthetic data generation with reinforcement learning to improve LLM performance on multi-step reasoning tasks. Instead of treating an entire solution as one monolithic sequence to optimize, SWiRL iteratively generates reasoning trajectories and breaks them into sub-trajectories, enabling step-wise RL optimization.

In the reported experiments, SWiRL achieved accuracy gains of +11.1% to +21.5% on GSM8K and HotPotQA versus existing models. The paper also reports cross-task transfer: training on HotPotQA produced a 16.9% improvement in zero-shot GSM8K performance, indicating that the learned reasoning improvements can generalize beyond the training task.

Lower labeling burden for reasoning-heavy domains: Data teams can generate synthetic reasoning trajectories to support RL fine-tuning, reducing reliance on costly expert annotations for multi-step tasks.
Reusable assets across tasks: The reported HotPotQA0GSM8K transfer suggests synthetic reasoning datasets (and the training recipe around them) may be reusable across related reasoning workloads, improving ROI on data generation.
Privacy posture can improve alongside accuracy: Synthetic trajectories can reduce exposure to sensitive source data when tuning domain models, which is especially relevant for regulated environments where raw examples are hard to share or retain.
Engineering implication: Step-wise optimization implies more granular reward design and evaluation; teams adopting SWiRL should plan for instrumentation that can score intermediate steps, not just final answers.

Reported benchmark lift: +11.1% to +21.5% accuracy on GSM8K and HotPotQA (arXiv, Nov 10, 2025).

Daily BriefJul 17, 20262 min