A new arXiv paper proposes SWiRL, a step-wise reinforcement learning approach that uses synthetic, multi-step reasoning trajectories to improve LLM performance on hard reasoning benchmarks. The reported gains suggest a path to better accuracy with less dependence on expensive, human-labeled chain-of-thought data.
SWiRL: step-wise RL + synthetic trajectories for multi-step reasoning
An arXiv paper (Nov 10, 2025) introduces Step-Wise Reinforcement Learning (SWiRL), which combines synthetic data generation with reinforcement learning to improve LLM performance on multi-step reasoning tasks. Instead of treating an entire solution as one monolithic sequence to optimize, SWiRL iteratively generates reasoning trajectories and breaks them into sub-trajectories, enabling step-wise RL optimization.
In the reported experiments, SWiRL achieved accuracy gains of +11.1% to +21.5% on GSM8K and HotPotQA versus existing models. The paper also reports cross-task transfer: training on HotPotQA produced a 16.9% improvement in zero-shot GSM8K performance, indicating that the learned reasoning improvements can generalize beyond the training task.
- Lower labeling burden for reasoning-heavy domains: Data teams can generate synthetic reasoning trajectories to support RL fine-tuning, reducing reliance on costly expert annotations for multi-step tasks.
- Reusable assets across tasks: The reported HotPotQA0GSM8K transfer suggests synthetic reasoning datasets (and the training recipe around them) may be reusable across related reasoning workloads, improving ROI on data generation.
- Privacy posture can improve alongside accuracy: Synthetic trajectories can reduce exposure to sensitive source data when tuning domain models, which is especially relevant for regulated environments where raw examples are hard to share or retain.
- Engineering implication: Step-wise optimization implies more granular reward design and evaluation; teams adopting SWiRL should plan for instrumentation that can score intermediate steps, not just final answers.
Reported benchmark lift: +11.1% to +21.5% accuracy on GSM8K and HotPotQA (arXiv, Nov 10, 2025).
