← All Posts

Teaching LLMs to Plan Before They Act

If you have ever watched a language model reason its way through a hard math problem, you have probably seen it wander. The chain of thought starts off promising, circles back on itself, re-derives something it already knew, and occasionally talks itself out of a correct intermediate result. The final answer may still be right, but the path there is long, redundant, and hard to trust.

Our ICML 2026 paper, Plan Then Action, starts from a simple diagnosis of why this happens: autoregressive generation is local. At every step the model decides only what token comes next, so the reasoning process is essentially a sequence of small, greedy decisions. There is no global plan — nothing that commits the model to a strategy before it starts executing one. Tree search and reinforcement learning can partially compensate, but they are expensive and still operate over the same token-level process.

The key idea

The approach we propose, PTA-GRPO (Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization), separates deciding what to do from doing it — and trains both. It works in two stages. First, we use stronger LLMs to distill long chains of thought into compact, high-level guidance — a short statement of the strategy, not the steps — and fine-tune the model to produce that guidance before its detailed reasoning. Second, we apply a guidance-aware reinforcement learning method that jointly optimizes both the final answer and the quality of the high-level plan itself, so the plan is not decorative: it is rewarded for actually steering the reasoning well.

The results were consistent in a way that RL-for-reasoning results often are not. Across MATH, AIME 2024, AIME 2025, and AMC, and across base models from LLaMA3.2-3B up to Qwen3-14B, planning first yields stable improvements — the gains do not depend on one lucky model or benchmark.

Some reflection

What I find satisfying about this result is how familiar it feels from software engineering. We never tell students to start typing code and hope a design emerges; design comes first, then implementation. It turns out the same discipline helps a model: a small amount of explicit structure, stated before the work begins, prevents a lot of wandering during the work.

There is also a responsible-AI angle that motivates us beyond the benchmark numbers. A plan is an artifact you can read. When a model commits to its strategy up front, you get a compact, inspectable statement of intent — which is far easier to audit than ten thousand tokens of stream-of-consciousness. We think this kind of structured reasoning is a step toward models whose behavior can be anticipated, not just observed. That theme — making AI systems inspectable by construction — runs through much of what the lab does.

This work was a broad collaboration led by Zhihao Dou, with our M.S. student Towsif Raiyan among the contributors. The paper will be presented at ICML 2026 in Seoul this July.

Pointers

← All Posts