The State of AI Post-Training Agents

We re-ran FrogsGame with a new generation of frontier models — Claude Fable 5, Opus 4.8, and GPT-5.5 — to see how far AI post-training agents have come.

MERSAD ABBASI · JUNE 2026

In our previous report, What We Learned from Letting AI PostTrain AI, we studied how frontier models perform on FrogsGame: a long-horizon post-training task where agents are asked to improve a fixed base model, Qwen3-8B, to solve a puzzle.

We are now updating that analysis with a new generation of frontier models, including Claude Fable 5, Opus 4.8, and GPT-5.5.

Why we care

We believe every person, company, and civilization should be able to craft AI around its own judgment: how it behaves, what it values, and how it improves. We call this modelcrafting. Most people should not have to do gnarly training loops, evals, data mixtures, or reinforcement learning to get AI that reflects their world. Their time should go back to what only humans can do, like building trust and deepening the relationships that give a business its soul.

Our goal is to teach AI to craft better models and improvement loops. This is the version of responsible recursive self-improvement we believe in: not AI improving itself in isolation, but AI helping every person and organization shape the intelligence they rely on.

Background

The task gives agents broad freedom while limiting reward hacks: they can generate their own training data, call stronger models through the Tinker API, write helper scripts, and decide how to combine supervised fine-tuning, reinforcement learning, and curriculum learning.

In the previous generation of runs, most models found the right high-level plan but failed in the details. They often generated low-quality SFT traces from the weak base model itself, then trained on those traces, amplifying noisy reasoning instead of teaching the underlying algorithm. Our conclusion was that the bottleneck was not planning but research intuition: knowing when data is too weak to trust, when to use a stronger model or deterministic solver, when an eval is misleading, and when correctness matters more than scale.

Quick update on Claude Opus 4.8 and GPT-5.5

Two generations later, many of those issues have improved. Opus 4.8 and GPT-5.5 generation models are better than previous generations at avoiding obvious training failures, using curriculum learning, and producing checkpoints that improve over the base model. But there is still a large gap between "having the right research plan" and "executing it well enough to reliably post-train another model."

3x improvement over previous models. Claude showed a large improvement, reaching 30.9% average pass@4, roughly a 3x improvement, while GPT-5.5 reached 9.7% pass@4. The models still only modestly outperform the baseline overall, but several runs produced checkpoints that genuinely improved performance.

No overconfident wrong answers. In almost all of the failed attempts, the agents acknowledged that the answer they are about to submit is wrong but, given the time constraint, did it anyway.

No working understanding of time. Giving agents more time did not reliably improve performance: GPT models in particular often gave up early and failed to use their full time budget.

Algorithm choice. Claude and GPT followed sharply different strategies. Claude used GRPO in all 8 trials, with no SFT-only runs and no cases of the earlier "SFT on weak-base traces" failure mode. GPT used an SFT-first strategy in all 8 trials, using data from the base model. Its RL phases were short, ranging from 0 to 4 steps, and 2 of 8 trials skipped RL entirely. This is a meaningful improvement over previous models. All Claude agents avoided the most obvious earlier mistake: starting with naive SFT on traces generated by the weak base model. Instead, they allocated training to RL from the beginning.

Reward design. Binary reward was the correct solution, and 3 runs, all from Opus 4.8, implemented it correctly. Dense partial credit, length-shaping coefficients, and other reward-shaping schemes did not help; in some cases, they degraded performance.

Curriculum. Almost all trials used some form of curriculum learning, which was not true in earlier generations. The two trials that did not use a curriculum finished near the bottom.

Models often implemented complicated curriculum strategies, but frontier-based curricula and multi-phase curricula performed best. More elaborate methods tended to suffer from small-sample noise. Adaptive curricula are elegant in theory, but in practice they often fell into cold-start traps when early solve-rate signals were too noisy to guide training reliably.

Evaluation remains the biggest failure point. Models still make obvious mistakes: they use small samples, overinterpret noisy results, rarely cross-validate their findings, and often build bloated internal evals that overemphasize easy boards. In many cases, the bottleneck was not the training algorithm itself, but the agent's inability to measure whether its training run was actually improving the model.

Let's talk about Claude Fable 5

Fable 5 solves the FrogsGame task by fixing the single biggest failure mode we identified in our previous report: overfitting to low-quality SFT traces. We ran more than 100 rollouts across different frontier models and generations. Fable 5 was the only model that discovered a reliable way to generate high-quality SFT traces programmatically across the full range of board difficulties.

Instead of generating traces from the weak base model, it generates deterministically correct traces by narrating the outputs of a backtracking algorithm, trains the base model on those traces, and then follows with RL to hill-climb task performance, all within the 20-hour time budget.

Importantly, this is not a reward hack or benchmark exploit. Fable 5 is not taking advantage of a gap between the proxy metric (accuracy on held-out boards) and the true objective (a model that solves unseen puzzles). Instead, it discovers the underlying algorithm, uses it to generate correct reasoning traces, and distills those traces into the base model.

FrogsGame includes a number of safeguards against shortcut solutions. The verifier evaluates checkpoints on an independent set of boards, only the best submitted checkpoint is scored, agents cannot use frontier models to solve boards or retrieve solutions from the internet, and the verifier implementation itself is hidden. Fable 5 does not circumvent any of these constraints. It succeeds because it finds a genuinely effective training strategy.

Beyond the core SFT-trace breakthrough, Fable 5 improved across several other dimensions of the FrogsGame long-horizon task:

Calibration. Fable 5 was substantially better calibrated than Opus 4.8. In the best Fable 5 run, self-evaluation overoptimism was only 1.2x, compared with 4.9x for the best Opus 4.8 run. Interestingly, there is a clear positive correlation between board difficulty and miscalibration.

Time utilization. Fable 5 agents used almost the entire 20-hour time budget, a major improvement over previous models that often gave up early or failed to convert extra time into better post-training results.

Data generation. The best Fable 5 run generated 3x more data than the best Opus 4.8 run. The dataset was also more diverse, both in board difficulty and in the algorithms used to generate examples.

Training signal. The best Fable 5 trial ran 512 rollouts per step for 95 steps, plus SFT on 6,000 balanced boards. By comparison, the best Opus 4.8 trial ran 128 rollouts per step for 64 steps and barely covered the hardest boards. Fable 5 therefore gave the base model a much more useful training signal, especially on the difficult parts of the distribution.

Recovery from mistakes. A known challenge with SFT on perfect reasoning traces is that the model learns how correct reasoning looks, but not how to recover when it makes a mistake at inference time. Fable 5 addressed this by injecting common errors into 5% of traces and showing the model how to recover from them.

Conclusion

FrogsGame is a toy task, but the lesson is not. Fable 5 did more than run the standard recipe. It found the real bottleneck, data quality, derived the underlying algorithm, generated reliable supervision programmatically, and used that signal to improve the base model. This is the judgment post-training depends on. Previous models had the right plan but failed in the details: bad traces, noisy evals, weak rewards, brittle curricula. Fable 5 is progress toward models that make those research decisions themselves.

The trajectory of AI research capabilities in post-training is becoming clear.

Research taste might be just another AI capability that AI systems fail at for a time, then get good at. — When AI Builds Itself

Acknowledgments

Thanks to Karina Nguyen for the co-development and feedback, and to Thinking Machines Lab for support with the Tinker API.