SkillOpt treats a compact natural-language skill document as the trainable state of a frozen language agent, then learns that document through rollouts, reflection, bounded edits, and held-out validation gates.
The SkillOpt training loop: rollout evidence, optimizer-side reflection, bounded skill edits, validation gating, and the exported reusable skill.
The target model executes tasks with the current skill and records scored trajectories.
The optimizer analyzes success and failure minibatches to find reusable procedures.
Candidate add, delete, and replace operations are merged and ranked under a budget.
The candidate skill is kept only if it improves held-out selection performance.
SkillOpt makes the skill document itself the optimization target. The target model, backend, and harness stay fixed; the procedure that guides evidence gathering, tool use, verification, and output formatting evolves.
Instead of fine-tuning a model or hand-maintaining prompts, SkillOpt runs the frozen agent on scored batches, asks a separate optimizer model to propose structured edits, and accepts a candidate only when validation performance improves.
The base model stays unchanged throughout the optimization loop.
A separate model proposes structured skill edits based on rollout evidence.
Operations are merged and ranked under a textual edit budget.
Edits are accepted only when they improve held-out selection performance.
The loop deliberately mirrors a learning algorithm: rollout evidence acts like a forward pass, reflection acts like a language-level backward pass, and the textual learning rate bounds how far the skill can move.
Rollout batches capture messages, tool calls, verifier feedback, task metadata, and final scores.
Failures and successes are reflected separately so edits correct recurring errors while preserving working behavior.
An edit budget functions as a textual learning rate, preventing useful rules from being overwritten by broad rewrites.
Rejected edits, slow update, and optimizer-side meta skill provide longer-horizon feedback without bloating deployment.
The table reports main-result gains across target models and execution harnesses, comparing no-skill execution with the final SkillOpt skill on held-out test splits.
| Target model | Harness | SearchQA | Sheet | Office | DocVQA | LiveMath | ALFWorld | Avg gain |
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | Direct chat | +9.6 | +38.9 | +39.0 | +12.4 | +29.3 | +11.9 | +23.5 |
| GPT-5.4 | Direct chat | +6.2 | +21.1 | +12.8 | +13.6 | +7.2 | +15.6 | +12.8 |
| GPT-5.4-mini | Direct chat | +4.3 | +11.4 | +26.7 | +16.5 | +4.8 | +12.7 | +12.7 |
| GPT-5.4-nano | Direct chat | +19.0 | +8.2 | +33.7 | +49.4 | +4.0 | +35.1 | +24.9 |
| GPT-5.2 | Direct chat | +11.2 | +18.9 | +21.5 | +16.5 | +15.2 | +16.4 | +16.6 |
| Qwen3.5-4B | Direct chat | +3.1 | +14.6 | +15.2 | +2.1 | +29.6 | +50.7 | +19.2 |
| Qwen3.6-35B-A3B | Direct chat | +7.6 | +9.3 | +1.2 | +3.8 | +10.4 | +22.4 | +9.1 |
| GPT-5.5 | Codex | +5.5 | +57.5 | +12.8 | +5.0 | +28.0 | N/A | +21.8 |
| GPT-5.5 | Claude Code | +4.0 | +58.3 | +13.9 | +3.5 | +13.3 | N/A | +18.6 |
The paper isolates the optimizer components that keep skill learning stable: enough evidence, bounded textual updates, rejected-edit feedback, slow update, and optimizer-side memory.
| Component | Setting | SearchQA | Spreadsheet | LiveMath |
|---|---|---|---|---|
| Learning rate | lr=4 default | 87.1 | 77.5 | 61.3 |
| Learning rate | without lr | 84.6 | 75.7 | 57.3 |
| Rejected buffer | with buffer | 87.1 | 77.5 | 61.3 |
| Rejected buffer | without buffer | 85.5 | 72.9 | 58.9 |
| Update memory | meta skill + slow update | 87.1 | 77.5 | 61.3 |
| Update memory | without both | 86.3 | 55.0 | 59.7 |
Textual learning rates prevent destructive rewrites while keeping enough plasticity to learn new procedures.
Held-out selection turns reflection into propose-and-test optimization rather than unconditional self-editing.
Rejected edits become negative feedback, helping the optimizer avoid repeating harmful directions.
This ALFWorld run uses GPT-5.4-mini as the frozen target model and GPT-5.5 as the optimizer model. The plot tracks train rollout and held-out selection scores.
SkillOpt exports a compact best_skill.md. The paper tests whether that artifact transfers
across model sizes, execution harnesses, and nearby benchmarks without further target-side optimization.
GPT-5.4 LiveMath skill transferred to GPT-5.4-nano on LiveMathBench.
Cross-modelCodex-trained SpreadsheetBench skill transferred into Claude Code.
Cross-harnessGPT-5.4-nano used as its own optimizer improved SpreadsheetBench over baseline.
Self-optimizerThe target model consumes only the final skill, not optimizer memory.
DeploymentA stronger optimizer model gives the largest gains, but the loop is not merely distillation from a stronger model. Even matched target-as-optimizer settings can discover useful edits when the update is constrained, buffered, and validated.
If you find SkillOpt useful, please cite the arXiv preprint below.
@misc{yang2026skilloptexecutivestrategyselfevolving,
title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills},
author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
year={2026},
eprint={2605.23904},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.23904},
}