Text-space optimization for frozen agents

SkillOpt Executive Strategy for Self-Evolving Agent Skills

SkillOpt treats a compact natural-language skill document as the trainable state of a frozen language agent, then learns that document through rollouts, reflection, bounded edits, and held-out validation gates.

Target Models 7
Benchmarks 6
Harnesses Codex + Claude Code
01

The core loop at a glance.

The SkillOpt training loop: rollout evidence, optimizer-side reflection, bounded skill edits, validation gating, and the exported reusable skill.

Rollout

The target model executes tasks with the current skill and records scored trajectories.

Reflect

The optimizer analyzes success and failure minibatches to find reusable procedures.

Edit

Candidate add, delete, and replace operations are merged and ranked under a budget.

Gate

The candidate skill is kept only if it improves held-out selection performance.

01 / Core Idea

Train the procedure, not the weights.

SkillOpt makes the skill document itself the optimization target. The target model, backend, and harness stay fixed; the procedure that guides evidence gathering, tool use, verification, and output formatting evolves.

A skill is external state for an agent.

Instead of fine-tuning a model or hand-maintaining prompts, SkillOpt runs the frozen agent on scored batches, asks a separate optimizer model to propose structured edits, and accepts a candidate only when validation performance improves.

1

Frozen target model

The base model stays unchanged throughout the optimization loop.

2

Optimizer model

A separate model proposes structured skill edits based on rollout evidence.

3

Add / delete / replace edits

Operations are merged and ranked under a textual edit budget.

4

Held-out gate

Edits are accepted only when they improve held-out selection performance.

02 / Method

A training loop for natural-language skills.

The loop deliberately mirrors a learning algorithm: rollout evidence acts like a forward pass, reflection acts like a language-level backward pass, and the textual learning rate bounds how far the skill can move.

Evidence

Rollout batches capture messages, tool calls, verifier feedback, task metadata, and final scores.

Minibatches

Failures and successes are reflected separately so edits correct recurring errors while preserving working behavior.

Bounded Edits

An edit budget functions as a textual learning rate, preventing useful rules from being overwritten by broad rewrites.

Memory

Rejected edits, slow update, and optimizer-side meta skill provide longer-horizon feedback without bloating deployment.

03 / Main Results

SkillOpt improves GPT and Qwen target models.

The table reports main-result gains across target models and execution harnesses, comparing no-skill execution with the final SkillOpt skill on held-out test splits.

Target model Harness SearchQA Sheet Office DocVQA LiveMath ALFWorld Avg gain
GPT-5.5Direct chat+9.6+38.9+39.0+12.4+29.3+11.9+23.5
GPT-5.4Direct chat+6.2+21.1+12.8+13.6+7.2+15.6+12.8
GPT-5.4-miniDirect chat+4.3+11.4+26.7+16.5+4.8+12.7+12.7
GPT-5.4-nanoDirect chat+19.0+8.2+33.7+49.4+4.0+35.1+24.9
GPT-5.2Direct chat+11.2+18.9+21.5+16.5+15.2+16.4+16.6
Qwen3.5-4BDirect chat+3.1+14.6+15.2+2.1+29.6+50.7+19.2
Qwen3.6-35B-A3BDirect chat+7.6+9.3+1.2+3.8+10.4+22.4+9.1
GPT-5.5Codex+5.5+57.5+12.8+5.0+28.0N/A+21.8
GPT-5.5Claude Code+4.0+58.3+13.9+3.5+13.3N/A+18.6
52/52 Best or tied-best in every model x benchmark
Method comparison SkillOpt clears the strongest baseline on every benchmark
04 / Ablations

The controls are doing real work.

The paper isolates the optimizer components that keep skill learning stable: enough evidence, bounded textual updates, rejected-edit feedback, slow update, and optimizer-side memory.

Component Setting SearchQA Spreadsheet LiveMath
Learning ratelr=4 default87.177.561.3
Learning ratewithout lr84.675.757.3
Rejected bufferwith buffer87.177.561.3
Rejected bufferwithout buffer85.572.958.9
Update memorymeta skill + slow update87.177.561.3
Update memorywithout both86.355.059.7

Bounded

Textual learning rates prevent destructive rewrites while keeping enough plasticity to learn new procedures.

Gated

Held-out selection turns reflection into propose-and-test optimization rather than unconditional self-editing.

Buffered

Rejected edits become negative feedback, helping the optimizer avoid repeating harmful directions.

05 / Skill Evolution

A typical run turns failures into concrete operating rules.

This ALFWorld run uses GPT-5.4-mini as the frozen target model and GPT-5.5 as the optimizer model. The plot tracks train rollout and held-out selection scores.

ALFWorld / train-sel evolution

Train rollout Selection gate
base step 1 step 2 step 3 slow step 4
Run setup Target model: GPT-5.4-mini. Optimizer model: GPT-5.5.
Selection rule Candidate edits are accepted only when held-out selection improves the current best score.
Outcome The selected skill improves final ALFWorld test hard score from 70.9% to 85.8%.
06 / Transfer

The exported skill behaves like a reusable artifact.

SkillOpt exports a compact best_skill.md. The paper tests whether that artifact transfers across model sizes, execution harnesses, and nearby benchmarks without further target-side optimization.

+15.2

GPT-5.4 LiveMath skill transferred to GPT-5.4-nano on LiveMathBench.

Cross-model
+31.8

Codex-trained SpreadsheetBench skill transferred into Claude Code.

Cross-harness
+10.4

GPT-5.4-nano used as its own optimizer improved SpreadsheetBench over baseline.

Self-optimizer
1 file

The target model consumes only the final skill, not optimizer memory.

Deployment

A stronger optimizer model gives the largest gains, but the loop is not merely distillation from a stronger model. Even matched target-as-optimizer settings can discover useful edits when the update is constrained, buffered, and validated.

07 / Citation

Cite the paper.

If you find SkillOpt useful, please cite the arXiv preprint below.

BibTeX
@misc{yang2026skilloptexecutivestrategyselfevolving,
      title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills}, 
      author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
      year={2026},
      eprint={2605.23904},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.23904}, 
}