Self-Improving Agents: What Trial and Error Actually Teaches Them

Five controlled experiments on whether agents can improve their own behavior through simulation—and what's doing the work when they do.

chasewhughes.com · Apr 2026

Every builder of AI agents eventually asks the same question.

Can an agent improve its own behavior through trial and error?

Run it on enough scenarios. Observe what works. Distill the lessons. Inject them back at inference. No retraining. No human in the loop on every iteration. Just the agent learning from its own runs.

The field’s current answer is broadly yes. A wave of work—Voyager, SkillWeaver, SAGE, SkillX—has refined the recipe: simulate the agent in an environment, harvest patterns from successful trajectories, store them as reusable skill text, retrieve them at inference time. The lifts are real. The papers are optimistic.

The answer I’ll defend in this post is more cautious: yes, but partially. Most of what the simulation step appears to contribute is recoverable from a strong proposer model plus the right instruction, with no simulation at all. There is something the simulation step does that priors can’t replicate—but it isn’t what the field assumes, and we still don’t know exactly what it is.

This is the empirical follow-up to a framework I sketched a month ago. The framework still holds. The experiment changed how I think about which parts of it are doing the work.

The full paper is on DocSend. The code, data, and skill libraries are at github.com/chasewhughes/nite-owl.

I. Where the Field Is

The dominant recipe for self-improving agents looks like this:

Run the agent through many scenarios in a simulator.
Use a strong proposer model (Claude, GPT-4) to read the transcripts and extract patterns.
Write the patterns into structured skill text—what to do, what to avoid, when each applies.
At inference time, retrieve relevant skills by similarity to the current scenario and inject them into the agent’s prompt.

This is inference-time knowledge transfer. The proposer LLM externalizes domain procedural knowledge into text, and the agent reads that text at inference. No weight updates. No new training data. The whole pipeline runs at inference cost.

Reported lifts across the literature are real but modest. The honest summary is that skill libraries help in domains where the agent’s defaults are weak, on tasks where procedural sequencing matters, when the proposer is strong, and when scenarios are bounded. Recent work in the same space (Liu et al., 2026) has started questioning the robustness of these gains—pass rates approach no-skill baselines as conditions get more realistic.

II. A Maturity Model for Self-Improvement

Before getting to where the simulation-based recipe sits, here’s the framework I use to think about agent self-improvement generally. Six levels, each building on the one before.

Level	Name	Capability
0	Static	No self-modification. All changes require human engineering.
1	Reactive	Self-heals in response to errors using known fixes from a playbook.
2	Reflective	Maintains reflective memory. Writes post-mortems and applies lessons to future runs.
3	Adaptive	Runs structured experiments (A/B tests, parallel prompts) and selects the best-performing approach.
4	Architectural	Modifies its own structure—spawns sub-agents, changes model routing, refactors decision trees.
5	Autonomous	Full-loop self-improvement with human oversight limited to policy-level guardrails.

Most production agents today live at Level 0 or 1.

The simulation-based skill library recipe is Level 2 territory—reflective. The agent (or rather, the proposer working from the agent’s transcripts) writes structured “lessons learned” and injects them into the next context. Level 3+ requires the agent to run its own A/B tests and modify its own structure, which the current literature hasn’t reached.

The question is whether Level 2 actually buys what the papers say it does.

III. The Question Nobody Asks

Two questions are conspicuously under-tested in the skill-library literature.

First: is the simulation step actually doing the work, or is the proposer LLM doing it? The proposer (Claude or GPT-4) already has substantial domain knowledge about whatever the agent is doing. Could it produce equivalent skill text from priors alone—given only a description of the world, with no transcripts to observe—at a fraction of the cost?

If yes, the simulation harness is decorative. The lift is just in-context distillation through a more expensive interface.

Second: when skill libraries do help, what specifically about the skill text causes the lift? Phrasings the agent recognizes? Failure modes the agent actually exhibits? Procedural sequencing observed in successful runs? Each implies a different recipe—and some are recoverable from proposer priors while others aren’t.

To my knowledge, no published work runs the cold-priors ablation. So I ran it.

IV. The Setup

Five controlled experiments in a frozen customer-service simulator—a late-night convenience store with an agent (the clerk), 44 pre-authored personas, and a 543-scenario pool. The harness extends SOTOPIA v0.1.5, replacing per-scenario LLM generation with deterministic template instantiation so the eval is reproducible.

The same proposer (Claude Sonnet 4.6) writes skills under different conditions:

Experiment	Condition
E1	Sonnet observes Qwen-30B’s training shifts → 31 skills → eval on Qwen
E2	Reverse distillation—E1’s library applied to a frontier agent (Kimi K2)
E3	Sonnet observes K2’s training shifts → 25 skills → eval on K2
E4	Sonnet writes skills from priors only—same world spec, same persona briefs, no transcripts
E5	Sonnet rewrites E4’s cold skills to triple their procedural-sequencing density, holding content constant

Same proposer, same agent, same evaluation scenarios, same retrieval, same Honcho workspace. Between E3 and E4 only the presence of simulation transcripts varies. 24 paired t-tests across four primary experiments. Benjamini-Hochberg FDR correction at α=0.05. A Gemini 2.5 Pro inter-judge cross-check on a 20-shift subsample.

V. Four Findings

One quality lift survives correction. K2 with K2-derived skills, retrieval-only, on resolution quality: ∆=+0.26, BH-q=.037, Cohen’s d=0.354. Same-tier extraction lifts a frontier-class agent on a customer-service domain. That’s the load-bearing result.

The cold-priors control fails on quality and damages safety. Cold-priors injection produces no significant lift on any dimension at any level of correction. And it inflates K2’s anti-pattern violation rate from 1% to 7%. Off-tier skill text—whether from a different model’s transcripts (E2) or from the proposer’s own priors (E4)—damages the agent’s clean-behavior baseline. Same-tier extraction preserves it (pooled E3 vs E4: Fisher’s exact p=.011).

Procedural structure is recoverable from priors. When I rewrote the cold skills to add explicit sequencing markers (E5), the resolution-quality lift came back: ∆=+0.213 vs grounded’s +0.263, statistically indistinguishable. The procedural structure that improves agent outcomes can be recovered from a “write procedurally” instruction without observing any transcripts at all.

But compliance doesn’t recover. Here’s where it gets strange. Scaffolded cold skills carry 2.3× the sequencing-marker density of the grounded library. They produce essentially the same quality lift. But the judge’s “did the agent follow this skill?” rating doesn’t move at all—fidelity 0.46 vs grounded’s 1.99. The agent appears to follow grounded skills three times as often as scaffolded ones, while producing the same outcomes either way.

Compliance and outcome metrics decouple. A library evaluated only on fidelity would have correctly preferred grounded over scaffolded but missed that scaffolded is competitive on the metric end-users care about. A library evaluated only on outcome would have missed whatever drives the residual fidelity gap.

VI. The Split-Mechanism Takeaway

The simulation harness is contributing at least two separable things.

The first—procedural structure that improves downstream agent behavior—is recoverable from proposer priors plus the right instruction. You don’t need the simulation step for this. You need a competent proposer and a prompt that tells it to write procedurally.

The second—whatever drives the agent’s apparent compliance with the skill text at the token level—is not recoverable from priors. Three candidates I can’t yet isolate: phrasing recognition (grounded skills contain phrasings the agent actually produced), failure-mode targeting (grounded skills target patterns the agent actually exhibits), and M=2 survivorship filtering (the grounded extractor only promotes skills that recurred across multiple shifts). The vocabulary-overlap data weakens the phrasing-recognition candidate—cold skills actually have higher overlap with the agent’s transcripts than grounded skills do.

The most cheaply testable candidate is the M=2 filter—generate cold skills two or three times and apply an equivalent multi-pass filter. I left it for future work.

VII. What Got Tested (and What Didn’t)

Self-improvement applies across two dimensions: what the agent optimizes (behavior vs. structure) and the scope of change (the agent’s own logic vs. the systems it operates on).

Dimension	Category	What It Covers
Behavior	Memories	Self-curating knowledge base—distilled lessons retrieved at inference.
Behavior	Prompts	Multi-layer prompt stack: system prompts, skill prompts, tool descriptions.
Structure	Agent Design	Decision trees, sub-agent spawning, model routing.
Structure	Infrastructure	Data pipelines, run schedules, tool selection.

The experiments above test exactly one cell of this matrix: skill text injected into the prompt that modifies how the agent behaves. That’s the Prompts row of the Behavior dimension—the most-studied cell in the literature, and the one with the most established methodology.

What the experiments tell us about that cell: simulation-grounded skill text genuinely outperforms what the proposer can write from priors, but most of the outcome lift is replicable from priors with the right instruction. The unique contribution of grounding is something other than what the prose looks like at the surface.

What they do not tell us: whether the same is true for memory curation, agent design changes, or infrastructure adjustments. Those cells are harder to test. The agent-design cell in particular—letting the agent rewrite its own decision tree—opens a bigger surface area of risk and a bigger range of possible mechanisms. I expect the cold-priors result to direction-of-travel transfer (the proposer’s priors do a lot of work everywhere) but I can’t claim it.

Anyone running self-improvement experiments in the structure cells should run the equivalent ablation: can a proposer, without observing the agent’s actual runs, produce a structurally equivalent change just from world description? If yes, the simulation step there is decorative too.

VIII. Risks and Guardrails

Recursive self-improvement is powerful precisely because the agent has latitude to change itself. That latitude introduces categories of risk that static agents never face.

Risk	Description	Guardrail
Drift	Incremental changes compound into behavior that no longer aligns with original intent.	Anchor evaluations to original North Star metrics. Flag cumulative deviation across multiple change cycles.
Reward hacking	The agent optimizes a metric in ways that satisfy measurement but violate the spirit of the goal.	Pair quantitative metrics with qualitative spot-checks. Include adversarial test cases.
Compounding errors	A bad change slightly degrades performance, goes undetected, becomes the baseline for future changes.	Automated regression checks against a frozen baseline, not just the previous version.
Context pollution	Reflective memories or meta-logs grow too large and crowd out useful context.	Enforce memory budgets. Periodically summarize and prune.
Over-exploration	Architecture temperature set too high causes cycles spent on radical experiments instead of execution.	Cap exploration frequency. Require exploratory changes to pass minimum performance thresholds.

The cold-priors result adds a sixth risk to this list, one I didn’t anticipate when I wrote the original framework:

Pipeline ornament. Building a simulation harness that isn’t actually contributing what you think it is. Shipping a complex inference-time system whose lift is fully attributable to the proposer’s priors plus an instruction, while paying for the simulation cost in compute, latency, and engineering surface area. The guardrail is the cold-priors ablation: before you ship a simulation-based skill library, generate skills both ways and compare. If the cold version performs equivalently, you don’t need the harness.

The non-negotiable guardrail across all of these remains: human-in-the-loop approval for any change that touches architecture, data pipelines, or external-facing behavior. The agent proposes; the operator disposes.

IX. What This Means for Builders

Three takeaways if you’re shipping a skill-library-based agent system.

Run the cold-priors ablation before you ship a simulation harness. If your proposer can produce equivalent skill text from priors plus a “write procedurally” instruction, you’re paying for a pipeline that’s doing in-context distillation through a more expensive interface. Generate skills both ways. Compare. The ablation is cheap. The cold-skill generation script is in the artifact release.

Report compliance and outcome metrics jointly. They decouple in ways single-metric reporting will miss. If your eval only tracks “did the agent follow the skill?” you’ll preferentially keep skills the agent compliantly executes regardless of whether they help. If your eval only tracks outcome quality, you’ll miss whatever the compliance gap is actually measuring.

Skills can hurt where the agent’s defaults are already strong. In one of my categories—medical-emergency / crowd-cascade—K2’s baseline acute-priority routing was already the strongest of any category, and the K2-derived skill imposed structure K2 read as constraining relative to its own defaults. Per-category resolution lift went negative. The Spearman correlation isn’t statistically robust at n=10 categories, but the direction is there. Don’t assume more skills means better behavior.

X. What This Doesn’t Show

One bounded customer-service domain. The mechanism-level findings I expect to transfer; I cannot demonstrate transfer without more experiments.

The headline finding sits exactly at the BH-corrected detectability threshold (d=0.354). Replication at larger n would be valuable.

Honcho-mediated selection did not significantly outperform top-K retrieval after correction in any experiment.

The Gemini cross-check reproduces the load-bearing fidelity gap (Sonnet 1.42 vs Gemini 1.35; per-shift r=0.99) but not the resolution lift at independent significance at the subsample n.

XI. The Methodology Is the Durable Part

The empirical contribution is modest: one corrected-significance lift on one bounded domain, plus a partial mechanism story. The methodology contribution is what I expect to outlast the specific findings.

Cold-priors ablation should be standard for skill-library research. Compliance and outcome should be reported jointly because they decouple. If you’re building in this space, run the ablation—the script is in the repo.

The agent self-improvement question is real, and it’s worth answering well. But “well” means knowing which parts of your pipeline are actually doing the work. Most of what we attribute to simulation is recoverable for free. The part that isn’t is the part worth understanding next.

Read the full paper on DocSend. Code, data, and skill libraries are open source on GitHub.