← Back to Writing

Lessons in Building Agents with Self-Recursive Improvement

A framework for agents that evaluate, modify, and evolve their own behavior—including the self-improvement.skill library.

chasewhughes.com · Mar 2026

I. The Problem Statement

Agents are hard to get right the first time. The gap between a working demo and a reliable production system is vast, and closing it requires sustained, iterative improvement. The question is not whether to improve—it’s who does the improving, and how.

Why Current Approaches Fall Short

  • Reactive by default: Error-driven methods (debugging, exception handling) are reactive. They only fire after something breaks and miss the broader optimization surface.
  • Human-limited: Human-dependent methods (observability dashboards, manual log review) create a bottleneck. A recursive agent consistently identifies more improvement vectors than a team of engineers can observe.
  • Incomplete coverage: Narrow-scope tools (prompt auto-optimizers, single-metric tuners) improve one dimension while ignoring the rest of the system.

Why Recursive Self-Improvement Changes the Equation

  • Learns the long tail of edge cases that surface in production but never appear in design phases.
  • Memories and design patterns carry over even when the underlying architecture changes—swapping an LLM, migrating a pipeline, or restructuring agent topology.
  • Model agnosticism: by encoding improvement logic into a self-improvement.skill library, the system’s intelligence persists independently of the base model.

Key Research Foundations

Reflexion & Linguistic Reinforcement: Research into the Reflexion framework shows that agents improve fastest when they maintain a reflective textual memory. Instead of updating weights, the agent writes a post-mortem—“I tried X, it failed because of Y, next time I should try Z”—and injects it into the next context window.

Exploration vs. Exploitation: Self-improving agents can get stuck in local maxima—finding a “good enough” approach and never testing alternatives. Incorporating a “temperature for architecture”—where the agent is occasionally forced to try a radically different tool or logic path—prevents stagnation and surfaces better strategies.

II. Tactical Implementation

These are the operational principles for implementing self-recursive improvement in practice.

Design Principles

PrincipleDescription
Metrics firstDefine objective qualitative metrics as North Stars, brainstormed collaboratively with the agent (e.g., “Reduce API tokens by 20% without losing accuracy”).
Human-in-the-loop gateAll architectural changes require human sign-off before deployment. Self-improvement introduces ambiguity and second-order effects that demand manual review.
Observability as fuelThe agent needs raw access to its own logs, trace data, error reports, and past versioning. Observability is the fuel for improvement.
Holistic scopeDon’t limit self-improvement to prompting or architecture. The agent should be able to touch the data pipeline, execution schedules, tool selection, and any other lever that affects outcomes.
The meta-logMaintain a log of the agent’s review methods, decisions, and changes so it can measure the impact of its own adjustments over time.

Versioning & Rollback

If the agent modifies its own prompts, skills, or architecture, there must be an explicit mechanism for reverting changes. The meta-log captures slow degradation that sandboxing alone won’t catch.

  • Every self-modification should produce a versioned snapshot of the prior state.
  • Automated regression checks should compare performance before and after each change over a defined window.
  • Rollback should be one-step: if a change degrades any tracked metric beyond a threshold, the agent or operator can instantly revert.

III. Patterns of Recursive Logic

PatternTriggerMechanismExample
Self-Healing (Pipeline)Production error detectedIntegrates with error monitoring (Sentry) and GitHub to auto-detect bugs, generate fixes, and deploy PRsAgent acts as first-responder to production errors
Self-Healing (Script)Specific script failureImmediate trigger when external factors break executionA website changes its HTML structure; agent rewrites the scraper
Introspection (Immediate)Batch run completesRuns sample batches and generates instant post-mortem reportsAgent runs, evaluates, and iterates in a tight loop
Introspection (Scheduled)Cron / time-basedPeriodic “Frontier Agent” reviews of weekly or monthly logsIdentifies trends and implements long-term strategic improvements
A/B TestingCheckpoint reachedParallel testing of multiple prompt or tool sequence versionsEvaluated against quantitative metrics at a fixed window
Reflective MemoryFailure occursWrites structured “note to self” injected into next context (Reflexion pattern)“Supplier X uses nested tables—extract with Tool B, not Tool A”
Architecture TemperaturePeriodic forced triggerForces the agent to try a radically different tool or logic pathEscapes local maxima and discovers better strategies

IV. What the Agent Improves

Self-improvement applies across two dimensions: what the agent optimizes (its behavior vs. its structure) and the scope of change (internal to the agent vs. external systems it operates on).

DimensionCategoryWhat It Covers
BehaviorMemoriesBuilds and references a self-curating knowledge base. Distills raw RAG into synthesized kernel memories. Prunes conflicting or redundant entries.
BehaviorPromptsAdjusts instructions across the full hierarchy: global system prompts, skill-specific prompts, tool-description prompts. Self-adjusts tone, verbosity, and formatting.
StructureAgent DesignDecides when a task is too complex and spawns sub-agents. Modifies its own decision tree and selects models per sub-task based on observed performance.
StructureInfrastructureAdjusts data pipelines, run schedules, and tool selection. Researches and integrates new APIs or modifies internal tools to expand capability.

Behavior: What the Agent Does

1. Memories — The Self-Curating Knowledge Base

  • Builds and references memories beyond static prompt instructions, retrieved at inference time.
  • Distillation: Moves beyond raw RAG to synthesized lessons (kernel memories): distilled insights rather than raw documents.
  • Pruning: Identifies and prunes conflicting or redundant memories to save context space and reduce noise.

2. Prompts — The Multi-Layer Stack

  • Adjusts instructions across the entire agent hierarchy: global system prompts, skill-specific prompts, and tool-description prompts.
  • Persona tuning: Self-adjusts tone, verbosity, and formatting based on user feedback cycles.

Structure: How the Agent Is Built

3. Agent Design — The Architect Pattern

  • Structural refactoring: The agent decides when a task is too complex and spawns sub-agent patterns to handle it.
  • Logic routing: Modifies its own decision tree (e.g., “Always use Tool A before attempting Tool B”).
  • Model selection: Selects different models for different sub-tasks based on observed performance.

4. Infrastructure — The DevOps Agent

  • Data pipelines: Adjusts which data is ingested, how it is preprocessed and cleaned, and how it is surfaced.
  • Run schedules: For agents with scheduled runs or triggers, modifies the frequency, pattern, or timing based on data availability or cost-efficiency.
  • Tool evolution: Researches and integrates new third-party APIs, or modifies internally built tools to expand capability.

V. Risks & Guardrails

Recursive self-improvement is powerful precisely because the agent has latitude to change itself. That latitude introduces categories of risk that static agents never face.

RiskDescriptionGuardrail
DriftIncremental changes compound into behavior that no longer aligns with original intent.Anchor all evaluations to the original North Star metrics. Flag cumulative deviation across multiple change cycles.
Reward hackingThe agent optimizes a metric in ways that satisfy the measurement but violate the spirit of the goal.Pair quantitative metrics with qualitative spot-checks. Include adversarial test cases in the evaluation rubric.
Compounding errorsA bad change that slightly degrades performance goes undetected and becomes the baseline for future changes.Automated regression checks after every change. Compare against a frozen baseline, not just the previous version.
Context pollutionReflective memories or meta-logs grow too large and start crowding out useful context.Enforce memory budgets. Periodically summarize and prune the meta-log and reflective memories.
Over-explorationArchitecture temperature set too high causes the agent to spend cycles on radical experiments instead of executing.Cap exploration frequency. Require that exploratory changes still pass minimum performance thresholds.

The non-negotiable guardrail across all of these: human-in-the-loop approval for any change that touches architecture, data pipelines, or external-facing behavior. The agent proposes; the operator disposes.

VI. The Evolution Maturity Model

A framework for assessing where an agent sits on the self-improvement spectrum. Each level builds on the capabilities of the one before it.

LevelNameCapabilityExample
0StaticNo self-modification. All changes require human engineering.Traditional rule-based system or hardcoded LLM pipeline.
1ReactiveSelf-heals in response to errors. Applies known fixes from a playbook.Auto-retries with modified parameters when an API call fails.
2ReflectiveMaintains reflective memory. Writes post-mortems and applies lessons to future runs.After a parsing failure, logs “use Tool B for nested tables” and applies it next time.
3AdaptiveRuns structured experiments (A/B tests, parallel prompts) and selects the best-performing approach.Tests three prompt variants over 100 runs and adopts the winner.
4ArchitecturalModifies its own structure—spawns sub-agents, changes model routing, refactors decision trees.Detects that a task exceeds single-agent complexity and creates a multi-agent pipeline.
5AutonomousFull-loop self-improvement with human oversight limited to policy-level guardrails.Agent identifies a new data source, builds the ingestion pipeline, integrates it, and validates against metrics—all with a single human approval gate.

Most production agents today operate at Level 0 or 1. The framework in this article targets Level 2–4, with Level 5 as the long-term horizon.

VII. The self-improvement.skill

The outline above references a self-improvement.skill library as the encoding mechanism for recursive improvement. Below is a sketch of what that skill contains and how it operates.

The full implementation is open source: github.com/hughes7370/self-improvement

What It Is

A structured skill file that any agent can load to initiate and maintain a self-improvement loop. It defines the evaluation criteria, the improvement workflow, and the guardrails—independent of the base model.

Core Components

ComponentPurpose
Evaluation rubricThe quantitative and qualitative metrics the agent optimizes toward, defined collaboratively between the agent and its operator.
Reflection templateThe structured template for post-mortem analysis after failures or suboptimal runs (the Reflexion pattern).
Change proposal formatThe checklist and shadow-test protocol that every proposed change must pass before deployment.
Meta-log schemaTracks what was changed, why, the expected impact, and the actual impact.
Exploration triggerThe mechanism for periodic forced exploration of alternative approaches (architecture temperature).
Rollback protocolInstructions for snapshotting state before changes and reverting if metrics degrade.

How It Operates

  1. Assess: Runs evaluation rubric against recent performance data.
  2. Identify: Identifies the highest-impact improvement vector.
  3. Propose: Drafts a specific change with expected impact, writes it to the meta-log.
  4. Test: The change is tested in a sandbox against known-good baselines.
  5. Deploy: If the change passes the shadow test and human approval gate, it is deployed.
  6. Measure: Post-deployment metrics are tracked and compared against the expected impact. Rollback if degradation is detected.

This document consolidates raw design notes, research-informed methodology, and structural review feedback into a single working reference. The self-improvement.skill framework is open source on GitHub.