Skip to main content

Agent harness

An agent harness is the runtime layer around an AI model. It decides what context the model sees, which tools it can call, how permissions work, where memory lives, and how the agent recovers from mistakes. Do not treat the harness as plumbing. For coding agents and workflow agents, the harness often determines reliability more than the raw model choice.

The useful mental model

A good agent system externalizes capabilities that should not live only inside model weights.
LayerWhat it externalizesExamples
MemoryDurable state and retrievalproject notes, user preferences, task history, vector search
SkillsReusable proceduresbrowser control, code review, image generation, deployment helpers
ProtocolsCommunication contractsMCP, JSON schemas, typed tool responses, error formats
HarnessRuntime coordinationcontext assembly, tool routing, approvals, logs, retries, evals
This model comes from the paper Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering. The important engineering question is:
Which capability should stay implicit in the model, and which capability should become an explicit external component?
If the answer affects reliability, auditability, reuse, or cost, it usually belongs outside the model.

What the harness owns

The harness should make these decisions explicit:
  • which files, pages, tools, and memories enter context
  • when old context is summarized, dropped, or preserved
  • how tool calls are validated before execution
  • which actions require user approval
  • how tool errors are represented and retried
  • how model effort, latency, and cost trade off
  • how runs are logged for later debugging
A thin harness can work, but an implicit harness becomes hard to debug. If behavior changes and nobody knows whether the cause is the model, prompt, context cache, tool layer, or product default, the harness is under-instrumented.

Design rules

Keep tools small and inspectable

Prefer narrow tools with clear inputs and outputs. A tool that does one thing is easier for the model to call correctly and easier for humans to audit. For local workflows, CLI tools are often enough. For reusable cross-client integration, use a protocol boundary such as MCP. For stable product backends, a direct API is still the simplest option.

Preserve reasoning-critical context

Context pruning is not just a cost optimization. It changes behavior. If an agent made tool calls or file edits based on earlier reasoning, the harness must preserve enough rationale for later turns to continue coherently. Otherwise the agent may repeat itself, forget why it chose a path, or pick odd tools.

Represent durable state as files

Long-running agent workflows need state outside chat history. Plain files are often the simplest durable memory layer because humans can inspect, edit, diff, and version them. A useful pattern is a workspace that separates intent, resources, work products, and learning records:
workspace/
├── MISSION.md              # why this work or learning track exists
├── RESOURCES.md            # trusted sources and references
├── NOTES.md                # preferences and scratch notes
├── lessons/                # one self-contained learning unit per file
├── reference/              # compressed long-term reference material
└── learning-records/       # what changed in the user's capability
The exact names can change. The important rule is that state should be inspectable, restartable, and scoped. For agent-assisted learning, this lets the agent teach the next step from the learner’s current capability instead of from a generic syllabus. For engineering work, the same pattern applies to onboarding notes, migration logs, incident follow-ups, and task journals.

Make retrieval deterministic when correctness matters

Do not ask an agent to “figure it out” through brittle websites, scattered databases, or undocumented one-off scripts when the task needs exact results. Give it a deterministic retrieval layer instead. Anthropic’s biology-agent case study is a good example. Scientific agents were asked to retrieve viral sequence data from NCBI Virus. The durable lesson was not that one model won a benchmark; it was that reliability improved dramatically when the workflow added gget virus, a deterministic retrieval layer with a narrower interface. For high-stakes retrieval, prefer:
  • typed query functions over free-form browsing
  • stable IDs, versions, and date filters
  • machine-checkable counts or checksums
  • explicit provenance for every returned record
  • validation steps before downstream analysis
  • documented error cases instead of silent best-effort output
This is the same harness principle as tool design: make the execution surface predictable, then let the model plan around it.

Make effort settings visible

Reasoning effort is a product decision, not only a model parameter. Lower effort can reduce latency and token use, but it can also make hard coding tasks feel worse. Expose the current effort level, make it easy to change, and avoid silently changing defaults for complex workflows.

Treat system prompts as code

System prompt edits can change quality as much as code changes. Review them with the same discipline:
  • run per-model evals
  • use ablations to test individual instructions
  • roll out gradually
  • keep an audit trail
  • gate model-specific instructions to the intended model
Prompt brevity rules are especially risky for coding agents. If the agent is forced to be terse between tool calls, it may lose useful planning and verification behavior.

Test the public harness

Internal builds can hide production-only issues. Dogfood the exact public build, public defaults, and public context behavior. For code review or coding-agent evals, include the repositories and files the agent would actually need. A review that lacks cross-repo context can miss the bug even when the model is capable of finding it.

Failure modes to watch

Anthropic’s April 2026 Claude Code postmortem is a useful case study because the reported degradation came from product and harness changes, not the base API model. Common failure modes:
  • Default effort regression: a latency-driven default can reduce perceived intelligence on hard work.
  • Context cache bug: pruning or clearing reasoning history at the wrong time can make the agent forgetful and repetitive.
  • Over-tight prompt constraint: broad brevity instructions can reduce coding quality.
  • Eval blind spot: internal evals may pass while real user workflows fail.
  • Build mismatch: staff may test a different harness than users run.
The practical lesson is simple: when an agent gets worse, debug the whole runtime, not just the model.

A quick diagnostic checklist

When an agent behaves oddly, ask:
  1. Did the model or effort level change?
  2. Did the system prompt, tool description, or skill instructions change?
  3. Did context pruning, summarization, or cache behavior change?
  4. Did tool schemas or return formats change?
  5. Does the agent have the same file and repo context as before?
  6. Are users running the same harness build that was evaluated internally?
  7. Do logs show repeated tool calls, missing rationale, or retries without new information?
This checklist helps separate model regressions from harness regressions.

When to invest in a stronger harness

Start simple. Add structure when the work becomes repeated, risky, or expensive. A stronger harness is worth it when:
  • agents edit code or production data
  • multiple tools need shared state
  • users resume long-running sessions
  • approvals and audit logs matter
  • the same workflow runs across teams
  • evaluation needs to compare behavior across model versions
If the task is one-off and low-risk, a prompt plus a few tools may be enough. If the task is long-running and stateful, invest in harness design early.

References

Last modified on June 9, 2026