Harness engineering: what Anthropic and OpenAI are actually doing with their agents
An agent that goes off the rails twenty minutes into a complex task isn’t a model problem. It’s an infrastructure problem. The model is doing exactly what it was asked to do — it just wasn’t given the right environment to succeed.
That environment is the harness: everything surrounding the LLM call that makes it effective on long, complex tasks. Feedback loops, context management, handoff artifacts between sessions, specialized agents, an environment the agent can actually read and reason about.
Anthropic and OpenAI published their post-mortems within weeks of each other. Two teams, two different problems, conclusions that largely overlap.
What the two articles cover
Anthropic’s piece starts from two recurring failure modes on long tasks: context anxiety (the model starts rushing and cutting corners as its context window fills up) and self-evaluation bias (an agent asked to review its own work will almost always respond positively, even when the output is mediocre). Their answer: a three-agent architecture — planner, generator, evaluator — where the evaluator is deliberately separate from the generator and tuned to be skeptical, with a Playwright MCP that lets it test the running application the way a real user would. The benchmark: a 2D retro game maker, solo agent vs. full harness. Solo: 20 min, $9, a game UI with broken play mode and no visible error. With harness: 6h, $200, a working application with sprite editor, level editor, and a playable game with a built-in Claude integration.
OpenAI’s piece documents a more radical experiment: shipping a million lines of code with three engineers, zero lines written by hand, in a few months — around 1,500 PRs merged, roughly 3.5 PRs per engineer per day. Their core insight is different but complementary: an agent can only work with what it can read. Anything living in a Slack thread, a Google Doc, or someone’s head is invisible to it. Their answer: a repo structured as a reference system, a short AGENTS.md (~100 lines) acting as a table of contents pointing to detailed sources elsewhere, and architectural constraints enforced mechanically through custom linters rather than through documentation alone.
Why the convergence matters
Both teams reach the same conclusion from opposite directions: the model alone isn’t enough, but the engineer’s value is no longer about writing code — it’s about designing the environment the agent works in. Anthropic puts it well: every harness component encodes an assumption about what the model can’t do on its own. When the model improves, that assumption can become false — and the component becomes overhead to remove.
Both articles are worth reading in full for the implementation details. The generator-evaluator loop, structuring a repo for agent readability, where the harness-or-not threshold actually sits — those are the questions worth unpacking next.