Token Economy: Three Tools Attacking the Problem Differently

🇫🇷 This article is also available in Français

Context windows keep growing. Gemini, Claude — everyone’s at 1M tokens now. And yet agents running in production saturate their context within an hour of intensive work. Not a paradox: a race where data volume systematically outpaces window size.

Three recent tools attack this problem at different levels of the stack. No universal solution, distinct philosophies, different tradeoffs.

The Real Problem Isn’t Token Cost

Cost is what you tell the finance team. The actual problem is reasoning quality.

A model with 180k of 200k tokens used doesn’t reason as well as one with 40k. Attention dilutes. System instructions from the start of the session carry less weight. The agent starts to forget, not raw facts, but constraints, priorities, the thread of what it was supposed to be doing.

You can observe this in practice: in a long debugging session, an agent that has accumulated 50 rounds of git status, cat file.ts, and repeated stack traces starts proposing solutions it already tried. Not because it’s broken. Because the signal is buried in noise.

Token reduction isn’t a cost optimization with a positive side effect on quality. It’s a reliability question first.

Three Levels of Attack

These approaches aren’t interchangeable because they don’t act on the same layer.

graph TD
    subgraph "Layer 3 — Context History"
        DCP["DCP\nDynamic Context Pruning\nPrunes accumulated history"]
    end
    subgraph "Layer 2 — Tool Outputs"
        RTK["RTK\nRust Token Killer\nFilters outputs before injection"]
    end
    subgraph "Layer 1 — Model Responses"
        CAV["Caveman\nConstrains LLM verbosity"]
    end

    CAV --> RTK --> DCP

RTK intercepts tool outputs before they enter the context. DCP works on what’s already in the context window. Caveman acts on what the model produces in response. Three distinct intervention points that don’t substitute for each other.

RTK: Compress What Tools Report

RTK (Rust Token Killer) is a CLI proxy written in Rust. It sits between the agent and system commands: git status becomes rtk git status, but through a shell hook rtk-rewrite.sh this rewrite is transparent. The agent calls git status, RTK intercepts, filters, and returns a leaner output.

40+ commands covered. The concrete case: cargo test typically generates 25k tokens of output (passing tests, warnings, timings, coverage stats) reduced to 2.5k. Failing tests, errors, nothing else. Startup under 10ms.

The filtering is semantic, not mechanical. RTK doesn’t blindly truncate to N lines. It understands that in test output, what matters is what failed. In git log, the last 50 commits matter less than the HEAD diff. In application logs, duplicate lines add nothing after the first occurrence.

Exit codes are preserved. The agent keeps the success/failure signal, often the only information it needs to decide what to do next. The rest was padding.

Compatible with Claude Code, Cursor, Windsurf — any agent that invokes shell commands.

→ github.com/rtk-ai/rtk

DCP: Manage What Accumulates

DCP (Dynamic Context Pruning) isn’t a proxy. It’s a plugin that acts directly inside the context window, on history that’s already there.

Two mechanics run alongside each other. The first is automatic: deduplication of identical tool calls (if the agent called ls three times with the same result, DCP keeps one), purging error inputs after N turns, removing redundant writes. Runs continuously without intervention.

The second is more interesting: the agent itself gets a compress tool it can invoke to summarize finished blocks of context. The agent decides when a portion of history can be crystallized into a dense summary, freeing space for what’s ahead. Compressions can nest for very long sessions.

RTK prevents garbage from entering. DCP handles what’s already there and can’t always be avoided. Not the same problem.

Current limitation: DCP is OpenCode-specific. The protection system (turn protection, file patterns, protected tools) is configured via dcp.jsonc and integrates into OpenCode’s infrastructure. Not yet portable to other agents.

→ github.com/Opencode-DCP/opencode-dynamic-context-pruning

Caveman: Compress What the AI Says

55,000 GitHub stars in a few weeks. That number says something about accumulated developer frustration with LLM verbosity.

Caveman is a multi-agent skill/plugin that forces the model to respond in ultra-compressed style. Three levels: lite (short responses), full (prose eliminated, essential structure), ultra (telegraphic style). There’s even a 文言文 mode — Classical Chinese — that pushes compression to the extreme by exploiting that language’s semantic density.

The advertised benchmark: 65% reduction in output tokens on real Claude API calls. A March 2026 study adds a counterintuitive angle: forcing brevity improves accuracy by 26 points on certain benchmarks. Verbosity isn’t a proxy for reasoning quality — sometimes it masks it.

LLM verbosity is a training artifact, not a necessity. Models were trained to produce long, structured responses because human annotators scored exhaustive answers favorably. In production inside an agent, that inheritance becomes a bug: every verbose response is context consumed for what follows. Caveman forces the model to unlearn the habit.

Caveman also includes an MCP middleware (caveman-shrink) that compresses MCP tool descriptions — an often-overlooked angle, since tool descriptions can represent several thousand fixed tokens per call. Compatible with 30+ agents, one-liner install.

→ github.com/juliusbrussee/caveman

The Uncomfortable Question

Where does this standoff end up?

The honest answer: data volume generated by an agent in intensive session outpaces window growth. Going from 200k to 1M tokens doesn’t solve anything if the session generates 800k tokens of logs, stack traces, and verbose responses in two hours.

Then there’s latency. An 800k token context, even if the model technically handles it, has a real inference-time cost at every step. This isn’t just about money or reasoning quality. It’s about whether the thing is actually usable.

There’s a third angle that rarely comes up in technical discussions: CO2. Every inference consumes energy. An agent processing 800k tokens per turn, dozens of times an hour, has a real carbon footprint. Every developer deploying agents in production can reduce that concretely, without waiting for datacenters to run on renewables. Fewer tokens means less compute, less energy.

Will these tools migrate to the model side, or stay in infrastructure? You can imagine models that natively manage context compression, that know what to keep and what to discard in their own history. Some experimental architectures point that direction. For now, external tooling remains necessary. RTK, DCP, Caveman are three operational answers to a problem model providers haven’t solved yet.

Contributor

David Micheneau

Agentic AI Engineer

GitHub LinkedIn