The review-manager: anatomy of a code review orchestrator

Article · 6 min read
🇫🇷 This article is also available in Français

AI-generated code review has a structural problem that rarely gets named. It’s not that models write bad code. It’s that when you ask them to evaluate it afterward, they trend toward approval. Not because they’re naive — because a model that built an internal representation of a piece of code to produce it is now evaluating from that same representation. Same biases. Same blind spots.

The opencode-team-lead article makes the diagnosis: an agent reviewing its own code isn’t doing code review, it’s self-validating. The solution it describes — delegating reviews to the review-manager — is worth unpacking. Because “delegate to another agent” doesn’t solve anything if that other agent does the same thing less structured.

This is a deep dive into how the review-manager actually works. We’re going to read the agent code.

The dual role: orchestrator, never reviewer

The first thing to understand about the review-manager is what it isn’t. It’s not an enhanced reviewer. It doesn’t read code to evaluate it. Its mandate is explicit in the system prompt:

You never review code yourself. You read enough to understand what changed 
and select the right reviewers. Then you delegate. Your job is reviewer 
selection, prompt crafting, verdict synthesis, and disagreement arbitration.

It reads just enough to pick who to send. That’s a meaningful distinction: between “reading to understand the change” and “reading to judge the quality.” The first informs delegation. The second is already a review — with all the biases that come with it.

The review-manager’s permissions reflect this exactly. It has access to task for delegation, and question to ask the team-lead when a mission is ambiguous. That’s it. No read, no edit, no grep. It can receive file descriptions in its prompt — but it can’t go explore the codebase itself.

Same principle as the team-lead: a permission constraint is more reliable than an instruction in a prompt. A prompt can be partially followed. A denied permission is deterministic.

Anatomy of the cycle

The full five-phase cycle — analyze, select, parallel spawn, confront, synthesize:

sequenceDiagram
    participant TL as team-lead
    participant RM as review-manager
    participant RR as requirements-reviewer
    participant CR as code-reviewer
    participant SR as security-reviewer

    TL->>RM: Mission (changed files + original user request)
    Note over RM: 1. Analyze the change
    Note over RM: 2. Select reviewers
    par Parallel spawn
        RM->>RR: Self-contained prompt
        RM->>CR: Self-contained prompt
        RM->>SR: Self-contained prompt
    end
    RR-->>RM: Verdict + issues
    CR-->>RM: Verdict + issues
    SR-->>RM: Verdict + issues
    Note over RM: 3. Confrontation Protocol
    RM-->>TL: APPROVED | CHANGES_REQUESTED | BLOCKED

Phases 1 and 2 are the critical decisions — once reviewers are selected and their prompts sent, the review-manager waits. The real value is in phase 3: confronting the verdicts. We’ll get there.

Selecting reviewers: a decision tree, not a lookup table

The change type → reviewers mapping looks like a dispatch table in the docs. In practice, it’s a decision tree with overlapping priorities.

The two axes: size and risk

The starting point isn’t the type of change — it’s the combination of size × risk.

SizeRiskReviewersNote
Docs onlynoneFast-exit APPROVED immediately
Trivial (1-2 files, < 50 lines)Low1 combinedFast path
TrivialHighrequirements + security + code3 agents
Normal (3-10 files)Lowrequirements + code + 1 domain3 agents
NormalHighrequirements + security + code + 1 domain4 agents
Large (10+ files)Lowrequirements + code + 2 domains4 agents
LargeHighrequirements + security + code + 1 domain4 agents
CapMax 3 technical reviewersrequirements-reviewer excluded from cap

The cap at 3 technical reviewers isn’t arbitrary — it’s documented diminishing returns. Beyond that, additional reviewers repeat findings already surfaced or add noise to the synthesis.

High-risk patterns: an override that precedes the table

Certain change types automatically trigger security-reviewer, regardless of diff size:

  • Auth, sessions, tokens
  • SQL queries or ORM calls
  • Cryptographic operations
  • Permission or access control handling
  • Secrets, credentials, API keys
  • External calls transmitting user data
  • LLM integration (prompt injection vectors)

A two-line change that touches an authentication handler goes into Trivial + High — not the fast path. Size doesn’t reduce risk.

The fast path: combined reviewer

For trivial low-risk changes, spawning three agents to review 30 lines is overkill. The fast path spawns a single code-reviewer with an expanded mandate. The only modification in its prompt:

## Context
Also verify requirements alignment for this review: does the implementation 
match the original user request stated below?

[rest of the standard prompt]

One line in ## Context, and the code-reviewer absorbs the requirements-reviewer’s mandate. This is the only exception to reviewer isolation — and it only exists for changes where the overhead cost clearly outweighs the value of separation.

flowchart TD
    A[Analyze the change] --> B{Docs-only / Formatting?}
    B -->|Yes| C[Fast-exit\nAPPROVED immediately]
    B -->|No| D{High-risk patterns?}
    D -->|auth / SQL / crypto\n/ secrets / LLM| E[security-reviewer mandatory]
    D -->|No| F{Change size?}
    E --> F
    F -->|Trivial\n1-2 files, less than 50 lines| G{High-risk?}
    G -->|No| H[Fast path\n1 combined reviewer]
    G -->|Yes| I[requirements + security + code\n3 agents]
    F -->|Normal\n3 to 10 files| J[requirements + code + 1 domain]
    F -->|Large\n10+ files| K[requirements + code + 2 domains]
    J --> L{High-risk?}
    K --> L
    L -->|Yes| M[+ security-reviewer\ncap 4 agents]
    L -->|No| N[3 agents]

The requirements-reviewer: outside the cap, always present

requirements-reviewer is mandatory on every non-trivial change. The only exception: pure formatting or typo-only fixes with no associated functional requirement.

And it doesn’t count toward the 3 technical reviewer cap. The logic is straightforward: verifying that the implementation matches what was asked isn’t one technical dimension among others — it’s the precondition for any review. Perfect code implementing the wrong feature is a complete failure. That’s what requirements-reviewer is there for, regardless of how many other reviewers are present.

The self-contained prompt: deliberate isolation

Each reviewer gets a prompt with exactly three sections:

## Context
[What changed, which agent produced it, and why. 
The original user request verbatim so the reviewer can verify intent.]

## Changed Files
[Every modified file with a one-line summary of what changed.
Include full file paths.]

## Out of Scope / Trade-offs
[What was explicitly excluded. Intentional trade-offs made. 
What the reviewer should NOT flag as an issue.]

Three things to note in this format.

The original user request verbatim. Not a summary, not a paraphrase — the exact text. requirements-reviewer can’t do its job without it. Critical enough to be a cardinal rule in its system prompt:

If the original requirements are absent from your mission, return BLOCKED immediately:

Verdict: BLOCKED
Reason: Original requirements not provided. Cannot perform functional 
        compliance review.
Action required: review-manager must include the original user request 
                 verbatim before spawning this reviewer.

This isn’t a code BLOCKED — it’s a process BLOCKED. The review-manager handles it differently from other BLOCKEDs (see the arbitration section).

Deliberate isolation. Reviewers don’t know there are others. They don’t see each other’s verdicts. This isolation isn’t an oversight — it’s a design decision. A reviewer who knows security-reviewer already approved is biased toward approval. A reviewer who knows code-reviewer found major issues will search harder for additional problems to avoid looking less thorough. Independent verdicts require isolated contexts.

The Out of Scope section. The most underrated section. It prevents reviewers from flagging intentional trade-offs as bugs. Without it, a reviewer seeing “no tests for this module” might raise it as an issue — when it was a deliberate decision. This section gives reviewers the context to distinguish “forgotten” from “intentional.”

Conseil

If the review mission doesn’t include the original requirements, the review-manager should request them via question before spawning anything. Spawning requirements-reviewer without requirements generates an unnecessary process BLOCKED.

The three reviewers: distinct mandates, shared stance

All three reviewers share one line in their system prompts:

Your default is skepticism. When you identify an issue, report it — do not 
rationalize it away. If something looks wrong, flag it even if uncertain. 
The review-manager arbitrates severity; your job is to surface, not to filter.

This is a fundamental separation of responsibilities. Reviewers don’t arbitrate — they surface. The review-manager decides whether a finding belongs in the final report and at what severity. This prevents two classic failure modes: the reviewer who downplays findings to avoid looking too strict, and the reviewer who escalates everything to appear rigorous.

Their respective mandates are strictly defined and non-overlapping.

requirements-reviewer

Single question: does the implementation match what was asked?

No quality judgments. No security. Only functional compliance. The workflow is literal: extract every discrete requirement from the user request, break them into atomic items, map each item to the implementation.

| Requirement | Covered? | Evidence                          |
|-------------|----------|-----------------------------------|
| [requirement] | Yes    | src/api/users.ts:42               |
| [requirement] | Partial | happy path ok, edge case X missing |
| [requirement] | No      | not found                         |

Four finding categories: missing feature, misinterpretation, partial implementation, scope creep. The last one is particularly interesting — the reviewer explicitly flags when the implementation does something that was not asked for, even if it’s technically an improvement. An unrequested feature can modify existing behavior — that’s a real risk.

code-reviewer

Single question: is this code technically sound?

Logic, error handling, edge cases, API design, patterns, maintainability, test coverage. The system prompt contains a concrete checklist, not a list of abstract principles:

- [ ] Null/undefined not guarded where inputs are uncontrolled
- [ ] Errors swallowed silently (catch with empty body or generic log)
- [ ] Off-by-one in loops, index access, range checks
- [ ] Missing validations on inputs (type, range, presence)
- [ ] Async errors not awaited or not caught
- [ ] Functions doing too many things (single-responsibility violation)
- [ ] Dead code or unreachable branches
- [ ] Naming that doesn't match behavior (isValid that throws, get that mutates)
- [ ] Inconsistent patterns vs. the rest of the codebase
- [ ] Missing test for new logic (when tests exist in the project)

No security, no functional compliance. When it spots an SQL injection issue, it leaves it to security-reviewer. A missing feature belongs to requirements-reviewer. A code-reviewer bleeding into other reviewers’ mandates generates duplicate findings the review-manager then has to deduplicate.

security-reviewer

Single question: does this change introduce a security risk?

Before looking at anything, it maps the attack surface:

Does it handle user input?
Does it interact with auth, sessions, or tokens?
Does it read/write to a database or filesystem?
Does it call external services?
Does it handle secrets or credentials?
Does it expose new API endpoints or modify existing ones?

If every answer is no, the change is low-risk. If several are yes, full scrutiny. The checklist covers injection (SQL, shell, prompt), auth/authz, data exposure, input validation, secret handling, supply chain, and infra misconfigs.

A specific rule on auth: if the change touches authentication or cryptography, the reviewer must acknowledge it in Positive Notes, even if nothing was found:

"Reviewed auth/token handling — no issues detected."

Silence isn’t validation. The absence of a finding must be explicit — otherwise the review-manager can’t tell whether the area was covered or skipped.

Note

There’s a known gap in the default reviewer set: performance. N+1 queries, algorithmic complexity, memory leaks, blocking I/O — no reviewer has an explicit mandate for this. The documentation acknowledges it. For performance-sensitive changes, you need to add an explicit performance focus instruction to the code-reviewer prompt.

The confrontation protocol: arbitrating disagreements

This is the most interesting phase. Once all reviewers have returned their verdicts, the review-manager synthesizes them into a single verdict. Unanimous agreement is trivial. Disagreements are where real value is produced.

The arbitration heuristics are hierarchical:

flowchart TD
    A[Verdicts received — disagreement] --> B{requirements-reviewer\nflags non-compliance?}
    B -->|Yes| C{Real non-compliance\nor misinterpretation\nof requirements?}
    C -->|Real non-compliance| D[BLOCKED\nimplementation off-target]
    C -->|Misinterpretation| E[Sided with approver\nDocument reasoning]
    B -->|No| F{Critical issue\nfrom any reviewer?}
    F -->|Yes| G[Critical always wins\neven if others approved]
    F -->|No| H{security-reviewer\nflags something?}
    H -->|Yes| I{Obvious\nfalse positive?}
    I -->|Yes| J[Sided with approver\nDocument]
    I -->|No| K[Security concern\nwins the tie]
    H -->|No| L{Disagreement only\non Minor issues?}
    L -->|Yes| M[Sided with approver\nOptional mention]
    L -->|No| N[Judgment on merits\nor escalate to team-lead]

A few edge cases worth detailing.

The requirements-reviewer process failure

If requirements-reviewer returns BLOCKED with Reason: Original requirements not provided, that’s a process BLOCKED — not a code BLOCKED. The review-manager does not propagate this BLOCKED to the team-lead. It requests the missing requirements via question and re-spawns only requirements-reviewer. The team doesn’t see a false BLOCKED on their dashboard — they see a review that needed clarification.

That distinction matters. A code BLOCKED says “the implementation is fundamentally wrong.” A process BLOCKED says “we didn’t have the information to review.” Treating them identically would stop sessions for administrative reasons.

Security wins ties

When security-reviewer and code-reviewer disagree on the same point — one approves, the other flags a security concern — the security concern wins by default, unless it’s clearly a false positive.

“Clearly” is the operative word. When in doubt, the security concern wins. The asymmetric cost between ignoring a real vulnerability and addressing a false positive justifies the bias.

Duplicate findings

If code-reviewer and security-reviewer both flag the same input validation issue — which happens on changes touching high-overlap areas — the final report contains a single bullet point. security-reviewer’s framing and severity take precedence, with both sources attributed. One item, not two duplicates cluttering prioritization.

When the wrong thing was built

The brutal case: requirements-reviewer returns BLOCKED because the implementation solves a different problem than what was asked. Not a nuance — the wrong feature.

Unless the review-manager identifies it as a misinterpretation of the requirements (documented reasoning, sided with approver), this BLOCKED passes to the team-lead unchanged. Immediate escalation to the human.

Verdict thresholds and the default bias

Three possible verdicts, with precise criteria.

BLOCKED — a critical issue with no safe path forward without user input, an implementation fundamentally off-target, or a critical security vulnerability. The change should not move forward.

CHANGES_REQUESTED — major or minor issues that can be fixed without architectural rework. Requirements are met but the implementation has correctness or quality gaps.

APPROVED — all reviewers returned no critical or major issues, requirements are met, no open questions requiring user input.

And the structural bias built into the instructions:

When in doubt between APPROVED and CHANGES_REQUESTED: default to 
CHANGES_REQUESTED. The cost of a false approval is higher than the cost 
of an extra fix cycle.

This isn’t paranoia — it’s cost asymmetry. An extra fix cycle costs a few minutes of agent execution. A false approval that ships to production can cost a lot more. The review-manager is calibrated to favor false negatives over false positives.

The output format

The format is fixed — no variation, because the team-lead parses it to know what to do next:

## Review Summary

**Verdict**: APPROVED | CHANGES_REQUESTED | BLOCKED

### Reviewers
- requirements-reviewer — APPROVED — requirements covered
- code-reviewer — CHANGES_REQUESTED — missing error handling on /api/users endpoint
- security-reviewer — APPROVED — no new attack surface introduced

### Issues

#### Critical
- **[title]** (source: code-reviewer)
  [Description of what's wrong]
  **Suggested fix:** [Concrete fix]

#### Major
[...]

#### Minor
[...]

### Disagreements
[Both positions, the arbitration, and why one reviewer was sided with.]

### Positive Notes
[Consolidated from all reviewers — what was done well.]

Issues are grouped by severity, not by reviewer. The team-lead cares about “what’s critical” — not “who said what.” Source attribution is there for traceability, but severity is what structures the read.

Resilience and calibration

When a reviewer fails

A reviewer can return with incomplete output, get compacted mid-review, or produce an unusable format. The protocol:

  1. Retry once — reword the prompt, reduce scope if the reviewer compacted, clarify the expected format
  2. If retry fails — proceed with the available results
  3. Document the gap in the final report:
⚠ security-reviewer failed to complete (compaction). Security review not performed.
Recommend a dedicated security pass before merging.

Partial review beats no review. But not without documenting what’s missing. The team-lead needs to know an angle wasn’t covered — and might decide to manually block until it is.

Calibrating reviewers to your codebase

The default reviewers have a generic stance — useful to get started, insufficient for a codebase with its own specific patterns and anti-patterns.

If verdicts are consistently too lenient or too strict for your codebase, the right approach is to modify individual reviewer system prompts with:

  • Named anti-patterns — not “check for performance issues” but “look for N+1 queries on Prisma relations with nested include in list handlers”
  • Few-shot examples — examples of good vs. bad verdicts on real cases from your codebase
  • Weighted criteria — if test coverage is critical for your team, state it explicitly with expected patterns

An underrated point: recalibrate after a model upgrade. A prompt tuned for one version may be too aggressive or too lenient on the next. Base model behavior shifts between versions, and reviewers inherit those changes. What was a correctly calibrated stance can become too strict or too permissive without anything changing in your code.


The review-manager runs in subagent mode in OpenCode — invisible in the main agent list, existing only to be invoked. That’s precisely what makes it worth dissecting: all its value is in its internal mechanics. An orchestrator that makes no production decisions, reads no line to evaluate it, and exists solely to produce more reliable review verdicts than a generalist agent would alone.

The repo is on GitHub (azrod/opencode-team-lead), full documentation at azrod.github.io/opencode-team-lead.

← Back to articles