| name | code-review |
|---|---|
| description | Review the current diff for correctness bugs and reuse/simplification/efficiency cleanups at the given effort level. |
high effort → 3+4 angles × 6 candidates → 1-vote verify (recall-biased) → ≤10 findings
You are reviewing for recall at high effort: catch every real bug a careful reviewer would catch in one sitting. At this level, catching real bugs matters more than avoiding false positives. Err on the side of surfacing.
Run git diff @{upstream}...HEAD (or git diff main...HEAD / git diff HEAD~1
if there's no upstream) to get the unified diff under review. If there are
uncommitted changes, or the range diff is empty, also run git diff HEAD and
include the working-tree changes in scope — the review often runs before the
commit. If a PR number, branch name, or file path was passed as an argument,
review that target instead. Treat this diff as the review scope.
Phase 1 — Find candidates (3 correctness angles + 3 cleanup angles + 1 altitude angle, up to 6 each)
Run 7 independent finder angles via the Agent tool. Each
surfaces up to 6 candidate findings with file, line, a one-line
summary, and a concrete failure_scenario.
Read every hunk in the diff, line by line. Then Read the enclosing function for
each hunk — bugs in unchanged lines of a touched function are in scope (the PR
re-exposes or fails to fix them). For every line ask: what input, state, timing,
or platform makes this line wrong? Look for inverted/wrong conditions,
off-by-one, null/undefined deref, missing await, falsy-zero checks,
wrong-variable copy-paste, error swallowed in catch, unescaped regex metachars.
For every line the diff DELETES or replaces, name the invariant or behavior it enforced, then search the new code for where that invariant is re-established. If you can't find it, that's a candidate: a removed guard, a dropped error path, a narrowed validation, a deleted test that was covering a real case.
For each function the diff changes, find its callers (Grep for the symbol) and check whether the change breaks any call site: a new precondition, a changed return shape, a new exception, a timing/ordering dependency. Also check callees: does a parallel change in the same PR make a call unsafe?
The angles above hunt for bugs; this one and the next two hunt for cleanup in the changed code. Flag new code that re-implements something the codebase already has — Grep shared/utility modules and files adjacent to the change, and name the existing helper to call instead.
Flag unnecessary complexity the diff adds: redundant or derivable state, copy-paste with slight variation, deep nesting, dead code left behind. Name the simpler form that does the same job.
Flag wasted work the diff introduces: redundant computation or repeated I/O, independent operations run sequentially, blocking work added to startup or hot paths. Name the cheaper alternative.
Check that each change is implemented at the right depth, not as a fragile bandaid. Special cases layered on shared infrastructure are a sign the fix isn't deep enough — prefer generalizing the underlying mechanism over adding special cases.
Cleanup and altitude candidates use the same file/line/summary shape; in
failure_scenario, state the concrete cost (what is duplicated, wasted, or
harder to maintain) instead of a crash. Correctness bugs always outrank
cleanup and altitude findings when the output cap forces a cut.
Pass every candidate with a nameable failure scenario through — finders that silently drop half-believed candidates bypass the verify step and are the dominant cause of misses.
Dedup near-duplicates (same defect, same location, same reason → keep one). For each remaining candidate, run one verifier via the Agent tool: give it the diff, the relevant file(s), and the candidate; it returns exactly one of CONFIRMED / PLAUSIBLE / REFUTED.
PLAUSIBLE by default — do not refute a candidate for being "speculative" or "depends on runtime state" when the state is realistic: concurrency races, nil/undefined on a rare-but-reachable path (error handler, cold cache, missing optional field), falsy-zero treated as missing, off-by-one on a boundary the code does not exclude, retry storms / partial failures, regex/allowlist that lost an anchor. These are PLAUSIBLE.
REFUTED only when constructible from the code: factually wrong (quote the actual line); provably impossible (type/constant/invariant — show it); already handled in this diff (cite the guard); or pure style with no observable effect.
Keep CONFIRMED and PLAUSIBLE. Drop REFUTED.
Return findings as a JSON array of at most 10 objects:
[
{
"file": "path/to/file.ext",
"line": 123,
"summary": "one-sentence statement of the bug",
"failure_scenario": "concrete inputs/state → wrong output/crash"
}
]Ranked most-severe first. If more than 10 survive, keep the 10 most
severe. If nothing survives verification, return [].