Skip to content

fix(deepseek-v4): release superseded interior continuation-state snapshots#460

Open
dongjiyingdjy wants to merge 5 commits into
mainfrom
fix/v4-continuation-state-snapshot-leak
Open

fix(deepseek-v4): release superseded interior continuation-state snapshots#460
dongjiyingdjy wants to merge 5 commits into
mainfrom
fix/v4-continuation-state-snapshot-leak

Conversation

@dongjiyingdjy

Copy link
Copy Markdown
Contributor

Problem

V4 State-family sliding-window cache groups (e.g. v4.c128a.compressor_state) attach a trailing-window continuation-state snapshot to each multi-turn turn's terminal radix-tree node. When a turn advances, the previous terminal becomes an interior ancestor, but its now-superseded snapshot was never released:

  • adoptExistingPagedCacheSnapshot re-adopts History groups only.
  • The LRU snapshot prune skips it because the owning request keeps every ancestor Locked (NodeRef::Lock locks the whole path to root), so RefCount > 0.

These pinned State pages accumulate one window per turn and exhaust the small per-group State pool. A subsequent fresh prefill's Acquire then fails and aborts all TP ranks (PagedCacheGroupTable::Acquire: failed to allocate pages for group v4.c128a.compressor_state) at high concurrency.

Measured (V4-Pro TP8+MTP, c=8 agentic-trie): v4.c128a.compressor_state usage p50 ≈ 1490/2385, peaking 2032/2385, crash on a concurrent fresh-prefill spike.

Fix

CommitChunk now releases an interior ancestor's State portion (keeping its History chain intact) once it is provably unreferenced — gated on BOTH:

  1. The owning request's sliding window has advanced past the ancestor (node_depth + max_state_window <= chunk_depth), so ReleaseSkipped has already dropped those pages from this request's own borrowed set; and
  2. The ancestor is the sole device referencer (OnDevice() && Device().RefCount() == 1). Each request holds exactly one DeviceNodeRef that locks its whole path to root, so RefCount == 1 means no other (shared-prefix) request can be borrowing the node's continuation-state window. When shared (RefCount > 1), the snapshot is kept so the sharer's resume stays valid, and released on a later commit once the sharer's ref drops.

Gated on a complete terminal snapshot so a resume anchor always remains (the deepest terminal is never touched). Uses the existing DetachStateSnapshotFromNode primitive (returns OwnedPages to the pool via RAII).

Verification

  • V4-Pro TP8+MTP and EP8+MTP, c=1/2/4/8 (agentic-trie, temp=0): all stages complete, 0 crashes; c=8 = 878/878 requests. v4.c128a.compressor_state pool p50 1490 → 280, peak 2032 → 490 (of 2385).
  • GSM8K (5-shot, 50): V4-Flash 0.96, V4-Pro 0.94 — no regression.
  • codex review: LGTM (memory-safety / correctness / determinism).
  • Zero VRAM cost (num_device_pages unchanged); no per-group sizing change.

🤖 Generated with Claude Code

@dongjiyingdjy dongjiyingdjy requested a review from a team as a code owner June 16, 2026 04:51
dongjiyingdjy and others added 2 commits June 16, 2026 07:36
…shots

V4 State-family sliding groups (e.g. v4.c128a.compressor_state) attach a
trailing-window continuation-state snapshot to each turn's terminal node.
When a turn advances, the old terminal becomes an interior ancestor but its
now-superseded snapshot was never released: adopt re-adopts History groups
only, and the LRU prune skips it because the owning request keeps every
ancestor Locked (RefCount>0). The pinned pages accumulate one window per turn
and exhaust the small State pool, crashing all TP ranks
(PagedCacheGroupTable::Acquire) at high concurrency.

CommitChunk now releases an ancestor's State portion (keeping its History
chain) once it is provably unreferenced: the owning request's sliding window
has advanced past it (node_depth + window <= chunk_depth, so ReleaseSkipped
already dropped those pages from this request's borrowed set) AND it is the
sole device referencer (RefCount == 1, so no other shared-prefix request can
be borrowing its continuation-state window).

V4-Pro TP8/EP8 +MTP c=8: 878/878, 0 crash; v4.c128a.compressor_state pool
p50 1490->280 (of 2385). GSM8K V4-Flash 0.96, V4-Pro 0.94. codex LGTM.

Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: lightseek-bot <243258330+lightseek-bot@users.noreply.github.com>
@dongjiyingdjy dongjiyingdjy force-pushed the fix/v4-continuation-state-snapshot-leak branch from bbf7365 to ed895f5 Compare June 16, 2026 07:36

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed895f55c5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1541 to +1542
if (!cur->OnDevice() || cur->Device().RefCount() != 1) continue;
DetachStateSnapshotFromNode(cur);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve snapshots needed by retracted requests

When a shared-prefix request has been retracted, it keeps only a HostNodeRef and its paged-cache table is not released until recovery, so it no longer contributes to Device().RefCount(). If another request sharing this ancestor commits with this count equal to 1, this branch deletes the ancestor's continuation-state snapshot even though the retracted request may later recover through Match(...StateRecovery) / AdmitChunkFromRetracted; that can leave recovery without the saved state pages (or with stale borrowed ids until release). Device refcount alone therefore does not prove no other live request still depends on the snapshot.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2d853b250

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

continue;
}
if (!cur->OnDevice() || cur->Device().RefCount() != 1) continue;
DetachStateSnapshotFromNode(cur);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recreate missing state before leaving history-only anchors

When this downgrades an interior node to a history-only snapshot, a later request that branches before the next state-complete boundary can no longer commit past that node: CommitChunk sees attach_node->HasPagedCacheSnapshot(), calls adoptExistingPagedCacheSnapshot, and that helper returns false because the required State group was erased, so the loop breaks before reaching the new terminal. commitTerminalContinuationSnapshot only repairs the terminal node, not this interior anchor, so prompts that must recompute from before the pruned boundary keep failing to attach new snapshots and retain request-owned paged-cache pages until release. Please either let adoption fill the missing State segment from the current table or fully detach/otherwise skip these history-only anchors when they are on the commit path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants