fix(deepseek-v4): release superseded interior continuation-state snapshots#460
fix(deepseek-v4): release superseded interior continuation-state snapshots#460dongjiyingdjy wants to merge 5 commits into
Conversation
…shots V4 State-family sliding groups (e.g. v4.c128a.compressor_state) attach a trailing-window continuation-state snapshot to each turn's terminal node. When a turn advances, the old terminal becomes an interior ancestor but its now-superseded snapshot was never released: adopt re-adopts History groups only, and the LRU prune skips it because the owning request keeps every ancestor Locked (RefCount>0). The pinned pages accumulate one window per turn and exhaust the small State pool, crashing all TP ranks (PagedCacheGroupTable::Acquire) at high concurrency. CommitChunk now releases an ancestor's State portion (keeping its History chain) once it is provably unreferenced: the owning request's sliding window has advanced past it (node_depth + window <= chunk_depth, so ReleaseSkipped already dropped those pages from this request's borrowed set) AND it is the sole device referencer (RefCount == 1, so no other shared-prefix request can be borrowing its continuation-state window). V4-Pro TP8/EP8 +MTP c=8: 878/878, 0 crash; v4.c128a.compressor_state pool p50 1490->280 (of 2385). GSM8K V4-Flash 0.96, V4-Pro 0.94. codex LGTM. Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: lightseek-bot <243258330+lightseek-bot@users.noreply.github.com>
bbf7365 to
ed895f5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed895f55c5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (!cur->OnDevice() || cur->Device().RefCount() != 1) continue; | ||
| DetachStateSnapshotFromNode(cur); |
There was a problem hiding this comment.
Preserve snapshots needed by retracted requests
When a shared-prefix request has been retracted, it keeps only a HostNodeRef and its paged-cache table is not released until recovery, so it no longer contributes to Device().RefCount(). If another request sharing this ancestor commits with this count equal to 1, this branch deletes the ancestor's continuation-state snapshot even though the retracted request may later recover through Match(...StateRecovery) / AdmitChunkFromRetracted; that can leave recovery without the saved state pages (or with stale borrowed ids until release). Device refcount alone therefore does not prove no other live request still depends on the snapshot.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2d853b250
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| continue; | ||
| } | ||
| if (!cur->OnDevice() || cur->Device().RefCount() != 1) continue; | ||
| DetachStateSnapshotFromNode(cur); |
There was a problem hiding this comment.
Recreate missing state before leaving history-only anchors
When this downgrades an interior node to a history-only snapshot, a later request that branches before the next state-complete boundary can no longer commit past that node: CommitChunk sees attach_node->HasPagedCacheSnapshot(), calls adoptExistingPagedCacheSnapshot, and that helper returns false because the required State group was erased, so the loop breaks before reaching the new terminal. commitTerminalContinuationSnapshot only repairs the terminal node, not this interior anchor, so prompts that must recompute from before the pruned boundary keep failing to attach new snapshots and retain request-owned paged-cache pages until release. Please either let adoption fill the missing State segment from the current table or fully detach/otherwise skip these history-only anchors when they are on the commit path.
Useful? React with 👍 / 👎.
Problem
V4 State-family sliding-window cache groups (e.g.
v4.c128a.compressor_state) attach a trailing-window continuation-state snapshot to each multi-turn turn's terminal radix-tree node. When a turn advances, the previous terminal becomes an interior ancestor, but its now-superseded snapshot was never released:adoptExistingPagedCacheSnapshotre-adopts History groups only.Locked (NodeRef::Locklocks the whole path to root), soRefCount > 0.These pinned State pages accumulate one window per turn and exhaust the small per-group State pool. A subsequent fresh prefill's
Acquirethen fails and aborts all TP ranks (PagedCacheGroupTable::Acquire: failed to allocate pages for group v4.c128a.compressor_state) at high concurrency.Measured (V4-Pro TP8+MTP, c=8 agentic-trie):
v4.c128a.compressor_stateusage p50 ≈ 1490/2385, peaking 2032/2385, crash on a concurrent fresh-prefill spike.Fix
CommitChunknow releases an interior ancestor's State portion (keeping its History chain intact) once it is provably unreferenced — gated on BOTH:node_depth + max_state_window <= chunk_depth), soReleaseSkippedhas already dropped those pages from this request's own borrowed set; andOnDevice() && Device().RefCount() == 1). Each request holds exactly oneDeviceNodeRefthat locks its whole path to root, soRefCount == 1means no other (shared-prefix) request can be borrowing the node's continuation-state window. When shared (RefCount > 1), the snapshot is kept so the sharer's resume stays valid, and released on a later commit once the sharer's ref drops.Gated on a complete terminal snapshot so a resume anchor always remains (the deepest terminal is never touched). Uses the existing
DetachStateSnapshotFromNodeprimitive (returnsOwnedPagesto the pool via RAII).Verification
v4.c128a.compressor_statepool p50 1490 → 280, peak 2032 → 490 (of 2385).num_device_pagesunchanged); no per-group sizing change.🤖 Generated with Claude Code