fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models#381
fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models#381qywu wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8373813b7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| OwnedPages pages_to_insert = local_kv_allocator->TakeFirst(new_page_count); | ||
| auto insert_result = hybrid_cache->GetKVPrefixCache().Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, | ||
| std::move(pages_to_insert)); | ||
| auto insert_result = kv->Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(pages_to_insert)); |
There was a problem hiding this comment.
Preserve page hashes when publishing unfinished prefixes
When enable_l3_storage is on, this mid-flight device insert creates non-hybrid prefix nodes without any page_hashes; when the request later finishes, FinishEvent::apply computes alloc_count from DevicePagesFromRoot(...) and skips the normal insert that would have set page_hashes_ because these pages are already in the radix tree. After those nodes are written back to host, Scheduler::CalcRollingHash(apply_match=true) seeds suffix hashes from result.host.last_node->PageHashes(), so prompts extending this cached prefix get hashes computed from an empty seed and L3 prefetch/backup lookups miss for prefixes published by this new path.
Useful? React with 👍 / 👎.
|
@codex review it |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8373813b7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| OwnedPages pages_to_insert = local_kv_allocator->TakeFirst(new_page_count); | ||
| auto insert_result = hybrid_cache->GetKVPrefixCache().Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, | ||
| std::move(pages_to_insert)); | ||
| auto insert_result = kv->Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(pages_to_insert)); |
There was a problem hiding this comment.
Preserve L3 page hashes when publishing prefixes
When non-hybrid L3 storage is enabled, this mid-flight device insert creates the radix-tree nodes before FinishEvent has the page_hashes_ computed from the request. A later FinishEvent insert cannot fill them in because KVPrefixCache::Insert skips SetPageHashes for nodes that already have device pages, so host-written nodes keep empty PageHashes; CalcRollingHash(..., apply_match=true) then seeds suffix hashes with "" after a host match and generates wrong L3 keys, causing prefetch/backup misses for shared prefixes. Please defer this insert for L3 or make the finish path backfill page hashes on already-present nodes.
Useful? React with 👍 / 👎.
|
Update — gated the mid-flight publish for sliding-window-attention models ( The first commit regressed gpt-oss-120b GPQA-diamond ~0.71 → 0.547 (B200 + MI355). gpt-oss is non-hybrid + sliding-window attention; publishing a prefix mid-flight (before the request finishes) corrupts SWA prefix reuse. Fix: add |
|
SWA coverage: only gpt-oss (non-hybrid) and deepseek-v4 (hybrid) use sliding-window attention; the gate catches both. gpt-oss falls back to finish-only publish (this fix); deepseek-v4 is unaffected, since hybrid models publish via their own (already window-correct) cache. No SWA model is missed. This PR is the conservative fix (disable mid-flight publish for non-hybrid SWA). Giving SWA models correct and mid-flight prefix caching via a windowed paged-cache-group — the mechanism deepseek-v4 already uses (clamping the radix match to the window) — will be a follow-up PR. |
ce9f2f6 to
bf05db9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf05db9f9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // now rather than only at FinishEvent. A null kv_prefix_cache (passed by the scheduler | ||
| // for sliding-window-attention models) disables this mid-flight publish, so SWA models | ||
| // fall back to the finish-only publish whose prefix reuse is known-correct. | ||
| KVPrefixCache* kv = (hybrid_cache != nullptr) ? &hybrid_cache->GetKVPrefixCache() : kv_prefix_cache; |
There was a problem hiding this comment.
Honor the SWA guard when a hybrid cache exists
For sliding-window models that also create a HybridPrefixCache (the scheduler does this whenever paged-cache groups or a prefix-cache adjunct are configured, e.g. the DeepSeek V4 SWA/state groups), schedulePrefill/scheduleDecode pass nullptr as the KV cache, but this line still selects hybrid_cache->GetKVPrefixCache() and publishes mid-flight. In the scheduler paths I checked, that leaves the newly documented corrupt SWA prefix-reuse path enabled for hybrid/SWA models; the helper needs an explicit skip signal rather than relying on a null base KV pointer.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This should be fixed now. The current implementation no longer relies on passing a null base KV cache to suppress publication. InsertPrefixCache() now takes an explicit enable_midflight_publish flag and returns before selecting hybrid_cache->GetKVPrefixCache() when the SWA path is not safe.
The scheduler policy is:
- non-SWA: publish mid-flight as before
- plain SWA / gpt-oss: publish only up to the configured sliding-window cap
- hybrid history-only SWA: do not publish mid-flight
- hybrid paged-cache SWA / DeepSeek V4: publish only when the paged-cache adjunct has sliding-window
Stategroups, withHybridPrefixCache::Match()still enforcing snapshot/window correctness
I also added regression coverage for the unsafe hybrid-history SWA case and for capped plain-SWA decode-result publication.
…on models Publish a request's prompt-prefix KV into the radix tree at the prefill->decode transition (not only at FinishEvent) for non-hybrid models, so concurrent same-prefix requests (RL rollouts, shared chat-template prefixes) reuse it. Rename InsertHybridCache->InsertPrefixCache; publish via the base KV prefix cache when there is no hybrid cache (hybrid path unchanged; Mamba checkpoint stays hybrid-only). Thread kv_prefix_cache_ into SchedulePrefillEvent/ScheduleDecodeEvent. Sliding-window-attention models (gpt-oss) are excluded: mid-flight SWA prefix reuse corrupts outputs (regressed gpt-oss GPQA 0.71->0.547). Add has_sliding_window to SchedulerConfig (from hf_config.sliding_window); SWA models pass a null kv_prefix_cache so the publish is skipped and they fall back to finish-only. deepseek-v4 (hybrid SWA) is unaffected. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
bf05db9 to
d5d3fc1
Compare
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
6d6e6a4 to
bee0920
Compare
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0bc9c7b6de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # models use their windowed adjunct snapshots. gpt-oss stores an | ||
| # inclusive HF window and converts it to TokenSpeed's exclusive | ||
| # attention window inside the model. | ||
| sliding_window = getattr(hf_config, "sliding_window", None) |
There was a problem hiding this comment.
Detect nested sliding windows before enabling publish
Fresh evidence is that this new scheduler-side detection only reads hf_config.sliding_window, while the DeepSeek V4 cache spec resolves the same value from either hf_config or hf_config.text_config. For checkpoints that put sliding_window under text_config, has_sliding_window stays false, and Scheduler::enableMidflightPrefixPublish() returns true before checking for a paged-cache adjunct/window state, re-enabling mid-flight publication for hybrid SWA configurations that this change is trying to guard. Please mirror the nested lookup here before deriving has_sliding_window.
Useful? React with 👍 / 👎.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
f036ee5 to
c08c1a8
Compare
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Problem
For non-hybrid models (Qwen3, Llama — everything except DeepSeek-V4/Mamba), a request's prompt-prefix KV is published into the shared device radix tree only at
FinishEvent. The mid-flight publish helperInsertHybridCacheearly-returns whenhybrid_prefix_cache_ == nullptr(every non-hybrid model).So when a burst of concurrent requests shares a prefix — RL rollouts with N samples per prompt, or a shared chat-template/system prefix — they all prefill before any finishes, and each
Matches a radix tree still empty for that prefix → ~0% prefix-cache reuse. SGLang publishes the prefix during prefill (cache_unfinished_req), so siblings reuse it (~26% on the same workload).Observed in slime RL rollouts:
rollout/prefix_cache_hit_rate0.001 vs 0.256 for SGLang; engine prefill logs showed#cached-token: 0in 130/131 prefills.Fix
Rename
InsertHybridCache→InsertPrefixCacheand publish the freshly-computed prefix through the base KV prefix cache when there is no hybrid cache (hybrid path unchanged: still viahybrid_cache->GetKVPrefixCache(); the Mamba checkpoint publication stays hybrid-only). The published node is pinned via the request'sDeviceNodeRefso it is not evicted while the request is still using it. Threadkv_prefix_cache_intoSchedulePrefillEventandScheduleDecodeEvent, which previously only heldhybrid_prefix_cache_(the reason they couldn't publish for non-hybrid models).Validation
Standalone Qwen2.5-0.5B engines, identical test (request A held in a long
ignore_eosdecode; request B prefills the same prompt while A is still decoding), only the scheduler differs:#cached-token 0#cached-token 0— re-prefills all 338 tokens#cached-token 0#cached-token 320— reuses A's prefix, prefills only 48B reusing A's prefix while A is unfinished is impossible under the old finish-only publish, confirming the mid-flight publish. Generation stays correct; compiles clean (scikit-build-core).
Files
tokenspeed-scheduler/csrc/fsm/forward_events.{h,cpp}tokenspeed-scheduler/csrc/scheduler/operations/forward.cpp