Skip to content

fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models#381

Open
qywu wants to merge 8 commits into
mainfrom
qywu/prefix-cache-mid-flight
Open

fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models#381
qywu wants to merge 8 commits into
mainfrom
qywu/prefix-cache-mid-flight

Conversation

@qywu

@qywu qywu commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Problem

For non-hybrid models (Qwen3, Llama — everything except DeepSeek-V4/Mamba), a request's prompt-prefix KV is published into the shared device radix tree only at FinishEvent. The mid-flight publish helper InsertHybridCache early-returns when hybrid_prefix_cache_ == nullptr (every non-hybrid model).

So when a burst of concurrent requests shares a prefix — RL rollouts with N samples per prompt, or a shared chat-template/system prefix — they all prefill before any finishes, and each Matches a radix tree still empty for that prefix → ~0% prefix-cache reuse. SGLang publishes the prefix during prefill (cache_unfinished_req), so siblings reuse it (~26% on the same workload).

Observed in slime RL rollouts: rollout/prefix_cache_hit_rate 0.001 vs 0.256 for SGLang; engine prefill logs showed #cached-token: 0 in 130/131 prefills.

Fix

Rename InsertHybridCacheInsertPrefixCache and publish the freshly-computed prefix through the base KV prefix cache when there is no hybrid cache (hybrid path unchanged: still via hybrid_cache->GetKVPrefixCache(); the Mamba checkpoint publication stays hybrid-only). The published node is pinned via the request's DeviceNodeRef so it is not evicted while the request is still using it. Thread kv_prefix_cache_ into SchedulePrefillEvent and ScheduleDecodeEvent, which previously only held hybrid_prefix_cache_ (the reason they couldn't publish for non-hybrid models).

Validation

Standalone Qwen2.5-0.5B engines, identical test (request A held in a long ignore_eos decode; request B prefills the same prompt while A is still decoding), only the scheduler differs:

A (fresh prompt) B (prefills while A still decoding)
before #cached-token 0 #cached-token 0 — re-prefills all 338 tokens
after #cached-token 0 #cached-token 320 — reuses A's prefix, prefills only 48

B reusing A's prefix while A is unfinished is impossible under the old finish-only publish, confirming the mid-flight publish. Generation stays correct; compiles clean (scikit-build-core).

Files

  • tokenspeed-scheduler/csrc/fsm/forward_events.{h,cpp}
  • tokenspeed-scheduler/csrc/scheduler/operations/forward.cpp

@qywu qywu requested a review from a team as a code owner June 8, 2026 07:22

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8373813b7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

OwnedPages pages_to_insert = local_kv_allocator->TakeFirst(new_page_count);
auto insert_result = hybrid_cache->GetKVPrefixCache().Insert<ResourceType::Device>(full_paged_tokens, prefix_pages,
std::move(pages_to_insert));
auto insert_result = kv->Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(pages_to_insert));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve page hashes when publishing unfinished prefixes

When enable_l3_storage is on, this mid-flight device insert creates non-hybrid prefix nodes without any page_hashes; when the request later finishes, FinishEvent::apply computes alloc_count from DevicePagesFromRoot(...) and skips the normal insert that would have set page_hashes_ because these pages are already in the radix tree. After those nodes are written back to host, Scheduler::CalcRollingHash(apply_match=true) seeds suffix hashes from result.host.last_node->PageHashes(), so prompts extending this cached prefix get hashes computed from an empty seed and L3 prefetch/backup lookups miss for prefixes published by this new path.

Useful? React with 👍 / 👎.

@qywu

qywu commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@codex review it

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8373813b7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

OwnedPages pages_to_insert = local_kv_allocator->TakeFirst(new_page_count);
auto insert_result = hybrid_cache->GetKVPrefixCache().Insert<ResourceType::Device>(full_paged_tokens, prefix_pages,
std::move(pages_to_insert));
auto insert_result = kv->Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(pages_to_insert));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve L3 page hashes when publishing prefixes

When non-hybrid L3 storage is enabled, this mid-flight device insert creates the radix-tree nodes before FinishEvent has the page_hashes_ computed from the request. A later FinishEvent insert cannot fill them in because KVPrefixCache::Insert skips SetPageHashes for nodes that already have device pages, so host-written nodes keep empty PageHashes; CalcRollingHash(..., apply_match=true) then seeds suffix hashes with "" after a host match and generates wrong L3 keys, causing prefetch/backup misses for shared prefixes. Please defer this insert for L3 or make the finish path backfill page hashes on already-present nodes.

Useful? React with 👍 / 👎.

@qywu qywu marked this pull request as draft June 8, 2026 15:50
@qywu

qywu commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Update — gated the mid-flight publish for sliding-window-attention models (2a4fe3a).

The first commit regressed gpt-oss-120b GPQA-diamond ~0.71 → 0.547 (B200 + MI355). gpt-oss is non-hybrid + sliding-window attention; publishing a prefix mid-flight (before the request finishes) corrupts SWA prefix reuse. ut-runtime-prefix-cache-e2e (full-attention) and hybrid/MLA evals were unaffected.

Fix: add has_sliding_window to SchedulerConfig (derived from hf_config.sliding_window, mirroring ModelRunner); SWA models skip the mid-flight publish and fall back to the prior finish-only publish (baseline was 0.707). Full-attention non-hybrid models keep the mid-flight reuse; hybrid (DeepSeek-V4) is unchanged. Local: wheel builds, 180/180 scheduler gtests pass. Needs the eval CI to re-run to confirm gpt-oss is back to ~0.70.

@qywu

qywu commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

SWA coverage: only gpt-oss (non-hybrid) and deepseek-v4 (hybrid) use sliding-window attention; the gate catches both. gpt-oss falls back to finish-only publish (this fix); deepseek-v4 is unaffected, since hybrid models publish via their own (already window-correct) cache. No SWA model is missed.

This PR is the conservative fix (disable mid-flight publish for non-hybrid SWA). Giving SWA models correct and mid-flight prefix caching via a windowed paged-cache-group — the mechanism deepseek-v4 already uses (clamping the radix match to the window) — will be a follow-up PR.

@qywu qywu force-pushed the qywu/prefix-cache-mid-flight branch 2 times, most recently from ce9f2f6 to bf05db9 Compare June 8, 2026 17:04
@qywu qywu marked this pull request as ready for review June 8, 2026 17:06

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf05db9f9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// now rather than only at FinishEvent. A null kv_prefix_cache (passed by the scheduler
// for sliding-window-attention models) disables this mid-flight publish, so SWA models
// fall back to the finish-only publish whose prefix reuse is known-correct.
KVPrefixCache* kv = (hybrid_cache != nullptr) ? &hybrid_cache->GetKVPrefixCache() : kv_prefix_cache;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor the SWA guard when a hybrid cache exists

For sliding-window models that also create a HybridPrefixCache (the scheduler does this whenever paged-cache groups or a prefix-cache adjunct are configured, e.g. the DeepSeek V4 SWA/state groups), schedulePrefill/scheduleDecode pass nullptr as the KV cache, but this line still selects hybrid_cache->GetKVPrefixCache() and publishes mid-flight. In the scheduler paths I checked, that leaves the newly documented corrupt SWA prefix-reuse path enabled for hybrid/SWA models; the helper needs an explicit skip signal rather than relying on a null base KV pointer.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed now. The current implementation no longer relies on passing a null base KV cache to suppress publication. InsertPrefixCache() now takes an explicit enable_midflight_publish flag and returns before selecting hybrid_cache->GetKVPrefixCache() when the SWA path is not safe.

The scheduler policy is:

  • non-SWA: publish mid-flight as before
  • plain SWA / gpt-oss: publish only up to the configured sliding-window cap
  • hybrid history-only SWA: do not publish mid-flight
  • hybrid paged-cache SWA / DeepSeek V4: publish only when the paged-cache adjunct has sliding-window State groups, with HybridPrefixCache::Match() still enforcing snapshot/window correctness

I also added regression coverage for the unsafe hybrid-history SWA case and for capped plain-SWA decode-result publication.

…on models

Publish a request's prompt-prefix KV into the radix tree at the prefill->decode transition (not only at FinishEvent) for non-hybrid models, so concurrent same-prefix requests (RL rollouts, shared chat-template prefixes) reuse it. Rename InsertHybridCache->InsertPrefixCache; publish via the base KV prefix cache when there is no hybrid cache (hybrid path unchanged; Mamba checkpoint stays hybrid-only). Thread kv_prefix_cache_ into SchedulePrefillEvent/ScheduleDecodeEvent.

Sliding-window-attention models (gpt-oss) are excluded: mid-flight SWA prefix reuse corrupts outputs (regressed gpt-oss GPQA 0.71->0.547). Add has_sliding_window to SchedulerConfig (from hf_config.sliding_window); SWA models pass a null kv_prefix_cache so the publish is skipped and they fall back to finish-only. deepseek-v4 (hybrid SWA) is unaffected.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the qywu/prefix-cache-mid-flight branch from bf05db9 to d5d3fc1 Compare June 8, 2026 18:10
@qywu qywu requested a review from zhyncs June 8, 2026 21:11
@qywu qywu marked this pull request as draft June 8, 2026 21:19
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the qywu/prefix-cache-mid-flight branch from 6d6e6a4 to bee0920 Compare June 8, 2026 22:28
@qywu qywu marked this pull request as ready for review June 8, 2026 22:59
Signed-off-by: Qingyang Wu <willqywu@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bc9c7b6de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# models use their windowed adjunct snapshots. gpt-oss stores an
# inclusive HF window and converts it to TokenSpeed's exclusive
# attention window inside the model.
sliding_window = getattr(hf_config, "sliding_window", None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect nested sliding windows before enabling publish

Fresh evidence is that this new scheduler-side detection only reads hf_config.sliding_window, while the DeepSeek V4 cache spec resolves the same value from either hf_config or hf_config.text_config. For checkpoints that put sliding_window under text_config, has_sliding_window stays false, and Scheduler::enableMidflightPrefixPublish() returns true before checking for a paged-cache adjunct/window state, re-enabling mid-flight publication for hybrid SWA configurations that this change is trying to guard. Please mirror the nested lookup here before deriving has_sliding_window.

Useful? React with 👍 / 👎.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the qywu/prefix-cache-mid-flight branch from f036ee5 to c08c1a8 Compare June 8, 2026 23:35
qywu added 2 commits June 9, 2026 00:05
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant