fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models by qywu · Pull Request #381 · lightseekorg/tokenspeed

qywu · 2026-06-08T07:22:40Z

Problem

For non-hybrid models (Qwen3, Llama — everything except DeepSeek-V4/Mamba), a request's prompt-prefix KV is published into the shared device radix tree only at FinishEvent. The mid-flight publish helper InsertHybridCache early-returns when hybrid_prefix_cache_ == nullptr (every non-hybrid model).

So when a burst of concurrent requests shares a prefix — RL rollouts with N samples per prompt, or a shared chat-template/system prefix — they all prefill before any finishes, and each Matches a radix tree still empty for that prefix → ~0% prefix-cache reuse. SGLang publishes the prefix during prefill (cache_unfinished_req), so siblings reuse it (~26% on the same workload).

Observed in slime RL rollouts: rollout/prefix_cache_hit_rate 0.001 vs 0.256 for SGLang; engine prefill logs showed #cached-token: 0 in 130/131 prefills.

Fix

Rename InsertHybridCache → InsertPrefixCache and publish the freshly-computed prefix through the base KV prefix cache when there is no hybrid cache (hybrid path unchanged: still via hybrid_cache->GetKVPrefixCache(); the Mamba checkpoint publication stays hybrid-only). The published node is pinned via the request's DeviceNodeRef so it is not evicted while the request is still using it. Thread kv_prefix_cache_ into SchedulePrefillEvent and ScheduleDecodeEvent, which previously only held hybrid_prefix_cache_ (the reason they couldn't publish for non-hybrid models).

Validation

Standalone Qwen2.5-0.5B engines, identical test (request A held in a long ignore_eos decode; request B prefills the same prompt while A is still decoding), only the scheduler differs:

	A (fresh prompt)	B (prefills while A still decoding)
before	`#cached-token 0`	`#cached-token 0` — re-prefills all 338 tokens
after	`#cached-token 0`	`#cached-token 320` — reuses A's prefix, prefills only 48

B reusing A's prefix while A is unfinished is impossible under the old finish-only publish, confirming the mid-flight publish. Generation stays correct; compiles clean (scikit-build-core).

Files

tokenspeed-scheduler/csrc/fsm/forward_events.{h,cpp}
tokenspeed-scheduler/csrc/scheduler/operations/forward.cpp

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8373813b7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T07:25:45Z

    OwnedPages pages_to_insert = local_kv_allocator->TakeFirst(new_page_count);
-    auto insert_result = hybrid_cache->GetKVPrefixCache().Insert<ResourceType::Device>(full_paged_tokens, prefix_pages,
-                                                                                       std::move(pages_to_insert));
+    auto insert_result = kv->Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(pages_to_insert));


Preserve page hashes when publishing unfinished prefixes

When enable_l3_storage is on, this mid-flight device insert creates non-hybrid prefix nodes without any page_hashes; when the request later finishes, FinishEvent::apply computes alloc_count from DevicePagesFromRoot(...) and skips the normal insert that would have set page_hashes_ because these pages are already in the radix tree. After those nodes are written back to host, Scheduler::CalcRollingHash(apply_match=true) seeds suffix hashes from result.host.last_node->PageHashes(), so prompts extending this cached prefix get hashes computed from an empty seed and L3 prefetch/backup lookups miss for prefixes published by this new path.

Useful? React with 👍 / 👎.

qywu · 2026-06-08T07:27:12Z

@codex review it

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8373813b7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T07:30:42Z

    OwnedPages pages_to_insert = local_kv_allocator->TakeFirst(new_page_count);
-    auto insert_result = hybrid_cache->GetKVPrefixCache().Insert<ResourceType::Device>(full_paged_tokens, prefix_pages,
-                                                                                       std::move(pages_to_insert));
+    auto insert_result = kv->Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(pages_to_insert));


Preserve L3 page hashes when publishing prefixes

When non-hybrid L3 storage is enabled, this mid-flight device insert creates the radix-tree nodes before FinishEvent has the page_hashes_ computed from the request. A later FinishEvent insert cannot fill them in because KVPrefixCache::Insert skips SetPageHashes for nodes that already have device pages, so host-written nodes keep empty PageHashes; CalcRollingHash(..., apply_match=true) then seeds suffix hashes with "" after a host match and generates wrong L3 keys, causing prefetch/backup misses for shared prefixes. Please defer this insert for L3 or make the finish path backfill page hashes on already-present nodes.

Useful? React with 👍 / 👎.

qywu · 2026-06-08T16:05:03Z

Update — gated the mid-flight publish for sliding-window-attention models (2a4fe3a).

The first commit regressed gpt-oss-120b GPQA-diamond ~0.71 → 0.547 (B200 + MI355). gpt-oss is non-hybrid + sliding-window attention; publishing a prefix mid-flight (before the request finishes) corrupts SWA prefix reuse. ut-runtime-prefix-cache-e2e (full-attention) and hybrid/MLA evals were unaffected.

Fix: add has_sliding_window to SchedulerConfig (derived from hf_config.sliding_window, mirroring ModelRunner); SWA models skip the mid-flight publish and fall back to the prior finish-only publish (baseline was 0.707). Full-attention non-hybrid models keep the mid-flight reuse; hybrid (DeepSeek-V4) is unchanged. Local: wheel builds, 180/180 scheduler gtests pass. Needs the eval CI to re-run to confirm gpt-oss is back to ~0.70.

qywu · 2026-06-08T16:24:29Z

SWA coverage: only gpt-oss (non-hybrid) and deepseek-v4 (hybrid) use sliding-window attention; the gate catches both. gpt-oss falls back to finish-only publish (this fix); deepseek-v4 is unaffected, since hybrid models publish via their own (already window-correct) cache. No SWA model is missed.

This PR is the conservative fix (disable mid-flight publish for non-hybrid SWA). Giving SWA models correct and mid-flight prefix caching via a windowed paged-cache-group — the mechanism deepseek-v4 already uses (clamping the radix match to the window) — will be a follow-up PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf05db9f9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T17:09:12Z

+    // now rather than only at FinishEvent. A null kv_prefix_cache (passed by the scheduler
+    // for sliding-window-attention models) disables this mid-flight publish, so SWA models
+    // fall back to the finish-only publish whose prefix reuse is known-correct.
+    KVPrefixCache* kv = (hybrid_cache != nullptr) ? &hybrid_cache->GetKVPrefixCache() : kv_prefix_cache;


Honor the SWA guard when a hybrid cache exists

For sliding-window models that also create a HybridPrefixCache (the scheduler does this whenever paged-cache groups or a prefix-cache adjunct are configured, e.g. the DeepSeek V4 SWA/state groups), schedulePrefill/scheduleDecode pass nullptr as the KV cache, but this line still selects hybrid_cache->GetKVPrefixCache() and publishes mid-flight. In the scheduler paths I checked, that leaves the newly documented corrupt SWA prefix-reuse path enabled for hybrid/SWA models; the helper needs an explicit skip signal rather than relying on a null base KV pointer.

Useful? React with 👍 / 👎.

This should be fixed now. The current implementation no longer relies on passing a null base KV cache to suppress publication. InsertPrefixCache() now takes an explicit enable_midflight_publish flag and returns before selecting hybrid_cache->GetKVPrefixCache() when the SWA path is not safe.

The scheduler policy is:

non-SWA: publish mid-flight as before

plain SWA / gpt-oss: publish only up to the configured sliding-window cap

hybrid history-only SWA: do not publish mid-flight

hybrid paged-cache SWA / DeepSeek V4: publish only when the paged-cache adjunct has sliding-window State groups, with HybridPrefixCache::Match() still enforcing snapshot/window correctness

I also added regression coverage for the unsafe hybrid-history SWA case and for capped plain-SWA decode-result publication.

…on models Publish a request's prompt-prefix KV into the radix tree at the prefill->decode transition (not only at FinishEvent) for non-hybrid models, so concurrent same-prefix requests (RL rollouts, shared chat-template prefixes) reuse it. Rename InsertHybridCache->InsertPrefixCache; publish via the base KV prefix cache when there is no hybrid cache (hybrid path unchanged; Mamba checkpoint stays hybrid-only). Thread kv_prefix_cache_ into SchedulePrefillEvent/ScheduleDecodeEvent. Sliding-window-attention models (gpt-oss) are excluded: mid-flight SWA prefix reuse corrupts outputs (regressed gpt-oss GPQA 0.71->0.547). Add has_sliding_window to SchedulerConfig (from hf_config.sliding_window); SWA models pass a null kv_prefix_cache so the publish is skipped and they fall back to finish-only. deepseek-v4 (hybrid SWA) is unaffected. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bc9c7b6de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T23:31:58Z

+        # models use their windowed adjunct snapshots. gpt-oss stores an
+        # inclusive HF window and converts it to TokenSpeed's exclusive
+        # attention window inside the model.
+        sliding_window = getattr(hf_config, "sliding_window", None)


Detect nested sliding windows before enabling publish

Fresh evidence is that this new scheduler-side detection only reads hf_config.sliding_window, while the DeepSeek V4 cache spec resolves the same value from either hf_config or hf_config.text_config. For checkpoints that put sliding_window under text_config, has_sliding_window stays false, and Scheduler::enableMidflightPrefixPublish() returns true before checking for a paged-cache adjunct/window state, re-enabling mid-flight publication for hybrid SWA configurations that this change is trying to guard. Please mirror the nested lookup here before deriving has_sliding_window.

Useful? React with 👍 / 👎.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu requested a review from a team as a code owner June 8, 2026 07:22

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

qywu marked this pull request as draft June 8, 2026 15:50

qywu force-pushed the qywu/prefix-cache-mid-flight branch 2 times, most recently from ce9f2f6 to bf05db9 Compare June 8, 2026 17:04

qywu marked this pull request as ready for review June 8, 2026 17:06

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

qywu force-pushed the qywu/prefix-cache-mid-flight branch from bf05db9 to d5d3fc1 Compare June 8, 2026 18:10

qywu requested a review from zhyncs June 8, 2026 21:11

zhyncs requested review from SimonCqk, dongjiyingdjy and wangbo981016 June 8, 2026 21:18

qywu marked this pull request as draft June 8, 2026 21:19

fix(scheduler): allow safe SWA mid-flight prefix publish

bee0920

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu force-pushed the qywu/prefix-cache-mid-flight branch from 6d6e6a4 to bee0920 Compare June 8, 2026 22:28

qywu added 2 commits June 8, 2026 15:42

Merge branch 'main' into qywu/prefix-cache-mid-flight

70cd61a

style(scheduler): format SWA publish changes

4deb07e

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu marked this pull request as ready for review June 8, 2026 22:59

fix(runtime): avoid ragged MHA prefix split with sinks

0bc9c7b

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

fix(runtime): preserve midflight prefixes for mixed SWA models

c08c1a8

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu force-pushed the qywu/prefix-cache-mid-flight branch from f036ee5 to c08c1a8 Compare June 8, 2026 23:35

qywu added 2 commits June 9, 2026 00:05

fix(runtime): disable gpt-oss midflight prefix publish

c9809f7

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

fix(runtime): disable Kimi K2.5 midflight prefix publish

bd9ad99

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

LorrinWWW mentioned this pull request Jun 9, 2026

perf: chunked-prefill prefix cache update for non-hybrid models #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models#381

fix(scheduler): publish prefix to radix tree during prefill for non-hybrid models#381
qywu wants to merge 8 commits into
mainfrom
qywu/prefix-cache-mid-flight

qywu commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

qywu commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

qywu commented Jun 8, 2026

Uh oh!

qywu commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

qywu Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qywu commented Jun 8, 2026

Problem

Fix

Validation

Files

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

qywu commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

qywu commented Jun 8, 2026

Uh oh!

qywu commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

qywu Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant