Skip to content

fix(cache): Coarsely fence the compute stream behind the host loadback stream on.#370

Open
LorrinWWW wants to merge 2 commits into
mainfrom
jue/loadback-decode-graph-fence
Open

fix(cache): Coarsely fence the compute stream behind the host loadback stream on.#370
LorrinWWW wants to merge 2 commits into
mainfrom
jue/loadback-decode-graph-fence

Conversation

@LorrinWWW

Copy link
Copy Markdown
Contributor

Summary

retraction→reload→resume-decode (and while loadback is in flight), since a captured decode graph can't honor the per-layer wait_until whose events live on the non-captured load_stream, allowing the replay to read KV mid-reload.

Test Plan

@LorrinWWW LorrinWWW requested a review from a team as a code owner June 6, 2026 18:27

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9297222c87

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +786 to +787
if recovery_with_loadback or host_exec.ack_load_queue:
self.model_executor.execution_stream.wait_stream(host_exec.load_stream)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid fencing eager loadbacks behind the full load stream

When an EXTEND/MIXED iteration submits a LoadBackOp, host_exec.ack_load_queue already contains the just-enqueued loadback because submit_plan()/flush() ran above, so this condition serializes the entire execution_stream behind load_stream even though recovery_with_loadback is false. Those eager forwards can use the layerwise consumers set immediately above to overlap compute with per-layer reloads; waiting for the whole loadback here removes that overlap and can significantly stall long prefill/mixed recovery batches. If the intent is only to fence prior in-flight loadbacks plus pure-decode graph replays, this needs to distinguish pre-existing acks from the current eager loadback.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant