fix(cache): Coarsely fence the compute stream behind the host loadback stream on.#370
fix(cache): Coarsely fence the compute stream behind the host loadback stream on.#370LorrinWWW wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9297222c87
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if recovery_with_loadback or host_exec.ack_load_queue: | ||
| self.model_executor.execution_stream.wait_stream(host_exec.load_stream) |
There was a problem hiding this comment.
Avoid fencing eager loadbacks behind the full load stream
When an EXTEND/MIXED iteration submits a LoadBackOp, host_exec.ack_load_queue already contains the just-enqueued loadback because submit_plan()/flush() ran above, so this condition serializes the entire execution_stream behind load_stream even though recovery_with_loadback is false. Those eager forwards can use the layerwise consumers set immediately above to overlap compute with per-layer reloads; waiting for the whole loadback here removes that overlap and can significantly stall long prefill/mixed recovery batches. If the intent is only to fence prior in-flight loadbacks plus pure-decode graph replays, this needs to distinguish pre-existing acks from the current eager loadback.
Useful? React with 👍 / 👎.
Summary
retraction→reload→resume-decode (and while loadback is in flight), since a captured decode graph can't honor the per-layer wait_until whose events live on the non-captured load_stream, allowing the replay to read KV mid-reload.
Test Plan