feat(engine): Sleep / Wake Up API (release/resume_memory_occupation)#393
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c2a4dffeaf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.state = PauseState.UNPAUSED | ||
| self.released = False |
There was a problem hiding this comment.
Keep scheduler paused while memory is released
After a release has drained, released_tags can still contain weights or kv_cache; calling the existing public resume_scheduler() path then reaches this handler and flips the pause state back to UNPAUSED without remapping those regions through resume_memory_occupation(). In that scenario the next admitted request or DP idle forward can touch unmapped/discarded GPU memory, so scheduler resume should reject or remain paused while the memory controller is still sleeping.
Useful? React with 👍 / 👎.
| not_released = [t for t in tags if t not in self.released_tags] | ||
| if not_released: | ||
| self._send.send_pyobj( | ||
| ResumeMemoryOccupationReqOutput( | ||
| success=False, message=f"tags not released: {not_released!r}" | ||
| ) | ||
| ) | ||
| return |
There was a problem hiding this comment.
Resume only the tags that were actually released
When a caller releases a subset, for example release_memory_occupation(tags=["weights"]), a later resume_memory_occupation() with the default tags=None is documented as resuming the previously released tags but _normalize_tags(None) expands to both tags. This makes not_released include kv_cache and returns failure, leaving the released weights unrestored unless the caller redundantly remembers the exact subset; None should be resolved from self.released_tags on resume.
Useful? React with 👍 / 👎.
|
Update: deepseek_v4 KV region wrapping added (was listed as a follow-on).
Validated on nv2 (b300): deepseek_v4 imports cleanly with the real compiled kernel/scheduler env; the MHA sleep/wake E2E regression still passes all cases (release frees + Remaining follow-ons: HTTP endpoints (external smg-grpc packages) and a TP/DP multi-GPU run of the idle-forward gate. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b22488d568
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| reset = getattr(self.scheduler, "reset_prefix_cache", None) | ||
| if callable(reset): | ||
| reset() | ||
| else: |
There was a problem hiding this comment.
Do not release KV when prefix cache cannot be reset
When prefix caching is enabled and the scheduler object does not expose reset_prefix_cache, this path still lets release_memory_occupation(tags containing kv_cache) succeed after only logging. The repo's FlushCacheReqInput handler is also ack-only (python/tokenspeed/runtime/engine/request_handler.py:182-185), and a repo-wide search finds no in-repo reset_prefix_cache implementation, so retained prefix-cache entries can point at remapped/zeroed KV pages after wake and produce stale cache hits. Please either call the actual scheduler cache-clear API or fail/disable the KV release in this context instead of continuing.
Useful? React with 👍 / 👎.
| pool = getattr(self.model_executor, "token_to_kv_pool", None) | ||
| if pool is not None and hasattr(pool, "clear_kv_buffers"): | ||
| pool.clear_kv_buffers() |
There was a problem hiding this comment.
Clear the draft KV pool after wake
In speculative decoding runs (--speculative-draft-model-path), ModelExecutor keeps a separate draft_token_to_kv_pool (python/tokenspeed/runtime/execution/model_executor.py:186-188) whose allocations are also tagged as kv_cache, but wake repair only zeros the target pool here. After resume_memory_occupation(tags=["kv_cache"]), the draft pool remains remapped with garbage, so the next draft forward can read stale/padding KV data; include draft_token_to_kv_pool in the repair path when it exists.
Useful? React with 👍 / 👎.
qywu
left a comment
There was a problem hiding this comment.
These two docs/superpowers/ files (~1,700 of the PR's ~2,500 added lines) are Claude Code workflow artifacts rather than project docs. Recommend dropping them so the PR is just the sleep/wake implementation + tests; the design/results already live in the PR description. Details inline on each file.
| @@ -0,0 +1,1359 @@ | |||
| # Sleep / Wake Up API Implementation Plan | |||
There was a problem hiding this comment.
Please remove this file from the PR. It's a Claude Code superpowers planning/results artifact (1,359 lines) — point-in-time process scratch, not project documentation. A few reasons:
docs/superpowers/doesn't exist onmain;docs/is a curated VitePress site (guides/,serving/,configuration/, …). This drops planning scratch into a published docs tree.- It won't be maintained as the code evolves, so it will drift and mislead.
- Planning notes + validation results like this belong in the PR description (where most of it already is) or a tracking issue, not committed under
docs/.
If there's durable content here worth keeping, fold the essentials into a maintained doc or the module docstrings instead.
| @@ -0,0 +1,335 @@ | |||
| # Sleep / Wake Up API — Design | |||
There was a problem hiding this comment.
Same ask as the plan doc — please remove this 335-line design spec from the PR. It's a generated superpowers/specs artifact and the first thing under docs/superpowers/, which isn't part of the existing published docs/ site.
If any of the design rationale is durable and worth keeping for future maintainers, distill it into the memory_occupation.py / pause.py module docstrings or a short maintained page under the existing docs/ structure. As a standalone point-in-time spec it'll go stale against the code.
qywu
left a comment
There was a problem hiding this comment.
test/runtime/test_io_struct_memory_occupation.py only tests @dataclass defaults/assignment (pure data holders, no behavior), and every contract it touches is already covered functionally by test_memory_occupation_controller.py. Recommend removing it — detail inline.
| @@ -0,0 +1,29 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Recommend removing this file. These six dataclasses (ReleaseMemoryOccupationReqInput/Output, ResumeMemoryOccupationReqInput/Output, IsSleepingReqInput/Output) are pure data holders — no __post_init__, no validation, no methods — so all four tests just assert that Python's @dataclass stores values and applies declared defaults:
ReleaseMemoryOccupationReqInput().tags is None/...Output().success is Truetest that a field defaulted toNone/Truehas that value.(success=False, message="x").message == "x"tests assignment.assert IsSleepingReqInput() is not Noneis tautological — a constructor never returnsNone.
They're change-detectors: editing a dataclass forces editing the test in lockstep, with no independent verification, and they can't catch a real bug because there's no behavior to break.
Every meaningful contract here is already covered by test_memory_occupation_controller.py, which constructs the same dataclasses where the values actually drive behavior — tags=None ⇒ all (lines 74/91/130/155), selective tags=["weights"]/["kv_cache"] (105–116), success True/False (85/99/123/131/142/149), is_sleeping transitions (86/100/111/116). A renamed field or changed default would fail there too, with real meaning.
(Fully agree with the AGENTS.md instinct to test changed code — that coverage just belongs in the controller/behavior tests, where it already lives. The controller, adapter, and GPU E2E tests are the ones doing real work.)
… fail-closed KV release, draft-pool repair - pause: reject a scheduler-level resume while GPU memory is released, and stop writing `released` there — set_released() (memory controller) is now its sole writer, so the control plane can't desync the data-plane flag (#1). - memory_occupation: default resume(None) wakes exactly the released tags, not all valid tags, so a partial (e.g. weights-only) release round-trips (#2). - memory_occupation + event_loop: fail-closed kv_cache release. When prefix caching is on and the scheduler has no prefix-cache reset (none exists today), releasing kv_cache/all is rejected up front rather than orphaning stale cache entries on wake. Capability is declared once at construction (kv_cache_release_allowed), not duck-typed per call (#3). - event_loop: repair every KV pool after wake (target + draft) via a shared _kv_pools() accessor, so spec-decode runs don't read garbage draft KV (#4). - docker: install libnuma1 (runtime dep of torch_memory_saver, the memory saver behind sleep/wake; the base runner image lacks it). - tests: unit coverage for #1/#2/#3; GPU suite rewritten to cover all fixes plus the fail-closed path; v4 + GPU tests set enable_prefix_caching=False and disable_kvstore=True (KV release requires both). - docs: document fail-closed KV release and the prefix-caching/kvstore coupling. GPU-validated on B200 against a from-source rebuilt image: full release frees ~18 GB, token-identical generation across a sleep cycle, #1 scheduler-resume reject, #2 default-resume of a partial release, and #3 fail-closed KV release all pass. (#4 draft-pool path is unit-reasoned; GPU smoke test is opt-in.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
b22488d to
3610625
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 361062591a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if req.tags is None: | ||
| tags = [t for t in VALID_TAGS if t in self.released_tags] |
There was a problem hiding this comment.
Reject resumes while release is still draining
When a concurrent resume_memory_occupation() arrives before the pending release drain completes, released_tags is still empty, so tags=None is resolved to []; the method later reports success and calls _pause.set_released(False) without cancelling the pending drain. The original drain can then run _finish_release() and free weights/KV after the resume succeeded, so new work may be admitted or the caller may proceed believing the engine is awake. Please reject or cancel resumes while self._pause.is_drain_pending.
Useful? React with 👍 / 👎.
| self.is_sleeping_communicator = _Communicator( | ||
| self.engine_core_client.send_to_scheduler, server_args.mapping.attn.dp_size | ||
| ) |
There was a problem hiding this comment.
Match is_sleeping fanout to DP control recipients
In attention-DP runs with TP>1, DataParallelController.send_control_message() sends control messages to self.workers[::self.control_message_step] with control_message_step=tp_size, while workers is indexed by DP rank; for example dp_size=4,tp_size=2 sends the request to only two workers. This new communicator still waits for attn.dp_size IsSleepingReqOutputs, so Engine.is_sleeping() can hang forever after receiving only the subset of replies. Either send the request to every DP worker or make the expected fanout match the controller's actual recipients.
Useful? React with 👍 / 👎.
… fail-closed KV release, draft-pool repair - pause: reject a scheduler-level resume while GPU memory is released, and stop writing `released` there — set_released() (memory controller) is now its sole writer, so the control plane can't desync the data-plane flag (#1). - memory_occupation: default resume(None) wakes exactly the released tags, not all valid tags, so a partial (e.g. weights-only) release round-trips (#2). - memory_occupation + event_loop: fail-closed kv_cache release. When prefix caching is on and the scheduler has no prefix-cache reset (none exists today), releasing kv_cache/all is rejected up front rather than orphaning stale cache entries on wake. Capability is declared once at construction (kv_cache_release_allowed), not duck-typed per call (#3). - event_loop: repair every KV pool after wake (target + draft) via a shared _kv_pools() accessor, so spec-decode runs don't read garbage draft KV (#4). - docker: install libnuma1 (runtime dep of torch_memory_saver, the memory saver behind sleep/wake; the base runner image lacks it). - tests: unit coverage for #1/#2/#3; GPU suite rewritten to cover all fixes plus the fail-closed path; v4 + GPU tests set enable_prefix_caching=False and disable_kvstore=True (KV release requires both). GPU-validated on B200 against a from-source rebuilt image: full release frees ~18 GB, token-identical generation across a sleep cycle, #1 scheduler-resume reject, #2 default-resume of a partial release, and #3 fail-closed KV release all pass. (#4 draft-pool path is unit-reasoned; GPU smoke test is opt-in.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
3610625 to
aeddd4c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aeddd4c479
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with self.memory_saver_adapter.region(): | ||
| # Tag as "kv_cache", no CPU backup: KV is discarded on sleep and rebuilt | ||
| # after wake (paging overwrites; clear_kv_buffers zeros the remapped pages). | ||
| with self.memory_saver_adapter.region(tag="kv_cache", enable_cpu_backup=False): |
There was a problem hiding this comment.
Keep KV pointer metadata out of discard region
In MHA pools this kv_cache region covers not only k_buffer/v_buffer but also the k_data_ptrs, v_data_ptrs, data_ptrs, and data_strides tensors built just below; because the region is later paused with enable_cpu_backup=False, resuming kv_cache leaves those metadata tensors' contents unspecified. On MHA runs that use the host/retraction path (MemoryExecutor reads device_pool.k_data_ptrs/v_data_ptrs) or _kv_copy_config, the next copy kernel can read invalid addresses/strides even though _kv_repair_after_wake() only zeros KV buffers. Please allocate or rebuild this metadata outside the discarded KV region.
Useful? React with 👍 / 👎.
| bad = [t for t in tags if t not in VALID_TAGS] | ||
| if bad: | ||
| return None, f"invalid tags: {bad!r}; valid: {list(VALID_TAGS)}" | ||
| return [t for t in VALID_TAGS if t in tags], None |
There was a problem hiding this comment.
Reject empty release tag lists
When a caller passes tags=[], normalization returns an empty list with no error, so handle_release() still drains and _finish_release() calls set_released(True) without pausing any tag or adding anything to released_tags. The scheduler then stays PAUSED_ALL while is_sleeping reports false, which is a confusing stuck state for an otherwise valid typed API input; treat an empty list as invalid or as a no-op instead of arming the release drain.
Useful? React with 👍 / 👎.
aeddd4c to
d5a8fa9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5a8fa9d42
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not self.released_tags: | ||
| # Fully awake → resume scheduling. | ||
| self._pause.set_released(False) |
There was a problem hiding this comment.
Preserve scheduler pause on no-op memory resume
When resume_memory_occupation() is called while no memory tags are released, req.tags is None resolves to an empty tags list, but this branch still calls set_released(False), which sets the shared PauseController state to UNPAUSED. In a scheduler that was intentionally paused via pause_scheduler(mode="keep"/"wait"), a no-op memory resume therefore re-admits work without a matching resume_scheduler() call; only clear the release-owned pause state after actually resuming at least one released tag.
Useful? React with 👍 / 👎.
d5a8fa9 to
d3e0cb8
Compare
Complete TokenSpeed's SGLang-style data-plane sleep/wake: release GPU memory (offload weights to CPU, discard KV) and resume it without restarting the process, driven by the RL/RLHF loop. Composes with the control-plane pause/resume (PR #346) — release auto-pauses + drains the scheduler, then frees memory region by region. - io_struct: tags + success on memory-occupation reqs; IsSleeping types. - memory-saver adapter: thread tag + enable_cpu_backup through region/pause/ resume (weights -> CPU backup, kv_cache -> discard). - pause: generalize the deferred reply into _PendingDrain(on_drained, on_cancelled); reject a scheduler resume while memory is released, so only resume_memory_occupation clears `released` (set_released is its sole writer). - MemoryOccupationController: release/resume by tag, released_tags tracking, is_sleeping. Default resume(None) wakes exactly what is asleep (not all tags). Fail-closed kv_cache release when prefix caching is on and the scheduler has no prefix-cache reset (kv_cache_release_allowed, decided once at construction) — discarded KV would otherwise orphan stale prefix-cache entries on wake. - event_loop: wire the controller; DP idle-forward gate while released; repair every KV pool after wake (target + draft) via a shared _kv_pools() accessor. - client/engine API: return release/resume outputs; add is_sleeping RPC. - weights + kv_cache (incl. deepseek_v4) wrapped in region(tag=...). - build: pin torch_memory_saver==0.0.9.post1 (import-guarded); docker installs libnuma1 (its runtime dep, absent from the base runner image). - tests: pause + memory-occupation unit suites; GPU integration cases (release/resume, token-identical sleep cycle, RL multi-stage, error paths, fail-closed KV release, scheduler-resume reject); V4 DP4 case. KV release requires enable_prefix_caching=False (=> disable_kvstore=True). Python/engine API only; HTTP deferred (needs external grpc pkgs). GPU-validated on B200/B300 from a source-rebuilt image: full release frees ~18 GB, token-identical generation across a sleep cycle, multi-stage tag flow, error paths, scheduler-resume reject, and fail-closed KV release all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
d3e0cb8 to
b59c2f6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b59c2f692b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self._kv_repair() | ||
| if not self.released_tags: | ||
| # Fully awake → resume scheduling. | ||
| self._pause.set_released(False) |
There was a problem hiding this comment.
Preserve existing scheduler pause on wake
When memory release is used after the scheduler was already paused and drained, for example pause_scheduler(mode="wait") followed by release_memory_occupation(tags=["weights"]), this unconditional wake path calls set_released(False), which sets the shared pause state to UNPAUSED. Because the release path never records that the scheduler was paused before the memory sleep, resume_memory_occupation() can re-admit buffered/new generation without the caller ever invoking resume_scheduler(), breaking the scheduler-level pause contract.
Useful? React with 👍 / 👎.
|
@qywu thank you for the comments. I have resolved them. Please take another look. |
Summary
Completes TokenSpeed's half-ported SGLang-style data-plane sleep/wake: wires
release_memory_occupation/resume_memory_occupation/is_sleepingend-to-end with selectiveweights/kv_cachetags. A release auto-pauses + drains the scheduler (control plane) and only then frees GPU memory viatorch_memory_saver(data plane); resume re-maps, repairs the KV cache, and unpauses. Primary driver: the RL/RLHF loop (free the GPU for the trainer between rollouts, push fresh weights, resume).Pure-Python — the C++ scheduler
.sois untouched.Before this PR (half-ported state)
--enable-memory-saver, thetorch_memory_saveradapter, region-wrapped weights/KV, the engine API stubs, and client communicators existed — but the scheduler-side dispatch was absent,memory_saver.pause/resumewas never called, and the io_structtagsfield was missing (so the engine API would haveTypeError'd).What changed
PauseController— generalized the deferred reply into a_PendingDrain(on_drained, on_cancelled)action + areleasedflag, so "release after drain" reuses the proven async-drain path (pause/resume behavior unchanged).MemoryOccupationController(new) — GPU-memory orchestration: tag→memory_saver.pause/resume,released_tagsbookkeeping (partial wake, double-release/not-released rejection), prefix-cache reset on release, KV repair on wake.request_handlerdispatch for Release/Resume/IsSleeping;event_loopconstructs the controller and skips DPexecute_idle_forwardwhilereleased(weights are unmapped);BaseTokenToKVPool.clear_kv_buffers()zeros remapped KV.region(tag, enable_cpu_backup)pass-through; weights taggedenable_cpu_backup=True(byte-exact restore), KVFalse(discard).release/resume_memory_occupation(tags)+is_sleeping(); client returnssuccessoutputs;torch_memory_saver==0.0.9.post1pinned (import-guarded).Validation
torch_memory_saver.--enable-memory-saver), all cases pass:is_sleeping→True; resume restored to within ~2 MiBsuccess=FalseScoped out (follow-ons)
tokenspeed-smg-grpc-proto/-servicerpackages (pause/resume is Python-only for the same reason).del enable_memory_saver; needed for the real serving model (validated on a small MHA model first).Design:
docs/superpowers/specs/2026-06-07-sleep-wakeup-api-design.md· Plan + results:docs/superpowers/plans/2026-06-07-sleep-wakeup-api.md🤖 Generated with Claude Code