Skip to content

feat(engine): Sleep / Wake Up API (release/resume_memory_occupation)#393

Merged
HJSang merged 1 commit into
mainfrom
hejian/sleep-wakeup-api
Jun 15, 2026
Merged

feat(engine): Sleep / Wake Up API (release/resume_memory_occupation)#393
HJSang merged 1 commit into
mainfrom
hejian/sleep-wakeup-api

Conversation

@HJSang

@HJSang HJSang commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Completes TokenSpeed's half-ported SGLang-style data-plane sleep/wake: wires release_memory_occupation / resume_memory_occupation / is_sleeping end-to-end with selective weights/kv_cache tags. A release auto-pauses + drains the scheduler (control plane) and only then frees GPU memory via torch_memory_saver (data plane); resume re-maps, repairs the KV cache, and unpauses. Primary driver: the RL/RLHF loop (free the GPU for the trainer between rollouts, push fresh weights, resume).

Pure-Python — the C++ scheduler .so is untouched.

Before this PR (half-ported state)

--enable-memory-saver, the torch_memory_saver adapter, region-wrapped weights/KV, the engine API stubs, and client communicators existed — but the scheduler-side dispatch was absent, memory_saver.pause/resume was never called, and the io_struct tags field was missing (so the engine API would have TypeError'd).

What changed

  • PauseController — generalized the deferred reply into a _PendingDrain(on_drained, on_cancelled) action + a released flag, so "release after drain" reuses the proven async-drain path (pause/resume behavior unchanged).
  • MemoryOccupationController (new) — GPU-memory orchestration: tag→memory_saver.pause/resume, released_tags bookkeeping (partial wake, double-release/not-released rejection), prefix-cache reset on release, KV repair on wake.
  • Wiringrequest_handler dispatch for Release/Resume/IsSleeping; event_loop constructs the controller and skips DP execute_idle_forward while released (weights are unmapped); BaseTokenToKVPool.clear_kv_buffers() zeros remapped KV.
  • Adapter / regionsregion(tag, enable_cpu_backup) pass-through; weights tagged enable_cpu_backup=True (byte-exact restore), KV False (discard).
  • Surface — Engine release/resume_memory_occupation(tags) + is_sleeping(); client returns success outputs; torch_memory_saver==0.0.9.post1 pinned (import-guarded).

Validation

  • 26 unit tests (local + real GPU env), incl. adapter tag pass-through against the real torch_memory_saver.
  • Live E2E on B300 (Qwen2-0.5B, --enable-memory-saver), all cases pass:
    • release freed 26.7 GiB, is_sleeping→True; resume restored to within ~2 MiB
    • token-identical generation across the sleep cycle with CUDA graphs (weights restored byte-exact)
    • RL multi-stage tag flow (release both → resume weights [still sleeping] → resume kv_cache [awake])
    • error paths return success=False

Scoped out (follow-ons)

  • HTTP endpoints — require new RPCs in the external tokenspeed-smg-grpc-proto/-servicer packages (pause/resume is Python-only for the same reason).
  • deepseek_v4 KV region wrapping — currently del enable_memory_saver; needed for the real serving model (validated on a small MHA model first).
  • TP/DP multi-GPU run of the idle-forward gate (logic in place; single-GPU validated).

Design: docs/superpowers/specs/2026-06-07-sleep-wakeup-api-design.md · Plan + results: docs/superpowers/plans/2026-06-07-sleep-wakeup-api.md

🤖 Generated with Claude Code

@HJSang HJSang requested a review from a team as a code owner June 9, 2026 04:08

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2a4dffeaf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +249 to +250
self.state = PauseState.UNPAUSED
self.released = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep scheduler paused while memory is released

After a release has drained, released_tags can still contain weights or kv_cache; calling the existing public resume_scheduler() path then reaches this handler and flips the pause state back to UNPAUSED without remapping those regions through resume_memory_occupation(). In that scenario the next admitted request or DP idle forward can touch unmapped/discarded GPU memory, so scheduler resume should reject or remain paused while the memory controller is still sleeping.

Useful? React with 👍 / 👎.

Comment on lines +117 to +124
not_released = [t for t in tags if t not in self.released_tags]
if not_released:
self._send.send_pyobj(
ResumeMemoryOccupationReqOutput(
success=False, message=f"tags not released: {not_released!r}"
)
)
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resume only the tags that were actually released

When a caller releases a subset, for example release_memory_occupation(tags=["weights"]), a later resume_memory_occupation() with the default tags=None is documented as resuming the previously released tags but _normalize_tags(None) expands to both tags. This makes not_released include kv_cache and returns failure, leaving the released weights unrestored unless the caller redundantly remembers the exact subset; None should be resolved from self.released_tags on resume.

Useful? React with 👍 / 👎.

@HJSang

HJSang commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Update: deepseek_v4 KV region wrapping added (was listed as a follow-on).

  • deepseek_v4.py now allocates its KV buffers (swa / compressed / compressor_state / indexer / indexer_state) inside region(tag="kv_cache", enable_cpu_backup=False) instead of discarding enable_memory_saver.
  • BaseTokenToKVPool.clear_kv_buffers() extended to zero V4's buffer names after wake.

Validated on nv2 (b300): deepseek_v4 imports cleanly with the real compiled kernel/scheduler env; the MHA sleep/wake E2E regression still passes all cases (release frees + is_sleeping→True, token-identical resume with CUDA graphs, RL multi-stage, error paths). A full DeepSeek-V4-Pro model run is pending free multi-GPU capacity (the box's free GPUs currently have only ~25 GB each; V4 needs ~4 GPUs × hundreds of GB).

Remaining follow-ons: HTTP endpoints (external smg-grpc packages) and a TP/DP multi-GPU run of the idle-forward gate.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b22488d568

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1196 to +1199
reset = getattr(self.scheduler, "reset_prefix_cache", None)
if callable(reset):
reset()
else:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not release KV when prefix cache cannot be reset

When prefix caching is enabled and the scheduler object does not expose reset_prefix_cache, this path still lets release_memory_occupation(tags containing kv_cache) succeed after only logging. The repo's FlushCacheReqInput handler is also ack-only (python/tokenspeed/runtime/engine/request_handler.py:182-185), and a repo-wide search finds no in-repo reset_prefix_cache implementation, so retained prefix-cache entries can point at remapped/zeroed KV pages after wake and produce stale cache hits. Please either call the actual scheduler cache-clear API or fail/disable the KV release in this context instead of continuing.

Useful? React with 👍 / 👎.

Comment on lines +1208 to +1210
pool = getattr(self.model_executor, "token_to_kv_pool", None)
if pool is not None and hasattr(pool, "clear_kv_buffers"):
pool.clear_kv_buffers()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear the draft KV pool after wake

In speculative decoding runs (--speculative-draft-model-path), ModelExecutor keeps a separate draft_token_to_kv_pool (python/tokenspeed/runtime/execution/model_executor.py:186-188) whose allocations are also tagged as kv_cache, but wake repair only zeros the target pool here. After resume_memory_occupation(tags=["kv_cache"]), the draft pool remains remapped with garbage, so the next draft forward can read stale/padding KV data; include draft_token_to_kv_pool in the repair path when it exists.

Useful? React with 👍 / 👎.

@qywu qywu self-requested a review June 12, 2026 08:08

@qywu qywu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two docs/superpowers/ files (~1,700 of the PR's ~2,500 added lines) are Claude Code workflow artifacts rather than project docs. Recommend dropping them so the PR is just the sleep/wake implementation + tests; the design/results already live in the PR description. Details inline on each file.

@@ -0,0 +1,1359 @@
# Sleep / Wake Up API Implementation Plan

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file from the PR. It's a Claude Code superpowers planning/results artifact (1,359 lines) — point-in-time process scratch, not project documentation. A few reasons:

  • docs/superpowers/ doesn't exist on main; docs/ is a curated VitePress site (guides/, serving/, configuration/, …). This drops planning scratch into a published docs tree.
  • It won't be maintained as the code evolves, so it will drift and mislead.
  • Planning notes + validation results like this belong in the PR description (where most of it already is) or a tracking issue, not committed under docs/.

If there's durable content here worth keeping, fold the essentials into a maintained doc or the module docstrings instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -0,0 +1,335 @@
# Sleep / Wake Up API — Design

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same ask as the plan doc — please remove this 335-line design spec from the PR. It's a generated superpowers/specs artifact and the first thing under docs/superpowers/, which isn't part of the existing published docs/ site.

If any of the design rationale is durable and worth keeping for future maintainers, distill it into the memory_occupation.py / pause.py module docstrings or a short maintained page under the existing docs/ structure. As a standalone point-in-time spec it'll go stale against the code.

@qywu qywu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/runtime/test_io_struct_memory_occupation.py only tests @dataclass defaults/assignment (pure data holders, no behavior), and every contract it touches is already covered functionally by test_memory_occupation_controller.py. Recommend removing it — detail inline.

@@ -0,0 +1,29 @@
# SPDX-License-Identifier: Apache-2.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend removing this file. These six dataclasses (ReleaseMemoryOccupationReqInput/Output, ResumeMemoryOccupationReqInput/Output, IsSleepingReqInput/Output) are pure data holders — no __post_init__, no validation, no methods — so all four tests just assert that Python's @dataclass stores values and applies declared defaults:

  • ReleaseMemoryOccupationReqInput().tags is None / ...Output().success is True test that a field defaulted to None/True has that value.
  • (success=False, message="x").message == "x" tests assignment.
  • assert IsSleepingReqInput() is not None is tautological — a constructor never returns None.

They're change-detectors: editing a dataclass forces editing the test in lockstep, with no independent verification, and they can't catch a real bug because there's no behavior to break.

Every meaningful contract here is already covered by test_memory_occupation_controller.py, which constructs the same dataclasses where the values actually drive behavior — tags=None ⇒ all (lines 74/91/130/155), selective tags=["weights"]/["kv_cache"] (105–116), success True/False (85/99/123/131/142/149), is_sleeping transitions (86/100/111/116). A renamed field or changed default would fail there too, with real meaning.

(Fully agree with the AGENTS.md instinct to test changed code — that coverage just belongs in the controller/behavior tests, where it already lives. The controller, adapter, and GPU E2E tests are the ones doing real work.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@qywu qywu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HJSang added a commit that referenced this pull request Jun 13, 2026
… fail-closed KV release, draft-pool repair

- pause: reject a scheduler-level resume while GPU memory is released, and stop
  writing `released` there — set_released() (memory controller) is now its sole
  writer, so the control plane can't desync the data-plane flag (#1).
- memory_occupation: default resume(None) wakes exactly the released tags, not
  all valid tags, so a partial (e.g. weights-only) release round-trips (#2).
- memory_occupation + event_loop: fail-closed kv_cache release. When prefix
  caching is on and the scheduler has no prefix-cache reset (none exists today),
  releasing kv_cache/all is rejected up front rather than orphaning stale cache
  entries on wake. Capability is declared once at construction
  (kv_cache_release_allowed), not duck-typed per call (#3).
- event_loop: repair every KV pool after wake (target + draft) via a shared
  _kv_pools() accessor, so spec-decode runs don't read garbage draft KV (#4).
- docker: install libnuma1 (runtime dep of torch_memory_saver, the memory saver
  behind sleep/wake; the base runner image lacks it).
- tests: unit coverage for #1/#2/#3; GPU suite rewritten to cover all fixes plus
  the fail-closed path; v4 + GPU tests set enable_prefix_caching=False and
  disable_kvstore=True (KV release requires both).
- docs: document fail-closed KV release and the prefix-caching/kvstore coupling.

GPU-validated on B200 against a from-source rebuilt image: full release frees
~18 GB, token-identical generation across a sleep cycle, #1 scheduler-resume
reject, #2 default-resume of a partial release, and #3 fail-closed KV release
all pass. (#4 draft-pool path is unit-reasoned; GPU smoke test is opt-in.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
@HJSang HJSang force-pushed the hejian/sleep-wakeup-api branch from b22488d to 3610625 Compare June 13, 2026 21:38

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 361062591a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +131 to +132
if req.tags is None:
tags = [t for t in VALID_TAGS if t in self.released_tags]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject resumes while release is still draining

When a concurrent resume_memory_occupation() arrives before the pending release drain completes, released_tags is still empty, so tags=None is resolved to []; the method later reports success and calls _pause.set_released(False) without cancelling the pending drain. The original drain can then run _finish_release() and free weights/KV after the resume succeeded, so new work may be admitted or the caller may proceed believing the engine is awake. Please reject or cancel resumes while self._pause.is_drain_pending.

Useful? React with 👍 / 👎.

Comment on lines +181 to +183
self.is_sleeping_communicator = _Communicator(
self.engine_core_client.send_to_scheduler, server_args.mapping.attn.dp_size
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match is_sleeping fanout to DP control recipients

In attention-DP runs with TP>1, DataParallelController.send_control_message() sends control messages to self.workers[::self.control_message_step] with control_message_step=tp_size, while workers is indexed by DP rank; for example dp_size=4,tp_size=2 sends the request to only two workers. This new communicator still waits for attn.dp_size IsSleepingReqOutputs, so Engine.is_sleeping() can hang forever after receiving only the subset of replies. Either send the request to every DP worker or make the expected fanout match the controller's actual recipients.

Useful? React with 👍 / 👎.

HJSang added a commit that referenced this pull request Jun 13, 2026
… fail-closed KV release, draft-pool repair

- pause: reject a scheduler-level resume while GPU memory is released, and stop
  writing `released` there — set_released() (memory controller) is now its sole
  writer, so the control plane can't desync the data-plane flag (#1).
- memory_occupation: default resume(None) wakes exactly the released tags, not
  all valid tags, so a partial (e.g. weights-only) release round-trips (#2).
- memory_occupation + event_loop: fail-closed kv_cache release. When prefix
  caching is on and the scheduler has no prefix-cache reset (none exists today),
  releasing kv_cache/all is rejected up front rather than orphaning stale cache
  entries on wake. Capability is declared once at construction
  (kv_cache_release_allowed), not duck-typed per call (#3).
- event_loop: repair every KV pool after wake (target + draft) via a shared
  _kv_pools() accessor, so spec-decode runs don't read garbage draft KV (#4).
- docker: install libnuma1 (runtime dep of torch_memory_saver, the memory saver
  behind sleep/wake; the base runner image lacks it).
- tests: unit coverage for #1/#2/#3; GPU suite rewritten to cover all fixes plus
  the fail-closed path; v4 + GPU tests set enable_prefix_caching=False and
  disable_kvstore=True (KV release requires both).

GPU-validated on B200 against a from-source rebuilt image: full release frees
~18 GB, token-identical generation across a sleep cycle, #1 scheduler-resume
reject, #2 default-resume of a partial release, and #3 fail-closed KV release
all pass. (#4 draft-pool path is unit-reasoned; GPU smoke test is opt-in.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
@HJSang HJSang force-pushed the hejian/sleep-wakeup-api branch from 3610625 to aeddd4c Compare June 13, 2026 22:25

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aeddd4c479

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

with self.memory_saver_adapter.region():
# Tag as "kv_cache", no CPU backup: KV is discarded on sleep and rebuilt
# after wake (paging overwrites; clear_kv_buffers zeros the remapped pages).
with self.memory_saver_adapter.region(tag="kv_cache", enable_cpu_backup=False):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep KV pointer metadata out of discard region

In MHA pools this kv_cache region covers not only k_buffer/v_buffer but also the k_data_ptrs, v_data_ptrs, data_ptrs, and data_strides tensors built just below; because the region is later paused with enable_cpu_backup=False, resuming kv_cache leaves those metadata tensors' contents unspecified. On MHA runs that use the host/retraction path (MemoryExecutor reads device_pool.k_data_ptrs/v_data_ptrs) or _kv_copy_config, the next copy kernel can read invalid addresses/strides even though _kv_repair_after_wake() only zeros KV buffers. Please allocate or rebuild this metadata outside the discarded KV region.

Useful? React with 👍 / 👎.

bad = [t for t in tags if t not in VALID_TAGS]
if bad:
return None, f"invalid tags: {bad!r}; valid: {list(VALID_TAGS)}"
return [t for t in VALID_TAGS if t in tags], None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject empty release tag lists

When a caller passes tags=[], normalization returns an empty list with no error, so handle_release() still drains and _finish_release() calls set_released(True) without pausing any tag or adding anything to released_tags. The scheduler then stays PAUSED_ALL while is_sleeping reports false, which is a confusing stuck state for an otherwise valid typed API input; treat an empty list as invalid or as a no-op instead of arming the release drain.

Useful? React with 👍 / 👎.

@HJSang HJSang force-pushed the hejian/sleep-wakeup-api branch from aeddd4c to d5a8fa9 Compare June 13, 2026 23:31

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5a8fa9d42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +153 to +155
if not self.released_tags:
# Fully awake → resume scheduling.
self._pause.set_released(False)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve scheduler pause on no-op memory resume

When resume_memory_occupation() is called while no memory tags are released, req.tags is None resolves to an empty tags list, but this branch still calls set_released(False), which sets the shared PauseController state to UNPAUSED. In a scheduler that was intentionally paused via pause_scheduler(mode="keep"/"wait"), a no-op memory resume therefore re-admits work without a matching resume_scheduler() call; only clear the release-owned pause state after actually resuming at least one released tag.

Useful? React with 👍 / 👎.

@lightseek-bot lightseek-bot requested a review from qywu June 14, 2026 00:03
@HJSang HJSang force-pushed the hejian/sleep-wakeup-api branch from d5a8fa9 to d3e0cb8 Compare June 14, 2026 00:24
Complete TokenSpeed's SGLang-style data-plane sleep/wake: release GPU memory
(offload weights to CPU, discard KV) and resume it without restarting the
process, driven by the RL/RLHF loop. Composes with the control-plane
pause/resume (PR #346) — release auto-pauses + drains the scheduler, then frees
memory region by region.

- io_struct: tags + success on memory-occupation reqs; IsSleeping types.
- memory-saver adapter: thread tag + enable_cpu_backup through region/pause/
  resume (weights -> CPU backup, kv_cache -> discard).
- pause: generalize the deferred reply into _PendingDrain(on_drained,
  on_cancelled); reject a scheduler resume while memory is released, so only
  resume_memory_occupation clears `released` (set_released is its sole writer).
- MemoryOccupationController: release/resume by tag, released_tags tracking,
  is_sleeping. Default resume(None) wakes exactly what is asleep (not all tags).
  Fail-closed kv_cache release when prefix caching is on and the scheduler has
  no prefix-cache reset (kv_cache_release_allowed, decided once at construction)
  — discarded KV would otherwise orphan stale prefix-cache entries on wake.
- event_loop: wire the controller; DP idle-forward gate while released; repair
  every KV pool after wake (target + draft) via a shared _kv_pools() accessor.
- client/engine API: return release/resume outputs; add is_sleeping RPC.
- weights + kv_cache (incl. deepseek_v4) wrapped in region(tag=...).
- build: pin torch_memory_saver==0.0.9.post1 (import-guarded); docker installs
  libnuma1 (its runtime dep, absent from the base runner image).
- tests: pause + memory-occupation unit suites; GPU integration cases
  (release/resume, token-identical sleep cycle, RL multi-stage, error paths,
  fail-closed KV release, scheduler-resume reject); V4 DP4 case. KV release
  requires enable_prefix_caching=False (=> disable_kvstore=True).

Python/engine API only; HTTP deferred (needs external grpc pkgs). GPU-validated
on B200/B300 from a source-rebuilt image: full release frees ~18 GB,
token-identical generation across a sleep cycle, multi-stage tag flow, error
paths, scheduler-resume reject, and fail-closed KV release all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
@HJSang HJSang force-pushed the hejian/sleep-wakeup-api branch from d3e0cb8 to b59c2f6 Compare June 14, 2026 00:29

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b59c2f692b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

self._kv_repair()
if not self.released_tags:
# Fully awake → resume scheduling.
self._pause.set_released(False)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve existing scheduler pause on wake

When memory release is used after the scheduler was already paused and drained, for example pause_scheduler(mode="wait") followed by release_memory_occupation(tags=["weights"]), this unconditional wake path calls set_released(False), which sets the shared pause state to UNPAUSED. Because the release path never records that the scheduler was paused before the memory sleep, resume_memory_occupation() can re-admit buffered/new generation without the caller ever invoking resume_scheduler(), breaking the scheduler-level pause contract.

Useful? React with 👍 / 👎.

@HJSang

HJSang commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

@qywu thank you for the comments. I have resolved them. Please take another look.

@HJSang HJSang merged commit 424c55f into main Jun 15, 2026
33 of 37 checks passed
@HJSang HJSang deleted the hejian/sleep-wakeup-api branch June 15, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants