You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current diffusion worker/engine implementation has made good progress: the engine/worker split is clear, diffusion lifecycle cleanup has improved, and several engine cleanup PRs are already tracking readability and telemetry work. However, a review of the current main branch and PR #4282 surfaced a few remaining reliability and scalability gaps that are not fully covered by existing issues.
The most important gap is distributed control-plane correctness: some all-rank diffusion RPCs can report success based only on rank 0 even when non-rank-0 workers fail. This can silently leave workers in inconsistent state after control operations such as LoRA add/remove, sleep, wake, or future extension RPCs.
Current status
What looks healthy today:
DiffusionEngine owns request lifecycle, scheduling, postprocessing, and response shaping.
DiffusionWorker owns device/model execution, with DiffusionModelRunner holding most model-specific execution logic.
Multiprocess diffusion execution uses a broadcast/control channel and a rank-0 result path.
Sleep/wake support and cleanup lifecycle have improved through prior work.
Remaining problems:
All-rank RPC failure visibility
MultiprocDiffusionExecutor.collective_rpc() executes some RPCs on all workers but expects only one response, normally from rank 0.
Non-rank workers catch/log exceptions locally, but failures are not propagated to the caller.
The caller can receive success even if rank 1+ failed, leaving distributed worker state divergent.
This is especially risky for LoRA mutation, sleep/wake, and any future all-rank stateful control RPC.
Implicit diffusion output contract
DiffusionEngine.step() currently handles loose dict outputs with keys such as video, audio, actions, custom_output, fps, and audio_sample_rate.
Multi-prompt output splitting and batch-scoped metadata handling live in the same broad method.
This works for current models but becomes brittle as image/video/audio/action outputs and model-specific postprocess paths grow.
A remaining edge is that close() can return after the background thread join timeout while scheduler/executor shutdown is deferred. That may be intentional, but the engine state and resource ownership should be explicit so callers know whether the engine is fully closed, failed, or pending retry cleanup.
Worker API contract cleanup
Some worker APIs have unclear contracts, for example sleep() is annotated like a boolean operation while returning byte counts.
LoRA public methods assume a LoRA manager exists; models without one can fail via AttributeError rather than returning a clear unsupported-operation result.
Needs compatibility shims for existing model postprocess functions.
Recommendation
Prioritize Option A first to close the distributed correctness hole. Then decide whether Option B should be the stable control-plane contract or whether it should be folded into #3855. In parallel, #2703/#2694 can continue improving output splitting, but a typed output envelope should be considered before more model-specific dict keys are added to DiffusionEngine.step().
Discussion period
Please use this issue for design discussion from June 13, 2026 through June 27, 2026. After that, we should decide whether to:
Motivation
The current diffusion worker/engine implementation has made good progress: the engine/worker split is clear, diffusion lifecycle cleanup has improved, and several engine cleanup PRs are already tracking readability and telemetry work. However, a review of the current
mainbranch and PR #4282 surfaced a few remaining reliability and scalability gaps that are not fully covered by existing issues.The most important gap is distributed control-plane correctness: some all-rank diffusion RPCs can report success based only on rank 0 even when non-rank-0 workers fail. This can silently leave workers in inconsistent state after control operations such as LoRA add/remove, sleep, wake, or future extension RPCs.
Current status
What looks healthy today:
DiffusionEngineowns request lifecycle, scheduling, postprocessing, and response shaping.DiffusionWorkerowns device/model execution, withDiffusionModelRunnerholding most model-specific execution logic.Remaining problems:
All-rank RPC failure visibility
MultiprocDiffusionExecutor.collective_rpc()executes some RPCs on all workers but expects only one response, normally from rank 0.Implicit diffusion output contract
DiffusionEngine.step()currently handles loose dict outputs with keys such asvideo,audio,actions,custom_output,fps, andaudio_sample_rate.Residual shutdown edge
close()can return after the background thread join timeout while scheduler/executor shutdown is deferred. That may be intentional, but the engine state and resource ownership should be explicit so callers know whether the engine is fully closed, failed, or pending retry cleanup.Worker API contract cleanup
sleep()is annotated like a boolean operation while returning byte counts.AttributeErrorrather than returning a clear unsupported-operation result.Related issues and PRs
DiffusionEnginecleanup: concurrency, telemetry, CPU utilization, and step readability.DiffusionEngine.Proposed fix options
Option A: Minimal correctness fix for all-rank RPCs
Add per-rank success/error aggregation for all-rank
collective_rpc()calls while preserving the current request protocol.Possible shape:
Pros:
Cons:
Option B: Typed control-plane result envelope
Introduce a structured result envelope for diffusion RPCs, for example:
Pros:
Cons:
Option C: Align with broader stage-runtime control-plane refactor
Fold the fix into the direction of #3855, where stage runtime and distributed replica ownership are being cleaned up.
Pros:
Cons:
Option D: Typed diffusion output/postprocess envelope
Introduce a typed output object for diffusion postprocess/model outputs, separate from the RPC fix. For example:
Pros:
DiffusionEngine.step()branching.Cons:
Recommendation
Prioritize Option A first to close the distributed correctness hole. Then decide whether Option B should be the stable control-plane contract or whether it should be folded into #3855. In parallel, #2703/#2694 can continue improving output splitting, but a typed output envelope should be considered before more model-specific
dictkeys are added toDiffusionEngine.step().Discussion period
Please use this issue for design discussion from June 13, 2026 through June 27, 2026. After that, we should decide whether to: