Skip to content

[RFC/Bug] Improve diffusion worker/engine control-plane reliability and output contracts #4400

@hsliuustc0106

Description

@hsliuustc0106

Motivation

The current diffusion worker/engine implementation has made good progress: the engine/worker split is clear, diffusion lifecycle cleanup has improved, and several engine cleanup PRs are already tracking readability and telemetry work. However, a review of the current main branch and PR #4282 surfaced a few remaining reliability and scalability gaps that are not fully covered by existing issues.

The most important gap is distributed control-plane correctness: some all-rank diffusion RPCs can report success based only on rank 0 even when non-rank-0 workers fail. This can silently leave workers in inconsistent state after control operations such as LoRA add/remove, sleep, wake, or future extension RPCs.

Current status

What looks healthy today:

  • DiffusionEngine owns request lifecycle, scheduling, postprocessing, and response shaping.
  • DiffusionWorker owns device/model execution, with DiffusionModelRunner holding most model-specific execution logic.
  • Multiprocess diffusion execution uses a broadcast/control channel and a rank-0 result path.
  • Sleep/wake support and cleanup lifecycle have improved through prior work.

Remaining problems:

  1. All-rank RPC failure visibility

    • MultiprocDiffusionExecutor.collective_rpc() executes some RPCs on all workers but expects only one response, normally from rank 0.
    • Non-rank workers catch/log exceptions locally, but failures are not propagated to the caller.
    • The caller can receive success even if rank 1+ failed, leaving distributed worker state divergent.
    • This is especially risky for LoRA mutation, sleep/wake, and any future all-rank stateful control RPC.
  2. Implicit diffusion output contract

    • DiffusionEngine.step() currently handles loose dict outputs with keys such as video, audio, actions, custom_output, fps, and audio_sample_rate.
    • Multi-prompt output splitting and batch-scoped metadata handling live in the same broad method.
    • This works for current models but becomes brittle as image/video/audio/action outputs and model-specific postprocess paths grow.
  3. Residual shutdown edge

    • PR Fix diffusion engine cleanup lifecycle #3494 fixed several important cleanup lifecycle bugs and is already merged.
    • A remaining edge is that close() can return after the background thread join timeout while scheduler/executor shutdown is deferred. That may be intentional, but the engine state and resource ownership should be explicit so callers know whether the engine is fully closed, failed, or pending retry cleanup.
  4. Worker API contract cleanup

    • Some worker APIs have unclear contracts, for example sleep() is annotated like a boolean operation while returning byte counts.
    • LoRA public methods assume a LoRA manager exists; models without one can fail via AttributeError rather than returning a clear unsupported-operation result.

Related issues and PRs

Proposed fix options

Option A: Minimal correctness fix for all-rank RPCs

Add per-rank success/error aggregation for all-rank collective_rpc() calls while preserving the current request protocol.

Possible shape:

  • Each worker sends a small RPC status envelope for all-rank control calls.
  • Rank 0 or the executor aggregates all statuses.
  • The caller receives success only if every rank succeeded.
  • Any rank failure includes rank id, method name, and traceback/error string.

Pros:

  • Lowest-risk path.
  • Directly fixes the highest-priority correctness issue.
  • Can be covered with unit tests by forcing a non-rank-0 RPC failure.

Cons:

  • Keeps the existing rank-0-centric result path and may add special cases.

Option B: Typed control-plane result envelope

Introduce a structured result envelope for diffusion RPCs, for example:

@dataclass
class DiffusionRPCResult:
    ok: bool
    method: str
    rank_results: dict[int, RankRPCResult]
    value: Any | None = None

Pros:

  • Makes all-rank vs single-rank semantics explicit.
  • Gives future control APIs a safer contract.
  • Easier to extend for sleep/wake/LoRA/health-check status.

Cons:

  • Moderate migration cost across executor, worker, and stage proc layers.

Option C: Align with broader stage-runtime control-plane refactor

Fold the fix into the direction of #3855, where stage runtime and distributed replica ownership are being cleaned up.

Pros:

  • Better long-term architecture.
  • Can unify diffusion and LLM control-plane behavior.
  • Avoids adding another short-lived protocol layer.

Cons:

  • Larger scope.
  • Does not immediately protect current all-rank diffusion RPCs unless prioritized inside that PR.

Option D: Typed diffusion output/postprocess envelope

Introduce a typed output object for diffusion postprocess/model outputs, separate from the RPC fix. For example:

@dataclass
class DiffusionPostprocessResult:
    video: Any | None = None
    audio: Any | None = None
    actions: Any | None = None
    custom_output: dict[str, Any] | None = None
    fps: float | None = None
    audio_sample_rate: int | None = None

Pros:

Cons:

  • Needs compatibility shims for existing model postprocess functions.

Recommendation

Prioritize Option A first to close the distributed correctness hole. Then decide whether Option B should be the stable control-plane contract or whether it should be folded into #3855. In parallel, #2703/#2694 can continue improving output splitting, but a typed output envelope should be considered before more model-specific dict keys are added to DiffusionEngine.step().

Discussion period

Please use this issue for design discussion from June 13, 2026 through June 27, 2026. After that, we should decide whether to:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededhigh priorityhigh priority issue, needs to be done asap

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions