[RFC/Bug] Improve diffusion worker/engine control-plane reliability and output contracts

## Motivation

The current diffusion worker/engine implementation has made good progress: the engine/worker split is clear, diffusion lifecycle cleanup has improved, and several engine cleanup PRs are already tracking readability and telemetry work. However, a review of the current `main` branch and PR #4282 surfaced a few remaining reliability and scalability gaps that are not fully covered by existing issues.

The most important gap is distributed control-plane correctness: some all-rank diffusion RPCs can report success based only on rank 0 even when non-rank-0 workers fail. This can silently leave workers in inconsistent state after control operations such as LoRA add/remove, sleep, wake, or future extension RPCs.

## Current status

What looks healthy today:

- `DiffusionEngine` owns request lifecycle, scheduling, postprocessing, and response shaping.
- `DiffusionWorker` owns device/model execution, with `DiffusionModelRunner` holding most model-specific execution logic.
- Multiprocess diffusion execution uses a broadcast/control channel and a rank-0 result path.
- Sleep/wake support and cleanup lifecycle have improved through prior work.

Remaining problems:

1. **All-rank RPC failure visibility**
   - `MultiprocDiffusionExecutor.collective_rpc()` executes some RPCs on all workers but expects only one response, normally from rank 0.
   - Non-rank workers catch/log exceptions locally, but failures are not propagated to the caller.
   - The caller can receive success even if rank 1+ failed, leaving distributed worker state divergent.
   - This is especially risky for LoRA mutation, sleep/wake, and any future all-rank stateful control RPC.

2. **Implicit diffusion output contract**
   - `DiffusionEngine.step()` currently handles loose dict outputs with keys such as `video`, `audio`, `actions`, `custom_output`, `fps`, and `audio_sample_rate`.
   - Multi-prompt output splitting and batch-scoped metadata handling live in the same broad method.
   - This works for current models but becomes brittle as image/video/audio/action outputs and model-specific postprocess paths grow.

3. **Residual shutdown edge**
   - PR #3494 fixed several important cleanup lifecycle bugs and is already merged.
   - A remaining edge is that `close()` can return after the background thread join timeout while scheduler/executor shutdown is deferred. That may be intentional, but the engine state and resource ownership should be explicit so callers know whether the engine is fully closed, failed, or pending retry cleanup.

4. **Worker API contract cleanup**
   - Some worker APIs have unclear contracts, for example `sleep()` is annotated like a boolean operation while returning byte counts.
   - LoRA public methods assume a LoRA manager exists; models without one can fail via `AttributeError` rather than returning a clear unsupported-operation result.

## Related issues and PRs

- #2335 tracks broader `DiffusionEngine` cleanup: concurrency, telemetry, CPU utilization, and step readability.
- #2409 is an open broad refactor for queue-backed engine loop, request/RPC ownership, metrics, and step materialization.
- #2703 is an open focused cleanup for metrics, recursive CPU offload, batched audio slicing, and metadata scope.
- #2694 is a similar/open phase-2 cleanup around recursive CPU offload and batched output splitting.
- #3494 is merged and fixed several cleanup lifecycle defects in `DiffusionEngine`.
- #2022 is merged and introduced Omni sleep mode and ACK protocol.
- #2224 is open for sleep/wake HTTP API exposure.
- #3855 is open and refactors stage runtime and distributed replica control-plane ownership.
- #4282 adds/adjusts diffusion engine/worker paths for Cosmos3/OpenPI and highlighted the need for clearer action/custom output ownership.

## Proposed fix options

### Option A: Minimal correctness fix for all-rank RPCs

Add per-rank success/error aggregation for all-rank `collective_rpc()` calls while preserving the current request protocol.

Possible shape:

- Each worker sends a small RPC status envelope for all-rank control calls.
- Rank 0 or the executor aggregates all statuses.
- The caller receives success only if every rank succeeded.
- Any rank failure includes rank id, method name, and traceback/error string.

Pros:

- Lowest-risk path.
- Directly fixes the highest-priority correctness issue.
- Can be covered with unit tests by forcing a non-rank-0 RPC failure.

Cons:

- Keeps the existing rank-0-centric result path and may add special cases.

### Option B: Typed control-plane result envelope

Introduce a structured result envelope for diffusion RPCs, for example:

```python
@dataclass
class DiffusionRPCResult:
    ok: bool
    method: str
    rank_results: dict[int, RankRPCResult]
    value: Any | None = None
```

Pros:

- Makes all-rank vs single-rank semantics explicit.
- Gives future control APIs a safer contract.
- Easier to extend for sleep/wake/LoRA/health-check status.

Cons:

- Moderate migration cost across executor, worker, and stage proc layers.

### Option C: Align with broader stage-runtime control-plane refactor

Fold the fix into the direction of #3855, where stage runtime and distributed replica ownership are being cleaned up.

Pros:

- Better long-term architecture.
- Can unify diffusion and LLM control-plane behavior.
- Avoids adding another short-lived protocol layer.

Cons:

- Larger scope.
- Does not immediately protect current all-rank diffusion RPCs unless prioritized inside that PR.

### Option D: Typed diffusion output/postprocess envelope

Introduce a typed output object for diffusion postprocess/model outputs, separate from the RPC fix. For example:

```python
@dataclass
class DiffusionPostprocessResult:
    video: Any | None = None
    audio: Any | None = None
    actions: Any | None = None
    custom_output: dict[str, Any] | None = None
    fps: float | None = None
    audio_sample_rate: int | None = None
```

Pros:

- Reduces `DiffusionEngine.step()` branching.
- Makes image/video/audio/action ownership explicit.
- Complements #2703/#2694 output-splitting cleanup.

Cons:

- Needs compatibility shims for existing model postprocess functions.

## Recommendation

Prioritize Option A first to close the distributed correctness hole. Then decide whether Option B should be the stable control-plane contract or whether it should be folded into #3855. In parallel, #2703/#2694 can continue improving output splitting, but a typed output envelope should be considered before more model-specific `dict` keys are added to `DiffusionEngine.step()`.

## Discussion period

Please use this issue for design discussion from **June 13, 2026 through June 27, 2026**. After that, we should decide whether to:

- land a minimal all-rank RPC aggregation fix,
- fold the work into #3855,
- or split the effort into separate RPC-contract and output-contract PRs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC/Bug] Improve diffusion worker/engine control-plane reliability and output contracts #4400

Motivation

Current status

Related issues and PRs

Proposed fix options

Option A: Minimal correctness fix for all-rank RPCs

Option B: Typed control-plane result envelope

Option C: Align with broader stage-runtime control-plane refactor

Option D: Typed diffusion output/postprocess envelope

Recommendation

Discussion period

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC/Bug] Improve diffusion worker/engine control-plane reliability and output contracts #4400

Description

Motivation

Current status

Related issues and PRs

Proposed fix options

Option A: Minimal correctness fix for all-rank RPCs

Option B: Typed control-plane result envelope

Option C: Align with broader stage-runtime control-plane refactor

Option D: Typed diffusion output/postprocess envelope

Recommendation

Discussion period

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions