[Bug] QwenImage RoPE `txt_seq_lens` uses `mask.sum()` instead of padded embed width (misaligned with diffusers)

### Your current environment

## Summary

vLLM-Omni QwenImage pipelines set `txt_seq_lens` from `prompt_embeds_mask.sum(dim=1)` when preparing generation context / `prepare_encode()`. Diffusers (latest `main`) instead derives RoPE text length from the **padded encoder hidden-states width** (`encoder_hidden_states.shape[1]`) inside `QwenImageTransformer2DModel.forward()` via `compute_text_seq_len_from_mask()`.

When prompt embeddings are wider than the number of valid (non-padding) tokens — common under **continuous batching (CB)** after in-place padding, RL training with fixed-width collation, or any caller that supplies pre-padded embeds — vLLM-Omni builds a **too-short** text RoPE table. That changes attention numerics relative to diffusers/PyTorch reference and can inflate rollout–trainer KL in RL setups.

Verified on **vllm-omni `main` @ `a693ae67`** against **diffusers `main` @ `d1f8e55c3`**.

## Affected code (vllm-omni)

All QwenImage pipeline variants set `txt_seq_lens` from `mask.sum()` in their generation-context helpers:

- `vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py` (`_prepare_generation_context`, used by `forward()`, `diffuse()`, and `prepare_encode()`)
- `vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit.py`
- `vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit_plus.py`
- `vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_layered.py`

```python
# Current (vllm-omni):
txt_seq_lens = prompt_embeds_mask.sum(dim=1).tolist() if prompt_embeds_mask is not None else None
negative_txt_seq_lens = (
    negative_prompt_embeds_mask.sum(dim=1).tolist() if negative_prompt_embeds_mask is not None else None
)
```

`prepare_encode()` copies these values onto `DiffusionRequestState.txt_seq_lens` / `negative_txt_seq_lens`. The stepwise CB path (`InputBatch`) reuses them but **does not refresh them** after `_prepare_request_prompt_field()` pads `prompt_embeds` / masks on the request state in place.

## Reference behavior (diffusers latest main)

`diffusers.models.transformers.transformer_qwenimage.compute_text_seq_len_from_mask()` returns the encoder tensor width for RoPE:

```python
batch_size, text_seq_len = encoder_hidden_states.shape[:2]
# ...
return text_seq_len, per_sample_len, encoder_hidden_states_mask
```

`QwenImageTransformer2DModel.forward()` uses that width:

```python
text_seq_len, _, encoder_hidden_states_mask = compute_text_seq_len_from_mask(
    encoder_hidden_states, encoder_hidden_states_mask
)
image_rotary_emb = self.pos_embed(img_shapes, max_txt_seq_len=text_seq_len, device=hidden_states.device)
```

The diffusers **pipeline** does not pass `txt_seq_lens` into the transformer; RoPE length is inferred from the padded tensor geometry.

## Why this harms precision

`QwenEmbedRope` in vllm-omni slices text frequencies with:

```python
max_len = max(txt_seq_lens)
txt_freqs = self.pos_freqs[max_vid_index : max_vid_index + max_len, ...]
```

Joint attention applies those frequencies to **all** `encoder_hidden_states.shape[1]` text positions (`qwen_image_transformer.py`).

If `max(txt_seq_lens) < encoder_hidden_states.shape[1]`:

1. The RoPE table is too short for the padded embedding width.
2. On the native CPU RoPE path this can raise a sequence-length `RuntimeError` during rotary application.
3. When execution proceeds (or on other backends), positions beyond `max(txt_seq_lens)` do not receive the same RoPE as diffusers — attention logits and denoiser outputs diverge from the reference.

**Concrete example:** 200 valid tokens in a tensor padded to width 1058 → vllm-omni uses RoPE length 200, diffusers uses 1058.

For positions `0..199` the frequency *values* match (same offset into `pos_freqs`), but positions `200..1057` are missing RoPE entirely in vllm-omni while diffusers still applies positional encoding across the full padded width (padding tokens are masked in attention but still participate in RoPE indexing consistent with diffusers).

## CB vs non-CB

### Non-CB (`forward()` / `diffuse()`)

Stock `encode_prompt()` usually produces tensors where `prompt_embeds.shape[1] == prompt_embeds_mask.sum()` for a single naturally-encoded prompt (no extra right-padding). In that narrow case `mask.sum()` and `shape[1]` agree and the bug is latent.

The mismatch appears whenever embedding width exceeds valid token count, e.g.:

- Caller-supplied `prompt_embeds` padded to a fixed `max_model_len`
- Batched / collated training tensors with fixed sequence width
- Any path that right-pads embeddings without making `mask.sum() == shape[1]`

### Continuous batching (CB)

1. `prepare_encode()` stores `txt_seq_lens = [mask.sum()]`.
2. `InputBatch._prepare_request_prompt_field()` may pad `state.prompt_embeds` / masks to a shared `target_seq_len` **without updating** `state.txt_seq_lens`.
3. `denoise_step()` passes the stale `input_batch.txt_seq_lens` into the transformer.

Even when batching a short request with a longer one, `max(stored txt_seq_lens)` often equals the padded width; the failure mode is when `max(txt_seq_lens)` stays below the padded embed width (e.g. single padded request, fixed-width training pad, or stale per-request values).

### CFG (classifier-free guidance)

Both branches are affected independently — positive and negative `txt_seq_lens` / `negative_txt_seq_lens` are each computed via `mask.sum()` in `_prepare_generation_context()` and passed through:

- non-CB: `cfg_parallel.py::diffuse()` (`positive_kwargs` / `negative_kwargs`)
- CB: `denoise_step()` → `_build_denoise_kwargs()` → `predict_noise_maybe_with_cfg()`

Short negative prompts with padded width exhibit the same RoPE length mismatch as the positive branch.

## Suggested fix

Align with diffusers / the verl-omni stepwise workaround:

```python
# In _prepare_generation_context() (all QwenImage pipeline variants):
txt_seq_lens = [int(prompt_embeds.shape[1])] * int(prompt_embeds.shape[0])
if negative_prompt_embeds is not None:
    negative_txt_seq_lens = [int(negative_prompt_embeds.shape[1])] * int(negative_prompt_embeds.shape[0])
else:
    negative_txt_seq_lens = None
```

Optionally also refresh `state.txt_seq_lens` inside `_prepare_request_prompt_field()` after CB padding so request state stays self-consistent.

Alternative longer-term: stop threading `txt_seq_lens` through the pipeline and infer RoPE length inside `QwenImageTransformer2DModel.forward()` the way diffusers does.

## Impact

- Numerical divergence vs diffusers / PyTorch reference for padded prompts
- RL training: elevated `actor/ppo_kl` at step 1 when comparing vLLM rollout to trainer log-prob recomputation (reported downstream in verl-omni stepwise adapters; fixed there by overriding `txt_seq_lens`, but stock vllm-omni `prepare_encode()` remains affected)

## Environment

- vllm-omni: `main` @ `a693ae67` (2026-06-15)
- diffusers: `main` @ `d1f8e55c3`
- CPU verification: conda env `torch_2.9.0`


### Your code version

<details>
<summary>The commit id or version of vllm</summary>

```text

```
</details>
<details>
<summary>The commit id or version of vllm-omni</summary>

```text

```
</details>


### 🐛 Describe the bug

see above

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] QwenImage RoPE `txt_seq_lens` uses `mask.sum()` instead of padded embed width (misaligned with diffusers) #4443

Your current environment

Summary

Affected code (vllm-omni)

Reference behavior (diffusers latest main)

Why this harms precision

CB vs non-CB

Non-CB (`forward()` / `diffuse()`)

Continuous batching (CB)

CFG (classifier-free guidance)

Suggested fix

Impact

Environment

Your code version

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] QwenImage RoPE txt_seq_lens uses mask.sum() instead of padded embed width (misaligned with diffusers) #4443

Description

Your current environment

Summary

Affected code (vllm-omni)

Reference behavior (diffusers latest main)

Why this harms precision

CB vs non-CB

Non-CB (forward() / diffuse())

Continuous batching (CB)

CFG (classifier-free guidance)

Suggested fix

Impact

Environment

Your code version

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] QwenImage RoPE `txt_seq_lens` uses `mask.sum()` instead of padded embed width (misaligned with diffusers) #4443

Non-CB (`forward()` / `diffuse()`)