[Diffusion] Resolve Anima precision drift via diffusers' apply_rotary_emb

akshatvishu · akshatvishu · commit 9c638d1fa402 · 2026-06-23T01:23:25.000+05:30
Uses diffusers' apply_rotary_emb to upcast RoPE calculations to float32,
resolving the bfloat16 numerical drift vs the reference pipeline.
Signed-off-by: akshatvishu &lt;akshatnayak197@gmail.com&gt;
diff --git a/benchmarks/diffusion/README.md b/benchmarks/diffusion/README.md
@@ -149,74 +149,3 @@ batch may still pay compile or CUDA-graph capture cost.
 
 For a Qwen-Image continuous-batching replay example, see
 [`performance_dashboard/qwen_image_serving_performance.md`](./performance_dashboard/qwen_image_serving_performance.md).
-
-## 4. Anima Native Single-File Benchmarking
-
-Native Anima is benchmarked as a text-to-image model through the same serving
-benchmark entrypoint. Unlike standard HuggingFace model IDs, Anima serves the
-raw single-file transformer checkpoint and loads non-denoiser components from a
-Diffusers-layout component directory.
-
-Download the official Anima checkpoint and components first. The commands below
-use `/path/to/models` as a placeholder; replace it with any local directory that
-has enough space for the checkpoint and component files.
-
-```bash
-mkdir -p /path/to/models/anima-official
-mkdir -p /path/to/models/anima-components
-
-hf download circlestone-labs/Anima \
-    split_files/diffusion_models/anima-base-v1.0.safetensors \
-    --local-dir /path/to/models/anima-official
-
-hf download circlestone-labs/Anima-Base-v1.0-Diffusers \
-    --local-dir /path/to/models/anima-components
-
-CHECKPOINT=/path/to/models/anima-official/split_files/diffusion_models/anima-base-v1.0.safetensors
-COMPONENTS=/path/to/models/anima-components
-```
-
-Run these commands from the vLLM-Omni repository in the Python environment or
-container where vLLM-Omni is installed.
-
-Start the server with the checkpoint as `--model` and pass the component
-directory through `--diffusers-load-kwargs`:
-
-```bash
-vllm serve "$CHECKPOINT" \
-    --omni \
-    --port 8099 \
-    --model-class-name AnimaPipeline \
-    --diffusers-load-kwargs "{\"components_path\":\"$COMPONENTS\"}"
-```
-
-Then run the standard diffusion serving benchmark:
-
-```bash
-python3 benchmarks/diffusion/diffusion_benchmark_serving.py \
-    --base-url http://localhost:8099 \
-    --endpoint /v1/chat/completions \
-    --model "$CHECKPOINT" \
-    --task t2i \
-    --dataset random \
-    --num-prompts 10 \
-    --max-concurrency 1 \
-    --warmup-requests 1 \
-    --warmup-concurrency 1 \
-    --width 1024 \
-    --height 1024 \
-    --num-inference-steps 50
-```
-
-This matches the Diffusers baseline defaults for Anima: 1024x1024, 50 denoising
-steps, `max_sequence_length=512`, one image per prompt, empty negative prompt,
-and CFG scale 4.0 from the default guider. Do not pass `guidance_scale` through
-the benchmark unless you are intentionally measuring a non-default CFG setting.
-
-Native Anima currently supports baseline single-GPU execution. Cache-DiT,
-TeaCache, CPU offload, layer-wise offload, quantization, TP/SP, CFG parallel,
-HSDP, and step execution are not supported by `AnimaPipeline` yet.
-
-Anima uses the default single diffusion stage for local single-file checkpoint
-discovery when `--model-class-name AnimaPipeline` is provided; no deploy config
-is required.
diff --git a/vllm_omni/diffusion/models/anima/anima_transformer.py b/vllm_omni/diffusion/models/anima/anima_transformer.py
@@ -7,7 +7,7 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from diffusers.models.embeddings import Timesteps
+from diffusers.models.embeddings import Timesteps, apply_rotary_emb
 from diffusers.models.modeling_outputs import Transformer2DModelOutput
 from vllm.model_executor.model_loader.weight_utils import default_weight_loader
 
@@ -32,13 +32,11 @@
 }
 
 
-def _apply_rotary_emb(hidden_states, image_rotary_emb):
-    cos, sin = image_rotary_emb
-    cos = cos[None, :, None, :].to(device=hidden_states.device, dtype=hidden_states.dtype)
-    sin = sin[None, :, None, :].to(device=hidden_states.device, dtype=hidden_states.dtype)
-    x_real, x_imag = hidden_states.reshape(*hidden_states.shape[:-1], 2, -1).unbind(-2)
-    x_rotated = torch.cat([-x_imag, x_real], dim=-1)
-    return hidden_states * cos + x_rotated * sin
+# NOTE: We import and use diffusers' `apply_rotary_emb` instead of a custom native implementation
+# to prevent numerical drift in bfloat16. Diffusers upcasts queries, keys, and rotary frequency
+# tensors to float32 before computing the rotation, and casts back to bfloat16 at the end.
+# Performing the entire computation in bfloat16 accumulates precision errors across the 28
+# transformer blocks, which is heavily amplified by Classifier-Free Guidance (CFG).
 
 
 class CosmosPatchEmbed(nn.Module):
@@ -235,8 +233,10 @@ def _attention(self, hidden_states, encoder_hidden_states=None, attention_mask=N
         key = self.norm_k(key)
 
         if image_rotary_emb is not None:
-            query = _apply_rotary_emb(query, image_rotary_emb)
-            key = _apply_rotary_emb(key, image_rotary_emb)
+            # We use diffusers' apply_rotary_emb to leverage its internal float32 rotation upcasting
+            # logic, resolving the bfloat16 cumulative precision drift vs. the reference pipeline.
+            query = apply_rotary_emb(query, image_rotary_emb, sequence_dim=1, use_real_unbind_dim=-2)
+            key = apply_rotary_emb(key, image_rotary_emb, sequence_dim=1, use_real_unbind_dim=-2)
 
         attn_metadata = AttentionMetadata(attn_mask=attention_mask) if attention_mask is not None else None
         hidden_states = self.attn(query, key, value, attn_metadata=attn_metadata)