vllm-project
diff --git a/‎docs/models/supported_models.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/models/supported_models.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/serving/videos_api.md‎
Lines changed: 44 additions & 3 deletions b/‎docs/serving/videos_api.md‎
Lines changed: 44 additions & 3 deletions
diff --git a/‎docs/user_guide/quantization/fp8.md‎
Lines changed: 44 additions & 0 deletions b/‎docs/user_guide/quantization/fp8.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎examples/offline_inference/cosmos3/inputs/v2v.json‎
Lines changed: 14 additions & 0 deletions b/‎examples/offline_inference/cosmos3/inputs/v2v.json‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 6 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎recipes/README.md‎
Lines changed: 2 additions & 2 deletions b/‎recipes/README.md‎
Lines changed: 2 additions & 2 deletions
@@ -33,7 +33,7 @@ th {
 | `ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `WanPipeline` | Wan2.1-T2V, Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-T2V-14B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
-| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound, action policy | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
+| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, V2V, T2V with sound, action policy | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
 | `WanSpeechToVideoPipeline` | Wan2.2-S2V | `Wan-AI/Wan2.2-S2V-14B` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `Wan22VACEPipeline` | Wan2.1-VACE | `Wan-AI/Wan2.1-VACE-1.3B-diffusers`, `Wan-AI/Wan2.1-VACE-14B-diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
 | `LTX2Pipeline` | LTX-2-T2V | `Lightricks/LTX-2` | ✅︎ | ✅︎ | | |
 
@@ -67,8 +67,9 @@ curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o output.mp4
 
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
-| `input_reference` | file | null | Uploaded reference image for image-to-video requests |
-| `image_reference` | string | null | JSON-encoded reference image payload; do not combine with `input_reference` |
+| `input_reference` | file | null | Uploaded reference image or video for image-to-video/video-to-video requests |
+| `image_reference` | string | null | JSON-encoded reference image payload; do not combine with `input_reference` or `video_reference` |
+| `video_reference` | string | null | JSON-encoded reference video payload; do not combine with `input_reference` or `image_reference` |
 | `width` | integer | model default | Output video width |
 | `height` | integer | model default | Output video height |
 | `num_frames` | integer | model default | Number of generated frames |
@@ -80,6 +81,8 @@ curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o output.mp4
 | `flow_shift` | number | null | Scheduler flow-shift value |
 | `true_cfg_scale` | number | null | True CFG scale when supported by the model |
 | `seed` | integer | null | Random seed for reproducibility |
+| `generate_sound` | boolean | false | Request model-generated audio for video models that support sound generation |
+| `sound_duration` | number | null | Duration in seconds for generated audio; defaults to generated video duration |
 | `negative_prompt` | string | null | Text describing what to avoid in the generated video |
 | `enable_frame_interpolation` | boolean | null | Enable post-generation frame interpolation |
 | `frame_interpolation_exp` | integer | null | Interpolation exponent; `1=2x`, `2=4x`, and so on |
@@ -123,6 +126,41 @@ curl -s http://localhost:8091/v1/videos \
   -F "fps=16"
 ```
 
+### Video-to-Video
+
+For models that support video conditioning, upload the reference video with
+`input_reference`:
+
+```bash
+curl -s http://localhost:8091/v1/videos \
+  -F "prompt=continue this motion with consistent subjects and lighting" \
+  -F "input_reference=@input.mp4;type=video/mp4" \
+  -F "width=1280" \
+  -F "height=720" \
+  -F "num_frames=80" \
+  -F "fps=16"
+```
+
+You can also pass a JSON-safe video URL or `data:video/...;base64,...` payload
+through `video_reference`. Do not send `video_reference` together with
+`input_reference` or `image_reference`.
+
+```bash
+curl -s http://localhost:8091/v1/videos \
+  -F "prompt=continue this motion with consistent subjects and lighting" \
+  -F 'video_reference={"video_url":"https://example.com/input.mp4"}' \
+  -F "width=1280" \
+  -F "height=720" \
+  -F "num_frames=80" \
+  -F "fps=16"
+```
+
+JSON references currently support `image_url`/`video_url`; `file_id` references
+are not implemented yet. Models may expose additional V2V controls through
+`extra_params`. For example, Cosmos3 supports
+`condition_frame_indexes_vision` and `condition_video_keep` to select which
+decoded reference frames are used as clean conditioning.
+
 ### Synchronous Generation
 
 ```bash
@@ -146,7 +184,10 @@ export VLLM_OMNI_STORAGE_PATH=/var/tmp/vllm-omni-videos
 
 ## Model-Specific Examples
 
-For complete text-to-video and image-to-video walkthroughs, see:
+For complete text-to-video, image-to-video, and model-specific video-to-video
+walkthroughs, see:
 
 - [Text-to-Video](../user_guide/examples/online_serving/text_to_video.md)
 - [Image-to-Video](../user_guide/examples/online_serving/image_to_video.md)
+- [Cosmos3 recipes](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md)
+  for model-specific video-to-video examples and conditioning controls
@@ -28,6 +28,50 @@ in deep DiT blocks.
 Legend: `✅` supported, `❌` unsupported, `⭕` not verified in this
 guide. FP8 on Ampere may use a weight-only path where available.
 
+### Faster FP8 GEMM on Blackwell (quack)
+
+On Blackwell (SM 100+), vLLM runs FP8 linears through the FlashInfer kernel, which
+applies the bias as a separate kernel after the GEMM. On the small GEMMs in video
+DiTs this bias add is a significant overhead. Installing the optional `quack` kernel
+lets vLLM-Omni fuse `alpha * (A @ B) + bias` into a single CuteDSL GEMM, recovering
+that overhead (e.g. HunyuanVideo-1.5 FP8 goes from slower-than-BF16 to faster).
+
+```bash
+# CUDA 12.9
+pip install vllm-omni[quack]
+
+# CUDA 13.x
+pip install 'quack-kernels[cu13]' --extra-index-url https://download.pytorch.org/whl/cu130
+```
+
+It is enabled automatically once installed (no flag needed) and is **Blackwell-only**:
+on Hopper/Ada the CUTLASS FP8 kernel already fuses bias, so quack is not used there.
+Set `VLLM_OMNI_USE_QUACK_FP8=0` to force the FlashInfer path. If `quack-kernels` is
+not installed, FP8 still works — it just keeps the unfused FlashInfer path.
+
+#### Compile cache and warmup
+
+quack JIT-compiles its kernel once per distinct GEMM shape (tens of seconds, longer
+the first time across all autotuned configs). The compiled `.o` files are cached on
+disk and reused on later runs, so this is a one-time cost — **not** per request.
+
+vLLM-Omni points that cache at `~/.cache/vllm_omni/quack` (override with
+`QUACK_CACHE_DIR`) instead of quack's default under `/tmp`, so it survives restarts.
+In containers, set `QUACK_CACHE_DIR` to a mounted/persistent path — or bake it into
+the image — so the first cold start does not recompile. The engine's startup dummy
+run already exercises the kernels, so with a warm cache the first real request is fast.
+
+To pre-warm specific shapes (e.g. at image build time):
+
+```python
+from vllm_omni.quantization.quack_fp8 import warmup_quack_fp8
+# (M, K, N) per linear; M = number of tokens for your resolution/frame count
+warmup_quack_fp8([(14040, 2048, 6144), (14040, 2048, 2048)])
+```
+
+> The PyPI package is `quack-kernels` (imported as `quack`); plain `pip install
+> quack` is an unrelated statistics library. Requires CUDA 12.9+ and Python 3.12.
+
 ## Model Type Support
 
 ### Diffusion Model (Qwen-Image, Wan2.2)
 
@@ -0,0 +1,14 @@
+{
+    "prompt": "A robotic arm, primarily white with black joints and cables, is shown in a clean, modern indoor setting with a white tabletop. The arm, equipped with a gripper holding a small, light green pitcher, is positioned above a clear glass containing a reddish-brown liquid and a spoon. The robotic arm is in the process of pouring a transparent liquid into the glass. To the left of the pitcher, there is an opened jar with a similar reddish-brown substance visible through its transparent body. In the background, a vase with white flowers and a brown couch are partially visible, adding to the contemporary ambiance. The lighting is bright, casting soft shadows on the table. The robotic arm's movements are smooth and controlled, demonstrating precision in its task. As the video progresses, the robotic arm completes the pour, leaving the glass half-filled with the reddish-brown liquid. The jar remains untouched throughout the sequence, and the spoon inside the glass remains stationary. The other robotic arm on the right side also stays stationary throughout the video. The final frame captures the robotic arm with the pitcher finishing the pour, with the glass now filled to a higher level, while the pitcher is slightly tilted but still held securely by the gripper.",
+    "negative_prompt": "",
+    "vision_path": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4",
+    "height": 720,
+    "width": 1280,
+    "num_frames": 189,
+    "num_inference_steps": 35,
+    "guidance_scale": 6.0,
+    "flow_shift": 10.0,
+    "fps": 24,
+    "condition_frame_indexes_vision": [0, 1],
+    "condition_video_keep": "first"
+}
@@ -90,6 +90,12 @@ minicpmo = [
     "stepaudio2-minicpmo",
 ]
 
+# Optional Blackwell-only fused-bias FP8 GEMM; package is `quack-kernels` (imported
+# as `quack`), not the unrelated `quack`. See docs/user_guide/quantization/fp8.md.
+quack = [
+    "quack-kernels>=0.3.11",
+]
+
 docs = [
     "mkdocs>=1.5.0",
     "mkdocs-api-autonav",
 
@@ -36,8 +36,8 @@ recipes/
 | [`LTX/LTX-2.md`](./LTX/LTX-2.md) | Text-to-video and image-to-video serving | 1x H200 141GB |
 | [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
 | [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
-| [`cosmos3/Cosmos3-Nano.md`](./cosmos3/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound, action policy  | 1x H200 141GB / B300 |
-| [`cosmos3/Cosmos3-Super.md`](./cosmos3/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) / Action policy | 8x H200/H100/A100 / 2x H200 / B300 |
+| [`cosmos3/Cosmos3-Nano.md`](./cosmos3/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video, video-to-video generation, text to video with sound, action policy | 1x H200 141GB / B300 |
+| [`cosmos3/Cosmos3-Super.md`](./cosmos3/Cosmos3-Super.md) | 64B T2I / T2V / I2V / V2V generation (+ optional audio) / Action policy | 8x H200/H100/A100 / 2x H200 / B300 |
 | [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
 | [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
 | [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |