Skip to content

Commit e715745

Browse files
authored
Merge branch 'main' into bench/moss-tts
2 parents af1b2fc + b5352a5 commit e715745

21 files changed

Lines changed: 1592 additions & 108 deletions

File tree

docs/models/supported_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ th {
3333
| `ZImagePipeline` | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
3434
| `WanPipeline` | Wan2.1-T2V, Wan2.2-T2V, Wan2.2-TI2V | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.1-T2V-14B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers`, `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
3535
| `WanImageToVideoPipeline` | Wan2.2-I2V | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
36-
| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, T2V with sound, action policy | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
36+
| `Cosmos3OmniDiffusersPipeline` | Cosmos3 T2I, T2V, I2V, V2V, T2V with sound, action policy | `nvidia/Cosmos3-Nano` | ✅︎ | | | |
3737
| `WanSpeechToVideoPipeline` | Wan2.2-S2V | `Wan-AI/Wan2.2-S2V-14B` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
3838
| `Wan22VACEPipeline` | Wan2.1-VACE | `Wan-AI/Wan2.1-VACE-1.3B-diffusers`, `Wan-AI/Wan2.1-VACE-14B-diffusers` | ✅︎ | ✅︎ | ✅︎ | ✅︎ |
3939
| `LTX2Pipeline` | LTX-2-T2V | `Lightricks/LTX-2` | ✅︎ | ✅︎ | | |

docs/serving/videos_api.md

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,9 @@ curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o output.mp4
6767

6868
| Parameter | Type | Default | Description |
6969
|-----------|------|---------|-------------|
70-
| `input_reference` | file | null | Uploaded reference image for image-to-video requests |
71-
| `image_reference` | string | null | JSON-encoded reference image payload; do not combine with `input_reference` |
70+
| `input_reference` | file | null | Uploaded reference image or video for image-to-video/video-to-video requests |
71+
| `image_reference` | string | null | JSON-encoded reference image payload; do not combine with `input_reference` or `video_reference` |
72+
| `video_reference` | string | null | JSON-encoded reference video payload; do not combine with `input_reference` or `image_reference` |
7273
| `width` | integer | model default | Output video width |
7374
| `height` | integer | model default | Output video height |
7475
| `num_frames` | integer | model default | Number of generated frames |
@@ -80,6 +81,8 @@ curl -L "http://localhost:8091/v1/videos/${video_id}/content" -o output.mp4
8081
| `flow_shift` | number | null | Scheduler flow-shift value |
8182
| `true_cfg_scale` | number | null | True CFG scale when supported by the model |
8283
| `seed` | integer | null | Random seed for reproducibility |
84+
| `generate_sound` | boolean | false | Request model-generated audio for video models that support sound generation |
85+
| `sound_duration` | number | null | Duration in seconds for generated audio; defaults to generated video duration |
8386
| `negative_prompt` | string | null | Text describing what to avoid in the generated video |
8487
| `enable_frame_interpolation` | boolean | null | Enable post-generation frame interpolation |
8588
| `frame_interpolation_exp` | integer | null | Interpolation exponent; `1=2x`, `2=4x`, and so on |
@@ -123,6 +126,41 @@ curl -s http://localhost:8091/v1/videos \
123126
-F "fps=16"
124127
```
125128

129+
### Video-to-Video
130+
131+
For models that support video conditioning, upload the reference video with
132+
`input_reference`:
133+
134+
```bash
135+
curl -s http://localhost:8091/v1/videos \
136+
-F "prompt=continue this motion with consistent subjects and lighting" \
137+
-F "input_reference=@input.mp4;type=video/mp4" \
138+
-F "width=1280" \
139+
-F "height=720" \
140+
-F "num_frames=80" \
141+
-F "fps=16"
142+
```
143+
144+
You can also pass a JSON-safe video URL or `data:video/...;base64,...` payload
145+
through `video_reference`. Do not send `video_reference` together with
146+
`input_reference` or `image_reference`.
147+
148+
```bash
149+
curl -s http://localhost:8091/v1/videos \
150+
-F "prompt=continue this motion with consistent subjects and lighting" \
151+
-F 'video_reference={"video_url":"https://example.com/input.mp4"}' \
152+
-F "width=1280" \
153+
-F "height=720" \
154+
-F "num_frames=80" \
155+
-F "fps=16"
156+
```
157+
158+
JSON references currently support `image_url`/`video_url`; `file_id` references
159+
are not implemented yet. Models may expose additional V2V controls through
160+
`extra_params`. For example, Cosmos3 supports
161+
`condition_frame_indexes_vision` and `condition_video_keep` to select which
162+
decoded reference frames are used as clean conditioning.
163+
126164
### Synchronous Generation
127165

128166
```bash
@@ -146,7 +184,10 @@ export VLLM_OMNI_STORAGE_PATH=/var/tmp/vllm-omni-videos
146184

147185
## Model-Specific Examples
148186

149-
For complete text-to-video and image-to-video walkthroughs, see:
187+
For complete text-to-video, image-to-video, and model-specific video-to-video
188+
walkthroughs, see:
150189

151190
- [Text-to-Video](../user_guide/examples/online_serving/text_to_video.md)
152191
- [Image-to-Video](../user_guide/examples/online_serving/image_to_video.md)
192+
- [Cosmos3 recipes](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md)
193+
for model-specific video-to-video examples and conditioning controls

docs/user_guide/quantization/fp8.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,50 @@ in deep DiT blocks.
2828
Legend: `` supported, `` unsupported, `` not verified in this
2929
guide. FP8 on Ampere may use a weight-only path where available.
3030

31+
### Faster FP8 GEMM on Blackwell (quack)
32+
33+
On Blackwell (SM 100+), vLLM runs FP8 linears through the FlashInfer kernel, which
34+
applies the bias as a separate kernel after the GEMM. On the small GEMMs in video
35+
DiTs this bias add is a significant overhead. Installing the optional `quack` kernel
36+
lets vLLM-Omni fuse `alpha * (A @ B) + bias` into a single CuteDSL GEMM, recovering
37+
that overhead (e.g. HunyuanVideo-1.5 FP8 goes from slower-than-BF16 to faster).
38+
39+
```bash
40+
# CUDA 12.9
41+
pip install vllm-omni[quack]
42+
43+
# CUDA 13.x
44+
pip install 'quack-kernels[cu13]' --extra-index-url https://download.pytorch.org/whl/cu130
45+
```
46+
47+
It is enabled automatically once installed (no flag needed) and is **Blackwell-only**:
48+
on Hopper/Ada the CUTLASS FP8 kernel already fuses bias, so quack is not used there.
49+
Set `VLLM_OMNI_USE_QUACK_FP8=0` to force the FlashInfer path. If `quack-kernels` is
50+
not installed, FP8 still works — it just keeps the unfused FlashInfer path.
51+
52+
#### Compile cache and warmup
53+
54+
quack JIT-compiles its kernel once per distinct GEMM shape (tens of seconds, longer
55+
the first time across all autotuned configs). The compiled `.o` files are cached on
56+
disk and reused on later runs, so this is a one-time cost — **not** per request.
57+
58+
vLLM-Omni points that cache at `~/.cache/vllm_omni/quack` (override with
59+
`QUACK_CACHE_DIR`) instead of quack's default under `/tmp`, so it survives restarts.
60+
In containers, set `QUACK_CACHE_DIR` to a mounted/persistent path — or bake it into
61+
the image — so the first cold start does not recompile. The engine's startup dummy
62+
run already exercises the kernels, so with a warm cache the first real request is fast.
63+
64+
To pre-warm specific shapes (e.g. at image build time):
65+
66+
```python
67+
from vllm_omni.quantization.quack_fp8 import warmup_quack_fp8
68+
# (M, K, N) per linear; M = number of tokens for your resolution/frame count
69+
warmup_quack_fp8([(14040, 2048, 6144), (14040, 2048, 2048)])
70+
```
71+
72+
> The PyPI package is `quack-kernels` (imported as `quack`); plain `pip install
73+
> quack` is an unrelated statistics library. Requires CUDA 12.9+ and Python 3.12.
74+
3175
## Model Type Support
3276

3377
### Diffusion Model (Qwen-Image, Wan2.2)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"prompt": "A robotic arm, primarily white with black joints and cables, is shown in a clean, modern indoor setting with a white tabletop. The arm, equipped with a gripper holding a small, light green pitcher, is positioned above a clear glass containing a reddish-brown liquid and a spoon. The robotic arm is in the process of pouring a transparent liquid into the glass. To the left of the pitcher, there is an opened jar with a similar reddish-brown substance visible through its transparent body. In the background, a vase with white flowers and a brown couch are partially visible, adding to the contemporary ambiance. The lighting is bright, casting soft shadows on the table. The robotic arm's movements are smooth and controlled, demonstrating precision in its task. As the video progresses, the robotic arm completes the pour, leaving the glass half-filled with the reddish-brown liquid. The jar remains untouched throughout the sequence, and the spoon inside the glass remains stationary. The other robotic arm on the right side also stays stationary throughout the video. The final frame captures the robotic arm with the pitcher finishing the pour, with the glass now filled to a higher level, while the pitcher is slightly tilted but still held securely by the gripper.",
3+
"negative_prompt": "",
4+
"vision_path": "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4",
5+
"height": 720,
6+
"width": 1280,
7+
"num_frames": 189,
8+
"num_inference_steps": 35,
9+
"guidance_scale": 6.0,
10+
"flow_shift": 10.0,
11+
"fps": 24,
12+
"condition_frame_indexes_vision": [0, 1],
13+
"condition_video_keep": "first"
14+
}

pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,12 @@ minicpmo = [
9090
"stepaudio2-minicpmo",
9191
]
9292

93+
# Optional Blackwell-only fused-bias FP8 GEMM; package is `quack-kernels` (imported
94+
# as `quack`), not the unrelated `quack`. See docs/user_guide/quantization/fp8.md.
95+
quack = [
96+
"quack-kernels>=0.3.11",
97+
]
98+
9399
docs = [
94100
"mkdocs>=1.5.0",
95101
"mkdocs-api-autonav",

recipes/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ recipes/
3636
| [`LTX/LTX-2.md`](./LTX/LTX-2.md) | Text-to-video and image-to-video serving | 1x H200 141GB |
3737
| [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
3838
| [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
39-
| [`cosmos3/Cosmos3-Nano.md`](./cosmos3/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video generation, text to video with sound, action policy | 1x H200 141GB / B300 |
40-
| [`cosmos3/Cosmos3-Super.md`](./cosmos3/Cosmos3-Super.md) | 64B T2I / T2V / I2V generation (+ optional audio) / Action policy | 8x H200/H100/A100 / 2x H200 / B300 |
39+
| [`cosmos3/Cosmos3-Nano.md`](./cosmos3/Cosmos3-Nano.md) | Text-to-image, text-to-video, image-to-video, video-to-video generation, text to video with sound, action policy | 1x H200 141GB / B300 |
40+
| [`cosmos3/Cosmos3-Super.md`](./cosmos3/Cosmos3-Super.md) | 64B T2I / T2V / I2V / V2V generation (+ optional audio) / Action policy | 8x H200/H100/A100 / 2x H200 / B300 |
4141
| [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
4242
| [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
4343
| [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |

0 commit comments

Comments
 (0)