[WAN2.2-S2V] Add server API for image + audio by xuechendi · Pull Request #3394 · vllm-project/vllm-omni

xuechendi · 2026-05-06T22:16:12Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Server

VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Wan-AI/Wan2.2-S2V-14B --omni \
  --model-class-name WanS2VPipeline \
  --tensor-parallel-size 2 \
  --flow-shift 3.0 \
  --vae-use-slicing --vae-use-tiling \
  --port 8091

Client

no_proxy=127.0.0.1 \
curl -X POST "http://localhost:8091/v1/videos/sync" \
  -F "prompt=A person singing" \
  -F "image_reference=https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png" \
  -F "audio_reference=https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3" \
  -F "width=832" -F "height=480" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.5" \
  -F "fps=16" \
  --output "s2v_480p_serve.mp4"

Test Plan

Offline Test

  CUDA_VISIBLE_DEVICES=0,2 VLLM_WORKER_MULTIPROC_METHOD=spawn HF_HOME=/mnt/data \
  python examples/offline_inference/speech_to_video/speech_to_video.py \
    --model /mnt/data/hub/models--Wan-AI--Wan2.2-S2V-14B \
    --image "Five Hundred Miles.png" --audio "Five Hundred Miles.MP3" \
    --prompt "A person singing" --height 480 --width 832 --num-frames 81 \
    --num-inference-steps 40 --fps 16 --tensor-parallel-size 2 \
    --vae-use-slicing --vae-use-tiling --output s2v_offline_test.mp4

Online Serving Test

TP=2 PORT=8091 bash examples/online_serving/speech_to_video/run_server.sh

IMAGE_URL="https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png" \                            
AUDIO_URL="https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3" \                            
bash examples/online_serving/speech_to_video/run_curl_speech_to_video.sh

Test Result

Metric	Offline Value	Online Value
Time	291.6s (Generation time)	188.7s (Request time)
Output / Status	78 frames @ 16 fps (4.9s) + audio 5.0s @ 44100 Hz	200 OK

Steps	Offline	Online
1-4 (warmup)	14-17s/step	14-17s/step
5-31 (cache_dit)	2-6s/step	2-6s/step
32-40 (tail)	8-13s/step	5-6s/step

offline - 40 steps

s2v_offline_40steps_cache_dit.mp4

online - 40 steps

s2v_serve_40steps_cache_dit.mp4

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6253aa433

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gcanlin

Any open standard about the API design, such as OpenAI?

hsliuustc0106

Review Summary

Validated: CI green (DCO, pre-commit, build 3.11/3.12, docs). The input_audio → audio_path threading through the API layer is consistent across all call sites. The multi_modal_data refactor from direct dict assignment to conditional construction is correct and cleaner.

Issues to address:

Temp file leak (see inline). NamedTemporaryFile(delete=False) writes the uploaded audio to disk but the temp file is never cleaned up. Neither _run_video_generation_job nor the sync endpoint has cleanup logic for audio_path. Each request will leave a temp audio file on disk. Consider adding cleanup in a finally block or using a context manager.
No audio validation. The endpoint accepts any bytes as input_audio with no checks for file type, MIME type, size limits, or duration. An unsupported format will fail deep in the pipeline with a confusing error. Validate the audio format and size at the API layer.
No test results. The PR description has empty "Test Plan" and "Test Result" sections. For an API change that adds a new form field, please provide at minimum a curl invocation showing the endpoint works, and confirm the model produces output with the uploaded audio.
Mergeable is UNKNOWN — may need a rebase against main.

Reviewed by Claude Code with deepseek-v4-pro

Copilot

Pull request overview

Adds support for supplying an uploaded audio file alongside a reference image to the OpenAI-style video serving endpoints, enabling Wan2.2-S2V “speech-to-video” style requests via the server API.

Changes:

Thread an optional audio_path through the video serving stack and attach it to prompt["multi_modal_data"].
Extend the FastAPI multipart form parser to accept input_audio and persist it to a temporary file for downstream consumption.
Document online serving usage for Wan2.2-S2V (server + curl client example).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`vllm_omni/entrypoints/openai/serving_video.py`	Accepts `audio_path` and forwards it via `multi_modal_data` into the diffusion prompt.
`vllm_omni/entrypoints/openai/api_server.py`	Adds `input_audio` form field parsing, writes uploads to a temp file, and passes `audio_path` into generation paths.
`recipes/Wan-AI/Wan2.2-S2V.md`	Adds “Online Serving” documentation with a server command and curl example including `input_audio`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 async def _parse_video_form(
    raw_request: Request,
    prompt: str = Form(...),
    input_reference: UploadFile | None = File(default=None),
+    input_audio: UploadFile | None = File(default=None),
    image_reference: str | None = Form(default=None),


xuechendi · 2026-05-11T23:03:10Z

Any open standard about the API design, such as OpenAI?

OpenAI api is can either use file

curl https://api.openai.com/v1/videos \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sora-2-pro",
    "prompt": "Animate the person in the reference image to sing the lyrics from the audio file with realistic facial expressions and matching mouth movements.",
    "image_input": "file-img_98765",
    "audio_input": "file-aud_12345",
    "resolution": "1080p",
    "motion_bucket": 5
  }'

or url

curl https://api.openai.com/v1/videos \
  -d '{
    "model": "sora-2",
    "image_url": "https://your-site.com/character.jpg",
    "audio_url": "https://your-site.com/voice-clip.mp3",
    "prompt": "Animate the character to speak the audio."
  }'

Since the existing attribute_name for image is already different from openAI - 'image_input' vs 'input_reference', should be refactoring and follow openAI here?

hsliuustc0106 · 2026-05-12T01:12:28Z

just a quick remind: please check whether online&offline mode output the same image

hsliuustc0106

Review Summary

Validated: All CI gates green (DCO, pre-commit, build 3.11/3.12, docs). PR is mergeable. The input_audio → audio_path threading through the API layer is consistent across all call sites. The multi_modal_data refactor from direct dict assignment to conditional construction is correct and cleaner. Examples and recipe docs are well-structured.

These issues were flagged in the prior review round (on the previous commit) and have not been addressed:

Blocking issues:

Temp file leak. NamedTemporaryFile(delete=False) writes uploaded audio to disk but the temp file is never cleaned up — not on success, not on error, not on cancellation. _run_video_generation_job calls _cleanup_video() for the output file but has no equivalent for audio_path. create_video_sync also has no cleanup. Each request leaks an orphan file on disk. Add os.unlink(audio_path) in a finally block in both paths.
Empty audio silently ignored. When input_audio is provided but the file has zero bytes, the code silently sets audio_path = None. This produces a misleading downstream error instead of a clear 400 response. Return an explicit 400 when input_audio is present but empty.
No test evidence. The test plan and test result sections are empty. Please provide at minimum a curl command showing the endpoint produces output with an uploaded audio file, and confirm online and offline modes produce consistent results (as previously requested).

Reviewed by Claude Code with glm-5.1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

hsliuustc0106

shall we update the doc for api server as well?

- Update docs/serving/videos_api.md with audio_reference parameter and Speech-to-Video example section - Add unit tests for decode_audio_url (data URL, HTTP URL, invalid URL, suffix sanitization) - Add integration test for audio_reference form parameter in video server - Sanitize user-controlled MIME extension in decode_audio_url to prevent path traversal via crafted data URLs Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-06-09T01:02:52Z

shall we update the doc for api server as well?

@hsliuustc0106 ,fixed comments by adding new tests and update doc for api server

hsliuustc0106 · 2026-06-10T12:28:35Z


tests/entrypoints/openai_api/test_video_server.py::test_delete_in_progress_job_cancels_task_and_removes_metadata - assert False
--
  | [2026-06-10T01:22:34Z]  +  where False = wait(timeout=2.0)
  | [2026-06-10T01:22:34Z]  +    where wait = <threading.Event at 0x7f5d418782c0: unset>.wait
  | [2026-06-10T01:22:34Z]  +      where <threading.Event at 0x7f5d418782c0: unset> = <openai_api.test_video_server.BlockingVideoHandler object at 0x7f5d418797f0>.started
  | [2026-06-10T01:22:34Z] = 1 failed, 1541 passed, 3 skipped, 1131 deselected, 48 warnings in 237.74s (0:03:57) =
  | [2026-06-10T01:22:38Z] sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
  | [2026-06-10T01:22:42Z] 🚨 Error: The command exited with status 1

hsliuustc0106 · 2026-06-10T15:05:13Z

is this ci failure related to this PR?

…d_removes_metadata Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-06-10T21:36:13Z

@hsliuustc0106 . Fix the failed issue by b8cdb5f
Need to add new arg reference_audio=None to tests/entrypoints/openai_api/test_video_server.py:generate_video_bytes

Have locally tested with A100 - test_video_server.py

tests/entrypoints/openai_api/test_video_server.py::test_async_video_generation_bypasses_base64 PASSED [  1%]
tests/entrypoints/openai_api/test_video_server.py::test_async_video_generation_with_audio_bypasses_base64 PASSED [  3%]
tests/entrypoints/openai_api/test_video_server.py::test_t2v_video_generation_form PASSED [  4%]
tests/entrypoints/openai_api/test_video_server.py::test_i2v_video_generation_form PASSED [  6%]
tests/entrypoints/openai_api/test_video_server.py::test_i2v_video_generation_resizes_input_to_requested_dimensions PASSED [  8%]
tests/entrypoints/openai_api/test_video_server.py::test_i2v_video_generation_with_image_reference_form PASSED [  9%]
tests/entrypoints/openai_api/test_video_server.py::test_v2v_video_generation_form PASSED [ 11%]
tests/entrypoints/openai_api/test_video_server.py::test_v2v_video_generation_with_video_reference_form PASSED [ 12%]
tests/entrypoints/openai_api/test_video_server.py::test_decode_video_bytes_can_keep_last_frames PASSED [ 14%]
tests/entrypoints/openai_api/test_video_server.py::test_cosmos3_reference_video_limit_uses_v2v_condition_frames PASSED [ 16%]
tests/entrypoints/openai_api/test_video_server.py::test_cosmos3_reference_video_limit_preserves_action_frames PASSED [ 17%]
tests/entrypoints/openai_api/test_video_server.py::test_cosmos3_reference_video_limit_caps_condition_frames_to_output_frames PASSED [ 19%]
tests/entrypoints/openai_api/test_video_server.py::test_s2v_video_generation_with_audio_reference_form PASSED [ 20%]
tests/entrypoints/openai_api/test_video_server.py::test_seconds_defaults_fps_and_frames PASSED [ 22%]
tests/entrypoints/openai_api/test_video_server.py::test_size_param_sets_width_height PASSED [ 24%]
tests/entrypoints/openai_api/test_video_server.py::test_sampling_params_pass_through PASSED [ 25%]
tests/entrypoints/openai_api/test_video_server.py::test_frame_interpolation_params_pass_to_diffusion_sampling_params PASSED [ 27%]
tests/entrypoints/openai_api/test_video_server.py::test_default_sampling_params_apply_to_video_requests PASSED [ 29%]
tests/entrypoints/openai_api/test_video_server.py::test_request_params_override_default_video_sampling_params PASSED [ 30%]
tests/entrypoints/openai_api/test_video_server.py::test_worker_fps_multiplier_is_applied_to_async_encoding PASSED [ 32%]
tests/entrypoints/openai_api/test_video_server.py::test_audio_sample_rate_comes_from_model_config PASSED [ 33%]
tests/entrypoints/openai_api/test_video_server.py::test_video_job_persists_profiler_metadata PASSED [ 35%]
tests/entrypoints/openai_api/test_video_server.py::test_video_generation_response_exposes_action_payload PASSED [ 37%]
tests/entrypoints/openai_api/test_video_server.py::test_video_job_persists_action_metadata PASSED [ 38%]
tests/entrypoints/openai_api/test_video_server.py::test_action_extraction_accepts_unbatched_action PASSED [ 40%]
tests/entrypoints/openai_api/test_video_server.py::test_missing_handler_returns_503 PASSED [ 41%]
tests/entrypoints/openai_api/test_video_server.py::test_missing_prompt_returns_422 PASSED [ 43%]
tests/entrypoints/openai_api/test_video_server.py::test_video_generation_rejects_model_mismatch PASSED [ 45%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_size_parse_returns_422 PASSED [ 46%]
tests/entrypoints/openai_api/test_video_server.py::test_rejects_input_reference_and_image_reference_together PASSED [ 48%]
tests/entrypoints/openai_api/test_video_server.py::test_rejects_image_reference_and_video_reference_together PASSED [ 50%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_seconds_returns_422 PASSED [ 51%]
tests/entrypoints/openai_api/test_video_server.py::test_negative_prompt_and_seed_pass_through PASSED [ 53%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_lora_returns_400 PASSED [ 54%]
tests/entrypoints/openai_api/test_video_server.py::test_unsupported_image_reference_file_id_returns_400 PASSED [ 56%]
tests/entrypoints/openai_api/test_video_server.py::test_unsupported_video_reference_file_id_returns_400 PASSED [ 58%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_uploaded_input_reference_returns_400 PASSED [ 59%]
tests/entrypoints/openai_api/test_video_server.py::test_video_request_validation PASSED [ 61%]
tests/entrypoints/openai_api/test_video_server.py::test_list_videos_supports_order_after_and_limit PASSED [ 62%]
tests/entrypoints/openai_api/test_video_server.py::test_delete_completed_job_removes_file_and_metadata PASSED [ 64%]
tests/entrypoints/openai_api/test_video_server.py::test_delete_in_progress_job_cancels_task_and_removes_metadata PASSED [ 66%]
tests/entrypoints/openai_api/test_video_server.py::test_video_response_file_extension_is_robust PASSED [ 67%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_merged_into_extra_args PASSED [ 69%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_none_by_default PASSED [ 70%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_invalid_json PASSED [ 72%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_merged_with_existing_extra_args PASSED [ 74%]
tests/entrypoints/openai_api/test_video_server.py::test_sample_solver_forwarded_via_extra_params PASSED [ 75%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_allows_inline_action PASSED [ 77%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_t2v_returns_video_bytes PASSED [ 79%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_t2v_returns_profiler_headers PASSED [ 80%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_i2v_returns_video_bytes PASSED [ 82%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_i2v_with_image_reference PASSED [ 83%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_v2v_returns_video_bytes PASSED [ 85%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_missing_handler_returns_503 PASSED [ 87%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_missing_prompt_returns_422 PASSED [ 88%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_rejects_both_references PASSED [ 90%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_generation_error_returns_500 PASSED [ 91%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_does_not_create_store_entry PASSED [ 93%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_sampling_params_pass_through PASSED [ 95%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_frame_interpolation_params_pass_to_sampling_params PASSED [ 96%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_default_sampling_params_apply_to_video_requests PASSED [ 98%]
tests/entrypoints/openai_api/test_video_server.py::test_worker_fps_multiplier_is_applied_to_sync_encoding PASSED [100%]
...
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
======================= 62 passed, 28 warnings in 7.26s ========================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

hsliuustc0106

lgtm

Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

xuechendi requested a review from hsliuustc0106 as a code owner May 6, 2026 22:16

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

gcanlin reviewed May 7, 2026

View reviewed changes

hsliuustc0106 reviewed May 7, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

hsliuustc0106 requested review from SamitHuang, Copilot, gcanlin, wtomin and yuanheng-zhao May 8, 2026 13:04

hsliuustc0106 reviewed May 8, 2026

View reviewed changes

Comment thread recipes/Wan-AI/Wan2.2-S2V.md Outdated

Copilot started reviewing on behalf of hsliuustc0106 May 8, 2026 13:05 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

xuechendi force-pushed the wan2_2-s2v_serve branch from d6253aa to e46394a Compare June 4, 2026 20:24

xuechendi requested review from Gaohan123, david6666666, tzhouam and ywang96 as code owners June 4, 2026 20:24

hsliuustc0106 requested changes Jun 4, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

xuechendi force-pushed the wan2_2-s2v_serve branch from e46394a to 9f80ae2 Compare June 5, 2026 16:49

xuechendi requested review from Isotr0py, RuixiangMa, ZJY0516 and princepride as code owners June 5, 2026 16:49

xuechendi force-pushed the wan2_2-s2v_serve branch 2 times, most recently from 5bb37a0 to 08e4171 Compare June 8, 2026 22:51

Add necessary changes to support image + audio

f482dbb

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the wan2_2-s2v_serve branch from 08e4171 to f482dbb Compare June 8, 2026 23:10

hsliuustc0106 reviewed Jun 8, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/video_api_utils.py

hsliuustc0106 reviewed Jun 8, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/api_server.py

xuechendi requested a review from yenuo26 as a code owner June 9, 2026 00:46

apply sanitizes to ext

9143c3b

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

hsliuustc0106 added the ready label to trigger buildkite CI label Jun 10, 2026

hsliuustc0106 removed the ready label to trigger buildkite CI label Jun 10, 2026

Fix test_video_server.py::test_delete_in_progress_job_cancels_task_an…

b8cdb5f

…d_removes_metadata Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Merge branch 'main' into wan2_2-s2v_serve

7e6251d

hsliuustc0106 added the ready label to trigger buildkite CI label Jun 10, 2026

hsliuustc0106 approved these changes Jun 11, 2026

View reviewed changes

hsliuustc0106 merged commit 5414f78 into vllm-project:main Jun 11, 2026
7 of 8 checks passed

Nughm3 pushed a commit to Nughm3/vllm-omni that referenced this pull request Jun 18, 2026

[WAN2.2-S2V] Add server API for image + audio (vllm-project#3394)

319548c

Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

Conversation

xuechendi commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Server

Client

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

xuechendi commented May 11, 2026

Uh oh!

hsliuustc0106 commented May 12, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xuechendi commented Jun 9, 2026

Uh oh!

hsliuustc0106 commented Jun 10, 2026

Uh oh!

hsliuustc0106 commented Jun 10, 2026

Uh oh!

xuechendi commented Jun 10, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xuechendi commented May 6, 2026 •

edited

Loading