Skip to content

[WAN2.2-S2V] Add server API for image + audio#3394

Merged
hsliuustc0106 merged 5 commits into
vllm-project:mainfrom
xuechendi:wan2_2-s2v_serve
Jun 11, 2026
Merged

[WAN2.2-S2V] Add server API for image + audio#3394
hsliuustc0106 merged 5 commits into
vllm-project:mainfrom
xuechendi:wan2_2-s2v_serve

Conversation

@xuechendi

@xuechendi xuechendi commented May 6, 2026

Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Server

VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Wan-AI/Wan2.2-S2V-14B --omni \
  --model-class-name WanS2VPipeline \
  --tensor-parallel-size 2 \
  --flow-shift 3.0 \
  --vae-use-slicing --vae-use-tiling \
  --port 8091

Client

no_proxy=127.0.0.1 \
curl -X POST "http://localhost:8091/v1/videos/sync" \
  -F "prompt=A person singing" \
  -F "image_reference=https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png" \
  -F "audio_reference=https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3" \
  -F "width=832" -F "height=480" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.5" \
  -F "fps=16" \
  --output "s2v_480p_serve.mp4"

Test Plan

Offline Test

  CUDA_VISIBLE_DEVICES=0,2 VLLM_WORKER_MULTIPROC_METHOD=spawn HF_HOME=/mnt/data \
  python examples/offline_inference/speech_to_video/speech_to_video.py \
    --model /mnt/data/hub/models--Wan-AI--Wan2.2-S2V-14B \
    --image "Five Hundred Miles.png" --audio "Five Hundred Miles.MP3" \
    --prompt "A person singing" --height 480 --width 832 --num-frames 81 \
    --num-inference-steps 40 --fps 16 --tensor-parallel-size 2 \
    --vae-use-slicing --vae-use-tiling --output s2v_offline_test.mp4

Online Serving Test

TP=2 PORT=8091 bash examples/online_serving/speech_to_video/run_server.sh
IMAGE_URL="https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.png" \                            
AUDIO_URL="https://raw.githubusercontent.com/Wan-Video/Wan2.2/main/examples/Five%20Hundred%20Miles.MP3" \                            
bash examples/online_serving/speech_to_video/run_curl_speech_to_video.sh   

Test Result

Metric Offline Value Online Value
Time 291.6s (Generation time) 188.7s (Request time)
Output / Status 78 frames @ 16 fps (4.9s) + audio 5.0s @ 44100 Hz 200 OK
Steps Offline Online
1-4 (warmup) 14-17s/step 14-17s/step
5-31 (cache_dit) 2-6s/step 2-6s/step
32-40 (tail) 8-13s/step 5-6s/step

offline - 40 steps

s2v_offline_40steps_cache_dit.mp4

online - 40 steps

s2v_serve_40steps_cache_dit.mp4

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@xuechendi xuechendi requested a review from hsliuustc0106 as a code owner May 6, 2026 22:16

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6253aa433

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

@gcanlin gcanlin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any open standard about the API design, such as OpenAI?

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Validated: CI green (DCO, pre-commit, build 3.11/3.12, docs). The input_audioaudio_path threading through the API layer is consistent across all call sites. The multi_modal_data refactor from direct dict assignment to conditional construction is correct and cleaner.

Issues to address:

  1. Temp file leak (see inline). NamedTemporaryFile(delete=False) writes the uploaded audio to disk but the temp file is never cleaned up. Neither _run_video_generation_job nor the sync endpoint has cleanup logic for audio_path. Each request will leave a temp audio file on disk. Consider adding cleanup in a finally block or using a context manager.

  2. No audio validation. The endpoint accepts any bytes as input_audio with no checks for file type, MIME type, size limits, or duration. An unsupported format will fail deep in the pipeline with a confusing error. Validate the audio format and size at the API layer.

  3. No test results. The PR description has empty "Test Plan" and "Test Result" sections. For an API change that adds a new form field, please provide at minimum a curl invocation showing the endpoint works, and confirm the model produces output with the uploaded audio.

  4. Mergeable is UNKNOWN — may need a rebase against main.


Reviewed by Claude Code with deepseek-v4-pro

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Comment thread recipes/Wan-AI/Wan2.2-S2V.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for supplying an uploaded audio file alongside a reference image to the OpenAI-style video serving endpoints, enabling Wan2.2-S2V “speech-to-video” style requests via the server API.

Changes:

  • Thread an optional audio_path through the video serving stack and attach it to prompt["multi_modal_data"].
  • Extend the FastAPI multipart form parser to accept input_audio and persist it to a temporary file for downstream consumption.
  • Document online serving usage for Wan2.2-S2V (server + curl client example).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
vllm_omni/entrypoints/openai/serving_video.py Accepts audio_path and forwards it via multi_modal_data into the diffusion prompt.
vllm_omni/entrypoints/openai/api_server.py Adds input_audio form field parsing, writes uploads to a temp file, and passes audio_path into generation paths.
recipes/Wan-AI/Wan2.2-S2V.md Adds “Online Serving” documentation with a server command and curl example including input_audio.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/entrypoints/openai/api_server.py
Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Comment on lines 2476 to 2481
async def _parse_video_form(
raw_request: Request,
prompt: str = Form(...),
input_reference: UploadFile | None = File(default=None),
input_audio: UploadFile | None = File(default=None),
image_reference: str | None = Form(default=None),
@xuechendi

Copy link
Copy Markdown
Contributor Author

Any open standard about the API design, such as OpenAI?

OpenAI api is can either use file

curl https://api.openai.com/v1/videos \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sora-2-pro",
    "prompt": "Animate the person in the reference image to sing the lyrics from the audio file with realistic facial expressions and matching mouth movements.",
    "image_input": "file-img_98765",
    "audio_input": "file-aud_12345",
    "resolution": "1080p",
    "motion_bucket": 5
  }'

or url

curl https://api.openai.com/v1/videos \
  -d '{
    "model": "sora-2",
    "image_url": "https://your-site.com/character.jpg",
    "audio_url": "https://your-site.com/voice-clip.mp3",
    "prompt": "Animate the character to speak the audio."
  }'

Since the existing attribute_name for image is already different from openAI - 'image_input' vs 'input_reference', should be refactoring and follow openAI here?

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

just a quick remind: please check whether online&offline mode output the same image

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Validated: All CI gates green (DCO, pre-commit, build 3.11/3.12, docs). PR is mergeable. The input_audioaudio_path threading through the API layer is consistent across all call sites. The multi_modal_data refactor from direct dict assignment to conditional construction is correct and cleaner. Examples and recipe docs are well-structured.

These issues were flagged in the prior review round (on the previous commit) and have not been addressed:

Blocking issues:

  1. Temp file leak. NamedTemporaryFile(delete=False) writes uploaded audio to disk but the temp file is never cleaned up — not on success, not on error, not on cancellation. _run_video_generation_job calls _cleanup_video() for the output file but has no equivalent for audio_path. create_video_sync also has no cleanup. Each request leaks an orphan file on disk. Add os.unlink(audio_path) in a finally block in both paths.

  2. Empty audio silently ignored. When input_audio is provided but the file has zero bytes, the code silently sets audio_path = None. This produces a misleading downstream error instead of a clear 400 response. Return an explicit 400 when input_audio is present but empty.

  3. No test evidence. The test plan and test result sections are empty. Please provide at minimum a curl command showing the endpoint produces output with an uploaded audio file, and confirm online and offline modes produce consistent results (as previously requested).


Reviewed by Claude Code with glm-5.1

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
@xuechendi xuechendi force-pushed the wan2_2-s2v_serve branch 2 times, most recently from 5bb37a0 to 08e4171 Compare June 8, 2026 22:51
Signed-off-by: Chendi Xue <chendi.xue@intel.com>

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we update the doc for api server as well?

Comment thread vllm_omni/entrypoints/openai/video_api_utils.py
Comment thread vllm_omni/entrypoints/openai/api_server.py
- Update docs/serving/videos_api.md with audio_reference parameter and
  Speech-to-Video example section
- Add unit tests for decode_audio_url (data URL, HTTP URL, invalid URL,
  suffix sanitization)
- Add integration test for audio_reference form parameter in video server
- Sanitize user-controlled MIME extension in decode_audio_url to prevent
  path traversal via crafted data URLs

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi requested a review from yenuo26 as a code owner June 9, 2026 00:46
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi

Copy link
Copy Markdown
Contributor Author

shall we update the doc for api server as well?

@hsliuustc0106 ,fixed comments by adding new tests and update doc for api server

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jun 10, 2026
@hsliuustc0106

Copy link
Copy Markdown
Collaborator

tests/entrypoints/openai_api/test_video_server.py::test_delete_in_progress_job_cancels_task_and_removes_metadata - assert False
--
  | [2026-06-10T01:22:34Z]  +  where False = wait(timeout=2.0)
  | [2026-06-10T01:22:34Z]  +    where wait = <threading.Event at 0x7f5d418782c0: unset>.wait
  | [2026-06-10T01:22:34Z]  +      where <threading.Event at 0x7f5d418782c0: unset> = <openai_api.test_video_server.BlockingVideoHandler object at 0x7f5d418797f0>.started
  | [2026-06-10T01:22:34Z] = 1 failed, 1541 passed, 3 skipped, 1131 deselected, 48 warnings in 237.74s (0:03:57) =
  | [2026-06-10T01:22:38Z] sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
  | [2026-06-10T01:22:42Z] 🚨 Error: The command exited with status 1


@hsliuustc0106 hsliuustc0106 removed the ready label to trigger buildkite CI label Jun 10, 2026
@hsliuustc0106

Copy link
Copy Markdown
Collaborator

is this ci failure related to this PR?

…d_removes_metadata

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi

Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 . Fix the failed issue by b8cdb5f
Need to add new arg reference_audio=None to tests/entrypoints/openai_api/test_video_server.py:generate_video_bytes

Have locally tested with A100 - test_video_server.py

tests/entrypoints/openai_api/test_video_server.py::test_async_video_generation_bypasses_base64 PASSED [  1%]
tests/entrypoints/openai_api/test_video_server.py::test_async_video_generation_with_audio_bypasses_base64 PASSED [  3%]
tests/entrypoints/openai_api/test_video_server.py::test_t2v_video_generation_form PASSED [  4%]
tests/entrypoints/openai_api/test_video_server.py::test_i2v_video_generation_form PASSED [  6%]
tests/entrypoints/openai_api/test_video_server.py::test_i2v_video_generation_resizes_input_to_requested_dimensions PASSED [  8%]
tests/entrypoints/openai_api/test_video_server.py::test_i2v_video_generation_with_image_reference_form PASSED [  9%]
tests/entrypoints/openai_api/test_video_server.py::test_v2v_video_generation_form PASSED [ 11%]
tests/entrypoints/openai_api/test_video_server.py::test_v2v_video_generation_with_video_reference_form PASSED [ 12%]
tests/entrypoints/openai_api/test_video_server.py::test_decode_video_bytes_can_keep_last_frames PASSED [ 14%]
tests/entrypoints/openai_api/test_video_server.py::test_cosmos3_reference_video_limit_uses_v2v_condition_frames PASSED [ 16%]
tests/entrypoints/openai_api/test_video_server.py::test_cosmos3_reference_video_limit_preserves_action_frames PASSED [ 17%]
tests/entrypoints/openai_api/test_video_server.py::test_cosmos3_reference_video_limit_caps_condition_frames_to_output_frames PASSED [ 19%]
tests/entrypoints/openai_api/test_video_server.py::test_s2v_video_generation_with_audio_reference_form PASSED [ 20%]
tests/entrypoints/openai_api/test_video_server.py::test_seconds_defaults_fps_and_frames PASSED [ 22%]
tests/entrypoints/openai_api/test_video_server.py::test_size_param_sets_width_height PASSED [ 24%]
tests/entrypoints/openai_api/test_video_server.py::test_sampling_params_pass_through PASSED [ 25%]
tests/entrypoints/openai_api/test_video_server.py::test_frame_interpolation_params_pass_to_diffusion_sampling_params PASSED [ 27%]
tests/entrypoints/openai_api/test_video_server.py::test_default_sampling_params_apply_to_video_requests PASSED [ 29%]
tests/entrypoints/openai_api/test_video_server.py::test_request_params_override_default_video_sampling_params PASSED [ 30%]
tests/entrypoints/openai_api/test_video_server.py::test_worker_fps_multiplier_is_applied_to_async_encoding PASSED [ 32%]
tests/entrypoints/openai_api/test_video_server.py::test_audio_sample_rate_comes_from_model_config PASSED [ 33%]
tests/entrypoints/openai_api/test_video_server.py::test_video_job_persists_profiler_metadata PASSED [ 35%]
tests/entrypoints/openai_api/test_video_server.py::test_video_generation_response_exposes_action_payload PASSED [ 37%]
tests/entrypoints/openai_api/test_video_server.py::test_video_job_persists_action_metadata PASSED [ 38%]
tests/entrypoints/openai_api/test_video_server.py::test_action_extraction_accepts_unbatched_action PASSED [ 40%]
tests/entrypoints/openai_api/test_video_server.py::test_missing_handler_returns_503 PASSED [ 41%]
tests/entrypoints/openai_api/test_video_server.py::test_missing_prompt_returns_422 PASSED [ 43%]
tests/entrypoints/openai_api/test_video_server.py::test_video_generation_rejects_model_mismatch PASSED [ 45%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_size_parse_returns_422 PASSED [ 46%]
tests/entrypoints/openai_api/test_video_server.py::test_rejects_input_reference_and_image_reference_together PASSED [ 48%]
tests/entrypoints/openai_api/test_video_server.py::test_rejects_image_reference_and_video_reference_together PASSED [ 50%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_seconds_returns_422 PASSED [ 51%]
tests/entrypoints/openai_api/test_video_server.py::test_negative_prompt_and_seed_pass_through PASSED [ 53%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_lora_returns_400 PASSED [ 54%]
tests/entrypoints/openai_api/test_video_server.py::test_unsupported_image_reference_file_id_returns_400 PASSED [ 56%]
tests/entrypoints/openai_api/test_video_server.py::test_unsupported_video_reference_file_id_returns_400 PASSED [ 58%]
tests/entrypoints/openai_api/test_video_server.py::test_invalid_uploaded_input_reference_returns_400 PASSED [ 59%]
tests/entrypoints/openai_api/test_video_server.py::test_video_request_validation PASSED [ 61%]
tests/entrypoints/openai_api/test_video_server.py::test_list_videos_supports_order_after_and_limit PASSED [ 62%]
tests/entrypoints/openai_api/test_video_server.py::test_delete_completed_job_removes_file_and_metadata PASSED [ 64%]
tests/entrypoints/openai_api/test_video_server.py::test_delete_in_progress_job_cancels_task_and_removes_metadata PASSED [ 66%]
tests/entrypoints/openai_api/test_video_server.py::test_video_response_file_extension_is_robust PASSED [ 67%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_merged_into_extra_args PASSED [ 69%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_none_by_default PASSED [ 70%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_invalid_json PASSED [ 72%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_merged_with_existing_extra_args PASSED [ 74%]
tests/entrypoints/openai_api/test_video_server.py::test_sample_solver_forwarded_via_extra_params PASSED [ 75%]
tests/entrypoints/openai_api/test_video_server.py::test_extra_params_allows_inline_action PASSED [ 77%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_t2v_returns_video_bytes PASSED [ 79%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_t2v_returns_profiler_headers PASSED [ 80%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_i2v_returns_video_bytes PASSED [ 82%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_i2v_with_image_reference PASSED [ 83%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_v2v_returns_video_bytes PASSED [ 85%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_missing_handler_returns_503 PASSED [ 87%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_missing_prompt_returns_422 PASSED [ 88%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_rejects_both_references PASSED [ 90%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_generation_error_returns_500 PASSED [ 91%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_does_not_create_store_entry PASSED [ 93%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_sampling_params_pass_through PASSED [ 95%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_frame_interpolation_params_pass_to_sampling_params PASSED [ 96%]
tests/entrypoints/openai_api/test_video_server.py::test_sync_default_sampling_params_apply_to_video_requests PASSED [ 98%]
tests/entrypoints/openai_api/test_video_server.py::test_worker_fps_multiplier_is_applied_to_sync_encoding PASSED [100%]
...
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
======================= 62 passed, 28 warnings in 7.26s ========================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jun 10, 2026

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit 5414f78 into vllm-project:main Jun 11, 2026
7 of 8 checks passed
Nughm3 pushed a commit to Nughm3/vllm-omni that referenced this pull request Jun 18, 2026
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants