[WAN2.2-S2V] Add server API for image + audio#3394
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d6253aa433
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
gcanlin
left a comment
There was a problem hiding this comment.
Any open standard about the API design, such as OpenAI?
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Summary
Validated: CI green (DCO, pre-commit, build 3.11/3.12, docs). The input_audio → audio_path threading through the API layer is consistent across all call sites. The multi_modal_data refactor from direct dict assignment to conditional construction is correct and cleaner.
Issues to address:
-
Temp file leak (see inline).
NamedTemporaryFile(delete=False)writes the uploaded audio to disk but the temp file is never cleaned up. Neither_run_video_generation_jobnor the sync endpoint has cleanup logic foraudio_path. Each request will leave a temp audio file on disk. Consider adding cleanup in afinallyblock or using a context manager. -
No audio validation. The endpoint accepts any bytes as
input_audiowith no checks for file type, MIME type, size limits, or duration. An unsupported format will fail deep in the pipeline with a confusing error. Validate the audio format and size at the API layer. -
No test results. The PR description has empty "Test Plan" and "Test Result" sections. For an API change that adds a new form field, please provide at minimum a curl invocation showing the endpoint works, and confirm the model produces output with the uploaded audio.
-
Mergeable is UNKNOWN — may need a rebase against main.
Reviewed by Claude Code with deepseek-v4-pro
There was a problem hiding this comment.
Pull request overview
Adds support for supplying an uploaded audio file alongside a reference image to the OpenAI-style video serving endpoints, enabling Wan2.2-S2V “speech-to-video” style requests via the server API.
Changes:
- Thread an optional
audio_paththrough the video serving stack and attach it toprompt["multi_modal_data"]. - Extend the FastAPI multipart form parser to accept
input_audioand persist it to a temporary file for downstream consumption. - Document online serving usage for Wan2.2-S2V (server + curl client example).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
vllm_omni/entrypoints/openai/serving_video.py |
Accepts audio_path and forwards it via multi_modal_data into the diffusion prompt. |
vllm_omni/entrypoints/openai/api_server.py |
Adds input_audio form field parsing, writes uploads to a temp file, and passes audio_path into generation paths. |
recipes/Wan-AI/Wan2.2-S2V.md |
Adds “Online Serving” documentation with a server command and curl example including input_audio. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async def _parse_video_form( | ||
| raw_request: Request, | ||
| prompt: str = Form(...), | ||
| input_reference: UploadFile | None = File(default=None), | ||
| input_audio: UploadFile | None = File(default=None), | ||
| image_reference: str | None = Form(default=None), |
OpenAI api is can either use file or url Since the existing attribute_name for image is already different from openAI - 'image_input' vs 'input_reference', should be refactoring and follow openAI here? |
|
just a quick remind: please check whether online&offline mode output the same image |
d6253aa to
e46394a
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Summary
Validated: All CI gates green (DCO, pre-commit, build 3.11/3.12, docs). PR is mergeable. The input_audio → audio_path threading through the API layer is consistent across all call sites. The multi_modal_data refactor from direct dict assignment to conditional construction is correct and cleaner. Examples and recipe docs are well-structured.
These issues were flagged in the prior review round (on the previous commit) and have not been addressed:
Blocking issues:
-
Temp file leak.
NamedTemporaryFile(delete=False)writes uploaded audio to disk but the temp file is never cleaned up — not on success, not on error, not on cancellation._run_video_generation_jobcalls_cleanup_video()for the output file but has no equivalent foraudio_path.create_video_syncalso has no cleanup. Each request leaks an orphan file on disk. Addos.unlink(audio_path)in afinallyblock in both paths. -
Empty audio silently ignored. When
input_audiois provided but the file has zero bytes, the code silently setsaudio_path = None. This produces a misleading downstream error instead of a clear 400 response. Return an explicit 400 wheninput_audiois present but empty. -
No test evidence. The test plan and test result sections are empty. Please provide at minimum a curl command showing the endpoint produces output with an uploaded audio file, and confirm online and offline modes produce consistent results (as previously requested).
Reviewed by Claude Code with glm-5.1
e46394a to
9f80ae2
Compare
5bb37a0 to
08e4171
Compare
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
08e4171 to
f482dbb
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
shall we update the doc for api server as well?
- Update docs/serving/videos_api.md with audio_reference parameter and Speech-to-Video example section - Add unit tests for decode_audio_url (data URL, HTTP URL, invalid URL, suffix sanitization) - Add integration test for audio_reference form parameter in video server - Sanitize user-controlled MIME extension in decode_audio_url to prevent path traversal via crafted data URLs Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@hsliuustc0106 ,fixed comments by adding new tests and update doc for api server |
|
|
is this ci failure related to this PR? |
…d_removes_metadata Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
@hsliuustc0106 . Fix the failed issue by b8cdb5f Have locally tested with A100 - test_video_server.py |
Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Server
Client
Test Plan
Offline Test
Online Serving Test
Test Result
offline - 40 steps
s2v_offline_40steps_cache_dit.mp4
online - 40 steps
s2v_serve_40steps_cache_dit.mp4
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)