Skip to content

feat(video) Generalize multimodal runtime support and add Qwen3.5 video#354

Draft
yechank-nvidia wants to merge 5 commits into
mainfrom
yechan/mm-video-support
Draft

feat(video) Generalize multimodal runtime support and add Qwen3.5 video#354
yechank-nvidia wants to merge 5 commits into
mainfrom
yechan/mm-video-support

Conversation

@yechank-nvidia

Copy link
Copy Markdown
Collaborator

Summary

This PR generalizes TokenSpeed's multimodal runtime path and adds Qwen3.5 video support.

Previously the multimodal runtime was mostly image-oriented: image encoder CUDA graph wrapping, image grid metadata, and image-only M-RoPE assumptions. This PR makes the encoder CUDA graph layer modality-aware and adds the missing runtime pieces needed for SMG to send precomputed video multimodal inputs to TokenSpeed.

Main changes:

  • Add Qwen3.5 video M-RoPE handling for video_grid_thw.
  • Split Qwen3.5 video RoPE metadata to match HF get_rope_index behavior while keeping the encoder input grid unchanged.
  • Generalize multimodal encoder CUDA graphs through EncoderCudaGraphWrapper + VisionEncoderCudaGraphAdapter.
  • Support CUDA graph wrapping for both image_encoder and video_encoder.
  • Add configurable video encoder CUDA graph metadata sequence limit via TOKENSPEED_MM_VIDEO_ENCODER_CUDA_GRAPH_MAX_SEQUENCES_PER_BATCH.
  • Clean up multimodal input lifecycle for SHM-backed features.
  • Avoid hashing SHM-backed multimodal tensors inside TokenSpeed unless SMG already provided the content hash / pad value.
  • Add scalar/cached M-RoPE delta handling to reduce decode-side multimodal overhead.
  • Add multimodal timing hooks via TOKENSPEED_LOG_MM_TIMING.
  • Keep Kimi/Qwen vision paths aligned with the generalized encoder CUDA graph plumbing.

This is intended to pair with the SMG multimodal payload PR, where SMG sends itemized precomputed multimodal tensors for image/video requests.

Test Plan

Validated with SMG + TokenSpeed using Qwen3.5-VL:

  • Text-only request works end-to-end.
  • Image request works end-to-end.
  • Video request works end-to-end.
  • Verified video output is meaningful, not garbage, on short video prompts.
  • Ran image/video benchmark sweeps with aiperf while CUDA graph was enabled.
  • Ran accuracy-oriented checks with sampled image/video eval tasks during bring-up.
  • Rebased onto latest origin/main; working tree is clean.
  • Ran syntax-only compile checks for touched Python files after rebase because py_compile could not write to an existing __pycache__ path due to permissions.

"SHM-backed multimodal items must carry content hash or "
"pad_value before TokenSpeed consumes them"
)
self.hash = hash_feature(self.feature)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always use gateway-provided hash value? If we do so, we can consider drop the in-engine hash fallback entirely and delete multimodal/hash.py, to avoid misleading.

)
if max_video_metadata_sequences is not None:
max_video_metadata_sequences = max(1, max_video_metadata_sequences)
return {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image_encoder and video_encoder share the same structure (both go through visual.forward_blocks over the same (64,4096) token buckets). Building separate wrappers doubles GPU memory usage. Would it be possible to share a single set of graphs here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants