feat(video) Generalize multimodal runtime support and add Qwen3.5 video by yechank-nvidia · Pull Request #354 · lightseekorg/tokenspeed

yechank-nvidia · 2026-06-04T15:34:17Z

Summary

This PR generalizes TokenSpeed's multimodal runtime path and adds Qwen3.5 video support.

Previously the multimodal runtime was mostly image-oriented: image encoder CUDA graph wrapping, image grid metadata, and image-only M-RoPE assumptions. This PR makes the encoder CUDA graph layer modality-aware and adds the missing runtime pieces needed for SMG to send precomputed video multimodal inputs to TokenSpeed.

Main changes:

Add Qwen3.5 video M-RoPE handling for video_grid_thw.
Split Qwen3.5 video RoPE metadata to match HF get_rope_index behavior while keeping the encoder input grid unchanged.
Generalize multimodal encoder CUDA graphs through EncoderCudaGraphWrapper + VisionEncoderCudaGraphAdapter.
Support CUDA graph wrapping for both image_encoder and video_encoder.
Add configurable video encoder CUDA graph metadata sequence limit via TOKENSPEED_MM_VIDEO_ENCODER_CUDA_GRAPH_MAX_SEQUENCES_PER_BATCH.
Clean up multimodal input lifecycle for SHM-backed features.
Avoid hashing SHM-backed multimodal tensors inside TokenSpeed unless SMG already provided the content hash / pad value.
Add scalar/cached M-RoPE delta handling to reduce decode-side multimodal overhead.
Add multimodal timing hooks via TOKENSPEED_LOG_MM_TIMING.
Keep Kimi/Qwen vision paths aligned with the generalized encoder CUDA graph plumbing.

This is intended to pair with the SMG multimodal payload PR, where SMG sends itemized precomputed multimodal tensors for image/video requests.

Test Plan

Validated with SMG + TokenSpeed using Qwen3.5-VL:

Text-only request works end-to-end.
Image request works end-to-end.
Video request works end-to-end.
Verified video output is meaningful, not garbage, on short video prompts.
Ran image/video benchmark sweeps with aiperf while CUDA graph was enabled.
Ran accuracy-oriented checks with sampled image/video eval tasks during bring-up.
Rebased onto latest origin/main; working tree is clean.
Ran syntax-only compile checks for touched Python files after rebase because py_compile could not write to an existing __pycache__ path due to permissions.

chenht2022 · 2026-06-08T15:37:48Z

+                    "SHM-backed multimodal items must carry content hash or "
+                    "pad_value before TokenSpeed consumes them"
+                )
            self.hash = hash_feature(self.feature)


Do we always use gateway-provided hash value? If we do so, we can consider drop the in-engine hash fallback entirely and delete multimodal/hash.py, to avoid misleading.

chenht2022 · 2026-06-08T16:01:29Z

+        )
+        if max_video_metadata_sequences is not None:
+            max_video_metadata_sequences = max(1, max_video_metadata_sequences)
+        return {


image_encoder and video_encoder share the same structure (both go through visual.forward_blocks over the same (64,4096) token buckets). Building separate wrappers doubles GPU memory usage. Would it be possible to share a single set of graphs here?

yechank-nvidia added 5 commits June 4, 2026 08:19

Support Qwen3.5 video M-RoPE

9b5f185

Generalize multimodal encoder CUDA graphs

e17f4f0

Optimize multimodal SHM feature lifecycle

663565b

Optimize multimodal M-RoPE decode positions

542bbcf

Clean up multimodal runtime plumbing

c7df817

chenht2022 reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(video) Generalize multimodal runtime support and add Qwen3.5 video#354

feat(video) Generalize multimodal runtime support and add Qwen3.5 video#354
yechank-nvidia wants to merge 5 commits into
mainfrom
yechan/mm-video-support

yechank-nvidia commented Jun 4, 2026

Uh oh!

chenht2022 Jun 8, 2026

Uh oh!

chenht2022 Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yechank-nvidia commented Jun 4, 2026

Summary

Test Plan

Uh oh!

chenht2022 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

chenht2022 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants