feat(video) Generalize multimodal runtime support and add Qwen3.5 video#354
Draft
yechank-nvidia wants to merge 5 commits into
Draft
feat(video) Generalize multimodal runtime support and add Qwen3.5 video#354yechank-nvidia wants to merge 5 commits into
yechank-nvidia wants to merge 5 commits into
Conversation
chenht2022
reviewed
Jun 8, 2026
| "SHM-backed multimodal items must carry content hash or " | ||
| "pad_value before TokenSpeed consumes them" | ||
| ) | ||
| self.hash = hash_feature(self.feature) |
Contributor
There was a problem hiding this comment.
Do we always use gateway-provided hash value? If we do so, we can consider drop the in-engine hash fallback entirely and delete multimodal/hash.py, to avoid misleading.
| ) | ||
| if max_video_metadata_sequences is not None: | ||
| max_video_metadata_sequences = max(1, max_video_metadata_sequences) | ||
| return { |
Contributor
There was a problem hiding this comment.
image_encoder and video_encoder share the same structure (both go through visual.forward_blocks over the same (64,4096) token buckets). Building separate wrappers doubles GPU memory usage. Would it be possible to share a single set of graphs here?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR generalizes TokenSpeed's multimodal runtime path and adds Qwen3.5 video support.
Previously the multimodal runtime was mostly image-oriented: image encoder CUDA graph wrapping, image grid metadata, and image-only M-RoPE assumptions. This PR makes the encoder CUDA graph layer modality-aware and adds the missing runtime pieces needed for SMG to send precomputed video multimodal inputs to TokenSpeed.
Main changes:
video_grid_thw.get_rope_indexbehavior while keeping the encoder input grid unchanged.EncoderCudaGraphWrapper+VisionEncoderCudaGraphAdapter.image_encoderandvideo_encoder.TOKENSPEED_MM_VIDEO_ENCODER_CUDA_GRAPH_MAX_SEQUENCES_PER_BATCH.TOKENSPEED_LOG_MM_TIMING.This is intended to pair with the SMG multimodal payload PR, where SMG sends itemized precomputed multimodal tensors for image/video requests.
Test Plan
Validated with SMG + TokenSpeed using Qwen3.5-VL:
origin/main; working tree is clean.py_compilecould not write to an existing__pycache__path due to permissions.