perf(deepseek-v4): pre-compile deep_gemm JIT kernels at startup#398
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9fa0e4959
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "head_dim": head_dim, | ||
| "kv_dim": kv_lora_rank, | ||
| "kv_scale_dim": kv_lora_rank // mxfp4_block_size, |
There was a problem hiding this comment.
Use indexer dimensions for MQA logits warmup
For the DeepSeek V4 prefill indexer, these shapes are derived from the attention MLA dimensions (head_dim/kv_lora_rank) instead of the indexer dimensions: DeepseekV4Indexer builds q from config.index_n_heads * config.index_head_dim and the MXFP4 cache stores packed value/scale bytes from index_head_dim. On V4 Flash those differ from the MLA values (e.g. 128 index head dim vs 512 head/kv rank), so this warms/compiles the wrong fp8_fp4_mqa_logits shapes and the first real prefill can still JIT after the CI warmup request was removed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 573e4b3f9a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| q_scales = torch.ones( | ||
| num_tokens, | ||
| num_heads, | ||
| dtype=torch.float32, | ||
| device=device, |
There was a problem hiding this comment.
Mirror the FP4 MQA runtime argument contract
This synthetic prefill call does not produce the same DeepGEMM-ready tensors that the runtime passes from deepseek_v4_prepare_indexer_q_mxfp4: the FP4 q scale tensor here is float32, while the FP4 MQA API expects the packed scale tensor (int32), and the later weights/cu_*/clean_logits arguments are also shaped or flagged differently from the runtime call. On DeepSeek V4 Flash this warmup will hit DeepGEMM host assertions before compiling the indexer kernel, so startup fails (or the removed CI warmup request no longer covers this path) instead of warming the JIT.
Useful? React with 👍 / 👎.
Move deep_gemm JIT warmup into tokenspeed-kernel and add warmup for prefill-path kernels (tf32_hc_prenorm_gemm, fp8_fp4_mqa_logits, fp8_gemm_nt) that were previously compiled on the first real request. On CI runners with cold deep_gemm cache, the first prefill request triggered ~8 cubin compilations (~14s on B200, ~34s on CI), blocking the worker event loop and causing smg health probe timeouts. Now all cubins are compiled during server startup (~4s overhead), reducing first-request latency from 14.6s to 0.32s. Changes: - New warmup.py in tokenspeed-kernel with warmup_mega_moe_jit(), warmup_prefill_jit(), warmup_fp8_gemm_nt(), and warmup_fp8_gemm_nt_from_model() - Generic post_quant_warmup hook in model loader for fp8_gemm_nt warmup after quant weight processing - Remove 300-token warmup request from V4-Flash CI eval configs (no longer needed) Signed-off-by: jiyingd <jiyingd@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64378d8508
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| a = torch.zeros(max_tokens, k, dtype=torch.float8_e4m3fn, device=device) | ||
| a_scales = torch.ones( | ||
| max_tokens, k // block_size, dtype=torch.float32, device=device | ||
| ) | ||
| b = torch.zeros(n, k, dtype=torch.float8_e4m3fn, device=device) | ||
| b_scales = torch.ones( | ||
| n // block_size, k // block_size, dtype=torch.float32, device=device |
There was a problem hiding this comment.
Use runtime scale layouts for FP8 GEMM warmup
On SM100 DeepSeek V4 FP8 block-scale layers, _use_deep_gemm_fp8 is only enabled after fp8.py converts weight scales to the UE8M0 DeepGEMM layout, and runtime activation scales are generated with scale_ue8m0=True before deep_gemm_mm_fp8_blockscale calls fp8_gemm_nt. This warmup bypasses that path and passes plain float32 row-major a_scales/b_scales directly to fp8_gemm_nt, so it exercises a different scale contract (or hits DeepGEMM assertions) instead of compiling the kernels serving will actually use.
Useful? React with 👍 / 👎.
64378d8 to
bceb897
Compare
The warmup request via smg ensures the worker is fully registered before evalscope sends concurrent requests. Without it, heavy prefill blocks health probes and smg marks the worker unavailable. Signed-off-by: jiyingd <jiyingd@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53ccbcbdf3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logger.warning("deep_gemm mega_moe symbols unavailable, skipping warmup") | ||
| return | ||
|
|
||
| token_counts = _token_count_sweep(max_num_tokens) |
There was a problem hiding this comment.
Warm the intermediate MegaMoE token buckets
When this sweep is used for fp8_fp4_mega_moe, the raw num_tokens is not the M dimension that selects the expert tile: after routing it is roughly num_tokens * top_k / num_experts per expert. For common DeepSeek V4 settings like 256 experts and top_k=8 with an 8192-token buffer, _token_count_sweep() only drives raw counts up to 256 and then 8192, so it warms about 1..8 and 256 tokens per expert but skips the intermediate 16/32/64/128 buckets that are hit by 512/1024/2048/4096-token prefills. Those requests can still JIT the MegaMoE kernel under the EP barrier even though startup warmup ran.
Useful? React with 👍 / 👎.
The V4-Flash decode sparse indexer calls fp8_fp4_paged_mqa_logits and get_paged_mqa_logits_metadata, which compile distinct cubins from the ragged prefill fp8_fp4_mqa_logits and were never warmed at startup. They JIT-compiled inline on the first decode (the metadata kernel re-compiles per 32-aligned batch bucket), stalling the engine long enough that smg's gRPC health probe timed out and evicted the worker -- the flaky V4-Flash GSM8K eval 404s. Add _warmup_fp8_fp4_paged_mqa_logits to warmup_prefill_jit, gated on the indexer's index_n_heads/index_head_dim (replicated, not TP-split), sweeping every 32-aligned decode batch bucket up to the runtime ceiling (max_cudagraph_capture_size / max_num_seqs) so the metadata kernel is fully covered. Runs before CUDA-graph capture, so no inline JIT remains on the decode hot path. Verified: GSM8K 5-shot N=200 = 0.96; server logs "Warmed up 2 deep_gemm prefill kernel families" with the paged kernels resolved before serving. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46b736d175
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if getattr(module, "_use_deep_gemm_fp8", False) and not getattr( | ||
| module, "is_bmm", False |
There was a problem hiding this comment.
Add warmup for the BMM fp8_einsum path
When DeepSeek V4 attention uses the FP8 BMM output projection, wo_a.is_bmm is set and runtime calls deep_gemm.fp8_einsum in _project_attention_output, but this scanner explicitly drops those layers and there is no other warmup call for fp8_einsum in the new module. In that configuration the first real attention output projection still JIT-compiles on the serving path despite startup warmup, so the new warmup does not cover all DeepGEMM kernels it enables for V4.
Useful? React with 👍 / 👎.
Port vLLM's M-value enumeration (_optimal_warmup_m_values: wave-boundary + block_m-multiple M, unioned with 16-step block_m values) and apply it in warmup_fp8_gemm_nt, _warmup_tf32_hc_prenorm_gemm, and warmup_mega_moe_jit. The old 16-step-only sweep missed many (N, K, block_m) tiles the eval prefills hit, so those kernels JIT-compiled inline during the eval -- stalling the engine and tripping smg's gRPC health probe. On a cold deep_gemm cache this cuts inline prefill compiles from 17 to 5 (the mega_moe grouped GEMM, mega_moe, and tf32 compressor families are now fully covered). Startup warmup is correspondingly slower (cold ~330s, within the 1800s readiness timeout). GSM8K 5-shot N=50 = 0.94 (warmup is numerics-neutral). The residual cold-start inline JIT (FP4 linear GEMM + ragged indexer) is a known follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2338d0f113
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| post_quant_warmup = getattr(model, "post_quant_warmup", None) | ||
| if callable(post_quant_warmup): | ||
| post_quant_warmup() |
There was a problem hiding this comment.
Invoke DeepGEMM model warmup for sharded loads
When loading DeepSeek V4 from sharded-state checkpoints, this loader never calls model.load_weights(), and the only call sites for warmup_deep_gemm() / _warmup_prefill_jit() are inside DeepseekV4ForCausalLM.load_weights (checked with rg warmup_deep_gemm). This new hook only runs post_quant_warmup(), which warms FP8 linear GEMMs, so sharded loads still skip the new prefill and MegaMoE DeepGEMM startup warmups and can hit first-request JITs despite this commit’s warmup path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3672551709
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| post_quant_warmup = getattr(model, "post_quant_warmup", None) | ||
| if callable(post_quant_warmup): | ||
| post_quant_warmup() |
There was a problem hiding this comment.
Run post-quant warmup on the target device
When CPU offloading is enabled, the preceding device_loading_context blocks intentionally move CPU parameters to target_device only for processing and then restore them to CPU before this new hook runs. post_quant_warmup() eventually uses next(model.parameters()).device to allocate the DeepGEMM warmup tensors, so an offloaded DeepSeek V4 FP8 load can try to invoke CUDA DeepGEMM with CPU tensors and fail during startup instead of just warming the kernels.
Useful? React with 👍 / 👎.
Under MTP/speculative decoding the verify step flattens bs*num_draft_tokens into the decode indexer's num_tokens, so get_paged_mqa_logits_metadata (smxx_paged_mqa_logits_metadata, JIT-keyed on align(num_tokens, 32)) hits batch buckets up to align(max_num_seqs * num_draft_tokens, 32). The startup warmup ceiling lacked the speculative multiplier, so those buckets JIT- compiled inline on the first verify -- stalling the engine and tripping smg's gRPC health probe (the V4-Flash-MTP gsm8k flaky). Expose speculative_num_draft_tokens in global_server_args_dict and scale the warmup max_decode_tokens by it (gated on speculative_algorithm; no-op when speculation is off). The paged-logits cubin is unchanged -- verify routes through the decode indexer with next_n=1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
Summary
warmup.py) and add warmup for prefill-path kernels (tf32_hc_prenorm_gemm,fp8_fp4_mqa_logits,fp8_gemm_nt) that were previously compiled on the first real requestChanges
tokenspeed-kernel/.../deep_gemm/warmup.pywarmup_mega_moe_jit(),warmup_prefill_jit(),warmup_fp8_gemm_nt(),warmup_fp8_gemm_nt_from_model()python/.../models/deepseek_v4.pypost_quant_warmuphookpython/.../model_loader/loader.pypost_quant_warmuphook after quant weight processingtest/ci/eval/deepseek-v4-flash-*.yamlRoot cause analysis
deep_gemm compiles one cubin per unique (N, K, block_m) combination. The block_m tile is selected at runtime based on M. Existing warmup only covered mega_moe tiles; compressor/indexer/attention projection kernels had no warmup. The
_token_count_sweepgenerates 21 M values (one per block_m step from 16 to 256) to trigger compilation of all tile variants in ~4 seconds.Test plan
eval-deepseek-v4-flash-gsm8k / b200-4gpushould pass without the manual warmup request🤖 Generated with Claude Code