perf(deepseek-v4): pre-compile deep_gemm JIT kernels at startup by dongjiyingdjy · Pull Request #398 · lightseekorg/tokenspeed

dongjiyingdjy · 2026-06-09T08:41:39Z

Summary

Move deep_gemm JIT warmup into tokenspeed-kernel (warmup.py) and add warmup for prefill-path kernels (tf32_hc_prenorm_gemm, fp8_fp4_mqa_logits, fp8_gemm_nt) that were previously compiled on the first real request
On CI runners with cold deep_gemm cache, the first prefill request triggered ~8 cubin compilations (~14s on B200, ~34s on CI), blocking the worker event loop and causing smg health probe timeouts → server marked dead → CI eval fails
Now all cubins are compiled during server startup (~4s overhead), reducing first-request latency from 14.6s to 0.32s
Remove 300-token warmup request from V4-Flash CI eval configs (no longer needed)

Changes

File	Change
`tokenspeed-kernel/.../deep_gemm/warmup.py`	New: `warmup_mega_moe_jit()`, `warmup_prefill_jit()`, `warmup_fp8_gemm_nt()`, `warmup_fp8_gemm_nt_from_model()`
`python/.../models/deepseek_v4.py`	Rewire warmup calls to tokenspeed-kernel; add `post_quant_warmup` hook
`python/.../model_loader/loader.py`	Generic `post_quant_warmup` hook after quant weight processing
`test/ci/eval/deepseek-v4-flash-*.yaml`	Remove manual 300-token warmup request

Root cause analysis

deep_gemm compiles one cubin per unique (N, K, block_m) combination. The block_m tile is selected at runtime based on M. Existing warmup only covered mega_moe tiles; compressor/indexer/attention projection kernels had no warmup. The _token_count_sweep generates 21 M values (one per block_m step from 16 to 256) to trigger compilation of all tile variants in ~4 seconds.

Test plan

Cold deep_gemm cache → server startup compiles 151 cubins → first 300-token request: 0.32s, 0 new cubins
GSM8K 50-sample eval: 0.94 accuracy (V4-Flash EP=4)
Code review: 4 findings fixed (env var scope, missing loader hooks, dead code, NameError guard)
CI: eval-deepseek-v4-flash-gsm8k / b200-4gpu should pass without the manual warmup request

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9fa0e4959

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T08:44:32Z

+                    "head_dim": head_dim,
+                    "kv_dim": kv_lora_rank,
+                    "kv_scale_dim": kv_lora_rank // mxfp4_block_size,


Use indexer dimensions for MQA logits warmup

For the DeepSeek V4 prefill indexer, these shapes are derived from the attention MLA dimensions (head_dim/kv_lora_rank) instead of the indexer dimensions: DeepseekV4Indexer builds q from config.index_n_heads * config.index_head_dim and the MXFP4 cache stores packed value/scale bytes from index_head_dim. On V4 Flash those differ from the MLA values (e.g. 128 index head dim vs 512 head/kv rank), so this warms/compiles the wrong fp8_fp4_mqa_logits shapes and the first real prefill can still JIT after the CI warmup request was removed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 573e4b3f9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T21:34:17Z

+            q_scales = torch.ones(
+                num_tokens,
+                num_heads,
+                dtype=torch.float32,
+                device=device,


Mirror the FP4 MQA runtime argument contract

This synthetic prefill call does not produce the same DeepGEMM-ready tensors that the runtime passes from deepseek_v4_prepare_indexer_q_mxfp4: the FP4 q scale tensor here is float32, while the FP4 MQA API expects the packed scale tensor (int32), and the later weights/cu_*/clean_logits arguments are also shaped or flagged differently from the runtime call. On DeepSeek V4 Flash this warmup will hit DeepGEMM host assertions before compiling the indexer kernel, so startup fails (or the removed CI warmup request no longer covers this path) instead of warming the JIT.

Useful? React with 👍 / 👎.

Move deep_gemm JIT warmup into tokenspeed-kernel and add warmup for prefill-path kernels (tf32_hc_prenorm_gemm, fp8_fp4_mqa_logits, fp8_gemm_nt) that were previously compiled on the first real request. On CI runners with cold deep_gemm cache, the first prefill request triggered ~8 cubin compilations (~14s on B200, ~34s on CI), blocking the worker event loop and causing smg health probe timeouts. Now all cubins are compiled during server startup (~4s overhead), reducing first-request latency from 14.6s to 0.32s. Changes: - New warmup.py in tokenspeed-kernel with warmup_mega_moe_jit(), warmup_prefill_jit(), warmup_fp8_gemm_nt(), and warmup_fp8_gemm_nt_from_model() - Generic post_quant_warmup hook in model loader for fp8_gemm_nt warmup after quant weight processing - Remove 300-token warmup request from V4-Flash CI eval configs (no longer needed) Signed-off-by: jiyingd <jiyingd@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64378d8508

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T14:06:27Z

+        a = torch.zeros(max_tokens, k, dtype=torch.float8_e4m3fn, device=device)
+        a_scales = torch.ones(
+            max_tokens, k // block_size, dtype=torch.float32, device=device
+        )
+        b = torch.zeros(n, k, dtype=torch.float8_e4m3fn, device=device)
+        b_scales = torch.ones(
+            n // block_size, k // block_size, dtype=torch.float32, device=device


Use runtime scale layouts for FP8 GEMM warmup

On SM100 DeepSeek V4 FP8 block-scale layers, _use_deep_gemm_fp8 is only enabled after fp8.py converts weight scales to the UE8M0 DeepGEMM layout, and runtime activation scales are generated with scale_ue8m0=True before deep_gemm_mm_fp8_blockscale calls fp8_gemm_nt. This warmup bypasses that path and passes plain float32 row-major a_scales/b_scales directly to fp8_gemm_nt, so it exercises a different scale contract (or hits DeepGEMM assertions) instead of compiling the kernels serving will actually use.

Useful? React with 👍 / 👎.

The warmup request via smg ensures the worker is fully registered before evalscope sends concurrent requests. Without it, heavy prefill blocks health probes and smg marks the worker unavailable. Signed-off-by: jiyingd <jiyingd@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53ccbcbdf3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T18:50:33Z

+        logger.warning("deep_gemm mega_moe symbols unavailable, skipping warmup")
+        return
+
+    token_counts = _token_count_sweep(max_num_tokens)


Warm the intermediate MegaMoE token buckets

When this sweep is used for fp8_fp4_mega_moe, the raw num_tokens is not the M dimension that selects the expert tile: after routing it is roughly num_tokens * top_k / num_experts per expert. For common DeepSeek V4 settings like 256 experts and top_k=8 with an 8192-token buffer, _token_count_sweep() only drives raw counts up to 256 and then 8192, so it warms about 1..8 and 256 tokens per expert but skips the intermediate 16/32/64/128 buckets that are hit by 512/1024/2048/4096-token prefills. Those requests can still JIT the MegaMoE kernel under the EP barrier even though startup warmup ran.

Useful? React with 👍 / 👎.

The V4-Flash decode sparse indexer calls fp8_fp4_paged_mqa_logits and get_paged_mqa_logits_metadata, which compile distinct cubins from the ragged prefill fp8_fp4_mqa_logits and were never warmed at startup. They JIT-compiled inline on the first decode (the metadata kernel re-compiles per 32-aligned batch bucket), stalling the engine long enough that smg's gRPC health probe timed out and evicted the worker -- the flaky V4-Flash GSM8K eval 404s. Add _warmup_fp8_fp4_paged_mqa_logits to warmup_prefill_jit, gated on the indexer's index_n_heads/index_head_dim (replicated, not TP-split), sweeping every 32-aligned decode batch bucket up to the runtime ceiling (max_cudagraph_capture_size / max_num_seqs) so the metadata kernel is fully covered. Runs before CUDA-graph capture, so no inline JIT remains on the decode hot path. Verified: GSM8K 5-shot N=200 = 0.96; server logs "Warmed up 2 deep_gemm prefill kernel families" with the paged kernels resolved before serving. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46b736d175

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T04:24:50Z

+        if getattr(module, "_use_deep_gemm_fp8", False) and not getattr(
+            module, "is_bmm", False


Add warmup for the BMM fp8_einsum path

When DeepSeek V4 attention uses the FP8 BMM output projection, wo_a.is_bmm is set and runtime calls deep_gemm.fp8_einsum in _project_attention_output, but this scanner explicitly drops those layers and there is no other warmup call for fp8_einsum in the new module. In that configuration the first real attention output projection still JIT-compiles on the serving path despite startup warmup, so the new warmup does not cover all DeepGEMM kernels it enables for V4.

Useful? React with 👍 / 👎.

Port vLLM's M-value enumeration (_optimal_warmup_m_values: wave-boundary + block_m-multiple M, unioned with 16-step block_m values) and apply it in warmup_fp8_gemm_nt, _warmup_tf32_hc_prenorm_gemm, and warmup_mega_moe_jit. The old 16-step-only sweep missed many (N, K, block_m) tiles the eval prefills hit, so those kernels JIT-compiled inline during the eval -- stalling the engine and tripping smg's gRPC health probe. On a cold deep_gemm cache this cuts inline prefill compiles from 17 to 5 (the mega_moe grouped GEMM, mega_moe, and tf32 compressor families are now fully covered). Startup warmup is correspondingly slower (cold ~330s, within the 1800s readiness timeout). GSM8K 5-shot N=50 = 0.94 (warmup is numerics-neutral). The residual cold-start inline JIT (FP4 linear GEMM + ragged indexer) is a known follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2338d0f113

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T06:28:51Z

+        post_quant_warmup = getattr(model, "post_quant_warmup", None)
+        if callable(post_quant_warmup):
+            post_quant_warmup()


Invoke DeepGEMM model warmup for sharded loads

When loading DeepSeek V4 from sharded-state checkpoints, this loader never calls model.load_weights(), and the only call sites for warmup_deep_gemm() / _warmup_prefill_jit() are inside DeepseekV4ForCausalLM.load_weights (checked with rg warmup_deep_gemm). This new hook only runs post_quant_warmup(), which warms FP8 linear GEMMs, so sharded loads still skip the new prefill and MegaMoE DeepGEMM startup warmups and can hit first-request JITs despite this commit’s warmup path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3672551709

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T07:56:57Z


+            post_quant_warmup = getattr(model, "post_quant_warmup", None)
+            if callable(post_quant_warmup):
+                post_quant_warmup()


Run post-quant warmup on the target device

When CPU offloading is enabled, the preceding device_loading_context blocks intentionally move CPU parameters to target_device only for processing and then restore them to CPU before this new hook runs. post_quant_warmup() eventually uses next(model.parameters()).device to allocate the DeepGEMM warmup tensors, so an offloaded DeepSeek V4 FP8 load can try to invoke CUDA DeepGEMM with CPU tensors and fail during startup instead of just warming the kernels.

Useful? React with 👍 / 👎.

Under MTP/speculative decoding the verify step flattens bs*num_draft_tokens into the decode indexer's num_tokens, so get_paged_mqa_logits_metadata (smxx_paged_mqa_logits_metadata, JIT-keyed on align(num_tokens, 32)) hits batch buckets up to align(max_num_seqs * num_draft_tokens, 32). The startup warmup ceiling lacked the speculative multiplier, so those buckets JIT- compiled inline on the first verify -- stalling the engine and tripping smg's gRPC health probe (the V4-Flash-MTP gsm8k flaky). Expose speculative_num_draft_tokens in global_server_args_dict and scale the warmup max_decode_tokens by it (gated on speculative_algorithm; no-op when speculation is off). The paged-logits cubin is unchanged -- verify routes through the decode indexer with next_n=1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

dongjiyingdjy requested a review from a team as a code owner June 9, 2026 08:41

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

dongjiyingdjy force-pushed the perf/deep-gemm-jit-warmup branch from 64378d8 to bceb897 Compare June 10, 2026 14:06

dongjiyingdjy and others added 2 commits June 10, 2026 15:03

Merge branch 'main' into perf/deep-gemm-jit-warmup

53ccbcb

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Merge branch 'main' into perf/deep-gemm-jit-warmup

3672551

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

lightseek-bot approved these changes Jun 11, 2026

View reviewed changes

lightseek-bot merged commit f13b16b into main Jun 11, 2026
62 of 70 checks passed

lightseek-bot deleted the perf/deep-gemm-jit-warmup branch June 11, 2026 21:57

dongjiyingdjy mentioned this pull request Jun 12, 2026

perf(deepseek-v4): dense deep_gemm warmup M-sweep + fp8_einsum coverage #427

Merged

		if getattr(module, "_use_deep_gemm_fp8", False) and not getattr(
		module, "is_bmm", False

Conversation

dongjiyingdjy commented Jun 9, 2026

Summary

Changes

Root cause analysis

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants