Skip to content

perf(deepseek-v4): pre-compile deep_gemm JIT kernels at startup#398

Merged
lightseek-bot merged 7 commits into
mainfrom
perf/deep-gemm-jit-warmup
Jun 11, 2026
Merged

perf(deepseek-v4): pre-compile deep_gemm JIT kernels at startup#398
lightseek-bot merged 7 commits into
mainfrom
perf/deep-gemm-jit-warmup

Conversation

@dongjiyingdjy

Copy link
Copy Markdown
Contributor

Summary

  • Move deep_gemm JIT warmup into tokenspeed-kernel (warmup.py) and add warmup for prefill-path kernels (tf32_hc_prenorm_gemm, fp8_fp4_mqa_logits, fp8_gemm_nt) that were previously compiled on the first real request
  • On CI runners with cold deep_gemm cache, the first prefill request triggered ~8 cubin compilations (~14s on B200, ~34s on CI), blocking the worker event loop and causing smg health probe timeouts → server marked dead → CI eval fails
  • Now all cubins are compiled during server startup (~4s overhead), reducing first-request latency from 14.6s to 0.32s
  • Remove 300-token warmup request from V4-Flash CI eval configs (no longer needed)

Changes

File Change
tokenspeed-kernel/.../deep_gemm/warmup.py New: warmup_mega_moe_jit(), warmup_prefill_jit(), warmup_fp8_gemm_nt(), warmup_fp8_gemm_nt_from_model()
python/.../models/deepseek_v4.py Rewire warmup calls to tokenspeed-kernel; add post_quant_warmup hook
python/.../model_loader/loader.py Generic post_quant_warmup hook after quant weight processing
test/ci/eval/deepseek-v4-flash-*.yaml Remove manual 300-token warmup request

Root cause analysis

deep_gemm compiles one cubin per unique (N, K, block_m) combination. The block_m tile is selected at runtime based on M. Existing warmup only covered mega_moe tiles; compressor/indexer/attention projection kernels had no warmup. The _token_count_sweep generates 21 M values (one per block_m step from 16 to 256) to trigger compilation of all tile variants in ~4 seconds.

Test plan

  • Cold deep_gemm cache → server startup compiles 151 cubins → first 300-token request: 0.32s, 0 new cubins
  • GSM8K 50-sample eval: 0.94 accuracy (V4-Flash EP=4)
  • Code review: 4 findings fixed (env var scope, missing loader hooks, dead code, NameError guard)
  • CI: eval-deepseek-v4-flash-gsm8k / b200-4gpu should pass without the manual warmup request

🤖 Generated with Claude Code

@dongjiyingdjy dongjiyingdjy requested a review from a team as a code owner June 9, 2026 08:41

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9fa0e4959

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +169 to +171
"head_dim": head_dim,
"kv_dim": kv_lora_rank,
"kv_scale_dim": kv_lora_rank // mxfp4_block_size,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use indexer dimensions for MQA logits warmup

For the DeepSeek V4 prefill indexer, these shapes are derived from the attention MLA dimensions (head_dim/kv_lora_rank) instead of the indexer dimensions: DeepseekV4Indexer builds q from config.index_n_heads * config.index_head_dim and the MXFP4 cache stores packed value/scale bytes from index_head_dim. On V4 Flash those differ from the MLA values (e.g. 128 index head dim vs 512 head/kv rank), so this warms/compiles the wrong fp8_fp4_mqa_logits shapes and the first real prefill can still JIT after the CI warmup request was removed.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 573e4b3f9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +274 to +278
q_scales = torch.ones(
num_tokens,
num_heads,
dtype=torch.float32,
device=device,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mirror the FP4 MQA runtime argument contract

This synthetic prefill call does not produce the same DeepGEMM-ready tensors that the runtime passes from deepseek_v4_prepare_indexer_q_mxfp4: the FP4 q scale tensor here is float32, while the FP4 MQA API expects the packed scale tensor (int32), and the later weights/cu_*/clean_logits arguments are also shaped or flagged differently from the runtime call. On DeepSeek V4 Flash this warmup will hit DeepGEMM host assertions before compiling the indexer kernel, so startup fails (or the removed CI warmup request no longer covers this path) instead of warming the JIT.

Useful? React with 👍 / 👎.

Move deep_gemm JIT warmup into tokenspeed-kernel and add warmup for
prefill-path kernels (tf32_hc_prenorm_gemm, fp8_fp4_mqa_logits,
fp8_gemm_nt) that were previously compiled on the first real request.

On CI runners with cold deep_gemm cache, the first prefill request
triggered ~8 cubin compilations (~14s on B200, ~34s on CI), blocking
the worker event loop and causing smg health probe timeouts. Now all
cubins are compiled during server startup (~4s overhead), reducing
first-request latency from 14.6s to 0.32s.

Changes:
- New warmup.py in tokenspeed-kernel with warmup_mega_moe_jit(),
  warmup_prefill_jit(), warmup_fp8_gemm_nt(), and
  warmup_fp8_gemm_nt_from_model()
- Generic post_quant_warmup hook in model loader for fp8_gemm_nt
  warmup after quant weight processing
- Remove 300-token warmup request from V4-Flash CI eval configs
  (no longer needed)

Signed-off-by: jiyingd <jiyingd@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64378d8508

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +353 to +359
a = torch.zeros(max_tokens, k, dtype=torch.float8_e4m3fn, device=device)
a_scales = torch.ones(
max_tokens, k // block_size, dtype=torch.float32, device=device
)
b = torch.zeros(n, k, dtype=torch.float8_e4m3fn, device=device)
b_scales = torch.ones(
n // block_size, k // block_size, dtype=torch.float32, device=device

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use runtime scale layouts for FP8 GEMM warmup

On SM100 DeepSeek V4 FP8 block-scale layers, _use_deep_gemm_fp8 is only enabled after fp8.py converts weight scales to the UE8M0 DeepGEMM layout, and runtime activation scales are generated with scale_ue8m0=True before deep_gemm_mm_fp8_blockscale calls fp8_gemm_nt. This warmup bypasses that path and passes plain float32 row-major a_scales/b_scales directly to fp8_gemm_nt, so it exercises a different scale contract (or hits DeepGEMM assertions) instead of compiling the kernels serving will actually use.

Useful? React with 👍 / 👎.

@dongjiyingdjy dongjiyingdjy force-pushed the perf/deep-gemm-jit-warmup branch from 64378d8 to bceb897 Compare June 10, 2026 14:06
dongjiyingdjy and others added 2 commits June 10, 2026 15:03
The warmup request via smg ensures the worker is fully registered
before evalscope sends concurrent requests. Without it, heavy
prefill blocks health probes and smg marks the worker unavailable.

Signed-off-by: jiyingd <jiyingd@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53ccbcbdf3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

logger.warning("deep_gemm mega_moe symbols unavailable, skipping warmup")
return

token_counts = _token_count_sweep(max_num_tokens)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Warm the intermediate MegaMoE token buckets

When this sweep is used for fp8_fp4_mega_moe, the raw num_tokens is not the M dimension that selects the expert tile: after routing it is roughly num_tokens * top_k / num_experts per expert. For common DeepSeek V4 settings like 256 experts and top_k=8 with an 8192-token buffer, _token_count_sweep() only drives raw counts up to 256 and then 8192, so it warms about 1..8 and 256 tokens per expert but skips the intermediate 16/32/64/128 buckets that are hit by 512/1024/2048/4096-token prefills. Those requests can still JIT the MegaMoE kernel under the EP barrier even though startup warmup ran.

Useful? React with 👍 / 👎.

The V4-Flash decode sparse indexer calls fp8_fp4_paged_mqa_logits and
get_paged_mqa_logits_metadata, which compile distinct cubins from the ragged
prefill fp8_fp4_mqa_logits and were never warmed at startup. They JIT-compiled
inline on the first decode (the metadata kernel re-compiles per 32-aligned
batch bucket), stalling the engine long enough that smg's gRPC health probe
timed out and evicted the worker -- the flaky V4-Flash GSM8K eval 404s.

Add _warmup_fp8_fp4_paged_mqa_logits to warmup_prefill_jit, gated on the
indexer's index_n_heads/index_head_dim (replicated, not TP-split), sweeping
every 32-aligned decode batch bucket up to the runtime ceiling
(max_cudagraph_capture_size / max_num_seqs) so the metadata kernel is fully
covered. Runs before CUDA-graph capture, so no inline JIT remains on the
decode hot path.

Verified: GSM8K 5-shot N=200 = 0.96; server logs "Warmed up 2 deep_gemm
prefill kernel families" with the paged kernels resolved before serving.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46b736d175

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +509 to +510
if getattr(module, "_use_deep_gemm_fp8", False) and not getattr(
module, "is_bmm", False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add warmup for the BMM fp8_einsum path

When DeepSeek V4 attention uses the FP8 BMM output projection, wo_a.is_bmm is set and runtime calls deep_gemm.fp8_einsum in _project_attention_output, but this scanner explicitly drops those layers and there is no other warmup call for fp8_einsum in the new module. In that configuration the first real attention output projection still JIT-compiles on the serving path despite startup warmup, so the new warmup does not cover all DeepGEMM kernels it enables for V4.

Useful? React with 👍 / 👎.

Port vLLM's M-value enumeration (_optimal_warmup_m_values: wave-boundary +
block_m-multiple M, unioned with 16-step block_m values) and apply it in
warmup_fp8_gemm_nt, _warmup_tf32_hc_prenorm_gemm, and warmup_mega_moe_jit.

The old 16-step-only sweep missed many (N, K, block_m) tiles the eval prefills
hit, so those kernels JIT-compiled inline during the eval -- stalling the
engine and tripping smg's gRPC health probe. On a cold deep_gemm cache this
cuts inline prefill compiles from 17 to 5 (the mega_moe grouped GEMM, mega_moe,
and tf32 compressor families are now fully covered). Startup warmup is
correspondingly slower (cold ~330s, within the 1800s readiness timeout).

GSM8K 5-shot N=50 = 0.94 (warmup is numerics-neutral). The residual cold-start
inline JIT (FP4 linear GEMM + ragged indexer) is a known follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2338d0f113

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +615 to +617
post_quant_warmup = getattr(model, "post_quant_warmup", None)
if callable(post_quant_warmup):
post_quant_warmup()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invoke DeepGEMM model warmup for sharded loads

When loading DeepSeek V4 from sharded-state checkpoints, this loader never calls model.load_weights(), and the only call sites for warmup_deep_gemm() / _warmup_prefill_jit() are inside DeepseekV4ForCausalLM.load_weights (checked with rg warmup_deep_gemm). This new hook only runs post_quant_warmup(), which warms FP8 linear GEMMs, so sharded loads still skip the new prefill and MegaMoE DeepGEMM startup warmups and can hit first-request JITs despite this commit’s warmup path.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3672551709

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


post_quant_warmup = getattr(model, "post_quant_warmup", None)
if callable(post_quant_warmup):
post_quant_warmup()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run post-quant warmup on the target device

When CPU offloading is enabled, the preceding device_loading_context blocks intentionally move CPU parameters to target_device only for processing and then restore them to CPU before this new hook runs. post_quant_warmup() eventually uses next(model.parameters()).device to allocate the DeepGEMM warmup tensors, so an offloaded DeepSeek V4 FP8 load can try to invoke CUDA DeepGEMM with CPU tensors and fail during startup instead of just warming the kernels.

Useful? React with 👍 / 👎.

Under MTP/speculative decoding the verify step flattens bs*num_draft_tokens
into the decode indexer's num_tokens, so get_paged_mqa_logits_metadata
(smxx_paged_mqa_logits_metadata, JIT-keyed on align(num_tokens, 32)) hits
batch buckets up to align(max_num_seqs * num_draft_tokens, 32). The startup
warmup ceiling lacked the speculative multiplier, so those buckets JIT-
compiled inline on the first verify -- stalling the engine and tripping smg's
gRPC health probe (the V4-Flash-MTP gsm8k flaky).

Expose speculative_num_draft_tokens in global_server_args_dict and scale the
warmup max_decode_tokens by it (gated on speculative_algorithm; no-op when
speculation is off). The paged-logits cubin is unchanged -- verify routes
through the decode indexer with next_n=1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
@lightseek-bot lightseek-bot merged commit f13b16b into main Jun 11, 2026
62 of 70 checks passed
@lightseek-bot lightseek-bot deleted the perf/deep-gemm-jit-warmup branch June 11, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants