test(ci): add DeepSeek-V4-Flash MTP AIME25 eval by dongjiyingdjy · Pull Request #461 · lightseekorg/tokenspeed

dongjiyingdjy · 2026-06-16T07:29:12Z

Summary

Adds an AIME25 accuracy CI gate for DeepSeek-V4-Flash with MTP speculative decode, modeled on the existing V4 GSM8K serve pattern and the Kimi AIME25 eval pattern (test/ci/eval/deepseek-v4-flash-mtp-evalscope-gsm8k.yaml, test/ci/eval/kimi-k2.5-nvfp4-evalscope-aime25.yaml).

New file: test/ci/eval/deepseek-v4-flash-mtp-evalscope-aime25.yaml (per-commit + manual, b200-4gpu).

Key design points:

Thinking mode is required. The eval --generation-config carries extra_body.chat_template_kwargs={reasoning_effort: high, thinking: true}. With it V4-Flash reaches ~0.96 on AIME25; without it ~0.5.
Sizing for long reasoning. Chains run up to max_tokens=65536, so --max-model-len 80000 --max-total-tokens 163840 (vs GSM8K's 4096/16384).
MTP enables batch 16. --speculative-algorithm MTP --speculative-num-steps 3 keeps tokens flowing (like Kimi's EAGLE3), so --eval-batch-size 16 runs without the long-stream read-timeout stalls a non-MTP serve hits at that concurrency. timeout=3600 guards the longest single-request streams.
score_threshold: 0.93.

Test Plan

Validated on B200×4:

AIME25 (MTP, batch 16): 28/30 = 0.9333 — 30/30 completed cleanly, 0 timeouts, MTP avg_accept_len ≈ 2.82, wall time ~9 min.
GSM8K: 0.94 (flexible + strict) on the same MTP serve — model healthy.
pipeline.py scan discovers the job; YAML + pre-commit pass; codex review clean (no correctness issues).

Add an AIME25 accuracy gate for DeepSeek-V4-Flash with MTP speculative decode, modeled on the V4 GSM8K serve and Kimi AIME25 eval patterns. - Thinking mode (reasoning_effort=high, thinking=true) is required for V4-Flash to reach ~0.96 on AIME25; reasoning runs up to max_tokens=65536, so max-model-len/max-total-tokens are raised to 80000/163840. - MTP (--speculative-algorithm MTP --speculative-num-steps 3) keeps tokens flowing, allowing eval-batch-size 16 without the long-stream read-timeout stalls a non-MTP serve hits at that concurrency. - Validated on B200x4: AIME25 28/30=0.9333 (30/30 clean, 0 timeouts), GSM8K 0.94. score_threshold 0.93. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba53d10d53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T06:21:38Z

+server:
+  command: >-
+    ts serve
+    --load-format instanttensor


Use a supported load format

In this ts serve path, --load-format is routed to the engine because it is a prepare_server_args flag, and the engine argparse choices in python/tokenspeed/runtime/utils/server_args.py only allow auto, pt, safetensors, npcache, dummy, or extensible (also documented in docs/configuration/server.md). With instanttensor here, the new per-commit eval exits during argument parsing before the readiness probe can ever pass, so the CI job is unusable unless this is changed to a supported format or support is added first.

Useful? React with 👍 / 👎.

dongjiyingdjy and others added 2 commits June 16, 2026 07:28

Merge branch 'main' into ci/deepseek-v4-mtp-aime25

ba53d10

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ci): add DeepSeek-V4-Flash MTP AIME25 eval#461

test(ci): add DeepSeek-V4-Flash MTP AIME25 eval#461
dongjiyingdjy wants to merge 2 commits into
mainfrom
ci/deepseek-v4-mtp-aime25

dongjiyingdjy commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dongjiyingdjy commented Jun 16, 2026

Summary

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant