test(ci): add DeepSeek-V4-Flash MTP AIME25 eval#461
Conversation
Add an AIME25 accuracy gate for DeepSeek-V4-Flash with MTP speculative decode, modeled on the V4 GSM8K serve and Kimi AIME25 eval patterns. - Thinking mode (reasoning_effort=high, thinking=true) is required for V4-Flash to reach ~0.96 on AIME25; reasoning runs up to max_tokens=65536, so max-model-len/max-total-tokens are raised to 80000/163840. - MTP (--speculative-algorithm MTP --speculative-num-steps 3) keeps tokens flowing, allowing eval-batch-size 16 without the long-stream read-timeout stalls a non-MTP serve hits at that concurrency. - Validated on B200x4: AIME25 28/30=0.9333 (30/30 clean, 0 timeouts), GSM8K 0.94. score_threshold 0.93. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ba53d10d53
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| server: | ||
| command: >- | ||
| ts serve | ||
| --load-format instanttensor |
There was a problem hiding this comment.
In this ts serve path, --load-format is routed to the engine because it is a prepare_server_args flag, and the engine argparse choices in python/tokenspeed/runtime/utils/server_args.py only allow auto, pt, safetensors, npcache, dummy, or extensible (also documented in docs/configuration/server.md). With instanttensor here, the new per-commit eval exits during argument parsing before the readiness probe can ever pass, so the CI job is unusable unless this is changed to a supported format or support is added first.
Useful? React with 👍 / 👎.
Summary
Adds an AIME25 accuracy CI gate for DeepSeek-V4-Flash with MTP speculative decode, modeled on the existing V4 GSM8K serve pattern and the Kimi AIME25 eval pattern (
test/ci/eval/deepseek-v4-flash-mtp-evalscope-gsm8k.yaml,test/ci/eval/kimi-k2.5-nvfp4-evalscope-aime25.yaml).New file:
test/ci/eval/deepseek-v4-flash-mtp-evalscope-aime25.yaml(per-commit + manual,b200-4gpu).Key design points:
--generation-configcarriesextra_body.chat_template_kwargs={reasoning_effort: high, thinking: true}. With it V4-Flash reaches ~0.96 on AIME25; without it ~0.5.max_tokens=65536, so--max-model-len 80000 --max-total-tokens 163840(vs GSM8K's 4096/16384).--speculative-algorithm MTP --speculative-num-steps 3keeps tokens flowing (like Kimi's EAGLE3), so--eval-batch-size 16runs without the long-stream read-timeout stalls a non-MTP serve hits at that concurrency.timeout=3600guards the longest single-request streams.score_threshold: 0.93.Test Plan
Validated on B200×4:
avg_accept_len ≈ 2.82, wall time ~9 min.pipeline.py scandiscovers the job; YAML + pre-commit pass; codex review clean (no correctness issues).