Skip to content

test(ci): add DeepSeek-V4-Flash MTP AIME25 eval#461

Open
dongjiyingdjy wants to merge 2 commits into
mainfrom
ci/deepseek-v4-mtp-aime25
Open

test(ci): add DeepSeek-V4-Flash MTP AIME25 eval#461
dongjiyingdjy wants to merge 2 commits into
mainfrom
ci/deepseek-v4-mtp-aime25

Conversation

@dongjiyingdjy

Copy link
Copy Markdown
Contributor

Summary

Adds an AIME25 accuracy CI gate for DeepSeek-V4-Flash with MTP speculative decode, modeled on the existing V4 GSM8K serve pattern and the Kimi AIME25 eval pattern (test/ci/eval/deepseek-v4-flash-mtp-evalscope-gsm8k.yaml, test/ci/eval/kimi-k2.5-nvfp4-evalscope-aime25.yaml).

New file: test/ci/eval/deepseek-v4-flash-mtp-evalscope-aime25.yaml (per-commit + manual, b200-4gpu).

Key design points:

  • Thinking mode is required. The eval --generation-config carries extra_body.chat_template_kwargs={reasoning_effort: high, thinking: true}. With it V4-Flash reaches ~0.96 on AIME25; without it ~0.5.
  • Sizing for long reasoning. Chains run up to max_tokens=65536, so --max-model-len 80000 --max-total-tokens 163840 (vs GSM8K's 4096/16384).
  • MTP enables batch 16. --speculative-algorithm MTP --speculative-num-steps 3 keeps tokens flowing (like Kimi's EAGLE3), so --eval-batch-size 16 runs without the long-stream read-timeout stalls a non-MTP serve hits at that concurrency. timeout=3600 guards the longest single-request streams.
  • score_threshold: 0.93.

Test Plan

Validated on B200×4:

  • AIME25 (MTP, batch 16): 28/30 = 0.9333 — 30/30 completed cleanly, 0 timeouts, MTP avg_accept_len ≈ 2.82, wall time ~9 min.
  • GSM8K: 0.94 (flexible + strict) on the same MTP serve — model healthy.
  • pipeline.py scan discovers the job; YAML + pre-commit pass; codex review clean (no correctness issues).

dongjiyingdjy and others added 2 commits June 16, 2026 07:28
Add an AIME25 accuracy gate for DeepSeek-V4-Flash with MTP speculative
decode, modeled on the V4 GSM8K serve and Kimi AIME25 eval patterns.

- Thinking mode (reasoning_effort=high, thinking=true) is required for
  V4-Flash to reach ~0.96 on AIME25; reasoning runs up to max_tokens=65536,
  so max-model-len/max-total-tokens are raised to 80000/163840.
- MTP (--speculative-algorithm MTP --speculative-num-steps 3) keeps tokens
  flowing, allowing eval-batch-size 16 without the long-stream read-timeout
  stalls a non-MTP serve hits at that concurrency.
- Validated on B200x4: AIME25 28/30=0.9333 (30/30 clean, 0 timeouts),
  GSM8K 0.94. score_threshold 0.93.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: jiyingd <87510204+dongjiyingdjy@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba53d10d53

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

server:
command: >-
ts serve
--load-format instanttensor

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a supported load format

In this ts serve path, --load-format is routed to the engine because it is a prepare_server_args flag, and the engine argparse choices in python/tokenspeed/runtime/utils/server_args.py only allow auto, pt, safetensors, npcache, dummy, or extensible (also documented in docs/configuration/server.md). With instanttensor here, the new per-commit eval exits during argument parsing before the readiness probe can ever pass, so the CI job is unusable unless this is changed to a supported format or support is added first.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant