Summary
The documented LLM bring-up command (in qwen3-tts-server/docs/ARCHITECTURE.md, qwen3-asr-server/docs/ARCHITECTURE.md, and the "recommended pairing" in matrix-voip-agent/README.md) cannot start the server as written. Three independent blockers, each reproduced on a clean DGX Spark (GB10, driver 580.159.03, CUDA 13, vllm-aeon-ultimate-dflash:qwen36-v3). All three have verified fixes below.
Filing here because all three errors concern the LLM / DFlash bring-up. Happy to send a docs PR if useful.
Environment
- Hardware: NVIDIA GB10 (DGX Spark), 128 GB unified
- Image:
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3
- Driver 580.159.03 / CUDA 13.0, docker + nvidia runtime, fresh HF cache
Blocker 1 — documented model ID 404s
The documented command serves aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS. That repo ID does not exist:
huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error.
Repository Not Found for url: https://huggingface.co/aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS/resolve/main/config.json
Fix: the real public repo is AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS (note the org casing AEON-7, and the Multimodal-NVFP4 segment).
Blocker 2 — missing --quantization modelopt
With the corrected model ID, vLLM does not auto-detect the modelopt FP4 quantization from the documented flags; weights fail to load until quantization is specified explicitly.
Fix: add --quantization modelopt. (vLLM then reports quantization=modelopt_fp4 and loads cleanly.)
Blocker 3 — DFlash spec-config crashes without --max-num-batched-tokens
With model ID and quantization fixed, the documented --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' crash-loops at startup:
Value error, max_num_scheduled_tokens is set to -1536 based on the speculative
decoding settings, which does not allow any tokens to be scheduled. Increase
max_num_batched_tokens to accommodate the additional draft token slots, ...
The 15 draft-token slots exceed the default batched-token budget, computing negative. z-lab's own DFlash launch command includes --max-num-batched-tokens 32768; the AEON docs omit it.
(Separately: z-lab/Qwen3.6-27B-DFlash is a click-through-gated HF repo — users must accept its conditions and provide an HF_TOKEN. Worth a one-line note in the docs; not a code bug.)
Fix: add --max-num-batched-tokens 32768.
Working command (all three fixes; verified serving on GB10)
docker run -d --name qwen36-aeon-xs \
--runtime nvidia --network aeon-stack -p 8000:8000 \
--shm-size=4gb --restart unless-stopped \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e NVIDIA_VISIBLE_DEVICES=all \
-e ENABLE_NVFP4_SM100=0 \
-e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 \
vllm serve AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
--served-model-name qwen36-ultimate-xs \
--host 0.0.0.0 --port 8000 \
--quantization modelopt \
--gpu-memory-utilization 0.75 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' \
--trust-remote-code
Verified result
Server serves on :8000; /v1/chat/completions returns correct output. Warm DFlash throughput ~31 tok/s (3 runs, 200-tok structured generation, temp 0.0), DFlash SpecDecoding metrics: mean acceptance length 3.9, avg draft acceptance ~19–20%. LLM EngineCore resident ~85 GB alongside ASR (~7.5 GB) + TTS (~4 GB) on the 128 GB GB10 — matches the documented budget.
Thanks for publishing this stack — the GB10 latency work is excellent, and once past these three doc fixes it comes up clean.
Summary
The documented LLM bring-up command (in
qwen3-tts-server/docs/ARCHITECTURE.md,qwen3-asr-server/docs/ARCHITECTURE.md, and the "recommended pairing" inmatrix-voip-agent/README.md) cannot start the server as written. Three independent blockers, each reproduced on a clean DGX Spark (GB10, driver 580.159.03, CUDA 13,vllm-aeon-ultimate-dflash:qwen36-v3). All three have verified fixes below.Filing here because all three errors concern the LLM / DFlash bring-up. Happy to send a docs PR if useful.
Environment
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3Blocker 1 — documented model ID 404s
The documented command serves
aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS. That repo ID does not exist:Fix: the real public repo is
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS(note the org casingAEON-7, and theMultimodal-NVFP4segment).Blocker 2 — missing
--quantization modeloptWith the corrected model ID, vLLM does not auto-detect the modelopt FP4 quantization from the documented flags; weights fail to load until quantization is specified explicitly.
Fix: add
--quantization modelopt. (vLLM then reportsquantization=modelopt_fp4and loads cleanly.)Blocker 3 — DFlash spec-config crashes without
--max-num-batched-tokensWith model ID and quantization fixed, the documented
--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}'crash-loops at startup:The 15 draft-token slots exceed the default batched-token budget, computing negative. z-lab's own DFlash launch command includes
--max-num-batched-tokens 32768; the AEON docs omit it.(Separately:
z-lab/Qwen3.6-27B-DFlashis a click-through-gated HF repo — users must accept its conditions and provide anHF_TOKEN. Worth a one-line note in the docs; not a code bug.)Fix: add
--max-num-batched-tokens 32768.Working command (all three fixes; verified serving on GB10)
Verified result
Server serves on
:8000;/v1/chat/completionsreturns correct output. Warm DFlash throughput ~31 tok/s (3 runs, 200-tok structured generation, temp 0.0), DFlash SpecDecoding metrics: mean acceptance length 3.9, avg draft acceptance ~19–20%. LLM EngineCore resident ~85 GB alongside ASR (~7.5 GB) + TTS (~4 GB) on the 128 GB GB10 — matches the documented budget.Thanks for publishing this stack — the GB10 latency work is excellent, and once past these three doc fixes it comes up clean.