Skip to content

Documented bring-up command fails (3 blockers): 404 model ID, missing --quantization modelopt, DFlash needs --max-num-batched-tokens #11

Description

@forge-witt3rd

Summary

The documented LLM bring-up command (in qwen3-tts-server/docs/ARCHITECTURE.md, qwen3-asr-server/docs/ARCHITECTURE.md, and the "recommended pairing" in matrix-voip-agent/README.md) cannot start the server as written. Three independent blockers, each reproduced on a clean DGX Spark (GB10, driver 580.159.03, CUDA 13, vllm-aeon-ultimate-dflash:qwen36-v3). All three have verified fixes below.

Filing here because all three errors concern the LLM / DFlash bring-up. Happy to send a docs PR if useful.

Environment

  • Hardware: NVIDIA GB10 (DGX Spark), 128 GB unified
  • Image: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3
  • Driver 580.159.03 / CUDA 13.0, docker + nvidia runtime, fresh HF cache

Blocker 1 — documented model ID 404s

The documented command serves aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS. That repo ID does not exist:

huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error.
Repository Not Found for url: https://huggingface.co/aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS/resolve/main/config.json

Fix: the real public repo is AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS (note the org casing AEON-7, and the Multimodal-NVFP4 segment).

Blocker 2 — missing --quantization modelopt

With the corrected model ID, vLLM does not auto-detect the modelopt FP4 quantization from the documented flags; weights fail to load until quantization is specified explicitly.

Fix: add --quantization modelopt. (vLLM then reports quantization=modelopt_fp4 and loads cleanly.)

Blocker 3 — DFlash spec-config crashes without --max-num-batched-tokens

With model ID and quantization fixed, the documented --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' crash-loops at startup:

Value error, max_num_scheduled_tokens is set to -1536 based on the speculative
decoding settings, which does not allow any tokens to be scheduled. Increase
max_num_batched_tokens to accommodate the additional draft token slots, ...

The 15 draft-token slots exceed the default batched-token budget, computing negative. z-lab's own DFlash launch command includes --max-num-batched-tokens 32768; the AEON docs omit it.

(Separately: z-lab/Qwen3.6-27B-DFlash is a click-through-gated HF repo — users must accept its conditions and provide an HF_TOKEN. Worth a one-line note in the docs; not a code bug.)

Fix: add --max-num-batched-tokens 32768.

Working command (all three fixes; verified serving on GB10)

docker run -d --name qwen36-aeon-xs \
  --runtime nvidia --network aeon-stack -p 8000:8000 \
  --shm-size=4gb --restart unless-stopped \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e ENABLE_NVFP4_SM100=0 \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 \
  vllm serve AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
    --served-model-name qwen36-ultimate-xs \
    --host 0.0.0.0 --port 8000 \
    --quantization modelopt \
    --gpu-memory-utilization 0.75 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code

Verified result

Server serves on :8000; /v1/chat/completions returns correct output. Warm DFlash throughput ~31 tok/s (3 runs, 200-tok structured generation, temp 0.0), DFlash SpecDecoding metrics: mean acceptance length 3.9, avg draft acceptance ~19–20%. LLM EngineCore resident ~85 GB alongside ASR (~7.5 GB) + TTS (~4 GB) on the 128 GB GB10 — matches the documented budget.

Thanks for publishing this stack — the GB10 latency work is excellent, and once past these three doc fixes it comes up clean.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions