Documented bring-up command fails (3 blockers): 404 model ID, missing --quantization modelopt, DFlash needs --max-num-batched-tokens

## Summary

The documented LLM bring-up command (in `qwen3-tts-server/docs/ARCHITECTURE.md`, `qwen3-asr-server/docs/ARCHITECTURE.md`, and the "recommended pairing" in `matrix-voip-agent/README.md`) **cannot start the server as written.** Three independent blockers, each reproduced on a clean DGX Spark (GB10, driver 580.159.03, CUDA 13, `vllm-aeon-ultimate-dflash:qwen36-v3`). All three have verified fixes below.

Filing here because all three errors concern the LLM / DFlash bring-up. Happy to send a docs PR if useful.

## Environment

- Hardware: NVIDIA **GB10** (DGX Spark), 128 GB unified
- Image: `ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3`
- Driver 580.159.03 / CUDA 13.0, docker + nvidia runtime, fresh HF cache

## Blocker 1 — documented model ID 404s

The documented command serves `aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS`. That repo ID does not exist:

```
huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error.
Repository Not Found for url: https://huggingface.co/aeon-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-MTP-XS/resolve/main/config.json
```

**Fix:** the real public repo is `AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS` (note the org casing `AEON-7`, and the `Multimodal-NVFP4` segment).

## Blocker 2 — missing `--quantization modelopt`

With the corrected model ID, vLLM does not auto-detect the modelopt FP4 quantization from the documented flags; weights fail to load until quantization is specified explicitly.

**Fix:** add `--quantization modelopt`. (vLLM then reports `quantization=modelopt_fp4` and loads cleanly.)

## Blocker 3 — DFlash spec-config crashes without `--max-num-batched-tokens`

With model ID and quantization fixed, the documented `--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}'` crash-loops at startup:

```
Value error, max_num_scheduled_tokens is set to -1536 based on the speculative
decoding settings, which does not allow any tokens to be scheduled. Increase
max_num_batched_tokens to accommodate the additional draft token slots, ...
```

The 15 draft-token slots exceed the default batched-token budget, computing negative. z-lab's own DFlash launch command includes `--max-num-batched-tokens 32768`; the AEON docs omit it.

(Separately: `z-lab/Qwen3.6-27B-DFlash` is a click-through-gated HF repo — users must accept its conditions and provide an `HF_TOKEN`. Worth a one-line note in the docs; not a code bug.)

**Fix:** add `--max-num-batched-tokens 32768`.

## Working command (all three fixes; verified serving on GB10)

```bash
docker run -d --name qwen36-aeon-xs \
  --runtime nvidia --network aeon-stack -p 8000:8000 \
  --shm-size=4gb --restart unless-stopped \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e ENABLE_NVFP4_SM100=0 \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 \
  vllm serve AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
    --served-model-name qwen36-ultimate-xs \
    --host 0.0.0.0 --port 8000 \
    --quantization modelopt \
    --gpu-memory-utilization 0.75 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --enable-auto-tool-choice --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code
```

## Verified result

Server serves on `:8000`; `/v1/chat/completions` returns correct output. Warm DFlash throughput **~31 tok/s** (3 runs, 200-tok structured generation, temp 0.0), DFlash SpecDecoding metrics: mean acceptance length **3.9**, avg draft acceptance **~19–20%**. LLM EngineCore resident **~85 GB** alongside ASR (~7.5 GB) + TTS (~4 GB) on the 128 GB GB10 — matches the documented budget.

Thanks for publishing this stack — the GB10 latency work is excellent, and once past these three doc fixes it comes up clean.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documented bring-up command fails (3 blockers): 404 model ID, missing --quantization modelopt, DFlash needs --max-num-batched-tokens #11

Summary

Environment

Blocker 1 — documented model ID 404s

Blocker 2 — missing `--quantization modelopt`

Blocker 3 — DFlash spec-config crashes without `--max-num-batched-tokens`

Working command (all three fixes; verified serving on GB10)

Verified result

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Documented bring-up command fails (3 blockers): 404 model ID, missing --quantization modelopt, DFlash needs --max-num-batched-tokens #11

Description

Summary

Environment

Blocker 1 — documented model ID 404s

Blocker 2 — missing --quantization modelopt

Blocker 3 — DFlash spec-config crashes without --max-num-batched-tokens

Working command (all three fixes; verified serving on GB10)

Verified result

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Blocker 2 — missing `--quantization modelopt`

Blocker 3 — DFlash spec-config crashes without `--max-num-batched-tokens`