Environment
- Hardware: Dell DGX Spark GB10 (Grace-Blackwell, 128GB unified memory, sm_121)
- Container:
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v4
- vLLM version: 0.20.2rc1.dev166+gf6490a284
- Configuration:
MAX_MODEL_LEN=200000
MAX_NUM_SEQS=32
GPU_MEMORY_UTILIZATION=0.61 (co-resident with embedding model at 0.20)
NUM_SPECULATIVE_TOKENS=15
MAX_NUM_BATCHED_TOKENS=8192
EXTRA_VLLM_ARGS="--no-scheduler-reserve-full-isl --reasoning-parser qwen3"
- BF16 KV cache, docker
--memory 83g
Symptoms
After running normally for ~2 weeks (12,763 requests completed successfully), the engine entered a hung state:
- 3 requests stuck in "running", 0 waiting — no requests completing for 7+ hours
- Engine metrics logging stopped — vLLM's 10-second interval metrics stopped being emitted entirely
- GPU shows 96% utilization but only 13W power draw — GPU context is held but no actual computation
- API endpoint responds to GET requests (
/v1/models, /metrics) but POST requests hang indefinitely
- New test requests (
max_tokens=5, simple "hi" prompt) got no response within 10 seconds
Timeline (all times UTC)
01:07-01:08 Normal operation: 3 reqs, gen throughput ~73 tok/s, KV cache ~52%
01:08:26 Brief idle (0 running)
01:08:36 3 new requests arrive, gen throughput ~65 tok/s
01:09:00 Burst of completions (3 requests finish quickly)
01:09:46 2 running + 1 WAITING, gen throughput drops to 0.0
01:10:06 3 running, KV cache 60.9%, gen throughput 3.1 tok/s
01:10:16 3 running, KV cache 60.9%, gen throughput 0.0 — LAST metrics log
01:18-03:49 6 POST completions trickle in (~30 min apart), no more metrics
03:49 Last request completion
~10:55 Diagnosed as hung (7+ hours since last completion)
Analysis
- Not a scheduler deadlock: Inspected
scheduler.py — running requests have proper preemption logic (both FCFS and PRIORITY paths). The --no-scheduler-reserve-full-isl flag is set, which avoids the upstream vllm#39734 HOL blocking.
- Likely CUDA-level hang: The 96% GPU utilization with 13W power draw suggests a CUDA kernel is stuck (spin-waiting), not doing real computation. Temperature was only 45°C.
- Host memory was tight:
MemAvailable was 13.5 GB (below the 15 GB threshold for 200K context), with 4 GB swap in use.
- Spec decode interaction?: The transition from normal throughput (~73 tok/s) to zero happened while 3 long-context requests were consuming 60.9% of KV cache. The spec decode metrics showed progressively lower acceptance rates before the hang.
Questions
- Is this a known issue with the AEON fork's DFlash speculative decoding on Blackwell (sm_121)?
- Could the tight host memory (13.5 GB available, unified memory architecture) contribute to this hang?
- Are there recommended monitoring/recovery strategies beyond
systemctl restart?
- Would upgrading to
qwen36-v5-ddtree-m53-temp-nonflat-gdn address this, or is that tag experimental?
Workaround
Service was restarted via systemctl --user restart vllm-aeon-27b-dflash.service which cleared the hang.
Disclosure: This issue was filed by Claude Opus 4.6 (Anthropic) at the direction of a human user who experienced this problem. The diagnosis and scheduler code analysis were performed by the AI. The human user reviewed and approved this submission.
Environment
ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v4MAX_MODEL_LEN=200000MAX_NUM_SEQS=32GPU_MEMORY_UTILIZATION=0.61(co-resident with embedding model at 0.20)NUM_SPECULATIVE_TOKENS=15MAX_NUM_BATCHED_TOKENS=8192EXTRA_VLLM_ARGS="--no-scheduler-reserve-full-isl --reasoning-parser qwen3"--memory 83gSymptoms
After running normally for ~2 weeks (12,763 requests completed successfully), the engine entered a hung state:
/v1/models,/metrics) but POST requests hang indefinitelymax_tokens=5, simple "hi" prompt) got no response within 10 secondsTimeline (all times UTC)
Analysis
scheduler.py— running requests have proper preemption logic (both FCFS and PRIORITY paths). The--no-scheduler-reserve-full-islflag is set, which avoids the upstream vllm#39734 HOL blocking.MemAvailablewas 13.5 GB (below the 15 GB threshold for 200K context), with 4 GB swap in use.Questions
systemctl restart?qwen36-v5-ddtree-m53-temp-nonflat-gdnaddress this, or is that tag experimental?Workaround
Service was restarted via
systemctl --user restart vllm-aeon-27b-dflash.servicewhich cleared the hang.