Skip to content

Engine hang: 96% GPU utilization but 0 generation throughput for 7+ hours (GB10 Blackwell, qwen36-v4) #10

Description

@aider4ryder

Environment

  • Hardware: Dell DGX Spark GB10 (Grace-Blackwell, 128GB unified memory, sm_121)
  • Container: ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v4
  • vLLM version: 0.20.2rc1.dev166+gf6490a284
  • Configuration:
    • MAX_MODEL_LEN=200000
    • MAX_NUM_SEQS=32
    • GPU_MEMORY_UTILIZATION=0.61 (co-resident with embedding model at 0.20)
    • NUM_SPECULATIVE_TOKENS=15
    • MAX_NUM_BATCHED_TOKENS=8192
    • EXTRA_VLLM_ARGS="--no-scheduler-reserve-full-isl --reasoning-parser qwen3"
    • BF16 KV cache, docker --memory 83g

Symptoms

After running normally for ~2 weeks (12,763 requests completed successfully), the engine entered a hung state:

  1. 3 requests stuck in "running", 0 waiting — no requests completing for 7+ hours
  2. Engine metrics logging stopped — vLLM's 10-second interval metrics stopped being emitted entirely
  3. GPU shows 96% utilization but only 13W power draw — GPU context is held but no actual computation
  4. API endpoint responds to GET requests (/v1/models, /metrics) but POST requests hang indefinitely
  5. New test requests (max_tokens=5, simple "hi" prompt) got no response within 10 seconds

Timeline (all times UTC)

01:07-01:08  Normal operation: 3 reqs, gen throughput ~73 tok/s, KV cache ~52%
01:08:26     Brief idle (0 running)
01:08:36     3 new requests arrive, gen throughput ~65 tok/s
01:09:00     Burst of completions (3 requests finish quickly)
01:09:46     2 running + 1 WAITING, gen throughput drops to 0.0
01:10:06     3 running, KV cache 60.9%, gen throughput 3.1 tok/s
01:10:16     3 running, KV cache 60.9%, gen throughput 0.0 — LAST metrics log
01:18-03:49  6 POST completions trickle in (~30 min apart), no more metrics
03:49        Last request completion
~10:55       Diagnosed as hung (7+ hours since last completion)

Analysis

  • Not a scheduler deadlock: Inspected scheduler.py — running requests have proper preemption logic (both FCFS and PRIORITY paths). The --no-scheduler-reserve-full-isl flag is set, which avoids the upstream vllm#39734 HOL blocking.
  • Likely CUDA-level hang: The 96% GPU utilization with 13W power draw suggests a CUDA kernel is stuck (spin-waiting), not doing real computation. Temperature was only 45°C.
  • Host memory was tight: MemAvailable was 13.5 GB (below the 15 GB threshold for 200K context), with 4 GB swap in use.
  • Spec decode interaction?: The transition from normal throughput (~73 tok/s) to zero happened while 3 long-context requests were consuming 60.9% of KV cache. The spec decode metrics showed progressively lower acceptance rates before the hang.

Questions

  1. Is this a known issue with the AEON fork's DFlash speculative decoding on Blackwell (sm_121)?
  2. Could the tight host memory (13.5 GB available, unified memory architecture) contribute to this hang?
  3. Are there recommended monitoring/recovery strategies beyond systemctl restart?
  4. Would upgrading to qwen36-v5-ddtree-m53-temp-nonflat-gdn address this, or is that tag experimental?

Workaround

Service was restarted via systemctl --user restart vllm-aeon-27b-dflash.service which cleared the hang.


Disclosure: This issue was filed by Claude Opus 4.6 (Anthropic) at the direction of a human user who experienced this problem. The diagnosis and scheduler code analysis were performed by the AI. The human user reviewed and approved this submission.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions