Engine hang: 96% GPU utilization but 0 generation throughput for 7+ hours (GB10 Blackwell, qwen36-v4)

## Environment

- **Hardware**: Dell DGX Spark GB10 (Grace-Blackwell, 128GB unified memory, sm_121)
- **Container**: `ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v4`
- **vLLM version**: 0.20.2rc1.dev166+gf6490a284
- **Configuration**:
  - `MAX_MODEL_LEN=200000`
  - `MAX_NUM_SEQS=32`
  - `GPU_MEMORY_UTILIZATION=0.61` (co-resident with embedding model at 0.20)
  - `NUM_SPECULATIVE_TOKENS=15`
  - `MAX_NUM_BATCHED_TOKENS=8192`
  - `EXTRA_VLLM_ARGS="--no-scheduler-reserve-full-isl --reasoning-parser qwen3"`
  - BF16 KV cache, docker `--memory 83g`

## Symptoms

After running normally for ~2 weeks (12,763 requests completed successfully), the engine entered a hung state:

1. **3 requests stuck in "running"**, 0 waiting — no requests completing for 7+ hours
2. **Engine metrics logging stopped** — vLLM's 10-second interval metrics stopped being emitted entirely
3. **GPU shows 96% utilization but only 13W power draw** — GPU context is held but no actual computation
4. **API endpoint responds to GET requests** (`/v1/models`, `/metrics`) but **POST requests hang indefinitely**
5. **New test requests** (`max_tokens=5`, simple "hi" prompt) got no response within 10 seconds

## Timeline (all times UTC)

```
01:07-01:08  Normal operation: 3 reqs, gen throughput ~73 tok/s, KV cache ~52%
01:08:26     Brief idle (0 running)
01:08:36     3 new requests arrive, gen throughput ~65 tok/s
01:09:00     Burst of completions (3 requests finish quickly)
01:09:46     2 running + 1 WAITING, gen throughput drops to 0.0
01:10:06     3 running, KV cache 60.9%, gen throughput 3.1 tok/s
01:10:16     3 running, KV cache 60.9%, gen throughput 0.0 — LAST metrics log
01:18-03:49  6 POST completions trickle in (~30 min apart), no more metrics
03:49        Last request completion
~10:55       Diagnosed as hung (7+ hours since last completion)
```

## Analysis

- **Not a scheduler deadlock**: Inspected `scheduler.py` — running requests have proper preemption logic (both FCFS and PRIORITY paths). The `--no-scheduler-reserve-full-isl` flag is set, which avoids the upstream vllm#39734 HOL blocking.
- **Likely CUDA-level hang**: The 96% GPU utilization with 13W power draw suggests a CUDA kernel is stuck (spin-waiting), not doing real computation. Temperature was only 45°C.
- **Host memory was tight**: `MemAvailable` was 13.5 GB (below the 15 GB threshold for 200K context), with 4 GB swap in use.
- **Spec decode interaction?**: The transition from normal throughput (~73 tok/s) to zero happened while 3 long-context requests were consuming 60.9% of KV cache. The spec decode metrics showed progressively lower acceptance rates before the hang.

## Questions

1. Is this a known issue with the AEON fork's DFlash speculative decoding on Blackwell (sm_121)?
2. Could the tight host memory (13.5 GB available, unified memory architecture) contribute to this hang?
3. Are there recommended monitoring/recovery strategies beyond `systemctl restart`?
4. Would upgrading to `qwen36-v5-ddtree-m53-temp-nonflat-gdn` address this, or is that tag experimental?

## Workaround

Service was restarted via `systemctl --user restart vllm-aeon-27b-dflash.service` which cleared the hang.

---

> **Disclosure**: This issue was filed by Claude Opus 4.6 (Anthropic) at the direction of a human user who experienced this problem. The diagnosis and scheduler code analysis were performed by the AI. The human user reviewed and approved this submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine hang: 96% GPU utilization but 0 generation throughput for 7+ hours (GB10 Blackwell, qwen36-v4) #10

Environment

Symptoms

Timeline (all times UTC)

Analysis

Questions

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Engine hang: 96% GPU utilization but 0 generation throughput for 7+ hours (GB10 Blackwell, qwen36-v4) #10

Description

Environment

Symptoms

Timeline (all times UTC)

Analysis

Questions

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions