|
1 | 1 | # llama.cpp-dgx |
2 | 2 |
|
| 3 | +> ## ⚠️ DEPRECATED — use upstream `ggml-org/llama.cpp` instead |
| 4 | +> |
| 5 | +> **Status:** As of 2026-05-25, upstream llama.cpp has surpassed this fork for our target workload (Qwopus3.6-27B + NVFP4 + speculative decode on GB10 / SM 12.1). |
| 6 | +> |
| 7 | +> ### Why upstream now wins |
| 8 | +> |
| 9 | +> 1. **Native MTP speculative decode** (`--spec-type draft-mtp`, PR [#22673](https://github.com/ggml-org/llama.cpp/pull/22673) + #23269 + #23461). Co-trained drafter delivers 45-85% acceptance vs 13-20% with our DFlash + post-hoc drafter. |
| 10 | +> 2. **NVFP4 + MTP scale tensors** ([#23563](https://github.com/ggml-org/llama.cpp/pull/23563), 2026-05-23). Closes the last gap that forced us to fork. |
| 11 | +> 3. **Stable VRAM footprint**. v5 + DFlash leaks ~2-3 GB/hour into the draft KV pool (positions are written every tree step but never compacted), reaching OOM in days under sustained traffic. Upstream pre-allocates and reuses cleanly: 30 GB GPU compute pool stays flat over hours. |
| 12 | +> 4. **Lower system memory**. Upstream: ~44-67 GB total system used (with cache-ram 16 GB lazy-filled and reclaimable). v5: 60-78 GB and growing. |
| 13 | +> 5. **Practical decode parity**. Stock + MTP (n_max=5) sustains 13-26 t/s on Qwopus3.6-27B-Abl-NVFP4; v5 + DFlash sustains 11-15 t/s and degrades on long-context multi-slot. |
| 14 | +> |
| 15 | +> ### Where this fork is still useful |
| 16 | +> |
| 17 | +> - **TurboQuant (TQ3_0 / TQ3_4S) KV cache and weights** — upstream does not expose `tq3_0` as a `--cache-type-v` value yet. If you need ~12% extra KV bandwidth savings on a memory-bound decode, this fork keeps the kernels (`ggml/src/ggml-cuda/tq3-prefill.cuh`, `fattn-vec-instance-tq3_*`). |
| 18 | +> - **DFlash external drafter integration** (`--dflash` + `--dflash-draft`) — relevant only if you have a target architecture without MTP layers and a separately trained DFlash drafter that matches it. |
| 19 | +> - **Spiritbuun KV cache fixes** and the GDN chunked kernel tuned for 99 KB SM 12.1 shared memory budget — both already merged or being merged upstream; this fork carries the original variants for archival reference. |
| 20 | +> |
| 21 | +> ### Migration |
| 22 | +> |
| 23 | +> ```bash |
| 24 | +> # Drop-in replacement |
| 25 | +> git clone https://github.com/ggml-org/llama.cpp ~/llama-cpp-stock |
| 26 | +> cd ~/llama-cpp-stock |
| 27 | +> cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120;121" -DGGML_CUDA_FA_ALL_QUANTS=ON |
| 28 | +> cmake --build build --target llama-server llama-quantize llama-cli -j$(nproc) |
| 29 | +> |
| 30 | +> # Replace --dflash --dflash-draft ... with: |
| 31 | +> --spec-type draft-mtp |
| 32 | +> ``` |
| 33 | +> |
| 34 | +> See [croll83/jarvis/infrastructure/gb10/](https://github.com/croll83/jarvis/tree/main/infrastructure/gb10) for the consolidated GB10 deployment (systemd service + cmdline). |
| 35 | +> |
| 36 | +> --- |
| 37 | +
|
| 38 | +
|
| 39 | +
|
3 | 40 | > **Fork of [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1).** |
4 | 41 |
|
5 | 42 | [](https://github.com/ggml-org/llama.cpp) |
|
0 commit comments