Skip to content

Commit 9131f2e

Browse files
author
jarvis
committed
docs: deprecate fork in favor of upstream ggml-org/llama.cpp
Upstream now has: - Native MTP speculative decode (PRs #22673, #23269, #23461, #23563) - NVFP4 + MTP scale tensors (#23563 merged 2026-05-23) - Stable VRAM (no draft KV leak vs this fork) Decode parity: upstream + MTP sustains 13-26 t/s on Qwopus3.6-27B-NVFP4, this fork + DFlash sustains 11-15 t/s and grows VRAM ~2-3 GB/hour. Kept useful: TQ3_0 KV cache (not in upstream yet), DFlash external drafter integration (for non-MTP targets), Spiritbuun KV cache fixes (some merged upstream).
1 parent a203b4a commit 9131f2e

1 file changed

Lines changed: 37 additions & 0 deletions

File tree

README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,42 @@
11
# llama.cpp-dgx
22

3+
> ## ⚠️ DEPRECATED — use upstream `ggml-org/llama.cpp` instead
4+
>
5+
> **Status:** As of 2026-05-25, upstream llama.cpp has surpassed this fork for our target workload (Qwopus3.6-27B + NVFP4 + speculative decode on GB10 / SM 12.1).
6+
>
7+
> ### Why upstream now wins
8+
>
9+
> 1. **Native MTP speculative decode** (`--spec-type draft-mtp`, PR [#22673](https://github.com/ggml-org/llama.cpp/pull/22673) + #23269 + #23461). Co-trained drafter delivers 45-85% acceptance vs 13-20% with our DFlash + post-hoc drafter.
10+
> 2. **NVFP4 + MTP scale tensors** ([#23563](https://github.com/ggml-org/llama.cpp/pull/23563), 2026-05-23). Closes the last gap that forced us to fork.
11+
> 3. **Stable VRAM footprint**. v5 + DFlash leaks ~2-3 GB/hour into the draft KV pool (positions are written every tree step but never compacted), reaching OOM in days under sustained traffic. Upstream pre-allocates and reuses cleanly: 30 GB GPU compute pool stays flat over hours.
12+
> 4. **Lower system memory**. Upstream: ~44-67 GB total system used (with cache-ram 16 GB lazy-filled and reclaimable). v5: 60-78 GB and growing.
13+
> 5. **Practical decode parity**. Stock + MTP (n_max=5) sustains 13-26 t/s on Qwopus3.6-27B-Abl-NVFP4; v5 + DFlash sustains 11-15 t/s and degrades on long-context multi-slot.
14+
>
15+
> ### Where this fork is still useful
16+
>
17+
> - **TurboQuant (TQ3_0 / TQ3_4S) KV cache and weights** — upstream does not expose `tq3_0` as a `--cache-type-v` value yet. If you need ~12% extra KV bandwidth savings on a memory-bound decode, this fork keeps the kernels (`ggml/src/ggml-cuda/tq3-prefill.cuh`, `fattn-vec-instance-tq3_*`).
18+
> - **DFlash external drafter integration** (`--dflash` + `--dflash-draft`) — relevant only if you have a target architecture without MTP layers and a separately trained DFlash drafter that matches it.
19+
> - **Spiritbuun KV cache fixes** and the GDN chunked kernel tuned for 99 KB SM 12.1 shared memory budget — both already merged or being merged upstream; this fork carries the original variants for archival reference.
20+
>
21+
> ### Migration
22+
>
23+
> ```bash
24+
> # Drop-in replacement
25+
> git clone https://github.com/ggml-org/llama.cpp ~/llama-cpp-stock
26+
> cd ~/llama-cpp-stock
27+
> cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120;121" -DGGML_CUDA_FA_ALL_QUANTS=ON
28+
> cmake --build build --target llama-server llama-quantize llama-cli -j$(nproc)
29+
>
30+
> # Replace --dflash --dflash-draft ... with:
31+
> --spec-type draft-mtp
32+
> ```
33+
>
34+
> See [croll83/jarvis/infrastructure/gb10/](https://github.com/croll83/jarvis/tree/main/infrastructure/gb10) for the consolidated GB10 deployment (systemd service + cmdline).
35+
>
36+
> ---
37+
38+
39+
340
> **Fork of [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1).**
441
542
[![Upstream](https://img.shields.io/badge/upstream-ggml--org%2Fllama.cpp-blue)](https://github.com/ggml-org/llama.cpp)

0 commit comments

Comments
 (0)