Skip to content

timothystewart6/vllm-gb10

Repository files navigation

vllm-gb10

Build Latest release GHCR

Reproducible vLLM Docker image for the NVIDIA DGX Spark (GB10 / sm_121a). Every input - CUDA base image, PyTorch stack, NCCL, FlashInfer, vLLM - is pinned by commit SHA or digest. The same versions.env always produces the same image.

Hardware: DGX Spark (GB10 SoC) only. The image targets linux/arm64 with TORCH_CUDA_ARCH_LIST=12.1a. It will not run on x86 or other GPU architectures.

Quick start

Pull the latest release and serve a model:

docker pull ghcr.io/timothystewart6/vllm-gb10:latest

docker run --rm -it \
  --gpus all \
  --ipc=host \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/timothystewart6/vllm-gb10:latest \
  vllm serve <model> --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.7

For a pinned version see the releases page for the full component table and immutable tag for each build.

What's in the image

Each release page lists the exact versions of every component. Key stack:

Component Pinned by
CUDA base image digest (sha256:...)
vLLM git commit SHA
PyTorch / TorchVision / TorchAudio / Triton exact version
NCCL git commit SHA (built from source)
FlashInfer git commit SHA (built from source)
vllm-rs Rust frontend built from source (axum HTTP server + PyO3 tool-parser module)
bitsandbytes, accelerate exact version (4-bit/8-bit quantization and HuggingFace model loading)
Ray, uv, and other runtime deps lockfile hash

All pins live in versions.env. All lockfiles live in locks/.

Known limitations

See the issues tab for tracked upstream compatibility gaps.

Image tags

Each build publishes four tags:

Tag Notes
v0.24.0-gb10.0 Canonical, immutable. vLLM version + stack revision.
v0.24.0-cu13.2-torch2.11-gb10.0 Same image - adds CUDA and PyTorch versions for quick scanning.
latest Mutable - always points at the most recent green build of main.
sha-<short_sha> Immutable, tied to the exact Git commit that produced it.

gb10.<N> increments when any non-vLLM input changes (CUDA, PyTorch, NCCL, FlashInfer, etc.) on the same vLLM version. It resets to 0 when VLLM_REF bumps. There is intentionally no bare v0.24.0 tag - it would be mutable.

Bumping versions

  1. Edit one or more _REF lines in versions.env on a branch
  2. Open a pull request - the run-bump.yaml workflow picks it up, runs scripts/bump.sh on the DGX Spark runner, and commits the resolved _COMMIT SHAs, updated GB10_BUILD, and regenerated lockfiles back to your branch
  3. Review the diff that CI committed, then merge
  4. A green build on main publishes updated image tags to GHCR and creates a GitHub Release automatically

You do not need to SSH into the Spark or run anything locally.

CI also triggers on changes to Dockerfile, locks/, scripts/, and checksums/.

Contributing

See CONTRIBUTING.md. Security issues: SECURITY.md.

License

MIT - see LICENSE.

About

Bleeding edge vLLM Docker image for the NVIDIA DGX Spark (GB10 / sm_121a).

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

 
 
 

Contributors