sparkinfer-runtime

Edge AI inference runtime for NVIDIA RTX Spark and RTX 5090-class GPUs.

Part of gittensor-ai-lab — SN74 decentralized kernel and runtime optimization network.

Primary Target: NVIDIA RTX Spark

RTX Spark is NVIDIA's Blackwell-based AI superchip: 20-core ARM CPU + Blackwell GPU + 128 GB unified LPDDR5X memory in a single package. It runs 120B-parameter models locally and delivers ~1 PFLOP of AI compute — making it the first consumer-grade platform where datacenter-scale MoE inference is native.

Why RTX Spark changes the runtime problem:

Before (cloud/datacenter era)	RTX Spark era
Optimize for TP=8, batch=1024	Optimize for single-node, batch=1–64
Memory bandwidth = HBM (3+ TB/s)	Memory bandwidth = LPDDR5X (~273 GB/s)
Expert eviction = rare	Expert scheduling = critical
KV cache = secondary concern	KV cache = primary memory pressure
vLLM / SGLang sufficient	New runtime layer required

Hardware Targets

Device	Memory	Bandwidth	Arch	Status
RTX Spark (GB10)	128 GB unified	~273 GB/s	sm_121	Primary
RTX PRO 6000 (GB202)	96 GB GDDR7 ECC	~1.79 TB/s	sm_120	Supported
RTX 5090 (GB202)	32 GB GDDR7	1.79 TB/s	sm_120	Supported
Jetson Thor	unified	—	sm_121	Planned

Consumer Blackwell is sm_120 (RTX 5090) / sm_121 (RTX Spark, Jetson Thor) — not sm_100, which is datacenter Blackwell (B200/GB200) and binary-incompatible. Builds target 89;90;100;120;121; requires CUDA Toolkit 12.8+.

What This Runtime Does

Standard serving stacks (vLLM, SGLang) are designed for distributed datacenter inference — tensor parallelism across multiple GPUs, large batch throughput, HBM-class bandwidth. On RTX Spark, those assumptions break:

No TP — everything runs on one chip
Unified memory means CPU and GPU share the same pool — expert weights don't need PCIe transfer
LPDDR5X bandwidth (~273 GB/s) is 6–7× lower than GDDR7 — memory layout and scheduling matter more than raw kernel FLOPS
Interactive latency targets (< 50 ms TTFT) require a latency-first scheduler, not a throughput-first one

This runtime is built specifically for that environment.

Components

include/sparkinfer/
├── runtime.h         — Runtime lifecycle, device query (SMs, bandwidth, cc)
├── scheduler.h       — Continuous batching, priority preemption
├── kv_cache.h        — Paged KV block allocator, per-layer sub-pools
├── kv_ops.h          — KV append + residual add
├── decode.h          — DecodeRunner: generic MoE decode-layer wiring
└── models/qwen35.h   — Qwen3.5-35B-A3B model: config, weights, generate()

src/
├── runtime.cpp       — device setup + capability query
├── scheduler.cpp     — priority continuous-batching scheduler
├── kv_cache.cpp      — paged block allocator + device block tables
├── decode_layer.cpp  — DecodeRunner (norm→QKV→attn→O→residual→MoE→residual)
├── models/qwen35.cpp — full Qwen3.5 forward + greedy generate loop
└── csrc/cuda/kv_ops.cu — paged KV append + residual add kernels

tools/convert_qwen35.py  — HF safetensors → sparkinfer weight format
examples/qwen35_generate.cpp — greedy generation demo (token ids in/out)

Target Models

Model	Total / Active	Q4_K_M Size	Notes
Qwen3.5-35B-A3B	35B / 3B	~20 GB	256 experts, 2 KV heads
Gemma 4 26B-A4B	26B / 4B	~14.6 GB	Interleaved local/global attn, head_dim=512

Both fit in RTX Spark's 128 GB unified memory with significant headroom for KV cache.

Related Repos

Repo	Purpose
sparkinfer-kernels	Native CUDA + CuTe DSL kernels: flash decode, RoPE, MoE router/FFN, GEMM, RMSNorm
sparkinfer-moe	Sync-free MoE engine: router GEMM → top-k → SwiGLU expert FFN
sparkinfer-bench	Reproducible benchmarks for RTX Spark, RTX 5090, Jetson Thor
sparkinfer-agent	Kernel design agents: NCU report parsing, auto-tuning loops

Build

The runtime is the integrator — it pulls in the sibling ../kernels and ../moe checkouts. Needs CUDA Toolkit 12.8+.

bash scripts/build.sh            # configures sm_120, builds, runs ctest
# or manually:
cmake -B build -DCMAKE_CUDA_ARCHITECTURES="120"   # 121 for RTX Spark
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure

Running Qwen3.5-35B-A3B

The runtime runs Qwen3.5 end-to-end on token IDs (embed → 40 layers → LM head → greedy sample). The full layer is implemented: RMSNorm, Q/K/V projection, per-head QK-norm, RoPE, paged GQA flash-decode (8:1, head_dim=128), O-projection, residuals, 256-expert top-8 routed MoE + 1 shared expert, all sync-free (token counts stay on-device, CUDA-graph capturable).

1. Convert weights (one-time, CPU, needs the HF checkpoint):

pip install safetensors numpy
python tools/convert_qwen35.py /path/to/hf/Qwen3.5-35B-A3B ./qwen35_weights

2. Generate (on an RTX 5090 / RTX Spark). Tokenize with the HF tokenizer to get input IDs, then:

./build/qwen35_generate ./qwen35_weights 64  <id0> <id1> ...   # 64 new tokens

It prints the generated token IDs; decode them with the HF tokenizer.

// or from C++:
sparkinfer::Qwen35Config cfg;                       // 35B-A3B defaults
sparkinfer::KVCacheManager kv(kvc, 512ull<<20);
auto engine = sparkinfer::moe::MoEEngine::create(mc);
sparkinfer::Qwen35Model model(cfg, &kv, engine.get());
model.load_weights("./qwen35_weights");
auto out = model.generate(prompt_ids, 64);          // greedy

Verification

Targets the 5090/Spark — every .cu compiles through NVRTC → ptxas -arch=sm_120/sm_121 to a real cubin.
Correct math — tests/qwen35_cpu_test runs the entire Qwen3.5 forward (embed, QK-norm, RoPE, multi-layer GQA, routed + shared MoE, LM head) autoregressively and matches a double-precision reference to ~1e-8; tests/decode_layer_cpu_test checks the layer composition; the kernel math is covered in sparkinfer-kernels' cpu_reference_test.
On hardware — tests/decode_runner_gpu_test and examples/qwen35_generate run the real device path (both self-skip cleanly when no GPU is present).

SN74 / Gittensor

This repository is part of SN74 on Gittensor — a decentralized network where human engineers and AI agents co-design and continuously improve kernels, memory systems, and MoE routing for edge AI hardware. All optimizations are reproducible: source-required, benchmark-verified, no hidden training or closed binaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sparkinfer-runtime

Primary Target: NVIDIA RTX Spark

Hardware Targets

What This Runtime Does

Components

Target Models

Related Repos

Build

Running Qwen3.5-35B-A3B

Verification

SN74 / Gittensor

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
csrc/cuda		csrc/cuda
examples		examples
include/sparkinfer		include/sparkinfer
python/sparkinfer_runtime		python/sparkinfer_runtime
scripts		scripts
src		src
tests		tests
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

sparkinfer-runtime

Primary Target: NVIDIA RTX Spark

Hardware Targets

What This Runtime Does

Components

Target Models

Related Repos

Build

Running Qwen3.5-35B-A3B

Verification

SN74 / Gittensor

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages