blackwell-geforce-nvfp4-gemm

NVFP4 inference on Blackwell GeForce / RTX PRO 6000 — patches, methodology, and benchmarks.

RTX 5090, 5080, 5070 Ti, and RTX PRO 6000 GPUs (SM120) share a unique architecture that is not a subset of datacenter Blackwell (SM100). This repository documents the patches needed to run NVFP4 quantized models at full speed on these GPUs, and explains why each patch is necessary.

Try it now

# Tested model (MoE, 168 tok/s on single RTX PRO 6000):
# https://huggingface.co/sakamakismile/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

docker run -d --rm \
  --runtime nvidia \
  -e NVIDIA_VISIBLE_DEVICES=0 \
  -v /path/to/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4:/models/current:ro \
  -p 8090:8090 \
  lna-lab/gemma4-inference:latest \
  --model /models/current --port 8090 --max-model-len 4096

The SM120 Chimera

SM120 is a hybrid architecture that borrows from three generations but matches none of them:

Feature	Source	SM120 Implementation
MMA Instructions	SM80 (Ampere)	`mma.sync.aligned.m16n8k32` — warp-level, register-to-register
TMA (Tensor Memory Access)	SM90 (Hopper)	Async bulk GMEM→SMEM loads
FP4/FP8 Block Scaling	SM100 (B200)	`mxf8f6f4.block_scale` in MMA instruction
Tensor Memory (TMEM)	SM100 only	Does not exist
UMMA / tcgen05	SM100 only	Does not exist
Shared Memory	Unique	99 KB/SM (vs 228 KB on SM100)
Cluster Multicast	SM100	Not supported (1×1×1 only)

The critical insight: SM120 uses SM80-era mma.sync instructions, not SM100's tcgen05.mma. Any kernel written for SM100 (including DeepGEMM, upstream CUTLASS SM100 collectives, and WGMMA-based Flash Attention) will fail to compile or crash at runtime on SM120.

What this means in practice

sm_120a target rejects tcgen05.* instructions at the ptxas level
sm_100f cubins fail to load on SM120 hardware (driver rejects binary)
SM120 GEMM tiles must fit 99 KB shared memory (not SM100's 228 KB)
No tensor memory means no UMMA descriptors — all data goes SMEM → registers → MMA

Patches

This repository contains the patches needed to make vLLM + FlashInfer + CUTLASS run NVFP4 inference correctly on SM120. Each patch addresses a specific gap in upstream SM120 support.

Core GEMM Patches

#	Patch	What it fixes
1	FlashInfer SM120 Grouped GEMM	MoE grouped GEMM tile sizes restricted to 99 KB. Without this, M128/M256 tiles crash on SM120.
2	QUTLASS SM120 Matmul	SM120-optimized MXFP4 dense matmul using warp-level MMA + ldmatrix.
3	CutlassExpertsFp4 SM120	Adds `is_device_capability_family(120)` to FP4 MoE support check.
4	FlashInferExperts SM120	Adds `is_device_capability(120)` to FlashInfer CUTLASS MoE dispatch.

Attention Patches

#	Patch	What it fixes
5	Flash Attention SM120	SM120 subclass using SM80 MMA with 99 KB SMEM tile sizing.
6	FlashInfer TMA FMHA SM120	Pre-built FMHA kernels for SM120 (64×128, 64×64 tiles).
7	FlashInfer MLA XQA SM120	SM120 uses `xqa` backend for MLA (not `trtllm-gen`).

Infrastructure Patches

#	Patch	What it fixes
8	FlashInfer JIT NVCC Flags	`sm120a` and `sm120f` gencode flags for JIT compilation.
9	FlashInfer FP4 Quantization JIT	SM120 target for runtime activation quantization kernels.
10	PyTorch Inductor Fixes	7 bug fixes for piecewise CUDA graph + NVFP4.
11	quack SM120 GEMM	Custom warp-level GEMM (BF16/FP16, based on CUTLASS example 79).
12	Marlin W4A8-FP8 SM120	Enables Marlin FP8 kernels on SM120.

Benchmark Results

Single GPU, single request, max_tokens=256:

Model	Format	tok/s	GPU
Qwen3.6-35B-A3B MoE	NVFP4	175	RTX PRO 6000 ×1
Gemma4-26B-A4B MoE	NVFP4	160	RTX PRO 6000 ×1
Qwen3.5-27B Dense	NVFP4	57	RTX PRO 6000 ×1
Gemma4-31B Dense	NVFP4	51	RTX PRO 6000 ×1

Architecture Deep Dive

See docs/sm120-architecture.md for the full technical analysis, including:

ISA comparison: SM80 vs SM90 vs SM100 vs SM120
Shared memory capacity and tile size constraints
MMA instruction encoding and register layout
Why DeepGEMM (tcgen05-based) cannot work on SM120
CUTLASS SM120 collective builder internals

Quantization

The NVFP4 models linked above were quantized using llm-compressor with the same patched environment:

# recipe.yaml
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
      scheme: NVFP4

Key: quantize in the same CUDA 13.0 + patched environment to ensure the FP4 weight layout matches the SM120 kernel expectations.

Requirements

GPU: NVIDIA RTX 5090, 5080, 5070 Ti, RTX PRO 6000, or any SM120/SM121 GPU
CUDA: 13.0+
Driver: 580+
Docker recommended (pre-built container: lna-lab/gemma4-inference:latest)

Related Work

CUTLASS examples/87_blackwell_geforce_gemm_blockwise — NVIDIA's official SM120 GEMM example
DeepGEMM — SM100-only FP8 GEMM kernels (incompatible with SM120)
vLLM — Upstream inference engine
FlashInfer — Attention and MoE kernels

License

Apache-2.0

Built at Lna-Lab — a seaside personal AI research lab.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
comment_drafts		comment_drafts
docs		docs
patches		patches
DEEPSEEK_V4_SM120_REPORT.md		DEEPSEEK_V4_SM120_REPORT.md
Dockerfile.kimi-k26		Dockerfile.kimi-k26
Dockerfile.kimi-k26-sglang		Dockerfile.kimi-k26-sglang
Dockerfile.kimi-k26-sglang-kt		Dockerfile.kimi-k26-sglang-kt
LICENSE		LICENSE
README.md		README.md
entrypoint-kimi-k26.sh		entrypoint-kimi-k26.sh
patch-nvfp4-oracle-sm120.py		patch-nvfp4-oracle-sm120.py
patch-sglang-kt-api-compat.py		patch-sglang-kt-api-compat.py
patch-sglang-kt-mask-fix.py		patch-sglang-kt-mask-fix.py
patch-sglang-kt-num-experts.py		patch-sglang-kt-num-experts.py
patch-sglang-nvfp4a16-moe.py		patch-sglang-nvfp4a16-moe.py
patch-sglang-offloader-nvfp4.py		patch-sglang-offloader-nvfp4.py
run-kimi-k26.sh		run-kimi-k26.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

blackwell-geforce-nvfp4-gemm

Try it now

The SM120 Chimera

What this means in practice

Patches

Core GEMM Patches

Attention Patches

Infrastructure Patches

Benchmark Results

Architecture Deep Dive

Quantization

Requirements

Related Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

blackwell-geforce-nvfp4-gemm

Try it now

The SM120 Chimera

What this means in practice

Patches

Core GEMM Patches

Attention Patches

Infrastructure Patches

Benchmark Results

Architecture Deep Dive

Quantization

Requirements

Related Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages