Releases: vipshop/cache-dit
v1.5.0 Major Release
🚀 Cache-DiT v1.5.0 Release Notes
Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases
📋 Overview
Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.
✨ Core Highlights
1. 💎 SVDQuant W4A4 Quantization
Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.
📊 PTQ (Post-Training Quantization)
Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:
- INT4 PTQ (≥sm80): Collect activation statistics via
calibrate_fn→ SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels:low(recommended default, ~18× speedup),medium,high. Serialize to{quant_type}.safetensors+quant_config.json; restore viacache_dit.load(). - NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only
runtime_kernel="v1"is supported for NVFP4.
Performance (FLUX.2-klein-4B, 1024×1024, L20):
| Stage | Latency (s) | Memory (GiB) | Transformer Weight (GiB) |
|---|---|---|---|
| BF16 baseline | 2.13 | 17.32 | 7.22 |
| SVDQuant INT4 | 1.24 | 12.39 | 2.28 |
| SVDQuant + compile | 1.02 | 12.39 | 2.28 |
- Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
- End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
- PSNR > 29 dB, near-lossless visual quality
NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):
| Stage | Latency (s) | Speedup | Memory (GiB) |
|---|---|---|---|
| BF16 baseline | 0.97 | 1.00× | 17.32 |
| NVFP4 PTQ | 0.58 | 1.69× | 12.50 |
| NVFP4 + compile | 0.47 | 2.05× | 12.50 |
⚡ DQ (Dynamic Quantization)
Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):
- identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
- weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
- few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies:
auto/stable_auto/power/log/rank/top/fixed). Supportsfew_shot_auto_compilefor deferred compilation after quantization.
DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.
🔧 SVDQ Converter CLI
New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:
cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
--save-dir ./FLUX.2-klein-4B-svdq \
--quant-type svdq-int4-r128-dqSupports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.
🔀 Fused MLP
New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.
🔗 Parallelism Compatibility (Cache-DiT Exclusive)
SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.
⚙️ Quantization Configuration Enhancements
- Regional Quantization (
regional_quantize=True+repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision. - Hybrid Precision Plan (
precision_plan): Assign different quant types to different sub-layers by name pattern. - FP8 Per-Tensor Fallback (
per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers. - TorchAO Backend Refactor: Cleaner
QuantizeBackendenum (AUTO / TORCHAO / CACHE_DIT / NONE). - Quantize API Refactor: Deprecated legacy kwargs, unified under
QuantizeConfig+svdq_kwargs.
📦 cache-dit-cu13 Pre-built Wheel
Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.
2. 💾 Bucket-style Layerwise CPU Offload
Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.
Core Design:
- Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
- Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
- Persistent Bins: Distribute the persistent budget evenly across the target sequence.
- Flexible Resource Controls:
transfer_buckets,persistent_buckets,persistent_bins,prefetch_limit,max_copy_streams,max_inflight_prefetch_bytes.
Performance (FLUX.1-dev, L20):
| Config | Memory | Latency |
|---|---|---|
| No offload | ~38 GiB | 23.4s |
| Diffusers sequential | ~1 GiB | 335s |
| Layerwise (transfer=4, persistent=32, bins=4) | ~16 GiB | 24.6s |
Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.
torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.
CLI Quick Start:
python3 -m cache_dit.generate flux \
--layerwise-offload --layerwise-async-transfer \
--layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
--layerwise-persistent-bins 4 --layerwise-prefetch-limit \
--layerwise-max-inflight-prefetch-bytes 8gib --compile3. 🌩️ Ray Wrapper (Transparent Distributed Inference)
The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.
Two Wrapper Levels:
| Level | Description | Best For |
|---|---|---|
| Pipeline Wrapper (recommended) | Ray manages the entire pipeline execution | Full feature support (cache, quant, parallelism), simplest, fastest. |
| Transformer Wrapper | Only the transformer runs on Ray workers | Lightweight, but slight slower |
Key Features:
ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.ray_use_compile: Automatic per-worker compilation.ray_runtime_env: Custom module import handling viaPYTHONPATH.- Supports all parallelism strategies: TP, Ulysses, Ring.
- LoRA support: fuse before enabling (TP requires fused LoRAs).
Performance (FLUX.2-klein-base-9B):
| Config | Latency |
|---|---|
| Baseline (single GPU) | 47.41s |
| Ray TP=2 + compile | 24.57s |
Minimal Example:
cache_dit.enable_cache(
pipe,
parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0] # Code unchanged4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)
DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.
Mathematical Principle: The cached feature stream is modeled as a linear dynamical system
TaylorSeer vs DMD:
| Aspect | TaylorSeer (Polynomial) | DMD (Exponential) |
|---|---|---|
| Basis | ||
| Extrapolation | Diverges as |
Bounded when |
| Snapshots needed | 2+ (1st order) | ≥ 4 uniformly spaced |
| Best for | DiT-class denoising (DDPM) | Flow-matching generators (Hunyuan3D, etc.) |
| Noise sensitivity | Low | Moderate (SVD truncation suppresses noise) |
Usage:
cache_dit.enable_cache(
pipe,
cache_config=DBCacheConfig(...),
calibrator_config=DMDCalibratorConfig(
dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
),
)CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6
🔧 Other Enhancements
🧩 FLUX.2-klein-kv Series
📈 CUDA Graph
v1.3.12
What's Changed
- chore: update installation guide by @DefTruth in #1035
- chore: Update README.md by @DefTruth in #1036
- Update README.md by @DefTruth in #1037
- chore: update installation guide by @DefTruth in #1038
- feat: config yaml support svdq dq/few-shot by @DefTruth in #1040
- [2/N]: config yaml support svdq dq/few-shot by @DefTruth in #1042
- chore: update why not svqd nvfp4 for sm100 by @DefTruth in #1043
- chore: hightlight bucket-style layerwise offload by @DefTruth in #1044
- chore: Update README.md by @DefTruth in #1045
- svdq: add fused gelu mlp/proj pass by @DefTruth in #1047
- docs: add fused gelu mlp/proj docs by @DefTruth in #1048
Full Changelog: v1.3.11...v1.3.12
v1.3.11
v1.3.10
What's Changed
- feat: add MindIE-SD as optional NPU attention and compilation backend by @blian6 in #1004
- chore: simplify attn backend auto select by @DefTruth in #1024
- Fix Python version mismatch in setup.py by @FNGarvin in #1025
- offload: extract copy stream pool and split init by @DefTruth in #1026
- feat: support svdquant nvfp4 ptq/dq by @DefTruth in #1029
- chore: Update README.md by @DefTruth in #1030
- whl: cache-dit-cu13 pkg w/ svdq kernels by @DefTruth in #1031
- whl: fix build_releases.sh tool by @DefTruth in #1032
- whl: fix build_releases.sh tool by @DefTruth in #1033
New Contributors
Full Changelog: v1.3.9...v1.3.10
v1.3.9
What's Changed
- ray: refactor ray wrapper impl by @DefTruth in #1016
- parallel: deprecated native diffusers backend by @DefTruth in #1017
- ray: allow pass init_fn to ray wrapper by @DefTruth in #1019
- docs: add torch.compile section to offload docs by @DefTruth in #1020
- API: remove dup ray api call by @DefTruth in #1021
- ray: simplify ray wrapper dispatch by @DefTruth in #1022
- ray: soft check for ray path by @DefTruth in #1023
Full Changelog: v1.3.8...v1.3.9
v1.3.8
What's Changed
- CLI: allow 8-steps lora for qwen-image edit lightning by @DefTruth in #1011
- skills: add triton-kernel skill by @DefTruth in #1013
- feat: make layerwise offload compatible w/ compile by @DefTruth in #1014
- ray: fix custom components serialize by @DefTruth in #1015
Full Changelog: v1.3.7...v1.3.8
v1.3.7
What's Changed
- docs: update ray wrapper docs by @DefTruth in #1005
- docs: update ray wrapper docs by @DefTruth in #1006
- ray: pass runtime env to workers by @DefTruth in #1007
- chore: update ray wrapper docs by @DefTruth in #1008
- ray: disable dashboard by default by @DefTruth in #1009
Full Changelog: v1.3.6...v1.3.7
v1.3.6
What's Changed
- chore: update cache-dit arch by @DefTruth in #932
- bc: deprecated serving module by @DefTruth in #933
- chore: suppress torch compile tuning logs by @DefTruth in #934
- compile: enabled descent_tuning by default by @DefTruth in #935
- docs: update quantization docs by @DefTruth in #937
- quant: add quantize backend enum by @DefTruth in #938
- kernel: refactor ops register by @DefTruth in #939
- chore: fix vllm-omni docs links by @DefTruth in #940
- examples: add cuda graph option by @DefTruth in #942
- chore: fix utils log info by @DefTruth in #943
- chore: add cuda graph usage docs by @DefTruth in #944
- chore: add cuda graph usage to overview by @DefTruth in #945
- CLI: add compile full-graph option by @DefTruth in #946
- chore: fix fullgraph param typo by @DefTruth in #947
- docs: add more cuda graph perf results by @DefTruth in #948
- docs: add more cuda graph perf results by @DefTruth in #949
- docs: update cuda graphs docs by @DefTruth in #950
- chore: allow cuda graph for dynamic compile by @DefTruth in #951
- feat: support cuda graph + fp8 rowwise by @DefTruth in #952
- chore: hotfix for mkdocs broken by @DefTruth in #953
- [1/N] feat: support svdquant w4a4 - kernels & skills by @DefTruth in #954
- pytest: fast_svd mode for testing by @DefTruth in #955
- [2/N] feat: streaming quantize for svdquant by @DefTruth in #956
- [3/N] feat: PTQ workflow for svdquant by @DefTruth in #957
- SKILL: add ptq-workflow-integration skill by @DefTruth in #958
- pytest: separate kernels and quantization tests by @DefTruth in #959
- chore: add docs strings to codebase by @DefTruth in #960
- chore: add svdq e2e example and format code by @DefTruth in #961
- SKILL: add Cute-DSL/CUDA/CUTLASS skills by @DefTruth in #962
- chore: update docs by @DefTruth in #963
- kernel: tune svdq w4a4 gemm stage/blk size for Ada by @DefTruth in #966
- kernel: unified ops register policy by @DefTruth in #967
- bench: refactor cache-dit bench by @DefTruth in #968
- svdquant: fast svd decompose, ~18x speedup by @DefTruth in #969
- [2/N] tune svdq w4a4 gemm for ada by @DefTruth in #970
- bc: refactor distributed codebase by @DefTruth in #971
- kernel: add cute-dsl based merge-attn-states kernel by @DefTruth in #973
- feat: extend SVDQ PTQ -> SVDQ DQ by @DefTruth in #974
- fix: support 3D input/output for W4A4 linear by @DefTruth in #975
- chore: support svdq-calib option in examples by @DefTruth in #976
- kernel: add cute-dsl based fp8 comm kernels by @DefTruth in #977
- [1/N] feat: support cute-dsl based svdquant w4a4 by @DefTruth in #978
- feat: support svdq-dq few shot by @DefTruth in #979
- chore: update svdq-dq few shot docs by @DefTruth in #980
- feat: support layerwise cpu offload by @DefTruth in #981
- [2/N] feat: support layerwise offload by @DefTruth in #982
- [3/N] feat: support layerwise offload by @DefTruth in #983
- [4/N] feat: support layerwise offload by @DefTruth in #984
- chore: unified all2all/ring comm api by @DefTruth in #985
- chore: refactor async ulysses codebase by @DefTruth in #986
- remove cutedsl based svdq kernels by @DefTruth in #987
- fix tensor parallel register import error by @DefTruth in #988
- feat: support sub cp_plan for context parallel by @DefTruth in #989
- chore: fix attention dispatch comments by @DefTruth in #990
- community: add tensorrt-llm x cache-dit link by @DefTruth in #991
- deps: use uv to install deps by @DefTruth in #992
- chore: update docs by @DefTruth in #993
- chore: add layerwise offload to overview by @DefTruth in #994
- chore: update layerwise offload cli quick start by @DefTruth in #995
- attention: fix sage-attn backend dispatch by @DefTruth in #996
- chore: add exclude-layers param to ptq example by @DefTruth in #997
- attention: separate attn backends by @DefTruth in #998
- svdq: support converter cli for dq workflow by @DefTruth in #999
- chore: fix typo by @DefTruth in #1000
- chore: revise quantization example in README by @DefTruth in #1001
- chore: update README by @DefTruth in #1002
- feat: support ray wrapper by @DefTruth in #1003
Full Changelog: v1.3.5...v1.3.6
v1.3.5 Quantization
Low-bits Quantization
Overview
Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.
| quantization type | description | devices |
|---|---|---|
| float8_per_row | quantize weights and activations to float8 (dynamic quantization) with rowwise method. (recommended) | >=sm89, Ada, Hopper or newer |
| float8_per_tensor | quantize weights and activations to float8 (dynamic quantization) with tensorwise method. | >=sm89, Ada, Hopper or newer |
| float8_per_block | block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128) | >=sm89, Ada, Hopper or newer |
| float8_weight_only | quantize only weights to float8, keep activations in full precision | >=sm89, Ada, Hopper or newer |
| int8_per_row | quantize weights and activations to int8 (dynamic quantization) with rowwise method. | >=sm80, Ampere or newer |
| int8_per_tensor | quantize weights and activations to int8 (dynamic quantization) with tensorwise method. | >=sm80, Ampere or newer |
| int8_weight_only | quantize only weights to int8, keep activations in full precision | >=sm80, Ampere or newer |
| int4_weight_only | quantize only weights to int4, keep activations in full precision | >=sm90, Hopper or newer, TMA required |
FP8 Quantization
Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)
For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these methods cause almost no loss in precision. Supported quantization types including:
- float8_per_row: quantize both weights and activations to float8 (dynamic quantization) with rowwise method.
- float8_per_tensor: quantize both weights and activations to float8 (dynamic quantization) with tensorwise method.
- float8_per_block: block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128). NOT supported for distributed inference for now.
- float8_weight_only: quantize only weights to float8, keep activations in full precision.
Here are some examples of how to use quantization with cache-dit. You can directly specify the quantization config in the enable_cache API.
import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
# quant_type: float8_per_row, float8_per_tensor, float8_per_block, float8_weight_only,
# int8_per_row, int8_per_tensor, int8_weight_only, int4_weight_only, etc.
# Pass a QuantizeConfig to the `enable_cache` API.
cache_dit.enable_cache(
pipe, cache_config=DBCacheConfig(), # w/ default
parallelism_config=ParallelismConfig(ulysses_size=2),
quantize_config=QuantizeConfig(quant_type="float8_per_row"),
)Users can also specify different quantization configs for different components. For example, quantize the transformer to float8_per_row and the text encoder to float8_weight_only.
import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
cache_dit.enable_cache(
pipe, cache_config=DBCacheConfig(), # w/ default
parallelism_config=ParallelismConfig(ulysses_size=2),
quantize_config=QuantizeConfig(
components_to_quantize={
"transformer": {
"quant_type": "float8_per_row",
"exclude_layers": ["embedder", "embed"],
},
"text_encoder": {
"quant_type": "float8_weight_only",
"exclude_layers": ["lm_head"],
}
}
),
)Or, directly call the quantize API for more fine-grained control.
import cache_dit
from cache_dit import QuantizeConfig
cache_dit.quantize(
pipe.transformer,
quantize_config=QuantizeConfig(quant_type="float8_per_row"),
)
cache_dit.quantize(
pipe.text_encoder,
quantize_config=QuantizeConfig(quant_type="float8_weight_only"),
)Please also enable torch.compile for better performance with quantization.
import cache_dit
cache_dit.set_compile_configs()
pipe.transformer = torch.compile(pipe.transformer)
pipe.text_encoder = torch.compile(pipe.text_encoder)Users can set exclude_layers in QuantizeConfig to exclude some sensitive layers that are not robust to quantization, e.g., embedding layers. Layers that contain any of the keywords in the exclude_layers list will be excluded from quantization. For example:
import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
cache_dit.enable_cache(
pipe, cache_config=DBCacheConfig(), # w/ default
parallelism_config=ParallelismConfig(ulysses_size=2),
quantize_config=QuantizeConfig(
quant_type="float8_per_row",
exclude_layers=["embedder", "embed"],
),
)By default, quant_type="float8_per_row" for better precision. Users can set it to "float8_per_tensor" to use per-tensor quantization for better performance on some hardware.
Regional Quantization
Cache-DiT also supports regional quantization, which allows users to quantize only the repeated blocks in a transformer. This can be useful for better balancing the precision and efficiency. Users can specify the blocks to be quantized via the regional_quantize and repeated_blocks arguments in QuantizeConfig. For example, to quantize repeated blocks of the Flux2's transformer:
import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig
cache_dit.enable_cache(
pipe, cache_config=DBCacheConfig(), # w/ default
parallelism_config=ParallelismConfig(ulysses_size=2),
quantize_config=QuantizeConfig(
quant_type="float8_per_row",
# Default (True), only quantize the repeated blocks in transformer if the repeated_blocks is
# specified. If set to False, the whole transformer will be quantized.
regional_quantize=True,
# Specify the block names for the transformer, cache-dit will automatically find the repeated
# blocks and quantize it inplace. The block names can be found in the model architecture, e.g.,
# for FLUX.2, the block name is "Flux2TransformerBlock" and "Flux2SingleTransformerBlock".
repeated_blocks=['Flux2TransformerBlock', 'Flux2SingleTransformerBlock'],
# repeated_blocks will be detected automatically from diffusers' transformer class, namely:
# default repeated_blocks = transformer._repeated_blocks if exists, else None (quantize
# the whole transformer.
),
)FP8 Per-Tensor Fallback
The per_tensor_fallback option in Cache-DiT's quantization configuration allows users to enable a fallback mechanism for layers that do not support float8 per-row or per-block quantization. This is particularly useful in scenarios where tensor parallelism is applied, and certain layers (e.g., those applied with RowwiseParallel) may encounter memory layout mismatch errors when quantized to float8 per-row.
When per_tensor_fallback is set to True, if a layer cannot be quantized to float8 per-row or per-block, it will automatically fall back to float8 per-tensor quantization instead of raising an error. This ensures that the quantization process can continue smoothly without interruption, while still providing the benefits of reduced precision for supported layers.
To enable this feature, simply set the per_tensor_fallback flag to True (default) in the QuantizeConfig when calling the enable_cache API. Only support for float8 quantization for now. For example:
import cac...