Skip to content

Releases: vipshop/cache-dit

v1.5.0 Major Release

16 Jun 04:27
eb0ec99

Choose a tag to compare

🚀 Cache-DiT v1.5.0 Release Notes

Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases

📋 Overview

Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.

📊 PTQ (Post-Training Quantization)

Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:

  • INT4 PTQ (≥sm80): Collect activation statistics via calibrate_fn → SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels: low (recommended default, ~18× speedup), medium, high. Serialize to {quant_type}.safetensors + quant_config.json; restore via cache_dit.load().
  • NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only runtime_kernel="v1" is supported for NVFP4.

Performance (FLUX.2-klein-4B, 1024×1024, L20):

Stage Latency (s) Memory (GiB) Transformer Weight (GiB)
BF16 baseline 2.13 17.32 7.22
SVDQuant INT4 1.24 12.39 2.28
SVDQuant + compile 1.02 12.39 2.28
  • Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
  • End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
  • PSNR > 29 dB, near-lossless visual quality

NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):

Stage Latency (s) Speedup Memory (GiB)
BF16 baseline 0.97 1.00× 17.32
NVFP4 PTQ 0.58 1.69× 12.50
NVFP4 + compile 0.47 2.05× 12.50

⚡ DQ (Dynamic Quantization)

Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):

  • identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
  • weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
  • few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies: auto/stable_auto/power/log/rank/top/fixed). Supports few_shot_auto_compile for deferred compilation after quantization.

DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.

🔧 SVDQ Converter CLI

New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

Supports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.

🔀 Fused MLP

New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.

⚙️ Quantization Configuration Enhancements

  • Regional Quantization (regional_quantize=True + repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision.
  • Hybrid Precision Plan (precision_plan): Assign different quant types to different sub-layers by name pattern.
  • FP8 Per-Tensor Fallback (per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers.
  • TorchAO Backend Refactor: Cleaner QuantizeBackend enum (AUTO / TORCHAO / CACHE_DIT / NONE).
  • Quantize API Refactor: Deprecated legacy kwargs, unified under QuantizeConfig + svdq_kwargs.

📦 cache-dit-cu13 Pre-built Wheel

Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.


2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.

Core Design:

  • Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
  • Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
  • Persistent Bins: Distribute the persistent budget evenly across the target sequence.
  • Flexible Resource Controls: transfer_buckets, persistent_buckets, persistent_bins, prefetch_limit, max_copy_streams, max_inflight_prefetch_bytes.

Performance (FLUX.1-dev, L20):

Config Memory Latency
No offload ~38 GiB 23.4s
Diffusers sequential ~1 GiB 335s
Layerwise (transfer=4, persistent=32, bins=4) ~16 GiB 24.6s

Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.

torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.

CLI Quick Start:

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.

Two Wrapper Levels:

Level Description Best For
Pipeline Wrapper (recommended) Ray manages the entire pipeline execution Full feature support (cache, quant, parallelism), simplest, fastest.
Transformer Wrapper Only the transformer runs on Ray workers Lightweight, but slight slower

Key Features:

  • ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.
  • ray_use_compile: Automatic per-worker compilation.
  • ray_runtime_env: Custom module import handling via PYTHONPATH.
  • Supports all parallelism strategies: TP, Ulysses, Ring.
  • LoRA support: fuse before enabling (TP requires fused LoRAs).

Performance (FLUX.2-klein-base-9B):

Config Latency
Baseline (single GPU) 47.41s
Ray TP=2 + compile 24.57s

Minimal Example:

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # Code unchanged

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.

Mathematical Principle: The cached feature stream is modeled as a linear dynamical system $Y_{t+1} \approx A \cdot Y_t$. The propagator $A$ is identified via one economy SVD, then eigendecomposed for extrapolation via $\lambda^k$. Unlike TaylorSeer's polynomial extrapolation (diverges as $t^n \to \infty$), DMD is bounded when $\lvert\lambda\rvert \leq 1$.

TaylorSeer vs DMD:

Aspect TaylorSeer (Polynomial) DMD (Exponential)
Basis $Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$ $Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
Extrapolation Diverges as $t^n \to \infty$ Bounded when $\lvert\lambda\rvert \leq 1$
Snapshots needed 2+ (1st order) ≥ 4 uniformly spaced
Best for DiT-class denoising (DDPM) Flow-matching generators (Hunyuan3D, etc.)
Noise sensitivity Low Moderate (SVD truncation suppresses noise)

Usage:

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6


🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

  • TP + compile integration (#888)
  • fp8 per-row + TP support (#896)
  • Async Ulysses support (#877)

📈 CUDA Graph

  • Full CUDA Graph support (#942-#952)
  • CUDA Graph + fp8 rowwi...
Read more

v1.3.12

09 Jun 05:08
95b0800

Choose a tag to compare

What's Changed

Full Changelog: v1.3.11...v1.3.12

v1.3.11

04 Jun 10:05
8bfe26c

Choose a tag to compare

What's Changed

Full Changelog: v1.3.10...v1.3.11

v1.3.10

04 Jun 06:37
5659e42

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.3.9...v1.3.10

v1.3.9

27 May 03:10
cdacd96

Choose a tag to compare

What's Changed

Full Changelog: v1.3.8...v1.3.9

v1.3.8

25 May 09:15
929041e

Choose a tag to compare

What's Changed

Full Changelog: v1.3.7...v1.3.8

v1.3.7

12 May 07:39
a0737f8

Choose a tag to compare

What's Changed

Full Changelog: v1.3.6...v1.3.7

v1.3.6

11 May 02:11
cdc0430

Choose a tag to compare

What's Changed

Full Changelog: v1.3.5...v1.3.6

v1.3.5 Quantization

30 Mar 08:13
a5022a5

Choose a tag to compare

Low-bits Quantization

Overview

Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.

quantization type description devices
float8_per_row quantize weights and activations to float8 (dynamic quantization) with rowwise method. (recommended) >=sm89, Ada, Hopper or newer
float8_per_tensor quantize weights and activations to float8 (dynamic quantization) with tensorwise method. >=sm89, Ada, Hopper or newer
float8_per_block block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128) >=sm89, Ada, Hopper or newer
float8_weight_only quantize only weights to float8, keep activations in full precision >=sm89, Ada, Hopper or newer
int8_per_row quantize weights and activations to int8 (dynamic quantization) with rowwise method. >=sm80, Ampere or newer
int8_per_tensor quantize weights and activations to int8 (dynamic quantization) with tensorwise method. >=sm80, Ampere or newer
int8_weight_only quantize only weights to int8, keep activations in full precision >=sm80, Ampere or newer
int4_weight_only quantize only weights to int4, keep activations in full precision >=sm90, Hopper or newer, TMA required

FP8 Quantization

Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)

For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these methods cause almost no loss in precision. Supported quantization types including:

  • float8_per_row: quantize both weights and activations to float8 (dynamic quantization) with rowwise method.
  • float8_per_tensor: quantize both weights and activations to float8 (dynamic quantization) with tensorwise method.
  • float8_per_block: block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128). NOT supported for distributed inference for now.
  • float8_weight_only: quantize only weights to float8, keep activations in full precision.

Here are some examples of how to use quantization with cache-dit. You can directly specify the quantization config in the enable_cache API.

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

# quant_type: float8_per_row, float8_per_tensor, float8_per_block, float8_weight_only, 
# int8_per_row, int8_per_tensor, int8_weight_only, int4_weight_only, etc.
# Pass a QuantizeConfig to the `enable_cache` API.
cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(quant_type="float8_per_row"),
)

Users can also specify different quantization configs for different components. For example, quantize the transformer to float8_per_row and the text encoder to float8_weight_only.

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(
        components_to_quantize={
            "transformer": {
                "quant_type": "float8_per_row",
                "exclude_layers": ["embedder", "embed"],
            },
            "text_encoder": {
                "quant_type": "float8_weight_only",
                "exclude_layers": ["lm_head"],
            }
        }
    ),
)

Or, directly call the quantize API for more fine-grained control.

import cache_dit
from cache_dit import QuantizeConfig

cache_dit.quantize(
    pipe.transformer, 
    quantize_config=QuantizeConfig(quant_type="float8_per_row"),
)
cache_dit.quantize(
    pipe.text_encoder, 
    quantize_config=QuantizeConfig(quant_type="float8_weight_only"),
)

Please also enable torch.compile for better performance with quantization.

import cache_dit

cache_dit.set_compile_configs()
pipe.transformer = torch.compile(pipe.transformer)
pipe.text_encoder = torch.compile(pipe.text_encoder)

Users can set exclude_layers in QuantizeConfig to exclude some sensitive layers that are not robust to quantization, e.g., embedding layers. Layers that contain any of the keywords in the exclude_layers list will be excluded from quantization. For example:

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(
        quant_type="float8_per_row",
        exclude_layers=["embedder", "embed"],
    ),
)

By default, quant_type="float8_per_row" for better precision. Users can set it to "float8_per_tensor" to use per-tensor quantization for better performance on some hardware.

Regional Quantization

Cache-DiT also supports regional quantization, which allows users to quantize only the repeated blocks in a transformer. This can be useful for better balancing the precision and efficiency. Users can specify the blocks to be quantized via the regional_quantize and repeated_blocks arguments in QuantizeConfig. For example, to quantize repeated blocks of the Flux2's transformer:

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(
        quant_type="float8_per_row",
        # Default (True), only quantize the repeated blocks in transformer if the repeated_blocks is 
        # specified. If set to False, the whole transformer will be quantized.
        regional_quantize=True, 
        # Specify the block names for the transformer, cache-dit will automatically find the repeated 
        # blocks and quantize it inplace. The block names can be found in the model architecture, e.g., 
        # for FLUX.2, the block name is "Flux2TransformerBlock" and "Flux2SingleTransformerBlock".
        repeated_blocks=['Flux2TransformerBlock', 'Flux2SingleTransformerBlock'],
        # repeated_blocks will be detected automatically from diffusers' transformer class, namely:
        # default repeated_blocks = transformer._repeated_blocks if exists, else None (quantize 
        # the whole transformer.
    ),
)

FP8 Per-Tensor Fallback

The per_tensor_fallback option in Cache-DiT's quantization configuration allows users to enable a fallback mechanism for layers that do not support float8 per-row or per-block quantization. This is particularly useful in scenarios where tensor parallelism is applied, and certain layers (e.g., those applied with RowwiseParallel) may encounter memory layout mismatch errors when quantized to float8 per-row.

When per_tensor_fallback is set to True, if a layer cannot be quantized to float8 per-row or per-block, it will automatically fall back to float8 per-tensor quantization instead of raising an error. This ensures that the quantization process can continue smoothly without interruption, while still providing the benefits of reduced precision for supported layers.

To enable this feature, simply set the per_tensor_fallback flag to True (default) in the QuantizeConfig when calling the enable_cache API. Only support for float8 quantization for now. For example:

import cac...
Read more

v1.3.4

27 Mar 03:14
99aade7

Choose a tag to compare

hotfix