16 Jun 04:27

DefTruth

eb0ec99

Latest

🚀 Cache-DiT v1.5.0 Release Notes

Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases

📋 Overview

Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.

📊 PTQ (Post-Training Quantization)

Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:

INT4 PTQ (≥sm80): Collect activation statistics via calibrate_fn → SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels: low (recommended default, ~18× speedup), medium, high. Serialize to {quant_type}.safetensors + quant_config.json; restore via cache_dit.load().
NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only runtime_kernel="v1" is supported for NVFP4.

Performance (FLUX.2-klein-4B, 1024×1024, L20):

Stage	Latency (s)	Memory (GiB)	Transformer Weight (GiB)
BF16 baseline	2.13	17.32	7.22
SVDQuant INT4	1.24	12.39	2.28
SVDQuant + compile	1.02	12.39	2.28

Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
PSNR > 29 dB, near-lossless visual quality

NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):

Stage	Latency (s)	Speedup	Memory (GiB)
BF16 baseline	0.97	1.00×	17.32
NVFP4 PTQ	0.58	1.69×	12.50
NVFP4 + compile	0.47	2.05×	12.50

⚡ DQ (Dynamic Quantization)

Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):

identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies: auto/stable_auto/power/log/rank/top/fixed). Supports few_shot_auto_compile for deferred compilation after quantization.

DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.

🔧 SVDQ Converter CLI

New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

Supports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.

🔀 Fused MLP

New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.

⚙️ Quantization Configuration Enhancements

Regional Quantization (regional_quantize=True + repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision.
Hybrid Precision Plan (precision_plan): Assign different quant types to different sub-layers by name pattern.
FP8 Per-Tensor Fallback (per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers.
TorchAO Backend Refactor: Cleaner QuantizeBackend enum (AUTO / TORCHAO / CACHE_DIT / NONE).
Quantize API Refactor: Deprecated legacy kwargs, unified under QuantizeConfig + svdq_kwargs.

📦 cache-dit-cu13 Pre-built Wheel

Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.

2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.

Core Design:

Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
Persistent Bins: Distribute the persistent budget evenly across the target sequence.
Flexible Resource Controls: transfer_buckets, persistent_buckets, persistent_bins, prefetch_limit, max_copy_streams, max_inflight_prefetch_bytes.

Performance (FLUX.1-dev, L20):

Config	Memory	Latency
No offload	~38 GiB	23.4s
Diffusers sequential	~1 GiB	335s
Layerwise (transfer=4, persistent=32, bins=4)	~16 GiB	24.6s

Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.

torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.

CLI Quick Start:

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.

Two Wrapper Levels:

Level	Description	Best For
Pipeline Wrapper (recommended)	Ray manages the entire pipeline execution	Full feature support (cache, quant, parallelism), simplest, fastest.
Transformer Wrapper	Only the transformer runs on Ray workers	Lightweight, but slight slower

Key Features:

ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.
ray_use_compile: Automatic per-worker compilation.
ray_runtime_env: Custom module import handling via PYTHONPATH.
Supports all parallelism strategies: TP, Ulysses, Ring.
LoRA support: fuse before enabling (TP requires fused LoRAs).

Performance (FLUX.2-klein-base-9B):

Config	Latency
Baseline (single GPU)	47.41s
Ray TP=2 + compile	24.57s

Minimal Example:

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # Code unchanged

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.

Mathematical Principle: The cached feature stream is modeled as a linear dynamical system $Y_{t+1} \approx A \cdot Y_t$. The propagator $A$ is identified via one economy SVD, then eigendecomposed for extrapolation via $\lambda^k$. Unlike TaylorSeer's polynomial extrapolation (diverges as $t^n \to \infty$), DMD is bounded when $\lvert\lambda\rvert \leq 1$.

TaylorSeer vs DMD:

Aspect	TaylorSeer (Polynomial)	DMD (Exponential)
Basis	$Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$	$Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
Extrapolation	Diverges as $t^n \to \infty$	Bounded when $\lvert\lambda\rvert \leq 1$
Snapshots needed	2+ (1st order)	≥ 4 uniformly spaced
Best for	DiT-class denoising (DDPM)	Flow-matching generators (Hunyuan3D, etc.)
Noise sensitivity	Low	Moderate (SVD truncation suppresses noise)

Usage:

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6

🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

TP + compile integration (#888)
fp8 per-row + TP support (#896)
Async Ulysses support (#877)

📈 CUDA Graph

Full CUDA Graph support (#942-#952)
CUDA Graph + fp8 rowwi...

Contributors

FNGarvin and blian6

Assets 2

09 Jun 05:08

DefTruth

v1.3.12

95b0800

v1.3.12

What's Changed

chore: update installation guide by @DefTruth in #1035
chore: Update README.md by @DefTruth in #1036
Update README.md by @DefTruth in #1037
chore: update installation guide by @DefTruth in #1038
feat: config yaml support svdq dq/few-shot by @DefTruth in #1040
[2/N]: config yaml support svdq dq/few-shot by @DefTruth in #1042
chore: update why not svqd nvfp4 for sm100 by @DefTruth in #1043
chore: hightlight bucket-style layerwise offload by @DefTruth in #1044
chore: Update README.md by @DefTruth in #1045
svdq: add fused gelu mlp/proj pass by @DefTruth in #1047
docs: add fused gelu mlp/proj docs by @DefTruth in #1048

Full Changelog: v1.3.11...v1.3.12

Contributors

DefTruth

Assets 2

04 Jun 10:05

DefTruth

v1.3.11

8bfe26c

v1.3.11

What's Changed

chore: update release tools by @DefTruth in #1034

Full Changelog: v1.3.10...v1.3.11

Contributors

DefTruth

Assets 2

04 Jun 06:37

DefTruth

v1.3.10

5659e42

v1.3.10

What's Changed

feat: add MindIE-SD as optional NPU attention and compilation backend by @blian6 in #1004
chore: simplify attn backend auto select by @DefTruth in #1024
Fix Python version mismatch in setup.py by @FNGarvin in #1025
offload: extract copy stream pool and split init by @DefTruth in #1026
feat: support svdquant nvfp4 ptq/dq by @DefTruth in #1029
chore: Update README.md by @DefTruth in #1030
whl: cache-dit-cu13 pkg w/ svdq kernels by @DefTruth in #1031
whl: fix build_releases.sh tool by @DefTruth in #1032
whl: fix build_releases.sh tool by @DefTruth in #1033

New Contributors

@blian6 made their first contribution in #1004
@FNGarvin made their first contribution in #1025

Full Changelog: v1.3.9...v1.3.10

Contributors

DefTruth, FNGarvin, and blian6

Assets 2

27 May 03:10

DefTruth

v1.3.9

cdacd96

v1.3.9

What's Changed

ray: refactor ray wrapper impl by @DefTruth in #1016
parallel: deprecated native diffusers backend by @DefTruth in #1017
ray: allow pass init_fn to ray wrapper by @DefTruth in #1019
docs: add torch.compile section to offload docs by @DefTruth in #1020
API: remove dup ray api call by @DefTruth in #1021
ray: simplify ray wrapper dispatch by @DefTruth in #1022
ray: soft check for ray path by @DefTruth in #1023

Full Changelog: v1.3.8...v1.3.9

Contributors

DefTruth

Assets 2

25 May 09:15

DefTruth

v1.3.8

929041e

v1.3.8

What's Changed

CLI: allow 8-steps lora for qwen-image edit lightning by @DefTruth in #1011
skills: add triton-kernel skill by @DefTruth in #1013
feat: make layerwise offload compatible w/ compile by @DefTruth in #1014
ray: fix custom components serialize by @DefTruth in #1015

Full Changelog: v1.3.7...v1.3.8

Contributors

DefTruth

Assets 2

12 May 07:39

DefTruth

v1.3.7

a0737f8

v1.3.7

What's Changed

docs: update ray wrapper docs by @DefTruth in #1005
docs: update ray wrapper docs by @DefTruth in #1006
ray: pass runtime env to workers by @DefTruth in #1007
chore: update ray wrapper docs by @DefTruth in #1008
ray: disable dashboard by default by @DefTruth in #1009

Full Changelog: v1.3.6...v1.3.7

Contributors

DefTruth

Assets 2

11 May 02:11

DefTruth

v1.3.6

cdc0430

v1.3.6

What's Changed

chore: update cache-dit arch by @DefTruth in #932
bc: deprecated serving module by @DefTruth in #933
chore: suppress torch compile tuning logs by @DefTruth in #934
compile: enabled descent_tuning by default by @DefTruth in #935
docs: update quantization docs by @DefTruth in #937
quant: add quantize backend enum by @DefTruth in #938
kernel: refactor ops register by @DefTruth in #939
chore: fix vllm-omni docs links by @DefTruth in #940
examples: add cuda graph option by @DefTruth in #942
chore: fix utils log info by @DefTruth in #943
chore: add cuda graph usage docs by @DefTruth in #944
chore: add cuda graph usage to overview by @DefTruth in #945
CLI: add compile full-graph option by @DefTruth in #946
chore: fix fullgraph param typo by @DefTruth in #947
docs: add more cuda graph perf results by @DefTruth in #948
docs: add more cuda graph perf results by @DefTruth in #949
docs: update cuda graphs docs by @DefTruth in #950
chore: allow cuda graph for dynamic compile by @DefTruth in #951
feat: support cuda graph + fp8 rowwise by @DefTruth in #952
chore: hotfix for mkdocs broken by @DefTruth in #953
[1/N] feat: support svdquant w4a4 - kernels & skills by @DefTruth in #954
pytest: fast_svd mode for testing by @DefTruth in #955
[2/N] feat: streaming quantize for svdquant by @DefTruth in #956
[3/N] feat: PTQ workflow for svdquant by @DefTruth in #957
SKILL: add ptq-workflow-integration skill by @DefTruth in #958
pytest: separate kernels and quantization tests by @DefTruth in #959
chore: add docs strings to codebase by @DefTruth in #960
chore: add svdq e2e example and format code by @DefTruth in #961
SKILL: add Cute-DSL/CUDA/CUTLASS skills by @DefTruth in #962
chore: update docs by @DefTruth in #963
kernel: tune svdq w4a4 gemm stage/blk size for Ada by @DefTruth in #966
kernel: unified ops register policy by @DefTruth in #967
bench: refactor cache-dit bench by @DefTruth in #968
svdquant: fast svd decompose, ~18x speedup by @DefTruth in #969
[2/N] tune svdq w4a4 gemm for ada by @DefTruth in #970
bc: refactor distributed codebase by @DefTruth in #971
kernel: add cute-dsl based merge-attn-states kernel by @DefTruth in #973
feat: extend SVDQ PTQ -> SVDQ DQ by @DefTruth in #974
fix: support 3D input/output for W4A4 linear by @DefTruth in #975
chore: support svdq-calib option in examples by @DefTruth in #976
kernel: add cute-dsl based fp8 comm kernels by @DefTruth in #977
[1/N] feat: support cute-dsl based svdquant w4a4 by @DefTruth in #978
feat: support svdq-dq few shot by @DefTruth in #979
chore: update svdq-dq few shot docs by @DefTruth in #980
feat: support layerwise cpu offload by @DefTruth in #981
[2/N] feat: support layerwise offload by @DefTruth in #982
[3/N] feat: support layerwise offload by @DefTruth in #983
[4/N] feat: support layerwise offload by @DefTruth in #984
chore: unified all2all/ring comm api by @DefTruth in #985
chore: refactor async ulysses codebase by @DefTruth in #986
remove cutedsl based svdq kernels by @DefTruth in #987
fix tensor parallel register import error by @DefTruth in #988
feat: support sub cp_plan for context parallel by @DefTruth in #989
chore: fix attention dispatch comments by @DefTruth in #990
community: add tensorrt-llm x cache-dit link by @DefTruth in #991
deps: use uv to install deps by @DefTruth in #992
chore: update docs by @DefTruth in #993
chore: add layerwise offload to overview by @DefTruth in #994
chore: update layerwise offload cli quick start by @DefTruth in #995
attention: fix sage-attn backend dispatch by @DefTruth in #996
chore: add exclude-layers param to ptq example by @DefTruth in #997
attention: separate attn backends by @DefTruth in #998
svdq: support converter cli for dq workflow by @DefTruth in #999
chore: fix typo by @DefTruth in #1000
chore: revise quantization example in README by @DefTruth in #1001
chore: update README by @DefTruth in #1002
feat: support ray wrapper by @DefTruth in #1003

Full Changelog: v1.3.5...v1.3.6

Contributors

DefTruth

Assets 2

30 Mar 08:13

DefTruth

v1.3.5

a5022a5

v1.3.5 Quantization

Low-bits Quantization

Overview

Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.

quantization type	description	devices
float8_per_row	quantize weights and activations to float8 (dynamic quantization) with rowwise method. (recommended)	>=sm89, Ada, Hopper or newer
float8_per_tensor	quantize weights and activations to float8 (dynamic quantization) with tensorwise method.	>=sm89, Ada, Hopper or newer
float8_per_block	block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128)	>=sm89, Ada, Hopper or newer
float8_weight_only	quantize only weights to float8, keep activations in full precision	>=sm89, Ada, Hopper or newer
int8_per_row	quantize weights and activations to int8 (dynamic quantization) with rowwise method.	>=sm80, Ampere or newer
int8_per_tensor	quantize weights and activations to int8 (dynamic quantization) with tensorwise method.	>=sm80, Ampere or newer
int8_weight_only	quantize only weights to int8, keep activations in full precision	>=sm80, Ampere or newer
int4_weight_only	quantize only weights to int4, keep activations in full precision	>=sm90, Hopper or newer, TMA required

FP8 Quantization

Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)

For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these methods cause almost no loss in precision. Supported quantization types including:

float8_per_row: quantize both weights and activations to float8 (dynamic quantization) with rowwise method.
float8_per_tensor: quantize both weights and activations to float8 (dynamic quantization) with tensorwise method.
float8_per_block: block-wise quantization weights and activations (dynamic quantization) to float8, which can provide better precision, activations's blocksize: (1, 128), weight's blocksize: (128, 128). NOT supported for distributed inference for now.
float8_weight_only: quantize only weights to float8, keep activations in full precision.

Here are some examples of how to use quantization with cache-dit. You can directly specify the quantization config in the enable_cache API.

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

# quant_type: float8_per_row, float8_per_tensor, float8_per_block, float8_weight_only, 
# int8_per_row, int8_per_tensor, int8_weight_only, int4_weight_only, etc.
# Pass a QuantizeConfig to the `enable_cache` API.
cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(quant_type="float8_per_row"),
)

Users can also specify different quantization configs for different components. For example, quantize the transformer to float8_per_row and the text encoder to float8_weight_only.

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(
        components_to_quantize={
            "transformer": {
                "quant_type": "float8_per_row",
                "exclude_layers": ["embedder", "embed"],
            },
            "text_encoder": {
                "quant_type": "float8_weight_only",
                "exclude_layers": ["lm_head"],
            }
        }
    ),
)

Or, directly call the quantize API for more fine-grained control.

import cache_dit
from cache_dit import QuantizeConfig

cache_dit.quantize(
    pipe.transformer, 
    quantize_config=QuantizeConfig(quant_type="float8_per_row"),
)
cache_dit.quantize(
    pipe.text_encoder, 
    quantize_config=QuantizeConfig(quant_type="float8_weight_only"),
)

Please also enable torch.compile for better performance with quantization.

import cache_dit

cache_dit.set_compile_configs()
pipe.transformer = torch.compile(pipe.transformer)
pipe.text_encoder = torch.compile(pipe.text_encoder)

Users can set exclude_layers in QuantizeConfig to exclude some sensitive layers that are not robust to quantization, e.g., embedding layers. Layers that contain any of the keywords in the exclude_layers list will be excluded from quantization. For example:

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(
        quant_type="float8_per_row",
        exclude_layers=["embedder", "embed"],
    ),
)

By default, quant_type="float8_per_row" for better precision. Users can set it to "float8_per_tensor" to use per-tensor quantization for better performance on some hardware.

Regional Quantization

Cache-DiT also supports regional quantization, which allows users to quantize only the repeated blocks in a transformer. This can be useful for better balancing the precision and efficiency. Users can specify the blocks to be quantized via the regional_quantize and repeated_blocks arguments in QuantizeConfig. For example, to quantize repeated blocks of the Flux2's transformer:

import cache_dit
from cache_dit import DBCacheConfig, ParallelismConfig, QuantizeConfig

cache_dit.enable_cache( 
    pipe, cache_config=DBCacheConfig(), # w/ default
    parallelism_config=ParallelismConfig(ulysses_size=2),
    quantize_config=QuantizeConfig(
        quant_type="float8_per_row",
        # Default (True), only quantize the repeated blocks in transformer if the repeated_blocks is 
        # specified. If set to False, the whole transformer will be quantized.
        regional_quantize=True, 
        # Specify the block names for the transformer, cache-dit will automatically find the repeated 
        # blocks and quantize it inplace. The block names can be found in the model architecture, e.g., 
        # for FLUX.2, the block name is "Flux2TransformerBlock" and "Flux2SingleTransformerBlock".
        repeated_blocks=['Flux2TransformerBlock', 'Flux2SingleTransformerBlock'],
        # repeated_blocks will be detected automatically from diffusers' transformer class, namely:
        # default repeated_blocks = transformer._repeated_blocks if exists, else None (quantize 
        # the whole transformer.
    ),
)

FP8 Per-Tensor Fallback

The per_tensor_fallback option in Cache-DiT's quantization configuration allows users to enable a fallback mechanism for layers that do not support float8 per-row or per-block quantization. This is particularly useful in scenarios where tensor parallelism is applied, and certain layers (e.g., those applied with RowwiseParallel) may encounter memory layout mismatch errors when quantized to float8 per-row.

When per_tensor_fallback is set to True, if a layer cannot be quantized to float8 per-row or per-block, it will automatically fall back to float8 per-tensor quantization instead of raising an error. This ensures that the quantization process can continue smoothly without interruption, while still providing the benefits of reduced precision for supported layers.

To enable this feature, simply set the per_tensor_fallback flag to True (default) in the QuantizeConfig when calling the enable_cache API. Only support for float8 quantization for now. For example:

import cac...

Assets 2

27 Mar 03:14

DefTruth

v1.3.4

99aade7

v1.3.4

hotfix

Assets 2

Uh oh!

Releases: vipshop/cache-dit

v1.5.0 Major Release

🚀 Cache-DiT v1.5.0 Release Notes

📋 Overview

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

📊 PTQ (Post-Training Quantization)

⚡ DQ (Dynamic Quantization)

🔧 SVDQ Converter CLI

🔀 Fused MLP

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

⚙️ Quantization Configuration Enhancements

📦 cache-dit-cu13 Pre-built Wheel

2. 💾 Bucket-style Layerwise CPU Offload

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

📈 CUDA Graph

Contributors

Uh oh!

v1.3.12

What's Changed

Contributors

Uh oh!

v1.3.11

What's Changed

Contributors

Uh oh!

v1.3.10

What's Changed

New Contributors

Contributors

Uh oh!

v1.3.9

What's Changed

Contributors

Uh oh!

v1.3.8

What's Changed

Contributors

Uh oh!

v1.3.7

What's Changed

Contributors

Uh oh!

v1.3.6

What's Changed

Contributors

Uh oh!

v1.3.5 Quantization

Low-bits Quantization

Overview

FP8 Quantization

Regional Quantization

FP8 Per-Tensor Fallback

Uh oh!

v1.3.4

Uh oh!