Releases: vipshop/cache-dit
v1.3.3
v1.3.2
hotfix release for fp8 per-row quantization w/ tensor parallel
Full Changelog: v1.3.1...v1.3.2
v1.3.1
What's Changed
- chore: update load configs docs by @DefTruth in #867
- fix: skip fp8 quantize linear w/ bias in tp by @DefTruth in #869
- chore: add quick start flags for quantize by @DefTruth in #871
- chore: update pypi download badge by @DefTruth in #872
- bugfix: remove un-supported quantize type by @DefTruth in #873
- feat: expand quantize config by @DefTruth in #874
- feat: support async ulysses for flux2 series by @DefTruth in #877
- chore: cleanup patch functors codes by @DefTruth in #878
- chore: fix docs typo by @DefTruth in #879
- chore: safe import metrics funcs by @DefTruth in #880
- chore: update quantization docs by @DefTruth in #881
- chore: use rel imports for calibrators by @DefTruth in #882
- chore: suppress torchao warnings by @DefTruth in #883
- chore: add tune alias for max-autotune by @DefTruth in #884
- remove manually graph break in cache blocks by @DefTruth in #885
- docs: format docs by @DefTruth in #886
- docs: fix typos by @DefTruth in #887
- [1/N] feat: support flux2-klein kv - tp + compile by @DefTruth in #888
- chore: cleanup tp utils codes by @DefTruth in #890
- chore: fix api docs typo by @DefTruth in #891
- chore: add mcc usage docs by @DefTruth in #892
- chore: update mcc usage docs by @DefTruth in #893
- chore: add mcc to cache-dit arch by @DefTruth in #894
- chore: update mcc docs by @DefTruth in #895
- [2/N] feat: support fp8 per-row + tp for flux2-klein kv by @DefTruth in #896
- quant: add float8 linear check by @DefTruth in #898
- docs: format docs by @DefTruth in #899
- deps: bump up torch to 2.11.0 by @DefTruth in #900
- quant: refactor torchao backend impl by @DefTruth in #901
- feat: support regional quantization by @DefTruth in #902
- chore: change docs highlight color by @DefTruth in #903
- chore: optimize quant stats summary by @DefTruth in #904
- kernel: register comm kernels as torch ops by @DefTruth in #905
- kernel: refactor custom triton kernels by @DefTruth in #907
- [2/N] kernel: refactor custom triton kernels by @DefTruth in #908
- [3/N] kernel: refactor custom triton kernels by @DefTruth in #909
- quant: refactor quantize api, deprecated kwargs by @DefTruth in #910
- [2/N] quant: refactor quantize api, deprecated kwargs by @DefTruth in #911
- chore: suppress diffusers torchao warnings by @DefTruth in #912
- chore: fix load configs docs typo by @DefTruth in #913
- chore: optimize quant ctx summary by @DefTruth in #914
Full Changelog: v1.3.0...v1.3.1
v1.3.0: USP, 2D/3D Parallel, FP8 Blockwise, ...
v1.3.0 Major Release: USP, 2D/3D Parallel, FP8 Blockwise, ...
Cache-DiT v1.3.0 is a major release after v.1.2.0, the major changes incuding:
- cache-dit-generate command line tool
- Optimize VAE Parallel comm, use batched isend/irecv
- 2D/3D Parallelism: Hybrid CP(USP) + TP, e.g, SP2 + TP2
- Support USP (hybrid ulysses and ring attention)
- New models support: GLM-Image, FLUX.2-Klein, Helios, FireRed-Image-Edit, and more.
- Support pass a quantize_config to
enable_cacheAPI - Support load cache, parallelism and quantization config from yaml, docs
- FP8 Blockwise dynamic quantization support
- AMD GPUs support
- ...
Full Changelog: v1.2.0...v1.3.0
v1.2.3
What's Changed
- feat: support 🔥FireRed-Image-Edit-1.0 by @DefTruth in #797
- misc: support custom input height/width by @DefTruth in #799
- chore: support compile repeated blocks in examples by @DefTruth in #800
- chore: add cache-dit arch by @DefTruth in #802
- chore: update cache-dit arch by @DefTruth in #803
- chore: update cache-dit arch by @DefTruth in #804
- chore: update cache-dit arch by @DefTruth in #805
- chore: update cache-dit arch by @DefTruth in #806
- chore: update cache-dit arch by @DefTruth in #807
- chore: update cache-dit arch by @DefTruth in #809
- fix tp flat mesh broken for torch < 2.10 by @DefTruth in #810
- chore: only logging at rank 0 by default by @DefTruth in #812
- chore: add env docs by @DefTruth in #813
Full Changelog: v1.2.2...v1.2.3
v1.2.2
What's Changed
- fix load config docs typo by @DefTruth in #778
- chore: rename hybrid parallel backend by @DefTruth in #779
- feat: add an extend context parallel api by @DefTruth in #780
- chore: set save_ctx as False for ring p2p by @DefTruth in #782
- chore: add flux2-klein edit examples by @DefTruth in #783
- fix ring lse fp32 convert error by @DefTruth in #785
- feat: support cache for glm-image by @DefTruth in #787
- chore: reset rdt as 0.12 in examples for better precision by @DefTruth in #789
- chore: update badges by @DefTruth in #790
- feat: ring attn w/ npu_fia for ascend npu by @luren55 in #792
- feat: support tensor parallel for glm-image by @DefTruth in #794
Full Changelog: v1.2.1...v1.2.2
v1.2.1 USP, 2D/3D Parallel
🎉 v1.2.1 release is ready, the major updates including: Ring Attention w/ batched P2P, USP (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥USP + TP), VAE-P Comm overhead reduce.
# Hybrid 2D/3D Parallelism in Cache-DiT is fully compatible w/ torch.compile,
# Cache Acceleration, Text Encoder Parallelism, VAE Parallelism and more.
torchrun --nproc_per_node=8 -m cache_dit.generate flux2 --config parallel_2d.yaml --compile
torchrun --nproc_per_node=8 -m cache_dit.generate flux2 --config parallel_3d.yaml --compile
torchrun --nproc_per_node=8 -m cache_dit.generate --parallel ulysses_tp --cache --compileWhat's Changed
- [chore] Align torch generator with example by @BBuf in #723
- Fix generator bug in cache-dit by @BBuf in #724
- examples: allow custom generator device by @DefTruth in #726
- examples: allow custom warmup-steps by @DefTruth in #727
- docs: add latest news by @DefTruth in #728
- docs: fix docs format by @DefTruth in #729
- fix selected metrics print by @66RING in #730
- docs: add flux examples to tp docs by @DefTruth in #731
- fix ltx-2 i2v example by @DefTruth in #734
- Update README.md by @DefTruth in #735
- chore: allow use default steps for scm by @DefTruth in #736
- [chore] support gpu generator in server by @BBuf in #737
- docs: update download badge by @DefTruth in #738
- Refine profiler and serving docs by @BBuf in #739
- example image-path support url by @BBuf in #742
- fix UAA broken while using joint attn by @DefTruth in #743
- compile: avoid graph break for UAA by @DefTruth in #744
- refactor configs yml in examples by @DefTruth in #745
- relax npu attention import by @DefTruth in #747
- feat: add set_attn_backend api by @DefTruth in #748
- docs: update quick start by @DefTruth in #749
- fix ring attn w/ native backend in torch 2.10 by @DefTruth in #750
- feat: NPU FA support attention mask by @zhangtao0408 in #751
- feat: add cache-dit-generate cli tool by @DefTruth in #752
- docs: update ascend npu examples by @DefTruth in #753
- feat: support ring attn p2p comm by @DefTruth in #754
- feat: support USP -> Ulysses + Ring by @DefTruth in #755
- fix npu import error w/o triton by @DefTruth in #756
- chore: use batched isend/irecv for vae-p by @DefTruth in #757
- feat: tile batched p2p comm for vae-p by @DefTruth in #758
- docs: update example installation by @DefTruth in #760
- reduce comm overhead for vae-p by @DefTruth in #762
- [chore] Fix FLUX2 Ulysses Anything NCCL Hang by @BBuf in #761
- [2/N] reduce comm overhead for vae-p by @DefTruth in #763
- misc: fix sglang diffusion docs link by @DefTruth in #764
- feat: support hybrid CP/SP + TP by @DefTruth in #765
- chore: use _cp_rank for cp_config & fix docs by @DefTruth in #766
- fix hybrid parallel docs by @DefTruth in #768
- chore: fix api docs typo by @DefTruth in #769
- fix api docs typo by @DefTruth in #770
- feat: support latest z-image in examples by @DefTruth in #771
- chore: add show case for parallel vae by @DefTruth in #772
- chore: update docs by @DefTruth in #773
- [2/N] update docs part-2 by @DefTruth in #774
- docs: add ComfyUI-CacheDiT link by @DefTruth in #775
- feat: load config support hybrid parallel by @DefTruth in #777
New Contributors
- @66RING made their first contribution in #730
- @zhangtao0408 made their first contribution in #751
Full Changelog: v1.2.0...v1.2.1
v1.2.0 Major Release: NPU, TE-P, VAE-P, CN-P, ...
v1.2.0 Major Release: NPU, TE-P, VAE-P, CN-P, ...
Overviews
v1.2.0 is a Major Release after v1.1.0. We introduced many updates in v1.2.0, thereby further enhancing the ease of use and performance of Cache-DiT. We sincerely thank the contributors of Cache-DiT. The main updates for this time are as follows, includes:
- 🎉New Models Support
- 🎉Request level cache context
- 🎉HTTP Serving Support
- 🎉Context Parallelism Optimization
- 🎉Text Encoder Parallelism
- 🎉Auto Encoder (VAE) Parallelism
- 🎉ControlNet Parallelism
- 🎉Ascend NPU Support
- 🎉Community Integration.
🔥New Models Support
- Qwen-Image:
- Image: Qwen-Image-2512, Qwen-Image-Layered
- Edit: Qwen-Image-Edit-2511, Qwen-Image-Edit-2509
- ControlNet: Qwen-Image-ControlNet, Qwen-Image-ControlNet-Inpainting
- Qwen-Image-Lightning: Qwen-Image-Lightning series, Qwen-Image-Edit-Lightning series
- Wan: Wan 2.1 VACE, Wan 2.2 VACE.
- Z-Image: Z-Image-Turbo, Z-Image-Turbo-Fun-ControlNet-2.0, Z-Image-Turbo-Fun-ControlNet-2.1
- FLUX.2: FLUX.2-dev, FLUX.2-Klein-4B, FLUX.2-Klein-base-4B, FLUX.2-Klein-9B, FLUX.2-Klein-base-9B
- LTX-2: LTX-2-I2V, LTX-2-T2V by @BBuf
- Ovis-Image: Ovis-Image
- LongCat-Image: LongCat-Image, LongCat-Image-Edit
- Nunchaku INT4 Models: Z-Image-Turbo, Qwen-Image-Edit-2511
🔥Request level cache context
If you need to use a different num_inference_steps for each user request instead of a fixed value, you should use it in conjunction with refresh_context API. Before performing inference for each user request, update the cache context based on the actual number of steps. Please refer to 📚run_cache_refresh as an example.
import cache_dit
from cache_dit import DBCacheConfig
from diffusers import DiffusionPipeline
# Init cache context with num_inference_steps=None (default)
pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image")
pipe = cache_dit.enable_cache(pipe.transformer, cache_config=DBCacheConfig(num_inference_steps=None))
# Assume num_inference_steps is 28, and we want to refresh the context
cache_dit.refresh_context(pipe.transformer, num_inference_steps=28, verbose=True)
output = pipe(...) # Just call the pipe as normal.
stats = cache_dit.summary(pipe.transformer) # Then, get the summary
# Update the cache context with new num_inference_steps=50.
cache_dit.refresh_context(pipe.transformer, num_inference_steps=50, verbose=True)
output = pipe(...) # Just call the pipe as normal.
stats = cache_dit.summary(pipe.transformer) # Then, get the summary
# Update the cache context with new cache_config.
cache_dit.refresh_context(
pipe.transformer,
cache_config=DBCacheConfig(
residual_diff_threshold=0.1,
max_warmup_steps=10,
max_cached_steps=20,
max_continuous_cached_steps=4,
# The cache settings should all be located in the cache config
# if cache config is provided. Otherwise, we will skip it.
num_inference_steps=50,
),
verbose=True,
)
output = pipe(...) # Just call the pipe as normal.
stats = cache_dit.summary(pipe.transformer) # Then, get the summary🔥HTTP Serving Support
- Built-in HTTP serving deployment support with simple REST APIs by @BBuf, deploy cache-dit models with HTTP API for text-to-image, image editing, multi-image editing, and text/image-to-video generation.
🔥Context Parallelism Optimization
- UAA: Ulysses Anything Attention: support any sequence length and any head num by @DefTruth @gameofdimension @tingkuanpei
- Async Ulysses CP: support Async Ulysses QKV Projection for FLUX.1, FLUX.2, Z-Image, Qwen-Image by @DefTruth
- Async FP8 Ulysses: support async FP8 all2all comm for ulysses by @triple-mu
🔥Text Encoder Parallelism
Currently, cache-dit supported text encoder parallelism for T5Encoder, UMT5Encoder, Llama, Gemma 1/2/3, Mistral, Mistral-3, Qwen-3, Qwen-2.5 VL, Glm and Glm-4 model series, namely, supported almost 🔥ALL pipelines in diffusers.
Users can set the extra_parallel_modules parameter in parallelism_config (when using Tensor Parallelism or Context Parallelism) to specify additional modules that need to be parallelized beyond the main transformer — e.g, text_encoder in Flux2Pipeline. It can further reduce the per-GPU memory requirement and slightly improve the inference performance of the text encoder.
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig
# Transformer Tensor Parallelism + Text Encoder Tensor Parallelism
cache_dit.enable_cache(
pipe,
cache_config=DBCacheConfig(...),
parallelism_config=ParallelismConfig(
tp_size=2,
parallel_kwargs={
"extra_parallel_modules": [pipe.text_encoder], # FLUX.2
},
),
)🔥Auto Encoder (VAE) Parallelism
Currently, cache-dit supported auto encoder (vae) parallelism for AutoencoderKL, AutoencoderKLQwenImage, AutoencoderKLWan, and AutoencoderKLHunyuanVideo series, namely, supported almost 🔥ALL pipelines in diffusers. It can further reduce the per-GPU memory requirement and slightly improve the inference performance of the auto encoder. Users can set it by extra_parallel_modules parameter in parallelism_config, for example:
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig
# Transformer Context Parallelism + Text Encoder Tensor Parallelism + VAE Data Parallelism
cache_dit.enable_cache(
pipe,
cache_config=DBCacheConfig(...),
parallelism_config=ParallelismConfig(
ulysses_size=2,
parallel_kwargs={
"extra_parallel_modules": [pipe.text_encoder, pipe.vae], # FLUX.1
},
),
)🔥ControlNet Parallelism
Further, cache-dit even supported controlnet parallelism for specific models, such as Z-Image-Turbo with ControlNet. Users can set it by extra_parallel_modules parameter in parallelism_config, for example:
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig
# Transformer Context Parallelism + Text Encoder Tensor Parallelism
# + VAE Data Parallelism + ControlNet Context Parallelism
cache_dit.enable_cache(
pipe,
cache_config=DBCacheConfig(...),
parallelism_config=ParallelismConfig(
ulysses_size=2,
# case: Z-Image-Turbo-Fun-ControlNet-2.1
parallel_kwargs={
"extra_parallel_modules": [pipe.text_encoder, pipe.vae, pipe.controlnet],
},
),
)
# torchrun --nproc_per_node=2 parallel_cache.py🔥Ascend NPU Support
Cache-DiT now provides native support for Ascend NPU (by @gameofdimension @luren55 @DefTruth). Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies, including:
- Hybrid Cache Acceleration (DBCache, DBPrune, TaylorSeer, SCM and more)
- Context Parallelism (w/ Extended Diffusers' CP APIs, UAA, Async Ulysses, ...)
- Tensor Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
- Text Encoder Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
- Auto Encoder (VAE) Parallelism (w/ Data or Tile Parallelism, avoid OOM)
- ControlNet Parallelism (w/ Context Parallelism for ControlNet module)
- Built-in HTTP serving deployment support with simple REST APIs
Please refer to Ascend NPU Supported Matrix for more details.
🔥Community Integration
- 🔥Ascend NPU x Cache-DiT
- 🎉Diffusers x Cache-DiT
- 🎉SGLang Diffusion x Cache-DiT
- 🎉vLLM-Omni x Cache-DiT
- 🎉Nunchaku x Cache-DiT
- 🎉SD.Next x Cache-DiT
- 🎉stable-diffusion.cpp x Cache-DiT
- 🎉jetson-containers x Cache-DiT
Full Changelogs
- chore: Update README.md by @DefTruth in #442
- feat: support step compute mask by @DefTruth in #444
- bugfix: fix bench distill cfg mismatch by @DefTruth in #445
- chore: update step mask docs by @DefTruth in #446
- chore: Update User_Guide.md by @DefTruth in #447
- chore: update README by @DefTruth in #448
- chore: update step mask example by @DefTruth in #449
- chore: hightlight
SCM- step computation mask by @DefTruth in #450 - chore: hightlight
SCM- step computation mask by @DefTruth in #451 - chore: hightlight SCM - step computation mask by @DefTruth in https://github.com/vipshop...
v1.1.10
New Models Supported
LongCat-Image, LongCat-Image-Edit, Z-Image-Turbo-ControlNet, Z-Image-Turbo Nunchaku, Qwen-Image-Edit-2511, Qwen-Image-Layered
What's Changed
- Simplify CLI: Make task argument optional by @BBuf in #600
- chore: fix extra path compare by @DefTruth in #603
- CI: Add build_wheel CI by @DefTruth in #604
- feat: support cache for LongCat-Image by @e1ijah1 in #602
- feat: Serving support LORA by @BBuf in #601
- CI: Add Forward Pattern CPU CI Tests by @DefTruth in #605
- chore: Update README.md by @DefTruth in #606
- Fix typo by @BBuf in #607
- feat: z-image-controlnet 🔥4x speedup! by @DefTruth in #608
- fix lora path mismatch in examples by @DefTruth in #609
- feat: support TP and CP for longcat-image by @DefTruth in #610
- misc: fix typo by @DefTruth in #612
- feat: support 🔥Qwen-Image-Edit-2511 by @DefTruth in #614
- feat: support 🔥Qwen-Image-Layered by @DefTruth in #615
- chore: simplify parallelism dispatch by @DefTruth in #616
- chore: simplify quantize dispatch by @DefTruth in #617
- chore: refactor kernels module by @DefTruth in #618
- ci: add refresh context ci tests by @DefTruth in #619
- misc: add device info to example summary by @DefTruth in #621
- feat: support ⚡️Z-Image-Turbo Nunchaku by @DefTruth in #623
- [chore] Improve error_logging in serving tp_worker.py by @BBuf in #627
- chore: support more alias for quant types by @DefTruth in #628
- chore: fix alias rev map for quant types by @DefTruth in #629
- chore: lazy import check for quantize api by @DefTruth in #630
- chore: add more compile flags setting by @DefTruth in #631
- [Bug] Apply --attn backend in single-GPU examples by @BBuf in #633
- Bump up to v1.1.10 by @DefTruth in #634
New Contributors
Full Changelog: v1.1.9...v1.1.10
v1.1.9
What's Changed
- feat: uaa avoid extra memory IO access by @triple-mu in #551
- chore: simplify quantize flags in example utils by @DefTruth in #553
- chore: fix quantize flags in example by @DefTruth in #554
- chore: fix quantize & TP conflicts for wan by @DefTruth in #556
- feat: support serving text2video by @BBuf in #555
- chore: Update SERVING Doc and FAQ Doc by @BBuf in #557
- chore: qwen edit lightning cp/tp examples by @DefTruth in #559
- feat: support ovis-image context parallel by @DefTruth in #560
- feat: serving support image2video by @BBuf in #558
- chore: add collect_env script by @DefTruth in #562
- Add pre-commit and GitHub Actions CI by @DefTruth in #564
- chore: refactor parallelism for better reusability by @DefTruth in #565
- chore: Update vLLM-Omni integration by @SamitHuang in #566
- feat: add pipe quant config for serving by @nono-Sang in #563
- News: 🔥vLLM-Omni x Cache-DiT ready! by @DefTruth in #567
- feat: enable custom attn backend for TP by @DefTruth in #568
- feat: support TP for many text encoder by @DefTruth in #569
- fix qwen-edit-lightning examples by @DefTruth in #571
- fix get_text_encoder_from_pipe by @DefTruth in #572
- fix: handle general compile options in example utils by @DefTruth in #573
- chore: reduce un-popular examples by @DefTruth in #574
- feat: add text_encoder tp for serving by @nono-Sang in #570
- chore: simplify example by @DefTruth in #575
- chore: make unified examples by @DefTruth in #576
- chore: fix vllm-omni docs link by @DefTruth in #577
- chore: optimize examples default path mapping by @DefTruth in #579
- chore: fix vllm-omni docs link by @DefTruth in #580
- feat: support Ovis-Image tensor parallel by @DefTruth in #582
- chore: fix typo in User_Guide.md by @DefTruth in #583
- chore: fail fast TP validation for attn heads by @CPFLAME in #581
- fix patch functor for multi transformers by @DefTruth in #586
- chore: add qwen image controlnet example by @DefTruth in #588
- chore: update docs by @DefTruth in #590
- feat: register fa3 backend for context parallel by @nono-Sang in #589
- chore: support separate quant-type for text encoder by @DefTruth in #591
- hotfix for fa3 backend import error by @DefTruth in #593
- chore: fix typo in README.md by @DefTruth in #594
- chore: set save_ctx to False for inference by @nono-Sang in #596
- fix flux examples model path mismatch by @DefTruth in #597
New Contributors
- @SamitHuang made their first contribution in #566
- @nono-Sang made their first contribution in #563
- @CPFLAME made their first contribution in #581
Full Changelog: v1.1.8...v1.1.9