Skip to content

[Model] Add circlestone-labs/Anima#4083

Open
akshatvishu wants to merge 11 commits into
vllm-project:mainfrom
akshatvishu:anima
Open

[Model] Add circlestone-labs/Anima#4083
akshatvishu wants to merge 11 commits into
vllm-project:mainfrom
akshatvishu:anima

Conversation

@akshatvishu

@akshatvishu akshatvishu commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Resolves #3658

Adds native diffusion support for circlestone-labs/Anima, a Cosmos-style text-to-image model distributed as a single-file safetensors checkpoint. The new AnimaPipeline loads the transformer/text-conditioner weights directly from the checkpoint, converts original Cosmos-style transformer keys when needed, and loads non-denoiser components such as the text encoder, tokenizers, VAE and scheduler from a Diffusers-layout components directory.

Native Anima currently targets baseline single-GPU execution. TP, SP, CFG-parallel, HSDP, Cache-DiT/TeaCache, quantization, CPU/layerwise offload and step execution are not supported yet.

Key Changes

  1. Native Anima Pipeline

    • Adds vllm_omni/diffusion/models/anima/ with AnimaPipeline,
      AnimaTransformer3DModel, and AnimaTextConditioner.
    • Supports direct local single-file safetensors loading and strict native
      module weight loading.
    • Implements prompt encoding, true CFG handling, denoising, VAE decode, and
      Anima-specific post-processing.
  2. Single-file Diffusion Loading

    • Adds diffusers_single_file handling for Diffusers adapter pipelines.
    • Auto-detects local .safetensors/.ckpt single-file checkpoints.
    • Maps Anima single-file aliases, including AnimaModularPipeline, to the
      native AnimaPipeline.
    • Allows default single-stage config selection for local single-file Anima
      serving without requiring a deploy config.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e64bd1902

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

pipeline.enable_vae_tiling()

self._pipeline = pipeline
self._accept_call_kwargs = set(inspect.signature(pipeline.__call__).parameters.keys())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve ModularPipeline runtime kwargs

When the native Anima path is used, pipeline is a Diffusers ModularPipeline, whose __call__ signature is generic (state, output, **kwargs) rather than listing model inputs like prompt, height, or num_inference_steps. Caching that signature here makes _build_call_kwargs() later reject and drop the actual request fields, so a normal text-to-image request reaches the modular blocks without the required prompt and fails before generation. For modular pipelines this needs to allow block input names (or accept all kwargs) instead of using inspect.signature(pipeline.__call__) directly.

Useful? React with 👍 / 👎.

@akshatvishu

Copy link
Copy Markdown
Contributor Author

Doing a major refactoring , not ready for review yet!

@timzsu

timzsu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Hi @akshatvishu, may I ask when this PR will be ready?

@akshatvishu akshatvishu changed the title [WIP] Add circlestone-labs/Anima [Model] Add circlestone-labs/Anima Jun 13, 2026
@akshatvishu

akshatvishu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@timzsu It's ready review!

The benchmarking code is included temporarily for validating this port. Once we're happy with the implementation, I'll run the benchmarks against the native diffuser implementation and remove the benchmarking code afterward before merging!

@timzsu

timzsu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Hi @akshatvishu, is it possible to split the performance optimizations from the model support? The current PR is too big (>3k lines) and hard to review. I suggest keeping the first PR as an integration with no extra optimizations. Then you can create separate PRs for offloading, quantization, and cache based on it.

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
@akshatvishu

akshatvishu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@timzsu Done. I’ve split the performance optimizations out of this PR. It now contains only the baseline model integration! All the optimization : the offloading, quantization and cache-related changes have been removed and will be raised separately via follow-up PRs to this!

I’ve also squashed the remaining changes into a single commit.

@akshatvishu

akshatvishu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Baseline validation

The baseline run completed successfully on a single MI300x (ROCm via official docker image) with the following configuration:
Prompt:
official art, 2girls, hatsune miku, kasane teto, metal gear (series), @ shinkawa youji, twintails, blue hair, drill hair, red hair, fighting stance, kneeling, aiming, handgun, holding gun, suppressor, sneaking suit, profile, projected inset

Negative prompt:
worst quality, low quality, score_1, score_2, score_3, artist name

  • BF16
  • 50 inference steps
  • Configured image size: 1024x1024

Results

  • Engine initialization: 28.58 s
  • Model loading and initialization: 13.18 s
  • End-to-end generation latency: 5.55 s
  • Diffusion time: 5.24 s
  • Post-processing time: 24.31 ms
  • Peak GPU memory: 10.52 GB reserved
  • Peak allocated GPU memory: 9.50 GB
  • Throughput: approximately 11.6 steps/s
anima_baseline

prompt = (
"masterpiece, best quality, very aesthetic, absurdres, 1girl, solo, silver hair, blue eyes, "
"long hair, school uniform, sailor collar, cherry blossoms, petals, spring, soft lighting, "
"looking at viewer, upper body, detailed background"
)
negative_prompt = (
"worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, "
"sepia, signature, artist name"
)

  • BF16
  • 25 inference steps
  • Configured image size: 1024x1024
anima_anime_girl

@akshatvishu

Copy link
Copy Markdown
Contributor Author

Here is where we currently stand with the Anima integration:

1. Checkpoint Loading & Architecture Alignment

Single-file checkpoints currently bypass stage-config discovery, meaning the caller must explicitly provide --model-class-name and --diffusion-load-format diffusers_single_file (or the equivalent configuration). This is intentional, but it means VLLM-Omni cannot infer the model type from the checkpoint alone.

Since Anima is the first native diffusers_single_file integration in Omni:
Should we keep this as an explicit loading contract? Or should we introduce a registry-based detection path for known single-file models? I’d like to align on the intended long-term design before finalizing this path.

2. Module Sharing & Code Reuse

I reviewed the Cosmos3 integration to see if we could share modules, similar to how Ming-TTS and Ming-Omni align.

While Anima is based on nvidia/Cosmos-Predict2-2B-Text2Image, its transformer structure, text-conditioning path, checkpoint conversion and execution assumptions are not directly compatible with the existing Cosmos3 implementation.

Unlike Ming-TTS and Ming-Omni (which share clear boundaries through common audio components/utilities), I don't see an equivalent reusable boundary for Anima right now. Therefore, I've kept the implementation separate for now and plan to extract smaller utilities only when a concrete second consumer arrives.


Next Steps

Optimization support is already tracked as follow-up work. I’ll run the parity and performance benchmarks as soon as we align on the integration design above.

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>

# Conflicts:
#	vllm_omni/diffusion/diffusion_engine.py
#	vllm_omni/diffusion/registry.py
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Comment thread vllm_omni/diffusion/data.py Outdated
Comment on lines +68 to +70
else:
if hasattr(diffusers, model_class_name):
return getattr(diffusers, model_class_name)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch seems to be dead code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

Comment on lines +36 to +58
_NATIVE_SINGLE_FILE_DIFFUSION_MODELS = {"AnimaPipeline"}
_ANIMA_SINGLE_FILE_ALIASES = {"AnimaPipeline", "AnimaModularPipeline"}


def _diffusers_pipeline_module_name(model_class_name):
base_name = model_class_name
for suffix in ("ModularPipeline", "Pipeline"):
if base_name.endswith(suffix):
base_name = base_name[: -len(suffix)]
break
if not base_name:
return None

chars = []
for index, char in enumerate(base_name):
if char.isupper() and index > 0:
chars.append("_")
chars.append(char.lower())
return "vllm_omni.diffusion.models." + "".join(chars)


def _resolve_diffusers_pipeline_cls(model_class_name):
if hasattr(diffusers, model_class_name):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need these changes specifically for AnimaPipeline?

@akshatvishu akshatvishu Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Kept only the Anima single-file alias handling; Anima's HF single-file checkpoint uses AnimaModularPipeline but vLLM-Omni loads it through the native AnimaPipeline because the denoiser and text-conditioner weights need custom splitting and key conversion.


def _load_native_denoiser_components(self, state_dict=None):
if state_dict is None:
import os

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant import os. And we'd better organize imports (move to top module imports if they're not triggering circular imports or some special cases)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

"extra_pos_embed_type": None,
}

_COSMOS_2_TRANSFORMER_RENAMES = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need cosmos 2 rename mapping for loading components of the model?

@akshatvishu akshatvishu Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw Anima checkpoint uses original training names for a Cosmos-style denoiser. Renamed this to Anima original-checkpoint conversion and added a short comment for the same!

for key, value in sampling.__dict__.items():
if value is None:
continue
if key == "guidance_scale" and not getattr(sampling, "guidance_scale_provided", False):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects other diffusers-adapter models, not just Anima. Please check if it's necessary, and if any better way to handle within the model-specific scope

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed! Anima now handles guidance_scale only inside AnimaPipeline

Comment on lines +65 to +79
_ANIMA_TRANSFORMER_CONFIG = {
"in_channels": 16,
"out_channels": 16,
"num_attention_heads": 16,
"attention_head_dim": 128,
"num_layers": 28,
"mlp_ratio": 4.0,
"text_embed_dim": 1024,
"adaln_lora_dim": 256,
"max_size": (128, 240, 240),
"patch_size": (1, 2, 2),
"rope_scale": (1.0, 4.0, 4.0),
"concat_padding_mask": True,
"extra_pos_embed_type": None,
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if it's suitable to be migrated into vllm_omni/transformers_utils/configs/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the Anima transformer config next to AnimaTransformer3DModel instead. I did not move it to transformers_utils/configs/ because it is not a HF AutoConfig config.

Comment on lines +180 to +194
def _infer_text_conditioner_config(state_dict):
model_dim = state_dict["blocks.0.self_attn.q_proj.weight"].shape[0]
source_dim = state_dict["blocks.0.cross_attn.k_proj.weight"].shape[1]
target_vocab_size, target_dim = state_dict["embed.weight"].shape
attention_head_dim = state_dict["blocks.0.self_attn.q_norm.weight"].shape[0]
num_layers = 1 + max(int(key.split(".")[1]) for key in state_dict if key.startswith("blocks."))

return {
"source_dim": source_dim,
"target_dim": target_dim,
"model_dim": model_dim,
"num_layers": num_layers,
"num_attention_heads": model_dim // attention_head_dim,
"target_vocab_size": target_vocab_size,
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we utilize unified / consistent configs rather than inferring? (for now we have both ways, sort of inconsistent)

@akshatvishu akshatvishu Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed! The text-conditioner now uses a fixed ANIMA_TEXT_CONDITIONER_CONFIG next to its component class.

Comment on lines +334 to +336
self.vae_scale_factor = (
2 ** len(self.vae.temperal_downsample) if hasattr(self.vae, "temperal_downsample") else 8
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In get_anima_post_process_func we have hardcoded vae_scale_factor = 8. Consider revising to a consistent way of assignment

@akshatvishu akshatvishu Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded ANIMA_VAE_SCALE_FACTOR = 8 to match the postprocess API, which lacks VAE access. The runtime still uses the loaded VAE's scale factor if available, falling back to 8.

Address Anima review feedback by removing dead Diffusers class
resolution code, keeping native Anima single-file routing explicit,
moving Anima component configs next to their model classes and making
VAE scale-factor handling consistent between postprocess and runtime.

Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
…_emb

Uses diffusers' apply_rotary_emb to upcast RoPE calculations to float32,
resolving the bfloat16 numerical drift vs the reference pipeline.
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: circlestone-labs/Anima

3 participants