Replies: 1 comment
-
|
Short answer: Partially yes — here's a breakdown by model: Qwen / Qwen2.5 → ✅ Officially Supported As of v2.11.0, AirLLM officially supports the Qwen and Qwen2.5 model families. Just use from airllm import AutoModel
model = AutoModel.from_pretrained("Qwen/Qwen2.5-72B-Instruct")Qwen3 → Qwen3 uses a similar transformer architecture to Qwen2.5. QwQ-32B → QwQ-32B uses the Qwen2.5 architecture under the hood, so it should be compatible with AirLLM v2.11.0+. Try: model = AutoModel.from_pretrained("Qwen/QwQ-32B")DeepSeek V3 / R2 → ❌ Not officially supported DeepSeek V3 is a 671B MoE model with a custom architecture (Multi-head Latent Attention + expert routing). AirLLM currently doesn't list it as supported, and its MoE design makes layer-by-layer sharding non-trivial. You'd likely hit errors. For DeepSeek, tools like llama.cpp with GGUF or Ollama are more practical right now. About RAM (16/32/64GB): AirLLM's bottleneck is actually disk I/O and VRAM, not system RAM. A 70B model needs ~140GB of disk space after sharding. 16–64GB of system RAM is fine as a buffer, but you still need at least 4–8GB VRAM. CPU-only inference is supported since v2.10.1 but is very slow. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It would be really great if 16/32/64GB of RAM can handle SOTA models
Beta Was this translation helpful? Give feedback.
All reactions