A curated list of open source AI projects that run well on CPU, no GPU required.
Perfect for makers, indie developers, and local AI experiments.
Last updated: June 2026 — featuring projects from 2025–2026
- Language Models
- Inference Engines & Runtimes
- Image Generation and Editing
- Voice and Audio
- Computer Vision
- Small Models
- AI Assistants & Agents
- Coding Assistants
- Document & Knowledge (RAG)
- Agentic Workflows & Platforms
- Embeddings & Vector Databases
- Creative AI
- Development Tools
- Tips for CPU
- Ollama — "Docker for local LLMs." Run Llama, Mistral, Gemma, DeepSeek with one command. Excellent CPU support, easy CLI, model library.
- GPT4All— Simple interface to run LLMs locally with several CPU-optimized models included.
- LM Studio — Beautiful GUI app for local LLMs, automatically supports CPU execution.
- Text Generation WebUI — Powerful web interface for LLMs with extensive CPU mode options.
- Jan.ai — ChatGPT-like interface that runs 100% offline with a clean, modern UI.
- LocalAI— OpenAI-compatible API for running local models. Drop-in replacement that also supports vision, voice, image gen — no GPU required.
- Kobold.cpp — Lightweight inference engine for GGUF models with built-in web UI.
- Open WebUI — Self-hosted, offline ChatGPT-style interface for Ollama, with RAG, web search, and multi-user support.
- llama.cpp— The gold standard for CPU-optimized LLM inference in C/C++. Powers Ollama, LM Studio, and most local LLM tools.
- llamafile— Mozilla's single-file LLM executable. Distribute and run LLMs as a standalone binary — no installation, no dependencies, no GPU required. Built on llama.cpp, supports CPU inference out of the box.
- mistral.rs — Fast, flexible LLM inference engine in Rust, built on Candle. Run any Hugging Face model or GGUF file with zero config — prebuilt CPU binaries for Linux/Windows and CPU Docker images mean no GPU or CUDA toolkit needed. Smart in-situ quantization (GGUF, GPTQ, AWQ) and hardware-aware tuning optimize for your CPU.
- BitNet— Microsoft's official inference framework for 1-bit LLMs. Extremely efficient on CPU.
- eLLM— Rust-based inference engine that claims to run LLMs faster on CPU than on GPU through aggressive optimization.
- Krasis— Hybrid LLM runtime focusing on efficient execution of larger models on consumer hardware (CPU + limited VRAM).
- IPEX-LLM— Accelerate local LLM inference on Intel CPUs, iGPUs, and NPUs. Seamless integration with llama.cpp, Ollama, HF Transformers.
- ONNX Runtime — Cross-platform ML inference acceleration with CPU-optimized execution providers (OpenVINO, XNNPACK, CoreML).
- OpenVINO — Intel's optimization toolkit for high-performance CPU inference across vision, language, and audio models.
- LLM-D — Achieves state-of-the-art inference performance with innovative architecture design.
- CTranslate2 — Fast inference engine for Transformer models. Powers Faster Whisper, optimized for CPU with Intel MKL and ONNX.
- Trillim— Local AI stack for CPUs: CLI, Python SDK, and FastAPI server for BitNet and Bonsai (1-bit/ternary) bundles. Includes speech-to-text, text-to-speech, and image generation support.
- Stable Diffusion (CPU mode) — Image generation model that works on CPU (slower but functional).
- Diffusion Bee — User-friendly GUI for Stable Diffusion on macOS, fully CPU compatible.
- InvokeAI — Professional Stable Diffusion interface with excellent CPU support.
- ComfyUI — Node-based UI for image AI pipelines, supports CPU workflow.
- Fooocus — Simplified Stable Diffusion, easier to use than ComfyUI.
- Real-ESRGAN — AI image upscaler, fast on CPU with great results.
- GFPGAN — Restores and improves old/blurry faces in photos, runs efficiently on CPU.
- Upscayl — Cross-platform AI image upscaler with simple GUI. Works great on CPU.
- Whisper.cpp— Highly optimized Whisper (OpenAI) for CPU speech recognition. The fastest Whisper implementation for CPU.
- Faster Whisper — Up to 4x faster than original Whisper using CTranslate2. Excellent CPU performance.
- Piper TTS— Fast, local text-to-speech with small voice models (5-20MB). Note: archived but still functional.
- Sherpa-ONNX— Comprehensive speech processing toolkit powered by ONNX Runtime. Speech-to-text, TTS, speaker diarization, VAD, keyword spotting — all on CPU. Cross-platform (x86, ARM, RISC-V, Android, iOS, Raspberry Pi).
- Supertonic— Lightning-fast, on-device, multilingual TTS running natively via ONNX. Python, JS, Rust, Swift bindings.
- MOSS-TTS-Nano— Ultra-compact (0.1B params) multilingual TTS from OpenMOSS. Runs realtime on a 4-core CPU, supports Chinese + English + more, with ONNX CPU inference and voice cloning. Apache-2.0.
- Coqui TTS— Open source text-to-speech engine with many voices and languages. CPU efficient.
- CosyVoice — Multi-lingual large voice generation model from FunAudioLLM. Supports voice cloning.
- Amphion— Open-MMLab's toolkit for Audio, Music, and Speech Generation. Reproducible research with CPU mode.
- Vosk — Offline speech recognition, very lightweight (50MB models).
- Bark (Suno)— Realistic voice generation from text with CPU mode available.
- Qwen3-TTS— Pure C inference engine for Qwen3-TTS. No Python, no PyTorch — just C and BLAS. Supports 0.6B/1.7B models.
- RVC (Voice Conversion) — Real-time voice conversion, CPU compatible.
- Demucs — Separate music into vocals/instruments (CPU mode available).
- MusicGen — Generate music from text descriptions (CPU mode supported).
- MusicGPT — Generate music based on natural language prompts. Runs locally on CPU.
- acestep.cpp— Local AI music generation server with browser UI, powered by GGML. Describe a song + optional lyrics and get stereo 48kHz audio. Runs on CPU via BLAS-accelerated GGML backend with a dedicated CPU build script.
- FunMusic — Fundamental toolkit for music generation, part of the FunAudioLLM ecosystem.
- OpenCV + DNN — Industry-standard vision framework with neural networks, fully CPU capable.
- Ultralytics YOLO— YOLOv8, v9, v10+ with
--device cpu. Real-time object detection on CPU. - MediaPipe — Google's library for hand, face, pose, and body tracking on CPU.
- FaceX— Full face stack running entirely in the browser via WebAssembly. Detection, 576-point 3D mesh, recognition, anti-spoof. Zero server needed.
- ONNX Models — Collection of pre-trained, state-of-the-art ONNX models for vision, text, and audio.
- Phi-3/Phi-4 Mini (3.8B/4B) — Microsoft's ultra-efficient models with excellent quality for their size.
- Gemma 2B/3B — Google's compact models, very fast on CPU.
- TinyLlama (1.1B) — Smallest LLaMA-based model, runs on 4GB RAM.
- Mistral 7B — Best quality/size ratio, quantized versions run smoothly.
- LFM (Liquid Foundation Models)— Liquid AI's open-weight models with hybrid architecture (convolution + attention). Efficient on CPU, laptops, and edge devices. Try on HF.
- Qwen 2.5 / 3 (3B/7B) — Alibaba's efficient multilingual models. Qwen3 brings improved reasoning.
- DeepSeek 2.5 Lite — Efficient Mixture-of-Experts model, strong with quantized GGUF.
- StableLM 3B — Stability AI's compact yet capable model.
- SmolLM2 (135M-1.7B) — HuggingFace's tiny models for on-device and CPU inference.
- Stable Diffusion 1.5 — Classic version, lighter than v2/XL.
- TinySD — Distilled version, 50% smaller than SD 1.5.
- SSD-1B — 1B parameter SD model, 60% faster than SD 1.5.
- Whisper Tiny/Base — 39M/74M params for speech transcription.
- Piper TTS voices — Tiny 5-20MB models for fast local TTS.
- MobileNetV3 — 5.4M params, image classification.
- YOLOv8n (nano) — Smallest YOLO, 3M params for object detection.
- EfficientNet-Lite — Lightweight classification models optimized for CPU.
- all-MiniLM-L6-v2 — 22M params, fast text embeddings.
- BGE-small-en-v1.5 — 33M params, excellent for retrieval.
- gte-small — 33M params, strong multilingual embeddings (Alibaba).
- Cline— Autonomous coding agent as an SDK, IDE extension, or CLI assistant. Works with local LLMs via Ollama/LM Studio.
- smolagents— HuggingFace's barebones library for agents that think in code. Supports local transformers and Ollama models, runs entirely on CPU.
- Open Interpreter — Code-executing AI assistant (works with local LLMs).
- AutoGPT — Autonomous AI agent (supports local models).
- CrewAI— Multi-agent orchestration framework. Deploy autonomous agents that collaborate on complex tasks.
- LangGraph — Stateful, graph-based agent orchestration framework from LangChain.
- Dify— Production-ready platform for agentic workflow development. Visual builder + built-in RAG.
- Flowise— Drag-and-drop visual tool to build LLM apps and AI agents. Self-host with Ollama.
- RAGFlow— Leading open-source RAG engine with agent capabilities. Deep document understanding.
- AnythingLLM — Chat with your documents (PDFs, text), supports local models.
- PrivateGPT — Ask questions to your documents 100% offline.
- CrewAI— Multi-agent orchestration for role-playing AI teams.
- Cline— Autonomous coding agent. VS Code extension + CLI + SDK. Supports Ollama, LM Studio, and any OpenAI-compatible local backend.
- Continue.dev — Open-source VS Code / JetBrains copilot. Use local models (Qwen 2.5 Coder, DeepSeek Coder) for autocomplete and chat.
- Aider — Terminal-based AI pair programming with git integration. Works with local LLMs.
- Tabby — Self-hosted GitHub Copilot alternative. Code completion on CPU with StarCoder models.
- OpenCode — The most-starred open-source AI coding agent of 2026. Designed for fast local development workflows.
- Crush — Terminal-based agentic coding assistant from Charm. Auto-discovers local models from Ollama, LM Studio, litellm, and any OpenAI-compatible backend — run it fully offline on CPU. LSP-enhanced, MCP-extensible, cross-platform (macOS, Linux, Windows, BSD).
- RAGFlow— Deep document understanding RAG engine. PDF, DOCX, Excel — with agentic retrieval.
- Dify— Full-featured LLM app platform with built-in RAG pipeline, knowledge base, and agentic workflow.
- AnythingLLM — All-in-one desktop app for document-grounded conversations and private knowledge bases.
- PrivateGPT — Offline Q&A over your documents (PDFs, text, code).
- MinerU — Transforms complex documents (PDF, HTML, scans) into clean Markdown/JSON for RAG pipelines.
- Docling— IBM's document understanding library. Parses PDF, DOCX, PPTX, images and more into structured Markdown/JSON with layout preservation. Runs fully on CPU via ONNX Runtime with dedicated CPU-only installation.
- VelociRAG — Lightning-fast RAG for AI agents. ONNX-powered, 4-layer fusion, MCP server. No PyTorch needed.
- RAG-Anything — All-in-one RAG framework with multiple retrieval strategies.
- LightRAG— Graph-based retrieval-augmented generation system. Indexes text into entity-relation graphs for efficient retrieval. Uses lightweight local embedding models and works with any local LLM backend (Ollama, llama.cpp). [EMNLP 2025].
- LlamaIndex — Data framework for connecting LLMs to external data sources (APIs, databases, documents).
- Dify— Production-ready platform for building AI agents and workflows. Visual pipeline builder, RAG, MCP support, multi-model.
- Flowise— Low-code/no-code platform to build LLM apps, chatbots, and agents visually.
- n8n — Advanced workflow automation with native AI capabilities and MCP nodes.
- Langflow — Visual framework for building multi-agent and RAG applications.
- Haystack — End-to-end NLP framework for building search, QA, and RAG pipelines.
- Chroma — Lightweight, embedded vector database. Runs entirely on CPU, perfect for local RAG.
- Weaviate — Open-source vector search engine with hybrid search (vector + keyword). Runs on CPU.
- Qdrant — High-performance vector database with rich filtering. CPU-friendly for moderate scale.
- FAISS — Meta's library for efficient similarity search and dense vector clustering. CPU-optimized.
- Voyager — Spotify's approximate nearest neighbor search library. Lightweight and fast on CPU.
- zvec — Alibaba's lightweight, in-process vector database. Blazing-fast similarity search with dense + sparse vectors, full-text search, and hybrid retrieval. Embedded library — no servers, no config, runs on CPU anywhere your code runs. Python, Node.js, Go, Rust SDKs.
- Amphion— Audio, music, and speech generation toolkit. TTS, SVC, music gen — all on CPU.
- MusicGen — Generate music from text descriptions (CPU mode supported).
- FunMusic — Music generation toolkit from FunAudioLLM.
- Diarize— Speaker diarization — "who spoke when?" CPU-only, no API keys, 8x faster than real-time.
- llama.cpp — CPU-optimized inference for LLaMA and compatible models.
- Roop — One-click face swap tool (CPU compatible).
- Transformers (Hugging Face) — Load and run any model with CPU backend (
device="cpu"). - Transformers.js— HuggingFace's Transformers for the browser. Run NLP, vision, and audio models directly in JavaScript — no server, no GPU. Powered by ONNX Runtime WebAssembly.
- ONNX Runtime — Accelerate ML inference on CPU with optimizations (XNNPACK, OpenVINO, CoreML).
- OpenVINO — Intel's optimization toolkit for CPU inference across any model.
- BitNet— Official framework for 1-bit LLM inference. Revolutionary efficiency.
- LMDeploy — Model compression and deployment toolkit for efficient CPU serving.
- CTranslate2 — Fast transformer inference on CPU. Powers Faster Whisper and many production systems.
- MLX — Apple's ML framework optimized for Apple Silicon (M-series CPUs). Excellent for local inference.
- Candle— HuggingFace's minimalist ML framework for Rust with CPU-first design. Run LLMs, vision models, and more locally with zero GPU dependency.
- llmfit— Rust CLI tool that detects your hardware and finds the best LLMs for your RAM, CPU, and GPU. One command to right-size models — scores quality, speed, fit, and context for hundreds of models. Supports Ollama, llama.cpp, MLX, LM Studio backends.
- 🔧 Use smaller model versions (
tiny,small,mini,nano) - ⚡ Apply quantization (Q4, Q5, Q8) to reduce RAM usage by 50-75%
- 🧩 Use optimized runtimes: GGUF/GGML, ONNX Runtime, or OpenVINO
- 🚀 Enable multi-threading to utilize all CPU cores
- 📉 Reduce resolution/steps in image generation for faster results
- 🔄 Use batch size 1 for CPU inference (larger batches don't help)
| Size | Parameters | RAM Needed | Speed on CPU | Use Case |
|---|---|---|---|---|
| Tiny | < 1B | 2-4GB | ⚡⚡⚡⚡ | Testing, edge devices |
| Small | 1-3B | 4-8GB | ⚡⚡⚡ | Daily use, chatbots |
| Medium | 3-7B | 8-16GB | ⚡⚡ | Quality balance |
| Large | 7B+ | 16GB+ | ⚡ | Best quality (slower) |
- GGUF (Q4_K_M) — Best for llama.cpp/Ollama (4-bit), excellent quality/size ratio
- GPTQ — Good compression with decent inference speed
- AWQ — Better quality than GPTQ at the same model size
- ONNX — Cross-platform optimization, works with many frameworks
- BitNet (1-bit) — Next-gen extreme quantization, 90% size reduction
- 1-bit LLMs are here: Microsoft's BitNet delivers surprisingly good quality at 1-bit precision. Runs 10x faster on CPU.
- MoE models save RAM: Mixture-of-Experts (DeepSeek, Qwen3-MoE) activate only a fraction of parameters per token.
- Apple Silicon is a CPU powerhouse: Use MLX or llama.cpp Metal backend on M1/M2/M3/M4 Macs for near-GPU speeds.
- Hybrid CPU/GPU runtimes: Tools like Krasis automatically split models across available hardware.
- WebAssembly AI: Run models directly in the browser (FaceX, Transformers.js) — zero install.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run a small model
ollama run phi3:mini
# For transcription with Whisper
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
./main -m models/ggml-base.bin -f audio.wav
# For local coding assistant (Continue)
# Install the VS Code extension, point it to Ollama/local modelContributions are welcome! Please read the contribution guidelines first.
- Add new projects that work well on CPU
- Fix broken links or outdated information
- Improve documentation and examples
- Keep star counts and descriptions up to date
This list is licensed under CC0 1.0 Universal and follows the Awesome format.
If you find this list helpful, please consider giving it a star on GitHub!
Made with ❤️ by the community | Last updated: June 2026