Load your documents into a context cache once, then ask unlimited questions where the document tokens are billed at a ~90% discount.
CAG is the cache-first cousin of RAG. Instead of chunking documents, embedding them, and retrieving fragments per query, CAG loads the whole knowledge base into the model's context cache once and reuses it on every call. For knowledge bases that fit in Gemini's long context window, this is simpler (no retrieval step to get wrong) and dramatically cheaper on repeated queries.
This app shows it end to end: upload documents → build a real Gemini context cache → ask questions → watch the savings add up. A Compare mode runs each question both with and without the cache so you can see the cost and latency difference live.
- See it in action
- RAG vs CAG
- How the caching actually works
- Features
- Quickstart
- API reference
- Configuration
- Project layout
- Testing
- Notes & caveats
┌─ Gemini CAG ──────────────┐ ┌─ Compare mode: CAG vs full-context ──────────────┐
│ Knowledge base │ │ ⚡ CAG 77% cheaper · 1.8× faster │
│ ▸ report.pdf ✓ cached │ ├──────────────────────┬───────────────────────────┤
│ 8,123 tokens 59:41 ⏳ │ │ ⚡ CAG (cached) │ 📦 Full-context (no cache)│
│ │ │ The report covers… │ The report covers… │
│ Cumulative savings │ │ 🧠 8,000 cached │ ✏️ 8,200 fresh │
│ $0.004320 │ │ $0.000650 │ $0.002810 │
│ 77% saved vs no-cache │ └──────────────────────┴───────────────────────────┘
└───────────────────────────┘
📸 Maintainer tip: drop a real screenshot or GIF at
docs/demo.pngand replace this block — a visual of the savings dashboard + compare mode makes the strongest first impression.
| RAG (Retrieval-Augmented Generation) | CAG (Cache-Augmented Generation) | |
|---|---|---|
| Knowledge prep | Chunk → embed → store in a vector DB | Upload files → create one context cache |
| Per-query work | Embed query → vector search → stitch context | Reference the cache by name |
| Moving parts | Embedder, vector DB, chunker, reranker | Just the model + a cache handle |
| Failure mode | Retrieval misses the relevant chunk | Knowledge base must fit the context window |
| Cost driver | Re-sends retrieved chunks each call | Cached tokens billed at a deep discount |
| Best when | Corpus is huge / changes constantly | Corpus is bounded and queried repeatedly |
CAG isn't a universal replacement for RAG — it shines when a bounded knowledge base (a contract, a manual, a codebase, a research bundle) is queried many times.
┌──────────────────────────────────────────────┐
1. Upload files │ client.files.upload(...) → Gemini Files API │
└──────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────┐
2. Build cache │ client.caches.create( │
(ONCE) │ model, contents=[files], ttl="3600s") │
│ → cachedContents/abc123 (KV state stored) │
└──────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────┐
3. Ask (N times) │ client.models.generate_content( │
│ contents=question, │
│ config=GenerateContentConfig( │
│ cached_content="cachedContents/abc")│
│ ) → document tokens reused at ~90% off │
└──────────────────────────────────────────────┘
Only the question is billed as fresh input on each call; the documents come from the cache. The app reads usage_metadata.cached_content_token_count from each response to compute exactly what you saved versus re-sending the documents.
A real run from this app (8 KB document, default gemini-2.5-flash):
| Cached prompt | Fresh prompt | Cost / query | vs no-cache | |
|---|---|---|---|---|
| CAG | 8,000 tokens | 200 tokens | $0.000650 | — |
| Full-context | 0 | 8,200 tokens | $0.002810 | |
| ↓ 77% cheaper |
- 🧠 True context caching — uses
client.caches.create+cached_content, not just a big prompt. - 💰 Live savings dashboard — cumulative cost, cached tokens, and % saved vs no-cache across the session.
- ⚖️ Compare mode — answers each question with and without the cache, side by side, with the cost/latency delta.
- ⏳ Cache lifecycle — TTL countdown, one-click extend, and delete to stop storage charges.
- 📄 Multi-format uploads — PDF, TXT, MD, DOCX, CSV, XLSX, PPTX, JSON (validated server-side).
- 🔌 Official
google-genaiSDK, model configurable via env (gemini-2.5-flashby default). - ✅ Tested — 27 unit/integration tests, ~93% coverage on the core package, SDK fully mocked.
git clone https://github.com/makieali/gemini-cag-demo.git
cd gemini-cag-demo
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# edit .env → set GEMINI_API_KEY (free key at https://aistudio.google.com/apikey)
python app.py # → http://localhost:5069Then in the browser: drop a document → "Build context cache" → ask away. Toggle Compare mode to watch CAG beat full-context on every query.
cp .env.example .env # set GEMINI_API_KEY
docker compose up --buildThe browser UI is a thin layer over a small JSON API you can drive directly:
| Method | Endpoint | Purpose |
|---|---|---|
POST |
/api/kb |
Upload documents (files[]) and build a context cache |
GET |
/api/kb |
Current knowledge-base status + TTL remaining |
POST |
/api/kb/extend |
Extend the cache TTL |
DELETE |
/api/kb |
Delete the cache (stops storage cost) |
POST |
/api/ask |
Ask a question using the cache → answer + usage + cost |
POST |
/api/compare |
Run the question with and without the cache |
GET |
/api/stats |
Cumulative savings for the session |
Example: ask a question via curl
curl -s localhost:5069/api/ask \
-H 'Content-Type: application/json' \
-d '{"question":"What does the contract say about termination?"}' | jq
# → { "answer": "...", "used_cache": true,
# "usage": { "cached_tokens": 8000, "fresh_prompt_tokens": 200, ... },
# "cost": { "total_cost": 0.00065, "savings_pct": 76.9 }, ... }All settings come from the environment (see .env.example):
| Variable | Default | Description |
|---|---|---|
GEMINI_API_KEY |
— | Required. Your Gemini API key. |
GEMINI_MODEL |
gemini-2.5-flash |
Any caching-capable model. |
CACHE_TTL_SECONDS |
3600 |
How long a context cache lives. |
SECRET_KEY |
dev key | Flask session secret — set a real one in prod. |
FLASK_DEBUG |
false |
Enable Flask debug mode. |
MAX_CONTENT_LENGTH_MB |
32 |
Max upload size. |
gemini-cag-demo/
├── app.py # Flask routes (thin) — validate, delegate, serialize
├── config.py # Env-based config
├── cag/ # All Gemini/CAG logic (no Flask imports)
│ ├── cache_manager.py # create / get / extend / delete context caches
│ ├── generation.py # cached vs full-context queries + streaming
│ ├── cost.py # pricing + cached-vs-uncached cost accounting
│ ├── analytics.py # cumulative session savings (immutable)
│ ├── files.py # upload + validation
│ ├── store.py # in-memory per-session KB store
│ └── models.py # frozen dataclasses
├── templates/index.html
├── static/{css,js}
└── tests/ # pytest, SDK mocked
The cag/ package has no Flask dependency — you can import it into a CLI, a notebook, or another framework.
pip install -r requirements.txt
pytest # 27 passed, ~93% coverageTests mock the Gemini SDK at the client boundary (tests/conftest.py injects a fake google.genai), so the suite runs offline with no API key. The full HTTP journey — build cache → ask → compare → stats → delete, plus validation failures — is covered in tests/test_api.py.
- Pricing is approximate. Rates live in
cag/cost.pyand may lag the official Gemini pricing. The dashboard illustrates relative savings — verify absolute numbers yourself. - Minimum cache size. Gemini requires ~2,048 tokens of content before a cache can be created; very small documents can't be cached.
- State is in-memory. Knowledge bases are stored per-session in process memory (
cag/store.py). For multi-process or production use, back it with Redis or a database — the rest of the app only touches the smallSessionStoreinterface. - Caches incur a storage cost per hour they live. Delete caches you're done with (the UI does this for you).
MIT © 2026 Muhammad Ali