⚡ Gemini CAG Demo

Cache-Augmented Generation with the Google Gemini API

Load your documents into a context cache once, then ask unlimited questions where the document tokens are billed at a ~90% discount.

CAG is the cache-first cousin of RAG. Instead of chunking documents, embedding them, and retrieving fragments per query, CAG loads the whole knowledge base into the model's context cache once and reuses it on every call. For knowledge bases that fit in Gemini's long context window, this is simpler (no retrieval step to get wrong) and dramatically cheaper on repeated queries.

This app shows it end to end: upload documents → build a real Gemini context cache → ask questions → watch the savings add up. A Compare mode runs each question both with and without the cache so you can see the cost and latency difference live.

🎬 See it in action

┌─ Gemini CAG ──────────────┐  ┌─ Compare mode: CAG vs full-context ──────────────┐
│ Knowledge base            │  │  ⚡ CAG 77% cheaper · 1.8× faster                │
│  ▸ report.pdf  ✓ cached   │  ├──────────────────────┬───────────────────────────┤
│  8,123 tokens   59:41 ⏳  │  │ ⚡ CAG (cached)       │ 📦 Full-context (no cache)│
│                           │  │ The report covers…   │ The report covers…        │
│ Cumulative savings        │  │ 🧠 8,000 cached      │ ✏️ 8,200 fresh            │
│   $0.004320               │  │ $0.000650            │ $0.002810                 │
│   77% saved vs no-cache   │  └──────────────────────┴───────────────────────────┘
└───────────────────────────┘

📸 Maintainer tip: drop a real screenshot or GIF at docs/demo.png and replace this block — a visual of the savings dashboard + compare mode makes the strongest first impression.

📊 RAG vs CAG, in one table

	RAG (Retrieval-Augmented Generation)	CAG (Cache-Augmented Generation)
Knowledge prep	Chunk → embed → store in a vector DB	Upload files → create one context cache
Per-query work	Embed query → vector search → stitch context	Reference the cache by name
Moving parts	Embedder, vector DB, chunker, reranker	Just the model + a cache handle
Failure mode	Retrieval misses the relevant chunk	Knowledge base must fit the context window
Cost driver	Re-sends retrieved chunks each call	Cached tokens billed at a deep discount
Best when	Corpus is huge / changes constantly	Corpus is bounded and queried repeatedly

CAG isn't a universal replacement for RAG — it shines when a bounded knowledge base (a contract, a manual, a codebase, a research bundle) is queried many times.

🔧 How the caching actually works

                     ┌──────────────────────────────────────────────┐
  1. Upload files    │  client.files.upload(...)  → Gemini Files API │
                     └──────────────────────────────────────────────┘
                                       │
                     ┌──────────────────────────────────────────────┐
  2. Build cache     │  client.caches.create(                       │
     (ONCE)          │      model, contents=[files], ttl="3600s")   │
                     │  → cachedContents/abc123   (KV state stored)  │
                     └──────────────────────────────────────────────┘
                                       │
                     ┌──────────────────────────────────────────────┐
  3. Ask (N times)   │  client.models.generate_content(             │
                     │      contents=question,                      │
                     │      config=GenerateContentConfig(           │
                     │          cached_content="cachedContents/abc")│
                     │  )  → document tokens reused at ~90% off      │
                     └──────────────────────────────────────────────┘

Only the question is billed as fresh input on each call; the documents come from the cache. The app reads usage_metadata.cached_content_token_count from each response to compute exactly what you saved versus re-sending the documents.

A real run from this app (8 KB document, default gemini-2.5-flash):

	Cached prompt	Fresh prompt	Cost / query	vs no-cache
CAG	8,000 tokens	200 tokens	$0.000650	—
Full-context	0	8,200 tokens	$0.002810
				↓ 77% cheaper

✨ Features

🧠 True context caching — uses client.caches.create + cached_content, not just a big prompt.
💰 Live savings dashboard — cumulative cost, cached tokens, and % saved vs no-cache across the session.
⚖️ Compare mode — answers each question with and without the cache, side by side, with the cost/latency delta.
⏳ Cache lifecycle — TTL countdown, one-click extend, and delete to stop storage charges.
📄 Multi-format uploads — PDF, TXT, MD, DOCX, CSV, XLSX, PPTX, JSON (validated server-side).
🔌 Official google-genai SDK, model configurable via env (gemini-2.5-flash by default).
✅ Tested — 27 unit/integration tests, ~93% coverage on the core package, SDK fully mocked.

🚀 Quickstart (60 seconds)

git clone https://github.com/makieali/gemini-cag-demo.git
cd gemini-cag-demo

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# edit .env → set GEMINI_API_KEY (free key at https://aistudio.google.com/apikey)

python app.py        # → http://localhost:5069

Then in the browser: drop a document → "Build context cache" → ask away. Toggle Compare mode to watch CAG beat full-context on every query.

🐳 Docker

cp .env.example .env   # set GEMINI_API_KEY
docker compose up --build

🔌 API reference

The browser UI is a thin layer over a small JSON API you can drive directly:

Method	Endpoint	Purpose
`POST`	`/api/kb`	Upload documents (`files[]`) and build a context cache
`GET`	`/api/kb`	Current knowledge-base status + TTL remaining
`POST`	`/api/kb/extend`	Extend the cache TTL
`DELETE`	`/api/kb`	Delete the cache (stops storage cost)
`POST`	`/api/ask`	Ask a question using the cache → answer + usage + cost
`POST`	`/api/compare`	Run the question with and without the cache
`GET`	`/api/stats`	Cumulative savings for the session

Example: ask a question via curl

curl -s localhost:5069/api/ask \
  -H 'Content-Type: application/json' \
  -d '{"question":"What does the contract say about termination?"}' | jq
# → { "answer": "...", "used_cache": true,
#     "usage": { "cached_tokens": 8000, "fresh_prompt_tokens": 200, ... },
#     "cost":  { "total_cost": 0.00065, "savings_pct": 76.9 }, ... }

⚙️ Configuration

All settings come from the environment (see .env.example):

Variable	Default	Description
`GEMINI_API_KEY`	—	Required. Your Gemini API key.
`GEMINI_MODEL`	`gemini-2.5-flash`	Any caching-capable model.
`CACHE_TTL_SECONDS`	`3600`	How long a context cache lives.
`SECRET_KEY`	dev key	Flask session secret — set a real one in prod.
`FLASK_DEBUG`	`false`	Enable Flask debug mode.
`MAX_CONTENT_LENGTH_MB`	`32`	Max upload size.

📁 Project layout

gemini-cag-demo/
├── app.py               # Flask routes (thin) — validate, delegate, serialize
├── config.py            # Env-based config
├── cag/                 # All Gemini/CAG logic (no Flask imports)
│   ├── cache_manager.py # create / get / extend / delete context caches
│   ├── generation.py    # cached vs full-context queries + streaming
│   ├── cost.py          # pricing + cached-vs-uncached cost accounting
│   ├── analytics.py     # cumulative session savings (immutable)
│   ├── files.py         # upload + validation
│   ├── store.py         # in-memory per-session KB store
│   └── models.py        # frozen dataclasses
├── templates/index.html
├── static/{css,js}
└── tests/               # pytest, SDK mocked

The cag/ package has no Flask dependency — you can import it into a CLI, a notebook, or another framework.

🧪 Testing

pip install -r requirements.txt
pytest                       # 27 passed, ~93% coverage

Tests mock the Gemini SDK at the client boundary (tests/conftest.py injects a fake google.genai), so the suite runs offline with no API key. The full HTTP journey — build cache → ask → compare → stats → delete, plus validation failures — is covered in tests/test_api.py.

📝 Notes & caveats

Pricing is approximate. Rates live in cag/cost.py and may lag the official Gemini pricing. The dashboard illustrates relative savings — verify absolute numbers yourself.
Minimum cache size. Gemini requires ~2,048 tokens of content before a cache can be created; very small documents can't be cached.
State is in-memory. Knowledge bases are stored per-session in process memory (cag/store.py). For multi-process or production use, back it with Redis or a database — the rest of the app only touches the small SessionStore interface.
Caches incur a storage cost per hour they live. Delete caches you're done with (the UI does this for you).

📄 License

_{Built to demonstrate Gemini's context caching API. ⭐ it if it helped.}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cag		cag
static		static
templates		templates
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.py		config.py
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ Gemini CAG Demo

Cache-Augmented Generation with the Google Gemini API

Contents

🎬 See it in action

📊 RAG vs CAG, in one table

🔧 How the caching actually works

✨ Features

🚀 Quickstart (60 seconds)

🐳 Docker

🔌 API reference

⚙️ Configuration

📁 Project layout

🧪 Testing

📝 Notes & caveats

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ Gemini CAG Demo

Cache-Augmented Generation with the Google Gemini API

Contents

🎬 See it in action

📊 RAG vs CAG, in one table

🔧 How the caching actually works

✨ Features

🚀 Quickstart (60 seconds)

🐳 Docker

🔌 API reference

⚙️ Configuration

📁 Project layout

🧪 Testing

📝 Notes & caveats

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages