A local FastAPI application that answers questions over your own documents using Retrieval-Augmented Generation (RAG). Upload files through a browser UI (or the API), ask multi-turn questions, and get answers grounded in retrieved chunks with source citations.
- Hybrid retrieval — dense embeddings (semantic) plus BM25 (lexical), merged with Reciprocal Rank Fusion (RRF) by default.
- Configurable retrieval — switch between
semantic,lexical, andhybridmodes via environment variables. - Multi-turn chat — the API accepts conversation history so follow-up questions keep context.
- Source citations — responses include file names, page numbers, and reranker confidence (%) when re-ranking is enabled.
- Document upload — index
.docx,.pdf,.txt,.md, and.pptxfiles. - In-memory index — simple setup for local development and prototyping (no external vector database).
flowchart TB
subgraph START["1 · Server startup"]
MAIN["main.py<br/>Uvicorn :8000"] --> APP["FastAPI app<br/>app/app.py"]
APP --> LIFE["Lifespan hook"]
LIFE --> CLEAR["Clear context_files/"]
LIFE --> RESET0["reset_chain()<br/>no index in memory"]
end
subgraph UPLOAD["2 · Upload documents"]
UI1["Browser UI<br/>frontend/static/app.js"] -->|POST /upload| UP_EP["upload_files()"]
UP_EP --> SAVE["Save to RAG_UPLOADED_DIR<br/>(default: context_files/)"]
SAVE --> RESET1["reset_chain()<br/>index invalidated"]
end
subgraph INDEX["3 · Index documents"]
UI2["Click Index documents"] -->|POST /index| IDX_EP["index_documents()"]
IDX_EP --> BUILD["build_rag_chain()<br/>rag/rag.py"]
BUILD --> STORE["In-memory RagWrapper<br/>vector store + BM25 + agent"]
end
subgraph CHAT["4 · Chat / query"]
UI3["User asks question<br/>+ conversation history"] -->|POST /query| Q_EP["query()"]
Q_EP --> RESP["RagWrapper.get_response()"]
RESP --> OUT["Answer + source citations"]
OUT --> UI4["Render reply + source pills<br/>(file · page · confidence %)"]
end
START --> UPLOAD
UPLOAD --> INDEX
INDEX --> CHAT
flowchart LR
FILES[("context_files/<br/>pdf · docx · txt · md · pptx")] --> LOAD["Load & parse<br/>LangChain loaders"]
LOAD --> CHUNK["Chunk<br/>1000 chars · 200 overlap"]
CHUNK --> EMB["Embed chunks<br/>gemini-embedding-2-preview"]
CHUNK --> BM25["Build BM25 index<br/>same chunks"]
EMB --> VDB[("InMemoryVectorStore<br/>cosine similarity")]
BM25 --> LEX[("BM25Retriever")]
VDB --> BUNDLE["RetrieverBundle<br/>mode · k · fetch_k · lexical_weight"]
LEX --> BUNDLE
CHUNK --> AGENT["LangChain agent<br/>create_agent()"]
LLM["Gemini gemini-2.5-flash-lite"] --> AGENT
MW["Prompt middleware<br/>dynamic system prompt"] --> AGENT
BUNDLE --> WRAP["RagWrapper<br/>agent + retriever"]
AGENT --> WRAP
flowchart TB
REQ["Request<br/>question + history"] --> MSGS["Build messages list<br/>history + current question"]
MSGS --> CHAIN["get_chain() → RagWrapper"]
CHAIN --> Q["Extract last user message"]
Q --> RET["retrieve_documents()<br/>src/retrieval/hybrid.py"]
RET --> MODE{RAG_RETRIEVAL_MODE}
MODE -->|semantic| SEM["Vector search<br/>top-k cosine"]
MODE -->|lexical| LEX["BM25 keyword search<br/>top-k"]
MODE -->|hybrid| SEM2["Semantic branch<br/>fetch 2×k candidates"]
MODE -->|hybrid| LEX2["Lexical branch<br/>fetch 2×k candidates"]
SEM2 --> RRF["RRF merge<br/>weights: 1.0 semantic · lexical_weight BM25"]
LEX2 --> RRF
RRF --> TOPK["Top-k candidates"]
SEM --> TOPK
LEX --> TOPK
TOPK --> RERANK["Optional rerank + score filter<br/>src/retrieval/rerank.py"]
RERANK --> EMPTY{Chunks after filter?}
EMPTY -->|no| NODOCS["Return: no relevant docs<br/>(skip LLM)"]
EMPTY -->|yes| FINAL["Final chunks"]
FINAL --> CITE["documents_to_sources()<br/>dedupe by file + page · confidence_pct"]
FINAL --> STATE["Agent invoke<br/>state: messages + retrieved_docs"]
STATE --> PROMPT["Prompt middleware<br/>system_prompt.txt + chunk text"]
PROMPT --> GEN["Gemini generates answer"]
GEN --> RES["QueryResponse<br/>answer + sources"]
NODOCS --> RES
| Layer | Key files | Role |
|---|---|---|
| Entry | main.py |
Starts Uvicorn on port 8000 |
| API | app/app.py, app/schemas.py |
Routes: /, /upload, /index, /index/status, /query, /health |
| Frontend | frontend/templates/index.html, frontend/static/app.js |
Upload UI, index button, multi-turn chat, citation pills |
| RAG core | rag/rag.py |
Load → chunk → embed → build agent → RagWrapper |
| Retrieval | src/retrieval/hybrid.py, src/retrieval/rerank.py |
Semantic / lexical / hybrid (RRF); optional Discovery Engine rerank and score filtering |
| Prompt | src/prompt/prompt_manager.py |
Injects retrieved chunks into system prompt |
| LLM | src/llm/gemini.py |
gemini-2.5-flash-lite via LangChain |
- Start server —
python3 main.py; the upload directory is cleared and no index exists yet. - Upload — UI or
POST /uploadsaves supported files tocontext_files/; any existing index is cleared. - Index — UI or
POST /indexloads files, chunks them, embeds them, builds BM25 + vector store, and creates the in-memoryRagWrapper. - Ask — UI or
POST /querysends the question plus prior turns (historydoes not include the current question). - Retrieve once —
RagWrapperruns hybrid/semantic/lexical retrieval on the latest user message, optionally reranks and filters by score. - Generate — If no chunks pass the score filter, the API returns a fixed “no relevant documents” message without calling the LLM. Otherwise, retrieved text is injected into the system prompt and the Gemini agent produces the answer.
- Respond — API returns
{ answer, sources }; the UI shows source pills (file · p. N · confidence %).
- Retrieval runs once per query in
RagWrapper.get_response()— the same chunks feed both the LLM prompt and the citation list. - Multi-turn chat — prior turns go in
history; only the latest user message drives retrieval. - Hybrid mode (default) — semantic and BM25 each fetch candidates, then RRF merges to a pool. With
RAG_RERANK_ENABLED=true, Discovery Engine re-ranks that pool, then a score filter trims weak tail hits before chunks reach the LLM and citations. - Rerank score filtering — after reranking, chunks must pass two rules (configurable via env): an absolute floor (
RAG_RERANK_MIN_SCORE, default0.12) and a gap cutoff (RAG_RERANK_GAP_RATIO, default0.45— stop when the next score falls below 45% of the previous kept score). If even the top chunk is below the floor, retrieval returns nothing and the user sees “No relevant documents were found for your question.” - Citation confidence — when reranking is on, each source includes
confidence_pct(rerank score × 100, rounded) in the API and UI. - Everything is in-memory — indexes are rebuilt on each
POST /index; uploads invalidate the index until re-indexing. - Strict grounding —
system_prompt.txtinstructs the model to answer only when confident in the retrieved context.
Semantic search uses cosine similarity over embedding vectors (exact brute-force search in memory). This is appropriate for small corpora; for very large indexes would be better moving to an approximate nearest-neighbor store (e.g. HNSW or IVF in FAISS, Qdrant, or pgvector).
| Path | Role |
|---|---|
main.py |
Entrypoint — runs Uvicorn with reload |
app/ |
FastAPI routes, request/response schemas, static frontend mount |
rag/ |
RAG pipeline — document loading, chunking, vector store, chain lifecycle |
src/llm/ |
LLM provider interface and Gemini implementation |
src/prompt/ |
Dynamic prompt middleware and system_prompt.txt |
src/retrieval/ |
Hybrid retrieval (semantic, BM25, RRF merge) and Discovery Engine rerank + score filtering |
frontend/ |
HTML template and static UI assets |
context_files/ |
Uploaded documents used for retrieval (gitignored) |
- Python 3.10+
- A Google AI API key with access to Gemini embedding and chat models
conda create -n rag-chatbot python=3.11 -y
conda activate rag-chatbot
pip install -e .python3 -m venv .venv
source .venv/bin/activate
pip install -e .Create a .env file at the repository root (or export variables in your shell).
| Variable | Required | Default | Description |
|---|---|---|---|
GOOGLE_API_KEY |
Yes | — | API key for Gemini embeddings and chat |
RAG_UPLOADED_DIR |
No | context_files/ |
Directory for uploaded and indexed documents |
RAG_RETRIEVAL_MODE |
No | hybrid |
semantic, lexical, or hybrid |
RAG_LEXICAL_WEIGHT |
No | 0.5 |
Weight of the BM25 branch in hybrid RRF merge (semantic branch is 1.0) |
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL |
No | 8 |
Final chunk count after retrieval (passed to build_rag_chain) |
RAG_NUMBER_OF_CHUNKS_PER_BRANCH |
No | 16 |
Candidates fetched per branch before RRF merge (when rerank is off) |
RAG_RERANK_ENABLED |
No | false |
Re-rank first-stage hits with Discovery Engine when true |
GOOGLE_CLOUD_PROJECT |
When reranking | — | GCP project id for Discovery Engine Ranking API (ADC auth) |
RAG_RERANK_MODEL |
No | semantic-ranker-default@latest |
Discovery Engine ranking model |
RAG_RERANK_LOCATION |
No | global |
Discovery Engine location for the ranking config |
RAG_RERANK_CANDIDATES |
No | max(2×k, k+4) |
First-stage pool size before re-ranking (max 200) |
RAG_RERANK_MIN_SCORE |
No | 0.12 |
Drop chunks below this rerank score (0–1); if the top chunk fails, keep none |
RAG_RERANK_GAP_RATIO |
No | 0.45 |
Stop keeping chunks when the next score falls below this ratio of the previous kept score; set to 0 to disable |
ENABLE_PRINT_DEBUG |
No | False |
Log retrieval and message debug output when true |
LOGFIRE_ENABLED |
No | true |
Send traces to Logfire when true |
LOGFIRE_SERVICE_NAME |
No | rag-chatbot |
Service name shown in Logfire |
LOGFIRE_USER_ID |
No | dev-user |
User id attached to every trace (placeholder until auth exists) |
LOGFIRE_INSTRUMENT_LANGCHAIN |
No | true |
Export LangChain/LangGraph spans via OpenTelemetry |
Example .env:
GOOGLE_API_KEY=your_key_here
RAG_UPLOADED_DIR=context_files
RAG_RETRIEVAL_MODE=hybrid
RAG_LEXICAL_WEIGHT=0.5
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL=8
RAG_NUMBER_OF_CHUNKS_PER_BRANCH=16
# Optional Discovery Engine re-ranking (requires ADC: gcloud auth application-default login)
# RAG_RERANK_ENABLED=true
# GOOGLE_CLOUD_PROJECT=your-gcp-project-id
# RAG_RERANK_CANDIDATES=24
# RAG_RERANK_MIN_SCORE=0.12
# RAG_RERANK_GAP_RATIO=0.45
ENABLE_PRINT_DEBUG=falsesemantic— embedding similarity only (cosine via in-memory vector store)lexical— BM25 keyword search onlyhybrid— run both, merge ranks with RRF (recommended)
The app sends traces to Logfire when enabled:
- Install SDK:
pip install logfire - Authenticate and select a project:
logfire auththenlogfire projects use <project> - Install
pip install uvicorn 'logfire[fastapi]'andpip install 'logfire[httpx]'.langchain-google-genai(ChatGoogleGenerativeAI,GoogleGenerativeAIEmbeddings) uses HTTPX under the hood for chat and embedding calls to Google’s API. - Run the server:
python3 main.pyInstall with the FastAPI extra: .
What is instrumented:
- HTTP routes —
POST /query,/index,/upload, etc. (/healthand/static/*are excluded) - Outbound HTTP — Gemini API calls via HTTPX
- LangChain — agent, retrieval, and LLM spans when
LOGFIRE_INSTRUMENT_LANGCHAIN=true(default) - User context —
user_idon every span via baggage (dev-userby default; override withLOGFIRE_USER_ID)
Set LOGFIRE_ENABLED=false to disable sending traces without removing the dependency.
Filter by user in Logfire Live view: attributes->>'user_id' = 'dev-user'
From the repository root:
python3 main.pyThen open:
- UI: http://127.0.0.1:8000/
- API docs: http://127.0.0.1:8000/docs
On startup, the app clears the upload directory and resets the in-memory index so each run starts with an empty document set. Upload files again after restarting the server.
Returns {"status": "ok"}.
Upload one or more files for indexing. Supported extensions: .docx, .pdf, .txt, .md, .pptx.
Response:
{
"saved": ["report.pdf", "notes.txt"],
"skipped": ["image.png"]
}Re-uploading clears the current index; call POST /index again before querying.
Returns whether documents are indexed and how many supported files are on disk.
Response:
{
"indexed": true,
"file_count": 2
}Build the in-memory index from uploaded files. Required before querying.
Response:
{
"documents": 2,
"chunks": 18
}Ask a question over the indexed documents.
Request:
{
"question": "What is the refund policy?",
"history": [
{ "role": "user", "content": "Tell me about billing." },
{ "role": "assistant", "content": "Billing is handled monthly..." }
]
}history contains prior turns in order and does not include the current question.
Response:
{
"answer": "Refunds are available within 30 days...",
"sources": [
{
"file": "policy.pdf",
"path": "/path/to/context_files/policy.pdf",
"page": 3,
"confidence_pct": 87
}
]
}confidence_pct is present when re-ranking is enabled (Discovery Engine score × 100, rounded). Omitted when reranking is off.
When no chunks pass the rerank score filter, the response is:
{
"answer": "No relevant documents were found for your question.",
"sources": []
}- Start the server:
python3 main.py - Open http://127.0.0.1:8000/
- Upload documents in the UI (or call
POST /upload) - Click Index documents (or call
POST /index) - Ask questions in the chat UI or via
POST /query - Inspect source pills under each assistant reply (file, optional page, optional confidence %)
- In-memory only — the vector store and BM25 index live in process memory and are rebuilt when you call
POST /index(or after new uploads, until you re-index). Not suitable for large production corpora without swapping in a persistent vector database. - Exact vector search — no HNSW/IVF indexing; every query compares against all chunk embeddings. Fast enough for small document sets.
- Fresh start on boot —
context_files/is emptied when the server starts; persist files elsewhere if you need them across restarts. - Single-provider LLM — defaults to Gemini via
langchain-google-genai; other providers can be wired throughsrc/llm/base.py.
The system prompt is very strict and mandates the llm to answer only when it's 100% sure. In Hybrid mode, the problem could come from how the retrieved sources are ranked, scored, and merged. If the query mentions a very specfic keyword, the lexical branch would probably have the better passages, but they might get ignored after narrowing down the final number of sources during RRF for two main reasons:
- The semantic branch is set to prevail by default over the lexical one during RRF, because
RAG_LEXICAL_WEIGHT=0.5. - The number of sources in
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVALis too low, and the relevant information is ranked poorly, but close to the cutoff, so it is ignored.
- Raise
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVALso poorly ranked, but relevant chunks can enter the pool. - Raise
RAG_LEXICAL_WEIGHT(e.g. 1.0 or higher) so BM25 matches for proper names compete fairly in RRF. - Retrieve more per branch before merging (raise
RAG_NUMBER_OF_CHUNKS_PER_BRANCHorRAG_RERANK_CANDIDATES, then RRF/rerank down to k). - If answers cite weak sources, tighten
RAG_RERANK_MIN_SCOREor lowerRAG_RERANK_GAP_RATIOso low-confidence tail chunks are dropped before the LLM prompt. - If valid queries return “No relevant documents were found”, lower
RAG_RERANK_MIN_SCOREor setRAG_RERANK_GAP_RATIO=0to disable the cliff cutoff. - Detect keyword names in the query and boost lexical-only or filter chunks containing those tokens. This could be done depending on the keyword topic, and providing a curated list of those entities.