RAG Chatbot

A local FastAPI application that answers questions over your own documents using Retrieval-Augmented Generation (RAG). Upload files through a browser UI (or the API), ask multi-turn questions, and get answers grounded in retrieved chunks with source citations.

Features

Hybrid retrieval — dense embeddings (semantic) plus BM25 (lexical), merged with Reciprocal Rank Fusion (RRF) by default.
Configurable retrieval — switch between semantic, lexical, and hybrid modes via environment variables.
Multi-turn chat — the API accepts conversation history so follow-up questions keep context.
Source citations — responses include file names, page numbers, and reranker confidence (%) when re-ranking is enabled.
Document upload — index .docx, .pdf, .txt, .md, and .pptx files.
In-memory index — simple setup for local development and prototyping (no external vector database).

How it works

High-level lifecycle

flowchart TB
    subgraph START["1 · Server startup"]
        MAIN["main.py<br/>Uvicorn :8000"] --> APP["FastAPI app<br/>app/app.py"]
        APP --> LIFE["Lifespan hook"]
        LIFE --> CLEAR["Clear context_files/"]
        LIFE --> RESET0["reset_chain()<br/>no index in memory"]
    end

    subgraph UPLOAD["2 · Upload documents"]
        UI1["Browser UI<br/>frontend/static/app.js"] -->|POST /upload| UP_EP["upload_files()"]
        UP_EP --> SAVE["Save to RAG_UPLOADED_DIR<br/>(default: context_files/)"]
        SAVE --> RESET1["reset_chain()<br/>index invalidated"]
    end

    subgraph INDEX["3 · Index documents"]
        UI2["Click Index documents"] -->|POST /index| IDX_EP["index_documents()"]
        IDX_EP --> BUILD["build_rag_chain()<br/>rag/rag.py"]
        BUILD --> STORE["In-memory RagWrapper<br/>vector store + BM25 + agent"]
    end

    subgraph CHAT["4 · Chat / query"]
        UI3["User asks question<br/>+ conversation history"] -->|POST /query| Q_EP["query()"]
        Q_EP --> RESP["RagWrapper.get_response()"]
        RESP --> OUT["Answer + source citations"]
        OUT --> UI4["Render reply + source pills<br/>(file · page · confidence %)"]
    end

    START --> UPLOAD
    UPLOAD --> INDEX
    INDEX --> CHAT

Indexing pipeline (`POST /index`)

flowchart LR
    FILES[("context_files/<br/>pdf · docx · txt · md · pptx")] --> LOAD["Load & parse<br/>LangChain loaders"]
    LOAD --> CHUNK["Chunk<br/>1000 chars · 200 overlap"]
    CHUNK --> EMB["Embed chunks<br/>gemini-embedding-2-preview"]
    CHUNK --> BM25["Build BM25 index<br/>same chunks"]

    EMB --> VDB[("InMemoryVectorStore<br/>cosine similarity")]
    BM25 --> LEX[("BM25Retriever")]

    VDB --> BUNDLE["RetrieverBundle<br/>mode · k · fetch_k · lexical_weight"]
    LEX --> BUNDLE

    CHUNK --> AGENT["LangChain agent<br/>create_agent()"]
    LLM["Gemini gemini-2.5-flash-lite"] --> AGENT
    MW["Prompt middleware<br/>dynamic system prompt"] --> AGENT

    BUNDLE --> WRAP["RagWrapper<br/>agent + retriever"]
    AGENT --> WRAP

Query pipeline (each `POST /query`)

flowchart TB
    REQ["Request<br/>question + history"] --> MSGS["Build messages list<br/>history + current question"]
    MSGS --> CHAIN["get_chain() → RagWrapper"]

    CHAIN --> Q["Extract last user message"]
    Q --> RET["retrieve_documents()<br/>src/retrieval/hybrid.py"]

    RET --> MODE{RAG_RETRIEVAL_MODE}

    MODE -->|semantic| SEM["Vector search<br/>top-k cosine"]
    MODE -->|lexical| LEX["BM25 keyword search<br/>top-k"]
    MODE -->|hybrid| SEM2["Semantic branch<br/>fetch 2×k candidates"]
    MODE -->|hybrid| LEX2["Lexical branch<br/>fetch 2×k candidates"]
    SEM2 --> RRF["RRF merge<br/>weights: 1.0 semantic · lexical_weight BM25"]
    LEX2 --> RRF
    RRF --> TOPK["Top-k candidates"]
    SEM --> TOPK
    LEX --> TOPK

    TOPK --> RERANK["Optional rerank + score filter<br/>src/retrieval/rerank.py"]
    RERANK --> EMPTY{Chunks after filter?}
    EMPTY -->|no| NODOCS["Return: no relevant docs<br/>(skip LLM)"]
    EMPTY -->|yes| FINAL["Final chunks"]

    FINAL --> CITE["documents_to_sources()<br/>dedupe by file + page · confidence_pct"]
    FINAL --> STATE["Agent invoke<br/>state: messages + retrieved_docs"]

    STATE --> PROMPT["Prompt middleware<br/>system_prompt.txt + chunk text"]
    PROMPT --> GEN["Gemini generates answer"]
    GEN --> RES["QueryResponse<br/>answer + sources"]
    NODOCS --> RES

Component map

Layer	Key files	Role
Entry	`main.py`	Starts Uvicorn on port 8000
API	`app/app.py`, `app/schemas.py`	Routes: `/`, `/upload`, `/index`, `/index/status`, `/query`, `/health`
Frontend	`frontend/templates/index.html`, `frontend/static/app.js`	Upload UI, index button, multi-turn chat, citation pills
RAG core	`rag/rag.py`	Load → chunk → embed → build agent → `RagWrapper`
Retrieval	`src/retrieval/hybrid.py`, `src/retrieval/rerank.py`	Semantic / lexical / hybrid (RRF); optional Discovery Engine rerank and score filtering
Prompt	`src/prompt/prompt_manager.py`	Injects retrieved chunks into system prompt
LLM	`src/llm/gemini.py`	`gemini-2.5-flash-lite` via LangChain

Main steps

Start server — python3 main.py; the upload directory is cleared and no index exists yet.
Upload — UI or POST /upload saves supported files to context_files/; any existing index is cleared.
Index — UI or POST /index loads files, chunks them, embeds them, builds BM25 + vector store, and creates the in-memory RagWrapper.
Ask — UI or POST /query sends the question plus prior turns (history does not include the current question).
Retrieve once — RagWrapper runs hybrid/semantic/lexical retrieval on the latest user message, optionally reranks and filters by score.
Generate — If no chunks pass the score filter, the API returns a fixed “no relevant documents” message without calling the LLM. Otherwise, retrieved text is injected into the system prompt and the Gemini agent produces the answer.
Respond — API returns { answer, sources }; the UI shows source pills (file · p. N · confidence %).

Design notes

Retrieval runs once per query in RagWrapper.get_response() — the same chunks feed both the LLM prompt and the citation list.
Multi-turn chat — prior turns go in history; only the latest user message drives retrieval.
Hybrid mode (default) — semantic and BM25 each fetch candidates, then RRF merges to a pool. With RAG_RERANK_ENABLED=true, Discovery Engine re-ranks that pool, then a score filter trims weak tail hits before chunks reach the LLM and citations.
Rerank score filtering — after reranking, chunks must pass two rules (configurable via env): an absolute floor (RAG_RERANK_MIN_SCORE, default 0.12) and a gap cutoff (RAG_RERANK_GAP_RATIO, default 0.45 — stop when the next score falls below 45% of the previous kept score). If even the top chunk is below the floor, retrieval returns nothing and the user sees “No relevant documents were found for your question.”
Citation confidence — when reranking is on, each source includes confidence_pct (rerank score × 100, rounded) in the API and UI.
Everything is in-memory — indexes are rebuilt on each POST /index; uploads invalidate the index until re-indexing.
Strict grounding — system_prompt.txt instructs the model to answer only when confident in the retrieved context.

Semantic search uses cosine similarity over embedding vectors (exact brute-force search in memory). This is appropriate for small corpora; for very large indexes would be better moving to an approximate nearest-neighbor store (e.g. HNSW or IVF in FAISS, Qdrant, or pgvector).

Project structure

Path	Role
`main.py`	Entrypoint — runs Uvicorn with reload
`app/`	FastAPI routes, request/response schemas, static frontend mount
`rag/`	RAG pipeline — document loading, chunking, vector store, chain lifecycle
`src/llm/`	LLM provider interface and Gemini implementation
`src/prompt/`	Dynamic prompt middleware and `system_prompt.txt`
`src/retrieval/`	Hybrid retrieval (semantic, BM25, RRF merge) and Discovery Engine rerank + score filtering
`frontend/`	HTML template and static UI assets
`context_files/`	Uploaded documents used for retrieval (gitignored)

Requirements

Python 3.10+
A Google AI API key with access to Gemini embedding and chat models

Installation

Option A: Conda

conda create -n rag-chatbot python=3.11 -y
conda activate rag-chatbot
pip install -e .

Option B: venv

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Configuration

Create a .env file at the repository root (or export variables in your shell).

Variable	Required	Default	Description
`GOOGLE_API_KEY`	Yes	—	API key for Gemini embeddings and chat
`RAG_UPLOADED_DIR`	No	`context_files/`	Directory for uploaded and indexed documents
`RAG_RETRIEVAL_MODE`	No	`hybrid`	`semantic`, `lexical`, or `hybrid`
`RAG_LEXICAL_WEIGHT`	No	`0.5`	Weight of the BM25 branch in hybrid RRF merge (semantic branch is `1.0`)
`RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL`	No	`8`	Final chunk count after retrieval (passed to `build_rag_chain`)
`RAG_NUMBER_OF_CHUNKS_PER_BRANCH`	No	`16`	Candidates fetched per branch before RRF merge (when rerank is off)
`RAG_RERANK_ENABLED`	No	`false`	Re-rank first-stage hits with Discovery Engine when `true`
`GOOGLE_CLOUD_PROJECT`	When reranking	—	GCP project id for Discovery Engine Ranking API (ADC auth)
`RAG_RERANK_MODEL`	No	`semantic-ranker-default@latest`	Discovery Engine ranking model
`RAG_RERANK_LOCATION`	No	`global`	Discovery Engine location for the ranking config
`RAG_RERANK_CANDIDATES`	No	`max(2×k, k+4)`	First-stage pool size before re-ranking (max 200)
`RAG_RERANK_MIN_SCORE`	No	`0.12`	Drop chunks below this rerank score (0–1); if the top chunk fails, keep none
`RAG_RERANK_GAP_RATIO`	No	`0.45`	Stop keeping chunks when the next score falls below this ratio of the previous kept score; set to `0` to disable
`ENABLE_PRINT_DEBUG`	No	`False`	Log retrieval and message debug output when `true`
`LOGFIRE_ENABLED`	No	`true`	Send traces to Logfire when `true`
`LOGFIRE_SERVICE_NAME`	No	`rag-chatbot`	Service name shown in Logfire
`LOGFIRE_USER_ID`	No	`dev-user`	User id attached to every trace (placeholder until auth exists)
`LOGFIRE_INSTRUMENT_LANGCHAIN`	No	`true`	Export LangChain/LangGraph spans via OpenTelemetry

Example .env:

GOOGLE_API_KEY=your_key_here
RAG_UPLOADED_DIR=context_files
RAG_RETRIEVAL_MODE=hybrid
RAG_LEXICAL_WEIGHT=0.5
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL=8
RAG_NUMBER_OF_CHUNKS_PER_BRANCH=16
# Optional Discovery Engine re-ranking (requires ADC: gcloud auth application-default login)
# RAG_RERANK_ENABLED=true
# GOOGLE_CLOUD_PROJECT=your-gcp-project-id
# RAG_RERANK_CANDIDATES=24
# RAG_RERANK_MIN_SCORE=0.12
# RAG_RERANK_GAP_RATIO=0.45
ENABLE_PRINT_DEBUG=false

Retrieval modes

semantic — embedding similarity only (cosine via in-memory vector store)
lexical — BM25 keyword search only
hybrid — run both, merge ranks with RRF (recommended)

Observability (Logfire)

The app sends traces to Logfire when enabled:

Install SDK: pip install logfire
Authenticate and select a project: logfire auth then logfire projects use <project>
Install pip install uvicorn 'logfire[fastapi]' and pip install 'logfire[httpx]'. langchain-google-genai (ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings) uses HTTPX under the hood for chat and embedding calls to Google’s API.
Run the server: python3 main.py Install with the FastAPI extra: .

What is instrumented:

HTTP routes — POST /query, /index, /upload, etc. (/health and /static/* are excluded)
Outbound HTTP — Gemini API calls via HTTPX
LangChain — agent, retrieval, and LLM spans when LOGFIRE_INSTRUMENT_LANGCHAIN=true (default)
User context — user_id on every span via baggage (dev-user by default; override with LOGFIRE_USER_ID)

Set LOGFIRE_ENABLED=false to disable sending traces without removing the dependency.

Filter by user in Logfire Live view: attributes->>'user_id' = 'dev-user'

Run

From the repository root:

python3 main.py

Then open:

UI: http://127.0.0.1:8000/
API docs: http://127.0.0.1:8000/docs

On startup, the app clears the upload directory and resets the in-memory index so each run starts with an empty document set. Upload files again after restarting the server.

API

`GET /health`

Returns {"status": "ok"}.

`POST /upload`

Upload one or more files for indexing. Supported extensions: .docx, .pdf, .txt, .md, .pptx.

Response:

{
  "saved": ["report.pdf", "notes.txt"],
  "skipped": ["image.png"]
}

Re-uploading clears the current index; call POST /index again before querying.

`GET /index/status`

Returns whether documents are indexed and how many supported files are on disk.

Response:

{
  "indexed": true,
  "file_count": 2
}

`POST /index`

Build the in-memory index from uploaded files. Required before querying.

Response:

{
  "documents": 2,
  "chunks": 18
}

`POST /query`

Ask a question over the indexed documents.

Request:

{
  "question": "What is the refund policy?",
  "history": [
    { "role": "user", "content": "Tell me about billing." },
    { "role": "assistant", "content": "Billing is handled monthly..." }
  ]
}

history contains prior turns in order and does not include the current question.

Response:

{
  "answer": "Refunds are available within 30 days...",
  "sources": [
    {
      "file": "policy.pdf",
      "path": "/path/to/context_files/policy.pdf",
      "page": 3,
      "confidence_pct": 87
    }
  ]
}

confidence_pct is present when re-ranking is enabled (Discovery Engine score × 100, rounded). Omitted when reranking is off.

When no chunks pass the rerank score filter, the response is:

{
  "answer": "No relevant documents were found for your question.",
  "sources": []
}

Typical workflow

Start the server: python3 main.py
Open http://127.0.0.1:8000/
Upload documents in the UI (or call POST /upload)
Click Index documents (or call POST /index)
Ask questions in the chat UI or via POST /query
Inspect source pills under each assistant reply (file, optional page, optional confidence %)

Limitations

In-memory only — the vector store and BM25 index live in process memory and are rebuilt when you call POST /index (or after new uploads, until you re-index). Not suitable for large production corpora without swapping in a persistent vector database.
Exact vector search — no HNSW/IVF indexing; every query compares against all chunk embeddings. Fast enough for small document sets.
Fresh start on boot — context_files/ is emptied when the server starts; persist files elsewhere if you need them across restarts.
Single-provider LLM — defaults to Gemini via langchain-google-genai; other providers can be wired through src/llm/base.py.

Troubleshooting

The chatbot does not find the relevant documentation

The system prompt is very strict and mandates the llm to answer only when it's 100% sure. In Hybrid mode, the problem could come from how the retrieved sources are ranked, scored, and merged. If the query mentions a very specfic keyword, the lexical branch would probably have the better passages, but they might get ignored after narrowing down the final number of sources during RRF for two main reasons:

The semantic branch is set to prevail by default over the lexical one during RRF, because RAG_LEXICAL_WEIGHT=0.5.
The number of sources in RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL is too low, and the relevant information is ranked poorly, but close to the cutoff, so it is ignored.

Practical fixes

Raise RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL so poorly ranked, but relevant chunks can enter the pool.
Raise RAG_LEXICAL_WEIGHT (e.g. 1.0 or higher) so BM25 matches for proper names compete fairly in RRF.
Retrieve more per branch before merging (raise RAG_NUMBER_OF_CHUNKS_PER_BRANCH or RAG_RERANK_CANDIDATES, then RRF/rerank down to k).
If answers cite weak sources, tighten RAG_RERANK_MIN_SCORE or lower RAG_RERANK_GAP_RATIO so low-confidence tail chunks are dropped before the LLM prompt.
If valid queries return “No relevant documents were found”, lower RAG_RERANK_MIN_SCORE or set RAG_RERANK_GAP_RATIO=0 to disable the cliff cutoff.
Detect keyword names in the query and boost lexical-only or filter chunks containing those tokens. This could be done depending on the keyword topic, and providing a curated list of those entities.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
app		app
frontend		frontend
rag		rag
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chatbot

Features

How it works

High-level lifecycle

Indexing pipeline (`POST /index`)

Query pipeline (each `POST /query`)

Component map

Main steps

Design notes

Project structure

Requirements

Installation

Option A: Conda

Option B: venv

Configuration

Retrieval modes

Observability (Logfire)

Run

API

`GET /health`

`POST /upload`

`GET /index/status`

`POST /index`

`POST /query`

Typical workflow

Limitations

Troubleshooting

The chatbot does not find the relevant documentation

Practical fixes

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Chatbot

Features

How it works

High-level lifecycle

Indexing pipeline (POST /index)

Query pipeline (each POST /query)

Component map

Main steps

Design notes

Project structure

Requirements

Installation

Option A: Conda

Option B: venv

Configuration

Retrieval modes

Observability (Logfire)

Run

API

GET /health

POST /upload

GET /index/status

POST /index

POST /query

Typical workflow

Limitations

Troubleshooting

The chatbot does not find the relevant documentation

Practical fixes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Indexing pipeline (`POST /index`)

Query pipeline (each `POST /query`)

`GET /health`

`POST /upload`

`GET /index/status`

`POST /index`

`POST /query`