Skip to content

pablogarciaprado/rag-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Chatbot

A local FastAPI application that answers questions over your own documents using Retrieval-Augmented Generation (RAG). Upload files through a browser UI (or the API), ask multi-turn questions, and get answers grounded in retrieved chunks with source citations.

Features

  • Hybrid retrieval — dense embeddings (semantic) plus BM25 (lexical), merged with Reciprocal Rank Fusion (RRF) by default.
  • Configurable retrieval — switch between semantic, lexical, and hybrid modes via environment variables.
  • Multi-turn chat — the API accepts conversation history so follow-up questions keep context.
  • Source citations — responses include file names, page numbers, and reranker confidence (%) when re-ranking is enabled.
  • Document upload — index .docx, .pdf, .txt, .md, and .pptx files.
  • In-memory index — simple setup for local development and prototyping (no external vector database).

How it works

High-level lifecycle

flowchart TB
    subgraph START["1 · Server startup"]
        MAIN["main.py<br/>Uvicorn :8000"] --> APP["FastAPI app<br/>app/app.py"]
        APP --> LIFE["Lifespan hook"]
        LIFE --> CLEAR["Clear context_files/"]
        LIFE --> RESET0["reset_chain()<br/>no index in memory"]
    end

    subgraph UPLOAD["2 · Upload documents"]
        UI1["Browser UI<br/>frontend/static/app.js"] -->|POST /upload| UP_EP["upload_files()"]
        UP_EP --> SAVE["Save to RAG_UPLOADED_DIR<br/>(default: context_files/)"]
        SAVE --> RESET1["reset_chain()<br/>index invalidated"]
    end

    subgraph INDEX["3 · Index documents"]
        UI2["Click Index documents"] -->|POST /index| IDX_EP["index_documents()"]
        IDX_EP --> BUILD["build_rag_chain()<br/>rag/rag.py"]
        BUILD --> STORE["In-memory RagWrapper<br/>vector store + BM25 + agent"]
    end

    subgraph CHAT["4 · Chat / query"]
        UI3["User asks question<br/>+ conversation history"] -->|POST /query| Q_EP["query()"]
        Q_EP --> RESP["RagWrapper.get_response()"]
        RESP --> OUT["Answer + source citations"]
        OUT --> UI4["Render reply + source pills<br/>(file · page · confidence %)"]
    end

    START --> UPLOAD
    UPLOAD --> INDEX
    INDEX --> CHAT
Loading

Indexing pipeline (POST /index)

flowchart LR
    FILES[("context_files/<br/>pdf · docx · txt · md · pptx")] --> LOAD["Load & parse<br/>LangChain loaders"]
    LOAD --> CHUNK["Chunk<br/>1000 chars · 200 overlap"]
    CHUNK --> EMB["Embed chunks<br/>gemini-embedding-2-preview"]
    CHUNK --> BM25["Build BM25 index<br/>same chunks"]

    EMB --> VDB[("InMemoryVectorStore<br/>cosine similarity")]
    BM25 --> LEX[("BM25Retriever")]

    VDB --> BUNDLE["RetrieverBundle<br/>mode · k · fetch_k · lexical_weight"]
    LEX --> BUNDLE

    CHUNK --> AGENT["LangChain agent<br/>create_agent()"]
    LLM["Gemini gemini-2.5-flash-lite"] --> AGENT
    MW["Prompt middleware<br/>dynamic system prompt"] --> AGENT

    BUNDLE --> WRAP["RagWrapper<br/>agent + retriever"]
    AGENT --> WRAP
Loading

Query pipeline (each POST /query)

flowchart TB
    REQ["Request<br/>question + history"] --> MSGS["Build messages list<br/>history + current question"]
    MSGS --> CHAIN["get_chain() → RagWrapper"]

    CHAIN --> Q["Extract last user message"]
    Q --> RET["retrieve_documents()<br/>src/retrieval/hybrid.py"]

    RET --> MODE{RAG_RETRIEVAL_MODE}

    MODE -->|semantic| SEM["Vector search<br/>top-k cosine"]
    MODE -->|lexical| LEX["BM25 keyword search<br/>top-k"]
    MODE -->|hybrid| SEM2["Semantic branch<br/>fetch 2×k candidates"]
    MODE -->|hybrid| LEX2["Lexical branch<br/>fetch 2×k candidates"]
    SEM2 --> RRF["RRF merge<br/>weights: 1.0 semantic · lexical_weight BM25"]
    LEX2 --> RRF
    RRF --> TOPK["Top-k candidates"]
    SEM --> TOPK
    LEX --> TOPK

    TOPK --> RERANK["Optional rerank + score filter<br/>src/retrieval/rerank.py"]
    RERANK --> EMPTY{Chunks after filter?}
    EMPTY -->|no| NODOCS["Return: no relevant docs<br/>(skip LLM)"]
    EMPTY -->|yes| FINAL["Final chunks"]

    FINAL --> CITE["documents_to_sources()<br/>dedupe by file + page · confidence_pct"]
    FINAL --> STATE["Agent invoke<br/>state: messages + retrieved_docs"]

    STATE --> PROMPT["Prompt middleware<br/>system_prompt.txt + chunk text"]
    PROMPT --> GEN["Gemini generates answer"]
    GEN --> RES["QueryResponse<br/>answer + sources"]
    NODOCS --> RES
Loading

Component map

Layer Key files Role
Entry main.py Starts Uvicorn on port 8000
API app/app.py, app/schemas.py Routes: /, /upload, /index, /index/status, /query, /health
Frontend frontend/templates/index.html, frontend/static/app.js Upload UI, index button, multi-turn chat, citation pills
RAG core rag/rag.py Load → chunk → embed → build agent → RagWrapper
Retrieval src/retrieval/hybrid.py, src/retrieval/rerank.py Semantic / lexical / hybrid (RRF); optional Discovery Engine rerank and score filtering
Prompt src/prompt/prompt_manager.py Injects retrieved chunks into system prompt
LLM src/llm/gemini.py gemini-2.5-flash-lite via LangChain

Main steps

  1. Start serverpython3 main.py; the upload directory is cleared and no index exists yet.
  2. Upload — UI or POST /upload saves supported files to context_files/; any existing index is cleared.
  3. Index — UI or POST /index loads files, chunks them, embeds them, builds BM25 + vector store, and creates the in-memory RagWrapper.
  4. Ask — UI or POST /query sends the question plus prior turns (history does not include the current question).
  5. Retrieve onceRagWrapper runs hybrid/semantic/lexical retrieval on the latest user message, optionally reranks and filters by score.
  6. Generate — If no chunks pass the score filter, the API returns a fixed “no relevant documents” message without calling the LLM. Otherwise, retrieved text is injected into the system prompt and the Gemini agent produces the answer.
  7. Respond — API returns { answer, sources }; the UI shows source pills (file · p. N · confidence %).

Design notes

  • Retrieval runs once per query in RagWrapper.get_response() — the same chunks feed both the LLM prompt and the citation list.
  • Multi-turn chat — prior turns go in history; only the latest user message drives retrieval.
  • Hybrid mode (default) — semantic and BM25 each fetch candidates, then RRF merges to a pool. With RAG_RERANK_ENABLED=true, Discovery Engine re-ranks that pool, then a score filter trims weak tail hits before chunks reach the LLM and citations.
  • Rerank score filtering — after reranking, chunks must pass two rules (configurable via env): an absolute floor (RAG_RERANK_MIN_SCORE, default 0.12) and a gap cutoff (RAG_RERANK_GAP_RATIO, default 0.45 — stop when the next score falls below 45% of the previous kept score). If even the top chunk is below the floor, retrieval returns nothing and the user sees “No relevant documents were found for your question.”
  • Citation confidence — when reranking is on, each source includes confidence_pct (rerank score × 100, rounded) in the API and UI.
  • Everything is in-memory — indexes are rebuilt on each POST /index; uploads invalidate the index until re-indexing.
  • Strict groundingsystem_prompt.txt instructs the model to answer only when confident in the retrieved context.

Semantic search uses cosine similarity over embedding vectors (exact brute-force search in memory). This is appropriate for small corpora; for very large indexes would be better moving to an approximate nearest-neighbor store (e.g. HNSW or IVF in FAISS, Qdrant, or pgvector).

Project structure

Path Role
main.py Entrypoint — runs Uvicorn with reload
app/ FastAPI routes, request/response schemas, static frontend mount
rag/ RAG pipeline — document loading, chunking, vector store, chain lifecycle
src/llm/ LLM provider interface and Gemini implementation
src/prompt/ Dynamic prompt middleware and system_prompt.txt
src/retrieval/ Hybrid retrieval (semantic, BM25, RRF merge) and Discovery Engine rerank + score filtering
frontend/ HTML template and static UI assets
context_files/ Uploaded documents used for retrieval (gitignored)

Requirements

Installation

Option A: Conda

conda create -n rag-chatbot python=3.11 -y
conda activate rag-chatbot
pip install -e .

Option B: venv

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Configuration

Create a .env file at the repository root (or export variables in your shell).

Variable Required Default Description
GOOGLE_API_KEY Yes API key for Gemini embeddings and chat
RAG_UPLOADED_DIR No context_files/ Directory for uploaded and indexed documents
RAG_RETRIEVAL_MODE No hybrid semantic, lexical, or hybrid
RAG_LEXICAL_WEIGHT No 0.5 Weight of the BM25 branch in hybrid RRF merge (semantic branch is 1.0)
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL No 8 Final chunk count after retrieval (passed to build_rag_chain)
RAG_NUMBER_OF_CHUNKS_PER_BRANCH No 16 Candidates fetched per branch before RRF merge (when rerank is off)
RAG_RERANK_ENABLED No false Re-rank first-stage hits with Discovery Engine when true
GOOGLE_CLOUD_PROJECT When reranking GCP project id for Discovery Engine Ranking API (ADC auth)
RAG_RERANK_MODEL No semantic-ranker-default@latest Discovery Engine ranking model
RAG_RERANK_LOCATION No global Discovery Engine location for the ranking config
RAG_RERANK_CANDIDATES No max(2×k, k+4) First-stage pool size before re-ranking (max 200)
RAG_RERANK_MIN_SCORE No 0.12 Drop chunks below this rerank score (0–1); if the top chunk fails, keep none
RAG_RERANK_GAP_RATIO No 0.45 Stop keeping chunks when the next score falls below this ratio of the previous kept score; set to 0 to disable
ENABLE_PRINT_DEBUG No False Log retrieval and message debug output when true
LOGFIRE_ENABLED No true Send traces to Logfire when true
LOGFIRE_SERVICE_NAME No rag-chatbot Service name shown in Logfire
LOGFIRE_USER_ID No dev-user User id attached to every trace (placeholder until auth exists)
LOGFIRE_INSTRUMENT_LANGCHAIN No true Export LangChain/LangGraph spans via OpenTelemetry

Example .env:

GOOGLE_API_KEY=your_key_here
RAG_UPLOADED_DIR=context_files
RAG_RETRIEVAL_MODE=hybrid
RAG_LEXICAL_WEIGHT=0.5
RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL=8
RAG_NUMBER_OF_CHUNKS_PER_BRANCH=16
# Optional Discovery Engine re-ranking (requires ADC: gcloud auth application-default login)
# RAG_RERANK_ENABLED=true
# GOOGLE_CLOUD_PROJECT=your-gcp-project-id
# RAG_RERANK_CANDIDATES=24
# RAG_RERANK_MIN_SCORE=0.12
# RAG_RERANK_GAP_RATIO=0.45
ENABLE_PRINT_DEBUG=false

Retrieval modes

  • semantic — embedding similarity only (cosine via in-memory vector store)
  • lexical — BM25 keyword search only
  • hybrid — run both, merge ranks with RRF (recommended)

Observability (Logfire)

The app sends traces to Logfire when enabled:

  1. Install SDK: pip install logfire
  2. Authenticate and select a project: logfire auth then logfire projects use <project>
  3. Install pip install uvicorn 'logfire[fastapi]' and pip install 'logfire[httpx]'. langchain-google-genai (ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings) uses HTTPX under the hood for chat and embedding calls to Google’s API.
  4. Run the server: python3 main.py Install with the FastAPI extra: .

What is instrumented:

  • HTTP routesPOST /query, /index, /upload, etc. (/health and /static/* are excluded)
  • Outbound HTTP — Gemini API calls via HTTPX
  • LangChain — agent, retrieval, and LLM spans when LOGFIRE_INSTRUMENT_LANGCHAIN=true (default)
  • User contextuser_id on every span via baggage (dev-user by default; override with LOGFIRE_USER_ID)

Set LOGFIRE_ENABLED=false to disable sending traces without removing the dependency.

Filter by user in Logfire Live view: attributes->>'user_id' = 'dev-user'

Run

From the repository root:

python3 main.py

Then open:

On startup, the app clears the upload directory and resets the in-memory index so each run starts with an empty document set. Upload files again after restarting the server.

API

GET /health

Returns {"status": "ok"}.

POST /upload

Upload one or more files for indexing. Supported extensions: .docx, .pdf, .txt, .md, .pptx.

Response:

{
  "saved": ["report.pdf", "notes.txt"],
  "skipped": ["image.png"]
}

Re-uploading clears the current index; call POST /index again before querying.

GET /index/status

Returns whether documents are indexed and how many supported files are on disk.

Response:

{
  "indexed": true,
  "file_count": 2
}

POST /index

Build the in-memory index from uploaded files. Required before querying.

Response:

{
  "documents": 2,
  "chunks": 18
}

POST /query

Ask a question over the indexed documents.

Request:

{
  "question": "What is the refund policy?",
  "history": [
    { "role": "user", "content": "Tell me about billing." },
    { "role": "assistant", "content": "Billing is handled monthly..." }
  ]
}

history contains prior turns in order and does not include the current question.

Response:

{
  "answer": "Refunds are available within 30 days...",
  "sources": [
    {
      "file": "policy.pdf",
      "path": "/path/to/context_files/policy.pdf",
      "page": 3,
      "confidence_pct": 87
    }
  ]
}

confidence_pct is present when re-ranking is enabled (Discovery Engine score × 100, rounded). Omitted when reranking is off.

When no chunks pass the rerank score filter, the response is:

{
  "answer": "No relevant documents were found for your question.",
  "sources": []
}

Typical workflow

  1. Start the server: python3 main.py
  2. Open http://127.0.0.1:8000/
  3. Upload documents in the UI (or call POST /upload)
  4. Click Index documents (or call POST /index)
  5. Ask questions in the chat UI or via POST /query
  6. Inspect source pills under each assistant reply (file, optional page, optional confidence %)

Limitations

  • In-memory only — the vector store and BM25 index live in process memory and are rebuilt when you call POST /index (or after new uploads, until you re-index). Not suitable for large production corpora without swapping in a persistent vector database.
  • Exact vector search — no HNSW/IVF indexing; every query compares against all chunk embeddings. Fast enough for small document sets.
  • Fresh start on bootcontext_files/ is emptied when the server starts; persist files elsewhere if you need them across restarts.
  • Single-provider LLM — defaults to Gemini via langchain-google-genai; other providers can be wired through src/llm/base.py.

Troubleshooting

The chatbot does not find the relevant documentation

The system prompt is very strict and mandates the llm to answer only when it's 100% sure. In Hybrid mode, the problem could come from how the retrieved sources are ranked, scored, and merged. If the query mentions a very specfic keyword, the lexical branch would probably have the better passages, but they might get ignored after narrowing down the final number of sources during RRF for two main reasons:

  • The semantic branch is set to prevail by default over the lexical one during RRF, because RAG_LEXICAL_WEIGHT=0.5.
  • The number of sources in RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL is too low, and the relevant information is ranked poorly, but close to the cutoff, so it is ignored.

Practical fixes

  • Raise RAG_FINAL_NUMBER_OF_SOURCES_AFTER_RETRIEVAL so poorly ranked, but relevant chunks can enter the pool.
  • Raise RAG_LEXICAL_WEIGHT (e.g. 1.0 or higher) so BM25 matches for proper names compete fairly in RRF.
  • Retrieve more per branch before merging (raise RAG_NUMBER_OF_CHUNKS_PER_BRANCH or RAG_RERANK_CANDIDATES, then RRF/rerank down to k).
  • If answers cite weak sources, tighten RAG_RERANK_MIN_SCORE or lower RAG_RERANK_GAP_RATIO so low-confidence tail chunks are dropped before the LLM prompt.
  • If valid queries return “No relevant documents were found”, lower RAG_RERANK_MIN_SCORE or set RAG_RERANK_GAP_RATIO=0 to disable the cliff cutoff.
  • Detect keyword names in the query and boost lexical-only or filter chunks containing those tokens. This could be done depending on the keyword topic, and providing a curated list of those entities.

About

LLM-powered chat application with retrieval over custom documents for grounded, context-aware responses.

Topics

Resources

Stars

Watchers

Forks

Contributors