Skip to content

SajidAli8015/multimodal-rag-system

Repository files navigation

Multilingual & Multimodal RAG System

Python FastAPI Gradio Gemini ChromaDB Status

A multilingual and multimodal Retrieval-Augmented Generation (RAG) system that enables organizations to query internal documents using AI while keeping data secure.

The system processes text, tables, and images from documents and generates grounded answers with citations using modern LLMs.


πŸ“Œ Project Overview

This project implements an AI-powered document assistant capable of understanding complex documents and answering questions based on their content.

Key capabilities include:

  • πŸ“„ Processing complex document formats (PDF, DOCX, PPTX, spreadsheets)
  • πŸ–Ό Understanding visual elements such as charts, diagrams, and scanned text
  • 🌍 Multilingual querying (English & Arabic)
  • πŸ” Hybrid semantic + keyword retrieval
  • πŸ€– LLM-generated answers grounded in source documents

Typical use cases include:

  • Legal document assistants
  • Enterprise knowledge search
  • Research document analysis
  • Internal company documentation assistants

πŸŽ₯ Demo

User Interface

Multimodal RAG Chat Interface

The system provides a web interface where users can:

  • Upload documents
  • Ask questions about their content
  • View grounded answers with citations
  • Manage chat sessions

🎯 Features

Core AI Capabilities

  • Retrieval-Augmented Generation (RAG) for grounded responses
  • Multilingual support (English and Arabic)
  • Multimodal understanding (text + images)
  • Hybrid search (semantic + keyword retrieval)
  • Persistent document knowledge base

Document Processing

Supported formats:

  • PDF
  • DOCX
  • PPTX
  • HTML
  • CSV
  • XLSX
  • Markdown

User Experience

  • ChatGPT-style conversation interface
  • Persistent chat sessions
  • Real-time streaming responses
  • Document management interface
  • REST API + WebSocket support

🌍 Supported Languages

Language Support
English Full support
Arabic Full support
Other languages Supported via multilingual embeddings

Language detection is performed automatically.


πŸ“¦ Requirements

  • Python 3.10+
  • Conda or Virtualenv
  • Google Gemini API key
  • Minimum 8GB RAM recommended

πŸš€ Quick Start

1️⃣ Create Environment

conda create -n RAG python=3.10 -y
conda activate RAG

pip install -r requirements.txt

2. Configure API Key

# Copy example config
cp .env.example .env

Edit .env and add your Gemini API key:

GEMINI_API_KEY=your_actual_key_here

Get your API key at: https://makersuite.google.com/app/apikey

3. Run the System

Option A: Full Stack (Backend + Frontend) - Recommended

python run_all.py

Option B: Separate Servers (For Development)

# Terminal 1: Backend
python run_backend.py

# Terminal 2: Frontend
python run_frontend.py

Option C: Python API Only (For Integration)

from src.rag_system import MultilingualRAG

# Initialize with image processing enabled
rag = MultilingualRAG(
    process_images=True,  # Enable Gemini Vision
    chunking_strategy="structural"
)

# Ingest documents (with image processing)
num_chunks, num_images = rag.ingest_document("path/to/document.pdf")
print(f"Created {num_chunks} chunks, processed {num_images} images")

# Query
response = rag.query("What are the main points?", language="Auto", top_k=10)
print(response["answer"])
print(response["citations"])

πŸ”„ Example Workflow

1️⃣ Upload documents (PDF, DOCX, PPTX)

2️⃣ The system extracts:

  • text
  • tables
  • images
  • metadata

3️⃣ Content is split into semantic chunks

4️⃣ Each chunk is converted into vector embeddings

5️⃣ Stored in ChromaDB vector database

6️⃣ User asks a question

7️⃣ System retrieves the most relevant document chunks

8️⃣ The Gemini LLM generates a grounded answer

9️⃣ Response includes citations pointing to source documents

πŸ“‹ System Architecture

Important: Frontend requires backend to function. It makes HTTP/WebSocket API calls rather than directly instantiating RAG components. This ensures single source of truth and proper separation of concerns.

Frontend (Gradio)
        β”‚
        β–Ό
Backend API (FastAPI)
        β”‚
        β–Ό
RAG Pipeline
 β”œβ”€β”€ Document Parsing
 β”œβ”€β”€ Image Processing
 β”œβ”€β”€ Embedding Generation
 β”œβ”€β”€ Vector Database (ChromaDB)
 └── LLM Answer Generation (Gemini)

πŸ› οΈ Technology Stack

Component Technology Purpose
Document Parser Docling 2.72.0 Unified PDF/DOCX/PPTX/HTML/CSV/XLSX/MD parsing
Image Extraction (PDF) PyMuPDF (fitz) Direct PDF binary image extraction
Image Extraction (DOCX) python-docx DOCX relationship-based extraction
Image Extraction (PPTX) python-pptx PPTX shape-based extraction
Image Understanding Google Gemini 2.5 Flash OCR, classification, visual descriptions
Chunking 4 Strategies Fixed-size, Structural, Recursive, Legal
Embeddings intfloat/multilingual-e5-large 1024-dim multilingual embeddings (93 languages)
Vector Store ChromaDB 0.4.22 Persistent vector storage with metadata filtering
LLM Google Gemini 2.5 Flash Answer generation with citations
Backend FastAPI REST API + WebSocket server
Frontend Gradio 4.44.0 Modern web UI with chat interface
Sessions SQLite + ChromaDB Persistent conversation storage
Config Pydantic Settings Environment-based configuration

Image Processing Pipeline

  1. Extract: PyMuPDF extracts images from PDFs with page context
  2. Analyze: Gemini Vision performs:
    • Text extraction (OCR)
    • Image classification (chart, diagram, table, etc.)
    • Detailed visual descriptions (4000+ characters)
  3. Embed: E5-Large creates semantic embeddings
  4. Store: ChromaDB indexes images as searchable chunks
  5. Retrieve: Images surface in search results alongside text

βš™οΈ Configuration

All settings are configurable via .env:

# API Configuration
GEMINI_API_KEY=your_key_here

# Embedding Model
EMBEDDING_MODEL=intfloat/multilingual-e5-large
EMBEDDING_DEVICE=cpu  # Options: cpu, cuda, mps

# Chunking Strategy
CHUNKING_STRATEGY=structural  # Options: structural, legal, recursive, fixed_size
CHUNK_SIZE=1500
CHUNK_OVERLAP=250

# Vector Store
VECTOR_STORE_TYPE=chromadb
VECTOR_STORE_PATH=./data/vector_store

# Retrieval
RETRIEVAL_MODE=hybrid  # Options: semantic, hybrid
TOP_K_RESULTS=10  # Default chunks to retrieve (1-20 range in UI)
SIMILARITY_THRESHOLD=0.7

# Image Processing (NEW!)
PROCESS_IMAGES=true  # Enable/disable Gemini Vision for images

# Generation
MAX_OUTPUT_TOKENS=1024
TEMPERATURE=0.3

# Logging
LOG_LEVEL=INFO
LOG_FILE=./logs/rag_system.log

Chunking Strategies

  1. structural (Default) ⭐

    • Best for: Reports, manuals, technical docs
    • Preserves headings and section hierarchy
    • Intelligent splitting for large sections
  2. legal

    • Best for: Legal documents, contracts, regulations
    • Detects articles, clauses, sections
    • Preserves cross-references
  3. recursive

    • Best for: Mixed content with nested structure
    • Hierarchical separators
    • Falls back gracefully
  4. fixed_size

    • Best for: Simple text, chat logs
    • Sentence-boundary aware
    • Fast and predictable

Image Extraction

  • πŸ”„ Multi-Format Support:
    • PDFs: PyMuPDF (direct binary extraction)
    • DOCX: python-docx (relationship-based extraction)
    • PPTX: python-pptx (shape-based extraction)
  • πŸ“„ Supported Formats: PDF, DOCX, PPTX, PPT, DOC
  • πŸ“ Context Preservation: Captures surrounding text (pages for PDF, slides for PPTX)
  • 🎯 Location Tracking: Page numbers (PDF) or slide numbers (PPTX)
  • ⚑ Reliable Extraction: Bypasses Docling's broken image API

Image Analysis

  • πŸ€– OCR Text Extraction: Reads all visible text from images
  • 🏷️ Automatic Classification: Identifies image types (chart, diagram, table, scanned_text, photo, screenshot, illustration, map, etc.)
  • πŸ“ Visual Descriptions: Generates detailed 4000+ character descriptions of visual content
  • πŸ” Semantic Search: Images are embedded and searchable alongside text

Usage

Enable in UI:

βœ… Process Images (OCR & Understanding)

Enable via API:

curl -X POST "http://localhost:8000/documents/upload?process_images=true" \
  -F "file=@document.pdf"

Enable in Python:

rag = MultilingualRAG(process_images=True)
num_chunks, num_images = rag.ingest_document("document.pdf")

Disable if needed: Set PROCESS_IMAGES=false in .env to skip image processing.

πŸ“‚ Project Structure

RAG/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ rag_system.py                      # Core RAG orchestration
β”‚   β”œβ”€β”€ ingestion/
β”‚   β”‚   β”œβ”€β”€ document_loader_docling.py     # Docling integration + image pipeline
β”‚   β”‚   β”œβ”€β”€ pymupdf_extractor.py           # PyMuPDF image extraction (NEW!)
β”‚   β”‚   └── image_processor.py             # Gemini Vision integration (NEW!)
β”‚   β”œβ”€β”€ preprocessing/
β”‚   β”‚   └── chunking_strategies.py         # 4 chunking strategies
β”‚   β”œβ”€β”€ embeddings/
β”‚   β”‚   └── embedding_generator.py         # E5-Large embedder
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”œβ”€β”€ vector_store.py                # ChromaDB interface
β”‚   β”‚   └── retriever.py                   # Hybrid retrieval
β”‚   β”œβ”€β”€ generation/
β”‚   β”‚   β”œβ”€β”€ llm_generator.py               # Gemini integration
β”‚   β”‚   └── answer_generator.py            # Answer generation with citations
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ main.py                        # FastAPI backend (12+ endpoints)
β”‚   β”‚   β”œβ”€β”€ models.py                      # Request/response models
β”‚   β”‚   └── session_manager.py             # Chat session management (NEW!)
β”‚   └── frontend/
β”‚       └── gradio_app.py                  # Gradio UI with chat interface
β”œβ”€β”€ config/
β”‚   └── settings.py                        # Pydantic settings
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ documents/                         # Input documents
β”‚   β”œβ”€β”€ vector_store/                      # ChromaDB persistence
β”‚   β”œβ”€β”€ chat_sessions.db                   # SQLite session storage (NEW!)
β”‚   └── debug/                             # Debug logs
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_chunking.py
β”‚   β”œβ”€β”€ test_rag.py
β”‚   β”œβ”€β”€ test_api.py
β”‚   └── test_multimodal.py                 # Image processing tests (NEW!)
β”œβ”€β”€ run_backend.py                         # Backend runner
β”œβ”€β”€ run_frontend.py                        # Frontend runner
β”œβ”€β”€ run_all.py                             # Full stack runner
β”œβ”€β”€ requirements.txt                       # Python dependencies
β”œβ”€β”€ .env                                   # Configuration (not in git)
β”œβ”€β”€ .env.example                           # Example configuration
└── README.md                              # This file

πŸ§ͺ Testing

# Run all tests
pytest tests/ -v

# Test specific components
pytest tests/test_chunking.py -v
pytest tests/test_rag.py -v
pytest tests/test_api.py -v

# Test with coverage
pytest --cov=src tests/

οΏ½ Chat Sessions

The system includes persistent chat session management:

Features

  • πŸ’Ύ Persistent Storage: Sessions saved to SQLite database
  • πŸ”„ Conversation Memory: LLM remembers previous questions and answers
  • πŸ“š Session Management: Create, load, and delete chat sessions
  • 🎨 Modern UI: ChatGPT-like interface with session sidebar
  • πŸ” Survives Restarts: Sessions persist across system restarts

Quick Start

  1. Click "βž• New Chat" to start a fresh conversation
  2. Ask questions - the system remembers context
  3. Load previous sessions from the dropdown
  4. Sessions automatically save to ./data/chat_sessions.db

πŸ“ˆ Future Roadmap

  • Video understanding
  • Audio transcription
  • Additional LLM providers
  • Multi-user authentication
  • Incremental document indexing
  • Advanced visual analytics

Status: Production Ready βœ…

About

A multilingual and multimodal Retrieval-Augmented Generation (RAG) system that enables organizations to query internal documents using AI while keeping data secure. The system processes text, tables, and images from documents and generates grounded answers with citations using modern LLMs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors