Multilingual & Multimodal RAG System

A multilingual and multimodal Retrieval-Augmented Generation (RAG) system that enables organizations to query internal documents using AI while keeping data secure.

The system processes text, tables, and images from documents and generates grounded answers with citations using modern LLMs.

📌 Project Overview

This project implements an AI-powered document assistant capable of understanding complex documents and answering questions based on their content.

Key capabilities include:

📄 Processing complex document formats (PDF, DOCX, PPTX, spreadsheets)
🖼 Understanding visual elements such as charts, diagrams, and scanned text
🌍 Multilingual querying (English & Arabic)
🔍 Hybrid semantic + keyword retrieval
🤖 LLM-generated answers grounded in source documents

Typical use cases include:

Legal document assistants
Enterprise knowledge search
Research document analysis
Internal company documentation assistants

🎥 Demo

User Interface

The system provides a web interface where users can:

Upload documents
Ask questions about their content
View grounded answers with citations
Manage chat sessions

🎯 Features

Core AI Capabilities

Retrieval-Augmented Generation (RAG) for grounded responses
Multilingual support (English and Arabic)
Multimodal understanding (text + images)
Hybrid search (semantic + keyword retrieval)
Persistent document knowledge base

Document Processing

Supported formats:

PDF
DOCX
PPTX
HTML
CSV
XLSX
Markdown

User Experience

ChatGPT-style conversation interface
Persistent chat sessions
Real-time streaming responses
Document management interface
REST API + WebSocket support

🌍 Supported Languages

Language	Support
English	Full support
Arabic	Full support
Other languages	Supported via multilingual embeddings

Language detection is performed automatically.

📦 Requirements

Python 3.10+
Conda or Virtualenv
Google Gemini API key
Minimum 8GB RAM recommended

🚀 Quick Start

1️⃣ Create Environment

conda create -n RAG python=3.10 -y
conda activate RAG

pip install -r requirements.txt

2. Configure API Key

# Copy example config
cp .env.example .env

Edit .env and add your Gemini API key:

GEMINI_API_KEY=your_actual_key_here

Get your API key at: https://makersuite.google.com/app/apikey

3. Run the System

Option A: Full Stack (Backend + Frontend) - Recommended

python run_all.py

Backend API: http://localhost:8000
Frontend UI: http://localhost:7860
API Documentation: http://localhost:8000/docs
WebSocket endpoint: ws://localhost:8000/ws/query

Option B: Separate Servers (For Development)

# Terminal 1: Backend
python run_backend.py

# Terminal 2: Frontend
python run_frontend.py

Option C: Python API Only (For Integration)

from src.rag_system import MultilingualRAG

# Initialize with image processing enabled
rag = MultilingualRAG(
    process_images=True,  # Enable Gemini Vision
    chunking_strategy="structural"
)

# Ingest documents (with image processing)
num_chunks, num_images = rag.ingest_document("path/to/document.pdf")
print(f"Created {num_chunks} chunks, processed {num_images} images")

# Query
response = rag.query("What are the main points?", language="Auto", top_k=10)
print(response["answer"])
print(response["citations"])

🔄 Example Workflow

1️⃣ Upload documents (PDF, DOCX, PPTX)

2️⃣ The system extracts:

text
tables
images
metadata

3️⃣ Content is split into semantic chunks

4️⃣ Each chunk is converted into vector embeddings

5️⃣ Stored in ChromaDB vector database

6️⃣ User asks a question

7️⃣ System retrieves the most relevant document chunks

8️⃣ The Gemini LLM generates a grounded answer

9️⃣ Response includes citations pointing to source documents

📋 System Architecture

Important: Frontend requires backend to function. It makes HTTP/WebSocket API calls rather than directly instantiating RAG components. This ensures single source of truth and proper separation of concerns.

Frontend (Gradio)
        │
        ▼
Backend API (FastAPI)
        │
        ▼
RAG Pipeline
 ├── Document Parsing
 ├── Image Processing
 ├── Embedding Generation
 ├── Vector Database (ChromaDB)
 └── LLM Answer Generation (Gemini)

🛠️ Technology Stack

Component	Technology	Purpose
Document Parser	Docling 2.72.0	Unified PDF/DOCX/PPTX/HTML/CSV/XLSX/MD parsing
Image Extraction (PDF)	PyMuPDF (fitz)	Direct PDF binary image extraction
Image Extraction (DOCX)	python-docx	DOCX relationship-based extraction
Image Extraction (PPTX)	python-pptx	PPTX shape-based extraction
Image Understanding	Google Gemini 2.5 Flash	OCR, classification, visual descriptions
Chunking	4 Strategies	Fixed-size, Structural, Recursive, Legal
Embeddings	intfloat/multilingual-e5-large	1024-dim multilingual embeddings (93 languages)
Vector Store	ChromaDB 0.4.22	Persistent vector storage with metadata filtering
LLM	Google Gemini 2.5 Flash	Answer generation with citations
Backend	FastAPI	REST API + WebSocket server
Frontend	Gradio 4.44.0	Modern web UI with chat interface
Sessions	SQLite + ChromaDB	Persistent conversation storage
Config	Pydantic Settings	Environment-based configuration

Image Processing Pipeline

Extract: PyMuPDF extracts images from PDFs with page context
Analyze: Gemini Vision performs:
- Text extraction (OCR)
- Image classification (chart, diagram, table, etc.)
- Detailed visual descriptions (4000+ characters)
Embed: E5-Large creates semantic embeddings
Store: ChromaDB indexes images as searchable chunks
Retrieve: Images surface in search results alongside text

⚙️ Configuration

All settings are configurable via .env:

# API Configuration
GEMINI_API_KEY=your_key_here

# Embedding Model
EMBEDDING_MODEL=intfloat/multilingual-e5-large
EMBEDDING_DEVICE=cpu  # Options: cpu, cuda, mps

# Chunking Strategy
CHUNKING_STRATEGY=structural  # Options: structural, legal, recursive, fixed_size
CHUNK_SIZE=1500
CHUNK_OVERLAP=250

# Vector Store
VECTOR_STORE_TYPE=chromadb
VECTOR_STORE_PATH=./data/vector_store

# Retrieval
RETRIEVAL_MODE=hybrid  # Options: semantic, hybrid
TOP_K_RESULTS=10  # Default chunks to retrieve (1-20 range in UI)
SIMILARITY_THRESHOLD=0.7

# Image Processing (NEW!)
PROCESS_IMAGES=true  # Enable/disable Gemini Vision for images

# Generation
MAX_OUTPUT_TOKENS=1024
TEMPERATURE=0.3

# Logging
LOG_LEVEL=INFO
LOG_FILE=./logs/rag_system.log

Chunking Strategies

structural (Default) ⭐
- Best for: Reports, manuals, technical docs
- Preserves headings and section hierarchy
- Intelligent splitting for large sections
legal
- Best for: Legal documents, contracts, regulations
- Detects articles, clauses, sections
- Preserves cross-references
recursive
- Best for: Mixed content with nested structure
- Hierarchical separators
- Falls back gracefully
fixed_size
- Best for: Simple text, chat logs
- Sentence-boundary aware
- Fast and predictable

Image Extraction

🔄 Multi-Format Support:
- PDFs: PyMuPDF (direct binary extraction)
- DOCX: python-docx (relationship-based extraction)
- PPTX: python-pptx (shape-based extraction)
📄 Supported Formats: PDF, DOCX, PPTX, PPT, DOC
📍 Context Preservation: Captures surrounding text (pages for PDF, slides for PPTX)
🎯 Location Tracking: Page numbers (PDF) or slide numbers (PPTX)
⚡ Reliable Extraction: Bypasses Docling's broken image API

Image Analysis

🤖 OCR Text Extraction: Reads all visible text from images
🏷️ Automatic Classification: Identifies image types (chart, diagram, table, scanned_text, photo, screenshot, illustration, map, etc.)
📝 Visual Descriptions: Generates detailed 4000+ character descriptions of visual content
🔍 Semantic Search: Images are embedded and searchable alongside text

Usage

Enable in UI:

✅ Process Images (OCR & Understanding)

Enable via API:

curl -X POST "http://localhost:8000/documents/upload?process_images=true" \
  -F "file=@document.pdf"

Enable in Python:

rag = MultilingualRAG(process_images=True)
num_chunks, num_images = rag.ingest_document("document.pdf")

Disable if needed: Set PROCESS_IMAGES=false in .env to skip image processing.

📂 Project Structure

RAG/
├── src/
│   ├── rag_system.py                      # Core RAG orchestration
│   ├── ingestion/
│   │   ├── document_loader_docling.py     # Docling integration + image pipeline
│   │   ├── pymupdf_extractor.py           # PyMuPDF image extraction (NEW!)
│   │   └── image_processor.py             # Gemini Vision integration (NEW!)
│   ├── preprocessing/
│   │   └── chunking_strategies.py         # 4 chunking strategies
│   ├── embeddings/
│   │   └── embedding_generator.py         # E5-Large embedder
│   ├── retrieval/
│   │   ├── vector_store.py                # ChromaDB interface
│   │   └── retriever.py                   # Hybrid retrieval
│   ├── generation/
│   │   ├── llm_generator.py               # Gemini integration
│   │   └── answer_generator.py            # Answer generation with citations
│   ├── api/
│   │   ├── main.py                        # FastAPI backend (12+ endpoints)
│   │   ├── models.py                      # Request/response models
│   │   └── session_manager.py             # Chat session management (NEW!)
│   └── frontend/
│       └── gradio_app.py                  # Gradio UI with chat interface
├── config/
│   └── settings.py                        # Pydantic settings
├── data/
│   ├── documents/                         # Input documents
│   ├── vector_store/                      # ChromaDB persistence
│   ├── chat_sessions.db                   # SQLite session storage (NEW!)
│   └── debug/                             # Debug logs
├── tests/
│   ├── test_chunking.py
│   ├── test_rag.py
│   ├── test_api.py
│   └── test_multimodal.py                 # Image processing tests (NEW!)
├── run_backend.py                         # Backend runner
├── run_frontend.py                        # Frontend runner
├── run_all.py                             # Full stack runner
├── requirements.txt                       # Python dependencies
├── .env                                   # Configuration (not in git)
├── .env.example                           # Example configuration
└── README.md                              # This file

🧪 Testing

# Run all tests
pytest tests/ -v

# Test specific components
pytest tests/test_chunking.py -v
pytest tests/test_rag.py -v
pytest tests/test_api.py -v

# Test with coverage
pytest --cov=src tests/

� Chat Sessions

The system includes persistent chat session management:

Features

💾 Persistent Storage: Sessions saved to SQLite database
🔄 Conversation Memory: LLM remembers previous questions and answers
📚 Session Management: Create, load, and delete chat sessions
🎨 Modern UI: ChatGPT-like interface with session sidebar
🔁 Survives Restarts: Sessions persist across system restarts

Quick Start

Click "➕ New Chat" to start a fresh conversation
Ask questions - the system remembers context
Load previous sessions from the dropdown
Sessions automatically save to ./data/chat_sessions.db

📈 Future Roadmap

Video understanding
Audio transcription
Additional LLM providers
Multi-user authentication
Incremental document indexing
Advanced visual analytics

Status: Production Ready ✅

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
config		config
data		data
docs		docs
sample_documents		sample_documents
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
cli.py		cli.py
examples.py		examples.py
quickstart.py		quickstart.py
requirements.txt		requirements.txt
run_all.py		run_all.py
run_backend.py		run_backend.py
run_frontend.py		run_frontend.py
test_api.py		test_api.py
test_sessions.py		test_sessions.py
tutorial.ipynb		tutorial.ipynb

Folders and files

Latest commit

History

Repository files navigation

Multilingual & Multimodal RAG System

📌 Project Overview

🎥 Demo

User Interface

🎯 Features

Core AI Capabilities

Document Processing

User Experience

🌍 Supported Languages

📦 Requirements

🚀 Quick Start

1️⃣ Create Environment

2. Configure API Key

3. Run the System

Option A: Full Stack (Backend + Frontend) - Recommended

Option B: Separate Servers (For Development)

Option C: Python API Only (For Integration)

🔄 Example Workflow

📋 System Architecture

🛠️ Technology Stack

Image Processing Pipeline

⚙️ Configuration

Chunking Strategies

Image Extraction

Image Analysis

Usage

📂 Project Structure

🧪 Testing

� Chat Sessions

Features

Quick Start

📈 Future Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages