A multilingual and multimodal Retrieval-Augmented Generation (RAG) system that enables organizations to query internal documents using AI while keeping data secure.
The system processes text, tables, and images from documents and generates grounded answers with citations using modern LLMs.
This project implements an AI-powered document assistant capable of understanding complex documents and answering questions based on their content.
Key capabilities include:
- π Processing complex document formats (PDF, DOCX, PPTX, spreadsheets)
- πΌ Understanding visual elements such as charts, diagrams, and scanned text
- π Multilingual querying (English & Arabic)
- π Hybrid semantic + keyword retrieval
- π€ LLM-generated answers grounded in source documents
Typical use cases include:
- Legal document assistants
- Enterprise knowledge search
- Research document analysis
- Internal company documentation assistants
The system provides a web interface where users can:
- Upload documents
- Ask questions about their content
- View grounded answers with citations
- Manage chat sessions
- Retrieval-Augmented Generation (RAG) for grounded responses
- Multilingual support (English and Arabic)
- Multimodal understanding (text + images)
- Hybrid search (semantic + keyword retrieval)
- Persistent document knowledge base
Supported formats:
- DOCX
- PPTX
- HTML
- CSV
- XLSX
- Markdown
- ChatGPT-style conversation interface
- Persistent chat sessions
- Real-time streaming responses
- Document management interface
- REST API + WebSocket support
| Language | Support |
|---|---|
| English | Full support |
| Arabic | Full support |
| Other languages | Supported via multilingual embeddings |
Language detection is performed automatically.
- Python 3.10+
- Conda or Virtualenv
- Google Gemini API key
- Minimum 8GB RAM recommended
conda create -n RAG python=3.10 -y
conda activate RAG
pip install -r requirements.txt# Copy example config
cp .env.example .envEdit .env and add your Gemini API key:
GEMINI_API_KEY=your_actual_key_hereGet your API key at: https://makersuite.google.com/app/apikey
python run_all.py- Backend API: http://localhost:8000
- Frontend UI: http://localhost:7860
- API Documentation: http://localhost:8000/docs
- WebSocket endpoint: ws://localhost:8000/ws/query
# Terminal 1: Backend
python run_backend.py
# Terminal 2: Frontend
python run_frontend.pyfrom src.rag_system import MultilingualRAG
# Initialize with image processing enabled
rag = MultilingualRAG(
process_images=True, # Enable Gemini Vision
chunking_strategy="structural"
)
# Ingest documents (with image processing)
num_chunks, num_images = rag.ingest_document("path/to/document.pdf")
print(f"Created {num_chunks} chunks, processed {num_images} images")
# Query
response = rag.query("What are the main points?", language="Auto", top_k=10)
print(response["answer"])
print(response["citations"])1οΈβ£ Upload documents (PDF, DOCX, PPTX)
2οΈβ£ The system extracts:
- text
- tables
- images
- metadata
3οΈβ£ Content is split into semantic chunks
4οΈβ£ Each chunk is converted into vector embeddings
5οΈβ£ Stored in ChromaDB vector database
6οΈβ£ User asks a question
7οΈβ£ System retrieves the most relevant document chunks
8οΈβ£ The Gemini LLM generates a grounded answer
9οΈβ£ Response includes citations pointing to source documents
Important: Frontend requires backend to function. It makes HTTP/WebSocket API calls rather than directly instantiating RAG components. This ensures single source of truth and proper separation of concerns.
Frontend (Gradio)
β
βΌ
Backend API (FastAPI)
β
βΌ
RAG Pipeline
βββ Document Parsing
βββ Image Processing
βββ Embedding Generation
βββ Vector Database (ChromaDB)
βββ LLM Answer Generation (Gemini)
| Component | Technology | Purpose |
|---|---|---|
| Document Parser | Docling 2.72.0 | Unified PDF/DOCX/PPTX/HTML/CSV/XLSX/MD parsing |
| Image Extraction (PDF) | PyMuPDF (fitz) | Direct PDF binary image extraction |
| Image Extraction (DOCX) | python-docx | DOCX relationship-based extraction |
| Image Extraction (PPTX) | python-pptx | PPTX shape-based extraction |
| Image Understanding | Google Gemini 2.5 Flash | OCR, classification, visual descriptions |
| Chunking | 4 Strategies | Fixed-size, Structural, Recursive, Legal |
| Embeddings | intfloat/multilingual-e5-large | 1024-dim multilingual embeddings (93 languages) |
| Vector Store | ChromaDB 0.4.22 | Persistent vector storage with metadata filtering |
| LLM | Google Gemini 2.5 Flash | Answer generation with citations |
| Backend | FastAPI | REST API + WebSocket server |
| Frontend | Gradio 4.44.0 | Modern web UI with chat interface |
| Sessions | SQLite + ChromaDB | Persistent conversation storage |
| Config | Pydantic Settings | Environment-based configuration |
- Extract: PyMuPDF extracts images from PDFs with page context
- Analyze: Gemini Vision performs:
- Text extraction (OCR)
- Image classification (chart, diagram, table, etc.)
- Detailed visual descriptions (4000+ characters)
- Embed: E5-Large creates semantic embeddings
- Store: ChromaDB indexes images as searchable chunks
- Retrieve: Images surface in search results alongside text
All settings are configurable via .env:
# API Configuration
GEMINI_API_KEY=your_key_here
# Embedding Model
EMBEDDING_MODEL=intfloat/multilingual-e5-large
EMBEDDING_DEVICE=cpu # Options: cpu, cuda, mps
# Chunking Strategy
CHUNKING_STRATEGY=structural # Options: structural, legal, recursive, fixed_size
CHUNK_SIZE=1500
CHUNK_OVERLAP=250
# Vector Store
VECTOR_STORE_TYPE=chromadb
VECTOR_STORE_PATH=./data/vector_store
# Retrieval
RETRIEVAL_MODE=hybrid # Options: semantic, hybrid
TOP_K_RESULTS=10 # Default chunks to retrieve (1-20 range in UI)
SIMILARITY_THRESHOLD=0.7
# Image Processing (NEW!)
PROCESS_IMAGES=true # Enable/disable Gemini Vision for images
# Generation
MAX_OUTPUT_TOKENS=1024
TEMPERATURE=0.3
# Logging
LOG_LEVEL=INFO
LOG_FILE=./logs/rag_system.log-
structural(Default) β- Best for: Reports, manuals, technical docs
- Preserves headings and section hierarchy
- Intelligent splitting for large sections
-
legal- Best for: Legal documents, contracts, regulations
- Detects articles, clauses, sections
- Preserves cross-references
-
recursive- Best for: Mixed content with nested structure
- Hierarchical separators
- Falls back gracefully
-
fixed_size- Best for: Simple text, chat logs
- Sentence-boundary aware
- Fast and predictable
- π Multi-Format Support:
- PDFs: PyMuPDF (direct binary extraction)
- DOCX: python-docx (relationship-based extraction)
- PPTX: python-pptx (shape-based extraction)
- π Supported Formats: PDF, DOCX, PPTX, PPT, DOC
- π Context Preservation: Captures surrounding text (pages for PDF, slides for PPTX)
- π― Location Tracking: Page numbers (PDF) or slide numbers (PPTX)
- β‘ Reliable Extraction: Bypasses Docling's broken image API
- π€ OCR Text Extraction: Reads all visible text from images
- π·οΈ Automatic Classification: Identifies image types (chart, diagram, table, scanned_text, photo, screenshot, illustration, map, etc.)
- π Visual Descriptions: Generates detailed 4000+ character descriptions of visual content
- π Semantic Search: Images are embedded and searchable alongside text
Enable in UI:
β
Process Images (OCR & Understanding)
Enable via API:
curl -X POST "http://localhost:8000/documents/upload?process_images=true" \
-F "file=@document.pdf"Enable in Python:
rag = MultilingualRAG(process_images=True)
num_chunks, num_images = rag.ingest_document("document.pdf")Disable if needed:
Set PROCESS_IMAGES=false in .env to skip image processing.
RAG/
βββ src/
β βββ rag_system.py # Core RAG orchestration
β βββ ingestion/
β β βββ document_loader_docling.py # Docling integration + image pipeline
β β βββ pymupdf_extractor.py # PyMuPDF image extraction (NEW!)
β β βββ image_processor.py # Gemini Vision integration (NEW!)
β βββ preprocessing/
β β βββ chunking_strategies.py # 4 chunking strategies
β βββ embeddings/
β β βββ embedding_generator.py # E5-Large embedder
β βββ retrieval/
β β βββ vector_store.py # ChromaDB interface
β β βββ retriever.py # Hybrid retrieval
β βββ generation/
β β βββ llm_generator.py # Gemini integration
β β βββ answer_generator.py # Answer generation with citations
β βββ api/
β β βββ main.py # FastAPI backend (12+ endpoints)
β β βββ models.py # Request/response models
β β βββ session_manager.py # Chat session management (NEW!)
β βββ frontend/
β βββ gradio_app.py # Gradio UI with chat interface
βββ config/
β βββ settings.py # Pydantic settings
βββ data/
β βββ documents/ # Input documents
β βββ vector_store/ # ChromaDB persistence
β βββ chat_sessions.db # SQLite session storage (NEW!)
β βββ debug/ # Debug logs
βββ tests/
β βββ test_chunking.py
β βββ test_rag.py
β βββ test_api.py
β βββ test_multimodal.py # Image processing tests (NEW!)
βββ run_backend.py # Backend runner
βββ run_frontend.py # Frontend runner
βββ run_all.py # Full stack runner
βββ requirements.txt # Python dependencies
βββ .env # Configuration (not in git)
βββ .env.example # Example configuration
βββ README.md # This file
# Run all tests
pytest tests/ -v
# Test specific components
pytest tests/test_chunking.py -v
pytest tests/test_rag.py -v
pytest tests/test_api.py -v
# Test with coverage
pytest --cov=src tests/The system includes persistent chat session management:
- πΎ Persistent Storage: Sessions saved to SQLite database
- π Conversation Memory: LLM remembers previous questions and answers
- π Session Management: Create, load, and delete chat sessions
- π¨ Modern UI: ChatGPT-like interface with session sidebar
- π Survives Restarts: Sessions persist across system restarts
- Click "β New Chat" to start a fresh conversation
- Ask questions - the system remembers context
- Load previous sessions from the dropdown
- Sessions automatically save to
./data/chat_sessions.db
- Video understanding
- Audio transcription
- Additional LLM providers
- Multi-user authentication
- Incremental document indexing
- Advanced visual analytics
Status: Production Ready β
