Repository: rag_on_pdf
A simple, local RAG (Retrieval-Augmented Generation) pipeline that allows you to upload PDF documents, extract text, generate embeddings using Google's Gemma 300M model, store them in ChromaDB, and perform semantic search using natural language queries.
Perfect for students and developers who want to understand how RAG works by building it from scratch on their local computer.
- π Semantic Search: Find relevant content by meaning, not just keywords
- π PDF Processing: Extract and process text from PDF documents
- π€ Local Embeddings: Uses Google's Gemma 300M embedding model (runs entirely on your machine)
- πΎ Persistent Storage: ChromaDB stores embeddings locally (no cloud required)
- π― Smart Chunking: Automatically splits documents into 2-sentence paragraphs for optimal search
- π GPU Support: Automatically detects and uses GPU if available
- π¨ User-Friendly UI: Built with Streamlit for easy interaction
- π Privacy-First: All processing happens locally - your documents never leave your computer
RAG (Retrieval-Augmented Generation) combines:
- Retrieval: Finding relevant information from documents
- Augmented Generation: Using that information to provide better answers
This app demonstrates a simple RAG pipeline:
PDF β Extract Text β Create Chunks β Generate Embeddings β Store in Vector DB
β
User Query β Generate Query Embedding β Search Vector DB β Retrieve Relevant Chunks
Before you begin, ensure you have:
- Python 3.8+ installed
- ~2GB free disk space (for the embedding model)
- A Hugging Face account (free - sign up here)
- Basic Python knowledge (helpful but not required)
Clone this repository or download the files:
git clone https://github.com/mrgehlot/rag_on_pdf.git
cd rag_on_pdfOr if you have SSH set up:
git clone git@github.com:mrgehlot/rag_on_pdf.git
cd rag_on_pdf# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activatepip install -r requirements.txtThis will install:
streamlit- Web interfacePyMuPDF- PDF text extractiontransformers- Hugging Face transformers librarychromadb- Vector databasetorch- PyTorch for model inferencenltk- Natural language processingpython-dotenv- Environment variable managementdatasets- Dataset utilitieshuggingface_hub- Hugging Face model hub
Important: Before running the app, you must accept Google's usage license for the EmbeddingGemma model.
-
Create a Hugging Face account (if you don't have one):
- Visit https://huggingface.co/join
- Sign up for a free account
-
Accept the model license:
- Go to the EmbeddingGemma 300M model card
- Make sure you're logged in
- Click the button to review and accept Google's usage license
- The acceptance is processed immediately
Note: If you skip this step, you'll get an error when trying to load the model. The error message will guide you to accept the license.
The app will automatically download required NLTK data on first run. If you encounter issues, you can manually download:
import nltk
nltk.download('punkt_tab')streamlit run pdf_search_app.pyThe app will open in your default web browser at http://localhost:8501
-
Initialize Model & Database:
- Click the "Initialize Model & Database" button in the sidebar
- On first run, this will download the Gemma 300M model (~300MB, one-time download)
- Wait for the initialization to complete
-
Upload PDF:
- Click "Choose a PDF file" in the main area
- Select your PDF document
- Click "Process PDF"
- Wait for processing to complete (progress bar will show status)
-
Search:
- Enter your natural language query in the search box
- Adjust the number of results (1-10) using the slider
- Click "Search"
- View relevant paragraphs ranked by similarity
- "What is the main topic of this document?"
- "Explain machine learning concepts"
- "What are the key findings?"
- "Summarize the introduction"
.
βββ pdf_search_app.py # Main Streamlit application
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ chroma_db/ # ChromaDB storage (created automatically)
βββ .env # Environment variables (optional)
The following diagram illustrates the complete RAG pipeline architecture:
graph TD
A[User Uploads PDF] --> B[Extract Text from PDF]
B --> C[Split into Sentences using NLTK]
C --> D[Create Paragraphs<br/>2 sentences each]
D --> E[Generate Embeddings<br/>Gemma 300M Model]
E --> F[Store in ChromaDB<br/>Vector Database]
G[User Query] --> H[Generate Query Embedding<br/>Same Gemma 300M Model]
H --> I[Search ChromaDB<br/>Cosine Similarity]
F --> I
I --> J[Retrieve Top-K Results]
J --> K[Display Relevant Paragraphs]
style A fill:#e1f5ff
style G fill:#e1f5ff
style E fill:#fff4e1
style H fill:#fff4e1
style F fill:#e8f5e9
style I fill:#e8f5e9
style K fill:#f3e5f5
Document Processing (Indexing):
- PDF β Text Extraction
- Text β Sentence Splitting
- Sentences β Paragraph Chunking
- Paragraphs β Embedding Generation
- Embeddings β Vector Storage (ChromaDB)
Query Processing (Retrieval):
- User Query β Query Embedding
- Query Embedding β Similarity Search
- Similarity Search β Top-K Results
- Results β Display to User
- Uses PyMuPDF to extract text from PDF files
- Handles multi-page documents
- Uses NLTK for intelligent sentence tokenization
- Handles complex cases (abbreviations, decimals, etc.)
- Creates paragraphs of 2 sentences each
- Balances context with granularity for optimal search
- Uses Google's Gemma 300M model to generate embeddings
- Converts text to 768-dimensional vectors
- Runs locally on your CPU or GPU
- Stores embeddings in ChromaDB (local vector database)
- Each PDF gets its own collection
- Persistent storage (survives app restarts)
- Converts query to embedding using the same model
- ChromaDB finds most similar embeddings using cosine similarity
- Returns top-k most relevant paragraphs
The app automatically detects and uses GPU if available. To check:
import torch
print(torch.cuda.is_available()) # True if GPU availableTo modify the number of sentences per paragraph, edit pdf_search_app.py:
paragraphs = create_paragraphs(sentences, sentences_per_paragraph=2) # Change 2 to desired numberBy default, ChromaDB stores data in ./chroma_db. To change this:
client = chromadb.PersistentClient(
path="./your_custom_path", # Change this
settings=Settings(anonymized_telemetry=False)
)Solution: The app will try to download NLTK data automatically. If it fails:
import nltk
nltk.download('punkt_tab')Solution:
- Make sure you've accepted the model license on Hugging Face
- You're logged into your Hugging Face account
- Try logging in via command line:
huggingface-cli login
Solution:
- The app will automatically fall back to CPU
- For large PDFs, process in smaller batches
- Close other GPU-intensive applications
Solution:
- Check your internet connection
- Ensure you have ~300MB free disk space
- Try downloading manually from Hugging Face
Solutions:
- Use GPU if available (much faster)
- Reduce chunk size for faster embedding generation
- Process smaller PDFs first to test
- Model Size: ~300MB
- Memory Usage: ~1-2GB RAM
- Processing Speed:
- CPU: ~1-2 seconds per paragraph
- GPU: ~0.1-0.5 seconds per paragraph
- Storage: ~10-50MB per PDF (depending on size)
- Research: Search through academic papers and documents
- Documentation: Find relevant sections in technical documentation
- Study: Quickly find information in textbooks and study materials
- Knowledge Base: Build a searchable knowledge base from PDFs
- Learning: Understand how RAG and semantic search work
- Change Chunk Size: Try 1, 3, or 5 sentences per paragraph
- Multiple PDFs: Process and search across multiple documents
- Different Models: Try other embedding models from Hugging Face
- Add Metadata: Store page numbers, dates, or other metadata
- Hybrid Search: Combine keyword search with semantic search
- Streamlit Documentation
- EmbeddingGemma 300M Model Card
- ChromaDB Documentation
- NLTK Documentation
- Hugging Face Transformers
This is a learning project! Feel free to:
- Fork the repository
- Add new features
- Improve the UI
- Fix bugs
- Share your experiments
This project uses:
- EmbeddingGemma 300M: Licensed under Google's Gemma Terms
- Code: Feel free to use and modify for learning purposes
- Google DeepMind for the EmbeddingGemma model
- Hugging Face for model hosting and transformers library
- ChromaDB for the vector database
- Streamlit for the web framework
- Start Small: Test with a small PDF first
- Experiment: Try different chunk sizes and queries
- Read the Code: Understanding the code is more valuable than just running it
- Break Things: Don't be afraid to modify and experiment
- Ask Questions: RAG is complex - it's okay to have questions!
If you encounter issues:
- Check the Troubleshooting section
- Review error messages carefully
- Ensure all prerequisites are met
- Check that you've accepted the model license
For questions, issues, or contributions:
- Email: abhigelot123@gmail.com
- Repository: rag_on_pdf
- Issues: Please open an issue on the repository for bug reports or feature requests
Happy Searching! ππ
Built with β€οΈ for students learning RAG and semantic search.