docs: finalize project documentation and professional polish

Krish Senthil · Krish Senthil · commit dd0566e435be · 2026-02-04T20:50:00.000-08:00
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -1,21 +1,36 @@
 # Development Guide
 
-## Project Goals
-Building a local, real-time RAG system for personal documents.
+## Project Structure
+* **FolderWatcher.java**: High-efficiency Java listener that watches the filesystem for changes. It executes the Python backend via subprocess calls.
+* **ingest.py**: The primary ingestion engine. Handles document loading, text splitting, embedding generation, and FAISS index management.
+* **ask.py**: The user interface. Implements the RAG chain using the vector store and the Gemini LLM.
+* **docs/**: The designated folder for document monitoring.
+* **faiss_index/**: Local persistence directory for the vectorized document mappings.
 
-## Architecture
-- Vector Store: FAISS (chosen for local persistence)
-- Embeddings: Google Gemini
+## Technical Architecture
 
-## Retrieval Strategy
-Using k=3 for nearest neighbor search to balance context and token usage.
+### 1. File Monitoring (Java)
+We use Java's `WatchService` (part of the `java.nio` package) because it provides native OS notifications. This is far more performant than polling a directory from Python, especially on large filesystems.
 
-## Multi-format Support
-Added loaders for Markdown and Plain Text to increase versatility.
+### 2. Semantic Search (FAISS)
+FAISS (Facebook AI Similarity Search) was selected for its speed and simplicity. It allows for local vector storage without requiring a dedicated database server like Chroma or Pinecone, making the system entirely portable and private.
 
-## Concurrency and Locking
-Using 'portalocker' to ensure index integrity when multiple files are added at once.
+### 3. AI Pipeline (Gemini)
+- **Embedding Model**: `text-embedding-004` generates dense 768-dimensional vectors.
+- **LLM**: `gemini-3-flash-preview` is used for the generation stage. Its high-speed inference and large context window make it ideal for RAG applications.
 
-## Performance Optimization
-- Added 'RecursiveCharacterTextSplitter' to ensure optimal chunk sizes for embedding quality.
-- Updated LLM to 'gemini-3-flash-preview' for significantly lower latency in the CLI.
+## Key Implementation Challenges
+
+### Concurrency
+During mass document imports, multiple instances of the ingestion script may attempt to write to the FAISS index simultaneously. We implemented file-level locking using the `portalocker` library to prevent index corruption.
+
+### Rate Limiting
+The Google AI free tier has specific requests-per-minute limits. The ingestion script includes a retry mechanism with exponential backoff to ensure large batches of documents are eventually indexed successfully.
+
+### Document Fragmentation
+To improve search accuracy, we use `RecursiveCharacterTextSplitter`. This breaks documents into overlapping chunks (2000 chars with 200 char overlap), ensuring that semantic context isn't lost at the boundaries of a split.
+
+## Future Roadmap
+- Integration of OCR for scanned PDF support.
+- Web-based dashboard for visual document management.
+- Support for remote cloud storage listeners (S3/GCS).
diff --git a/README.md b/README.md
@@ -0,0 +1,43 @@
+# mySearch: Personal Document Intelligence System
+
+mySearch is a high-performance, real-time document indexing and retrieval-augmented generation (RAG) system. It combines a low-latency Java backend for filesystem monitoring with a sophisticated Python frontend leveraging the latest Google Gemini AI models to provide a private, searchable knowledge base.
+
+## Key Features
+
+* Automated Monitoring: Native Java WatchService monitors the filesystem for sub-millisecond detection of new content.
+* Broad Format Support: Seamlessly indexes PDF, DOCX, TXT, MD, CSV, and Excel files.
+* AI-Driven Retrieval: Built on Gemini 3.0 Flash and text-embedding-004 for high-accuracy semantic search.
+* Local Performance: Utilizes FAISS for high-speed vector similarity searches and local metadata persistence.
+* Resilient Design: Includes robust concurrency handling and automatic API rate-limit management.
+
+## Technologies Used
+
+* Java (NIO WatchService): Serves as the high-efficiency file system listener.
+* Python (LangChain): Acts as the orchestration layer for the RAG pipeline.
+* Google Gemini 3.0 Flash: The primary large language model (LLM) used for generating answers.
+* text-embedding-004: State-of-the-art embedding model for precise semantic mapping.
+* FAISS: High-performance local vector database for similarity searches.
+* Portalocker: Provides file-level locking to maintain index stability during parallel indexing.
+
+## Setup and Installation
+
+1. Initialize Environment:
+   ```bash
+   python -m venv venv
+   source venv/bin/activate
+   pip install langchain-google-genai langchain-community faiss-cpu pypdf langchain-classic pandas openpyxl python-docx portalocker
+   ```
+
+2. Configure API Credentials:
+   ```bash
+   export GOOGLE_API_KEY="your_api_key_here"
+   ```
+
+3. Start monitoring and search services according to the usage guide.
+
+## Basic Usage
+
+To index and search your documents:
+1. Place your desired documents (PDF, Word, etc.) into the docs/ directory.
+2. Verify the FolderWatcher is active to initiate the automated indexing process.
+3. Execute ask.py to start the terminal-based query interface and ask questions about your library.
diff --git a/docs/lorem_ipsum_template.pdf b/docs/lorem_ipsum_template.pdf