Skip to content

Commit dd0566e

Browse files
Krish SenthilKrish Senthil
authored andcommitted
docs: finalize project documentation and professional polish
1 parent 39774b6 commit dd0566e

3 files changed

Lines changed: 72 additions & 14 deletions

File tree

DEVELOPMENT.md

Lines changed: 29 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,36 @@
11
# Development Guide
22

3-
## Project Goals
4-
Building a local, real-time RAG system for personal documents.
3+
## Project Structure
4+
* **FolderWatcher.java**: High-efficiency Java listener that watches the filesystem for changes. It executes the Python backend via subprocess calls.
5+
* **ingest.py**: The primary ingestion engine. Handles document loading, text splitting, embedding generation, and FAISS index management.
6+
* **ask.py**: The user interface. Implements the RAG chain using the vector store and the Gemini LLM.
7+
* **docs/**: The designated folder for document monitoring.
8+
* **faiss_index/**: Local persistence directory for the vectorized document mappings.
59

6-
## Architecture
7-
- Vector Store: FAISS (chosen for local persistence)
8-
- Embeddings: Google Gemini
10+
## Technical Architecture
911

10-
## Retrieval Strategy
11-
Using k=3 for nearest neighbor search to balance context and token usage.
12+
### 1. File Monitoring (Java)
13+
We use Java's `WatchService` (part of the `java.nio` package) because it provides native OS notifications. This is far more performant than polling a directory from Python, especially on large filesystems.
1214

13-
## Multi-format Support
14-
Added loaders for Markdown and Plain Text to increase versatility.
15+
### 2. Semantic Search (FAISS)
16+
FAISS (Facebook AI Similarity Search) was selected for its speed and simplicity. It allows for local vector storage without requiring a dedicated database server like Chroma or Pinecone, making the system entirely portable and private.
1517

16-
## Concurrency and Locking
17-
Using 'portalocker' to ensure index integrity when multiple files are added at once.
18+
### 3. AI Pipeline (Gemini)
19+
- **Embedding Model**: `text-embedding-004` generates dense 768-dimensional vectors.
20+
- **LLM**: `gemini-3-flash-preview` is used for the generation stage. Its high-speed inference and large context window make it ideal for RAG applications.
1821

19-
## Performance Optimization
20-
- Added 'RecursiveCharacterTextSplitter' to ensure optimal chunk sizes for embedding quality.
21-
- Updated LLM to 'gemini-3-flash-preview' for significantly lower latency in the CLI.
22+
## Key Implementation Challenges
23+
24+
### Concurrency
25+
During mass document imports, multiple instances of the ingestion script may attempt to write to the FAISS index simultaneously. We implemented file-level locking using the `portalocker` library to prevent index corruption.
26+
27+
### Rate Limiting
28+
The Google AI free tier has specific requests-per-minute limits. The ingestion script includes a retry mechanism with exponential backoff to ensure large batches of documents are eventually indexed successfully.
29+
30+
### Document Fragmentation
31+
To improve search accuracy, we use `RecursiveCharacterTextSplitter`. This breaks documents into overlapping chunks (2000 chars with 200 char overlap), ensuring that semantic context isn't lost at the boundaries of a split.
32+
33+
## Future Roadmap
34+
- Integration of OCR for scanned PDF support.
35+
- Web-based dashboard for visual document management.
36+
- Support for remote cloud storage listeners (S3/GCS).

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# mySearch: Personal Document Intelligence System
2+
3+
mySearch is a high-performance, real-time document indexing and retrieval-augmented generation (RAG) system. It combines a low-latency Java backend for filesystem monitoring with a sophisticated Python frontend leveraging the latest Google Gemini AI models to provide a private, searchable knowledge base.
4+
5+
## Key Features
6+
7+
* Automated Monitoring: Native Java WatchService monitors the filesystem for sub-millisecond detection of new content.
8+
* Broad Format Support: Seamlessly indexes PDF, DOCX, TXT, MD, CSV, and Excel files.
9+
* AI-Driven Retrieval: Built on Gemini 3.0 Flash and text-embedding-004 for high-accuracy semantic search.
10+
* Local Performance: Utilizes FAISS for high-speed vector similarity searches and local metadata persistence.
11+
* Resilient Design: Includes robust concurrency handling and automatic API rate-limit management.
12+
13+
## Technologies Used
14+
15+
* Java (NIO WatchService): Serves as the high-efficiency file system listener.
16+
* Python (LangChain): Acts as the orchestration layer for the RAG pipeline.
17+
* Google Gemini 3.0 Flash: The primary large language model (LLM) used for generating answers.
18+
* text-embedding-004: State-of-the-art embedding model for precise semantic mapping.
19+
* FAISS: High-performance local vector database for similarity searches.
20+
* Portalocker: Provides file-level locking to maintain index stability during parallel indexing.
21+
22+
## Setup and Installation
23+
24+
1. Initialize Environment:
25+
```bash
26+
python -m venv venv
27+
source venv/bin/activate
28+
pip install langchain-google-genai langchain-community faiss-cpu pypdf langchain-classic pandas openpyxl python-docx portalocker
29+
```
30+
31+
2. Configure API Credentials:
32+
```bash
33+
export GOOGLE_API_KEY="your_api_key_here"
34+
```
35+
36+
3. Start monitoring and search services according to the usage guide.
37+
38+
## Basic Usage
39+
40+
To index and search your documents:
41+
1. Place your desired documents (PDF, Word, etc.) into the docs/ directory.
42+
2. Verify the FolderWatcher is active to initiate the automated indexing process.
43+
3. Execute ask.py to start the terminal-based query interface and ask questions about your library.

docs/lorem_ipsum_template.pdf

1.37 KB
Binary file not shown.

0 commit comments

Comments
 (0)