|
1 | 1 | # Development Guide |
2 | 2 |
|
3 | | -## Project Goals |
4 | | -Building a local, real-time RAG system for personal documents. |
| 3 | +## Project Structure |
| 4 | +* **FolderWatcher.java**: High-efficiency Java listener that watches the filesystem for changes. It executes the Python backend via subprocess calls. |
| 5 | +* **ingest.py**: The primary ingestion engine. Handles document loading, text splitting, embedding generation, and FAISS index management. |
| 6 | +* **ask.py**: The user interface. Implements the RAG chain using the vector store and the Gemini LLM. |
| 7 | +* **docs/**: The designated folder for document monitoring. |
| 8 | +* **faiss_index/**: Local persistence directory for the vectorized document mappings. |
5 | 9 |
|
6 | | -## Architecture |
7 | | -- Vector Store: FAISS (chosen for local persistence) |
8 | | -- Embeddings: Google Gemini |
| 10 | +## Technical Architecture |
9 | 11 |
|
10 | | -## Retrieval Strategy |
11 | | -Using k=3 for nearest neighbor search to balance context and token usage. |
| 12 | +### 1. File Monitoring (Java) |
| 13 | +We use Java's `WatchService` (part of the `java.nio` package) because it provides native OS notifications. This is far more performant than polling a directory from Python, especially on large filesystems. |
12 | 14 |
|
13 | | -## Multi-format Support |
14 | | -Added loaders for Markdown and Plain Text to increase versatility. |
| 15 | +### 2. Semantic Search (FAISS) |
| 16 | +FAISS (Facebook AI Similarity Search) was selected for its speed and simplicity. It allows for local vector storage without requiring a dedicated database server like Chroma or Pinecone, making the system entirely portable and private. |
15 | 17 |
|
16 | | -## Concurrency and Locking |
17 | | -Using 'portalocker' to ensure index integrity when multiple files are added at once. |
| 18 | +### 3. AI Pipeline (Gemini) |
| 19 | +- **Embedding Model**: `text-embedding-004` generates dense 768-dimensional vectors. |
| 20 | +- **LLM**: `gemini-3-flash-preview` is used for the generation stage. Its high-speed inference and large context window make it ideal for RAG applications. |
18 | 21 |
|
19 | | -## Performance Optimization |
20 | | -- Added 'RecursiveCharacterTextSplitter' to ensure optimal chunk sizes for embedding quality. |
21 | | -- Updated LLM to 'gemini-3-flash-preview' for significantly lower latency in the CLI. |
| 22 | +## Key Implementation Challenges |
| 23 | + |
| 24 | +### Concurrency |
| 25 | +During mass document imports, multiple instances of the ingestion script may attempt to write to the FAISS index simultaneously. We implemented file-level locking using the `portalocker` library to prevent index corruption. |
| 26 | + |
| 27 | +### Rate Limiting |
| 28 | +The Google AI free tier has specific requests-per-minute limits. The ingestion script includes a retry mechanism with exponential backoff to ensure large batches of documents are eventually indexed successfully. |
| 29 | + |
| 30 | +### Document Fragmentation |
| 31 | +To improve search accuracy, we use `RecursiveCharacterTextSplitter`. This breaks documents into overlapping chunks (2000 chars with 200 char overlap), ensuring that semantic context isn't lost at the boundaries of a split. |
| 32 | + |
| 33 | +## Future Roadmap |
| 34 | +- Integration of OCR for scanned PDF support. |
| 35 | +- Web-based dashboard for visual document management. |
| 36 | +- Support for remote cloud storage listeners (S3/GCS). |
0 commit comments