Add Vector Store Service for GitHub Issue Similarity Search

## Overview

Implement a new microservice that provides semantic search over GitHub issues using vector embeddings. This service will enable users to find similar issues and their associated labels, improving issue triaging and duplicate detection workflows.

## Motivation

Currently, the application lacks the ability to semantically search through historical GitHub issues. By implementing a vector store service with retrieval and reranking capabilities, users will be able to:

- Find similar issues to avoid creating duplicates
- Discover relevant labels based on similar historical issues
- Improve issue classification accuracy
- Leverage historical issue data for better triaging decisions

## Requirements

### Functional Requirements

1. **Vector Storage**: Persist GitHub issue embeddings with metadata (title, body, labels, state, etc.)
2. **Semantic Search**: Accept natural language queries and return semantically similar issues
3. **Reranking**: Refine initial retrieval results using a cross-encoder model for improved relevance
4. **Label Discovery**: Return labels from similar issues to assist with issue classification
5. **Batch Indexing**: Support indexing multiple issues efficiently
6. **Data Persistence**: Maintain vector store data across container restarts
7. **CSV Data Loading**: Load and index issues from CSV files in the `data/` folder on initialization

### Non-Functional Requirements

1. **Performance**: 
   - Search queries should complete in < 2 seconds for collections up to 50,000 issues
   - Support concurrent requests from multiple users
2. **Scalability**: Handle at least 10,000 GitHub issues initially, with room to scale
3. **Integration**: Follow existing service architecture patterns (FastAPI, Docker, environment-based configuration)
4. **Resource Efficiency**: Reasonable memory footprint suitable for containerized deployment

## Technical Specification

### Data Source

The service must load existing issue data from CSV files located in the `data/` folder. These CSV files have the following schema:
```csv
title,body,label,url
```

**Field Descriptions:**
- `title`: Issue title (string)
- `body`: Issue description/body text (string)
- `label`: Issue label(s) - may contain multiple comma-separated labels or a single label (string)
- `url`: GitHub URL of the issue (string)

**Requirements:**
- All CSV files in `data/` should be loaded during container initialization
- The service should handle multiple CSV files and merge them into a single collection
- Parsing should handle multi-line fields and escaped quotes properly
- Empty or malformed rows should be logged and skipped gracefully
- Progress should be logged during the indexing process

### Service Architecture

The service should follow the existing microservice pattern with:

- **Vector Database**: ChromaDB for vector storage and similarity search
- **Embedding Model**: Sentence-transformers for generating embeddings (e.g., `all-mpnet-base-v2`)
- **Reranker Model**: Cross-encoder for result refinement (e.g., `ms-marco-MiniLM-L-6-v2`)
- **API Framework**: FastAPI with Pydantic models for request/response validation
- **Containerization**: Docker with dedicated Dockerfile following project conventions

### Docker Compose Integration

Add a new service to `docker-compose.yml`:
```yaml
vector-store:
  build:
    context: .
    dockerfile: dockerfiles/dockerfile.vectorstore
  container_name: vector-store-service
  ports:
    - "${VECTOR_STORE_PORT}:8001"
  environment:
    - HOST=${VECTOR_STORE_HOST}
    - PORT=${VECTOR_STORE_PORT}
    - CHROMA_DATA_PATH=/data/chroma
    - DATA_DIR=/app/data
  volumes:
    - vector_store_data:/data/chroma
    - ./data:/app/data:ro  # Mount data directory as read-only
  restart: unless-stopped
```

Update the `app` service to include:
```yaml
environment:
  - VECTOR_STORE_BASE_URL=${DOCKER_VECTOR_STORE_BASE_URL}
depends_on:
  vector-store:
    condition: service_started
```

Add volume definition:
```yaml
volumes:
  vector_store_data:
```

### API Endpoints

Implement the following REST endpoints:

#### 1. Health Check
```
GET /health
```
Returns service status and collection statistics.

#### 2. Index Issues
```
POST /index
Content-Type: application/json

{
  "issues": [
    {
      "id": "string",
      "title": "string",
      "body": "string",
      "labels": ["string"],
      "state": "open|closed",
      "created_at": "ISO8601 timestamp",
      "metadata": {
        "number": int,
        "url": "string",
        "user": "string",
        "comments": int
      }
    }
  ]
}
```

#### 3. Search Similar Issues
```
POST /search
Content-Type: application/json

{
  "query": "string",
  "top_k": 10,
  "rerank": true,
  "rerank_top_k": 5,
  "filter_labels": ["optional", "label", "filter"]
}

Response:
{
  "results": [
    {
      "id": "string",
      "title": "string",
      "body": "string",
      "labels": ["string"],
      "state": "string",
      "score": float,
      "metadata": {}
    }
  ],
  "query": "string",
  "total_results": int
}
```

#### 4. Get Issue by ID
```
GET /issue/{issue_id}
```

#### 5. Clear Collection (Admin)
```
DELETE /collection
```

#### 6. Reindex from CSV Files
```
POST /reindex
```
Clears the collection and reloads all CSV files from the data directory. Useful for updating the index when CSV files change.

### File Structure

Create the following directory structure:
```
services/vector_store/
├── vector_store_api.py          # Main FastAPI application
├── requirements.txt             # Python dependencies
├── load_csv_data.py            # CSV data loading utility
└── index_github_issues.py       # Utility script for batch indexing from GitHub API

dockerfiles/
└── dockerfile.vectorstore       # Container definition

data/
├── issues_1.csv                # Issue data files (example)
├── issues_2.csv
└── ...
```


### Environment Variables

Add to `.env`:
```env
VECTOR_STORE_PORT=8001
VECTOR_STORE_HOST=0.0.0.0
DOCKER_VECTOR_STORE_BASE_URL=http://vector-store:8001
```

## Implementation Details

### CSV Data Loading

Create a `load_csv_data.py` module that:

1. **Discovers CSV Files**: Scans the `data/` directory for all `.csv` files
2. **Parses CSV Data**: Uses pandas to read CSV files with proper handling of:
   - Multi-line fields
   - Quoted strings
   - Various encodings (UTF-8, latin-1, etc.)
   - Missing values
3. **Generates Issue IDs**: Creates unique IDs from URLs or generates UUIDs if needed
4. **Handles Labels**: Parses label fields which may contain:
   - Single labels: `"bug"`
5. **Batches Indexing**: Processes issues in batches (e.g., 100 at a time) for memory efficiency
6. **Logs Progress**: Reports loading progress and any errors encountered

Example implementation structure:
```python
import pandas as pd
import glob
import hashlib
from pathlib import Path

def load_issues_from_csv(data_dir: str) -> list:
    """Load all issues from CSV files in data directory."""
    csv_files = glob.glob(f"{data_dir}/*.csv")
    all_issues = []
    
    for csv_file in csv_files:
        df = pd.read_csv(csv_file, encoding='utf-8')
        # Parse and transform data
        # Generate IDs, parse labels, etc.
        
    return all_issues
```

### Startup Behavior

On service startup, the application should:

1. Initialize embedding and reranker models
2. Connect to ChromaDB
3. Check if collection is empty
4. If empty, automatically load and index all CSV files from `data/`
5. If not empty, skip automatic loading (use `/reindex` endpoint if refresh needed)
6. Log summary statistics (total issues indexed, time taken, etc.)

### Retrieval Strategy

1. **Initial Retrieval**: Use sentence-transformers to encode the query and retrieve top-k candidates based on cosine similarity (via ChromaDB's L2 distance)
2. **Reranking** (optional): Apply cross-encoder model to rerank candidates for improved relevance
3. **Filtering**: Support filtering results by specific labels if requested

### Data Indexing

#### From CSV Files
The primary data source is CSV files in the `data/` folder. These should be loaded automatically on first run.

#### From GitHub API (Optional)
Provide a utility script (`index_github_issues.py`) that:
- Fetches issues from GitHub's REST API
- Handles pagination and rate limiting
- Filters out pull requests
- Batches requests to the vector store API
- Supports authentication via GitHub tokens

Example usage:
```bash
python services/vector_store/index_github_issues.py \
  --owner pytorch \
  --repo pytorch \
  --token $GITHUB_TOKEN \
  --max-issues 5000
```

### Model Selection Rationale

- **Embedding Model**: `all-mpnet-base-v2` provides excellent semantic understanding with reasonable computational requirements
- **Reranker**: `ms-marco-MiniLM-L-6-v2` offers strong reranking performance with minimal latency impact

Alternative models can be configured if needed for specific domains or languages.

### ID Generation Strategy

Since CSV files may not include explicit IDs, implement one of these strategies:

1. **Hash-based IDs**: Generate deterministic IDs from URL: `hashlib.md5(url.encode()).hexdigest()[:16]`
2. **URL-based IDs**: Extract issue number from GitHub URL
3. **UUID-based IDs**: Generate random UUIDs (not recommended as they're not reproducible)

**Recommendation**: Use hash-based IDs from URLs for deterministic, reproducible indexing.

## Acceptance Criteria

- [ ] Docker service builds successfully and starts without errors
- [ ] Service integrates with existing docker-compose infrastructure
- [ ] **All CSV files in `data/` folder are automatically loaded and indexed on first startup**
- [ ] **CSV parsing correctly handles the schema: title, body, label, url**
- [ ] **Progress is logged during CSV loading with statistics (files processed, issues indexed, errors encountered)**
- [ ] **Empty or malformed CSV rows are skipped gracefully with appropriate logging**
- [ ] All API endpoints return correct responses with proper error handling
- [ ] Vector embeddings are persisted across container restarts
- [ ] Search returns semantically relevant results ranked by similarity
- [ ] Documentation includes usage examples for indexing and searching
- [ ] Models are downloaded at build time to avoid runtime delays
- [ ] Health check endpoint provides useful diagnostic information including number of indexed issues
- [ ] `/reindex` endpoint successfully clears and reloads data from CSV files

## Testing Recommendations

1. **Unit Tests**: 
   - Test CSV parsing with various edge cases (multi-line, quotes, empty fields)
   - Test embedding generation, search logic, and API validation
   - Test label parsing for single and multiple labels
2. **Integration Tests**: 
   - Test with docker-compose, verify cross-service communication
   - Test with sample CSV files in `data/` folder
   - Verify data persistence across container restarts
3. **Performance Tests**: 
   - Measure search latency with varying collection sizes
   - Measure CSV loading time for large datasets
4. **Quality Tests**: 
   - Evaluate search relevance on a curated set of test queries
   - Verify that labels from similar issues are relevant

## Sample CSV Data for Testing

Create a `data/test_issues.csv` file with sample data:
```csv
title,body,label,url
"Memory leak in training loop","When training for extended periods, memory usage continuously increases without being released. This eventually leads to OOM errors.","bug,memory","https://github.com/example/repo/issues/123"
"Add support for distributed training","It would be great to have built-in support for multi-GPU and multi-node training using PyTorch DDP.","enhancement,feature-request","https://github.com/example/repo/issues/124"
"Documentation: Getting started guide incomplete","The getting started guide is missing information about installing dependencies on Windows.","documentation,good-first-issue","https://github.com/example/repo/issues/125"
```

## UI Integration Notes

The UI service should integrate this functionality by:

1. Adding a "Find Similar Issues" button/feature in the issue creation/viewing interface
2. Displaying similar issues with their labels when a user describes a new issue
3. Suggesting relevant labels based on the labels of similar historical issues
4. Providing a confidence score or relevance indicator for each similar issue

Example integration code:
```python
import requests

def find_similar_issues(issue_text: str, top_k: int = 5):
    response = requests.post(
        f"{VECTOR_STORE_BASE_URL}/search",
        json={
            "query": issue_text,
            "top_k": 10,
            "rerank": True,
            "rerank_top_k": top_k
        }
    )
    return response.json()['results']
```

## Future Enhancements (Out of Scope)

- Multi-repository support with separate collections
- Incremental indexing via GitHub webhooks
- CSV file watching for automatic reindexing on file changes
- Advanced filtering (date ranges, author, comment count)
- Multilingual support with language-specific models
- A/B testing different embedding models
- Query expansion and reformulation
- Analytics dashboard for search quality metrics
- Export indexed data back to CSV format

## Resources

- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Sentence Transformers Documentation](https://www.sbert.net/)
- [GitHub REST API - Issues](https://docs.github.com/en/rest/issues/issues)
- [Pandas CSV Reading](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

## Estimated Effort

- **Core Implementation**: 8-12 hours
- **CSV Loading Implementation**: 3-4 hours
- **Testing & Documentation**: 3-4 hours
- **Integration with UI**: 2-3 hours
- **Total**: ~16-23 hours

## Priority

Medium-High - This feature significantly improves issue management workflows and leverages existing historical data.

## Notes for Implementation

- The `data/` directory should be mounted as read-only in the container for security
- Consider implementing a checksum or timestamp mechanism to detect when CSV files have been updated
- Handle the case where the `data/` directory is empty or contains no CSV files gracefully
- Provide clear error messages if CSV files don't match the expected schema
- Consider adding a configuration option to disable automatic CSV loading on startup (for development or when using only the API)

Add Vector Store Service for GitHub Issue Similarity Search #6

Description

Overview

Motivation

Requirements

Functional Requirements

Non-Functional Requirements

Technical Specification

Data Source

Service Architecture

Docker Compose Integration

API Endpoints

1. Health Check

2. Index Issues

3. Search Similar Issues

4. Get Issue by ID

5. Clear Collection (Admin)

6. Reindex from CSV Files

File Structure

Environment Variables

Implementation Details

CSV Data Loading

Startup Behavior

Retrieval Strategy

Data Indexing

From CSV Files

From GitHub API (Optional)

Model Selection Rationale

ID Generation Strategy

Acceptance Criteria

Testing Recommendations

Sample CSV Data for Testing

UI Integration Notes

Future Enhancements (Out of Scope)

Resources

Estimated Effort

Priority

Notes for Implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions