Overview
Implement a new microservice that provides semantic search over GitHub issues using vector embeddings. This service will enable users to find similar issues and their associated labels, improving issue triaging and duplicate detection workflows.
Motivation
Currently, the application lacks the ability to semantically search through historical GitHub issues. By implementing a vector store service with retrieval and reranking capabilities, users will be able to:
- Find similar issues to avoid creating duplicates
- Discover relevant labels based on similar historical issues
- Improve issue classification accuracy
- Leverage historical issue data for better triaging decisions
Requirements
Functional Requirements
- Vector Storage: Persist GitHub issue embeddings with metadata (title, body, labels, state, etc.)
- Semantic Search: Accept natural language queries and return semantically similar issues
- Reranking: Refine initial retrieval results using a cross-encoder model for improved relevance
- Label Discovery: Return labels from similar issues to assist with issue classification
- Batch Indexing: Support indexing multiple issues efficiently
- Data Persistence: Maintain vector store data across container restarts
- CSV Data Loading: Load and index issues from CSV files in the
data/ folder on initialization
Non-Functional Requirements
- Performance:
- Search queries should complete in < 2 seconds for collections up to 50,000 issues
- Support concurrent requests from multiple users
- Scalability: Handle at least 10,000 GitHub issues initially, with room to scale
- Integration: Follow existing service architecture patterns (FastAPI, Docker, environment-based configuration)
- Resource Efficiency: Reasonable memory footprint suitable for containerized deployment
Technical Specification
Data Source
The service must load existing issue data from CSV files located in the data/ folder. These CSV files have the following schema:
Field Descriptions:
title: Issue title (string)
body: Issue description/body text (string)
label: Issue label(s) - may contain multiple comma-separated labels or a single label (string)
url: GitHub URL of the issue (string)
Requirements:
- All CSV files in
data/ should be loaded during container initialization
- The service should handle multiple CSV files and merge them into a single collection
- Parsing should handle multi-line fields and escaped quotes properly
- Empty or malformed rows should be logged and skipped gracefully
- Progress should be logged during the indexing process
Service Architecture
The service should follow the existing microservice pattern with:
- Vector Database: ChromaDB for vector storage and similarity search
- Embedding Model: Sentence-transformers for generating embeddings (e.g.,
all-mpnet-base-v2)
- Reranker Model: Cross-encoder for result refinement (e.g.,
ms-marco-MiniLM-L-6-v2)
- API Framework: FastAPI with Pydantic models for request/response validation
- Containerization: Docker with dedicated Dockerfile following project conventions
Docker Compose Integration
Add a new service to docker-compose.yml:
vector-store:
build:
context: .
dockerfile: dockerfiles/dockerfile.vectorstore
container_name: vector-store-service
ports:
- "${VECTOR_STORE_PORT}:8001"
environment:
- HOST=${VECTOR_STORE_HOST}
- PORT=${VECTOR_STORE_PORT}
- CHROMA_DATA_PATH=/data/chroma
- DATA_DIR=/app/data
volumes:
- vector_store_data:/data/chroma
- ./data:/app/data:ro # Mount data directory as read-only
restart: unless-stopped
Update the app service to include:
environment:
- VECTOR_STORE_BASE_URL=${DOCKER_VECTOR_STORE_BASE_URL}
depends_on:
vector-store:
condition: service_started
Add volume definition:
volumes:
vector_store_data:
API Endpoints
Implement the following REST endpoints:
1. Health Check
Returns service status and collection statistics.
2. Index Issues
POST /index
Content-Type: application/json
{
"issues": [
{
"id": "string",
"title": "string",
"body": "string",
"labels": ["string"],
"state": "open|closed",
"created_at": "ISO8601 timestamp",
"metadata": {
"number": int,
"url": "string",
"user": "string",
"comments": int
}
}
]
}
3. Search Similar Issues
POST /search
Content-Type: application/json
{
"query": "string",
"top_k": 10,
"rerank": true,
"rerank_top_k": 5,
"filter_labels": ["optional", "label", "filter"]
}
Response:
{
"results": [
{
"id": "string",
"title": "string",
"body": "string",
"labels": ["string"],
"state": "string",
"score": float,
"metadata": {}
}
],
"query": "string",
"total_results": int
}
4. Get Issue by ID
5. Clear Collection (Admin)
6. Reindex from CSV Files
Clears the collection and reloads all CSV files from the data directory. Useful for updating the index when CSV files change.
File Structure
Create the following directory structure:
services/vector_store/
├── vector_store_api.py # Main FastAPI application
├── requirements.txt # Python dependencies
├── load_csv_data.py # CSV data loading utility
└── index_github_issues.py # Utility script for batch indexing from GitHub API
dockerfiles/
└── dockerfile.vectorstore # Container definition
data/
├── issues_1.csv # Issue data files (example)
├── issues_2.csv
└── ...
Environment Variables
Add to .env:
VECTOR_STORE_PORT=8001
VECTOR_STORE_HOST=0.0.0.0
DOCKER_VECTOR_STORE_BASE_URL=http://vector-store:8001
Implementation Details
CSV Data Loading
Create a load_csv_data.py module that:
- Discovers CSV Files: Scans the
data/ directory for all .csv files
- Parses CSV Data: Uses pandas to read CSV files with proper handling of:
- Multi-line fields
- Quoted strings
- Various encodings (UTF-8, latin-1, etc.)
- Missing values
- Generates Issue IDs: Creates unique IDs from URLs or generates UUIDs if needed
- Handles Labels: Parses label fields which may contain:
- Batches Indexing: Processes issues in batches (e.g., 100 at a time) for memory efficiency
- Logs Progress: Reports loading progress and any errors encountered
Example implementation structure:
import pandas as pd
import glob
import hashlib
from pathlib import Path
def load_issues_from_csv(data_dir: str) -> list:
"""Load all issues from CSV files in data directory."""
csv_files = glob.glob(f"{data_dir}/*.csv")
all_issues = []
for csv_file in csv_files:
df = pd.read_csv(csv_file, encoding='utf-8')
# Parse and transform data
# Generate IDs, parse labels, etc.
return all_issues
Startup Behavior
On service startup, the application should:
- Initialize embedding and reranker models
- Connect to ChromaDB
- Check if collection is empty
- If empty, automatically load and index all CSV files from
data/
- If not empty, skip automatic loading (use
/reindex endpoint if refresh needed)
- Log summary statistics (total issues indexed, time taken, etc.)
Retrieval Strategy
- Initial Retrieval: Use sentence-transformers to encode the query and retrieve top-k candidates based on cosine similarity (via ChromaDB's L2 distance)
- Reranking (optional): Apply cross-encoder model to rerank candidates for improved relevance
- Filtering: Support filtering results by specific labels if requested
Data Indexing
From CSV Files
The primary data source is CSV files in the data/ folder. These should be loaded automatically on first run.
From GitHub API (Optional)
Provide a utility script (index_github_issues.py) that:
- Fetches issues from GitHub's REST API
- Handles pagination and rate limiting
- Filters out pull requests
- Batches requests to the vector store API
- Supports authentication via GitHub tokens
Example usage:
python services/vector_store/index_github_issues.py \
--owner pytorch \
--repo pytorch \
--token $GITHUB_TOKEN \
--max-issues 5000
Model Selection Rationale
- Embedding Model:
all-mpnet-base-v2 provides excellent semantic understanding with reasonable computational requirements
- Reranker:
ms-marco-MiniLM-L-6-v2 offers strong reranking performance with minimal latency impact
Alternative models can be configured if needed for specific domains or languages.
ID Generation Strategy
Since CSV files may not include explicit IDs, implement one of these strategies:
- Hash-based IDs: Generate deterministic IDs from URL:
hashlib.md5(url.encode()).hexdigest()[:16]
- URL-based IDs: Extract issue number from GitHub URL
- UUID-based IDs: Generate random UUIDs (not recommended as they're not reproducible)
Recommendation: Use hash-based IDs from URLs for deterministic, reproducible indexing.
Acceptance Criteria
Testing Recommendations
- Unit Tests:
- Test CSV parsing with various edge cases (multi-line, quotes, empty fields)
- Test embedding generation, search logic, and API validation
- Test label parsing for single and multiple labels
- Integration Tests:
- Test with docker-compose, verify cross-service communication
- Test with sample CSV files in
data/ folder
- Verify data persistence across container restarts
- Performance Tests:
- Measure search latency with varying collection sizes
- Measure CSV loading time for large datasets
- Quality Tests:
- Evaluate search relevance on a curated set of test queries
- Verify that labels from similar issues are relevant
Sample CSV Data for Testing
Create a data/test_issues.csv file with sample data:
title,body,label,url
"Memory leak in training loop","When training for extended periods, memory usage continuously increases without being released. This eventually leads to OOM errors.","bug,memory","https://github.com/example/repo/issues/123"
"Add support for distributed training","It would be great to have built-in support for multi-GPU and multi-node training using PyTorch DDP.","enhancement,feature-request","https://github.com/example/repo/issues/124"
"Documentation: Getting started guide incomplete","The getting started guide is missing information about installing dependencies on Windows.","documentation,good-first-issue","https://github.com/example/repo/issues/125"
UI Integration Notes
The UI service should integrate this functionality by:
- Adding a "Find Similar Issues" button/feature in the issue creation/viewing interface
- Displaying similar issues with their labels when a user describes a new issue
- Suggesting relevant labels based on the labels of similar historical issues
- Providing a confidence score or relevance indicator for each similar issue
Example integration code:
import requests
def find_similar_issues(issue_text: str, top_k: int = 5):
response = requests.post(
f"{VECTOR_STORE_BASE_URL}/search",
json={
"query": issue_text,
"top_k": 10,
"rerank": True,
"rerank_top_k": top_k
}
)
return response.json()['results']
Future Enhancements (Out of Scope)
- Multi-repository support with separate collections
- Incremental indexing via GitHub webhooks
- CSV file watching for automatic reindexing on file changes
- Advanced filtering (date ranges, author, comment count)
- Multilingual support with language-specific models
- A/B testing different embedding models
- Query expansion and reformulation
- Analytics dashboard for search quality metrics
- Export indexed data back to CSV format
Resources
Estimated Effort
- Core Implementation: 8-12 hours
- CSV Loading Implementation: 3-4 hours
- Testing & Documentation: 3-4 hours
- Integration with UI: 2-3 hours
- Total: ~16-23 hours
Priority
Medium-High - This feature significantly improves issue management workflows and leverages existing historical data.
Notes for Implementation
- The
data/ directory should be mounted as read-only in the container for security
- Consider implementing a checksum or timestamp mechanism to detect when CSV files have been updated
- Handle the case where the
data/ directory is empty or contains no CSV files gracefully
- Provide clear error messages if CSV files don't match the expected schema
- Consider adding a configuration option to disable automatic CSV loading on startup (for development or when using only the API)
Overview
Implement a new microservice that provides semantic search over GitHub issues using vector embeddings. This service will enable users to find similar issues and their associated labels, improving issue triaging and duplicate detection workflows.
Motivation
Currently, the application lacks the ability to semantically search through historical GitHub issues. By implementing a vector store service with retrieval and reranking capabilities, users will be able to:
Requirements
Functional Requirements
data/folder on initializationNon-Functional Requirements
Technical Specification
Data Source
The service must load existing issue data from CSV files located in the
data/folder. These CSV files have the following schema:Field Descriptions:
title: Issue title (string)body: Issue description/body text (string)label: Issue label(s) - may contain multiple comma-separated labels or a single label (string)url: GitHub URL of the issue (string)Requirements:
data/should be loaded during container initializationService Architecture
The service should follow the existing microservice pattern with:
all-mpnet-base-v2)ms-marco-MiniLM-L-6-v2)Docker Compose Integration
Add a new service to
docker-compose.yml:Update the
appservice to include:Add volume definition:
API Endpoints
Implement the following REST endpoints:
1. Health Check
Returns service status and collection statistics.
2. Index Issues
3. Search Similar Issues
4. Get Issue by ID
5. Clear Collection (Admin)
6. Reindex from CSV Files
Clears the collection and reloads all CSV files from the data directory. Useful for updating the index when CSV files change.
File Structure
Create the following directory structure:
Environment Variables
Add to
.env:Implementation Details
CSV Data Loading
Create a
load_csv_data.pymodule that:data/directory for all.csvfiles"bug"Example implementation structure:
Startup Behavior
On service startup, the application should:
data//reindexendpoint if refresh needed)Retrieval Strategy
Data Indexing
From CSV Files
The primary data source is CSV files in the
data/folder. These should be loaded automatically on first run.From GitHub API (Optional)
Provide a utility script (
index_github_issues.py) that:Example usage:
python services/vector_store/index_github_issues.py \ --owner pytorch \ --repo pytorch \ --token $GITHUB_TOKEN \ --max-issues 5000Model Selection Rationale
all-mpnet-base-v2provides excellent semantic understanding with reasonable computational requirementsms-marco-MiniLM-L-6-v2offers strong reranking performance with minimal latency impactAlternative models can be configured if needed for specific domains or languages.
ID Generation Strategy
Since CSV files may not include explicit IDs, implement one of these strategies:
hashlib.md5(url.encode()).hexdigest()[:16]Recommendation: Use hash-based IDs from URLs for deterministic, reproducible indexing.
Acceptance Criteria
data/folder are automatically loaded and indexed on first startup/reindexendpoint successfully clears and reloads data from CSV filesTesting Recommendations
data/folderSample CSV Data for Testing
Create a
data/test_issues.csvfile with sample data:UI Integration Notes
The UI service should integrate this functionality by:
Example integration code:
Future Enhancements (Out of Scope)
Resources
Estimated Effort
Priority
Medium-High - This feature significantly improves issue management workflows and leverages existing historical data.
Notes for Implementation
data/directory should be mounted as read-only in the container for securitydata/directory is empty or contains no CSV files gracefully