Skip to content

Add Vector Store Service for GitHub Issue Similarity Search #6

Description

@peppocola

Overview

Implement a new microservice that provides semantic search over GitHub issues using vector embeddings. This service will enable users to find similar issues and their associated labels, improving issue triaging and duplicate detection workflows.

Motivation

Currently, the application lacks the ability to semantically search through historical GitHub issues. By implementing a vector store service with retrieval and reranking capabilities, users will be able to:

  • Find similar issues to avoid creating duplicates
  • Discover relevant labels based on similar historical issues
  • Improve issue classification accuracy
  • Leverage historical issue data for better triaging decisions

Requirements

Functional Requirements

  1. Vector Storage: Persist GitHub issue embeddings with metadata (title, body, labels, state, etc.)
  2. Semantic Search: Accept natural language queries and return semantically similar issues
  3. Reranking: Refine initial retrieval results using a cross-encoder model for improved relevance
  4. Label Discovery: Return labels from similar issues to assist with issue classification
  5. Batch Indexing: Support indexing multiple issues efficiently
  6. Data Persistence: Maintain vector store data across container restarts
  7. CSV Data Loading: Load and index issues from CSV files in the data/ folder on initialization

Non-Functional Requirements

  1. Performance:
    • Search queries should complete in < 2 seconds for collections up to 50,000 issues
    • Support concurrent requests from multiple users
  2. Scalability: Handle at least 10,000 GitHub issues initially, with room to scale
  3. Integration: Follow existing service architecture patterns (FastAPI, Docker, environment-based configuration)
  4. Resource Efficiency: Reasonable memory footprint suitable for containerized deployment

Technical Specification

Data Source

The service must load existing issue data from CSV files located in the data/ folder. These CSV files have the following schema:

title,body,label,url

Field Descriptions:

  • title: Issue title (string)
  • body: Issue description/body text (string)
  • label: Issue label(s) - may contain multiple comma-separated labels or a single label (string)
  • url: GitHub URL of the issue (string)

Requirements:

  • All CSV files in data/ should be loaded during container initialization
  • The service should handle multiple CSV files and merge them into a single collection
  • Parsing should handle multi-line fields and escaped quotes properly
  • Empty or malformed rows should be logged and skipped gracefully
  • Progress should be logged during the indexing process

Service Architecture

The service should follow the existing microservice pattern with:

  • Vector Database: ChromaDB for vector storage and similarity search
  • Embedding Model: Sentence-transformers for generating embeddings (e.g., all-mpnet-base-v2)
  • Reranker Model: Cross-encoder for result refinement (e.g., ms-marco-MiniLM-L-6-v2)
  • API Framework: FastAPI with Pydantic models for request/response validation
  • Containerization: Docker with dedicated Dockerfile following project conventions

Docker Compose Integration

Add a new service to docker-compose.yml:

vector-store:
  build:
    context: .
    dockerfile: dockerfiles/dockerfile.vectorstore
  container_name: vector-store-service
  ports:
    - "${VECTOR_STORE_PORT}:8001"
  environment:
    - HOST=${VECTOR_STORE_HOST}
    - PORT=${VECTOR_STORE_PORT}
    - CHROMA_DATA_PATH=/data/chroma
    - DATA_DIR=/app/data
  volumes:
    - vector_store_data:/data/chroma
    - ./data:/app/data:ro  # Mount data directory as read-only
  restart: unless-stopped

Update the app service to include:

environment:
  - VECTOR_STORE_BASE_URL=${DOCKER_VECTOR_STORE_BASE_URL}
depends_on:
  vector-store:
    condition: service_started

Add volume definition:

volumes:
  vector_store_data:

API Endpoints

Implement the following REST endpoints:

1. Health Check

GET /health

Returns service status and collection statistics.

2. Index Issues

POST /index
Content-Type: application/json

{
  "issues": [
    {
      "id": "string",
      "title": "string",
      "body": "string",
      "labels": ["string"],
      "state": "open|closed",
      "created_at": "ISO8601 timestamp",
      "metadata": {
        "number": int,
        "url": "string",
        "user": "string",
        "comments": int
      }
    }
  ]
}

3. Search Similar Issues

POST /search
Content-Type: application/json

{
  "query": "string",
  "top_k": 10,
  "rerank": true,
  "rerank_top_k": 5,
  "filter_labels": ["optional", "label", "filter"]
}

Response:
{
  "results": [
    {
      "id": "string",
      "title": "string",
      "body": "string",
      "labels": ["string"],
      "state": "string",
      "score": float,
      "metadata": {}
    }
  ],
  "query": "string",
  "total_results": int
}

4. Get Issue by ID

GET /issue/{issue_id}

5. Clear Collection (Admin)

DELETE /collection

6. Reindex from CSV Files

POST /reindex

Clears the collection and reloads all CSV files from the data directory. Useful for updating the index when CSV files change.

File Structure

Create the following directory structure:

services/vector_store/
├── vector_store_api.py          # Main FastAPI application
├── requirements.txt             # Python dependencies
├── load_csv_data.py            # CSV data loading utility
└── index_github_issues.py       # Utility script for batch indexing from GitHub API

dockerfiles/
└── dockerfile.vectorstore       # Container definition

data/
├── issues_1.csv                # Issue data files (example)
├── issues_2.csv
└── ...

Environment Variables

Add to .env:

VECTOR_STORE_PORT=8001
VECTOR_STORE_HOST=0.0.0.0
DOCKER_VECTOR_STORE_BASE_URL=http://vector-store:8001

Implementation Details

CSV Data Loading

Create a load_csv_data.py module that:

  1. Discovers CSV Files: Scans the data/ directory for all .csv files
  2. Parses CSV Data: Uses pandas to read CSV files with proper handling of:
    • Multi-line fields
    • Quoted strings
    • Various encodings (UTF-8, latin-1, etc.)
    • Missing values
  3. Generates Issue IDs: Creates unique IDs from URLs or generates UUIDs if needed
  4. Handles Labels: Parses label fields which may contain:
    • Single labels: "bug"
  5. Batches Indexing: Processes issues in batches (e.g., 100 at a time) for memory efficiency
  6. Logs Progress: Reports loading progress and any errors encountered

Example implementation structure:

import pandas as pd
import glob
import hashlib
from pathlib import Path

def load_issues_from_csv(data_dir: str) -> list:
    """Load all issues from CSV files in data directory."""
    csv_files = glob.glob(f"{data_dir}/*.csv")
    all_issues = []
    
    for csv_file in csv_files:
        df = pd.read_csv(csv_file, encoding='utf-8')
        # Parse and transform data
        # Generate IDs, parse labels, etc.
        
    return all_issues

Startup Behavior

On service startup, the application should:

  1. Initialize embedding and reranker models
  2. Connect to ChromaDB
  3. Check if collection is empty
  4. If empty, automatically load and index all CSV files from data/
  5. If not empty, skip automatic loading (use /reindex endpoint if refresh needed)
  6. Log summary statistics (total issues indexed, time taken, etc.)

Retrieval Strategy

  1. Initial Retrieval: Use sentence-transformers to encode the query and retrieve top-k candidates based on cosine similarity (via ChromaDB's L2 distance)
  2. Reranking (optional): Apply cross-encoder model to rerank candidates for improved relevance
  3. Filtering: Support filtering results by specific labels if requested

Data Indexing

From CSV Files

The primary data source is CSV files in the data/ folder. These should be loaded automatically on first run.

From GitHub API (Optional)

Provide a utility script (index_github_issues.py) that:

  • Fetches issues from GitHub's REST API
  • Handles pagination and rate limiting
  • Filters out pull requests
  • Batches requests to the vector store API
  • Supports authentication via GitHub tokens

Example usage:

python services/vector_store/index_github_issues.py \
  --owner pytorch \
  --repo pytorch \
  --token $GITHUB_TOKEN \
  --max-issues 5000

Model Selection Rationale

  • Embedding Model: all-mpnet-base-v2 provides excellent semantic understanding with reasonable computational requirements
  • Reranker: ms-marco-MiniLM-L-6-v2 offers strong reranking performance with minimal latency impact

Alternative models can be configured if needed for specific domains or languages.

ID Generation Strategy

Since CSV files may not include explicit IDs, implement one of these strategies:

  1. Hash-based IDs: Generate deterministic IDs from URL: hashlib.md5(url.encode()).hexdigest()[:16]
  2. URL-based IDs: Extract issue number from GitHub URL
  3. UUID-based IDs: Generate random UUIDs (not recommended as they're not reproducible)

Recommendation: Use hash-based IDs from URLs for deterministic, reproducible indexing.

Acceptance Criteria

  • Docker service builds successfully and starts without errors
  • Service integrates with existing docker-compose infrastructure
  • All CSV files in data/ folder are automatically loaded and indexed on first startup
  • CSV parsing correctly handles the schema: title, body, label, url
  • Progress is logged during CSV loading with statistics (files processed, issues indexed, errors encountered)
  • Empty or malformed CSV rows are skipped gracefully with appropriate logging
  • All API endpoints return correct responses with proper error handling
  • Vector embeddings are persisted across container restarts
  • Search returns semantically relevant results ranked by similarity
  • Documentation includes usage examples for indexing and searching
  • Models are downloaded at build time to avoid runtime delays
  • Health check endpoint provides useful diagnostic information including number of indexed issues
  • /reindex endpoint successfully clears and reloads data from CSV files

Testing Recommendations

  1. Unit Tests:
    • Test CSV parsing with various edge cases (multi-line, quotes, empty fields)
    • Test embedding generation, search logic, and API validation
    • Test label parsing for single and multiple labels
  2. Integration Tests:
    • Test with docker-compose, verify cross-service communication
    • Test with sample CSV files in data/ folder
    • Verify data persistence across container restarts
  3. Performance Tests:
    • Measure search latency with varying collection sizes
    • Measure CSV loading time for large datasets
  4. Quality Tests:
    • Evaluate search relevance on a curated set of test queries
    • Verify that labels from similar issues are relevant

Sample CSV Data for Testing

Create a data/test_issues.csv file with sample data:

title,body,label,url
"Memory leak in training loop","When training for extended periods, memory usage continuously increases without being released. This eventually leads to OOM errors.","bug,memory","https://github.com/example/repo/issues/123"
"Add support for distributed training","It would be great to have built-in support for multi-GPU and multi-node training using PyTorch DDP.","enhancement,feature-request","https://github.com/example/repo/issues/124"
"Documentation: Getting started guide incomplete","The getting started guide is missing information about installing dependencies on Windows.","documentation,good-first-issue","https://github.com/example/repo/issues/125"

UI Integration Notes

The UI service should integrate this functionality by:

  1. Adding a "Find Similar Issues" button/feature in the issue creation/viewing interface
  2. Displaying similar issues with their labels when a user describes a new issue
  3. Suggesting relevant labels based on the labels of similar historical issues
  4. Providing a confidence score or relevance indicator for each similar issue

Example integration code:

import requests

def find_similar_issues(issue_text: str, top_k: int = 5):
    response = requests.post(
        f"{VECTOR_STORE_BASE_URL}/search",
        json={
            "query": issue_text,
            "top_k": 10,
            "rerank": True,
            "rerank_top_k": top_k
        }
    )
    return response.json()['results']

Future Enhancements (Out of Scope)

  • Multi-repository support with separate collections
  • Incremental indexing via GitHub webhooks
  • CSV file watching for automatic reindexing on file changes
  • Advanced filtering (date ranges, author, comment count)
  • Multilingual support with language-specific models
  • A/B testing different embedding models
  • Query expansion and reformulation
  • Analytics dashboard for search quality metrics
  • Export indexed data back to CSV format

Resources

Estimated Effort

  • Core Implementation: 8-12 hours
  • CSV Loading Implementation: 3-4 hours
  • Testing & Documentation: 3-4 hours
  • Integration with UI: 2-3 hours
  • Total: ~16-23 hours

Priority

Medium-High - This feature significantly improves issue management workflows and leverages existing historical data.

Notes for Implementation

  • The data/ directory should be mounted as read-only in the container for security
  • Consider implementing a checksum or timestamp mechanism to detect when CSV files have been updated
  • Handle the case where the data/ directory is empty or contains no CSV files gracefully
  • Provide clear error messages if CSV files don't match the expected schema
  • Consider adding a configuration option to disable automatic CSV loading on startup (for development or when using only the API)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions