Skip to content

ManasVardhan/mteb-domain-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Domain-Specific Embedding Model for MTEB

Goal: Build a domain-specialized text embedding model that tops the MTEB leaderboard in a specific subcategory (medical, legal, scientific, or code retrieval).

Approach: Apply NVIDIA's published recipe from the NV-Embed papers -- synthetic data generation + hard negative mining + contrastive fine-tuning -- to a small base model that fits on a free Colab T4 GPU.

Why This Works

Most top MTEB models are general-purpose. They spread their capacity across dozens of tasks and domains. A focused model trained on high-quality, domain-specific synthetic data with carefully mined hard negatives can beat much larger general models on domain-specific benchmarks.

The NVIDIA NV-Embed recipe (ICLR 2025 Spotlight) demonstrated this at scale. We apply the same principles -- two-stage contrastive instruction-tuning, hard negative mining, synthetic query generation -- but target a single domain to maximize per-task performance.

Approach

  1. Domain Analysis: Programmatically identify which MTEB subcategories have the most room for improvement
  2. Synthetic Data Generation: Use an LLM to generate query-document pairs from domain corpora (PubMed, arXiv, case law, or code docs)
  3. Hard Negative Mining: Combine BM25 lexical matching with embedding similarity to find challenging negatives
  4. Contrastive Fine-Tuning: Train with InfoNCE loss + in-batch negatives + mined hard negatives
  5. Evaluation and Submission: Run MTEB evaluation, compare against baselines, submit to leaderboard

Base Models

Model Params Embedding Dim Context Length Notes
all-MiniLM-L6-v2 22M 384 256 Fast baseline, easy to fine-tune
BAAI/bge-base-en-v1.5 109M 768 512 Strong general-purpose base
nvidia/llama-nemotron-embed-1b-v2 1B 2048 8192 SOTA architecture, Colab Pro recommended

The default path uses BAAI/bge-base-en-v1.5 for free Colab compatibility. The 1B Nemotron model is available if you have Colab Pro (A100 GPU).

Results

Results will be added after training and evaluation.

Task Baseline Ours Delta Current SOTA
TBD TBD TBD TBD TBD

Quickstart

1. Clone and Install

git clone https://github.com/ManasVardhan/mteb-domain-embeddings.git
cd mteb-domain-embeddings
pip install -r requirements.txt

2. Run Notebooks Sequentially on Colab

The project is designed as four sequential Colab notebooks:

  1. notebooks/01_domain_analysis.ipynb - Analyze MTEB tasks, pick the best domain to target
  2. notebooks/02_synthetic_data_gen.ipynb - Generate training data from domain corpora
  3. notebooks/03_train_embedding.ipynb - Fine-tune the embedding model
  4. notebooks/04_evaluate_submit.ipynb - Evaluate on MTEB and submit to leaderboard

Each notebook is self-contained with Colab setup cells (Drive mounting, dependency installation, checkpointing).

3. Or Use the Python Modules Directly

from src.data_utils import generate_synthetic_pairs, mine_hard_negatives
from src.train_utils import ContrastiveTrainer, InfoNCELoss
from src.eval_utils import run_mteb_evaluation, compare_results

Project Structure

mteb-domain-embeddings/
  README.md
  PLAN.md
  requirements.txt
  .gitignore
  notebooks/
    01_domain_analysis.ipynb
    02_synthetic_data_gen.ipynb
    03_train_embedding.ipynb
    04_evaluate_submit.ipynb
  src/
    __init__.py
    data_utils.py
    train_utils.py
    eval_utils.py
  data/           # Generated training data (gitignored)
  models/         # Model checkpoints (gitignored)
  results/        # Evaluation results (gitignored)

Cost Estimates

Resource Cost Notes
Colab T4 GPU Free 12h session limit, ~15GB VRAM
Colab A100 GPU ~$10/month (Pro) Needed for 1B param models
Synthetic data gen (OpenRouter) ~$2-5 ~10K query-doc pairs via Llama 3
Synthetic data gen (Colab local) Free Slower, use Gemma 2B or Phi-3-mini
HuggingFace Hub Free Model hosting and leaderboard submission

Key References

Citation

If you use this work, please cite:

@misc{vardhan2025domainembeddings,
  author = {Manas Vardhan},
  title = {Domain-Specific Embedding Model for MTEB},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/ManasVardhan/mteb-domain-embeddings}
}

License

MIT

About

Beat SOTA on MTEB leaderboard subcategories with domain-specific embedding fine-tuning (NVIDIA NV-Embed recipe)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors