Goal: Build a domain-specialized text embedding model that tops the MTEB leaderboard in a specific subcategory (medical, legal, scientific, or code retrieval).
Approach: Apply NVIDIA's published recipe from the NV-Embed papers -- synthetic data generation + hard negative mining + contrastive fine-tuning -- to a small base model that fits on a free Colab T4 GPU.
Most top MTEB models are general-purpose. They spread their capacity across dozens of tasks and domains. A focused model trained on high-quality, domain-specific synthetic data with carefully mined hard negatives can beat much larger general models on domain-specific benchmarks.
The NVIDIA NV-Embed recipe (ICLR 2025 Spotlight) demonstrated this at scale. We apply the same principles -- two-stage contrastive instruction-tuning, hard negative mining, synthetic query generation -- but target a single domain to maximize per-task performance.
- Domain Analysis: Programmatically identify which MTEB subcategories have the most room for improvement
- Synthetic Data Generation: Use an LLM to generate query-document pairs from domain corpora (PubMed, arXiv, case law, or code docs)
- Hard Negative Mining: Combine BM25 lexical matching with embedding similarity to find challenging negatives
- Contrastive Fine-Tuning: Train with InfoNCE loss + in-batch negatives + mined hard negatives
- Evaluation and Submission: Run MTEB evaluation, compare against baselines, submit to leaderboard
| Model | Params | Embedding Dim | Context Length | Notes |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | 384 | 256 | Fast baseline, easy to fine-tune |
| BAAI/bge-base-en-v1.5 | 109M | 768 | 512 | Strong general-purpose base |
| nvidia/llama-nemotron-embed-1b-v2 | 1B | 2048 | 8192 | SOTA architecture, Colab Pro recommended |
The default path uses BAAI/bge-base-en-v1.5 for free Colab compatibility. The 1B Nemotron model is available if you have Colab Pro (A100 GPU).
Results will be added after training and evaluation.
| Task | Baseline | Ours | Delta | Current SOTA |
|---|---|---|---|---|
| TBD | TBD | TBD | TBD | TBD |
git clone https://github.com/ManasVardhan/mteb-domain-embeddings.git
cd mteb-domain-embeddings
pip install -r requirements.txtThe project is designed as four sequential Colab notebooks:
notebooks/01_domain_analysis.ipynb- Analyze MTEB tasks, pick the best domain to targetnotebooks/02_synthetic_data_gen.ipynb- Generate training data from domain corporanotebooks/03_train_embedding.ipynb- Fine-tune the embedding modelnotebooks/04_evaluate_submit.ipynb- Evaluate on MTEB and submit to leaderboard
Each notebook is self-contained with Colab setup cells (Drive mounting, dependency installation, checkpointing).
from src.data_utils import generate_synthetic_pairs, mine_hard_negatives
from src.train_utils import ContrastiveTrainer, InfoNCELoss
from src.eval_utils import run_mteb_evaluation, compare_resultsmteb-domain-embeddings/
README.md
PLAN.md
requirements.txt
.gitignore
notebooks/
01_domain_analysis.ipynb
02_synthetic_data_gen.ipynb
03_train_embedding.ipynb
04_evaluate_submit.ipynb
src/
__init__.py
data_utils.py
train_utils.py
eval_utils.py
data/ # Generated training data (gitignored)
models/ # Model checkpoints (gitignored)
results/ # Evaluation results (gitignored)
| Resource | Cost | Notes |
|---|---|---|
| Colab T4 GPU | Free | 12h session limit, ~15GB VRAM |
| Colab A100 GPU | ~$10/month (Pro) | Needed for 1B param models |
| Synthetic data gen (OpenRouter) | ~$2-5 | ~10K query-doc pairs via Llama 3 |
| Synthetic data gen (Colab local) | Free | Slower, use Gemma 2B or Phi-3-mini |
| HuggingFace Hub | Free | Model hosting and leaderboard submission |
- NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models (ICLR 2025 Spotlight)
- MTEB: Massive Text Embedding Benchmark
- MMTEB: Massive Multilingual Text Embedding Benchmark
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
If you use this work, please cite:
@misc{vardhan2025domainembeddings,
author = {Manas Vardhan},
title = {Domain-Specific Embedding Model for MTEB},
year = {2025},
publisher = {GitHub},
url = {https://github.com/ManasVardhan/mteb-domain-embeddings}
}MIT