Domain-Specific Embedding Model for MTEB

Goal: Build a domain-specialized text embedding model that tops the MTEB leaderboard in a specific subcategory (medical, legal, scientific, or code retrieval).

Approach: Apply NVIDIA's published recipe from the NV-Embed papers -- synthetic data generation + hard negative mining + contrastive fine-tuning -- to a small base model that fits on a free Colab T4 GPU.

Why This Works

Most top MTEB models are general-purpose. They spread their capacity across dozens of tasks and domains. A focused model trained on high-quality, domain-specific synthetic data with carefully mined hard negatives can beat much larger general models on domain-specific benchmarks.

The NVIDIA NV-Embed recipe (ICLR 2025 Spotlight) demonstrated this at scale. We apply the same principles -- two-stage contrastive instruction-tuning, hard negative mining, synthetic query generation -- but target a single domain to maximize per-task performance.

Approach

Domain Analysis: Programmatically identify which MTEB subcategories have the most room for improvement
Synthetic Data Generation: Use an LLM to generate query-document pairs from domain corpora (PubMed, arXiv, case law, or code docs)
Hard Negative Mining: Combine BM25 lexical matching with embedding similarity to find challenging negatives
Contrastive Fine-Tuning: Train with InfoNCE loss + in-batch negatives + mined hard negatives
Evaluation and Submission: Run MTEB evaluation, compare against baselines, submit to leaderboard

Base Models

Model	Params	Embedding Dim	Context Length	Notes
all-MiniLM-L6-v2	22M	384	256	Fast baseline, easy to fine-tune
BAAI/bge-base-en-v1.5	109M	768	512	Strong general-purpose base
nvidia/llama-nemotron-embed-1b-v2	1B	2048	8192	SOTA architecture, Colab Pro recommended

The default path uses BAAI/bge-base-en-v1.5 for free Colab compatibility. The 1B Nemotron model is available if you have Colab Pro (A100 GPU).

Results

Results will be added after training and evaluation.

Task	Baseline	Ours	Delta	Current SOTA
TBD	TBD	TBD	TBD	TBD

Quickstart

1. Clone and Install

git clone https://github.com/ManasVardhan/mteb-domain-embeddings.git
cd mteb-domain-embeddings
pip install -r requirements.txt

2. Run Notebooks Sequentially on Colab

The project is designed as four sequential Colab notebooks:

notebooks/01_domain_analysis.ipynb - Analyze MTEB tasks, pick the best domain to target
notebooks/02_synthetic_data_gen.ipynb - Generate training data from domain corpora
notebooks/03_train_embedding.ipynb - Fine-tune the embedding model
notebooks/04_evaluate_submit.ipynb - Evaluate on MTEB and submit to leaderboard

Each notebook is self-contained with Colab setup cells (Drive mounting, dependency installation, checkpointing).

3. Or Use the Python Modules Directly

from src.data_utils import generate_synthetic_pairs, mine_hard_negatives
from src.train_utils import ContrastiveTrainer, InfoNCELoss
from src.eval_utils import run_mteb_evaluation, compare_results

Project Structure

mteb-domain-embeddings/
  README.md
  PLAN.md
  requirements.txt
  .gitignore
  notebooks/
    01_domain_analysis.ipynb
    02_synthetic_data_gen.ipynb
    03_train_embedding.ipynb
    04_evaluate_submit.ipynb
  src/
    __init__.py
    data_utils.py
    train_utils.py
    eval_utils.py
  data/           # Generated training data (gitignored)
  models/         # Model checkpoints (gitignored)
  results/        # Evaluation results (gitignored)

Cost Estimates

Resource	Cost	Notes
Colab T4 GPU	Free	12h session limit, ~15GB VRAM
Colab A100 GPU	~$10/month (Pro)	Needed for 1B param models
Synthetic data gen (OpenRouter)	~$2-5	~10K query-doc pairs via Llama 3
Synthetic data gen (Colab local)	Free	Slower, use Gemma 2B or Phi-3-mini
HuggingFace Hub	Free	Model hosting and leaderboard submission

Key References

Citation

If you use this work, please cite:

@misc{vardhan2025domainembeddings,
  author = {Manas Vardhan},
  title = {Domain-Specific Embedding Model for MTEB},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/ManasVardhan/mteb-domain-embeddings}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Specific Embedding Model for MTEB

Why This Works

Approach

Base Models

Results

Quickstart

1. Clone and Install

2. Run Notebooks Sequentially on Colab

3. Or Use the Python Modules Directly

Project Structure

Cost Estimates

Key References

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
PLAN.md		PLAN.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Domain-Specific Embedding Model for MTEB

Why This Works

Approach

Base Models

Results

Quickstart

1. Clone and Install

2. Run Notebooks Sequentially on Colab

3. Or Use the Python Modules Directly

Project Structure

Cost Estimates

Key References

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages