Evaluates the Oblivion memory framework against the LongMemEval benchmark for long-term memory in conversational AI.
The evaluation pipeline consists of two phases:
- Preparation (memory injection): Processes LongMemEval conversation sessions through Oblivion's memory framework, storing memories via different preparation strategies.
- Query: Evaluates Oblivion's ability to answer questions about the stored conversations, computing retrieval metrics (NDCG, recall, etc.).
The preparation strategies are inspired by the Recognizer module in the Oblivion framework, adapted for batch processing of LongMemEval sessions:
| Strategy | Description |
|---|---|
fastest |
Minimal processing, direct memory storage |
lme_like |
Mirrors the original LongMemEval baseline approach |
lme_like_userfact |
LME-like with user fact extraction |
compress |
Compresses conversations before storage |
lossless |
Full conversation preservation |
mapreduce |
Map-reduce style memory consolidation |
- Install dependencies:
poetry install --extras lme-benchmark - LLM credentials: Azure (
.keys.ini) orOPENAI_API_KEYenv var - Dataset: Initialize the LongMemEval submodule:
git submodule update --init data/benchmarks/longmemeval
# Dry-run (validates config, no LLM calls)
python -m experiments.longmemeval_benchmark.run_evaluation \
--config experiments/longmemeval_benchmark/config/toy_3samples/config.yaml \
--dry-run
# Full evaluation
python -m experiments.longmemeval_benchmark.run_evaluation \
--config experiments/longmemeval_benchmark/config/toy_3samples/config.yamllongmemeval_benchmark/
├── run_evaluation.py # CLI entry point
├── runner/ # Core pipeline
│ ├── benchmark_runner.py # LongMemEvalOblivionRunner
│ ├── preparation_pipeline.py
│ ├── query_pipeline.py
│ ├── context_assembly.py
│ ├── config_factory.py
│ ├── memory_helpers.py
│ └── prompts.py
├── preparation/ # Memory preparation strategies
│ ├── strategies/ # Strategy implementations
│ └── prompts/ # Per-strategy prompt templates
├── metrics/ # Retrieval metrics + cost tracking
├── cache/ # Preparation cache + resume
├── analysis/ # Pipeline trace + exclusions
├── llm/ # Async structured calls, throttling
├── models/ # Pydantic data models
└── config/ # YAML experiment configurations
See config/README.md for details on the YAML configuration format.
Key config fields:
split: LongMemEval data splitmodel: LLM model nameprovider:azure,openai, orinternalpreparation_method: Strategy name (see table above)preparation_cache_id: Reuse a previous preparation run (set toPLACEHOLDER_RUN_PREPARATION_FIRSTby default)