LongMemEval Benchmark

Evaluates the Oblivion memory framework against the LongMemEval benchmark for long-term memory in conversational AI.

Overview

The evaluation pipeline consists of two phases:

Preparation (memory injection): Processes LongMemEval conversation sessions through Oblivion's memory framework, storing memories via different preparation strategies.
Query: Evaluates Oblivion's ability to answer questions about the stored conversations, computing retrieval metrics (NDCG, recall, etc.).

Preparation Strategies

The preparation strategies are inspired by the Recognizer module in the Oblivion framework, adapted for batch processing of LongMemEval sessions:

Strategy	Description
`fastest`	Minimal processing, direct memory storage
`lme_like`	Mirrors the original LongMemEval baseline approach
`lme_like_userfact`	LME-like with user fact extraction
`compress`	Compresses conversations before storage
`lossless`	Full conversation preservation
`mapreduce`	Map-reduce style memory consolidation

Prerequisites

Install dependencies: poetry install --extras lme-benchmark
LLM credentials: Azure (.keys.ini) or OPENAI_API_KEY env var

Dataset: Initialize the LongMemEval submodule:

git submodule update --init data/benchmarks/longmemeval

Running

# Dry-run (validates config, no LLM calls)
python -m experiments.longmemeval_benchmark.run_evaluation \
  --config experiments/longmemeval_benchmark/config/toy_3samples/config.yaml \
  --dry-run

# Full evaluation
python -m experiments.longmemeval_benchmark.run_evaluation \
  --config experiments/longmemeval_benchmark/config/toy_3samples/config.yaml

Directory Structure

longmemeval_benchmark/
├── run_evaluation.py        # CLI entry point
├── runner/                  # Core pipeline
│   ├── benchmark_runner.py  # LongMemEvalOblivionRunner
│   ├── preparation_pipeline.py
│   ├── query_pipeline.py
│   ├── context_assembly.py
│   ├── config_factory.py
│   ├── memory_helpers.py
│   └── prompts.py
├── preparation/             # Memory preparation strategies
│   ├── strategies/          # Strategy implementations
│   └── prompts/             # Per-strategy prompt templates
├── metrics/                 # Retrieval metrics + cost tracking
├── cache/                   # Preparation cache + resume
├── analysis/                # Pipeline trace + exclusions
├── llm/                     # Async structured calls, throttling
├── models/                  # Pydantic data models
└── config/                  # YAML experiment configurations

Configuration

See config/README.md for details on the YAML configuration format.

Key config fields:

split: LongMemEval data split
model: LLM model name
provider: azure, openai, or internal
preparation_method: Strategy name (see table above)
preparation_cache_id: Reuse a previous preparation run (set to PLACEHOLDER_RUN_PREPARATION_FIRST by default)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LongMemEval Benchmark

Overview

Preparation Strategies

Prerequisites

Running

Directory Structure

Configuration

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LongMemEval Benchmark

Overview

Preparation Strategies

Prerequisites

Running

Directory Structure

Configuration