Skip to content

Latest commit

 

History

History
80 lines (63 loc) · 3.15 KB

File metadata and controls

80 lines (63 loc) · 3.15 KB

LongMemEval Benchmark

Evaluates the Oblivion memory framework against the LongMemEval benchmark for long-term memory in conversational AI.

Overview

The evaluation pipeline consists of two phases:

  1. Preparation (memory injection): Processes LongMemEval conversation sessions through Oblivion's memory framework, storing memories via different preparation strategies.
  2. Query: Evaluates Oblivion's ability to answer questions about the stored conversations, computing retrieval metrics (NDCG, recall, etc.).

Preparation Strategies

The preparation strategies are inspired by the Recognizer module in the Oblivion framework, adapted for batch processing of LongMemEval sessions:

Strategy Description
fastest Minimal processing, direct memory storage
lme_like Mirrors the original LongMemEval baseline approach
lme_like_userfact LME-like with user fact extraction
compress Compresses conversations before storage
lossless Full conversation preservation
mapreduce Map-reduce style memory consolidation

Prerequisites

  1. Install dependencies: poetry install --extras lme-benchmark
  2. LLM credentials: Azure (.keys.ini) or OPENAI_API_KEY env var
  3. Dataset: Initialize the LongMemEval submodule:
    git submodule update --init data/benchmarks/longmemeval

Running

# Dry-run (validates config, no LLM calls)
python -m experiments.longmemeval_benchmark.run_evaluation \
  --config experiments/longmemeval_benchmark/config/toy_3samples/config.yaml \
  --dry-run

# Full evaluation
python -m experiments.longmemeval_benchmark.run_evaluation \
  --config experiments/longmemeval_benchmark/config/toy_3samples/config.yaml

Directory Structure

longmemeval_benchmark/
├── run_evaluation.py        # CLI entry point
├── runner/                  # Core pipeline
│   ├── benchmark_runner.py  # LongMemEvalOblivionRunner
│   ├── preparation_pipeline.py
│   ├── query_pipeline.py
│   ├── context_assembly.py
│   ├── config_factory.py
│   ├── memory_helpers.py
│   └── prompts.py
├── preparation/             # Memory preparation strategies
│   ├── strategies/          # Strategy implementations
│   └── prompts/             # Per-strategy prompt templates
├── metrics/                 # Retrieval metrics + cost tracking
├── cache/                   # Preparation cache + resume
├── analysis/                # Pipeline trace + exclusions
├── llm/                     # Async structured calls, throttling
├── models/                  # Pydantic data models
└── config/                  # YAML experiment configurations

Configuration

See config/README.md for details on the YAML configuration format.

Key config fields:

  • split: LongMemEval data split
  • model: LLM model name
  • provider: azure, openai, or internal
  • preparation_method: Strategy name (see table above)
  • preparation_cache_id: Reuse a previous preparation run (set to PLACEHOLDER_RUN_PREPARATION_FIRST by default)