Skip to content

Latest commit

 

History

History
82 lines (58 loc) · 3.54 KB

File metadata and controls

82 lines (58 loc) · 3.54 KB

Benchmark Datasets

This directory contains optional git submodules for benchmark datasets used by the experiment scripts.

Available Datasets

Dataset Type Cases Categories Splits License Path
LongMemEval (Wu et al., 2025) Static 500 6 Oracle, S MIT data/benchmarks/longmemeval/
GoodAI-LTM (Castillo-Bolado et al., 2024) Dynamic 33 7 Isolated, 2K, 32K MIT data/benchmarks/goodai-ltm/

Both datasets are available under permissive MIT licenses for commercial use.

Experiment Scripts

This repository includes custom benchmark runners for both datasets in the experiments/ directory. These are not the original benchmark scripts — they are reimplementations tailored for evaluating the Oblivion memory framework.

GoodAI-LTM Benchmark (experiments/goodai_ltm_benchmark/)

Our implementation differs from the original GoodAI-LTM codebase in several ways:

  • Filler token batching: The original sends the entire filler trivia blob as a single message. Our version splits filler into sub-batches (default: 16 Q&A pairs per batch) to avoid overloading the memory recognizer, while maintaining the same total filler token count.
  • Vanilla LLM baseline fix: The original benchmark's Vanilla LLM was limited to the last k messages only (small fixed window), which unfairly penalizes the baseline at longer context settings. Our implementation uses full conversation history up to the model's token limit (default: 120K tokens), providing a fairer comparison.
  • LTM Agent enhancements: The LTM Agent baseline includes improved query generation and retrieval quality. These changes increase API cost and latency but produce higher normalized scores compared to the original implementation.
  • Trivia data source: Filler trivia is loaded from goodai-ltm-benchmark/data/trivia/trivia.json (the submodule). The submodule must be initialized for filler generation to work.

LongMemEval Benchmark (experiments/longmemeval_benchmark/)

The LongMemEval evaluation pipeline is a custom implementation built specifically for Oblivion. It does not depend on or import from the original longmemeval Python package. All memory preparation strategies, query pipelines, and metric computations are self-contained within the experiments/longmemeval_benchmark/ directory.

The pipeline uses the LongMemEval dataset files (JSON sessions and questions) from the submodule but implements its own preparation strategies inspired by the Oblivion Recognizer module.

Shared data loading utilities live in experiments/longmemeval_data_utils/, which is used by both the benchmark and the ablation experiments.

Dataset Placement

Benchmark scripts expect dataset files at:

data/
└── benchmarks/
    ├── longmemeval/     # LongMemEval dataset (git submodule)
    └── goodai-ltm/      # GoodAI-LTM benchmark data (git submodule)

Setup

Submodules are optional. They are only needed if you plan to run specific benchmark experiments.

To clone with submodules (at initial clone time):

git clone --recurse-submodules <repo-url>

To initialize submodules after cloning:

git submodule update --init

To initialize only a specific submodule:

git submodule update --init data/benchmarks/longmemeval
git submodule update --init data/benchmarks/goodai-ltm