Benchmark Datasets

This directory contains optional git submodules for benchmark datasets used by the experiment scripts.

Available Datasets

Dataset	Type	Cases	Categories	Splits	License	Path
LongMemEval (Wu et al., 2025)	Static	500	6	Oracle, S	MIT	`data/benchmarks/longmemeval/`
GoodAI-LTM (Castillo-Bolado et al., 2024)	Dynamic	33	7	Isolated, 2K, 32K	MIT	`data/benchmarks/goodai-ltm/`

Both datasets are available under permissive MIT licenses for commercial use.

Experiment Scripts

This repository includes custom benchmark runners for both datasets in the experiments/ directory. These are not the original benchmark scripts — they are reimplementations tailored for evaluating the Oblivion memory framework.

GoodAI-LTM Benchmark (`experiments/goodai_ltm_benchmark/`)

Our implementation differs from the original GoodAI-LTM codebase in several ways:

Filler token batching: The original sends the entire filler trivia blob as a single message. Our version splits filler into sub-batches (default: 16 Q&A pairs per batch) to avoid overloading the memory recognizer, while maintaining the same total filler token count.
Vanilla LLM baseline fix: The original benchmark's Vanilla LLM was limited to the last k messages only (small fixed window), which unfairly penalizes the baseline at longer context settings. Our implementation uses full conversation history up to the model's token limit (default: 120K tokens), providing a fairer comparison.
LTM Agent enhancements: The LTM Agent baseline includes improved query generation and retrieval quality. These changes increase API cost and latency but produce higher normalized scores compared to the original implementation.
Trivia data source: Filler trivia is loaded from goodai-ltm-benchmark/data/trivia/trivia.json (the submodule). The submodule must be initialized for filler generation to work.

LongMemEval Benchmark (`experiments/longmemeval_benchmark/`)

The LongMemEval evaluation pipeline is a custom implementation built specifically for Oblivion. It does not depend on or import from the original longmemeval Python package. All memory preparation strategies, query pipelines, and metric computations are self-contained within the experiments/longmemeval_benchmark/ directory.

The pipeline uses the LongMemEval dataset files (JSON sessions and questions) from the submodule but implements its own preparation strategies inspired by the Oblivion Recognizer module.

Shared data loading utilities live in experiments/longmemeval_data_utils/, which is used by both the benchmark and the ablation experiments.

Dataset Placement

Benchmark scripts expect dataset files at:

data/
└── benchmarks/
    ├── longmemeval/     # LongMemEval dataset (git submodule)
    └── goodai-ltm/      # GoodAI-LTM benchmark data (git submodule)

Setup

Submodules are optional. They are only needed if you plan to run specific benchmark experiments.

To clone with submodules (at initial clone time):

git clone --recurse-submodules <repo-url>

To initialize submodules after cloning:

git submodule update --init

To initialize only a specific submodule:

git submodule update --init data/benchmarks/longmemeval
git submodule update --init data/benchmarks/goodai-ltm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark Datasets

Available Datasets

Experiment Scripts

GoodAI-LTM Benchmark (`experiments/goodai_ltm_benchmark/`)

LongMemEval Benchmark (`experiments/longmemeval_benchmark/`)

Dataset Placement

Setup

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmark Datasets

Available Datasets

Experiment Scripts

GoodAI-LTM Benchmark (experiments/goodai_ltm_benchmark/)

LongMemEval Benchmark (experiments/longmemeval_benchmark/)

Dataset Placement

Setup

GoodAI-LTM Benchmark (`experiments/goodai_ltm_benchmark/`)

LongMemEval Benchmark (`experiments/longmemeval_benchmark/`)