This directory contains optional git submodules for benchmark datasets used by the experiment scripts.
| Dataset | Type | Cases | Categories | Splits | License | Path |
|---|---|---|---|---|---|---|
| LongMemEval (Wu et al., 2025) | Static | 500 | 6 | Oracle, S | MIT | data/benchmarks/longmemeval/ |
| GoodAI-LTM (Castillo-Bolado et al., 2024) | Dynamic | 33 | 7 | Isolated, 2K, 32K | MIT | data/benchmarks/goodai-ltm/ |
Both datasets are available under permissive MIT licenses for commercial use.
This repository includes custom benchmark runners for both datasets in the experiments/ directory.
These are not the original benchmark scripts — they are reimplementations tailored for evaluating
the Oblivion memory framework.
Our implementation differs from the original GoodAI-LTM codebase in several ways:
- Filler token batching: The original sends the entire filler trivia blob as a single message. Our version splits filler into sub-batches (default: 16 Q&A pairs per batch) to avoid overloading the memory recognizer, while maintaining the same total filler token count.
- Vanilla LLM baseline fix: The original benchmark's Vanilla LLM was limited to the last k messages only (small fixed window), which unfairly penalizes the baseline at longer context settings. Our implementation uses full conversation history up to the model's token limit (default: 120K tokens), providing a fairer comparison.
- LTM Agent enhancements: The LTM Agent baseline includes improved query generation and retrieval quality. These changes increase API cost and latency but produce higher normalized scores compared to the original implementation.
- Trivia data source: Filler trivia is loaded from
goodai-ltm-benchmark/data/trivia/trivia.json(the submodule). The submodule must be initialized for filler generation to work.
The LongMemEval evaluation pipeline is a custom implementation built specifically for Oblivion.
It does not depend on or import from the original longmemeval Python package. All memory
preparation strategies, query pipelines, and metric computations are self-contained within the
experiments/longmemeval_benchmark/ directory.
The pipeline uses the LongMemEval dataset files (JSON sessions and questions) from the submodule but implements its own preparation strategies inspired by the Oblivion Recognizer module.
Shared data loading utilities live in experiments/longmemeval_data_utils/, which is used by both
the benchmark and the ablation experiments.
Benchmark scripts expect dataset files at:
data/
└── benchmarks/
├── longmemeval/ # LongMemEval dataset (git submodule)
└── goodai-ltm/ # GoodAI-LTM benchmark data (git submodule)
Submodules are optional. They are only needed if you plan to run specific benchmark experiments.
To clone with submodules (at initial clone time):
git clone --recurse-submodules <repo-url>To initialize submodules after cloning:
git submodule update --initTo initialize only a specific submodule:
git submodule update --init data/benchmarks/longmemeval
git submodule update --init data/benchmarks/goodai-ltm