This directory contains benchmark evaluation scripts and ablation experiments for the Oblivion memory framework.
The GoodAI Long-Term Memory Benchmark (Castillo-Bolado et al., 2024) is a dynamic benchmark with 33 test cases across 7 categories. It evaluates memory systems by interleaving test content with filler trivia tokens, simulating realistic long-term memory pressure at configurable context lengths (1K–500K tokens).
# Install dependencies
poetry install --extras "goodai-benchmark"
# Run Oblivion agent on 32K benchmark
python -m experiments.goodai_ltm_benchmark.run_benchmark \
benchmark=32k agent=oblivion
# Run Vanilla LLM baseline
python -m experiments.goodai_ltm_benchmark.run_benchmark \
benchmark=32k agent=vanilla_llmSee goodai_ltm_benchmark/README.md for detailed usage, example commands, configuration parameters, and hyperparameter sweep documentation.
LongMemEval (Wu et al., 2025) is a static benchmark with 500 test cases across 6 categories, evaluating long-term conversational memory through oracle and systematic splits. The evaluation pipeline in this repository is a custom implementation — it does not depend on the original longmemeval package.
# Install dependencies
poetry install --extras "lme-benchmark"
# Initialize dataset submodule
git submodule update --init data/benchmarks/longmemevalSee longmemeval_benchmark/README.md for pipeline documentation, configuration, and preparation strategies.
| Directory | Description | Install Extras |
|---|---|---|
goodai_ltm_benchmark/ |
GoodAI-LTM benchmark runner, baselines, hyperparameter sweeps, Streamlit UI | goodai-benchmark |
longmemeval_benchmark/ |
LongMemEval evaluation framework (preparation + query pipeline) | lme-benchmark |
longmemeval_data_utils/ |
Shared LongMemEval data loading and preparation utilities | lme-benchmark |
longmemeval_ablation_experiments/ |
Ablation experiments: decayer temperature, hallucination analysis, error analysis | goodai-benchmark |
Both benchmark datasets are managed as optional git submodules under data/benchmarks/. See data/benchmarks/README.md for dataset details, licenses, and setup instructions.