Skip to content

Latest commit

 

History

History
51 lines (34 loc) · 2.62 KB

File metadata and controls

51 lines (34 loc) · 2.62 KB

Oblivion Experiments

This directory contains benchmark evaluation scripts and ablation experiments for the Oblivion memory framework.

Benchmarks

GoodAI-LTM Benchmark

The GoodAI Long-Term Memory Benchmark (Castillo-Bolado et al., 2024) is a dynamic benchmark with 33 test cases across 7 categories. It evaluates memory systems by interleaving test content with filler trivia tokens, simulating realistic long-term memory pressure at configurable context lengths (1K–500K tokens).

# Install dependencies
poetry install --extras "goodai-benchmark"

# Run Oblivion agent on 32K benchmark
python -m experiments.goodai_ltm_benchmark.run_benchmark \
    benchmark=32k agent=oblivion

# Run Vanilla LLM baseline
python -m experiments.goodai_ltm_benchmark.run_benchmark \
    benchmark=32k agent=vanilla_llm

See goodai_ltm_benchmark/README.md for detailed usage, example commands, configuration parameters, and hyperparameter sweep documentation.

LongMemEval Benchmark

LongMemEval (Wu et al., 2025) is a static benchmark with 500 test cases across 6 categories, evaluating long-term conversational memory through oracle and systematic splits. The evaluation pipeline in this repository is a custom implementation — it does not depend on the original longmemeval package.

# Install dependencies
poetry install --extras "lme-benchmark"

# Initialize dataset submodule
git submodule update --init data/benchmarks/longmemeval

See longmemeval_benchmark/README.md for pipeline documentation, configuration, and preparation strategies.

Directory Structure

Directory Description Install Extras
goodai_ltm_benchmark/ GoodAI-LTM benchmark runner, baselines, hyperparameter sweeps, Streamlit UI goodai-benchmark
longmemeval_benchmark/ LongMemEval evaluation framework (preparation + query pipeline) lme-benchmark
longmemeval_data_utils/ Shared LongMemEval data loading and preparation utilities lme-benchmark
longmemeval_ablation_experiments/ Ablation experiments: decayer temperature, hallucination analysis, error analysis goodai-benchmark

Benchmark Datasets

Both benchmark datasets are managed as optional git submodules under data/benchmarks/. See data/benchmarks/README.md for dataset details, licenses, and setup instructions.