Kimetsu logoKimetsu
Memory Benchmark

BEAM

How Kimetsu scores on BEAM, the ten-ability long-term-memory benchmark, at the 100K and 1M token buckets.

How Kimetsu scores on BEAM, the ten-ability long-term-memory benchmark, at the 100K and 1M token buckets.

BEAM probes ten memory abilities (information extraction, multi-session reasoning, knowledge update, temporal reasoning, abstention, contradiction resolution, event ordering, instruction following, preference following, summarization) over conversations from 128K to 10M tokens, graded by an LLM judge against per-question rubrics. The kbench beam driver ingests each conversation into a fresh brain, retrieves per probe, answers with the Codex reader, and judges against the rubric.

Results: 100K bucket

The full 100K bucket (20 conversations, 400 probes), v2.5 graph-lite backend, 96k retrieval budget.

abilityaccuracy
contradiction resolution40/40 (100%)
preference following38/40 (95%)
information extraction36/40 (90%)
summarization34/40 (85%)
temporal reasoning31/40 (77.5%)
multi-session reasoning30/40 (75%)
instruction following29/40 (72.5%)
knowledge update24/40 (60%)
abstention18/40 (45%)
event ordering13/40 (32.5%)
overall293/400 (73.3%)

Three notes on what drives the result:

  • Graph ranking moved the number, model-free. The flat baseline scored 62.3% on the same set. Graph-lite lets a probe's top hits pull in their edge-connected neighbours with hop-decayed relevance, lifting the abilities that stitch facts across sessions (multi-session 35 → 75%, preference 72.5 → 95%).
  • Global abilities need budget. Summarization, contradiction resolution, and temporal reasoning need the whole arc. At 48k tokens they scored near zero; at 96k they reach 100% / 85% / 77.5%. The fix is a knob, not a redesign.
  • The reader and judge are LLMs; the memory is not. The pipeline feeding the reader makes zero model calls. The two laggards, abstention (45%) and event ordering (32.5%), are reader-behavior limits, not retrieval gaps.

Reproduce with kbench beam --dataset beam-100k.json --reader-backend codex; the converter that builds the dataset from the BEAM repo ships in the bench tooling.

Results: 1M bucket

A 1M-token conversation exceeds any context window, so this bucket tests retrieval ranking: the 96k budget surfaces ~10% of the conversation. Run on 15 of the 35 conversations (300 probes), uniform 96k budget.

abilityaccuracy
contradiction resolution27/30 (90%)
knowledge update26/30 (86.7%)
preference following25/30 (83.3%)
information extraction24/30 (80%)
summarization23/30 (76.7%)
instruction following20/30 (66.7%)
multi-session reasoning20/30 (66.7%)
temporal reasoning15/30 (50%)
event ordering9/30 (30%)
abstention9/30 (30%)
overall198/300 (66.0%)
  • 66.0% edges mem0's self-reported 62% at the matched bucket. Caveats in How Kimetsu compares.
  • Global and temporal abilities degrade with scale, as expected. When the conversation is ~10x the budget, whole-arc tasks lose ground that local-fact tasks keep.
  • The buckets are not a controlled A/B: different conversations, and the 100K run used a smaller budget for the localized abilities. Each overall is a standalone, reproducible figure.

Reproduce with kbench beam --dataset beam-1m.json --limit 15 --reader-backend codex. The 10M bucket is future work: a faithful run needs the write-time distiller in the loop rather than raw per-turn ingest. mem0 reports 48.6% there; Cognee reports 67%.

On this page