BEAM

How Kimetsu scores on BEAM, the ten-ability long-term-memory benchmark, at the 100K and 1M token buckets.

BEAM probes ten memory abilities (information extraction, multi-session reasoning, knowledge update, temporal reasoning, abstention, contradiction resolution, event ordering, instruction following, preference following, summarization) over conversations from 128K to 10M tokens, graded by an LLM judge against per-question rubrics. The kbench beam driver ingests each conversation into a fresh brain, retrieves per probe, answers with the Codex reader, and judges against the rubric.

Results: 100K bucket

The full 100K bucket (20 conversations, 400 probes), v2.5 graph-lite backend, 96k retrieval budget.

ability	accuracy
contradiction resolution	40/40 (100%)
preference following	38/40 (95%)
information extraction	36/40 (90%)
summarization	34/40 (85%)
temporal reasoning	31/40 (77.5%)
multi-session reasoning	30/40 (75%)
instruction following	29/40 (72.5%)
knowledge update	24/40 (60%)
abstention	18/40 (45%)
event ordering	13/40 (32.5%)
overall	293/400 (73.3%)

Three notes on what drives the result:

Graph ranking moved the number, model-free. The flat baseline scored 62.3% on the same set. Graph-lite lets a probe's top hits pull in their edge-connected neighbours with hop-decayed relevance, lifting the abilities that stitch facts across sessions (multi-session 35 → 75%, preference 72.5 → 95%).
Global abilities need budget. Summarization, contradiction resolution, and temporal reasoning need the whole arc. At 48k tokens they scored near zero; at 96k they reach 100% / 85% / 77.5%. The fix is a knob, not a redesign.
The reader and judge are LLMs; the memory is not. The pipeline feeding the reader makes zero model calls. The two laggards, abstention (45%) and event ordering (32.5%), are reader-behavior limits, not retrieval gaps.

Reproduce with kbench beam --dataset beam-100k.json --reader-backend codex; the converter that builds the dataset from the BEAM repo ships in the bench tooling.

Results: 1M bucket

A 1M-token conversation exceeds any context window, so this bucket tests retrieval ranking: the 96k budget surfaces ~10% of the conversation. Run on 15 of the 35 conversations (300 probes), uniform 96k budget.

ability	accuracy
contradiction resolution	27/30 (90%)
knowledge update	26/30 (86.7%)
preference following	25/30 (83.3%)
information extraction	24/30 (80%)
summarization	23/30 (76.7%)
instruction following	20/30 (66.7%)
multi-session reasoning	20/30 (66.7%)
temporal reasoning	15/30 (50%)
event ordering	9/30 (30%)
abstention	9/30 (30%)
overall	198/300 (66.0%)

66.0% edges mem0's self-reported 62% at the matched bucket. Caveats in How Kimetsu compares.
Global and temporal abilities degrade with scale, as expected. When the conversation is ~10x the budget, whole-arc tasks lose ground that local-fact tasks keep.
The buckets are not a controlled A/B: different conversations, and the 100K run used a smaller budget for the localized abilities. Each overall is a standalone, reproducible figure.

Reproduce with kbench beam --dataset beam-1m.json --limit 15 --reader-backend codex. The 10M bucket is future work: a faithful run needs the write-time distiller in the loop rather than raw per-turn ingest. mem0 reports 48.6% there; Cognee reports 67%.

BEAM

Results: 100K bucket

Results: 1M bucket

On this page