Kimetsu logoKimetsu
Memory Benchmark

Retrieval & correctness

Kimetsu's retrieval and correctness numbers: recall, MRR, latency, stale-hit suppression, and contradiction resolution, all reproducible from the shipped CLI.

Kimetsu's retrieval and correctness numbers: recall, MRR, latency, stale-hit suppression, and contradiction resolution, all reproducible from the shipped CLI.

Retrieval quality

On a 100-memory / 210-case dataset seeded from real exported memories (keyword, paraphrase, oblique, confusable, in-domain-no-answer, multi-answer):

metricvalue
recall@40.949 (default reranker), up to 0.975
MRR0.914 (default), up to 0.933
latency~138 ms per retrieval + rerank

The default (ms-marco-tinybert-l-2-v2) is the fastest reranked combo; the quality-best rerankers reach recall@4 0.975 / MRR 0.933 at higher latency. Swap embedder and reranker with one config key each and re-judge on your own corpus.

Memory correctness (v2.5)

v2.5 ("The best memory") added a temporal validity model, automatic contradiction resolution, and validity-aware retrieval. We measure two things a plain vector store cannot do, on a correctness dataset of knowledge-update, contradiction, and temporal cases:

  • stale-hit rate: how often a superseded / outdated memory still shows up in the top-k. Lower is better.
  • resolution accuracy: on contradiction and knowledge-update cases, how often the current / correct memory outranks the stale one. Higher is better.
metricbefore (flat retrieval)v2.5change
stale-hit rate0.5000.091−82%
resolution accuracy0.3640.909+0.545

A plain semantic store returns both the old and new fact because cosine similarity does not track recency or supersession, so a stale fact surfaces about half the time, and contradictions resolve barely better than chance. With v2.5, superseded facts are excluded from default retrieval (still queryable for history), and a new memory that contradicts an old one is resolved automatically by confidence × recency, with the loser invalidated-as-of (lineage preserved, never destroyed).

No regression

The correctness work did not cost retrieval quality. The v2.0 retrieval baseline is unchanged in v2.5: on the 18-memory / 100-case set, recall@4 0.977 / MRR 0.941 before and after.

Cost

On a recorded 16-task Terminal-Bench slice, runs with the brain cost about 13× less per win than the no-brain baseline ($0.19 vs $2.47), measured on Claude Code at Claude pricing. See docs/ROI-METHODOLOGY.md for the methodology and the kimetsu brain roi ledger for per-memory savings on your own work.

How to reproduce

# retrieval quality + correctness metrics (semantic build, cached models)
kimetsu brain bench --dataset <fixture>.json \
  --embedders jina-v2-base-code --rerankers ms-marco-tinybert-l-2-v2

# the summary table reports recall@2/4, MRR, latency, and (when the fixture
# has temporal/contradiction cases) stale_hit_rate + resolution_accuracy.

The eval fixtures live in the bench tooling; the harness (kimetsu brain bench) ships in the CLI, so you can run the same metrics against your own exported memories.

On this page