Benchmarks

Published results, grouped by benchmark. Each row keeps its backbone LLM, embedder, source, and a trust badge — because a memory score is only as meaningful as the pipeline and the party that measured it. The context-window baseline shows how far naive prompt-stuffing gets.

IndependentA neutral party ran it.

Self-reportedThe framework's own vendor reported it.

UnverifiedSource is neutral but not yet reproduced.

LoCoMo

32K-context era (ACL 2024) · 1,982 questions

Long-term, multi-session conversational recall (single-hop, multi-hop, open-domain, temporal).

Caveats

Average context length is modest by 2026 standards; a 'dump everything into the prompt' baseline now scores competitively.
Does not explicitly score knowledge updates.

Framework	Value	Backbone	Embedder	Trust	Source	Date
ByteRover	96.1 accuracy	Gemini 3 Flash (curation/query) + Gemini 3.1 Pro (justifier)	—	Self-reported	ByteRover team (Nguyen et al.) ↗	2026-04-02
Mem0	92.5 accuracy	—	—	Self-reported	Mem0 ↗	2026-04-01
ByteRover	92.2 accuracy	Gemini 3 Flash (curation/judge) + Gemini 3 Pro (answer/justifier, best run)	—	Self-reported	ByteRover ↗	2026-02-27
Honcho	89.9 accuracy	—	—	Self-reported	Honcho (Plastic Labs) ↗	2026-05-26
MIRIX	85.38 accuracy	gpt-4.1-mini	—	Self-reported	MIRIX (Wang & Chen) ↗	2025-07-10
Memori	81.95 accuracy	—	—	Self-reported	Memori (MemoriLabs) ↗	2026-05-28
MemOS	75.8 accuracy	GPT-4o-mini	—	Self-reported	MemOS (MemTensor et al.) ↗	2025-07-04
Letta (MemGPT)	74 accuracy	gpt-4o-mini	text-embedding-3-large	Self-reported	Letta (MemGPT authors — Packer, Wooders et al.) ↗	2025-08-12
LiCoMemory	67.2 accuracy	gpt-4o-mini	BGE-M3	Self-reported	LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗	2025-11-03
Mem0	66.88 accuracy	—	—	Independent	Hindsight/Vectorize (competitor re-run) ↗	2026-04-02
LiCoMemory	62.99 accuracy	Llama-3.1-70B-Instruct-Turbo	BGE-M3	Self-reported	LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗	2025-11-03
Mem0	54.68 accuracy	gpt-4o-mini	BGE-M3	Independent	LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗	2025-11-03
A-MEM	48.59 accuracy	gpt-4o-mini	BGE-M3	Independent	LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗	2025-11-03
A-MEM	48.38 accuracy	gpt-4o-mini	—	Independent	MIRIX (Wang & Chen) — competitor re-run ↗	2025-07-10
Zep (Graphiti)	44.76 accuracy	gpt-4o-mini	BGE-M3	Independent	LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗	2025-11-03

LongMemEval

32K-context era (2024) · 500 questions

Multi-session recall including knowledge updates across ~500 questions.

Caveats

Like LoCoMo, large modern context windows weaken it as an isolation test of memory.
LongMemEval-S (~103k tokens) fits inside a 128k context window, so a full-context baseline can solve much of it without memory — 'borderline' saturation risk per Jiang et al., 'Anatomy of Agentic Memory' (arXiv:2602.19320, 2026).

Framework	Value	Backbone	Embedder	Trust	Source	Date
agentmemory	95.2 recall	—	all-MiniLM-L6-v2	Self-reported	rohitg00 (agentmemory authors) ↗	2026-05-20
Mem0	94.4 accuracy	—	—	Self-reported	Mem0 ↗	2026-04-01
Hindsight	91.4 accuracy	Gemini 3 Pro	—	Self-reported	Hindsight (Vectorize) ↗	2026-04-02
Honcho	90.4 accuracy	—	—	Self-reported	Honcho (Plastic Labs) ↗	2026-05-26
Zep (Graphiti)	90.2 accuracy	gpt-5.4 (reasoning=medium)	—	Self-reported	Zep ↗	2026-05-28
RetainDB	79 accuracy	gpt-5.4	—	Self-reported	RetainDB ↗	2026-03-01
MemOS	77.8 accuracy	GPT-4o-mini	—	Self-reported	MemOS (MemTensor et al.) ↗	2025-07-04
LiCoMemory	73.8 accuracy	gpt-4o-mini	BGE-M3	Self-reported	LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗	2025-11-03
Zep (Graphiti)	71.2 accuracy	GPT-4o	—	Independent	Hindsight/Vectorize (competitor re-run) ↗	2026-04-02
LiCoMemory	69.2 accuracy	Llama-3.1-70B-Instruct-Turbo	BGE-M3	Self-reported	LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗	2025-11-03
Zep (Graphiti)	63.8 accuracy	GPT-4o	—	Self-reported	Zep ↗	2026-02-01
Mem0	62.6 accuracy	gpt-4o-mini	BGE-M3	Independent	LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗	2025-11-03
Zep (Graphiti)	58.6 accuracy	gpt-4o-mini	BGE-M3	Independent	LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗	2025-11-03
A-MEM	55 accuracy	gpt-4o-mini	BGE-M3	Independent	LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗	2025-11-03
Mem0	49 accuracy	GPT-4o	—	Independent	Zep (competitor harness) ↗	2026-02-01
MIRIX	43.49 accuracy	GPT-4o-mini	—	Independent	MemOS (MemTensor) — competitor re-run ↗	2025-07-04

BEAM (1M)

ICLR 2026

Long-term memory across ~1M-token conversations spanning multiple domains.

Caveats

Built specifically to escape the context-window-rot that affects LoCoMo/LongMemEval.

Framework	Value	Backbone	Embedder	Trust	Source	Date
Mem0	64.1 accuracy	—	—	Self-reported	Mem0 ↗	2026-04-01
Context-window baseline	64.1 accuracy	—	—	Unverified	Mem0 (benchmark summary) ↗	2026-03-01

BEAM (10M)

ICLR 2026

Long-term memory stressed to ~10M-token scale.

Caveats

Hardest tier; scores drop sharply, exposing real retention limits.

Framework	Value	Backbone	Embedder	Trust	Source	Date
Hindsight	64.1 accuracy	—	—	Self-reported	Hindsight (Vectorize) ↗	2026-04-02
Mem0	48.6 accuracy	—	—	Self-reported	Mem0 ↗	2026-04-01
Cognee	0.67 accuracy	—	—	Self-reported	cognee maintainers (README Benchmarks section) ↗	2026-06-28