Benchmarks
Published results, grouped by benchmark. Each row keeps its backbone LLM, embedder, source, and a trust badge — because a memory score is only as meaningful as the pipeline and the party that measured it. The context-window baseline shows how far naive prompt-stuffing gets.
IndependentA neutral party ran it.
Self-reportedThe framework's own vendor reported it.
UnverifiedSource is neutral but not yet reproduced.
LoCoMo
32K-context era (ACL 2024) · 1,982 questionsLong-term, multi-session conversational recall (single-hop, multi-hop, open-domain, temporal).
Caveats
- Average context length is modest by 2026 standards; a 'dump everything into the prompt' baseline now scores competitively.
- Does not explicitly score knowledge updates.
| Framework | Value | Backbone | Embedder | Trust | Source | Date |
|---|---|---|---|---|---|---|
| ByteRover | 96.1 accuracy | Gemini 3 Flash (curation/query) + Gemini 3.1 Pro (justifier) | — | Self-reported | ByteRover team (Nguyen et al.) ↗ | 2026-04-02 |
| Mem0 | 92.5 accuracy | — | — | Self-reported | Mem0 ↗ | 2026-04-01 |
| ByteRover | 92.2 accuracy | Gemini 3 Flash (curation/judge) + Gemini 3 Pro (answer/justifier, best run) | — | Self-reported | ByteRover ↗ | 2026-02-27 |
| Honcho | 89.9 accuracy | — | — | Self-reported | Honcho (Plastic Labs) ↗ | 2026-05-26 |
| MIRIX | 85.38 accuracy | gpt-4.1-mini | — | Self-reported | MIRIX (Wang & Chen) ↗ | 2025-07-10 |
| Memori | 81.95 accuracy | — | — | Self-reported | Memori (MemoriLabs) ↗ | 2026-05-28 |
| MemOS | 75.8 accuracy | GPT-4o-mini | — | Self-reported | MemOS (MemTensor et al.) ↗ | 2025-07-04 |
| Letta (MemGPT) | 74 accuracy | gpt-4o-mini | text-embedding-3-large | Self-reported | Letta (MemGPT authors — Packer, Wooders et al.) ↗ | 2025-08-12 |
| LiCoMemory | 67.2 accuracy | gpt-4o-mini | BGE-M3 | Self-reported | LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗ | 2025-11-03 |
| Mem0 | 66.88 accuracy | — | — | Independent | Hindsight/Vectorize (competitor re-run) ↗ | 2026-04-02 |
| LiCoMemory | 62.99 accuracy | Llama-3.1-70B-Instruct-Turbo | BGE-M3 | Self-reported | LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗ | 2025-11-03 |
| Mem0 | 54.68 accuracy | gpt-4o-mini | BGE-M3 | Independent | LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗ | 2025-11-03 |
| A-MEM | 48.59 accuracy | gpt-4o-mini | BGE-M3 | Independent | LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗ | 2025-11-03 |
| A-MEM | 48.38 accuracy | gpt-4o-mini | — | Independent | MIRIX (Wang & Chen) — competitor re-run ↗ | 2025-07-10 |
| Zep (Graphiti) | 44.76 accuracy | gpt-4o-mini | BGE-M3 | Independent | LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗ | 2025-11-03 |
LongMemEval
32K-context era (2024) · 500 questionsMulti-session recall including knowledge updates across ~500 questions.
Caveats
- Like LoCoMo, large modern context windows weaken it as an isolation test of memory.
- LongMemEval-S (~103k tokens) fits inside a 128k context window, so a full-context baseline can solve much of it without memory — 'borderline' saturation risk per Jiang et al., 'Anatomy of Agentic Memory' (arXiv:2602.19320, 2026).
| Framework | Value | Backbone | Embedder | Trust | Source | Date |
|---|---|---|---|---|---|---|
| agentmemory | 95.2 recall | — | all-MiniLM-L6-v2 | Self-reported | rohitg00 (agentmemory authors) ↗ | 2026-05-20 |
| Mem0 | 94.4 accuracy | — | — | Self-reported | Mem0 ↗ | 2026-04-01 |
| Hindsight | 91.4 accuracy | Gemini 3 Pro | — | Self-reported | Hindsight (Vectorize) ↗ | 2026-04-02 |
| Honcho | 90.4 accuracy | — | — | Self-reported | Honcho (Plastic Labs) ↗ | 2026-05-26 |
| Zep (Graphiti) | 90.2 accuracy | gpt-5.4 (reasoning=medium) | — | Self-reported | Zep ↗ | 2026-05-28 |
| RetainDB | 79 accuracy | gpt-5.4 | — | Self-reported | RetainDB ↗ | 2026-03-01 |
| MemOS | 77.8 accuracy | GPT-4o-mini | — | Self-reported | MemOS (MemTensor et al.) ↗ | 2025-07-04 |
| LiCoMemory | 73.8 accuracy | gpt-4o-mini | BGE-M3 | Self-reported | LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗ | 2025-11-03 |
| Zep (Graphiti) | 71.2 accuracy | GPT-4o | — | Independent | Hindsight/Vectorize (competitor re-run) ↗ | 2026-04-02 |
| LiCoMemory | 69.2 accuracy | Llama-3.1-70B-Instruct-Turbo | BGE-M3 | Self-reported | LiCoMemory (Huang et al., HKUST/Huawei/CUHK-SZ/WeBank) ↗ | 2025-11-03 |
| Zep (Graphiti) | 63.8 accuracy | GPT-4o | — | Self-reported | Zep ↗ | 2026-02-01 |
| Mem0 | 62.6 accuracy | gpt-4o-mini | BGE-M3 | Independent | LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗ | 2025-11-03 |
| Zep (Graphiti) | 58.6 accuracy | gpt-4o-mini | BGE-M3 | Independent | LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗ | 2025-11-03 |
| A-MEM | 55 accuracy | gpt-4o-mini | BGE-M3 | Independent | LiCoMemory (Huang et al., HKUST et al.) — competitor re-run ↗ | 2025-11-03 |
| Mem0 | 49 accuracy | GPT-4o | — | Independent | Zep (competitor harness) ↗ | 2026-02-01 |
| MIRIX | 43.49 accuracy | GPT-4o-mini | — | Independent | MemOS (MemTensor) — competitor re-run ↗ | 2025-07-04 |
BEAM (1M)
ICLR 2026Long-term memory across ~1M-token conversations spanning multiple domains.
Caveats
- Built specifically to escape the context-window-rot that affects LoCoMo/LongMemEval.
| Framework | Value | Backbone | Embedder | Trust | Source | Date |
|---|---|---|---|---|---|---|
| Mem0 | 64.1 accuracy | — | — | Self-reported | Mem0 ↗ | 2026-04-01 |
| Context-window baseline | 64.1 accuracy | — | — | Unverified | Mem0 (benchmark summary) ↗ | 2026-03-01 |
BEAM (10M)
ICLR 2026Long-term memory stressed to ~10M-token scale.
Caveats
- Hardest tier; scores drop sharply, exposing real retention limits.
| Framework | Value | Backbone | Embedder | Trust | Source | Date |
|---|---|---|---|---|---|---|
| Hindsight | 64.1 accuracy | — | — | Self-reported | Hindsight (Vectorize) ↗ | 2026-04-02 |
| Mem0 | 48.6 accuracy | — | — | Self-reported | Mem0 ↗ | 2026-04-01 |
| Cognee | 0.67 accuracy | — | — | Self-reported | cognee maintainers (README Benchmarks section) ↗ | 2026-06-28 |