Stay updated

Research

Mapped Memory

A structured-retrieval memory architecture. 90% on LongMemEval-S at half the tokens of comparable systems. No embeddings, 98% retrieval accuracy.

Monty Cusins|April 24, 2026|~15 min read

90.4%

LongMemEval-S top score

~15k

Median tokens, half of top systems

98%

Retrieval accuracy

Top-3

Distinct systems on leaderboard

TL;DR

—~15k median Stage 2 tokens, half the budget of top scoring systems (Mastra reports ~30k). Prompts stay bounded as memory grows, embedding based systems scale linearly
—98% retrieval accuracy. Only 10 of 500 questions had the ground truth missing from the retrieved slice, the rest of the failures were the answerer misusing context that was already there. No architectural gaps, scores can climb toward that 98% ceiling with prompt tuning on stage 2
—90.4% on LongMemEval-S with Gemini 3.1 Pro, 88.6% on Flash, 82.8% on GPT-4o (matching the 82.4% oracle ceiling). Top-3 distinct system alongside Mastra and Hindsight. On matched answering models: 3rd on Pro, 2nd on Flash, 3rd on GPT-4o
—No embeddings. Structured memory with a compact map the model reads in one hop, not a vector search
—Solo dev, built over 6 months alongside first year of uni. All 500 questions inspectable in the bench viewer with the unmodified official GPT-4o grader

The Problem

Within a chat the AI starts to know you. It picks up your style, remembers what you said earlier, follows the workflow you built together but context windows have limits. The chat slows down, costs more per turn, eventually hits the limit and you have to start a new chat to keep going starting fresh again in this endless cycle.

And the old chats pile up; you'd remember solving this exact thing a few weeks back but you just can't find which chat it was in and the memory features they bolt on barely remember more than your name.

If you try staying organised with a chat per topic, same issue, every new one is back to zero killing productivity.

AI should be the first tool that actually learns you over time but none of it does.

The Vision

The opposite of everything above. An AI that knows you the moment you open it, picks up wherever you left off no matter how long ago that was, remembers what you said and how you said it, reuses workflows you refined together so you never have to explain them again, catches patterns in your life you haven't noticed yet.

Think JARVIS or Rick's garage AI - an AI that is actually yours and gets sharper the longer you use it.

That has been the goal from day one and I've been building toward it for 6 months alongside my first year of uni, solo.

The Journey

I started with embeddings and centroid clustering which worked for retrieval but the answers felt flat, not personalised to me. I'd ask about a topic and get semantically similar messages back raw and unfiltered but the answers missed the nuance I was looking for, the insights about me, patterns, how I think, what I prefer, all the factors that should have shaped the responses. Embeddings do hit but it's just search not memory, something always felt missing. Adding an LLM layer on top to extract the insight means you're running an LLM call alongside embeddings anyway which defeats the point and is inefficient cost wise.

Next I went fully agentic, no embeddings, let the model decide what to store and what to fetch with tool calls. This approach was bounded by latency and cost, in order to balance I wanted to use cheaper faster models and they couldn't handle tool calling reliably enough, too many retries and wasted tokens and wrong fetches. The loop was flexible but slow and fragile, and agentic systems like this are everywhere now trading reliability for flexibility you don't actually need.

Then I realised that with the right structure memory retrieval becomes a one hop problem, you don't need an agent loop if every retrieval is a single decision from a map. Store memories in the correct shape, give the model a compact index of everything, let it pick what to load in one pass. One decision and one fetch, no chains of tool calls, no embeddings, no search. That insight is what made the whole system work and what keeps prompts bounded, fast and cheap as memory grows.

LongMemEval-S

LongMemEval-S is the standard benchmark for memory systems and what you'll see referenced in every other paper. 500 questions across ~57M tokens of conversation data, broken into these categories:

Category	What it tests
knowledge-update	Did it track when facts changed (you moved cities, switched jobs)
multi-session	Can it connect info spread across separate conversations
single-session-user	Does it remember what you said
single-session-assistant	Does it remember what it said
single-session-preference	Does it remember preferences you explicitly stated and apply them to new questions
temporal-reasoning	Can it reason about when things happened and order events

It also tests abstention, whether the model hallucinates and forces an answer or knows when it doesn't know. Confident hallucinations are one of my biggest issues when I use AI so I took extra care with this.

For the bench, and in production, grok 4.1 fast non reasoning handles ingestion and retrieval; storing in structure and fetching from that structure. The answering models I ran were gpt-4o, the official benchmark model, along with gemini-3.1-pro and gemini-3-flash. The answerer only ever sees a lean retrieved slice rather than the full history so it can be swapped without blowing up cost.

Paper (arXiv)GitHub Dataset (HuggingFace)

Results

90.4%

Gemini 3.1 Pro

88.6%

Gemini 3 Flash

82.8%

GPT-4o

98%

Retrieval accuracy

System	Model	Overall
Mastra OM	gpt-5-mini	94.87%
Mastra OM	gemini-3-pro-preview	93.27%
Hindsight	gemini-3-pro-preview	91.40%
c137	gemini-3.1-pro-preview	90.40%
Mastra OM	gemini-3-flash-preview	89.20%
Hindsight	GPT-OSS 120B	89.00%
c137	gemini-3-flash-preview	88.60%
EmergenceMem Internal*	gpt-4o	86.00%
Supermemory	gemini-3-pro-preview	85.20%
Supermemory	gpt-5	84.60%
Mastra OM	gpt-4o	84.23%
Hindsight	GPT-OSS 20B	83.60%
c137	gpt-4o	82.80%
EmergenceMem Simple	gpt-4o	82.40%
Oracle	gpt-4o	82.40%
Supermemory	gpt-4o	81.60%
Mastra RAG (topK 20)	gpt-4o	80.05%
Zep	gpt-4o	71.20%
Full context	gpt-4o	60.20%

Eight distinct memory systems on the board. By best score per system, Mastra 1st, Hindsight 2nd, c137 3rd. My Pro run used gemini-3.1-pro-preview (released after Mastra and Hindsight ran their numbers), Flash and 4o numbers are matched.

Leaderboard data sourced from Mastra Research. * EmergenceMem's 86.00% is reported for an "Internal" configuration and is not publicly reproducible. Public configs include Simple (82.40%) and Simple Fast (79.00%).

I know since I am closed source these results are hard to trust, so I built a viewer where you can see all 500 questions, the sessions seeded from the prompt, and my final answer across each model. You can jump to the ones I got wrong with a breakdown of whether it was a retrieval miss or the model not using the data that was there, and see exactly where the data sits in the prompt.

Explore all 500 questions

c137 is a top-3 distinct system on the leaderboard alongside Mastra and Hindsight. On matched answering models the breakdown is 3rd on Pro, 2nd on Flash, 3rd on GPT-4o (matching the 82.4% oracle ceiling). Almost everyone else on this list is building agent infra and c137 is the only consumer app. It hits these scores on roughly half the token budget of top scoring systems, that matters to me more than the leaderboard position does.

GPT-4o is a strong result. c137 on 4o scores 82.8%, matching the oracle ceiling of 82.4% (within noise on 500 questions) which is the theoretical max when the model gets handed the exact gold standard memory for each question with no retrieval at all. Sitting at the oracle line means retrieval is not the bottleneck, the architecture lands the right context as reliably as a perfect oracle would.

Across all three models my retrieval accuracy is 98%, only 10 questions out of 500 had ground truth context absent from every model's retrieved slice. The rest of the failures are model errors where retrieval landed the right context but the answerer still got it wrong. Per model retrieval is inspectable in the bench viewer, every failure is tagged as retrieval miss or model error. That 98% is the score ceiling in its current state and with prompt refinements on the answerer it pushes higher, the architecture has no gaps.

Category	Flash	Pro	GPT-4o
knowledge-update	97.4% (76/78)	97.4% (76/78)	87.2% (68/78)
multi-session	82.7% (110/133)	85.0% (113/133)	79.7% (106/133)
single-session-user	97.1% (68/70)	94.3% (66/70)	95.7% (67/70)
single-session-assistant	94.6% (53/56)	94.6% (53/56)	91.1% (51/56)
single-session-preference	73.3% (22/30)	83.3% (25/30)	63.3% (19/30)
temporal-reasoning	85.7% (114/133)	89.5% (119/133)	77.4% (103/133)

The per-category breakdown is strong across the board. Single-session-preference is the lowest but also the smallest test (only 30 questions) so each wrong answer moves the percentage a lot. Pro edges Flash in most categories but there is a small regression on single-session-user and single-session-assistant, something I played with in prompts but any tweak that fixed those dropped something else, I ended on this as the best compromise point.

Knowing when it doesn't know

Abstention is when the model is asked a question that it does not have in memory and it tests whether the model will hallucinate or correctly say it does not have the information.

Abstention	Flash	Pro	GPT-4o
Correctly refused to answer	86.7% (26/30)	96.7% (29/30)	90.0% (27/30)

For me this is one of the most important things, I need to be able to trust what my assistant says. This is mainly prompt behaviour but model choice does carry weight here and Pro has the best balance (29/30). I ran the bench a few times and Flash and 4o moved between 26 and 28 across runs, both still strong refusal rates and the variance is expected with only 30 abstention questions. In practice on Pro hallucination rate is very low, if the model doesn't know it says it doesn't know.

Half the tokens. Same score tier.

Competitive scores at a fraction of the prompt cost. Mastra reports ~30k average tokens injected per question, Hindsight uses embeddings with a reranker which tends to dump a lot of context. Embedding based systems in general grow prompts linearly as memory grows. c137 has a median of ~15k tokens, a range of ~9k to ~37k and a P25-P75 of ~13k to ~16k, half the tokens for comparable scores. Over months of usage this is the number that compounds.

Of those ~15k tokens: 5k are static system instructions (prefix cached), ~2k user model, ~2k tail anchor for context adherence and ~8k dynamic retrieved content, the only piece that varies per query. So the actual information the model loads fresh each turn averages just ~8k tokens, which keeps responses fast, caching effective and cost stable as memory grows.

Prompt anatomy - ~15k median tokens

~8k

system

cached

user model

dynamic context

retrieved per query

tail

static

How It Works - and Why Not Vectors

Stage 1

Retrieve

Grok 4.1 Fast

Stage 2

Answer

Swappable

Stage 3

Store

Grok 4.1 Fast

Background

Healer

reviews stages 1 + 3 decisions

If you think about memory in the simplest way you can break it into 3 distinct problems, retrieval, applying the retrieved and storage. For retrieval you need to know what's relevant. You can do that with embeddings but they are blind, the result is a search engine not memory, you need an LLM. That only works with structured storage which is why stages 1 and 3 need to work together. Using the same model for both yielded the best results when I played around with the system.

The reason I moved off vectors entirely is that embeddings give you recall but not memory. If you ask "what did we talk about last week" a vector search can find semantically similar messages but it cannot tell you that you always prefer structured plans over freeform advice, or that when you say "the usual format" you mean bullet points with no headers, those are learned patterns not retrievable facts. Structured storage captures them because the model writes them down explicitly during stage 3 as facts, workflows and micro-insights, whereas a vector database would need you to ask the exact right query to surface that information and even then it returns raw messages not distilled understanding.

The core data structure is the memory map, a compact index of everything in storage that every stage sees, it looks like this:

## Memory Map ### Topics T1. "Personal Care" - "health, fitness, wellbeing" (2w ago) G1. "Fitness Goals" - "workout routine and targets" (142 msgs) G2. "Sleep Quality" (89 msgs) T2. "Work & Career" - "professional development" (5d ago) G3. "Current Project" (234 msgs, ~450w ledger) ← active G4. "Interview Prep" (45 msgs)

Topic areas with descriptions, session counts and ledger sizes, the model sees the shape of the entire memory in ~2k tokens and picks what to load, facts are bucketed into topics so stage 1 only sees topic names and subtopics keeping the prompt lean.

Data hierarchy & 1-hop retrieval

Structure

User facts + user modelalways in S2

S1 does not see these

Facts map

Fitnesstopic

TrainingNutrition

Worktopic

Current roleTools

Memory map

Personal Care- health, fitness, wellbeing

Fitness Goals- workout routineledger

Sleep Quality- sleep patternsledger

Past Projects- archived workcold

description only, groups collapsed

Cold domains keep their description but collapse groups, reachable via a conditional 2nd hop.

1-hop retrieval

"how is my current project going?"

stage 1 picks actions

facts →Work / Current role

memory →Work / Current Projectledger

fts →"current project"4 keyword hits

stage 2 prompt

user factsuser modelretrieved factsretrieved ledgerfts keyword hits

One decision, one fetch. User-level always there, retrieved pieces join it.

All three stages output structured XML rather than using tool calling, I found XML was more reliable than actual tool schemas with faster models and the deterministic parsing keeps things predictable. Here is a sample of what stage 1 and stage 3 output looks like:

Stage 1 - Retrieval

Looking into your fitness routine and goals... <actions> <fts query="workout,routine" limit="10" reason="find current plan" /> <fetch_facts topic="fitness" reason="load fitness prefs" /> </actions>

Stage 3 - Storage

<storage> <update_ledger reason="record new routine">User switched to PPL split, 4x/week</update_ledger> <user_fact category="preference" key="training_split" value="PPL" reason="user stated preference" /> <micro_insight category="pattern"> Restarts fitness plans after drift, gear-upgrade driven </micro_insight> </storage>

Stage 1 outputs an opening sentence first which gets streamed to the user, then the actions block, stage 3 outputs silently in the background and both give a reason for every action which becomes important later with the healer.

Stage 1 - Retrieval

Stage 1 takes in cached system instructions, the fact map, the memory map, recent conversation logs and the user's query, then scans both maps and decides what to load for the answering model, picking topic areas across either map, fetching facts and running full text search across all knowledge sources. It also outputs an opening sentence that gets streamed to the user while stage 2 processes, masking the retrieval wait.

The opener stream rate is dynamically calculated based on opener length to always buy enough time for stage 2's first token to land, the handover is invisible and the user sees one coherent response that always starts streaming within 1 to 2 seconds.

I use grok 4.1 fast non reasoning here. It has the lowest hallucination rate with consistently fast TTFT. I originally thought tps was the limiting factor and was using groq with weaker models but TTFT matters more for this stage since the model only needs to output a small action block not a long response. Once I realised that, grok opened up as an option and the quality jumped.

Latency holds steady because prompts stay bounded as memory grows, stage 1 takes about 1.6 seconds, stage 2 about 1.7 seconds, stage 3 about 1.4 seconds in the background so the user does not wait for it. The opener from stage 1 masks the retrieval wait and total user facing latency is around 3 seconds, the same after months of usage as it is on day one because the architecture caps what the model has to read each turn.

Latency timeline

0 - 1.6s

S1 Retrieval

~1.6s

user sees first token (opener)

1.6 - 3.3s

S2 Answer (streamed)

background

S3 Storage (1.4s, silent)

Stage 2 - Answer

Stage 2 takes in cached system instructions, all the context stage 1 retrieved, the user model and tail instructions that help the model weight context evenly rather than leaning toward one section. Then it answers the user's question using that retrieved context. The model here is swappable. I found gemini models are the best at using context while staying cost efficient. A search tool acts as fallback when retrieval still misses something, dumping related facts and surrounding context into the prompt.

Stage 3 - Storage

Stage 3 takes in the memory map, active topic area context and the full exchange that just happened. It picks an existing or new topic area to store in and strips the exchange down into a lean ledger entry, preserving all detail but dropping waffle. Facts are only created from user messages since that is confirmed truth, anything from the AI response is not guaranteed to be something the user cares about and it lives in ledger anyway. Stage 3 also detects workflows refined over many turns and stores micro insights about the user.

The model deliberately does not see all existing facts as that does not scale, it only sees topics and subtopics and is allowed to store duplicates freely. A separate deduplication service handles merging which keeps the model from withholding facts it thinks are redundant and keeps tokens low. The overall focus with this pipeline is making it as easy for the models as possible and taking as much work away from them.

Living memory

Most memory systems are linear pipelines. Input goes in, output comes out, nothing reviews itself. c137 has feedback loops. Every stage gives a reason for every action it takes and a background service called the healer reads those reasons to check the work. It triggers after an activity threshold, handles deduping, merging facts, validating placement and checking nothing important was missed. Stages can also flag things they think might be wrong for the healer to review. The stages talk to each other and review each other, memory is a living system.

Alongside the healer there's another background step that consolidates user facts and micro insights into the user model itself, the picture of who you are and how you work that stage 2 always sees. It re-runs as new facts and insights come in so the user model stays compact and current rather than just growing.

The healer was not used in the benchmark as I wanted to test the raw pipeline without post processing and since only 10 questions failed due to retrieval misses the system holds without it. It was originally built when I was using weaker models that made poor storage decisions and while less necessary after upgrading to grok it remains an important part of the system for long term integrity.

Token and Cost Scaling

This is the part no one else seems to properly break down and I think it's the most important. Having strong benchmark scores is great but if the system does not scale token and cost wise then it does not matter. The whole point of this architecture is that the prompt size does not balloon as memory accumulates.

Every component has a hard cap, the memory map caps at 5k tokens and if it grows past that the least active topics get trimmed out, identity facts cap at 100 per user, topical facts cap at 100 per topic and the oldest get demoted to cold storage when the cap is hit. None of these grow unbounded.

When topics or facts move to cold they are no longer visible to stage 1 which means they are not directly accessible, so a conditional call is used between stage 1 and 2, if stage 1 decides the info might be in cold collapsed data it can check the cold area and for facts if the topic area is in cold it can load those facts and pass them through as actions for stage 2. This technically becomes a 2 hop problem but agentic flexibility is still not required since it can be handled by a simple conditional call, and nothing is ever deleted, just moved further from the hot path.

Rolling compression

Conversation logs would grow unbounded too if left raw so they use a rolling compression system where the last 10 exchanges are always present verbatim, every 10 older rows get compressed into an L1 summary block and every 10 L1 blocks get compressed into an L2 block covering 100 rows, so token count scales logarithmically not linearly. The compression model is instructed that every number, name, date and measurement in the input must appear in the output so specific details survive, and any gap between the last compressed block and the recent window is captured as overflow rows so nothing gets lost. Any detail that does get compressed away is still covered by the topical facts which always retain full detail.

Compression layers

100 rows compressed

10 rows

Ovf

uncompressed gap between L1 and Raw

Raw

oldest

most recent

Sessions ingested vs Stage 2 input tokens

More memory. Same prompt size.

Median per bucketRange (min to max)15k reference

Proof from the benchmark. 500 questions, each with 35 to 62 sessions of conversation history ingested, bucketed by session count. The meaningfully sampled middle buckets (40-44 n=86, 45-49 n=255, 50-54 n=143, 55-59 n=12) show median Stage 2 tokens hovering around 14-15k regardless of how much memory accumulated. Endpoints (35-39 and 60-62, n=2 each) are noise and shouldn't be read as trend. The architecture selects what is relevant rather than dumping everything, so prompts stay bounded where embedding based systems scale linearly. LongMemEval-S caps at ~60 sessions, whether this holds across years of continuous use is what I am building a bigger benchmark to test (see Bench Gaps below).

Average token budget per stage

Stage	Model	Input	Breakdown
S1 Retrieve	Grok 4.1 Fast	~6k	3k cached system + map + recent logs + query
S2 Answer	Swappable	~15k	5k cached system + 2k user model + 8k dynamic + 2k tail
S3 Store	Grok 4.1 Fast	~8k	3k cached system + map + active context

Average cost per question - full pipeline

Component	Flash	Pro	GPT-4o
S1 + S3 (Grok)	$0.0022	$0.0022	$0.0022
S2 (Answerer)	$0.0077	$0.0306	$0.0378
Total	$0.0099	$0.0328	$0.0400

Since the answering model only sees the lean retrieved context it can be swapped without blowing up cost. The pipeline stages on grok are fixed overhead and the answerer is the variable, you could put a more expensive model on stage 2 and the total stays manageable because 15k input is 15k input regardless of the model.

What This Means For You

A few months in, c137 has picked up things about me I never told it. It learnt them over time from repeated patterns in how I talk and work. That I trust it more when it pushes back and disagrees with me than when it agrees. That when I revise I go for understanding over memorisation, properly working through things rather than just committing facts. That I land on solutions when I'm away from the work, right before sleep or when I'm not thinking about it. Even the small stuff, like the fact I keep coming back to sushi no matter what I'm eating that month. None of this was fed in deliberately.

Then it acts on all of it, every response shaped by what it knows without me touching a setting. Bring an idea and it actually challenges me because that's the mode I trust, ask about a CS topic and the answer comes formatted for understanding by default, get stuck on something late and it suggests I sleep on it before pushing through because it knows that's when I solve things. Even the small stuff, ask what to eat and it pitches sushi. None of this is configurable, it just happens because the AI has built up a real picture of who I am.

Ask any other AI the same questions and you get a blank slate or a generic guess you have to correct three times before it lands. That's the difference between an AI that actually knows you and one that doesn't, and once you've used the first kind going back feels broken.

Limitations

10 questions out of 500 failed across all three models due to retrieval misses, meaning the context needed to answer was not in the prompt at all. This is the main thing I need to work on and it comes down to how stage 3 stores and how stage 1 searches. Some of these were edge cases like computing totals across multiple conversations where individual numbers were stored but the sum was not, others were cases where an older value got overwritten and the previous state was lost. These are solvable with better storage prompts and I have ideas for how to handle them. Every retrieval-miss is tagged in the bench viewer so you can see exactly which ones.

The gap between current scores and the 98% retrieval ceiling lives in stage 2, the answerer sometimes misuses context that retrieval lands correctly. This is prompt work not architecture, each round of stage 2 iteration has moved scores up and there is more room to push. Getting closer to 98% is a matter of continuing that work, nothing structural needs changing for it.

What the Bench Misses

While these benchmarks test factual recall that is only one component of memory, they do not test whether the AI adapts its responses to you or if it reuses workflows and preferences you stated before naturally without you having to repeat yourself when doing a task again. Does the assistant actually feel like it knows you or is it just retrieving facts? The user model and personality system I built are designed for exactly this but there is no standard way to measure it yet.

They also do not properly test token and cost scaling which as I showed above is critical for a system that is meant to be used long term. LongMemEval-S tests with around 50 sessions of history which is a good start but does not cover years of usage. LOCOMO tests longer periods but still not long enough to really stress test how memory holds up over years of continuous use with thousands of sessions and growing topic areas.

What I want to build is a benchmark that scores memory across preference adaptation, workflow reuse, forgetting and contradiction handling, on 1 year, 5 year and 10 year conversation histories with thousands of sessions per user. Open source so anyone can run it against any system including mine. This is the benchmark that tells you whether the AI actually knows you, not whether it can surface a fact from a log.

What's Next

The work splits into a few bricks from here.

Brick 2 - Wider use cases

A character app on the same memory engine, leaning into model personality and behavioural controls so characters actually grow with you instead of resetting every conversation. Different surface, same engine.

Brick 3 - API and improved scores

Closing the gap to the 98% retrieval ceiling is stage 2 prompt work, no architectural changes. I will publish results across more answering models and ship a public API so anyone can run their own LongMemEval pass against c137 and reproduce the numbers. Closed source is the right critique today, the API is the answer.

Brick 4 - Agentic context

Mapped Memory was designed for chat. Tool-calling agents change the shape of the problem, larger and noisier context per turn, transient state mixed with persistent state. Adapting the architecture to that is the next expansion, with its own benchmarks.

Brick 5 - A new benchmark

LongMemEval tests recall well but not the things memory actually has to do day to day, preference adaptation, workflow reuse, silent forgetting, contradiction handling, or token scaling over years not sessions. Open source so anyone can run any system through it including mine. This is the bench that tells you whether an AI knows you.

More beyond that, but those are the next ones I am committing to publicly. The end goal hasn't changed since day one, an AI that actually knows you, not a search engine over your past messages or a chatbot with a long context window but something that understands how you think and how you work and gets better the longer you use it. This is brick 1.

Questions, research enquiries, or just want to chat:hello@c137.ai

Stay updated

New benchmarks, the memory API, and more. No spam.