Stay updated
A structured-retrieval memory architecture. 90% on LongMemEval-S at half the tokens of comparable systems. No embeddings, 98% retrieval accuracy.
90.4%
LongMemEval-S top score
~15k
Median tokens, half of top systems
98%
Retrieval accuracy
Top-3
Distinct systems on leaderboard
TL;DR
Within a chat the AI starts to know you. It picks up your style, remembers what you said earlier, follows the workflow you built together but context windows have limits. The chat slows down, costs more per turn, eventually hits the limit and you have to start a new chat to keep going starting fresh again in this endless cycle.
And the old chats pile up; you'd remember solving this exact thing a few weeks back but you just can't find which chat it was in and the memory features they bolt on barely remember more than your name.
If you try staying organised with a chat per topic, same issue, every new one is back to zero killing productivity.
AI should be the first tool that actually learns you over time but none of it does.
The opposite of everything above. An AI that knows you the moment you open it, picks up wherever you left off no matter how long ago that was, remembers what you said and how you said it, reuses workflows you refined together so you never have to explain them again, catches patterns in your life you haven't noticed yet.
Think JARVIS or Rick's garage AI - an AI that is actually yours and gets sharper the longer you use it.
That has been the goal from day one and I've been building toward it for 6 months alongside my first year of uni, solo.
I started with embeddings and centroid clustering which worked for retrieval but the answers felt flat, not personalised to me. I'd ask about a topic and get semantically similar messages back raw and unfiltered but the answers missed the nuance I was looking for, the insights about me, patterns, how I think, what I prefer, all the factors that should have shaped the responses. Embeddings do hit but it's just search not memory, something always felt missing. Adding an LLM layer on top to extract the insight means you're running an LLM call alongside embeddings anyway which defeats the point and is inefficient cost wise.
Next I went fully agentic, no embeddings, let the model decide what to store and what to fetch with tool calls. This approach was bounded by latency and cost, in order to balance I wanted to use cheaper faster models and they couldn't handle tool calling reliably enough, too many retries and wasted tokens and wrong fetches. The loop was flexible but slow and fragile, and agentic systems like this are everywhere now trading reliability for flexibility you don't actually need.
Then I realised that with the right structure memory retrieval becomes a one hop problem, you don't need an agent loop if every retrieval is a single decision from a map. Store memories in the correct shape, give the model a compact index of everything, let it pick what to load in one pass. One decision and one fetch, no chains of tool calls, no embeddings, no search. That insight is what made the whole system work and what keeps prompts bounded, fast and cheap as memory grows.
LongMemEval-S is the standard benchmark for memory systems and what you'll see referenced in every other paper. 500 questions across ~57M tokens of conversation data, broken into these categories:
| Category | What it tests |
|---|---|
| knowledge-update | Did it track when facts changed (you moved cities, switched jobs) |
| multi-session | Can it connect info spread across separate conversations |
| single-session-user | Does it remember what you said |
| single-session-assistant | Does it remember what it said |
| single-session-preference | Does it remember preferences you explicitly stated and apply them to new questions |
| temporal-reasoning | Can it reason about when things happened and order events |
It also tests abstention, whether the model hallucinates and forces an answer or knows when it doesn't know. Confident hallucinations are one of my biggest issues when I use AI so I took extra care with this.
For the bench, and in production, grok 4.1 fast non reasoning handles ingestion and retrieval; storing in structure and fetching from that structure. The answering models I ran were gpt-4o, the official benchmark model, along with gemini-3.1-pro and gemini-3-flash. The answerer only ever sees a lean retrieved slice rather than the full history so it can be swapped without blowing up cost.
90.4%
Gemini 3.1 Pro
88.6%
Gemini 3 Flash
82.8%
GPT-4o
98%
Retrieval accuracy
| System | Model | Overall |
|---|---|---|
| Mastra OM | gpt-5-mini | 94.87% |
| Mastra OM | gemini-3-pro-preview | 93.27% |
| Hindsight | gemini-3-pro-preview | 91.40% |
| c137 | gemini-3.1-pro-preview | 90.40% |
| Mastra OM | gemini-3-flash-preview | 89.20% |
| Hindsight | GPT-OSS 120B | 89.00% |
| c137 | gemini-3-flash-preview | 88.60% |
| EmergenceMem Internal* | gpt-4o | 86.00% |
| Supermemory | gemini-3-pro-preview | 85.20% |
| Supermemory | gpt-5 | 84.60% |
| Mastra OM | gpt-4o | 84.23% |
| Hindsight | GPT-OSS 20B | 83.60% |
| c137 | gpt-4o | 82.80% |
| EmergenceMem Simple | gpt-4o | 82.40% |
| Oracle | gpt-4o | 82.40% |
| Supermemory | gpt-4o | 81.60% |
| Mastra RAG (topK 20) | gpt-4o | 80.05% |
| Zep | gpt-4o | 71.20% |
| Full context | gpt-4o | 60.20% |
Eight distinct memory systems on the board. By best score per system, Mastra 1st, Hindsight 2nd, c137 3rd. My Pro run used gemini-3.1-pro-preview (released after Mastra and Hindsight ran their numbers), Flash and 4o numbers are matched.
Leaderboard data sourced from Mastra Research. * EmergenceMem's 86.00% is reported for an "Internal" configuration and is not publicly reproducible. Public configs include Simple (82.40%) and Simple Fast (79.00%).
I know since I am closed source these results are hard to trust, so I built a viewer where you can see all 500 questions, the sessions seeded from the prompt, and my final answer across each model. You can jump to the ones I got wrong with a breakdown of whether it was a retrieval miss or the model not using the data that was there, and see exactly where the data sits in the prompt.
c137 is a top-3 distinct system on the leaderboard alongside Mastra and Hindsight. On matched answering models the breakdown is 3rd on Pro, 2nd on Flash, 3rd on GPT-4o (matching the 82.4% oracle ceiling). Almost everyone else on this list is building agent infra and c137 is the only consumer app. It hits these scores on roughly half the token budget of top scoring systems, that matters to me more than the leaderboard position does.
GPT-4o is a strong result. c137 on 4o scores 82.8%, matching the oracle ceiling of 82.4% (within noise on 500 questions) which is the theoretical max when the model gets handed the exact gold standard memory for each question with no retrieval at all. Sitting at the oracle line means retrieval is not the bottleneck, the architecture lands the right context as reliably as a perfect oracle would.
Across all three models my retrieval accuracy is 98%, only 10 questions out of 500 had ground truth context absent from every model's retrieved slice. The rest of the failures are model errors where retrieval landed the right context but the answerer still got it wrong. Per model retrieval is inspectable in the bench viewer, every failure is tagged as retrieval miss or model error. That 98% is the score ceiling in its current state and with prompt refinements on the answerer it pushes higher, the architecture has no gaps.
| Category | Flash | Pro | GPT-4o |
|---|---|---|---|
| knowledge-update | 97.4% (76/78) | 97.4% (76/78) | 87.2% (68/78) |
| multi-session | 82.7% (110/133) | 85.0% (113/133) | 79.7% (106/133) |
| single-session-user | 97.1% (68/70) | 94.3% (66/70) | 95.7% (67/70) |
| single-session-assistant | 94.6% (53/56) | 94.6% (53/56) | 91.1% (51/56) |
| single-session-preference | 73.3% (22/30) | 83.3% (25/30) | 63.3% (19/30) |
| temporal-reasoning | 85.7% (114/133) | 89.5% (119/133) | 77.4% (103/133) |
The per-category breakdown is strong across the board. Single-session-preference is the lowest but also the smallest test (only 30 questions) so each wrong answer moves the percentage a lot. Pro edges Flash in most categories but there is a small regression on single-session-user and single-session-assistant, something I played with in prompts but any tweak that fixed those dropped something else, I ended on this as the best compromise point.
Abstention is when the model is asked a question that it does not have in memory and it tests whether the model will hallucinate or correctly say it does not have the information.
| Abstention | Flash | Pro | GPT-4o |
|---|---|---|---|
| Correctly refused to answer | 86.7% (26/30) | 96.7% (29/30) | 90.0% (27/30) |
For me this is one of the most important things, I need to be able to trust what my assistant says. This is mainly prompt behaviour but model choice does carry weight here and Pro has the best balance (29/30). I ran the bench a few times and Flash and 4o moved between 26 and 28 across runs, both still strong refusal rates and the variance is expected with only 30 abstention questions. In practice on Pro hallucination rate is very low, if the model doesn't know it says it doesn't know.
Competitive scores at a fraction of the prompt cost. Mastra reports ~30k average tokens injected per question, Hindsight uses embeddings with a reranker which tends to dump a lot of context. Embedding based systems in general grow prompts linearly as memory grows. c137 has a median of ~15k tokens, a range of ~9k to ~37k and a P25-P75 of ~13k to ~16k, half the tokens for comparable scores. Over months of usage this is the number that compounds.
Of those ~15k tokens: 5k are static system instructions (prefix cached), ~2k user model, ~2k tail anchor for context adherence and ~8k dynamic retrieved content, the only piece that varies per query. So the actual information the model loads fresh each turn averages just ~8k tokens, which keeps responses fast, caching effective and cost stable as memory grows.
Prompt anatomy - ~15k median tokens
system
cached
user model
dynamic context
retrieved per query
tail
static
Stage 1
Retrieve
Grok 4.1 Fast
Stage 2
Answer
Swappable
Stage 3
Store
Grok 4.1 Fast
Background
Healer
reviews stages 1 + 3 decisions
If you think about memory in the simplest way you can break it into 3 distinct problems, retrieval, applying the retrieved and storage. For retrieval you need to know what's relevant. You can do that with embeddings but they are blind, the result is a search engine not memory, you need an LLM. That only works with structured storage which is why stages 1 and 3 need to work together. Using the same model for both yielded the best results when I played around with the system.
The reason I moved off vectors entirely is that embeddings give you recall but not memory. If you ask "what did we talk about last week" a vector search can find semantically similar messages but it cannot tell you that you always prefer structured plans over freeform advice, or that when you say "the usual format" you mean bullet points with no headers, those are learned patterns not retrievable facts. Structured storage captures them because the model writes them down explicitly during stage 3 as facts, workflows and micro-insights, whereas a vector database would need you to ask the exact right query to surface that information and even then it returns raw messages not distilled understanding.
The core data structure is the memory map, a compact index of everything in storage that every stage sees, it looks like this:
Topic areas with descriptions, session counts and ledger sizes, the model sees the shape of the entire memory in ~2k tokens and picks what to load, facts are bucketed into topics so stage 1 only sees topic names and subtopics keeping the prompt lean.
Data hierarchy & 1-hop retrieval
Structure
S1 does not see these
Facts map
Memory map
description only, groups collapsed
Cold domains keep their description but collapse groups, reachable via a conditional 2nd hop.
1-hop retrieval
"how is my current project going?"
stage 1 picks actions
stage 2 prompt
One decision, one fetch. User-level always there, retrieved pieces join it.
All three stages output structured XML rather than using tool calling, I found XML was more reliable than actual tool schemas with faster models and the deterministic parsing keeps things predictable. Here is a sample of what stage 1 and stage 3 output looks like:
Stage 1 - Retrieval
Stage 3 - Storage
Stage 1 outputs an opening sentence first which gets streamed to the user, then the actions block, stage 3 outputs silently in the background and both give a reason for every action which becomes important later with the healer.
Stage 1 takes in cached system instructions, the fact map, the memory map, recent conversation logs and the user's query, then scans both maps and decides what to load for the answering model, picking topic areas across either map, fetching facts and running full text search across all knowledge sources. It also outputs an opening sentence that gets streamed to the user while stage 2 processes, masking the retrieval wait.
The opener stream rate is dynamically calculated based on opener length to always buy enough time for stage 2's first token to land, the handover is invisible and the user sees one coherent response that always starts streaming within 1 to 2 seconds.
I use grok 4.1 fast non reasoning here. It has the lowest hallucination rate with consistently fast TTFT. I originally thought tps was the limiting factor and was using groq with weaker models but TTFT matters more for this stage since the model only needs to output a small action block not a long response. Once I realised that, grok opened up as an option and the quality jumped.
Latency holds steady because prompts stay bounded as memory grows, stage 1 takes about 1.6 seconds, stage 2 about 1.7 seconds, stage 3 about 1.4 seconds in the background so the user does not wait for it. The opener from stage 1 masks the retrieval wait and total user facing latency is around 3 seconds, the same after months of usage as it is on day one because the architecture caps what the model has to read each turn.
Latency timeline
Stage 2 takes in cached system instructions, all the context stage 1 retrieved, the user model and tail instructions that help the model weight context evenly rather than leaning toward one section. Then it answers the user's question using that retrieved context. The model here is swappable. I found gemini models are the best at using context while staying cost efficient. A search tool acts as fallback when retrieval still misses something, dumping related facts and surrounding context into the prompt.
Stage 3 takes in the memory map, active topic area context and the full exchange that just happened. It picks an existing or new topic area to store in and strips the exchange down into a lean ledger entry, preserving all detail but dropping waffle. Facts are only created from user messages since that is confirmed truth, anything from the AI response is not guaranteed to be something the user cares about and it lives in ledger anyway. Stage 3 also detects workflows refined over many turns and stores micro insights about the user.
The model deliberately does not see all existing facts as that does not scale, it only sees topics and subtopics and is allowed to store duplicates freely. A separate deduplication service handles merging which keeps the model from withholding facts it thinks are redundant and keeps tokens low. The overall focus with this pipeline is making it as easy for the models as possible and taking as much work away from them.
Most memory systems are linear pipelines. Input goes in, output comes out, nothing reviews itself. c137 has feedback loops. Every stage gives a reason for every action it takes and a background service called the healer reads those reasons to check the work. It triggers after an activity threshold, handles deduping, merging facts, validating placement and checking nothing important was missed. Stages can also flag things they think might be wrong for the healer to review. The stages talk to each other and review each other, memory is a living system.
Alongside the healer there's another background step that consolidates user facts and micro insights into the user model itself, the picture of who you are and how you work that stage 2 always sees. It re-runs as new facts and insights come in so the user model stays compact and current rather than just growing.
The healer was not used in the benchmark as I wanted to test the raw pipeline without post processing and since only 10 questions failed due to retrieval misses the system holds without it. It was originally built when I was using weaker models that made poor storage decisions and while less necessary after upgrading to grok it remains an important part of the system for long term integrity.
This is the part no one else seems to properly break down and I think it's the most important. Having strong benchmark scores is great but if the system does not scale token and cost wise then it does not matter. The whole point of this architecture is that the prompt size does not balloon as memory accumulates.
Every component has a hard cap, the memory map caps at 5k tokens and if it grows past that the least active topics get trimmed out, identity facts cap at 100 per user, topical facts cap at 100 per topic and the oldest get demoted to cold storage when the cap is hit. None of these grow unbounded.
When topics or facts move to cold they are no longer visible to stage 1 which means they are not directly accessible, so a conditional call is used between stage 1 and 2, if stage 1 decides the info might be in cold collapsed data it can check the cold area and for facts if the topic area is in cold it can load those facts and pass them through as actions for stage 2. This technically becomes a 2 hop problem but agentic flexibility is still not required since it can be handled by a simple conditional call, and nothing is ever deleted, just moved further from the hot path.
Conversation logs would grow unbounded too if left raw so they use a rolling compression system where the last 10 exchanges are always present verbatim, every 10 older rows get compressed into an L1 summary block and every 10 L1 blocks get compressed into an L2 block covering 100 rows, so token count scales logarithmically not linearly. The compression model is instructed that every number, name, date and measurement in the input must appear in the output so specific details survive, and any gap between the last compressed block and the recent window is captured as overflow rows so nothing gets lost. Any detail that does get compressed away is still covered by the topical facts which always retain full detail.
Compression layers
Sessions ingested vs Stage 2 input tokens
More memory. Same prompt size.
Proof from the benchmark. 500 questions, each with 35 to 62 sessions of conversation history ingested, bucketed by session count. The meaningfully sampled middle buckets (40-44 n=86, 45-49 n=255, 50-54 n=143, 55-59 n=12) show median Stage 2 tokens hovering around 14-15k regardless of how much memory accumulated. Endpoints (35-39 and 60-62, n=2 each) are noise and shouldn't be read as trend. The architecture selects what is relevant rather than dumping everything, so prompts stay bounded where embedding based systems scale linearly. LongMemEval-S caps at ~60 sessions, whether this holds across years of continuous use is what I am building a bigger benchmark to test (see Bench Gaps below).
Average token budget per stage
| Stage | Model | Input | Breakdown |
|---|---|---|---|
| S1 Retrieve | Grok 4.1 Fast | ~6k | 3k cached system + map + recent logs + query |
| S2 Answer | Swappable | ~15k | 5k cached system + 2k user model + 8k dynamic + 2k tail |
| S3 Store | Grok 4.1 Fast | ~8k | 3k cached system + map + active context |
Average cost per question - full pipeline
| Component | Flash | Pro | GPT-4o |
|---|---|---|---|
| S1 + S3 (Grok) | $0.0022 | $0.0022 | $0.0022 |
| S2 (Answerer) | $0.0077 | $0.0306 | $0.0378 |
| Total | $0.0099 | $0.0328 | $0.0400 |
Since the answering model only sees the lean retrieved context it can be swapped without blowing up cost. The pipeline stages on grok are fixed overhead and the answerer is the variable, you could put a more expensive model on stage 2 and the total stays manageable because 15k input is 15k input regardless of the model.
A few months in, c137 has picked up things about me I never told it. It learnt them over time from repeated patterns in how I talk and work. That I trust it more when it pushes back and disagrees with me than when it agrees. That when I revise I go for understanding over memorisation, properly working through things rather than just committing facts. That I land on solutions when I'm away from the work, right before sleep or when I'm not thinking about it. Even the small stuff, like the fact I keep coming back to sushi no matter what I'm eating that month. None of this was fed in deliberately.
Then it acts on all of it, every response shaped by what it knows without me touching a setting. Bring an idea and it actually challenges me because that's the mode I trust, ask about a CS topic and the answer comes formatted for understanding by default, get stuck on something late and it suggests I sleep on it before pushing through because it knows that's when I solve things. Even the small stuff, ask what to eat and it pitches sushi. None of this is configurable, it just happens because the AI has built up a real picture of who I am.
Ask any other AI the same questions and you get a blank slate or a generic guess you have to correct three times before it lands. That's the difference between an AI that actually knows you and one that doesn't, and once you've used the first kind going back feels broken.
10 questions out of 500 failed across all three models due to retrieval misses, meaning the context needed to answer was not in the prompt at all. This is the main thing I need to work on and it comes down to how stage 3 stores and how stage 1 searches. Some of these were edge cases like computing totals across multiple conversations where individual numbers were stored but the sum was not, others were cases where an older value got overwritten and the previous state was lost. These are solvable with better storage prompts and I have ideas for how to handle them. Every retrieval-miss is tagged in the bench viewer so you can see exactly which ones.
The gap between current scores and the 98% retrieval ceiling lives in stage 2, the answerer sometimes misuses context that retrieval lands correctly. This is prompt work not architecture, each round of stage 2 iteration has moved scores up and there is more room to push. Getting closer to 98% is a matter of continuing that work, nothing structural needs changing for it.
While these benchmarks test factual recall that is only one component of memory, they do not test whether the AI adapts its responses to you or if it reuses workflows and preferences you stated before naturally without you having to repeat yourself when doing a task again. Does the assistant actually feel like it knows you or is it just retrieving facts? The user model and personality system I built are designed for exactly this but there is no standard way to measure it yet.
They also do not properly test token and cost scaling which as I showed above is critical for a system that is meant to be used long term. LongMemEval-S tests with around 50 sessions of history which is a good start but does not cover years of usage. LOCOMO tests longer periods but still not long enough to really stress test how memory holds up over years of continuous use with thousands of sessions and growing topic areas.
What I want to build is a benchmark that scores memory across preference adaptation, workflow reuse, forgetting and contradiction handling, on 1 year, 5 year and 10 year conversation histories with thousands of sessions per user. Open source so anyone can run it against any system including mine. This is the benchmark that tells you whether the AI actually knows you, not whether it can surface a fact from a log.
The work splits into a few bricks from here.
Brick 2 - Wider use cases
A character app on the same memory engine, leaning into model personality and behavioural controls so characters actually grow with you instead of resetting every conversation. Different surface, same engine.
Brick 3 - API and improved scores
Closing the gap to the 98% retrieval ceiling is stage 2 prompt work, no architectural changes. I will publish results across more answering models and ship a public API so anyone can run their own LongMemEval pass against c137 and reproduce the numbers. Closed source is the right critique today, the API is the answer.
Brick 4 - Agentic context
Mapped Memory was designed for chat. Tool-calling agents change the shape of the problem, larger and noisier context per turn, transient state mixed with persistent state. Adapting the architecture to that is the next expansion, with its own benchmarks.
Brick 5 - A new benchmark
LongMemEval tests recall well but not the things memory actually has to do day to day, preference adaptation, workflow reuse, silent forgetting, contradiction handling, or token scaling over years not sessions. Open source so anyone can run any system through it including mine. This is the bench that tells you whether an AI knows you.
More beyond that, but those are the next ones I am committing to publicly. The end goal hasn't changed since day one, an AI that actually knows you, not a search engine over your past messages or a chatbot with a long context window but something that understands how you think and how you work and gets better the longer you use it. This is brick 1.
New benchmarks, the memory API, and more. No spam.