Memory v2: What 45,000 Stars Were Actually Asking For

Share
Memory v2: What 45,000 Stars Were Actually Asking For
Photo by Amirhossein Hasani / Unsplash

A celebrity-backed AI memory project hit 45,000 GitHub stars in under a week and briefly became the focal point for a much larger argument about what AI memory systems actually are. Public discussion quickly split between the launch story and the implementation reality: a ChromaDB collection, a SQLite triple store, a spatial metaphor, and benchmark claims that turned out to be narrower than many readers assumed. A widely circulated public code review argued that the advertised contradiction-detection behaviour was not implemented in the relevant code path.

This article is not about that project. It is about why 45,000 developers starred it anyway. That number is a referendum on the state of AI memory infrastructure, and if you read it as one, it says something specific about what the field is actually asking for. This article covers what that signal was, where the viral project fell short of it, the nine architectural properties that a real memory v2 has to carry, and what that shape starts to look like in MinnsDB.

The Signal, Not the Artifact

The current AI memory market makes three bets, and they are bets about architecture, not about commercial model. Any of them can be self-hosted or run as a service. What matters is how memory is structured underneath.

The first bet is "let the LLM manage its own memory," and Letta (formerly MemGPT) is the cleanest example of it. The model is handed read and write tools over a hierarchy of memory blocks, and at every turn it decides which facts belong in the small hot block it sees directly versus the larger archival store it has to query. Memory is a program the model edits.

The second bet is "extract facts early and trust the abstraction," and Mem0 is the cleanest example of it. An LLM decides at write time that user_prefers_postgres is the fact worth keeping, the conversation that explained why gets discarded, and every later query has to route through whichever schema the extractor happened to pick.

The third bet, which almost every open-source project in this space has taken, is "wrap a database someone else wrote." This usually means a few hundred lines of Python sitting on top of ChromaDB, Qdrant, Weaviate, Neo4j, or some combination of them, calling their APIs and adding orchestration on top. Zep/Graphiti sits here too, as a temporal knowledge graph layer running on top of Neo4j or FalkorDB, which are engines it does not own. Cognee and most of its peers sit here as well. The orchestration these projects add can be genuinely useful, but it is still composition on top of engines they do not own, and the ceiling is wherever those engines' APIs stop. The viral project is a less careful example of the same pattern. It is a ChromaDB collection plus a SQLite triple store plus orchestration code that does not own a single byte of the storage or indexing it depends on.

Each bet loses something.

The first loses whatever the model did not decide was worth filing. If the LLM never promoted a fact into its hot block, that fact is gone from the hot path until some future retrieval round happens to surface it, and the resulting agent has exactly the memory the model was feeling reflective enough to write.

The second loses resolution. You cannot ask a question the extractor did not anticipate, because the conversation that would have answered it no longer exists in the store.

The third loses the ability to implement anything the wrapped engines do not already implement. If Chroma does not do bi-temporal filtering at the index level, your Python wrapper does not either. If Neo4j's transaction model does not match the supersession semantics you need, you cannot fix it from outside. A wrapper is a tenant of the engines it wraps, and every architectural property it wants to add has to fit through their public API or be simulated on top of it. Simulation is where performance and correctness both leak.

None of this is dismissal. These were good v1 directions, and each one moved the field. Letta pushed the idea that memory is a long-running stateful program the model actively edits, not a pile of chunks. Mem0 made agent memory a first-class category instead of a LangChain subpackage, and made "extract structured facts at write time" a real architectural stance people could argue about. Zep/Graphiti pulled temporal reasoning and hybrid retrieval into mainstream RAG vocabulary and proved that temporal knowledge graphs were a viable memory substrate at all. Cognee and its peers showed how far orchestration on top of existing engines could actually get you. The viral project, for all its overclaims, took the single most important stand in the whole category out loud: store everything, then make it findable.

They are all v1 in the honest sense. Each one moved the conversation, and each one is now bumping into the ceiling of the abstraction it chose or the engine it built on. The v2 comes next, and it will have to own the stack in a way none of them do.

The viral project landed its shot by saying, out loud, that the first two bets were wrong. 45,000 stars is what happens when a sentence like store everything, then make it findable lands in a market that has been under-served in three directions at once.

The artifact that shipped under that sentence is not the v2. The signal is.

Where the Viral Project Fell Short of Its Own Thesis

Five gaps, framed architecturally rather than as benchmark gotchas.

The knowledge graph is a SQLite triple store. Three columns: subject, predicate, object. There is no notion of when a triple became true, when it stopped being true, or how the database learned it. For a system whose stated thesis is to preserve the full history of what was said, having no temporal dimension on the relationships themselves is an inversion. The conversations are preserved, but the facts derived from them are stateless.

There is no multi-hop traversal. Retrieval is a single-collection vector search with metadata filters. "Find all the papers my advisor's former students published on retrieval" is not expressible as one query. You can retrieve chunks about the advisor, chunks about the students, chunks about papers, and hope a language model stitches them at read time. That hope is not an architecture.

"Contradiction detection" is claimed but unlocated in the code. Independent code review argued that the advertised behaviour is not wired up in the knowledge-graph path. This matters less as a factual correction and more as a tell. Building real contradiction handling requires time-indexed edges, supersession semantics, and a query language that can express "as of when." A system that lacks all three will not bolt contradiction detection on as a feature.

The spatial metaphor reduces to metadata filtering. Wings, Rooms, and Halls are a nice pedagogical frame. In the implementation they become three ChromaDB metadata fields. Nothing structural emerges from the geometry. There is no traversal across rooms, no cost model for which wing to expand, no promotion of a concept from one hall to another as its evidence accumulates. The metaphor is a naming convention for vector-DB filters.

The benchmark that made the launch viral measured a different thing than the number people thought they were reading. LongMemEval's primary metric is a generated answer judged by GPT-4. The headline score being circulated was a retrieval-style top-k result, not an end-to-end answer-quality result. The two numbers live in different categories. For comparison, published end-to-end scores on the same benchmark are in the 70s and low 80s: EverMemOS at 83.0%, TiMem at 76.88%, Zep/Graphiti at 71.2%. None of that means the retrieval score was useless. It means the launch marketing was answering a different question than the one the field cares about.

These five gaps rhyme. They are the same gap, told five ways: the project built a retrieval layer and called it a memory system. Memory is what happens around retrieval: the temporal structure, the multi-hop reach, the supersession of things that used to be true, the observer that notices when a fact changes. Retrieval is a step. Memory is a substrate.

What Memory v2 Actually Needs

Nine architectural properties, derived from the gaps above and from what the mainstream research line has been converging on. The first one is a meta-constraint. It determines whether the other eight can be implemented correctly at all.

1. Own the stack

The memory system owns its storage format, its index structures, its transaction model, and its query planner. It does not import Chroma and hope. It does not shell out to Neo4j and translate. It does not compose three vendor engines with a Python orchestration layer and call the composition a database.

This sounds like an aesthetic preference and is actually a correctness constraint. Bi-temporal semantics, supersession chains, delta broadcasts, and multi-representation retrieval all cross-cut the storage layer. They cannot be implemented as wrappers because they require primitives the wrapped engines do not expose. A wrapper can add features the engine already has and rename them. A substrate can add features the engine never had. Memory v2 is a substrate.

2. Bi-temporal edges

Every relationship carries two time axes: when it was true in the world (valid_from, valid_until) and when the database learned it (created_at). Neither axis can be collapsed into the other. You need valid-time to answer "where did Alice live in 2024." You need transaction-time to answer "what did our system believe last Tuesday." A memory system that supports one without the other cannot answer questions about its own history.

3. Supersession, not deletion

When Alice moves from London to Berlin, the London edge does not get overwritten. It gets a valid_until timestamp equal to the moment Berlin became true. Both edges remain in the store. Current-state queries filter to edges where valid_until IS NULL. Point-in-time queries filter to edges where valid_from <= t < valid_until. This is the only honest way to give a language model the answer to "what did you used to think."

4. Multi-hop traversal as a first-class query

Some questions do not live in any single document. "What API does our project use that's built by the company Steve used to work at?" expands to a four-hop path:

Steve -> worked_at -> company -> builds -> API -> used_by -> project

In a triple store or vector collection, answering this is an iterative-retrieval loop with a prompt holding it together. In a graph query language it is one statement. The difference is not performance. It is whether the answer can be structurally verified.

5. A deterministic write path

There has to be a way to ingest without calling an LLM on every message. The viral project got this right, and it is worth naming. If your write path depends on a paid API, your memory system has a cloud dependency whether or not the read path does. A v2 substrate should support both an LLM-assisted ingestion cascade and a rules-driven path that mines existing transcripts at zero marginal cost. The LLM cascade adds richness. The deterministic path adds durability.

6. Hybrid retrieval with a rerank stage

Dense vector search alone misses rare tokens. BM25 alone misses paraphrase. Structural traversal alone misses the cases where the relationship does not exist in the graph yet but does exist in the underlying text. A v2 memory layer should compose all three and let a learned reranker sort the final candidates.

This is no longer a research position. Every credible system in the field has moved here. It is also the property that most obviously cannot be bolted on from outside. Fusing three retrieval legs with shared filter predicates requires a query planner that sees all three.

7. Reactive subscriptions

Agents that observe their memory instead of polling it are not a nice-to-have. They are the difference between "the assistant noticed your calendar changed" and "the assistant runs a cron job over the entire graph every minute." When a node or edge is mutated, the database should broadcast a delta on a channel. Subscribers compile a trigger set so irrelevant deltas get rejected in O(1). This is how you build an agent that stays aware without burning tokens on redundant reads.

8. An ontology you can both define and evolve

Property behaviors (is this predicate functional, meaning one value per subject? symmetric? transitive? append-only?) belong in a schema, not in code. If they live in code, then every new domain requires a deploy. A v2 system should load OWL/RDFS Turtle at startup and let you edit it at runtime. It should also infer candidate behaviors from observed data so that the schema grows with the graph instead of lagging behind it.

9. Local-first, single binary

The simplest of the nine, and the least negotiable. If the system cannot run with no external services (no Postgres, no vector DB sidecar, no cloud API, no hosted coordinator), then it is not a local memory system. It is a client for someone else's memory system. The viral project got this right about local ingestion and wrong about what its headline metric actually showed. A real v2 gets it right end to end.

These nine properties are not novel in isolation. Temporal graphs exist. Reactive subscriptions exist. Hybrid retrieval exists. The missing thing is a system that has all nine at once, in one process, written as one thing rather than assembled from parts, that you can put on a laptop. The community signal is not asking for a new idea. It is asking for the integration, and the integration is what gets lost when every project in the space ships as a wrapper around engines it does not own.

A Direction Worth Watching

MinnsDB is a temporal database written in Rust. It is not the v2. It is, at the architectural level, aligned with each of the nine properties above, which makes it a useful artifact to reason about what that shape looks like concretely. This section walks the mapping.

Owning the stack

MinnsDB is not a Python library sitting in front of Chroma, Pinecone, Qdrant, Weaviate, or Neo4j. It is the vector index, the keyword index, the graph store, and the query engine compiled into one program.

A typical RAG memory layer today looks like pip install chromadb for dense retrieval, a second service for BM25, maybe Neo4j in Docker if you need relationships, and LangChain or LlamaIndex holding the three together. Four dependencies, four network hops per query, four places your data can silently fall out of sync. MinnsDB replaces that stack with one binary and one file on disk. The embeddings, the text they came from, the facts extracted out of the text, and the relationships between those facts all live in the same place, reachable in one query.

The real payoff is the questions you can then ask. Give me vector hits whose source fact was still true last Tuesday. Only rank chunks whose source entity has more than three incoming citations. Find semantic matches, but only inside the subgraph of things Alice worked on. None of those are writable when your vectors live in one database and your facts live in another, because the information each layer needs about the other never crosses the API boundary between them. A wrapper can only combine what the wrapped engines already return. A substrate can ask questions that cross layers, because inside a substrate there are no layers to cross.

Bi-temporal edges and supersession

Every edge in the MinnsDB graph carries valid_from: Option<u64> and valid_until: Option<u64>, alongside the monotonic created_at. When a new fact contradicts an existing single-valued predicate (location:lives_in, work:employer), the old edge is not removed. It receives a valid_until equal to the timestamp of the contradicting event, and the new edge takes over with valid_until = None. Current-state queries filter to valid_until IS NULL; point-in-time queries filter with a timestamp. The graph never forgets that Alice used to live in London.

pub struct GraphEdge {
    pub id: EdgeId,
    pub source: NodeId,
    pub target: NodeId,
    pub edge_type: EdgeType,
    pub weight: f32,
    pub confidence: f32,
    pub valid_from: Option<u64>,
    pub valid_until: Option<u64>,
    pub created_at: u64,
    // ...
}

Multi-hop traversal

MinnsQL's MATCH clause expresses the four-hop question above as one statement. Path length, predicate constraints, and temporal windows are all first-class. Traversal is bounded (10,000 visited nodes max and a 30 second deadline) so that a pathological query cannot take the process down. Allen's Interval Algebra predicates (overlap, meets, precedes, covers) are available at the predicate layer, which means "did Alice change jobs around the same time she moved" compiles to a graph query rather than an application-layer diff.

Two-phase writes that handle contradictions

Conversation ingestion in MinnsDB runs in two phases. Phase one writes single-valued facts first, so that contradictions with existing single-valued edges trigger supersession immediately. Phase two writes multi-valued facts with cascade dependency metadata, so that changing one fact can invalidate dependents in a single atomic window. The ordering is not cosmetic. It is what keeps the graph temporally consistent under concurrent ingestion. An earlier post in this series, Why Merge Is the Hardest Operation in a Temporal Knowledge Graph, covers the downstream case where two nodes get identified as the same entity and every edge, every index, and every live subscription has to be reconciled in one pass.

Deterministic and LLM-assisted ingestion, both

The ingestion pipeline can run with a three-call LLM cascade (entity extraction -> relationship discovery -> structured fact formation) or with a deterministic path that takes pre-tokenized events and writes graph deltas directly. The LLM cascade adds entity resolution and confidence scoring. The deterministic path lets you replay a transcript into the graph with no cloud dependency at all. The two paths converge on the same internal write API, so the rest of the system (subscriptions, the query engine, the ontology layer) does not know or care which ingestion mode produced an edge.

Hybrid retrieval in one call

Dense vector search, BM25 keyword search, and graph traversal all run through the same query engine. You write one statement, you get results fused across all three legs, and the three legs share the same filters, the same time window, and the same scope. This is the difference from the usual LangChain pattern where you hit Chroma, hit Elasticsearch, walk your graph, and then write Python to merge the three result sets without losing the filters you meant to apply. A previous post, Pre-Compiled Filter Sets and Query-Time Specialization in Temporal Databases, walks a concrete case where a five-value filter against 100,000 rows was doing 500,000 redundant checks per query and dropped to zero after the constants were evaluated once at plan time instead of once per row. The same trick applies to temporal filters.

Reactive subscriptions

When something in the graph changes, MinnsDB broadcasts a delta on a channel that any subscriber can listen to. Each subscription registers a compact trigger at subscribe time ("wake me when an edge of type work:employer is added or removed"), and the broadcast layer drops irrelevant deltas in one cheap check, so your subscriber only wakes up for changes it actually cares about. No polling loop, no re-running your whole retrieval pipeline every minute just in case something moved.

When a node is removed, subscribers receive one NodeRemoved event plus one EdgeRemoved event per incident edge, bundled into a single batch so the observer sees the cascade as one atomic effect:

pub enum GraphDelta {
    NodeAdded    { node_id, node_type_disc, generation },
    NodeRemoved  { node_id, node_type_disc, generation },
    EdgeAdded    { edge_id, source, target, edge_type_tag, generation },
    EdgeRemoved  { edge_id, source, target, edge_type_tag, generation },
    // ...
}

The cascade is structural: the delete path collects all outgoing and incoming edges of the removed node, appends one EdgeRemoved per edge to the batch, and broadcasts the whole batch in a single send so that subscribers see either all of the effect or none of it. The test harness exercises this cascade directly. delete_node(alice) produces one NodeRemoved plus one EdgeRemoved per incident edge, in one batch, with a generation range. This is the shape a v2 needs: observe the graph, do not poll it.

OWL/RDFS ontology loaded from Turtle

Property behaviors live in data/ontology/*.ttl. The files are grouped by domain: location.ttl, work.ttl, relationship.ttl, financial.ttl, health.ttl, preference.ttl, routine.ttl, education.ttl. A property descriptor marks a predicate as functional, symmetric, transitive, append-only, or cascade-inducing. The registry loads these at startup and the graph pipeline consults the registry rather than a hardcoded switch statement when deciding whether a new fact supersedes an old one.

An ontology discovery pass observes edge patterns and proposes new behaviors. If 90% of subjects of a predicate have exactly one active edge, the system proposes marking it functional. High-confidence proposals can auto-apply. The rest are held for review.

Single binary on disk

The whole thing ships as one executable. No Docker compose file with five services, no external vector database, no external graph database, no message queue sidecar, no JVM, no Python runtime. The graph, the vector index, the keyword index, the query engine, the subscription broadcast, the ontology, and the conversation ingestion pipeline are all one program writing to one data directory on disk. It starts in under a second on a laptop. This is what "own the stack" looks like from the outside: you run one thing, and that one thing is your memory layer.

Structural memory benchmark

On StructMemEval, a benchmark for structured memory tasks, MinnsDB scored 70%. The next-best system scored 27%. This is not a claim about LongMemEval or ConvoMem. Those are different benchmarks measuring different capabilities, and swapping them in would be the same category error the viral launch made. It is a claim about one specific axis: structural memory tasks that require temporal reasoning and multi-hop retrieval. That is the axis the nine-property list is pointing at.

"Store Everything" Is Not the Answer; It Is the Question

The most important thing the viral launch said out loud was store everything, then make it findable. The most important thing it got wrong was believing the second half of that sentence reduces to vector retrieval.

Storing verbatim text plus an embedding index is a lossy system. The lossiness moves from write time (which is what MemPalace correctly objected to in Mem0 and friends) to read time, where a language model has to reconstruct structure on every query from chunks that never knew they were connected. The user has traded one form of information loss for another. Write-time extraction loses what the extractor did not anticipate. Read-time reconstruction loses whatever the retriever did not rank high enough to include in the context window. Neither is a substrate.

A substrate is what you get when the relationships between the things you stored are themselves stored, queryable, temporal, and observable, and when the system holding them owns every layer from the on-disk page format to the query planner. Vector search still exists in a substrate (it is the dense leg of the hybrid retriever), but it no longer has to carry the weight of answering multi-hop, temporally-scoped, contradiction-aware questions. The graph carries those. The vectors help find entry points into the graph. The rerank resolves ambiguity at the edge of both. None of those three legs can be a black box to the others, which is why the substrate cannot be assembled from wrappers around engines written by three different teams for three different purposes.

The viral project stored the text on top of ChromaDB and SQLite. The v2 needs to build the substrate around it, and needs to own the substrate end to end.

Closing

45,000 stars in a week is not an endorsement of a particular architecture. It is a measurement of how acute the demand is for a memory system that does not force a choice between "rent it," "trust the LLM," and "wrap a database someone else wrote." The project that eventually satisfies that demand will have nine properties together in one process, written as one thing rather than stitched from wrappers, and the measurement that matters will not be retrieval recall on a curated benchmark. It will be whether an agent running on a laptop can answer a question a week from now that references something it learned a month ago, through three hops, with one of the intermediate facts having been superseded last Tuesday.

That is the question memory v2 has to answer. The next project that actually ships it will not need a celebrity launch.

Read more