Agent Memory

Also known as: LLM memory, long-term memory, agent state

TL;DR

Agent memory is how an agent persists information across turns and sessions. Short-term memory lives in the context window; long-term memory lives in an external store (vector DB, structured records, files).

An LLM by itself has no memory. Every call starts from scratch — the model reads its and produces a response, with no record kept between calls. For an that needs to remember earlier turns of a conversation, learn from yesterday’s session, or accumulate facts about a user, memory has to be built around the model.

AGENT MEMORYShort-term is what the model sees. Long-term is everything else.SHORT-TERMcontext window · rollingCAP · 8K TOKUSERremind me what we shipped last weekAGENTchecking history…TOOL · SEARCH_MEMORYquerying episodic storeRECALLED · EP#147"shipped reranker v2 on tue"AGENT · GROUNDED"last week you shipped reranker v2"source: episode #147LONG-TERMepisodic store · retrievableN ≈ 12K EPEP147QUERY · EMBED · TOP-Kcosine similarity ranks every episode;top-K are eligible for the lift.FREE TO READbounded by context lengthCHEAP TO STOREcostly + lossy to recall

Short-term vs long-term

The dominant frame splits memory into two scales:

  • Short-term memory — fits in the context window. Conversation history, scratch-pad reasoning, recent tool results. Cheap to read (just prompt tokens) but bounded by the context length and degraded by past ~10K tokens.
  • Long-term memory — externalized: vector store, structured database, or files. The agent retrieves from it when needed. Effectively unbounded, but every read is a problem with its own recall and precision concerns.

Production agents almost always have both. Short-term carries the immediate conversation; long-term carries facts that should survive sessions.

Long-term memory shapes

Three dominant patterns, often combined
  • Vector memory. Each turn, fact, or document is and stored in a vector index. At read time, the agent queries semantically — “anything about pricing concerns from this user” — and retrieves the top-K. Best for fuzzy, semantic recall.
  • Structured memory. Facts the agent must access exactly: user preferences, account state, configuration. Stored as rows or key-value records. Read by exact lookup, not similarity.
  • File-based memory. Notes, drafts, working documents the agent maintains. Useful when humans need to inspect or edit memory between sessions, or when the memory is itself a deliverable (a research notebook, a planning doc).

Single-store memory architectures usually fail one of these intents — semantic recall over a SQL table is fragile, exact lookup over a vector store is expensive.

What makes memory work

Same answer as for : retrieval quality. A vector memory with poor recall is an agent with confident amnesia — it’ll produce answers without the relevant past, never knowing it missed something. The fixes are the same ones that fix RAG: over the memory index, the candidates, and for “is this actually relevant or am I padding context with noise.”

Memory writes are also hard

Naive approaches store every turn, exploding the index with redundant noise. Smarter approaches summarize, deduplicate, and consolidate — each of which is itself an LLM call with its own error mode. A common pattern: write raw turns immediately, run a periodic consolidation pass that merges and dedupes.

Storing every turn is trivial — append to a vector index, write to a log, commit a row. The hard problem is surfacing the right past turn at the right future moment, when the present context contains no obvious anchor to it.

This is why agent memory is fundamentally a retrieval problem and inherits the same failure modes as . The query at recall time is whatever the current conversation is “about” — often unrelated to the literal text of the relevant past turn. A conversation about latency optimization may need a memory of yesterday’s discussion of “p99 spikes during rollouts” — different vocabulary, semantically connected. Embedding similarity gets you partway; reranking with a cross-encoder gets you most of the rest.

The corollary: memory quality scales with the same tools that scale RAG quality. Teams that treat memory as “write everything to a vector store” hit a quality ceiling early.

Cognitive science has a useful split. Episodic memory is “what happened” — a specific past event with timing, actors, context. Semantic memory is “what’s true” — facts decoupled from any specific event. Agents need both, and the right storage shape differs.

Episodic memory benefits from temporal indexing and verbatim storage: the agent may need to replay a conversation, cite a previous decision, or detect a contradiction with what it said before. Vector indexes work, with timestamp-aware retrieval and a recency bias.

Semantic memory benefits from canonicalization and deduplication: “the user’s plan is Pro” should exist in exactly one place, not as twelve embeddings of similar phrasings. Structured stores (key-value, graph, SQL) dominate here. The retrieval pattern is exact lookup, not similarity.

Most production failures involve mixing the two — storing facts in episodic form (so the agent can’t look them up exactly) or events in semantic form (so the agent loses the timing).

Agent memory is a retrieval system masquerading as a feature; most memory failures are retrieval failures in disguise.

Memory feels like an LLM with amnesia not because storage is hard, but because retrieval is.

Go further

Vector store, structured DB, or files for long-term memory?

All three, for different things. Vector store for fuzzy semantic recall ('find conversations about pricing'). Structured DB for facts the agent must look up exactly ('user's plan tier'). Files for things humans need to inspect or edit. Production agents combine them; single-store memory is usually a sign of underbuilt memory architecture.

What do I do when the conversation outgrows the context window?

Compress old turns into summaries that stay in the window, and offload the raw turns to long-term memory for selective retrieval when needed. 'Lost in the middle' attention degradation makes naive concatenation worse than compression past ~10K tokens of history.

Why is recall the hard part of memory, not storage?

Storing every turn is trivial; surfacing the right past turn at the right future moment is hard. It's a retrieval problem — and just as in RAG, retrieval quality (recall, reranking, calibration) is what determines whether memory feels useful or feels like an LLM with amnesia.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord