An LLM has no memory. Each API call is independent — the model has zero recollection of your last conversation. "Memory" in agent products is something you build by deciding what context to include in each call. Get this wrong and your agent feels stupid (forgets the user's name two messages later) or expensive (sends the entire history of all conversations every time).
Four memory layers cover essentially every real product need. Most teams over-engineer this and reach for a memory library when 50 lines of their own code would suffice.
The four layers
- Working memory. The current conversation's message history. "I just told you my name is Alice." Lasts for the conversation.
- Session memory. Compressed summary of the current session. "This user is troubleshooting a refund." Lasts until the user closes the chat.
- User memory. Things known about the user across sessions. "Alice is on the Pro plan, lives in Taipei, prefers Traditional Chinese." Lasts forever.
- Episodic memory. Specific past events the user might reference later. "On March 3, Alice asked about API rate limits and we resolved it with X." Searchable, not always loaded.
These aren't sequential layers — they're different abstractions you maintain in parallel.
Layer 1: working memory (the easy one)
This is just the message array you send with each API call:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Alice"},
{"role": "assistant", "content": "Nice to meet you, Alice"},
{"role": "user", "content": "What's my name?"},
]
The model trivially answers "Alice" because the answer is in its context.
Working memory has one real challenge: the context window fills up. After 50 turns of a conversation involving large tool outputs, you might be at 80% of your context budget. Three solutions:
- Summarize older turns. Replace the first 30 turns with a 200-token summary. Run summary every N turns.
- Drop tool results once consumed. A retrieval result from turn 4 isn't needed at turn 20.
- Use prompt caching. With Claude or Gemini, the system prompt + early conversation can be cached so re-sending it costs less. (Still uses context though.)
Layer 2: session memory (the easy one done well)
When a session ends (user closes chat, you start a new conversation), don't throw the whole history away. Compress it:
session_summary = llm.summarize(
"Summarize this conversation in 200 tokens. "
"Focus on: what the user wanted, what was resolved, what's pending. "
f"\n\nConversation:\n{full_transcript}"
)
store_in_db(user_id, session_id, session_summary)
Next time the user shows up: load their last 3 session summaries. Inject into the system prompt as "recent context." The agent now feels continuity without you sending 50,000 tokens.
This single feature — session summaries injected into next session's context — gets you 80% of the "feels like it remembers me" effect with 100 lines of code.
Layer 3: user memory (where teams overthink)
Things about the user that should persist forever: name, plan tier, language preference, work context, stated goals, learned facts ("the user is a pediatrician," "the user is allergic to peanuts").
Two implementation approaches:
Approach A: structured profile
CREATE TABLE user_profile (
user_id UUID PRIMARY KEY,
display_name TEXT,
preferred_language TEXT,
occupation TEXT,
notes TEXT[] -- free-form facts the user mentioned
);
When the user mentions a fact, the agent uses a tool like remember(category, fact) to add to the profile. At session start, you load the profile into the system prompt.
Pros: structured, queryable, predictable. The user can see and edit their profile.
Cons: requires you to predict what fields matter. Doesn't capture everything.
Approach B: free-form memory + retrieval
Store facts as embeddings in a vector store. At each turn, retrieve the top-K most relevant memories.
relevant_memories = vector_store.search(current_query, top_k=5)
system_prompt += f"\nRelevant memories:\n{relevant_memories}"
Pros: captures arbitrary facts. Scales to thousands of memories per user.
Cons: retrieval can miss relevant memories or surface irrelevant ones. Harder to debug. Storage grows unbounded if you don't have a forgetting policy.
What works in practice
A hybrid:
- Structured profile for important facts (plan tier, language, name, top 3 stated goals). 5-10 fields.
- Free-form notes table for anecdotal facts ("prefers concise answers," "is a pediatrician"). Limit to ~50 most recent or use embedding search if larger.
Don't reach for Mem0 / Letta / Zep until you've shipped this and hit a real limit.
Layer 4: episodic memory
This is the deepest layer. Past conversations the user might want to refer back to: "What did we decide about the API rate limit issue last month?"
Implementation:
- Index every past conversation (full or summarized) as a document with metadata (date, user_id, topic).
- When the agent gets a query, run retrieval against this index.
- If matches found, surface them as context: "On 2026-03-03 you discussed... [summary]"
- Optionally, allow the agent to ask follow-up questions about specific past sessions.
Most products don't need Layer 4. Customer support agents and personal assistants benefit. General chatbots don't — the cost (storage, retrieval latency, eval complexity) outweighs the rare "do you remember when..." moment.
The 2026 framework landscape
Memory frameworks worth knowing:
- Mem0. Most popular. Automatic memory extraction + storage + retrieval. SDK in Python and TS. Can dropped into LangChain or used standalone. Uses LLM to extract facts from conversations and store them.
- Letta (formerly MemGPT). Treats memory as a hierarchical OS — "main memory" vs "archival memory." More academically interesting, more setup.
- Zep. Production-focused, stronger eval and dashboard. Worth it for serious customer-facing agents.
- LangChain Memory / LlamaIndex Memory. Built-in modules in those frameworks.
My take: for a single-product solo team, write your own. For a team of 5+ shipping multiple agent products, use Mem0 to standardize. For high-stakes (legal, medical, financial advice agents), evaluate Zep with their team.
Common mistakes
- Storing too much. Every passing comment ends up in long-term memory. Six months later the agent thinks the user really cares about cats because they mentioned one once. Filter aggressively at storage time.
- Forgetting to forget. Users change. Old facts become wrong. Add timestamps and decay weighting; allow users to clear memories.
- Trusting facts the agent extracted. "User said they're 30 years old" might mean "user said I think you're 30 years old." LLM extraction has errors. Show users their stored memories; let them correct.
- No privacy controls. Users in 2026 expect to see, edit, and delete their memory. "What does this app remember about me?" must be answerable, ideally as a UI.
- Loading too much per turn. If you stuff 5000 tokens of memory into every turn, you're paying for it on every API call. Be selective; load only relevant memories.
A pragmatic stack for solo developers
Here's what I run for personal projects and recommend for solo founders in 2026:
Working memory: just the message array, summarize after 30 turns.
Session memory: store 200-token summary per session in postgres.
User memory: 5-field structured profile + 50-most-recent free notes.
Episodic memory: skip until a user explicitly asks for it.
This is maybe 200 lines of code. It covers 90% of what "memory" should do for most products. When you hit a wall, evaluate frameworks.
When NOT to build memory
- Single-turn product (search, tool use without back-and-forth). No memory needed.
- Stateless API endpoint. Memory makes inputs/outputs depend on hidden state. Hard to test.
- High-privacy domain. If users assume their conversation is forgotten when the page closes, building memory creates expectations and risks. Default to not remembering and let users opt in.
Further reading
- MemGPT: Towards LLMs as Operating Systems (Packer et al, 2023) — the original Letta paper.
- Mem0 docs — practical extraction patterns.
- Generative Agents: Interactive Simulacra of Human Behavior (Park et al, 2023) — the Stanford paper that popularized agent memory.
- Look up: episodic memory, retrieval-augmented memory, fact extraction, memory consolidation.