TL;DR:Memory in agents is just expanding the context window. Here's the simple mental model that makes it practical.
Memory = Context Window Expansion
In agent design, “memory” is a word that sounds complicated but maps to something concrete: getting information into the context window that wouldn’t otherwise fit.
A 256k context window sounds large until you try to fit the world’s knowledge into it. More importantly, you shouldn’t try — attention spreads thin over large contexts. Finding a tiny detail inside a massive blob of text is like finding a footnote inside an encyclopedia. The model can do it, but not reliably.
The Early Problem
Early LLMs had tiny context windows — 64k tokens was generous. With system prompt, guidance, and conversation history taking up space, you had almost nothing left for the actual knowledge that helps the model produce good answers.
This forced a key design decision: most knowledge lives outside the context window. You retrieve what you need, when you need it.
RAG (Retrieval-Augmented Generation) is the canonical solution. So is the knowledge graph. Both are just strategies for deciding what to pull in and when.
The File System Mental Model
Here’s the simplest way to think about agent memory: it’s a text file.
Think about what you can do with a text file in an OS:
- Read — load it into context when relevant
- Append — add new information without overwriting
- Overwrite — replace when the old content is no longer valid
- Concat — merge multiple memory sources
That’s the complete set of memory operations you need. No magic required.
Triggers and Lifecycle
Every memory system needs a trigger — some event that causes the agent to create, update, or read memory.
In my implementation, the trigger is simple: end of session. After each session, the agent:
- Scans existing memories for relevance
- Summarizes what happened: key actions taken, what worked, what failed
- Writes a new memory entry or updates an existing one
This creates a persistent record of agent experience. Over multiple sessions, the agent builds up a structured history of its own performance.
Memory as Training Data
Here’s where it gets interesting.
The summaries your agent writes are exactly the kind of data you’d want for fine-tuning. Key decisions made, correct choices, wrong turns, recovery patterns — this is behavioral signal in a clean format.
If you can afford to fine-tune (or when fine-tuning costs drop further), the memory log from a well-designed agent becomes a natural training dataset. Your model starts embodying the patterns of whoever built the agent.
Evaluation Closes the Loop
In production, you don’t just write memories blindly. You run an evaluation pass after each session:
- Did the agent achieve the goal?
- Were the actions efficient?
- Were any tools misused?
Only memories that pass evaluation get committed to long-term storage. Bad runs get flagged for review, not reinforced.
This is the same loop that makes humans better at their jobs: do, reflect, evaluate, adjust.
The Full Picture
Session starts → Load relevant memories into context → Agent executes task using skills + context
Session ends → Summarize what happened → Evaluate quality → Write/update memory → (Optional) Flag data for fine-tuningSimple. No exotic architecture required. The complexity is in the evaluation step — deciding what counts as a good run is the hardest part.
GitHub
Implementation is at github.com/Czhang0727. The memory system is the simplest module in the repo — a reminder that the best designs usually are.
The next post covers Hermes — a real-world agent hitting the limits of this design and what I built to fix it.
Auth_Verified: 2026.05.10
