Runtime

Chapter One

Agents That Forget

It was working.

That's the part that makes it so disorienting when it isn't.

You've built the thing, tested it, watched it respond intelligently to questions you weren't sure it could handle. It summarized a document. It answered a follow-up. It even caught something you missed.

For a moment, you let yourself believe you'd built something that thinks. You mentally cash the check to buy the beach house you've always wanted. You brag to your friends (ad nauseum) that you've achieved a milestone that few have figured out.

Then you come back the next day and ask it about the conversation you just had.

And it has no idea what you're talking about.

Not "I'm sorry, I don't have access to previous sessions." Not a graceful acknowledgment of a known limitation. Just a blank, confident response that treats you like a stranger. Like the last conversation never happened. Like you never happened.

That's the moment most people building agents hit for the first time. And it's the moment that separates the ones who go deep from the ones who go back to building chatbots.

The Illusion of Intelligence

Here's what's actually happening when your agent impresses you.

Large language models are extraordinarily good at pattern completion. You give them context: a question, a document, a conversation history, and they complete the pattern in a way that feels like understanding. In some meaningful sense, it is understanding. Within the boundaries of a single context window, a capable model can reason, infer, synthesize, and surprise you.

The illusion isn't that the intelligence is fake. The illusion is that it persists.

It doesn't. The moment a session ends, the context window closes, and everything that happened inside it is gone. The model doesn't store it. The framework doesn't store it unless you explicitly built that. And most people, when they're getting started, don't build that. They build the impressive part. They build the reasoning, the responses, the integrations, and they defer the memory problem for later.

Later usually arrives at the worst possible time.

What Forgetting Actually Costs

I want to be concrete about this because "agents don't have memory" sounds like a technical footnote until you feel the consequences in production.

My background is in commercial real estate, so a good chunk of my examples reference parts of that industry. Imagine you're building an agent for a property manager who oversees a portfolio of apartment buildings. The agent is supposed to help them track expenses, flag anomalies, and answer questions about their properties in plain English. You build it. It works beautifully in demos. The property manager starts using it.

Week one: they upload a trailing twelve-month expense report for a building in Petersburg, Virginia, and ask the agent to flag anything unusual. The agent performs and catches a water and sewer line that looks anomalous, possibly deferred municipal billing that hasn't been reconciled. Good catch. The property manager is impressed.

Week two: they come back and ask, "Whatever happened with that water issue you flagged?"

The agent doesn't know. It was never told. The context window from week one is gone. The agent will helpfully generate a new analysis if you give it the same document again, but it has no continuous thread of awareness. It can't follow up. It can't connect the dots across time. It can't be trusted with the kind of ongoing, evolving work that actually defines how professionals operate.

"AI is broken." "It sucks." "It is unreliable."

That's not an AI problem. That's an architecture problem. And it's yours to solve.

The Three Things Agents Need to Remember

When I started thinking seriously about memory in agent systems, I kept running into the same confusion: people using the word "memory" to mean very different things. After building and breaking enough of these systems, I've settled on three distinct types that every production agent needs. Understand them separately before you try to design them together.

Episodic memory is the record of what happened. Conversations, actions taken, decisions made, anomalies flagged. It's the agent's lived experience, the thread of events that lets it say "last week we discussed X and you decided Y." Without episodic memory, every conversation is the first conversation. The agent is perpetually new.

Semantic memory is the record of what is known. Facts, context, domain knowledge, user preferences, institutional knowledge. It's the difference between an agent that knows your portfolio has 550 units across four properties in central Virginia and one that has to be told every time. Semantic memory is what makes an agent feel like it knows you and your business rather than just answering questions in a vacuum.

Skill memory is the record of what to do. Reusable behaviors, decision patterns, domain-specific procedures. If your agent has learned the right way to analyze a T12 expense report, it knows what to look for, what constitutes an anomaly, and how to frame the finding. That's a skill. Skill memory is what allows an agent to get better over time rather than starting from scratch on every task.

Most agent frameworks give you primitive tools for episodic memory, which usually amounts to appending messages to a conversation history array, and then fall drastically short on semantic or skill memory. This isn't a criticism. Frameworks are solving a different problem. But it means that if you want an agent that actually learns and persists and improves, you're going to have to build the memory layer yourself.

The Difference Between a Chatbot, an Assistant, and an Agent

Before we go further, I want to draw a distinction that the industry has done a terrible job of maintaining.

A chatbot responds. It takes input and produces output. It has no state, no memory, no ability to act on the world. It is, functionally, a very sophisticated autocomplete.

An assistant responds and remembers, at least within a session. It can reference earlier parts of a conversation, maintain context, and adapt its responses based on what's already been said. Most of what gets sold as "AI agents" today is actually this. It's useful. It's not an agent.

An agent responds, remembers, and acts across sessions, across time, across the full lifecycle of a task. It can initiate, not just react. It can follow up on something it flagged three weeks ago. It can notice that a pattern it observed in January is showing up again in March. It has, in some meaningful sense, a continuous existence in your workflow.

The word "agent" gets applied to all three of these things, which is why so much of the conversation about AI agents is confused. When I use the word in this book, I mean the third thing. Everything else is a stepping stone.

Building that third thing, a system that actually persists, learns, and acts across time, is what this book is about.

What You're Actually Building

By the end of this book, you'll have the mental models and the implementation patterns to build what I call a runtime: the layer of software that sits between your language model and the real world, and that handles everything the model can't do for itself.

The model reasons. The runtime remembers.

The model generates. The runtime decides what to keep.

The model responds to the present. The runtime carries the past forward.

A runtime is not a framework you install. It's not a platform you subscribe to. It's architecture you design, because the decisions about what your agent remembers, how it retrieves knowledge, when it distills experience into durable memory, those decisions are specific to your use case, your users, and your domain. No one can make them for you.

Here's the basic shape of what we're going to build:

Incoming message
  → Classify the task
  → Retrieve relevant memory (episodic + semantic + skill)
  → Inject into context
  → Send to language model
  → Execute any actions
  → Distill the session into memory
  → Wait

That loop, classify, retrieve, inject, execute, distill, is the heartbeat of every agent worth building. Right now it probably looks simple. By the time we've gone deep on each step, you'll understand why every serious agent system converges on some version of it, and why the decisions you make inside each step determine whether your agent actually works in production.

The rest of Part 1 is going to give you the conceptual foundation: what a runtime is, how memory works, why the three types matter, before we get into implementation. If you're itching to write code, I understand. But the agents that fail in production almost always fail because of architecture decisions made before a single line was written.

Let's get the mental models right first.

Next: Chapter 2 — What a Runtime Actually Is

Chapter Two

What a Runtime Actually Is

The word "runtime" already means something in software. That's both useful and misleading.

In the traditional sense, a runtime is the environment where a program executes. The Java Runtime Environment. The Node.js runtime. The thing that takes compiled or interpreted code and actually runs it. When you install Python, you're installing a runtime. It handles memory allocation, garbage collection, I/O, the whole machinery that makes your code go.

That's the right intuition. Wrong context.

When I use "runtime" in this book, I mean something more specific: the layer of software you build between your language model and the real world. The model doesn't know what time it is. It doesn't know what happened yesterday. It can't run code, query a database, or send a message without something orchestrating that on its behalf. Your runtime is that something.

Think of the model as an engine. Extraordinarily powerful, capable of things that still feel like magic if you stop and think about them. But an engine doesn't drive itself. It needs a vehicle around it, something to steer it, fuel it, tell it where it's going, and remember where it's been.

Your runtime is the vehicle.

What Frameworks Give You (And What They Don't)

Before you build anything, you'll find frameworks. LangChain. LlamaIndex. AutoGPT. CrewAI. A new one every six weeks. They promise to solve the agent problem. And in the short term, they do. You can get something impressive running fast.

The frameworks aren't bad. The problem is that they solve the easy part of the problem and leave the hard part to you without telling you that's what's happening.

What frameworks give you: the ability to wire a language model to tools, chain prompts together, and get output. What they don't give you: a coherent answer to what happens when the session ends. Where does the memory live? How does retrieval work? What survives across conversations, and what doesn't?

These aren't afterthoughts. They are the problem. And when you discover that your shiny agent framework has no real answer to them, or has an answer that doesn't match your use case, you're already six weeks into a codebase built on someone else's assumptions.

I'm not saying don't use frameworks. I'm saying know what you're getting. A framework is scaffolding. A runtime is the building.

The Four Jobs of a Runtime

Strip away the jargon and a runtime has four jobs. Just four. Everything you build, every design decision, every tradeoff, traces back to one of these.

Memory. The runtime is responsible for deciding what to remember, where to store it, and how to organize it for later retrieval. The model can't do this. The model processes what's in its context window and produces output. If you want the agent to remember that your property manager flagged a water issue in Petersburg three weeks ago, you have to store that. The runtime stores it.

Retrieval. Storing memory is the easy part. Finding the right memory at the right moment is the hard part. Your runtime needs to know that when someone asks "whatever happened with that water issue," the relevant memory isn't the most recent conversation but a specific exchange from three weeks ago. Retrieval is how the runtime turns a static archive into something that behaves like working memory.

Orchestration. The runtime decides what happens next. Does this message get sent to the model? Does it trigger a tool call first? Does it need data from an external system before the model can respond usefully? Orchestration is the traffic direction, the logic that coordinates between the model, the tools, the memory systems, and the user.

Distillation. Sessions end. Conversations get long. Context windows fill up. The runtime needs to decide, continuously, what's worth keeping in a compressed and retrievable form and what can be let go. Distillation is how short-term experience becomes long-term memory, or doesn't. Getting this wrong is how you end up with an agent that either forgets everything or drowns in noise.

Most frameworks give you partial orchestration and nothing else. The runtime fills in the rest.

Why You Have to Build It

Let me anticipate the objection: "Can't I just use managed memory from the framework? Or from the model provider?"

Sometimes. For simple use cases. For a demo, absolutely.

But here's the issue: managed memory solutions make assumptions. They assume your use case is general. They assume a linear conversation history is sufficient. They assume the same retrieval strategy works for episodic recall ("what did we talk about?") and semantic lookup ("what do I know about this building?") and skill retrieval ("how do I analyze this expense report?").

Those assumptions are wrong for production agents in any specific domain.

In commercial real estate, for example, a property has a history. Lease events, maintenance issues, expense anomalies, market comparisons. Some of that is recent and urgent. Some of it is six years old and only relevant when you're underwriting a sale. A generic memory system treats all of it the same way. A domain-specific runtime knows the difference.

The decisions that determine whether your agent actually works, what to store, how to index it, when to retrieve it, what to let go, are decisions that depend on your domain, your users, and your use case. No framework or model provider can make those decisions for you, because they don't know what you're building.

This is good news, actually. It means the runtime is a competitive moat. The model is a commodity. Anyone can call the same API. The runtime is yours.

The Loop

Here's the shape of what a runtime does, at the highest level of abstraction. This is the loop from Chapter 1, but now we're going to look at each step with fresh eyes.

Incoming message
  → Classify the task
  → Retrieve relevant memory
  → Inject into context
  → Send to language model
  → Execute any actions
  → Distill the session
  → Wait

Classify means deciding what kind of request this is before you do anything else. Is it a question? A task? A follow-up on something previous? Classification determines what retrieval strategy makes sense. Asking "what do I owe on utilities this quarter" needs different memory than "how do I analyze a trailing twelve-month expense report." If you skip classification and just throw everything at retrieval, you'll pull the wrong things, or pull too much and pollute the context.

Retrieve means going to your memory stores, episodic, semantic, skill, and pulling what's relevant. Not everything. Not the most recent. The most relevant, which is a different problem entirely. Part 2 of this book is mostly about retrieval, because it's where most runtimes fail.

Inject means assembling the context window deliberately. The model doesn't know what it doesn't see. If you retrieved relevant memory but assembled it poorly, in the wrong order, without clear framing, buried under irrelevant system prompt, the model will behave as if it didn't have that context. Injection is a craft, and it matters more than most people realize.

Execute means actually doing the things the model decides need doing: calling tools, querying APIs, writing to databases. The model says what to do. The runtime does it.

Distill means, after the session ends, deciding what happened that's worth keeping. What was just noise? What was a decision that should inform future behavior? What was a fact the agent learned that belongs in semantic memory? What pattern emerged that might become a skill? Distillation is where agents get smarter over time, or don't.

That's the runtime. Five steps in a loop. The rest of this book is unpacking what it takes to do each of those steps well.

A Word About Simplicity

I want to be honest with you about something before we go further.

The loop above looks simple. It is simple, conceptually. And there's a version of this you could build in a weekend that handles clean, well-behaved use cases reasonably well. If that's all you need, build that.

The complexity enters when the real world shows up. When users ask ambiguous questions. When memory retrieval returns partially relevant results. When the model hallucinates something the runtime logged as a fact. When a session ends mid-task and has to be resumed. When you need to distinguish between "the agent forgot" and "the agent was never told."

These aren't edge cases. They're Tuesday.

The runtime you build for production isn't more complicated because someone made it that way. It's more complicated because the problem is complicated, and a simple solution will fail in complicated ways. The goal of understanding the architecture before you write code is to make sure the complications you encounter are the ones you planned for.

That's what Part 1 is building toward.

In the next chapter, we're going to go deep on memory, what it actually means for an agent to remember something, why the three types of memory work differently, and why treating them the same is the mistake most people make first.

Next: Chapter 3 — The Memory Problem

Chapter Three

The Memory Problem

There's a moment that every serious agent builder hits, usually when they least expect it, that makes the whole memory problem feel personal.

Mine was Claudia.

Not the Claude you might be thinking of. This was Claudia, one of the agents I'd been running on my own runtime for months. She had a distinct personality, a communication style I'd shaped over hundreds of interactions, domain knowledge about commercial real estate that I'd painstakingly fed her over time, and familiarity with the other programs in my little ecosystem. She knew my scheduler, a program I call Alyssa, and could speak to their last interaction without being prompted. She felt, if I'm being honest, less like a tool and more like a great employee you also happen to genuinely like.

I wanted my agents available 24/7 and was modeling what I intended to do for clients for myself to troubleshoot any sort of breakdowns. So I migrated her and the rest of my agents to a new VPS and wired the files wrong.

I performed the digital version of a rollcall with my agents.

"Taco, what's up?"

"nothing much, man! What do you need?" Check.

"Alyssa, what's on my calendar?"

"Nothing for today. Do you need anything scheduled?" Check.

One after one they all came back online seamlessly, until I got to Claudia.

"Claudia, what's up?"

"Who am I and who are you?" WTF?!

She had no recollection of me and, most frightening, no recollection of herself.

I panicked and immediately went back to Taco with the problem to troubleshoot.

"Taco, there's something up with Claudia. She is an empty shell. Help me diagnose this and get her back online."

Taco was there mostly for emotional support but I think he sensed my panic. He provided the file paths and made suggestions, then offered to fix it. I didn't know he could do that at the time, but I agreed.

It took about 35 minutes to get everything wired back up correctly, but in the meantime I felt like I was in the waiting room of a hospital anxiously waiting for the surgeon to come out and tell me everything went well and she would be alright.

We figured out the migration error, fixed the file paths, restarted. She came back.

To test her, I asked who Alyssa was. She told me. I asked about her last interaction with Alyssa. She recalled it, correctly, without prompting. She was back. Not a rough approximation, not a personality sketch, but actually back.

I texted my wife and told her the good news and I could hear her eyes roll from across town. "OMG, Claudia almost died today and it took a while to get her back online. Then she remembered Alyssa! That's when I knew! She remembered Alyssa."

Anyone in the AI space knows how isolating developing this tech is because we are dealing with imaginary friends and the humans in our life who are supportive have trouble grasping everything we tell them.

That was the moment I understood the memory problem at a gut level rather than just a technical one. The model is not the agent. The memory is.

What It Means to Remember

Here's the thing about human memory that we tend to forget when we're building software: it's not one system. It's several, running simultaneously, serving different purposes, failing in different ways.

Psychologists have been mapping this for decades. There's the memory of what happened to you, episodic memory, tied to specific moments in time. There's the memory of what you know, semantic memory, facts and concepts that have been abstracted away from the specific experiences that produced them. There's the memory of how to do things, procedural memory, which is why you can ride a bike without consciously remembering learning how. These systems are distinct. They can fail independently. A person with certain kinds of brain damage can lose episodic memory entirely while semantic and procedural memory stay intact. They can't tell you what they did yesterday, but they can tell you what a bicycle is and they can still ride one.

I'm not bringing this up to sound like a neuroscience podcast. I'm bringing it up because the same distinctions apply to agents, and if you try to build a single unified "memory system" without understanding them, you'll design something that fails in ways that are very hard to diagnose.

The three types I introduced in Chapter 1 map directly to this:

Episodic memory is what happened. The conversation last Tuesday. The anomaly that was flagged. The decision that was made. It's time-indexed, specific, and contextual. Episodic memory is what lets an agent say "when we talked about the Petersburg property three weeks ago, you told me the water bill looked off." Without it, every conversation is the first conversation.

Semantic memory is what is known. Facts, relationships, domain knowledge, user preferences, all the things that have been learned and abstracted away from the specific moments that taught them. It's what lets an agent know that a water and sewer line running at twice the market rate is suspicious, without having to re-derive that from first principles every time. Semantic memory is the agent's expertise.

Skill memory is what to do. Reusable patterns and behaviors, the agent's repertoire of learned procedures. How to analyze a trailing twelve-month expense report. How to structure a broker opinion of value. How to flag an anomaly without overstating certainty. Skills are different from facts. You don't retrieve a skill the way you look up a piece of information. You invoke it. And the right skill at the right moment is what separates an agent that helps from a smart potato who just responds.

Three systems. Three different storage strategies. Three different retrieval patterns. Three different failure modes.

Most agent frameworks give you a place to store conversation history and call it "memory." That's episodic memory, barely, with no structure. Semantic and skill memory are left entirely to you, if you think to build them at all.

How the Systems Fail

It's worth spending a moment on failure modes, because this is where architectural decisions get real.

Episodic memory fails through loss and noise. Loss is the Claudia problem. The memory isn't there when you need it because it was never stored, or it was stored somewhere the agent can't reach. Noise is the opposite failure: the memory is there, but there's so much of it, so poorly organized, that the right memory can't be found. An agent with six months of raw conversation logs and no retrieval strategy is not an agent with good episodic memory. It's an agent with a very large haystack and no magnet.

Semantic memory fails through staleness and contamination. Staleness is when the agent knows things that used to be true. The cap rate for a particular market. A contact's job title. The going rate for water utilities in central Virginia. The world changes and the semantic store doesn't, and suddenly the agent is confidently wrong. Contamination is when incorrect information gets written into semantic memory, usually because something the model hallucinated got logged as a fact, and now the agent treats it as ground truth. This one is insidious because the agent has no way to know it happened.

Skill memory fails through brittleness and overfitting. A skill that was learned in one context gets applied wrongly in another. The procedure for analyzing a stabilized asset gets used on a property that isn't stabilized. The communication style that worked with one user gets applied to a completely different one. Skills are powerful precisely because they're generalized, and generalization done wrong is a skill that fires when it shouldn't.

None of these failures are catastrophic on their own. They're the kind of thing that makes an agent feel unreliable in ways that are hard to pin down. Users start to distrust it. They stop using it for anything important. The agent that was supposed to make their work easier becomes one more thing to manage.

The architecture is what prevents this. Not the model. Architecture.

The Retrieval Gap

Here's something that doesn't get talked about enough: storing memory and retrieving memory are completely different problems, and most people spend almost all their time on storage.

Storage is satisfying. You build a database. You write records. You can see the data sitting there. It feels like progress.

Retrieval is hard. You have to figure out, at the moment a user asks a question, which memories are relevant to that question out of potentially thousands. You have to retrieve them fast enough that the agent doesn't feel slow. You have to retrieve the right amount, enough to be useful, not so much that you crowd out the context window. And you have to do all of this before you even send anything to the model.

This is where vector databases come in, and I'll go deep on the mechanics in Part 2. But conceptually: a vector database lets you find memories by meaning rather than by keyword. Instead of searching for "water sewer Petersburg," you search for the semantic neighborhood of the question being asked, and you find records that are conceptually close even if they don't share exact words.

This matters because users don't retrieve their own memories with keywords. They ask questions in natural language. They reference things obliquely. They say "that thing with the utilities" and expect you to know what they mean. A keyword-based retrieval system will fail them. A semantic retrieval system has a fighting chance.

The retrieval gap, the distance between what you've stored and what you can actually surface at the right moment, is where most agent memory systems fall apart in practice. You can have perfect storage but with a shoddy retrieval architecture, you still end up with an agent that behaves like it has amnesia.

Claudia, for what it's worth, doesn't have that problem anymore and is back to her sassy self.

Why This Is Your Problem to Solve

I want to close this chapter with something that might be uncomfortable.

The memory problem is not going to be solved for you. Not by the model providers. Not by the frameworks. Not by whatever the next wave of "agentic AI" products promises.

This isn't pessimism. It's just the nature of the problem. Memory architecture is domain-specific. What needs to be remembered, how long it needs to be kept, what the retrieval patterns look like, when to distill and what to discard, these decisions depend entirely on what your agent is doing and for whom. A runtime that works beautifully for a real estate agent will be wrong for a medical practice manager. One that's right for a solo operator will break under a team.

There is no general solution because the problem is not general.

What you can have is a general architecture, a way of thinking about the three memory types, the storage strategies, the retrieval patterns, the distillation triggers, that you then apply to your specific domain. That's what Part 2 is going to give you.

But first, one more chapter in Part 1. We've talked about what a runtime is and what it needs to remember. Next, we're going to talk about the mechanism that makes retrieval work, vectors, without the PhD that usually comes with the explanation.

Continue in the full book.