LongCut logo

AI's Memory Wall: Why Compute Grew 60,000x But Memory Only 100x (PLUS My 8 Principles to Fix)

By AI News & Strategy Daily | Nate B Jones

Summary

Topics Covered

  • AI Memory Worsens with Intelligence
  • Stateless Design Blocks Useful State
  • Semantic Search Fails Relevance
  • Forgetting Enables Precision
  • Memory Demands Multiple Architectures

Full Transcript

Memory is perhaps the biggest unsolved problem in AI and it is one of the only problems in AI that is getting worse, not better. As we get better and better

not better. As we get better and better and better at intelligence, we get worse at memory, relatively speaking. In fact,

there's a name for it in the model maker community. It's called the memory wall.

community. It's called the memory wall.

We are not improving the hardware chip capabilities of our memory systems nearly as fast as we are improving the ability of those chips to infer or

compute words or do LLM inference. That

generates a growing gap between our intelligence capabilities and our memory capabilities. Don't worry, we won't stay

capabilities. Don't worry, we won't stay at the hardware level for long. I want

to go through with you the core issues that we see as builders, as users of AI, as designers of AI systems. What is the root of the memory problems we

experience? If we're at a systems design

experience? If we're at a systems design level, if we're at a usage level, if if we are even using Chad JPT, why are memory problems so sticky and hard to

untangle? Why have we not seen better

untangle? Why have we not seen better solutions in the market? I think there are good reasons for that. And then once we go through those root causes, how can we start to think about solving them?

How can we think about solving them as users? How can we think about solving

users? How can we think about solving them as builders? So, I'm going to go through five root causes and then we're going to flip the script and I'm going to go through eight principles for building a solution because I want you

to walk away from this and I want you to feel empowered to actually design better memory systems. I don't want you to wait around for someone in Silicon Valley to make a pitch and get funded for this.

You can design your own solution here.

So the key thing to keep in mind through this whole conversation is that AI systems are stateless by design but useful intelligence requires state. So

every conversation is stateless meaning it starts from zero. The model has parametric knowledge which the weights we talk about in a model right but it doesn't have episodic memory. It does

not remember what happened to you. And

I'm sorry, but the 10 or 11 sentences or the the very lossy memory that chat GPT has right now or the ability to search conversations that Claude has right now

is not good enough for that. You have to reconstruct your context every single time. This is not a bug actually. It is

time. This is not a bug actually. It is

an intentional architecture. It is a design for statelessness because the model makers want the model to be maximally useful at solving the next

problem, the problem in front of you.

And they cannot presume that state matters. It doesn't always matter. So

matters. It doesn't always matter. So

the promise of memory features is that vendors are going to be able to magically solve this by making the system stateful in ways that are useful to you. But this creates a whole host of

to you. But this creates a whole host of new problems because statefulness is not the same for all of us. What should it remember? Is it passive accumulation? Is

remember? Is it passive accumulation? Is

it active curation? How long should it remember? Is it persistent forever? Is

remember? Is it persistent forever? Is

it stale ever? Does it drop off after 30 days? When do you retrieve it? Do you

days? When do you retrieve it? Do you

retrieve it when it's relevant, sort of like claude does? Do you retrieve it all the time and potentially it's noisy in the context window? How do you update it? This is one of the biggest problems

it? This is one of the biggest problems with LLMs. People tell me they'll put their wiki into a retrieval augmented generation system and I'm like, when was the last time you updated your wiki? If

it's not updated, how do you overwrite it? How do you append data to it? How do

it? How do you append data to it? How do

you change data? These are not implementation details. They are

implementation details. They are fundamental questions about what memory is and its purpose when we do work.

Memory matters because we humans are able to quickly and fluidly negotiate between stateless brainstorming things that are like wild and we don't need to

use a lot of our past memory and very stateful work. LLMs are not good at

stateful work. LLMs are not good at that. Loading that context is very very

that. Loading that context is very very hard right now. So why is this so persistent? We've talked a little bit

persistent? We've talked a little bit about how the promise is hard to fulfill, but what are some of the root causes that make it hard for vendors to do this? Number one, the relevance

do this? Number one, the relevance problem is one of the gnarliest unsolved problems out there. What's relevant

actually changes based on the task that you're doing. Are you planning? Are you

you're doing. Are you planning? Are you

executing? The phase of your work. Are

you just exploring? Are you refining your work? The scope you're in, right?

your work? The scope you're in, right?

Is it a personal or is it a project? I

know someone who is in the healthcare industry. And they have to be very

industry. And they have to be very careful because if they were to ever ask for health advice then the memory retrieval within Chad GBT would pull up work stuff and they are afraid in the

same context if they pull up a work thing that their personal health data will leak in because it will all look like health data. So the scope matters.

What has changed since the last time you talked? The state delta is what we would

talked? The state delta is what we would call that. If you come back and you say

call that. If you come back and you say this is a new version, does it really understand that's a new version or not?

Semantic similarity which is what a retrieval augmented generation depends on is just a proxy. It is a proxy for relevance. It is not a true solution.

relevance. It is not a true solution.

Finding similar documents works until you need to find the document where we decided X and that's very specific. Or

ignore everything about client A right now but pay attention to clients B, C, and D. Or please only pay attention to

and D. Or please only pay attention to what we've decided since October 12th.

These are all things that we humans can understand and execute on when we go and manually retrieve information. But the

AI using semantic search, it's just not the right tool for that job. There's no

general algorithm for relevance. There's

no magic relevance solve that the AI can depend on. You need to use human

depend on. You need to use human judgment about task context. And that

means requiring very complicated architectures to accomplish a specific memory task, not just better embeddings in a rag memory system. And that, by the

way, is one of the big reasons why these like one-stop shop vendors often struggle with real implementations.

Number two, the persistence precision trade-off is a massive issue with memory systems. If you store everything, retrieval becomes very noisy and it becomes very expensive. You jam up your

context window. If you store

context window. If you store selectively, you're going to lose information that you need later. If you

let the system decide what to keep, it optimizes often for something that you didn't ask it to. Maybe it optimizes for recency. Maybe it optimizes for

recency. Maybe it optimizes for frequency. Maybe it optimize for

frequency. Maybe it optimize for statistical saliency versus actual importance. And if you wonder what

importance. And if you wonder what statistical salency is, have you ever tried having an argument with Chad GPT or Claude or Gemini about the fact that it's emphasizing the wrong thing in

something it's writing? That is salency.

That's a salency defect. Human memory is actually, funnily enough, very good at this through the technology of forgetting. We use incredibly lossy

forgetting. We use incredibly lossy compression with emotional and importance waiting. And so we've

importance waiting. And so we've actually done studies on human memory.

And it turns out that you can with practice get better and better and better at recalling specific things. But

if you choose not to recall something that happened to you, you're just going to lose it. And what's interesting is it seems to be a database keys issue for us. I realize I am like some someone in

us. I realize I am like some someone in the comments is going to be a neuroscientist and just rightly take me to town. But my understanding of the

to town. But my understanding of the reading is that you have to be able to remember the equivalent of a database key to retrieve the memory. And if you can do that, the memory becomes

accessible again. But your short-term

accessible again. But your short-term memory, so to speak, in humans is very lossy. And so you lose the database keys

lossy. And so you lose the database keys if you can't persist them with intent.

if you don't intend to remember them.

And that is why fundamentally your childhood memories can be very accessible. But what happened last

accessible. But what happened last Thursday? You're sitting there and

Thursday? You're sitting there and you're like, did we eat out or did we not eat out? Which which day did we go to the movies? Right? It's not because you have a profound issue with memory.

It's because your brain is desperately compressing information to make it useful to you and has dumped out those database keys. And when you go to the

database keys. And when you go to the effort of remembering, you're literally retrieving the database keys to get the memory back. Forgetting is a useful

memory back. Forgetting is a useful technology for us. That's the point of that. AI systems don't have any of that.

that. AI systems don't have any of that.

They either accumulate or they purge, but they do not decay. And what I'm talking about when I'm like, did I go to the movie? Oh, yeah. It was the movie.

the movie? Oh, yeah. It was the movie.

Who was that character? Oh, now I have I'm recovering the key and I'm able to get it back. The memory has decayed into a lossy approximation in the memory key,

but I can recover it if I put effort into it. We have nothing like that in

into it. We have nothing like that in AI. That is a uniquely human technology

AI. That is a uniquely human technology and it's funny but we have to think about forgetting as a technology when we talk about memory. Number three, the single context window assumption.

Vendors often try to solve memory by making context windows bigger. But

volume is not the issue. The structure

is the problem. A million token context window is not a usable million token context window if it's full of unsorted context. That is worse than a tightly

context. That is worse than a tightly curated 10,000 time. The model has to still find what matters, parse the relevance, ignore the noise. You have

not solved the problem by expanding the context window. You have simply made

context window. You have simply made your problem more expensive. Sometimes

substantially more expensive. I know

people who make calls and they don't budget the calls and they're like, "Why is my API bill high?" I'm like, your API bill is high cuz you're stuffing the context window and you're just kind of trying to throw queries against it. It

does not work well and it also is very expensive. The real solution requires

expensive. The real solution requires multiple context streams with different life cycles and retrieval patterns. It

is hard. You have to design it. It

breaks the mental model of just talk to the AI. That is why there is no

the AI. That is why there is no one-sizefits-all solution. Issue number

one-sizefits-all solution. Issue number four is the portability problem. Every

single vendor builds proprietary memory layers because they think in their pitch deck that memory is a moat. I get it. It

makes sense on a pitch deck. Chat GPT

memory cla recall cursor memory banks.

These are not inherently interoperable.

Users will invest time building up memory in a given system. And the model makers like that because it makes the switching cost real and you can't port what chat GPT knows about me to claude

and your memory is locked in and so on.

The problem here is a problem of the commons. This behavior set from vendors

commons. This behavior set from vendors and model makers and tool builders encourages users to leave memory to the tool rather than encouraging them to

build a proper context library. And I

get it from a product design perspective because like how many users are going to really build a product context library?

But if we reframe it and we say portability is a first class problem, users should be inherently able to be multimodel. I think that's more

multimodel. I think that's more relevant. And maybe from a consumer

relevant. And maybe from a consumer standpoint, you don't care because you have 800 million users in chat GPT. It

dwarfs everything else, etc. One, that's not entirely true because Gemini has I think uh closing in on half a million or half a billion now. But the other reason is that from a business perspective, you

have to be multimodel. It is it is a liability to be single model. And so if you're building business memory systems, you must solve the portability problem.

And the issue is any given vendor is not incentivized to make that truly portable either. They want to make that

either. They want to make that proprietary to them. And then you have the same bottleneck, but now you're on a vendor who may not be as well funded as the model maker. And so it becomes a

house of cards. Number five, the passive accumulation fallacy. Most memory

accumulation fallacy. Most memory features assume you just use your AI normally and it will figure out what to remember. That is the default mental

remember. That is the default mental model of users. And so that's the assumption that memory features build around. But this fails because the

around. But this fails because the system cannot distinguish a preference from a fact. It cannot easily tell project specific from evergreen context.

I've often seen that mixed up. It

doesn't automatically know when old information is stale. If you've ever wondered why chat GPT or Claude or Perplexity comes back and talks about old AI models as if they are active

today, that is the same issue. They

can't tell when old information is stale and it optimizes for continuity. It does

not optimize for correctness. This is

the keep the conversation going issue.

Useful memory fundamentally requires active curation. You have to decide what

active curation. You have to decide what to keep, what to update, and what to discard. And that is work. And so

discard. And that is work. And so

vendors promise passive solutions because active curation they are told does not scale as a product. I think we have to start by framing that problem better because it turns out passive

accumulation doesn't solve for it either. And this is still a big enough

either. And this is still a big enough problem that it costs us billions of dollars at the enterprise level and it's extremely frustrating for users both personally and professionally. The

answer cannot be there is no answer or we'll fake the answer. Finally, number

six on the root cause side, then we're going to get to solve. It'll it'll feel better. Memory is actually multiple

better. Memory is actually multiple problems. And that's part of why it's so hard. I hope you're getting that idea,

hard. I hope you're getting that idea, right? When people say AI memory, what

right? When people say AI memory, what they really mean is any number of preferences, how I like things done.

That could be a key value that's persistent. They could mean facts.

persistent. They could mean facts.

What's true about particular things or entities that can be structured, it might need updates. They might mean knowledge, right? Domain expertise. And

knowledge, right? Domain expertise. And

that can be parametric, right? that can

be embedded in weights but it might not be right and then what do you do? It can

be episodic. So it could be conversational temporal ephemeral knowledge. It can also be procedural.

knowledge. It can also be procedural.

Have we solved this before? Right? If

episodic memory is what we've discussed in the past, procedural memory is how we solve this problem in the past. And

those are also different things. And so

you have exemplars there, you have successes and fails in procedural memory. Every single memory type needs

memory. Every single memory type needs different system design to handle storage retrieval and update patterns.

And if you feel like you're getting a headache here, you're not alone. This is

why we don't have a good solve. And this

is why I want to lay out in the next section principles for solve. But it

starts with being honest about the problem. Treating this problem as one

problem. Treating this problem as one problem guarantees you are going to solve none of the real problems well.

And that is why we have memory as a persistent issue. in fact a growingly

persistent issue. in fact a growingly worse issue in the AI community. Vendors

fundamentally are treating this as a solve for infrastructure and not a sol for architecture. And so bigger windows

for architecture. And so bigger windows and better embeddings and cross chat search scale, but they don't solve structurally. And users keep expecting

structurally. And users keep expecting passive solutions because they're frankly sold passive solutions. There's

an expectations issue here. Just

remember what matters is not something that you can expect to work. But we're

told that it will work. So if memory requires architecture and users want magic, the gap between what's promised, what's delivered, and what's needed has never been bigger. We have a memory wall

of our own beyond the chip level in how we design our systems. And it won't get solved if we solve the wrong. So let's

say you've gone through all of this and you want to solve memory correctly. I am

going to give you principles that work whether you are using the chat and a sort of power user at home and you want to build something yourself because this

absolutely works for that or whether you are designing larger systems because it turns out that the principles for memory are fractal because the problem is fractal. We have the same kinds of

fractal. We have the same kinds of memory issues when we are power users individually in a chat as we do when we are designing agentic systems. So the principles that work. Number one,

there's going to be eight of these.

Settle in. It's going to be fun. Memory

is an architecture. Memory is an architecture. It is not a feature. You

architecture. It is not a feature. You

cannot wait for vendors to solve this. I

think you get this idea. We won't spend too long here. Every tool will have memory capabilities, but if you leave it to tools, they will solve different slices. You need principles that work

slices. You need principles that work across all of them. And you need to architect memory as a standalone that works across your whole tool set.

Principle two, you should separate by life cycle, not by convenience. So as an example, you need to separate personal

preferences which can be permanent from project facts which can be temporary and those should be separated from session state which can be ephemeral or conversation state. Mixing different

conversation state. Mixing different life cycle states mixing permanent with temporary with ephemeral it just breaks memory. The discipline lies in keeping

memory. The discipline lies in keeping these apart cleanly. And again, this works if you're in chat. It works if you're designing a gentic systems. If you have a permanent personal

preference, it is possible. It is as simple as a very disciplined system chat update where you go into the sort of system rules and the system prompt for chat GPT and you say, "This is what you

need to know about me. These are my personal preferences." And model makers

personal preferences." And model makers are starting to make that more exposed because they want that. But they don't tell you how to use that properly. And

when I observe how people actually use that tell me about yourself, it is absolutely a mix of personal preferences and ephemeral stuff and project facts because no one has taught them to use it better. And if you're designing agentic

better. And if you're designing agentic systems, it gets more complex, but it's the same separation of concerns. You

have to separate out what are the permanent facts in the situation here.

What are project specific facts and what is session state. Principle number

three, you need to match storage to query pattern. So that means you're

query pattern. So that means you're going to need multiple stores because different questions require different retrieval. Now in the chat situation

retrieval. Now in the chat situation that I gave you, chat GPT can retrieve the memory if it's a system prompt kind of a thing and it just calls it into the

context window and it's super simple and you'd never think of it as memory for most people but that's what it is. If

you're designing an agentic system, it is understanding the difference between, for example, what is my style, which could be a key value because it's a written style of some sort. What is the

client ID, which should be structured data or relational data, what similar work have we done, which could be semantic or vector storage data, and what did we do last time, which should be event logs. Those are four different

types of data, right? You have key value data, structured data, semantic data, event logs. Trying to do all of these in

event logs. Trying to do all of these in one storage pattern is going to fail.

And that is why when people say, "We have our data lake and it's going to be a rag." I'm like, why? Why is it going

a rag." I'm like, why? Why is it going to be a rag? Have you heard the word rag repeated a hundred times like a magic spell for memory? It does not work that way. You need to match storage to the

way. You need to match storage to the query pattern. Otherwise, you just have

query pattern. Otherwise, you just have a very expensive data dump. Principle

number four, mode aare context beats volume hands down. And so more context is not better context. Planning

conversations need breadth like they need to have space for alternatives.

They need to have space for comparables.

Brainstorming conversations are similar to planning conversations. You need to be able to range. Execution

conversations. Execution workflows in agentic situations. They need precision.

agentic situations. They need precision.

They need precise constraints. Retrieval

strategy needs to match your task type.

You cannot just sit there and think to yourself, okay, I'm going to have a brainstorming conversation and it's going to be incredibly precise and just hope that it works. This is why I talk about prompting so much. Effectively,

what prompting is doing? It is giving context that is mode aare to an AI so that it can be in the right mode. And

that's super effective for chat users.

But guess what? If you're designing agentic systems, it is your responsibility to architect mode awareness into the system so that it is aware that this is an execution environment and that precision matters

and that it is audited and eval on precision. Principle number five, you

precision. Principle number five, you need to build portable as a first class object. You need to build portable and

object. You need to build portable and not platform dependent. Your memory

layer needs to survive vendor changes.

It needs to survive tool changes. It

needs to survive model changes. If chat

GPT changes their pricing, if Claude adds a feature, your context library should be retrievable regardless. And

that is something that almost nobody can say right now. And the people who are doing it tend to be designing very large scale agentic AI systems at the enterprise level. But this is a lesson

enterprise level. But this is a lesson that we all need to take with us. I

think it is a best practice. It is sort of like keeping a go bag next to the door in case you need to get out in case of I don't know something happens to your house. You need to have something

your house. You need to have something that is portable that carries relevant memory that you can use to have productive conversations with another AI. I fully admit there is not an

AI. I fully admit there is not an outof-box solution for this. There are

people who are power users who configure obsidian to do this right as a note-taking app and they tie it into AI and it becomes a portable platform independent way of handling this. There

are people who use notion for this. The

thing that is a common trait is that they are obsessed with making sure the memor is configured correctly for them and the AI has to come in and be queried correctly or called correctly to engage

with a piece of the memory that matters.

Whether that is a key value piece like what's my style or a semantic search like what similar work have we done together. A good data structure accounts

together. A good data structure accounts for that. Principle number six

for that. Principle number six compression is curation. Do not upload 40 pages hoping the AI extracts what matters. I see people do this when they

matters. I see people do this when they overload the context window and they ask for an analysis of a report. You need to do the compression work. You need to either in a separate LM call or in your

own work, write the brief, identify the key facts that matter and state the constraints. This is where judgment

constraints. This is where judgment lives. And if you don't delegate it, you

lives. And if you don't delegate it, you will be happier with the precision and context awareness of the response.

Memory is bound up in how we humans touch the work. There are ways to use AI to amplify and expand your judgment. You

can use a precise prompt to extract information in a structured way from 40 pages of data and then in a separate sort of piece of work figure out what to

do with that data. But it remains on you to make sure that the facts are correct, that the constraints are real, and that the precision work you're asking AI to do with that data is the correct

precision work. The judgment in

precision work. The judgment in compression is human judgment. It may be human judgment that you amplify with AI, but it remains human judgment. Principle

number seven, retrieval needs verification. So semantic search will

verification. So semantic search will recall well but fail on specifics, right? It will recall topics and themes.

right? It will recall topics and themes.

Well, you need to pair fuzzy retrieval techniques like rag search with exact verification where facts must be correct. You should have a two-stage

correct. You should have a two-stage retrieval call path, right? Recall

candidates and then verify against some kind of ground truth. This is especially important in situations where you have policy or you have financial facts or legal facts that you need to validate.

Something like this is exactly why there was a very prominent fine leveled against a major consultant firm in the last two weeks. I think the fine came to

close to half a million dollars because they could not verify facts around court cases in a document that they prepared and they hallucinated them and they

didn't catch them. retrieval failed.

Retrieval failed. And because the LLM is designed to keep the conversation going, it just inserted something plausible and nobody caught it. You need to be able to verify retrieval against ground truth.

Now, if it's a small task, that might be the human at the other end of the chat, right? It just is a step that needs

right? It just is a step that needs doing. If it's a large agentic system,

doing. If it's a large agentic system, it is the exact same fractal principle, but you need to do it in an automatic way using an AI agent for evals.

Principle number eight, memory compounds through structure. So random

through structure. So random accumulation actually does not compound.

It just creates noise. Just adding stuff doesn't compound. If if we just added

doesn't compound. If if we just added memories randomly the way we experience them in life and we had no lossiness, no forgetting ability, we would not be able to function as people. Forgetting is a

technology for us. In the same way that forgetting is a technology for us, structured memory is a technology for LLM systems. So evergreen context goes one place, version prompts go another

place, tagged exemplars go another place. And at a small scale, yes, you

place. And at a small scale, yes, you can do this. People are doing this with Obsidian, with notion, with other systems as individuals. And yes, you can scale this as a business. Same same

principle. You let each interaction build without degradation if you have structured memory. Otherwise, you just

structured memory. Otherwise, you just have random accumulation. Otherwise, you

have the pile of transcripts you never got to, and you're like, well, this is data. We're logging it. it's probably

data. We're logging it. it's probably

good. It just it's going to be random accumulation. It creates noise. You're

accumulation. It creates noise. You're

not going to have structured memory.

These are the principles that work. They

work whether you are a power user with chat GPT or a developer building agentic systems. Frankly, they are guideposts for you if you are evaluating vendors in the memory space. These are tool

agnostic principles. They're designed to

agnostic principles. They're designed to scale with complexity and they're designed to give you keys that solve the memory problem because they make consist

context persist reliably without the brittleleness that we see with current AI systems. And so my challenge to you as we wrap up this video, we've gone through root causes. We've gone through

why memory is a hard problem. We've gone

through eight principles for how to solve for this memory issue. Please take

memory seriously. The reason it matters now is because if you solve memory now, you have an agentic AI edge. These

systems are going to get cheaper and more powerful, but you can't assume they're magically going to solve for memory. As I said at the beginning,

memory. As I said at the beginning, there's a chip level issue here. It is a hard hard problem. If they don't magically solve for it, if you take responsibility for memory and build it yourself in the way that works for you,

you are starting the timer earlier than everybody else around you on getting memory that is functional across a long-term engagement with AI. Because

you have to start to think, we're in year two of the AI revolution. Wouldn't

it be great to have memory that goes back to the year two when you are working with AI systems in 10 years, in 15 years, in 20 years? Everybody else is going to have memory that started much

later and they're going to lose that discipline, that acceleration, that ability to manage deep work over time that AI is going to be capable of with

proper memory structures. So there is a moment here for you to think about and put in place a memory structure that works. Don't lose the opportunity. This

works. Don't lose the opportunity. This

is a this is a complex one, but it's on you and me and all of us together to build memory systems that handle our own needs, whether that's personal needs or professional needs. I know you can do

professional needs. I know you can do it. Drop in the comments how you're

it. Drop in the comments how you're doing it because I think we should all crowdsource

Loading...

Loading video analysis...