LongCut logo

AI just got Elephant Memory - Hands on with the Wildest AI Updates

By MattVidPro

Summary

## Key takeaways - **Crea Node Agent Auto-Builds Workflows**: Node Agent in Crea AI builds complex workflows by stringing together models and APIs from a simple prompt like combining three photos into different angles. Workflows are fully modifiable and the AI can extend them, such as turning image nodes into a video pipeline. [01:02], [01:37] - **First Local AI Battle Royale Game**: Hugo's 70M parameter world model runs a real-time multiplayer Doom deathmatch locally in the browser, with coherent world generation despite fuzzy hallucinations. Players battle in an AI-generated game world with minimap and quick respawns. [02:18], [03:20] - **MSA Enables 100M Token Memory**: Memory Sparse Attention grows ultra-long memory natively into the attention mechanism, scaling to 100 million tokens on two GPUs with less than 9% stability loss. It uses document-based sparse retrieval, chunkwise pooling, and tiered storage to bypass quadratic compute limits. [11:20], [14:47] - **Notebook LM Cinematic Paper Explainers**: Notebook LM's new cinematic video feature generates an 8-minute immersive overview of the MSA paper using engaging visuals and storytelling from uploaded PDFs. It unpacks complex ideas like human-scale memory equivalent to 200 million tokens. [11:46], [12:25] - **Claw Router Optimizes LLM Costs**: Claw Router saves costs for AI agents by scoring prompts across 15 dimensions to route to the best LLM from 44 models, avoiding always using the most expensive API. [09:04], [09:23] - **Claude Skill Builds Godot Games**: GoDoGenen Claude code skill orchestrates full Godot projects with planning and execution skills, generating scenes, scripts, assets, 3D models, and using visual QA from screenshots to fix the game. [10:04], [10:57]

Topics Covered

  • AI Agents Auto-Build Modifiable Workflows
  • World Models Enable Real-Time Multiplayer Games
  • Nano Banana Beats Microsoft Image Gen
  • Claw Router Optimizes LLM Costs
  • MSA Scales LLMs to 100M Tokens Natively

Full Transcript

What's going on everyone? I hope you're all having a fantastic Friday. Welcome

back to the Matt Vidpro YouTube channel.

We've got a great deal to talk about today. I've gathered all of the most

today. I've gathered all of the most intriguing AI research, demos, and hands-on experiences I could find. And

the first one is this by Core AI. Let me

preface this by saying that Crea AI has a fantastic track record. I rarely hear people complain about this company and they're doing very well despite a lot of other competitors. This is a website

other competitors. This is a website that lets you generate AI images, AI videos, build workflows. Very well known for their realtime AI image and video gen. But what they have today is Node

gen. But what they have today is Node Agent. Pria Nodes allows you to build

Agent. Pria Nodes allows you to build all kinds of workflows by stringing together various models and building your own custom pipeline. But what nodes agent does is live on the right hand

side and just straight up build all of the workflows inside of Crea Node. You

can see the prompt here is just to combine these three photos. It goes

ahead and imports them and then connects them all up to various Nano Banana APIs for the purpose of outputting all these different angles. Two standout things

different angles. Two standout things I'm noticing right away. First, once it generates all of the working nodes for whatever your task is, it's fully modifiable. And that goes for every

modifiable. And that goes for every single detail. You can change the model,

single detail. You can change the model, the prompt, detach nodes, retach new ones, and whatever you end up with, the AI can still work with. And that's the second thing I noticed. As you can see right here, the user highlights

everything and gives the simple prompt to create a video. Immediately branches

off of these, creating new pipelines. I

think this is for professional AI creatives, people that need the complex workflows, but very much value their time. If you're serious about creativity

time. If you're serious about creativity with AI, it is possible to create decent quality AI shorts. Right now, the barriers are traditionally swapping between a thousand tabs, handling every

workflow yourself, but this looks like a good solution, especially if you already have a Crea plan, but if you don't already use Crea, you're going to have to at least have a pro plan. 35 bucks

isn't terrible, and Crea is a good site, but that's still a decently thick barrier to entry. So, let's talk about something you can try for free in your browser. Hugo has built the first battle

browser. Hugo has built the first battle royale running locally in a world model.

Don't expect mind-blowing ascension here, but 70 million parameters, real-time multiplayer, and customizable levels. Looks like the model is trained

levels. Looks like the model is trained almost exclusively on Doom. And here's

the site. We're going to give it a quick test drive. Let's do the Doom

test drive. Let's do the Doom deathmatch. You're actually battling

deathmatch. You're actually battling against real players also on the site.

So, it's not any AI, but the whole world, the whole game is an AI world model. I'm going to jump in with quick

model. I'm going to jump in with quick play here. And you can see I am in the

play here. And you can see I am in the game. I've got that classic Doom shotgun

game. I've got that classic Doom shotgun and I can move around. And yeah, it actually does look very much like real Doom, but some guy clearly spawned in there and already got me. I've got a different weapon this time. Let's see if

I can get this guy. I think he's hiding behind a wall over here.

Oh my. Okay, I don't know what just happened there. I think there's some

happened there. I think there's some hallucinatory effects going on. Where

are the other players? The world is coherent, but it it's fuzzy. It's fuzzy

and strange. Oh, there's a guy. Oh, I

think I got him. I think I got him.

Okay. Oh, there's another one. Oh, he

got me. But it just throws you right back in. There's actually a mini map on

back in. There's actually a mini map on the side. So, you can see that there is

the side. So, you can see that there is some sort of coherent representation that the AI has to adhere to as it's generating the gameplay in real time.

And it's very impressive that all of this is streamed. That low resolution, that tiny model is really, really what's making this possible right now. It's

cool. You know, this is this is the first time I've ever played a fully AI generated game that is multiplayer.

Rudimentary, sure, but overall a really cool little tech demo. Oh, let's see.

Can we get this guy? Oh my gosh, how did anyone ever aim in actual Doom? This is

insanity. Okay, this guy appears to be like invincible or something. It's

interesting. The players just kind of look like creepy shady blobs of hallucination. Oh, I've got a different

hallucination. Oh, I've got a different weapon now. See if I can get this guy.

weapon now. See if I can get this guy.

So, you get the idea. There are also customizable levels and game modes. This

is dipping our toe in the water of what generative AI video games could one day be. It's fuzzy, wrinkly, unclear now,

be. It's fuzzy, wrinkly, unclear now, but one day that concept is going to evolve into something much, much more than what we have access to today. In my

last video, I also checked out an open- source world model. These architectures

are being developed, experimented with.

You're not playing an AI hallucinated GTA 5 anytime soon, but never say never.

Shared by Wild Minder. This paper also focuses on AI generated game worlds and gaming. This is similar to what we just

gaming. This is similar to what we just saw, but more limited and more advanced at the same time. Probably one of the most impressive things is the very precise action control through complex

and tangled keyboard and mouse inputs.

But it's likely this is made possible by that CSGO training data. And they claim long horizon sequences, but that caps out at 10 seconds 20 frames per second.

So not true long horizon. long horizon

for where this technology stands right now as brand new. But in terms of long horizon, this is actually technically less capable than what we just tested. I

got to say though, the game worlds look consistent. They are higher resolution.

consistent. They are higher resolution.

It maintains 3D shapes very impressively as you look around. So, this is cool.

This is promising. But I would love to see a full open-source release because right now the GitHub just has a readme.

Next up, Microsoft has released an update to its image generator. Not sure

if you knew they even had one, but my image too. Honestly, it appears to be a

image too. Honestly, it appears to be a pretty great model. Ranked number five on the arena. These cherrypicked sample images are pretty great. Good skin

tones. However, I know this isn't going to beat Nano Banana 2 in coherency. And

honestly, for me, also photo realism.

But similar to Nano Banana, this model does appear to be pretty strong with text and creating graphics. I think

these examples right here, tasteful, smart use of color, not overblown.

Sometimes Nano Banana can definitely overdo things a little bit. If these

images look like something that you pursue often in image gen, I recommend checking this model out. But if you care about dominance and coherence and ability to follow instructions exactly,

Nano Banana 2, Nano Banana Pro, really hard to beat. Going to go ahead and give this a try with a quick infographic prompt and we'll put it headto-head with Nano Banana 2. Okay. And here is our

result. Yeah, it isn't a nano banana,

result. Yeah, it isn't a nano banana, too. I wanted a theoretical lemon

too. I wanted a theoretical lemon character like a video game character, but the anatomy of it, so it had to be creative and come up with real names. We

have neural pulp interface, acidic core power, fiber optic stem. That's all cool stuff, but you can see there's not too much detail. There's no blurbs or

much detail. There's no blurbs or descriptions. The art style, it feels

descriptions. The art style, it feels very SDXL, if you know what I mean.

Infographics are not the strong suit of this model. The Nano Banana Pro output,

this model. The Nano Banana Pro output, far more detail, synthetic leaf antenna receives wireless data and solar power, external structure, lens assembly with

aroma injection, integrated haptic motor, synthetic leaf antenna. You can

see there are still some dupes, but did also include the cutout, which is really cool. In my last video, we talked about

cool. In my last video, we talked about updates to Google's AI studio, what they call the vibe coding interface. It felt

very much like an inbrowser version of anti-gravity, toned down quite a bit, but still highly capable. So, what else do they want to bring to the table? And

this is all claimed by Logan Kilpatrick in the next few weeks, but I would take that with a grain of salt. I think we can expect a few of these over the next few weeks, but we're looking at a design

mode, perhaps inspired by the recent Stitch update we also took a look at in my last video that had a strong focus on design, but I assume this would somehow

be more generalized for producing apps and programs and games. Figma

integration, Google Workspace integration, better GitHub support. That

GitHub support apparently is a huge deal, a planning mode, which was echoing what I was talking about yesterday.

Anti-gravity has a planning mode and it works very well. Really, that feature at the end of the day is inspired by claude code and it works beautifully forcing the LLM to take a look at the whole

situation, write out a plan and then execute it systematically. The plan

literally exists as a file that it has to reference. Immersive UI, which we

to reference. Immersive UI, which we don't know what that is. Agents already

consider what they just shipped to be an agent, but I imagine maybe sub agents can be spawned off of that one. Multiple

chats per app. Good to see simplified deploys and G1 support. I wish the best of the luck to the Google team. The more

apps that we have like this, the better places people can go and get started for completely free. Not just learn about

completely free. Not just learn about the structure of app creation, but learning how to interact with LLMs in order to produce something. There is so much to be said about prompting and how

learning to communicate with these models in order to bridge the gap between human and code is massive. And

these guys are far from the only people pursuing it. Next up, let's talk about

pursuing it. Next up, let's talk about this open-source project, Claw Router.

Obviously, this is designed to be integrated with AI agents, especially OpenClaw. It is designed to save costs

OpenClaw. It is designed to save costs by effectively routing prompts to the correct LLM. How is it actually

correct LLM. How is it actually accomplishing this? Well, it weighs a

accomplishing this? Well, it weighs a score across 15 dimensions to give you the best bang for buck. I'm not going to come out and say that this works 100% of the time because I doubt that it does.

But I think it's very much worth a try because my typical solution to this problem is to just constantly hit the most expensive API so I'm always getting the best model and that's inefficient.

Now, since this is open source, the default values are customizable and you can choose from over 44 models. I think

these picks aren't bad at all, but I could definitely see the use case where maybe I would swap Sonnet 4.6 or Opus 4.6 out. Regardless, for all of you

4.6 out. Regardless, for all of you running agents out there, this is definitely something to look into.

Especially if you run a business and you have employees that use agents, there could be some real cost savings to have.

I've got another project also open source, go do genen, a claude code skill that allows it to build complete godo for projects. Two cla code skills

for projects. Two cla code skills orchestrate the entire pipeline. And if

you didn't know, go is a potent and open game engine for both 2D games and 3D games. You can actually just with Claude

games. You can actually just with Claude code itself or Google anti-gravity generate god do games from scratch. But

I'm telling you right now with the skills this is going to be a lot more effective. Two skills orchestrate the

effective. Two skills orchestrate the whole pipeline. One plans and then one

whole pipeline. One plans and then one executes. Each task spawns in a fresh

executes. Each task spawns in a fresh context. It can do real projects with

context. It can do real projects with proper scene trees scripts and asset organization. It can generate assets in

organization. It can generate assets in 2D and can even do textures. Tripo 3D

can convert images to 3D models. It's

got customuilt language references or all 850 plus Godo classes. This is going to compensate for the lack of GD knowledge inside the LLMs. Best of all,

it does a visual QA that closes the loop. It will capture actual screenshots

loop. It will capture actual screenshots from the running game, analyze them in order to fix the game. I really want to try this out. But it's out of the scope of today's video, but depending on how good the games are that it creates, this

could be just way too much fun and seriously powerful. Next up, let's talk

seriously powerful. Next up, let's talk about this paper. MSA, memory sparse attention, explained in one sentence, it enables large models to natively have

ultraong memory, not through external retrieval add-ons, not through brute force window expansion, but by directly growing memory into the attention

mechanism itself, trained end to end. He

goes on to explain a little bit more, but I thought instead we could actually use Notebook LM's new feature to explain it for us. If you don't know, Notebook

LM is a mustuse AI tool by Google. Put

PDFs, images, YouTube links, websites, all as sources, and it can break them down, make flashcards, infographics, you name it. Their latest feature, cinematic

name it. Their latest feature, cinematic video overviews, now rolled out to 100% of pro users, so paid users only. But I

downloaded the MSA paper and I tossed it right into Notebook LM. I generated the cinematic video. There are a few

cinematic video. There are a few different formats you can do like a structured explainer or a brief overview, but this cinematic one is supposed to be rich immersive

experience. Unpack complex ideas through

experience. Unpack complex ideas through engaging visuals and storytelling. So

let's see if we can learn about MSA scaling memory sparse attention to 100 million tokens. Okay, so it generated an

million tokens. Okay, so it generated an 8 minutee long video. That's pretty

hefty. Cognitive scientists estimate that human functional memory holds the equivalent of about 200 million tokens.

As we accumulate knowledge, this network expands continuously to accommodate new information.

>> The current artificial neural networks hit a rigid ceiling before reaching that scale. Even at the frontier of AI

scale. Even at the frontier of AI research, the effective context windows of large language models typically collapse around the 1 million token mark. Beyond this point, models lose the

mark. Beyond this point, models lose the ability to recall specific details from earlier in the sequence. To bridge this gap, researchers from Peaking University and Shondaanda Group initiated a project

specifically to break through the 1 million token barrier. The team avoided methods like Laura or low rank adaptation, which updates model parameters by adding a smaller set of

trainable weights. While Laura

trainable weights. While Laura internalizes knowledge, it is vulnerable to catastrophic forgetting. When a model is forced to learn conflicting, good demonstration the weights associated

with previous memories. They also moved past external retrieval augmented generation or rag. Rag pipelines pull external text into the prompt window based on a search query. Because these

systems rely on discrete chunks of text rather than the model's native mathematical representations, they hit a semantic ceiling that limits complex reasoning. True human scale memory

reasoning. True human scale memory requires an architecture built directly into the model's latent space, allowing it to process information natively rather than searching an external database. Standard dense self attention

database. Standard dense self attention faces a fundamental scaling flaw known as quadratic compute complexity. In this

architecture, every new token must be mathematically compared against every historical token in the sequence to determine relevance. This exhaustive

determine relevance. This exhaustive matching process causes the key and value cache, the matrices that store the model's historical state, to balloon in size until it shatters the memory limits

of the hardware. Alternative

architectures like linear attention or recurrent neural networks attempt to compress history into fixed size mathematical states. This lossy

mathematical states. This lossy compression forces the model to summarize its memory, which inevitably leads to the loss of fine grained details over extreme context lengths.

Memory sparse attention or MSA provides a new framework to achieve linear computational complexity. This allows

computational complexity. This allows the model to scale its context length while maintaining the high precision of standard attention. MSA enables the

standard attention. MSA enables the model to process context lengths up to 100 million tokens while running on standard hardware with only two GPUs. In

testing, MSA demonstrated exceptional stability showing less than a 9.

Achieving this stability required a total re-engineering of the model's data routing, positional encoding, and memory storage protocols. Solving the 100

storage protocols. Solving the 100 million token puzzle relies on restructuring the physical mechanics of the attention mechanism, moving beyond the limits of raw compute density. In

its deeper layers, MSA replaces exhaustive token matching with a document-based sparse retrieval system embedded directly into the model's internal processing flow. The

architecture adds a third projection alongside the standard key and value matrices. This specialized router key is

matrices. This specialized router key is used to index and locate information without requiring the model to look at every individual token. To manage the data volume, the model segments document

hidden states into fixed blocks of 64 tokens. A process called chunkwise mean

tokens. A process called chunkwise mean pooling shrinks these segments into highly compact latent representations.

This reduces the number of points the model has to search. When a user submits a query, the model automatically generates a specialized routing vector to represent that specific question. The

model performs a cosign similarity search, scanning the compressed router key cache with the query vector to calculate exact relevant scores for every document in the bank. The model

uses these scores to isolate the top k most relevant documents, usually the top 16, from the massive memory bank. The

computationally heavy attention process then runs exclusively on this isolated fraction of the data, ignoring millions of irrelevant tokens. Substituting

exhaustive calculations with compressed latent routing allows MSA to bypass the compute penalty usually associated with massive contexts. Beyond compute limits,

massive contexts. Beyond compute limits, the researchers faced an extrapolation problem. Most models are trained on

problem. Most models are trained on short context lengths but are deployed to handle much longer sequences.

Standard global positional encoding fails at scale because it assigns a strictly increasing ID number to every sequential token. As context reaches

sequential token. As context reaches millions of tokens, these ID numbers far exceed the range the model encountered during its training. When forced to process these massive positional values,

the model's internal math breaks down, leading to incoherent outputs. MSA uses

document-wise rope as a mathematical remedy. Rope, short for rotary

remedy. Rope, short for rotary positional embedding, is a method for encoding token positions. In this

framework, the position ID counter resets to zero at the start of every single document. This isolated counting

single document. This isolated counting method decouples a document's positional math from the total volume of the surrounding memory bank. The model then applies a specific global rope offset to

the active query and any new tokens it generates. This offset maintains the

generates. This offset maintains the causal dependency required for the model to generate coherent sentences while integrating facts from multiple isolated documents. Resetting these internal

documents. Resetting these internal counters allows MSA to be trained on just 64,000 tokens while flawlessly extrapolating to read 100 million.

Deployment of these models is limited by the physical memory bandwidth of modern GPUs. Storing the compressed historical

GPUs. Storing the compressed historical state for 100 million tokens requires approximately 169 GB of memory. This

schematic shows the constraint. A

standard node with two A800 GPUs provides 160 GB of VMA. The hardware

rejects a 169 GB cache resulting in an immediate memory overflow before the model weights even load. MSA uses a tiered storage strategy called memory parallel to bypass this physical limit.

The architecture separates lightweight routing keys from the heavy content data. This allows the model to perform

data. This allows the model to perform its initial search using only a fraction of the total data. These lightweight

router keys are loaded directly into the GPU's VRAMm, enabling instantaneous distributed scoring across millions of documents. The massive bulk of the

documents. The massive bulk of the context, the content keys and values, is offloaded into the host CPU's system DRAM, which has far higher capacity than

the GPU. Once the GPU identifies the

the GPU. Once the GPU identifies the relevant documents, the system fetches only those specific matrices from the CPU to the GPU for the final attention calculation, the model pre-calculates

and caches these compressed representations offline, avoiding the need for massive recalculations for every new user query. Divorcing context

capacity from GPU memory limits turns lifetime scale memory from a theoretical concept into a deployable reality.

>> Okay, towards the end it started to have a few issues. Wow, I am really impressed by this notebook LM upgrade. It's

definitely producing more vibrant graphics. I think it's using code to

graphics. I think it's using code to generate some of these at least. And

then also V3, obviously Nano Banana for a lot of the imagery. Wow. I really

wonder what the pipeline is under the hood to make this. There was some skips, some weird cuts that that just didn't make sense. In terms of the notebook LMG

make sense. In terms of the notebook LMG gen though, I mean, I think we got most of the understanding through of the paper. This is the kind of architectural

paper. This is the kind of architectural development that we want to see in the AI space. This is a true solution to the

AI space. This is a true solution to the limited memory issues that we have with today's LLMs. I'm interested to see this sort of thing in the wild. Looks like

this is going to be released open source as well. Scalable end to-end trainable

as well. Scalable end to-end trainable latent memory framework. No code, no model weights yet, but apparently they are coming soon. The graph really does

not lie. This is looking like possibly a

not lie. This is looking like possibly a real solution to memory. No more

compacting your conversation. Awesome.

I'm glad to see this is going to be released open source and isn't just a paper. I'd like to thank you all so much

paper. I'd like to thank you all so much for watching today's video. Wow, so many projects that just give you hands directly on the wheel. Let's see what we can do with AI. Let's see which barriers can be broken. Every week I say to

myself, what am I going to build this week? And I never know what it's going

week? And I never know what it's going to be because all these projects come out and it's like, oh, here it is. Go do

running a world model locally. I want to start doing some live streams where I show off a lot of these little AI projects I mess around with and demonstrate. I end up doing a lot of

demonstrate. I end up doing a lot of cool things behind the scenes, but they never make it into a full video. So, I'd

like to do a live stream just kind of going through all of those like ROM hacking with Claude Unity MCP. It could

be fun. That Godo 4 game engine skill is really intriguing to me. There might be a video there. Have a great one everyone. I'll see you in the next video

everyone. I'll see you in the next video and goodbye.

Loading...

Loading video analysis...