LongCut logo

Moonlake: Interactive, Multimodal World Models — with Chris Manning and Fan-yun Sun

By Latent Space

Summary

Topics Covered

  • Video generators are not world models
  • Structure not scale: five orders of magnitude more efficient
  • Language as humanity's secret weapon over chimpanzees
  • Can you interact with the world and see consequences?
  • Benchmarking world models is fundamentally unsolved

Full Transcript

I think this whole space is extremely difficult as things are emerging now.

And I mean it's not only for world models. I think it's for everything

models. I think it's for everything including textbased models, right? Cuz

you know in the early days it seemed very easy to have good benchmarks cuz we could do things like question answering benchmarks. But you know these days so

benchmarks. But you know these days so much of what people are wanting to do is nothing like that, right? If you're

wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month, it's not so easy to come up with a

benchmark. And it's the same problem

benchmark. And it's the same problem with these world models.

Before we get into today's episode, I just have a small message for listeners.

Thank you. We would not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content.

We've been approached by sponsors on an almost daily basis. But fortunately,

enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way.

But I just have one favor to ask all of you. The single most powerful,

you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the

only thing I'll ever ask of you. And it

means absolutely everything to me and my team that works so hard to bring the Inspace to you each and every week. If

you do it, I promise you we'll never stop working to make the show even better. Now, let's get into it.

better. Now, let's get into it.

Okay, we're back in the studio with Moonakes uh two leads. I guess there's there's other founders as well, but uh Sun and Chris Manning, welcome to the studio.

Thanks. Thanks for having us.

You've you guys have uh you know come burst onto the scene with a really refreshing new take of old models. Um I

would just want to uh sort of I guess ask how you the two of you came together. Chris, you're a legend in NLP

together. Chris, you're a legend in NLP and just AI in in general. Uh you're

you're his grad student I guess.

Actually my co-founder.

Oh yeah.

I should give a lot of credit to my co-founder Sharon. um she was she was

co-founder Sharon. um she was she was actually working with professor fein and then she ended up working with um Ron and Chris Manning here and then so I got connected through to Chris initially

actually through my co-founder what is moonlake what what is um actually I'm also very curious about the name but like why going into world models

so I was working a lot with actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embodied EI agents

And then there's two observations. One

in academia and one in industry. In

industry like folks like Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds whether it's for the sake of evaluation or training the robots um or

policies or models. And then um in academia same thing is happening and more specifically when I was actually working with Nvidia on the synthetic data foundation model training project we were actually generating a lot of

these synthetic data and showing that hey you can actually these synthetic are actually as useful as real world data when it comes to multimmodal pre-training but then like I said there's a lot of dollars being paid out

to like external vendors or or like other folks to manually curate these types of data. It was very clear to us that okay on our way to let's call it embody general intelligence models need

to learn the consequences behind their actions which means that they need interactive data and the demand for those types of data are growing exponentially but everybody is sort of thinking about it from a pure say video

generation perspective or something else but we feel like the the true actually opportunity is actually building reasoning models that can do these things like how humans do these things today so that's a little bit on the

genesis of moon And I think the reason I got into world models was partly a philosophical take of the on the world where I like you know believe the simulation theory and stuff like

that but on the other on the other hand it's really just like oh like there's an opportunity there that I feel like nobody's doing it the way I think should be done.

I can say a little bit about that. Yeah.

So of the o overall goal is the pursuit of artificial intelligence and you know most of my career has been doing that in the language space and that's been just

extremely productive as we all know the story of the last few years. I don't

have to tell about how much we've achieved with large language models but although they have been extremely effective for ramping language and

general intelligence it's clearly not the whole world. there's this multimodal world of vision, sound, taste that you'd

like to be dealing with more than just um language. And then the question is

um language. And then the question is how to do it. Um and despite you know a huge investment in the computer vision

space right as a research field computer vision has been for decades far far larger than the language space actually.

I mean, I think it's fair to say that, you know, vision understanding sort of stalled out, right? You got to object recognition and then progress just

wasn't being made, right? If you look at any of these um vision language models, it's the language that's doing 90% of the work and the vision barely works.

And so there's really an interesting research question as to why that is. And

at heart um the ideas behind Moon Lake are an attempt to answer that believing that there can be a really rich connection between a more symbolic layer

of abstracted understanding of visual domains which aren't in the mainstream vision models which are still trying to operate on the surface level of pixels.

I think one of your blog posts you put it as structure not scale. Is that uh a general thesis? Yeah. Well, scale is

general thesis? Yeah. Well, scale is good too. Lots of data is good as well.

good too. Lots of data is good as well.

But nevertheless, you want the structure. Yeah. To be able to much more

structure. Yeah. To be able to much more efficiently learn.

Yeah. The other thing I really liked also was you put out an example of what your kind of reasoning traces look like, right? Which you would distill is is the

right? Which you would distill is is the word that comes to mind. I don't even think that's a good good description but it would involve for example geometry physics affordances symbolic logic

perceptual mappings um and what what have you but like that that is the kind of example that involves let's call it spatial reasoning world model reasoning

as as compared to normal LM reasoning but also like taking it a step back so how do you guys define world models you know a lot of people see like okay you can do diffusion you can do video

generation but You guys put out quite a few blog posts. You put out a essay recently, we can even pull it up about efficient world models. Um, you have a pretty like structural definition here, but for the general audience that don't

super follow the space, right? What's

what's the difference in what we see from like a video generation model to a world gen, a simulator, how do you kind of paint that last?

Yeah. So I think this is actually a little bit subtle because you know people look at these amazing generative AI video models Sora V3 one of these

things and they think genies they think oh this is amazing this is sort of you know we've solved understanding the world because you can produce these

generative AI videos but the reality is that although the visuals do look fantastic those visual s

actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what

the consequences of different actions are, and that's what's really needed for spatial intelligence. So I mean a term

spatial intelligence. So I mean a term we sometimes use is that you need action condition world models that you only actually have a world model if you can

predict given some action is taken what is going to change in the world because of it and in particular that becomes hard over longer time scales. So if

you're simply, you know, trying to predict the next video frame, that's not so difficult. But what you actually want

so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes

into the future. And to do that, you actually need much more of an abstracted semantic model of the world.

Yeah. The question comes where you want to have more structure than is available in just predicting the next token. Um

and typically well let's let's call it the experience of the last 5 years has been that that is just washed away by scale right. Um so what is the right

scale right. Um so what is the right middle ground here that uh you don't ignore the bitter lesson but also you can be more efficient than what we're

doing today. You know, one possibility

doing today. You know, one possibility is look, if we just collect masses and

masses and masses and masses of video data, this problem will be solved. Um,

under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true.

The first is what's really essential is understanding the the consequences of actions, producing an action conditioned world model. And if you're simply um

world model. And if you're simply um collecting observational video data, which is the easy stuff to collect when you're sort of mining online videos, you

don't actually know the actions that are being taken to see how the video is changing. And so if you're never

changing. And so if you're never collecting directly actions and you're having to try and infer them from what happened in the observe video, that's

not impossible, but it's very hard and it's not really established that you can get that to work at any scale yet. And

so there's a lot of premium on collecting action condition video data which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do

know the actions which is in quite limited supply. But there's also in the

limited supply. But there's also in the limit of as much data as you could possibly have. You know, maybe the

possibly have. You know, maybe the problem is eventually solvable, but even though we collect huge amounts of text

data, text data is always at a great level of abstraction. Right? Language is

a human-designed abstracted representation where there's meaning in each token and it's representing an abstraction of the world. Right? As soon

as you're describing someone as a professor and as soon as you're saying that they're condescending, right? You

know, these are very abstracted descriptions of the world is not at sort of what you're observing as pixel level.

And so to get to that kind of degree of abstraction starting from pixels is orders and magnitude of extra data and

processing. And so although you know we

processing. And so although you know we absolutely want to exploit get as much data as possible use the bitter lesson nevertheless if there are ways in which

you can work with five orders of magnitude less data than people working purely from pixels you're going to be able to make a lot more progress a lot

more quickly. And that's the bet here.

more quickly. And that's the bet here.

And so you could just say that's only wanting to be able to, you know, do it more efficiently, do it more quickly, do it more cheaply. But I think it's

actually more than that. I think one should be making the analogy to how human beings work. At one level, you

know, yes, we have these high resolution eyes and we can look and see a scene like a video, but all of the evidence from neuroscience and psychology is that

most of what comes into people's eyes is never processed, right? that you're

doing fairly fine processing of exactly what you're focusing on. But you know, as soon as it's away from that of Yeah, there's another guy over there that you've sort of only processing top down

this very abstracted semantic description of the world around you. And

so, you know, that's what human beings are doing. They're working with semantic

are doing. They're working with semantic abstractions. And so I think it is just

abstractions. And so I think it is just the right representation because we also have other goals. We want to be able to do you know real time worlds that means

there's a limit to how much processing you can do and we want to do long-term planning and consistency and again that favors abstraction. I mean I guess there

favors abstraction. I mean I guess there was actually a recent blog post that came out from our friends at physical intelligence and you know they were sort of heading in the same direction. And

they were saying, "Oh, model."

model." Yeah. To maintain a long-term memory of

Yeah. To maintain a long-term memory of what's happening in the world so we can do longer term. We're actually storing text of what is um, you know, been

happening in the world, right? It's not

such a successful strategy of trying to keep it all at a pixel level. And yeah,

I mean, you can see it in video models like that temporal consistency. We're at

a scale of train on, you know, all the video data we have. We have it for maybe 30 seconds, a few minutes. That's not

the same as a game state played for half an hour, right? Um, I thought you guys break it down pretty well. You have a you have a blog post about uh building multimodal worlds with an agent. I don't

know if you guys want to talk about this. This is one of the things I read.

this. This is one of the things I read.

I thought Yeah, the thing I talked about with the reasoning chain. Yeah.

reasoning chain. Yeah.

So, there's like different phases to this. It seems like it's more of an

this. It seems like it's more of an agent, a scaffold. Uh, very different approach than just, you know, type in a prompt and you you don't have the same consistency. It also like for people

consistency. It also like for people that are listening, you know, I would highly recommend reading it. It breaks

down the problem in a different light, right? So like what do you need to

right? So like what do you need to consider when you're talking about video like world game models, right? How would

what do you need to consider? What are

the factors? What are the elements?

What's the state? So I don't know if you guys have stuff to talk about for this one.

Yeah. Um actually I wanted to add on a little bit on our previous point which is just like change quickly. I I do feel like sometimes people confuse like oh like we're taking an an method with with abstraction that means they don't

believe in bitter lesson like like that's just false right like we are believe is a bitter lesson but then I feel like the question that we always discuss is like what is the right

abstraction level today the analogy I like to make is like let's just say we can encode and decode represent all of images videos audio in bytes then the

most bitter lesson approached is to train a next bite prediction model as opposed to the next token prediction model where it's just like okay it's natively multimodal and you can just um but it's like well yeah like to to

Chris's point it's like the scale and compute you need to achieve that. Um um

so that's why we always come back to like okay what is the most efficient way to do it and and reasoning models to to the point of this blog post is a showcase of like hey we're actually just

like reasoning about the world and reasoning about the aspects of the world that that matter for me to learn what I want to learn from this world model. Um

yeah, it's like you're improving the encoder of whatever you're uh trying to model and like a better representation would just represent the important things in less space.

Yeah.

And which would just be more efficient.

Yeah.

Um so yeah, I I fully agree that it is not um antagonistic to uh bitter lesson.

I do want to mention one more thing. Um

is there any philosophical differences with the Japa stuff that uh Yanukun is working on?

I got to go there. you you you're you're mentioning like some latent abstraction.

I'm like, okay, fine. Let's let's talk about it, right? Like it's an elephant in the room.

Yeah, there are philosophical differences. Um Yan Lun is a dear friend

differences. Um Yan Lun is a dear friend of mine. Um but

of mine. Um but he has never appreciated

the power of language in particular or symbolic representations in general. Yan

is a very visual thinker. He always

wants to claim that he thinks visually and there are no words, symbols or math in his head. Um maybe that's true of Yan. It's certainly not the way I think.

Yan. It's certainly not the way I think.

Um but at any rate you know um the world according to Yan is the basic stuff of

the the world and of intelligence is visual and language is just this low bit rate communication mechanism between

humans and it doesn't have much other utility and it's far inferior to the high bit rate video um that comes in

your eyes. And I think he's

your eyes. And I think he's fundamentally missing a number of important things there, right? Think of

this evolutionary argument looking at animals, right? That the closest

animals, right? That the closest analogies, the things with chimps, right? So chimpanzees, you know, have

right? So chimpanzees, you know, have fairly similar brains to human beings.

They have great vision systems. They have great memory systems. They've got, you know, better memory than we do of short-term memories. They can plan. They

short-term memories. They can plan. They

can build primitive tools. But, you

know humans massively ahead in what we understand about the world, what we can plan, what

we can build. And essentially what took off for us was that humans managed to develop language and that gave a

symbolic knowledge representation and reasoning level which just gave this sort of vaultting of what could be done

with the intelligence in brains. So the

philosopher Dan Dennett refers to language as a cognitive tool and argues that you know humans unique among the creatures in the world have managed to

build their own cognitive tools and language is the famous first example but other things like um mathematics and programming languages are also cognitive

tools. They give you an ability to think

tools. They give you an ability to think in abstractions in extended causal reasoning chains and that allows you to

do much more and we use that for spatial representation and intelligence and planning and gameplay as well. So we

believe and this is you know underlying the specific technologies that Moon Lake is making that symbolic representations are powerful and you want to use it in

your understanding of the visual world when you want a causal understanding when you want to maintain long-term consistency and prediction and you know

as I understand it that's just not in Yan Mcun's worldview so I think that's the fundamental philosophy ical difference. Um then there's the specific

difference. Um then there's the specific model he's been advancing Jeffa I mean that's a reasonable research bed as a

direction as to to head for building out a model of the visual world to my mind it's sort of one reasonable research bed it's not really established it's the

best one that everyone should be following at least developed at scale at meta but it's not just vision right like I mean Japa is a you know just join embedding prediction can be applied to anything really and and people have done

it. If the argument is that there is a

it. If the argument is that there is a latent representation or that is that is probably more uh suited to the task then why not let machines do it for us instead of predefining it at all and

isn't something like a Jeppa shaped thing the right answer and if not why not so I think there's a part of Jeppa that's right which is you do want to

have a joint embedding that gives you a consistent model of the world and yan 's

argument is you can never get that from auto reggressive language models because they're sort of left to right turning out one token at a time. I guess this is

where we're um you know the researcher arguments of the field. you know, I'm not actually convinced that's right cuz although the token production is this

auto reggressive um process that's heading, you know, left to right, I guess don't have to be left to right, but anyway, in sequence of tokens, we

could have right to left Arabic. that um

you know although that's true all of the weights of the model that are internal to the transformer they are a joint model of the model's understanding of the world and so I

think you can think of the weights of the model as a form of joint representation and therefore it is plausible to think

that that could be the basis of a world model which avoids um Han's objections.

I think I follow and obviously that will touch on what Moon Lake eventually ends up doing as well, right? Like which it's hard to tell because you put out the end results but we don't know the inputs

that go into it. So it's it's you know that's that's something that we have to figure out over time.

Yeah.

I mean I guess this kind of breaks down some of the outputs. Do you want to walk us through it?

Yeah. Uh so this this really just walks us through the reasoning traces of like okay that just say if we want to build a world in this context it's really just a game demo that that shows the variety of

interactions that this world model can build and yeah it's really just a reasoning traces of like okay you're prompted to create a bowling game like

how did it achieve what you saw that level of causality interaction and consistency right um so yeah this is almost just like an example of like a reasoning traces very detailed

very very detailed but you got to like you don't even realize it right like when a video is generated what happens when a ball strikes a pin right so first like you there's audio in that like audio triggers happen score increments

uh the world changes like pins have to start dropping there's a timer that goes on um you know it's just like very similar to how now we're used to reasoning for language models there's a

whole state of what happens so geometry physics uh all this stuff and then there's kind of that single prompt so asset um physication, all this stuff. It's

it's like a it's a nice view to see what's going on.

I think Sun is also too polite to point out that uh both like Google's Genie uh demos as well as uh World Labs's Marble

do not have interactive worlds. Uh

that's the benefit of having a reasoning model, right? Like because you can you

model, right? Like because you can you can say, "Oh, like maybe in this particular context I want to learn how to bowl." And then you can say, okay,

to bowl." And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe

it's like I need to understand the the basic of like physics and I want to throw it over them. I want to know that when I when it resets, it's it's a new game. So I know that yeah, basically,

game. So I know that yeah, basically, you know, you know, you know, to pick up the ball, you know that ball's going to cause the pins to fall down. You know

that what's important to this particular bowling game is to score. and you know that the score corresponds to the number of pins that fell down. Um so it's just

like if it's a model that sort of knows what it looks like, knows what a bowling game looks like, but doesn't actually allows you to practice over and over again and to understand that oh like

what it takes to actually get a high score, then it sort of doesn't actually allow you to learn what you set out to learn within the world model, right? And

and I think this is really just one example of showing like the advantages of the approach that we're taking over most the let's call it the zeitgeist is today when people talk about

quoteunquote world models, right? So it sort of seems like the

right? So it sort of seems like the question to ask when there's a world model is can I not only just wander

around the world and look at the beautiful graphics? Can I interact with

beautiful graphics? Can I interact with the objects in the world and see the right consequences of actions and you also understand what the consequences would be if you do

something right so it's not just like okay there's one thing if I pick it up something will happen but you know there's there's 50 options and I know I can expect I can infer what would happen if I do any of them right so very

different when you can actually see it play around with it there there's two cheeky elements of that I mean the the sort of I guess less ambitious one is um let's really

establish for listeners uh why is this fundamentally different than writing unity code right like just creating a model to translate a prompt into unity code so there is an underlying physics engine

um in that sense there's some overlapping things to unity but the way we think about it is like physics engine or tools or code are cognitive tools

like borrowing Chris's term right like tools that the model can deploy as means to an end. So today maybe you say okay in this particular context we care about physics we care about the long-term

causality consequences then yes we deploy it employ physics engine and then maybe tomorrow we say okay we're we're training that just say drones where we

only care about really fluid dynamics and the visual aspect of the world then then yeah maybe we don't actually the model actually doesn't have to use a physics engine or maybe it employs other types of representation or physics

engine to achieve the task. So yes,

writing code for Unity is sort of similar to a tool that a model can employ. But our goal is for model to

employ. But our goal is for model to take a representation conditioned reasoning approach or process. Yeah.

Internally.

Yeah. Using these things is just like general tool calls, right? Which I think is very interesting. The other more ambitious one is uh some kind of recursive element where it becomes multiplayer, right? Like here there's a

multiplayer, right? Like here there's a single player element. you're not

modeling any other people involved and that is a whole other thing.

But in fact, we can already do multiplayers.

Oh yeah. Okay. I haven't seen any you just actually just like prompt our our model to say hey like configure it to multiplayer then it'll do like this you'll be able to configure multiplayer.

Great persistency database for you.

Easy.

Yeah. So what what are like some of the current limitations in where we're at.

So there's one approach of like okay scale up video predictors. Obviously

there's data issues uh you know with approaches like this uh is it data constraints what are like the next steps is it real time like so there's one side of you know write an agent to write unity code but okay I want to be

streaming a game real time I want to have characters being also like agentic but where where do we kind of see this scaling up right yeah there's definitely a data

constraint like the more data the the better this reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever is necessary and then

there's a sort of fidelity constraint which we're actually solving with another model rey which we can talk about later. Um but it's like well it's

about later. Um but it's like well it's not as easy to get to photo realism with the approach that we're taking. Um but

we think there are better solutions to that which is we can dive into later later.

The one one thing you note here is it's a diffusion model right? So there's

there's a few approaches uh diffusion caution splatting um yeah so rey diffusion model you guys want to introduce yeah totally so within our world modeling framework we think there are

two models that we train right like there's the multimodal reasoning model that we just talked about that essentially handles mainly the the causality the persistency and logic

determinism determinism of the world and then rey is our bet on saying okay Like while all those model um can take care

of all these things that we just talked about its limitations compared to existing say video models is that it doesn't have as high of a pixel fidelity

right off the gate right and rey is to say hey we can actually take whatever persistent representation that we generate with our multimodal reasoning

model and learn to restyle it into photorealistic styles or arbitrary styles you want. So this model is almost to say, hey, I'm going to respect the persistency and interactivity of the

world that you created, but my only job is to make sure that its pixel distribution is close to what we want.

Yeah.

And yeah, example, right?

You kept the KO divergence where No, no. I mean, this this is a a classic

No, no. I mean, this this is a a classic like um how you don't stray too far from the source material as you you kept the KO, which is kind of cool.

Yeah. Yeah. I mean and the difference is and I mean Sun was pointing at this where sort of saying it's in one way a more difficult path but a better path

that you know typically the diffusion models uh producing the whole scene and it looks lovely but there isn't spatial understanding behind it which is

allowing for the real time graphics game play the spatial intelligence understanding the consequences of worlds where this is um taking a path where it

is assuming an abstracted semantic model of the world the world state and then the diffusion model is then being used on top of that to produce the high

quality graphics.

Is there an intended practical or business use for this or is it like a like a demonstration of capabilities?

We actually believe that this is going to be the next paradigm of rendering. So

it's going to replace how rest Riser is.

It's going to replace DLSS today because it not only has these pixel prior that's learned from the world such that you can literally play any game in photorealistic styles which is a lot of people's desire when they do GTA right like um

all the mods all the people adding perfect lighting and all this.

So skins for worlds let's call it skins that's called you can call it skins you can call it customization you can play it how you want right? Yeah,

exactly. And I think another thing that we really pointed out spec specifically in this blog is the programmability of it. Right? So what this means is that

it. Right? So what this means is that this renderer well historically renderer is always a derivative of the game state. Right? You're saying okay here's

state. Right? You're saying okay here's the game state I'm rendering out of frame. But here I'm saying actually this

frame. But here I'm saying actually this renderer can be part of the gameplay loop. I can say something along the

loop. I can say something along the lines of if upon getting 10 apples, I'm gonna my weapon of choice, my bullet's going to turn into apples. And that's

that's possible because we can say we can basically dynamically have certain game state trigger the the preconditions to the renderer such that the rendering is now part of the game loop too. One

thing is to just say okay it's it's it's the appearance but the second thing is also to say there's these novel interactions that are of possible because this renderer now has actually

prior of the world.

It is up to the artists to figure out what to do with it.

It is up to the creators. Yes. And I

also think that's actually another big argument that we're making and the reason that we're picking back taking the bet we're making is that a lot of the times whether it's for embody AI or gaming like you want a layer where human

can inject their intentions right so for example let's just say in the context of gaming it's obviously like my creative intent but maybe in the context of embody AI it's like oh like I take this foundational policy and I want to

actually fine-tune it to deploy in my house so you want to almost say inject have a layer where human can say, "Oh, here's the distribution of things I want

to create to achieve my goal."

And I think 3D graphics as it as it is today is basically the layer for people to say, "Hey, what do I care about in this world?" And it allows um basically

this world?" And it allows um basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying, "Hey, I'm going to generate like arbitrary and it's like just prompts."

You know, it's one of those things where like I I think you're going to build up a series of models, right? This is just one of this is probably like the highest utility or heaviest frequency one. I

don't know what to call this where like yeah, you can immediately drop this in on any game and you don't need anything else that that you guys do, but um I could see I could see that. I think the the human intent is something that

people are not even used to because we're so used to static worlds or um you know, worlds that just don't react or I don't know. It's it's you're kind of

don't know. It's it's you're kind of blowing my mind right now with like well I wonder if you've talked to people at GDC and what are what are they going to do with it?

Yeah. Now the stance that we take on this front is like we're not going to be more creative than our users.

Um but we want to make sure that we're building things in a way that really allows them to express their intent.

The thing that you said about here's the distribution that I want. I think text may be the too low of a bandwidth to to really demonstrate because you know

there I'm I'm probably just going to want to drop in a bunch of reference assets and then you can figure it out from there.

You probably want to do a mixture of both, right? Like you throw in a few

both, right? Like you throw in a few images. I want it this style. I want it

images. I want it this style. I want it to look like this. It's it's a mixture, right?

I think it's a mixture. I mean, yeah, I mean, there's clearly a visual component of this and it's not that, you know, everything can be text because of of

course you want to give a visual look, but there's also a massive amount of giving the overall picture of the look

of the world and the behavior of things that you can express in a few words of text and it be very time consuming and difficult. ult to do via visual means.

difficult. ult to do via visual means.

So I think yeah you want a combination of both.

So one question I kind of have is how do we go about evaluating world models? So

like there's many axes right one is like okay I have preferences how well do we adhere to prompts one is the simulation one is like do things is there core

logic that's broken. So coming from we know how to evaluate diffusion there's fidelity there's stuff like that but what are some of the challenges that most people probably aren't thinking about? Yeah, I think this is like a

about? Yeah, I think this is like a great question and probably one of the hardest questions in world models because like I think it always comes back to what are you building this world model for and depending on your end goal

and purpose the evaluation should differ. So in the context of games then

differ. So in the context of games then the most direct way of measuring is how much time are people actually spending in this world that you create. And if

your goal is to say for example in the context that we just talked about like hey deploying deploying action in body a agent then your your end metric is then okay after training in these worlds that

you generate how robust it is to when you actually deploy to the target environment but then you know it's it's hard to measure these end metrics. So

today people have like these proxy metrics that I call that basically try to measure what we really care about which is the end metrics but then frankly it's different for every use

case. Um yeah,

case. Um yeah, which seems like quite a challenge, right? Like in in language models or

right? Like in in language models or video models, image models, your benchmarks are proxies, right? People

aren't actually asking instruction following tool use questions, they're proxies of how well it will do downstream. But for this, so like you

downstream. But for this, so like you know, should should team should companies have their own individual benchmarks outside of games? If you

think of stuff like okay video production, movies stuff like that that also want to use world models should should they sort of internalize like their own proxy is this something you

guys do where does that kind of I think this whole space is extremely difficult as things are emerging now and I mean it's not only for world models I

think it's for everything including textbased models right cuz you know in the early days it seemed very easy to have good benchmark marks because we could do things like question answering

benchmarks and could you answer the question based on these documents and the various other kinds of you know do pieces of logical reasoning or math. But

again these are sort of and there are sort of visual equivalents of things like object recognition right but these small component tasks but you know these

days so much of what people are wanting to do also with language models is nothing like that right you're wanting to um have an interaction with the language model and get some

recommendations about which backpack would be best for you for your trip in Europe next month and it's not the same kind of thing, right? And it's not so

easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you

in a good way for shopping, right? So,

and it's the same problem with these world models. So if we take the game

world models. So if we take the game design case, well success is that a game

designer can produce what they are imagining in a reasonable amount of time and

that's really the kind of macro task but you know that's a very hard thing to turn into a benchmark and I think a lot of this is actually going to turn into

people working walking with their feet, right? I mean, I guess that's what's

right? I mean, I guess that's what's happening, you know, at the large language model level, right? when people

are choosing to use, you know, GPG5 or Gemini or Claude, you know, individuals are trying out these different models

and deciding, oh, I like the kind of answers that GPT5 gives me or no, I feel like I get more accurate detail from Claude right?

It's a lot of vibe checking. I realize

that but it's actually whether people feel it's giving them utility in what they want. Right?

they want. Right?

And the the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not

right? This looks pretty, which is not the objective of what this is for, right? It's if a game designer is

right? It's if a game designer is working on something, they care about the game engine, the state. It's it can look whatever. You can fix that up later

look whatever. You can fix that up later or you can have a really good game state and you can quickly edit it to 20 20 different versions that keep state, right? So that's a really important

right? So that's a really important distinction um for and for speaking to Moonlike strength, right? So yeah, I

mean, you know, great visuals are lovely to look at for a few seconds, but games

are really all about the concept, the game play, and you know, a lot of the time that doesn't actually even require

great visuals. I mean there are just

great visuals. I mean there are just lots of very successful games which have relatively primitive visuals and there are other games where people have spent

millions producing photorealistic um visuals and the game sucks, right? Um so

um keeping those two axes apart is really important in thinking about what's important in a world model for different uses. This conversation is

different uses. This conversation is reminding me of some game review and fiction discussions I've um had in my sort of non-AI related life. Uh some

people might know Brandon Sanderson who's a very famous uh fiction author uh had is is a big big game reviewer and he he's a big fan of video games where you change one thing about a normal what

what you might assume about the world.

For example, Baba is you. I don't know if you might have come across that where like the rules change as you play the game and also like where you know you can do things like reverse time selectively or like change gravity

selectively and I think this is also remind reminds me of other kinds of world models that are created by authors where Ted Chang is is my typical example where he'll take the world that you know today but change one thing about it and

but then create a consistent world based on that. uh which is longwinded as of me

on that. uh which is longwinded as of me of for me to say is is it easy to create alternative rules that don't exist but you change one thing and then let's let's run a whole bunch of people

through it to see if it works. My first

answer will be that seems a lot easier and more conceivable to do using techn technology like moon lakes than with some of the other world models out there

um where the sun can actually make it happen. I'll let him give the second

happen. I'll let him give the second answer.

If I guess for you, you're constrained by the game engine tool, right? Like at

the end of the day, that's the that's the thought um partner that you have. If

I ask for something where like if it never is allowed to reverse time or if gravity only ever works one way, then well, that's it. But sometimes gravity

might change. But it's a lot easier to

might change. But it's a lot easier to change with code as opposed to a model that is learned primarily on data of

real world and virtual worlds that are I guess like for example Genie right like there's actually train on a lot of real world data and a lot of virtual gaming data and it's hard to say well maybe it's easy to say okay I want to change

the visuals in like the time period of of the world like you can't change gravity for example I feel like you can to light bounds right everything comes comes down to like code is a better way to execute it.

But the models aren't that diverse and creative, right? You can say, "Okay,

creative, right? You can say, "Okay, make gravity slower." It can do that, but it's limited to your representation of how you text it out, right? Like

they're they're only going to do a few iterations, whereas programmatically, you know, if there's a game engine under the hood, you can you can kind of go wild, right? So, one of the I don't

wild, right? So, one of the I don't know, one of the limitations of most models is that they're very overtrained to one style, right? and extracting

diversity is pretty difficult at least that's something we've seen. I mean are there examples you have in mind where existing models it like it would be

easier to do that's not using code like certain types of creative intent or like transition state transitions clipping uh other models other world models are very good at clipping through things

clipping my my legs clipping through a rock because because it's you know it's just it's just bad like you would have to struggle very hard with your your stuff to actually

make that uh which I think is maybe a topic that you actually prepared on uh gshian splatting versus uh the other stuff.

Yeah. Yeah. It's just for those not super familiar right there's a there's gshin splatting there is diffusion like what works what scales up. I feel like in February when sor one came out the

the blog post was literally titled like we bring it up and every day you know world world video generation models are world simulators. Uh it's

super bitter lesson pilled. Yeah. A lot

of it is emergence, right? So, uh not to go through their blog post. Basically,

their whole thing was as you scale up all this consistency, all this stuff just kind of solves. It's a very simple premise, right? They just scaled up

premise, right? They just scaled up diffusion. And from there, you know,

diffusion. And from there, you know, this is this is Feb 2024. How much can we It's already been 2 years, which is basically 5 years. You know, how much more AI time do we need to just scale up

or or do we hit a data cap? But I think we already talked about this a lot, right? like this is back to the

right? like this is back to the beginning discussion of what's appropriate for the time and that seems like your approach, right?

Yeah. The point I'm trying to make is that there are very many many different types of world simulators and like having a world simulator that can produce pixel coherency is very very

useful for games and you know marketing and all these things but it's not as useful as people think when it comes to causal reasoning when it comes to

embodied AI and yeah like it this this title is true like we're not saying that it's it's like you know not a great world simulator. But actually in the

world simulator. But actually in the blog that we we we we wrote the bet is more so that there are going to be disproportionately large share of value

of real world task and virtual tasks where high resolution pixel fidelity is not needed and yes video models have their values. Yeah, this is at the

their values. Yeah, this is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like bas the equivalent of a

threebody problem in a deterministic world whereas the video models would just approximate it. Good enough.

Yeah.

Right. Like there's there's some point at which your approach kind of runs into like the well you now have to simulate the world please. Thank you very much.

And like you're you're trying to do that but only to the extent that the game engine lets you and like game engines cannot do some things.

Yeah. No, I mean I I think the the interesting or more technical question here actually is where do you draw the boundary between what's handled with

let's say diffusion prior and when and what's handled with symbolic prior. Yes.

And right this boundary can actually be fluid. Like I think like maybe what

fluid. Like I think like maybe what you're trying to get at is like okay people are saying pixel prior everything but what we're saying is okay there's a boundary that we draw where this is

where we think provides the most economical value for the domains and things that we care about today. And I

actually do think and it's something that we do internally all the time which is like okay given new equations that we learn or new elements of the world and

that we we learn or maybe some other knowledge that we acquire in the process developing the models. Should we still be maintaining this line exactly as it is today or should we move it a little bit left or a little bit right? Right.

Like sometimes that we realize that oh like maybe customers or or folks like want certain things that are better handled with pixel prior as opposed to um symbolic prior.

Yeah. Your your skin thing is a is a example moving it right. Yeah.

Um or left I don't know what the left right is.

Yeah. Yeah. Yeah. No the the the the revery model. Yes. Actually we have a

revery model. Yes. Actually we have a few iterations of them. They're actually

at slightly different I know. You should you should do that.

I know. You should you should do that.

That's a cool dimension to show.

Yeah. is quantum mechanics the diffusion prior of our world right it's like that's the boundary of classical mechanics versus quantum right like that's it right at one point God

plays dice and the other point doesn't I don't know I don't know if Chris you want to say but I think I think generally I feel like physics is better with symbolic prior um even quantum physics

even quantum physics yeah this is start to um MLST territory is is what I call it where he he likes to get philosoph ical. We're we're quite

philosoph ical. We're we're quite friendly.

I mean, we need to get we need to get singularity.

I heard some of that.

No, no, I think that is actually really helpful. And uh man, I just want you to

helpful. And uh man, I just want you to productize this. Like as a product guy,

productize this. Like as a product guy, I'm just like, well, as a gamer, you know, I want to researcher, you know, like it's cool like this this is a theor theoretical like you have a very good I don't know like the way of thinking

about these things, but I just want to see you like, you know, express it.

I do think like you're fundamentally things when you leave open new tools like okay use use human intent to incorporate it into how you render.

Well, artists are going to have to take like two to three years to figure out what to do with this and you just don't know like but I think you know this is um gives a

much more approachable and controllable world for the beauty of NLP that that will enable it to be adopted and used and we're very hopeful about that.

Yeah. Yeah.

Yeah. I mean we are we are very focused actually on commercialization in the sense that like we do we do really believe in the data flywheel approach where um we put this in the hands of the creators and the users and then they

will teach us when what capability or model should improve and that's why we are we are actually you know like products in beta yeah focusing on gaming what what's like the adjacent thing to gaming

embody basically so maybe we can we can I I'll maybe start with where we see the platform in three years which is like, okay, the users would tell us what they want to achieve. The end goal could be,

hey, I just I want to make something to teach my kids the value of humility. Um,

or it could be, hey, I want to fine-tune my um drones to be really good at rescue situations. I could be vacuum robots. I

situations. I could be vacuum robots. I

want to like train my manipulation or like vacuum robot to be very robust to my office, right? But it's like whatever it is in my office like navigate very robustly with in my

office. But then it's like whatever end

office. But then it's like whatever end goal that you want our world model will say okay given what you want to achieve let me generate a distribution of environments such that I can train and

evaluate whatever it is you want. Yeah.

Right. Maybe for the purpose of games it's just the end simulation and that's the end product. for certain policies.

It's like I can train it within these environments and then help you see where your policy is failing or not and then you know so I think so in that case much more of a training tool than in other

training evaluation both right sure same same thing I think it's just this world model that allows people to train any policy that can act in any multimodal environments would it be harder to reward hack is

there an angle here where it is harder to reward hack like just I'll just put it generally because I think that's a that's obviously a key problem that a lot of people face when in when training

agents in these environments and I don't know can you solve it I think not necessarily I mean to the extent that there's a misspecified

reward that it seems like it could be hacked in a more symbolic world or in a more pixelbased world um I don't know if son's got any thoughts but I don't think

that's really being solved the other thing that comes to mind is just you could just build a better Sora as a videogenerated model, right?

Because then you you would move the diffusion uh side a bit more further to the right. I think if I got the

the right. I think if I got the directionality correct um and that's it.

It's better on domains, right? Like on

consistency over now or for sure it exists versus something doesn't, right?

So yeah, is your question more like like I'm just riffing on like how do you what can you build, you know, with the stuff that you have? I do think that the mind

or the academic does go immediately to training and in evaluation but like art tends to take unusual directions like you might end up okay yeah but the question is can you

use this piece of software to develop compelling gameplay and I don't think you can take sore and produce compelling gameplay right if you

want to have a world that you can wander around in a bit you're good but what are your ability abilities to have gameplay mechanics implemented the way you'd like

them to be and to have things stay, you know, with the long-term history of your gameplay that influences future actions.

I think there's just nothing there for that.

Yeah, I do tend to agree. I I'm just trying to sort of test the boundaries. I

would also make the observation that as AAA games industry has developed, the line between what is a movie and what is a game has blurred. Um, and you you you

do end up basically producing a two-hour movie as part of your team.

Um, no, honestly, there there's so many actually applications in adjacent markets that our world model can go into.

Yeah.

But yeah, it's sort of fun to riff riff on. Although on the execution side, we

on. Although on the execution side, we sort of we we need to stay focused with like, okay, what are the capabilities we want to unlock over time and there's a road map for that. But yeah, if we're just ripping on sort of like the possibilities, I feel like whether it's

endless. Yeah, it's like classic.

endless. Yeah, it's like classic.

the embedding for possibility and list in my mind is very close.

Yeah, I do want to uh focus on one like weird choice. I I don't know if it's

weird choice. I I don't know if it's weird. Maybe I'm I got something here.

weird. Maybe I'm I got something here.

Uh audio, right? You could have just said no audio and audio in my mind has a lot of recursion whereas in in video you can just do ray casting and that's much

computationally much simpler. Audio just

seems way harder. I don't know if you want to just comment on just the spatial 3D audio problem. Did you really have to do it? I I guess you do to be immersive,

do it? I I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a a TTS model on top of Well, there's a lot more to game audio

than just speech, right? It's not just TTS.

TTS, SFX, DGM spatial in my mind, echoes.

Yeah.

And reflections. And I I don't even know what's what else. I don't I don't know what what other problems in this space.

Yeah, I think this point like the is sort of a more more pointing to the benefits of using an game engine as a tool that's available to the model,

right? Because like part of the spatial

right? Because like part of the spatial audio is from the code that is underlying the simulation. And while we

do give our model access to other types of audio models as tools, none of them would be spatial, I think.

Right. But that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that. And you

can argue that sort of spatial is like a like emergence out of the the tools that we and abstraction that we provide to the agents. And I think that's the

the agents. And I think that's the beauty of this this this approach is like there's a lot of things kind of like how humanity's built technology and they're like Lego blocks that build on top of each other and it's the same thing here like there's going to be

things that so just sort of emerges from being able to put these things together in like combinatorily interesting ways right so this integrated audio model

exploits the understanding and semantics of the moon lake world right and whereas in general role for the Gen AI video

models, there's no actual integration across to audio at all, right? That someone might

stick some music or stick a soundsscape or whatever else on top of their video so it's not a silent video, but they're

in no way connected into a consistent world model and there's nothing that's okay. An action is happening in the

okay. An action is happening in the video. Therefore, there should be a

video. Therefore, there should be a sound that's coming from this part of the visual field.

Yeah.

Is that different than Sora 2? Does it

not have audio? Not to say it's not like doesn't have spatial audio.

It doesn't.

No, I've I've played around with it enough.

It just sounds like someone put an 11 Labs voice on top of it and just tried to do the lip syn.

I mean, I've seen Okay. Generate a dog at the beach and reactions to big wave and move around. It's definitely like have the dog have the dog move away from camera and see if the the sound goes down

or it doesn't, right? Cuz they don't have spatial audio.

We do want to basically like we our model like the one we're training is basically towards the goal of having a combined lat representation across all these different modalities, right? Such

that you can like reason across these different modalities. Um, so for

different modalities. Um, so for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me, I almost can like visually extrapolate that trajectory in my mind. And I think that

that type of capability, we want our model to be able to reason, right? And

that's the reason that we're sort of taking this multimodal reasoning approach. It's like

we want this combined lat space that can Yeah. Oh, you said lat space. We like

Yeah. Oh, you said lat space. We like

that here. We have to play the the bell every time that someone says in space.

Uh, no. You got to train Daredevil one where you you it's only audio but you have to work out where everything is.

Cool. I think that that was that was about it for our Moon Lake coverage. Uh,

I do think we have like a couple of uh Chris Madding questions on on IR and uh just any any other sort of attention topics or N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N L topics.

Okay, go ahead.

Oh, no. I mean, yeah, it's just fun. uh

you know we talked a bit about how you guys met but you basically you you were like the godfather of NLP per se right you spent the whole career from early embeddings early early attention you did

2015 attention for machine translation everything uh you you had information retrieval so rag before rag you know we just want to shout that out and admire a lot of that right so what prompted the

switch over to world models how how'd all that come about to some answer it is um the enthusiasms and creativity of students but there's a

bit of a history there right so yeah so clearly most of my career has been doing stuff with language and you know how I got into research was thinking ah this

is just so amazing how humans can produce speech and understand each other in real time and somehow they managed to learn languages when they're kids how

could this possibly happen and so yeah starting off I was very focused on language But you know as it sort of got into the 20110s

I started you know going I'd been working on question answering and then I started to get um interest in visual

question answering and that was an area where it was very noticeable that the

visual understanding was bad. Right? You

know, these were the days when like it sort of seemed like there was almost no visual understanding. You were just

visual understanding. You were just getting answers that came from prior.

So, you know, if you asked how many people are sitting at the table, it always answer to regardless of how many how many people you could see in the picture. And you know, so it seemed like

picture. And you know, so it seemed like ah these models actually aren't able to get semantic information out of images.

And so I was interested in that problem and tried to work more on that. And so

then that required knowing more about what's happening in vision and how you can represent visual information. And

then things start, you know, there started to be this revolution of um doing generative AI images and then I had students that started looking at

that before the era of Moon Lake. I was

also working with Demi Gore who founded PA. Um and so

PA. Um and so and Ian obviously with Gans.

Yeah. Though Ian was never my student but yeah Ian I I was very aware for the the whole decade there of Ian with GANs.

Yeah. And I mean Ian was a Stanford undergrad but yeah Richard Dyouu.com I believe he was your student. Um

student. Um um yeah and you know there were there were links across at that stage as well.

So I mean you know there were several papers in that era of doing I mean so Andre Kapathy was a um PhD student at the same time as Richard and so there

was some joint language vision work in that era as well you know it seems kind of ancient by modern standards but yeah we're trying to go from sort of textual

dependency graphs to visual scenes at a time the glove embeddings really took over a lot of TF like one hot encoding all that the early vision language models we saw were like lava

style adapters, right? It's it's

technically still just embedding latent space, let's add image, let's mix modality. So, and that that's one of the

modality. So, and that that's one of the things you super put out there too, right?

Yeah.

Yeah.

Yeah. Well, thank you for all of that.

Thank you for advancing the world on uh world modeling. Uh I honestly do think

world modeling. Uh I honestly do think that if people deeply understand everything we just covered, they will see what's coming. And I think you guys have, you know, made some really significant contribution here. What are

you hiring for? You know what what is the where do people find you? You know,

we we agreed that the CTA was a hiring call. Yeah. I mean, don't we have AGI?

call. Yeah. I mean, don't we have AGI?

You don't need you don't need engineers anymore right?

Yeah. On the model side, we we are actually striving towards basically a self-improving system. But what that

self-improving system. But what that means is that we need people to set up the self-improving system. Um so more specifically people who have the intersection of knowledge within code generation and computer vision and

graphics, right? That's that's sort of

graphics, right? That's that's sort of the core research background that we look for within our team and and the majority of the team today do have like both backgrounds. Um

both backgrounds. Um when you say computer vision and graphics, are they the same thing or is it computer vision one thing, graphics another thing? How intertwined are they?

another thing? How intertwined are they?

They're intertwined but different. Yeah.

And I think, you know, this relates to some of the themes that we've been talking about that the more explicit

underlying world models that are being constructed inside Moon Lake really draw

on the computer graphics tradition. And

so it's then combining that with the visual understanding of vision.

Got it. All right. So if you've written a game engine, you're come talk to us, right?

Oh yeah, definitely. But I do think that the line is blurred like increasingly blurred these days where it's like if you have a general understanding of vision and graphics.

I think for your standards it is uh for me it feels like vision is is you know I'll leave that to the big labs.

Graphics I I I can get that you know you would want to do that from more first principles but vision there's so many vision models off the shelf that I can take but probably not good enough for your I see. I see. If if you're sort of like

I see. I see. If if you're sort of like making that distinction, then then maybe we we care a little bit more about having graphics knowledge.

Yeah. Exactly. Exactly.

Um it could be like you know sometimes a hiring call can be as simple as like if you know the answer to blah you should talk to me you know like the the the sort of core known hard problem in uh in

your world.

Ah I see. Yeah. In that case, if you Yeah, definitely if you've written a game engine before, if you've RLED a variety of coding models on different objectives like easy um

many of those. Yeah,

if you've done multimodal lan space alignment. I I intentionally included

alignment. I I intentionally included space again.

A poor editor has a edit thing every time. Uh yeah, lean space alignment.

time. Uh yeah, lean space alignment.

Honestly, is it that hard?

Well, I there's some scripts out there that I've saved for the day I someday someday have to do it, but I don't have to do it.

It's done, right?

I think yeah, there's there's versions of that that are done. But I I think we are aligning audio, text, language, and video, right? Like, and basically, we

video, right? Like, and basically, we have these role models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to

the models to sort of self-improve. So,

it's an insanely exciting but also technically challenged problem. So people who want to do their

problem. So people who want to do their lives best work, you know, makes a place.

How big are you guys? Where are you guys based?

We're currently based in Sato, although we're moving up to SF. Um, we're about 18 folks right now.

My ending question was going to be why what is the name? What's behind the name?

Oh.

Um, very cool graphics and design by the way. Actually, at the at the time when

way. Actually, at the at the time when the when the when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like open AI but for

like almost like industrial light and magic vibes because it's like we care about creativity and using that as a funnel to solve AGI. So then we were we

brainstorm a lot around like dreamworks right like industrial light of magic and um so there's a few few basically uh space of things that we feel like are very very semantically close to the company's identity.

Yeah.

And then it ended up being Moon Lake partly because of the DreamWorks vibe, you know, the DreamWorks uh Moon Lake.

Exactly. Yeah.

Um, so that was a little bit of that inspiration and then the moon was sort of like a it basically was like about the reflection. The reflection part also

the reflection. The reflection part also implies the self-improvement loop that we sort of like really believed in and that's the path towards multimotal general intelligence.

So that's that's that's that I'll leave.

I love a good name. I love a good name.

This is great.

It's a very good name.

It's very good lore. I'm glad I asked the question. I will also say, you know,

the question. I will also say, you know, one of my favorite story uh books or biographies ever is Creativity Inc. with Ed Catmill's story about Pixar and how

he, you know, was rejected as a Disney animation artist. So then he went into

animation artist. So then he went into computing and brute forced his way into back into Disney.

Yeah. And Walt Disney is also like one of my favorite founders. He's like his his story like at the time you're like, "Okay, I'm going to create this like immersive park." like people can't can't

immersive park." like people can't can't even have that technology to create it virtually, but like you know what, let's just build it physically such that people can.

So he's the first world modeler.

No, I I I'll tell people that like theme parks are world models, too.

Yeah. Yeah. Yeah. I mean, you know, it's a small world or it's like the Epcot Center with all the little um replicas of the countries. Yeah, those are very interesting. Um Okay. Well, thank you.

interesting. Um Okay. Well, thank you.

We've covered uh you know, a huge amount. Thank you for your time and

amount. Thank you for your time and thank you for inspiring us.

Thank you for having us.

Fun chatting. Yeah, it's been a good time.

Loading...

Loading video analysis...