Moonlake: Interactive, Multimodal World Models — with Chris Manning and Fan-yun Sun
By Latent Space
Summary
Topics Covered
- Video generators are not world models
- Structure not scale: five orders of magnitude more efficient
- Language as humanity's secret weapon over chimpanzees
- Can you interact with the world and see consequences?
- Benchmarking world models is fundamentally unsolved
Full Transcript
I think this whole space is extremely difficult as things are emerging now.
And I mean it's not only for world models. I think it's for everything
models. I think it's for everything including textbased models, right? Cuz
you know in the early days it seemed very easy to have good benchmarks cuz we could do things like question answering benchmarks. But you know these days so
benchmarks. But you know these days so much of what people are wanting to do is nothing like that, right? If you're
wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month, it's not so easy to come up with a
benchmark. And it's the same problem
benchmark. And it's the same problem with these world models.
Before we get into today's episode, I just have a small message for listeners.
Thank you. We would not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content.
We've been approached by sponsors on an almost daily basis. But fortunately,
enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way.
But I just have one favor to ask all of you. The single most powerful,
you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the
only thing I'll ever ask of you. And it
means absolutely everything to me and my team that works so hard to bring the Inspace to you each and every week. If
you do it, I promise you we'll never stop working to make the show even better. Now, let's get into it.
better. Now, let's get into it.
Okay, we're back in the studio with Moonakes uh two leads. I guess there's there's other founders as well, but uh Sun and Chris Manning, welcome to the studio.
Thanks. Thanks for having us.
You've you guys have uh you know come burst onto the scene with a really refreshing new take of old models. Um I
would just want to uh sort of I guess ask how you the two of you came together. Chris, you're a legend in NLP
together. Chris, you're a legend in NLP and just AI in in general. Uh you're
you're his grad student I guess.
Actually my co-founder.
Oh yeah.
I should give a lot of credit to my co-founder Sharon. um she was she was
co-founder Sharon. um she was she was actually working with professor fein and then she ended up working with um Ron and Chris Manning here and then so I got connected through to Chris initially
actually through my co-founder what is moonlake what what is um actually I'm also very curious about the name but like why going into world models
so I was working a lot with actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embodied EI agents
And then there's two observations. One
in academia and one in industry. In
industry like folks like Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds whether it's for the sake of evaluation or training the robots um or
policies or models. And then um in academia same thing is happening and more specifically when I was actually working with Nvidia on the synthetic data foundation model training project we were actually generating a lot of
these synthetic data and showing that hey you can actually these synthetic are actually as useful as real world data when it comes to multimmodal pre-training but then like I said there's a lot of dollars being paid out
to like external vendors or or like other folks to manually curate these types of data. It was very clear to us that okay on our way to let's call it embody general intelligence models need
to learn the consequences behind their actions which means that they need interactive data and the demand for those types of data are growing exponentially but everybody is sort of thinking about it from a pure say video
generation perspective or something else but we feel like the the true actually opportunity is actually building reasoning models that can do these things like how humans do these things today so that's a little bit on the
genesis of moon And I think the reason I got into world models was partly a philosophical take of the on the world where I like you know believe the simulation theory and stuff like
that but on the other on the other hand it's really just like oh like there's an opportunity there that I feel like nobody's doing it the way I think should be done.
I can say a little bit about that. Yeah.
So of the o overall goal is the pursuit of artificial intelligence and you know most of my career has been doing that in the language space and that's been just
extremely productive as we all know the story of the last few years. I don't
have to tell about how much we've achieved with large language models but although they have been extremely effective for ramping language and
general intelligence it's clearly not the whole world. there's this multimodal world of vision, sound, taste that you'd
like to be dealing with more than just um language. And then the question is
um language. And then the question is how to do it. Um and despite you know a huge investment in the computer vision
space right as a research field computer vision has been for decades far far larger than the language space actually.
I mean, I think it's fair to say that, you know, vision understanding sort of stalled out, right? You got to object recognition and then progress just
wasn't being made, right? If you look at any of these um vision language models, it's the language that's doing 90% of the work and the vision barely works.
And so there's really an interesting research question as to why that is. And
at heart um the ideas behind Moon Lake are an attempt to answer that believing that there can be a really rich connection between a more symbolic layer
of abstracted understanding of visual domains which aren't in the mainstream vision models which are still trying to operate on the surface level of pixels.
I think one of your blog posts you put it as structure not scale. Is that uh a general thesis? Yeah. Well, scale is
general thesis? Yeah. Well, scale is good too. Lots of data is good as well.
good too. Lots of data is good as well.
But nevertheless, you want the structure. Yeah. To be able to much more
structure. Yeah. To be able to much more efficiently learn.
Yeah. The other thing I really liked also was you put out an example of what your kind of reasoning traces look like, right? Which you would distill is is the
right? Which you would distill is is the word that comes to mind. I don't even think that's a good good description but it would involve for example geometry physics affordances symbolic logic
perceptual mappings um and what what have you but like that that is the kind of example that involves let's call it spatial reasoning world model reasoning
as as compared to normal LM reasoning but also like taking it a step back so how do you guys define world models you know a lot of people see like okay you can do diffusion you can do video
generation but You guys put out quite a few blog posts. You put out a essay recently, we can even pull it up about efficient world models. Um, you have a pretty like structural definition here, but for the general audience that don't
super follow the space, right? What's
what's the difference in what we see from like a video generation model to a world gen, a simulator, how do you kind of paint that last?
Yeah. So I think this is actually a little bit subtle because you know people look at these amazing generative AI video models Sora V3 one of these
things and they think genies they think oh this is amazing this is sort of you know we've solved understanding the world because you can produce these
generative AI videos but the reality is that although the visuals do look fantastic those visual s
actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what
the consequences of different actions are, and that's what's really needed for spatial intelligence. So I mean a term
spatial intelligence. So I mean a term we sometimes use is that you need action condition world models that you only actually have a world model if you can
predict given some action is taken what is going to change in the world because of it and in particular that becomes hard over longer time scales. So if
you're simply, you know, trying to predict the next video frame, that's not so difficult. But what you actually want
so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes
into the future. And to do that, you actually need much more of an abstracted semantic model of the world.
Yeah. The question comes where you want to have more structure than is available in just predicting the next token. Um
and typically well let's let's call it the experience of the last 5 years has been that that is just washed away by scale right. Um so what is the right
scale right. Um so what is the right middle ground here that uh you don't ignore the bitter lesson but also you can be more efficient than what we're
doing today. You know, one possibility
doing today. You know, one possibility is look, if we just collect masses and
masses and masses and masses of video data, this problem will be solved. Um,
under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true.
The first is what's really essential is understanding the the consequences of actions, producing an action conditioned world model. And if you're simply um
world model. And if you're simply um collecting observational video data, which is the easy stuff to collect when you're sort of mining online videos, you
don't actually know the actions that are being taken to see how the video is changing. And so if you're never
changing. And so if you're never collecting directly actions and you're having to try and infer them from what happened in the observe video, that's
not impossible, but it's very hard and it's not really established that you can get that to work at any scale yet. And
so there's a lot of premium on collecting action condition video data which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do
know the actions which is in quite limited supply. But there's also in the
limited supply. But there's also in the limit of as much data as you could possibly have. You know, maybe the
possibly have. You know, maybe the problem is eventually solvable, but even though we collect huge amounts of text
data, text data is always at a great level of abstraction. Right? Language is
a human-designed abstracted representation where there's meaning in each token and it's representing an abstraction of the world. Right? As soon
as you're describing someone as a professor and as soon as you're saying that they're condescending, right? You
know, these are very abstracted descriptions of the world is not at sort of what you're observing as pixel level.
And so to get to that kind of degree of abstraction starting from pixels is orders and magnitude of extra data and
processing. And so although you know we
processing. And so although you know we absolutely want to exploit get as much data as possible use the bitter lesson nevertheless if there are ways in which
you can work with five orders of magnitude less data than people working purely from pixels you're going to be able to make a lot more progress a lot
more quickly. And that's the bet here.
more quickly. And that's the bet here.
And so you could just say that's only wanting to be able to, you know, do it more efficiently, do it more quickly, do it more cheaply. But I think it's
actually more than that. I think one should be making the analogy to how human beings work. At one level, you
know, yes, we have these high resolution eyes and we can look and see a scene like a video, but all of the evidence from neuroscience and psychology is that
most of what comes into people's eyes is never processed, right? that you're
doing fairly fine processing of exactly what you're focusing on. But you know, as soon as it's away from that of Yeah, there's another guy over there that you've sort of only processing top down
this very abstracted semantic description of the world around you. And
so, you know, that's what human beings are doing. They're working with semantic
are doing. They're working with semantic abstractions. And so I think it is just
abstractions. And so I think it is just the right representation because we also have other goals. We want to be able to do you know real time worlds that means
there's a limit to how much processing you can do and we want to do long-term planning and consistency and again that favors abstraction. I mean I guess there
favors abstraction. I mean I guess there was actually a recent blog post that came out from our friends at physical intelligence and you know they were sort of heading in the same direction. And
they were saying, "Oh, model."
model." Yeah. To maintain a long-term memory of
Yeah. To maintain a long-term memory of what's happening in the world so we can do longer term. We're actually storing text of what is um, you know, been
happening in the world, right? It's not
such a successful strategy of trying to keep it all at a pixel level. And yeah,
I mean, you can see it in video models like that temporal consistency. We're at
a scale of train on, you know, all the video data we have. We have it for maybe 30 seconds, a few minutes. That's not
the same as a game state played for half an hour, right? Um, I thought you guys break it down pretty well. You have a you have a blog post about uh building multimodal worlds with an agent. I don't
know if you guys want to talk about this. This is one of the things I read.
this. This is one of the things I read.
I thought Yeah, the thing I talked about with the reasoning chain. Yeah.
reasoning chain. Yeah.
So, there's like different phases to this. It seems like it's more of an
this. It seems like it's more of an agent, a scaffold. Uh, very different approach than just, you know, type in a prompt and you you don't have the same consistency. It also like for people
consistency. It also like for people that are listening, you know, I would highly recommend reading it. It breaks
down the problem in a different light, right? So like what do you need to
right? So like what do you need to consider when you're talking about video like world game models, right? How would
what do you need to consider? What are
the factors? What are the elements?
What's the state? So I don't know if you guys have stuff to talk about for this one.
Yeah. Um actually I wanted to add on a little bit on our previous point which is just like change quickly. I I do feel like sometimes people confuse like oh like we're taking an an method with with abstraction that means they don't
believe in bitter lesson like like that's just false right like we are believe is a bitter lesson but then I feel like the question that we always discuss is like what is the right
abstraction level today the analogy I like to make is like let's just say we can encode and decode represent all of images videos audio in bytes then the
most bitter lesson approached is to train a next bite prediction model as opposed to the next token prediction model where it's just like okay it's natively multimodal and you can just um but it's like well yeah like to to
Chris's point it's like the scale and compute you need to achieve that. Um um
so that's why we always come back to like okay what is the most efficient way to do it and and reasoning models to to the point of this blog post is a showcase of like hey we're actually just
like reasoning about the world and reasoning about the aspects of the world that that matter for me to learn what I want to learn from this world model. Um
yeah, it's like you're improving the encoder of whatever you're uh trying to model and like a better representation would just represent the important things in less space.
Yeah.
And which would just be more efficient.
Yeah.
Um so yeah, I I fully agree that it is not um antagonistic to uh bitter lesson.
I do want to mention one more thing. Um
is there any philosophical differences with the Japa stuff that uh Yanukun is working on?
I got to go there. you you you're you're mentioning like some latent abstraction.
I'm like, okay, fine. Let's let's talk about it, right? Like it's an elephant in the room.
Yeah, there are philosophical differences. Um Yan Lun is a dear friend
differences. Um Yan Lun is a dear friend of mine. Um but
of mine. Um but he has never appreciated
the power of language in particular or symbolic representations in general. Yan
is a very visual thinker. He always
wants to claim that he thinks visually and there are no words, symbols or math in his head. Um maybe that's true of Yan. It's certainly not the way I think.
Yan. It's certainly not the way I think.
Um but at any rate you know um the world according to Yan is the basic stuff of
the the world and of intelligence is visual and language is just this low bit rate communication mechanism between
humans and it doesn't have much other utility and it's far inferior to the high bit rate video um that comes in
your eyes. And I think he's
your eyes. And I think he's fundamentally missing a number of important things there, right? Think of
this evolutionary argument looking at animals, right? That the closest
animals, right? That the closest analogies, the things with chimps, right? So chimpanzees, you know, have
right? So chimpanzees, you know, have fairly similar brains to human beings.
They have great vision systems. They have great memory systems. They've got, you know, better memory than we do of short-term memories. They can plan. They
short-term memories. They can plan. They
can build primitive tools. But, you
know humans massively ahead in what we understand about the world, what we can plan, what
we can build. And essentially what took off for us was that humans managed to develop language and that gave a
symbolic knowledge representation and reasoning level which just gave this sort of vaultting of what could be done
with the intelligence in brains. So the
philosopher Dan Dennett refers to language as a cognitive tool and argues that you know humans unique among the creatures in the world have managed to
build their own cognitive tools and language is the famous first example but other things like um mathematics and programming languages are also cognitive
tools. They give you an ability to think
tools. They give you an ability to think in abstractions in extended causal reasoning chains and that allows you to
do much more and we use that for spatial representation and intelligence and planning and gameplay as well. So we
believe and this is you know underlying the specific technologies that Moon Lake is making that symbolic representations are powerful and you want to use it in
your understanding of the visual world when you want a causal understanding when you want to maintain long-term consistency and prediction and you know
as I understand it that's just not in Yan Mcun's worldview so I think that's the fundamental philosophy ical difference. Um then there's the specific
difference. Um then there's the specific model he's been advancing Jeffa I mean that's a reasonable research bed as a
direction as to to head for building out a model of the visual world to my mind it's sort of one reasonable research bed it's not really established it's the
best one that everyone should be following at least developed at scale at meta but it's not just vision right like I mean Japa is a you know just join embedding prediction can be applied to anything really and and people have done
it. If the argument is that there is a
it. If the argument is that there is a latent representation or that is that is probably more uh suited to the task then why not let machines do it for us instead of predefining it at all and
isn't something like a Jeppa shaped thing the right answer and if not why not so I think there's a part of Jeppa that's right which is you do want to
have a joint embedding that gives you a consistent model of the world and yan 's
argument is you can never get that from auto reggressive language models because they're sort of left to right turning out one token at a time. I guess this is
where we're um you know the researcher arguments of the field. you know, I'm not actually convinced that's right cuz although the token production is this
auto reggressive um process that's heading, you know, left to right, I guess don't have to be left to right, but anyway, in sequence of tokens, we
could have right to left Arabic. that um
you know although that's true all of the weights of the model that are internal to the transformer they are a joint model of the model's understanding of the world and so I
think you can think of the weights of the model as a form of joint representation and therefore it is plausible to think
that that could be the basis of a world model which avoids um Han's objections.
I think I follow and obviously that will touch on what Moon Lake eventually ends up doing as well, right? Like which it's hard to tell because you put out the end results but we don't know the inputs
that go into it. So it's it's you know that's that's something that we have to figure out over time.
Yeah.
I mean I guess this kind of breaks down some of the outputs. Do you want to walk us through it?
Yeah. Uh so this this really just walks us through the reasoning traces of like okay that just say if we want to build a world in this context it's really just a game demo that that shows the variety of
interactions that this world model can build and yeah it's really just a reasoning traces of like okay you're prompted to create a bowling game like
how did it achieve what you saw that level of causality interaction and consistency right um so yeah this is almost just like an example of like a reasoning traces very detailed
very very detailed but you got to like you don't even realize it right like when a video is generated what happens when a ball strikes a pin right so first like you there's audio in that like audio triggers happen score increments
uh the world changes like pins have to start dropping there's a timer that goes on um you know it's just like very similar to how now we're used to reasoning for language models there's a
whole state of what happens so geometry physics uh all this stuff and then there's kind of that single prompt so asset um physication, all this stuff. It's
it's like a it's a nice view to see what's going on.
I think Sun is also too polite to point out that uh both like Google's Genie uh demos as well as uh World Labs's Marble
do not have interactive worlds. Uh
that's the benefit of having a reasoning model, right? Like because you can you
model, right? Like because you can you can say, "Oh, like maybe in this particular context I want to learn how to bowl." And then you can say, okay,
to bowl." And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe
it's like I need to understand the the basic of like physics and I want to throw it over them. I want to know that when I when it resets, it's it's a new game. So I know that yeah, basically,
game. So I know that yeah, basically, you know, you know, you know, to pick up the ball, you know that ball's going to cause the pins to fall down. You know
that what's important to this particular bowling game is to score. and you know that the score corresponds to the number of pins that fell down. Um so it's just
like if it's a model that sort of knows what it looks like, knows what a bowling game looks like, but doesn't actually allows you to practice over and over again and to understand that oh like
what it takes to actually get a high score, then it sort of doesn't actually allow you to learn what you set out to learn within the world model, right? And
and I think this is really just one example of showing like the advantages of the approach that we're taking over most the let's call it the zeitgeist is today when people talk about
quoteunquote world models, right? So it sort of seems like the
right? So it sort of seems like the question to ask when there's a world model is can I not only just wander
around the world and look at the beautiful graphics? Can I interact with
beautiful graphics? Can I interact with the objects in the world and see the right consequences of actions and you also understand what the consequences would be if you do
something right so it's not just like okay there's one thing if I pick it up something will happen but you know there's there's 50 options and I know I can expect I can infer what would happen if I do any of them right so very
different when you can actually see it play around with it there there's two cheeky elements of that I mean the the sort of I guess less ambitious one is um let's really
establish for listeners uh why is this fundamentally different than writing unity code right like just creating a model to translate a prompt into unity code so there is an underlying physics engine
um in that sense there's some overlapping things to unity but the way we think about it is like physics engine or tools or code are cognitive tools
like borrowing Chris's term right like tools that the model can deploy as means to an end. So today maybe you say okay in this particular context we care about physics we care about the long-term
causality consequences then yes we deploy it employ physics engine and then maybe tomorrow we say okay we're we're training that just say drones where we
only care about really fluid dynamics and the visual aspect of the world then then yeah maybe we don't actually the model actually doesn't have to use a physics engine or maybe it employs other types of representation or physics
engine to achieve the task. So yes,
writing code for Unity is sort of similar to a tool that a model can employ. But our goal is for model to
employ. But our goal is for model to take a representation conditioned reasoning approach or process. Yeah.
Internally.
Yeah. Using these things is just like general tool calls, right? Which I think is very interesting. The other more ambitious one is uh some kind of recursive element where it becomes multiplayer, right? Like here there's a
multiplayer, right? Like here there's a single player element. you're not
modeling any other people involved and that is a whole other thing.
But in fact, we can already do multiplayers.
Oh yeah. Okay. I haven't seen any you just actually just like prompt our our model to say hey like configure it to multiplayer then it'll do like this you'll be able to configure multiplayer.
Great persistency database for you.
Easy.
Yeah. So what what are like some of the current limitations in where we're at.
So there's one approach of like okay scale up video predictors. Obviously
there's data issues uh you know with approaches like this uh is it data constraints what are like the next steps is it real time like so there's one side of you know write an agent to write unity code but okay I want to be
streaming a game real time I want to have characters being also like agentic but where where do we kind of see this scaling up right yeah there's definitely a data
constraint like the more data the the better this reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever is necessary and then
there's a sort of fidelity constraint which we're actually solving with another model rey which we can talk about later. Um but it's like well it's
about later. Um but it's like well it's not as easy to get to photo realism with the approach that we're taking. Um but
we think there are better solutions to that which is we can dive into later later.
The one one thing you note here is it's a diffusion model right? So there's
there's a few approaches uh diffusion caution splatting um yeah so rey diffusion model you guys want to introduce yeah totally so within our world modeling framework we think there are
two models that we train right like there's the multimodal reasoning model that we just talked about that essentially handles mainly the the causality the persistency and logic
determinism determinism of the world and then rey is our bet on saying okay Like while all those model um can take care
of all these things that we just talked about its limitations compared to existing say video models is that it doesn't have as high of a pixel fidelity
right off the gate right and rey is to say hey we can actually take whatever persistent representation that we generate with our multimodal reasoning
model and learn to restyle it into photorealistic styles or arbitrary styles you want. So this model is almost to say, hey, I'm going to respect the persistency and interactivity of the
world that you created, but my only job is to make sure that its pixel distribution is close to what we want.
Yeah.
And yeah, example, right?
You kept the KO divergence where No, no. I mean, this this is a a classic
No, no. I mean, this this is a a classic like um how you don't stray too far from the source material as you you kept the KO, which is kind of cool.
Yeah. Yeah. I mean and the difference is and I mean Sun was pointing at this where sort of saying it's in one way a more difficult path but a better path
that you know typically the diffusion models uh producing the whole scene and it looks lovely but there isn't spatial understanding behind it which is
allowing for the real time graphics game play the spatial intelligence understanding the consequences of worlds where this is um taking a path where it
is assuming an abstracted semantic model of the world the world state and then the diffusion model is then being used on top of that to produce the high
quality graphics.
Is there an intended practical or business use for this or is it like a like a demonstration of capabilities?
We actually believe that this is going to be the next paradigm of rendering. So
it's going to replace how rest Riser is.
It's going to replace DLSS today because it not only has these pixel prior that's learned from the world such that you can literally play any game in photorealistic styles which is a lot of people's desire when they do GTA right like um
all the mods all the people adding perfect lighting and all this.
So skins for worlds let's call it skins that's called you can call it skins you can call it customization you can play it how you want right? Yeah,
exactly. And I think another thing that we really pointed out spec specifically in this blog is the programmability of it. Right? So what this means is that
it. Right? So what this means is that this renderer well historically renderer is always a derivative of the game state. Right? You're saying okay here's
state. Right? You're saying okay here's the game state I'm rendering out of frame. But here I'm saying actually this
frame. But here I'm saying actually this renderer can be part of the gameplay loop. I can say something along the
loop. I can say something along the lines of if upon getting 10 apples, I'm gonna my weapon of choice, my bullet's going to turn into apples. And that's
that's possible because we can say we can basically dynamically have certain game state trigger the the preconditions to the renderer such that the rendering is now part of the game loop too. One
thing is to just say okay it's it's it's the appearance but the second thing is also to say there's these novel interactions that are of possible because this renderer now has actually
prior of the world.
It is up to the artists to figure out what to do with it.
It is up to the creators. Yes. And I
also think that's actually another big argument that we're making and the reason that we're picking back taking the bet we're making is that a lot of the times whether it's for embody AI or gaming like you want a layer where human
can inject their intentions right so for example let's just say in the context of gaming it's obviously like my creative intent but maybe in the context of embody AI it's like oh like I take this foundational policy and I want to
actually fine-tune it to deploy in my house so you want to almost say inject have a layer where human can say, "Oh, here's the distribution of things I want
to create to achieve my goal."
And I think 3D graphics as it as it is today is basically the layer for people to say, "Hey, what do I care about in this world?" And it allows um basically
this world?" And it allows um basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying, "Hey, I'm going to generate like arbitrary and it's like just prompts."
You know, it's one of those things where like I I think you're going to build up a series of models, right? This is just one of this is probably like the highest utility or heaviest frequency one. I
don't know what to call this where like yeah, you can immediately drop this in on any game and you don't need anything else that that you guys do, but um I could see I could see that. I think the the human intent is something that
people are not even used to because we're so used to static worlds or um you know, worlds that just don't react or I don't know. It's it's you're kind of
don't know. It's it's you're kind of blowing my mind right now with like well I wonder if you've talked to people at GDC and what are what are they going to do with it?
Yeah. Now the stance that we take on this front is like we're not going to be more creative than our users.
Um but we want to make sure that we're building things in a way that really allows them to express their intent.
The thing that you said about here's the distribution that I want. I think text may be the too low of a bandwidth to to really demonstrate because you know
there I'm I'm probably just going to want to drop in a bunch of reference assets and then you can figure it out from there.
You probably want to do a mixture of both, right? Like you throw in a few
both, right? Like you throw in a few images. I want it this style. I want it
images. I want it this style. I want it to look like this. It's it's a mixture, right?
I think it's a mixture. I mean, yeah, I mean, there's clearly a visual component of this and it's not that, you know, everything can be text because of of
course you want to give a visual look, but there's also a massive amount of giving the overall picture of the look
of the world and the behavior of things that you can express in a few words of text and it be very time consuming and difficult. ult to do via visual means.
difficult. ult to do via visual means.
So I think yeah you want a combination of both.
So one question I kind of have is how do we go about evaluating world models? So
like there's many axes right one is like okay I have preferences how well do we adhere to prompts one is the simulation one is like do things is there core
logic that's broken. So coming from we know how to evaluate diffusion there's fidelity there's stuff like that but what are some of the challenges that most people probably aren't thinking about? Yeah, I think this is like a
about? Yeah, I think this is like a great question and probably one of the hardest questions in world models because like I think it always comes back to what are you building this world model for and depending on your end goal
and purpose the evaluation should differ. So in the context of games then
differ. So in the context of games then the most direct way of measuring is how much time are people actually spending in this world that you create. And if
your goal is to say for example in the context that we just talked about like hey deploying deploying action in body a agent then your your end metric is then okay after training in these worlds that
you generate how robust it is to when you actually deploy to the target environment but then you know it's it's hard to measure these end metrics. So
today people have like these proxy metrics that I call that basically try to measure what we really care about which is the end metrics but then frankly it's different for every use
case. Um yeah,
case. Um yeah, which seems like quite a challenge, right? Like in in language models or
right? Like in in language models or video models, image models, your benchmarks are proxies, right? People
aren't actually asking instruction following tool use questions, they're proxies of how well it will do downstream. But for this, so like you
downstream. But for this, so like you know, should should team should companies have their own individual benchmarks outside of games? If you
think of stuff like okay video production, movies stuff like that that also want to use world models should should they sort of internalize like their own proxy is this something you
guys do where does that kind of I think this whole space is extremely difficult as things are emerging now and I mean it's not only for world models I
think it's for everything including textbased models right cuz you know in the early days it seemed very easy to have good benchmark marks because we could do things like question answering
benchmarks and could you answer the question based on these documents and the various other kinds of you know do pieces of logical reasoning or math. But
again these are sort of and there are sort of visual equivalents of things like object recognition right but these small component tasks but you know these
days so much of what people are wanting to do also with language models is nothing like that right you're wanting to um have an interaction with the language model and get some
recommendations about which backpack would be best for you for your trip in Europe next month and it's not the same kind of thing, right? And it's not so
easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you
in a good way for shopping, right? So,
and it's the same problem with these world models. So if we take the game
world models. So if we take the game design case, well success is that a game
designer can produce what they are imagining in a reasonable amount of time and
that's really the kind of macro task but you know that's a very hard thing to turn into a benchmark and I think a lot of this is actually going to turn into
people working walking with their feet, right? I mean, I guess that's what's
right? I mean, I guess that's what's happening, you know, at the large language model level, right? when people
are choosing to use, you know, GPG5 or Gemini or Claude, you know, individuals are trying out these different models
and deciding, oh, I like the kind of answers that GPT5 gives me or no, I feel like I get more accurate detail from Claude right?
It's a lot of vibe checking. I realize
that but it's actually whether people feel it's giving them utility in what they want. Right?
they want. Right?
And the the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not
right? This looks pretty, which is not the objective of what this is for, right? It's if a game designer is
right? It's if a game designer is working on something, they care about the game engine, the state. It's it can look whatever. You can fix that up later
look whatever. You can fix that up later or you can have a really good game state and you can quickly edit it to 20 20 different versions that keep state, right? So that's a really important
right? So that's a really important distinction um for and for speaking to Moonlike strength, right? So yeah, I
mean, you know, great visuals are lovely to look at for a few seconds, but games
are really all about the concept, the game play, and you know, a lot of the time that doesn't actually even require
great visuals. I mean there are just
great visuals. I mean there are just lots of very successful games which have relatively primitive visuals and there are other games where people have spent
millions producing photorealistic um visuals and the game sucks, right? Um so
um keeping those two axes apart is really important in thinking about what's important in a world model for different uses. This conversation is
different uses. This conversation is reminding me of some game review and fiction discussions I've um had in my sort of non-AI related life. Uh some
people might know Brandon Sanderson who's a very famous uh fiction author uh had is is a big big game reviewer and he he's a big fan of video games where you change one thing about a normal what
what you might assume about the world.
For example, Baba is you. I don't know if you might have come across that where like the rules change as you play the game and also like where you know you can do things like reverse time selectively or like change gravity
selectively and I think this is also remind reminds me of other kinds of world models that are created by authors where Ted Chang is is my typical example where he'll take the world that you know today but change one thing about it and
but then create a consistent world based on that. uh which is longwinded as of me
on that. uh which is longwinded as of me of for me to say is is it easy to create alternative rules that don't exist but you change one thing and then let's let's run a whole bunch of people
through it to see if it works. My first
answer will be that seems a lot easier and more conceivable to do using techn technology like moon lakes than with some of the other world models out there
um where the sun can actually make it happen. I'll let him give the second
happen. I'll let him give the second answer.
If I guess for you, you're constrained by the game engine tool, right? Like at
the end of the day, that's the that's the thought um partner that you have. If
I ask for something where like if it never is allowed to reverse time or if gravity only ever works one way, then well, that's it. But sometimes gravity
might change. But it's a lot easier to
might change. But it's a lot easier to change with code as opposed to a model that is learned primarily on data of
real world and virtual worlds that are I guess like for example Genie right like there's actually train on a lot of real world data and a lot of virtual gaming data and it's hard to say well maybe it's easy to say okay I want to change
the visuals in like the time period of of the world like you can't change gravity for example I feel like you can to light bounds right everything comes comes down to like code is a better way to execute it.
But the models aren't that diverse and creative, right? You can say, "Okay,
creative, right? You can say, "Okay, make gravity slower." It can do that, but it's limited to your representation of how you text it out, right? Like
they're they're only going to do a few iterations, whereas programmatically, you know, if there's a game engine under the hood, you can you can kind of go wild, right? So, one of the I don't
wild, right? So, one of the I don't know, one of the limitations of most models is that they're very overtrained to one style, right? and extracting
diversity is pretty difficult at least that's something we've seen. I mean are there examples you have in mind where existing models it like it would be
easier to do that's not using code like certain types of creative intent or like transition state transitions clipping uh other models other world models are very good at clipping through things
clipping my my legs clipping through a rock because because it's you know it's just it's just bad like you would have to struggle very hard with your your stuff to actually
make that uh which I think is maybe a topic that you actually prepared on uh gshian splatting versus uh the other stuff.
Yeah. Yeah. It's just for those not super familiar right there's a there's gshin splatting there is diffusion like what works what scales up. I feel like in February when sor one came out the
the blog post was literally titled like we bring it up and every day you know world world video generation models are world simulators. Uh it's
super bitter lesson pilled. Yeah. A lot
of it is emergence, right? So, uh not to go through their blog post. Basically,
their whole thing was as you scale up all this consistency, all this stuff just kind of solves. It's a very simple premise, right? They just scaled up
premise, right? They just scaled up diffusion. And from there, you know,
diffusion. And from there, you know, this is this is Feb 2024. How much can we It's already been 2 years, which is basically 5 years. You know, how much more AI time do we need to just scale up
or or do we hit a data cap? But I think we already talked about this a lot, right? like this is back to the
right? like this is back to the beginning discussion of what's appropriate for the time and that seems like your approach, right?
Yeah. The point I'm trying to make is that there are very many many different types of world simulators and like having a world simulator that can produce pixel coherency is very very
useful for games and you know marketing and all these things but it's not as useful as people think when it comes to causal reasoning when it comes to
embodied AI and yeah like it this this title is true like we're not saying that it's it's like you know not a great world simulator. But actually in the
world simulator. But actually in the blog that we we we we wrote the bet is more so that there are going to be disproportionately large share of value
of real world task and virtual tasks where high resolution pixel fidelity is not needed and yes video models have their values. Yeah, this is at the
their values. Yeah, this is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like bas the equivalent of a
threebody problem in a deterministic world whereas the video models would just approximate it. Good enough.
Yeah.
Right. Like there's there's some point at which your approach kind of runs into like the well you now have to simulate the world please. Thank you very much.
And like you're you're trying to do that but only to the extent that the game engine lets you and like game engines cannot do some things.
Yeah. No, I mean I I think the the interesting or more technical question here actually is where do you draw the boundary between what's handled with
let's say diffusion prior and when and what's handled with symbolic prior. Yes.
And right this boundary can actually be fluid. Like I think like maybe what
fluid. Like I think like maybe what you're trying to get at is like okay people are saying pixel prior everything but what we're saying is okay there's a boundary that we draw where this is
where we think provides the most economical value for the domains and things that we care about today. And I
actually do think and it's something that we do internally all the time which is like okay given new equations that we learn or new elements of the world and
that we we learn or maybe some other knowledge that we acquire in the process developing the models. Should we still be maintaining this line exactly as it is today or should we move it a little bit left or a little bit right? Right.
Like sometimes that we realize that oh like maybe customers or or folks like want certain things that are better handled with pixel prior as opposed to um symbolic prior.
Yeah. Your your skin thing is a is a example moving it right. Yeah.
Um or left I don't know what the left right is.
Yeah. Yeah. Yeah. No the the the the revery model. Yes. Actually we have a
revery model. Yes. Actually we have a few iterations of them. They're actually
at slightly different I know. You should you should do that.
I know. You should you should do that.
That's a cool dimension to show.
Yeah. is quantum mechanics the diffusion prior of our world right it's like that's the boundary of classical mechanics versus quantum right like that's it right at one point God
plays dice and the other point doesn't I don't know I don't know if Chris you want to say but I think I think generally I feel like physics is better with symbolic prior um even quantum physics
even quantum physics yeah this is start to um MLST territory is is what I call it where he he likes to get philosoph ical. We're we're quite
philosoph ical. We're we're quite friendly.
I mean, we need to get we need to get singularity.
I heard some of that.
No, no, I think that is actually really helpful. And uh man, I just want you to
helpful. And uh man, I just want you to productize this. Like as a product guy,
productize this. Like as a product guy, I'm just like, well, as a gamer, you know, I want to researcher, you know, like it's cool like this this is a theor theoretical like you have a very good I don't know like the way of thinking
about these things, but I just want to see you like, you know, express it.
I do think like you're fundamentally things when you leave open new tools like okay use use human intent to incorporate it into how you render.
Well, artists are going to have to take like two to three years to figure out what to do with this and you just don't know like but I think you know this is um gives a
much more approachable and controllable world for the beauty of NLP that that will enable it to be adopted and used and we're very hopeful about that.
Yeah. Yeah.
Yeah. I mean we are we are very focused actually on commercialization in the sense that like we do we do really believe in the data flywheel approach where um we put this in the hands of the creators and the users and then they
will teach us when what capability or model should improve and that's why we are we are actually you know like products in beta yeah focusing on gaming what what's like the adjacent thing to gaming
embody basically so maybe we can we can I I'll maybe start with where we see the platform in three years which is like, okay, the users would tell us what they want to achieve. The end goal could be,
hey, I just I want to make something to teach my kids the value of humility. Um,
or it could be, hey, I want to fine-tune my um drones to be really good at rescue situations. I could be vacuum robots. I
situations. I could be vacuum robots. I
want to like train my manipulation or like vacuum robot to be very robust to my office, right? But it's like whatever it is in my office like navigate very robustly with in my
office. But then it's like whatever end
office. But then it's like whatever end goal that you want our world model will say okay given what you want to achieve let me generate a distribution of environments such that I can train and
evaluate whatever it is you want. Yeah.
Right. Maybe for the purpose of games it's just the end simulation and that's the end product. for certain policies.
It's like I can train it within these environments and then help you see where your policy is failing or not and then you know so I think so in that case much more of a training tool than in other
training evaluation both right sure same same thing I think it's just this world model that allows people to train any policy that can act in any multimodal environments would it be harder to reward hack is
there an angle here where it is harder to reward hack like just I'll just put it generally because I think that's a that's obviously a key problem that a lot of people face when in when training
agents in these environments and I don't know can you solve it I think not necessarily I mean to the extent that there's a misspecified
reward that it seems like it could be hacked in a more symbolic world or in a more pixelbased world um I don't know if son's got any thoughts but I don't think
that's really being solved the other thing that comes to mind is just you could just build a better Sora as a videogenerated model, right?
Because then you you would move the diffusion uh side a bit more further to the right. I think if I got the
the right. I think if I got the directionality correct um and that's it.
It's better on domains, right? Like on
consistency over now or for sure it exists versus something doesn't, right?
So yeah, is your question more like like I'm just riffing on like how do you what can you build, you know, with the stuff that you have? I do think that the mind
or the academic does go immediately to training and in evaluation but like art tends to take unusual directions like you might end up okay yeah but the question is can you
use this piece of software to develop compelling gameplay and I don't think you can take sore and produce compelling gameplay right if you
want to have a world that you can wander around in a bit you're good but what are your ability abilities to have gameplay mechanics implemented the way you'd like
them to be and to have things stay, you know, with the long-term history of your gameplay that influences future actions.
I think there's just nothing there for that.
Yeah, I do tend to agree. I I'm just trying to sort of test the boundaries. I
would also make the observation that as AAA games industry has developed, the line between what is a movie and what is a game has blurred. Um, and you you you
do end up basically producing a two-hour movie as part of your team.
Um, no, honestly, there there's so many actually applications in adjacent markets that our world model can go into.
Yeah.
But yeah, it's sort of fun to riff riff on. Although on the execution side, we
on. Although on the execution side, we sort of we we need to stay focused with like, okay, what are the capabilities we want to unlock over time and there's a road map for that. But yeah, if we're just ripping on sort of like the possibilities, I feel like whether it's
endless. Yeah, it's like classic.
endless. Yeah, it's like classic.
the embedding for possibility and list in my mind is very close.
Yeah, I do want to uh focus on one like weird choice. I I don't know if it's
weird choice. I I don't know if it's weird. Maybe I'm I got something here.
weird. Maybe I'm I got something here.
Uh audio, right? You could have just said no audio and audio in my mind has a lot of recursion whereas in in video you can just do ray casting and that's much
computationally much simpler. Audio just
seems way harder. I don't know if you want to just comment on just the spatial 3D audio problem. Did you really have to do it? I I guess you do to be immersive,
do it? I I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a a TTS model on top of Well, there's a lot more to game audio
than just speech, right? It's not just TTS.
TTS, SFX, DGM spatial in my mind, echoes.
Yeah.
And reflections. And I I don't even know what's what else. I don't I don't know what what other problems in this space.
Yeah, I think this point like the is sort of a more more pointing to the benefits of using an game engine as a tool that's available to the model,
right? Because like part of the spatial
right? Because like part of the spatial audio is from the code that is underlying the simulation. And while we
do give our model access to other types of audio models as tools, none of them would be spatial, I think.
Right. But that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that. And you
can argue that sort of spatial is like a like emergence out of the the tools that we and abstraction that we provide to the agents. And I think that's the
the agents. And I think that's the beauty of this this this approach is like there's a lot of things kind of like how humanity's built technology and they're like Lego blocks that build on top of each other and it's the same thing here like there's going to be
things that so just sort of emerges from being able to put these things together in like combinatorily interesting ways right so this integrated audio model
exploits the understanding and semantics of the moon lake world right and whereas in general role for the Gen AI video
models, there's no actual integration across to audio at all, right? That someone might
stick some music or stick a soundsscape or whatever else on top of their video so it's not a silent video, but they're
in no way connected into a consistent world model and there's nothing that's okay. An action is happening in the
okay. An action is happening in the video. Therefore, there should be a
video. Therefore, there should be a sound that's coming from this part of the visual field.
Yeah.
Is that different than Sora 2? Does it
not have audio? Not to say it's not like doesn't have spatial audio.
It doesn't.
No, I've I've played around with it enough.
It just sounds like someone put an 11 Labs voice on top of it and just tried to do the lip syn.
I mean, I've seen Okay. Generate a dog at the beach and reactions to big wave and move around. It's definitely like have the dog have the dog move away from camera and see if the the sound goes down
or it doesn't, right? Cuz they don't have spatial audio.
We do want to basically like we our model like the one we're training is basically towards the goal of having a combined lat representation across all these different modalities, right? Such
that you can like reason across these different modalities. Um, so for
different modalities. Um, so for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me, I almost can like visually extrapolate that trajectory in my mind. And I think that
that type of capability, we want our model to be able to reason, right? And
that's the reason that we're sort of taking this multimodal reasoning approach. It's like
we want this combined lat space that can Yeah. Oh, you said lat space. We like
Yeah. Oh, you said lat space. We like
that here. We have to play the the bell every time that someone says in space.
Uh, no. You got to train Daredevil one where you you it's only audio but you have to work out where everything is.
Cool. I think that that was that was about it for our Moon Lake coverage. Uh,
I do think we have like a couple of uh Chris Madding questions on on IR and uh just any any other sort of attention topics or N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N L topics.
Okay, go ahead.
Oh, no. I mean, yeah, it's just fun. uh
you know we talked a bit about how you guys met but you basically you you were like the godfather of NLP per se right you spent the whole career from early embeddings early early attention you did
2015 attention for machine translation everything uh you you had information retrieval so rag before rag you know we just want to shout that out and admire a lot of that right so what prompted the
switch over to world models how how'd all that come about to some answer it is um the enthusiasms and creativity of students but there's a
bit of a history there right so yeah so clearly most of my career has been doing stuff with language and you know how I got into research was thinking ah this
is just so amazing how humans can produce speech and understand each other in real time and somehow they managed to learn languages when they're kids how
could this possibly happen and so yeah starting off I was very focused on language But you know as it sort of got into the 20110s
I started you know going I'd been working on question answering and then I started to get um interest in visual
question answering and that was an area where it was very noticeable that the
visual understanding was bad. Right? You
know, these were the days when like it sort of seemed like there was almost no visual understanding. You were just
visual understanding. You were just getting answers that came from prior.
So, you know, if you asked how many people are sitting at the table, it always answer to regardless of how many how many people you could see in the picture. And you know, so it seemed like
picture. And you know, so it seemed like ah these models actually aren't able to get semantic information out of images.
And so I was interested in that problem and tried to work more on that. And so
then that required knowing more about what's happening in vision and how you can represent visual information. And
then things start, you know, there started to be this revolution of um doing generative AI images and then I had students that started looking at
that before the era of Moon Lake. I was
also working with Demi Gore who founded PA. Um and so
PA. Um and so and Ian obviously with Gans.
Yeah. Though Ian was never my student but yeah Ian I I was very aware for the the whole decade there of Ian with GANs.
Yeah. And I mean Ian was a Stanford undergrad but yeah Richard Dyouu.com I believe he was your student. Um
student. Um um yeah and you know there were there were links across at that stage as well.
So I mean you know there were several papers in that era of doing I mean so Andre Kapathy was a um PhD student at the same time as Richard and so there
was some joint language vision work in that era as well you know it seems kind of ancient by modern standards but yeah we're trying to go from sort of textual
dependency graphs to visual scenes at a time the glove embeddings really took over a lot of TF like one hot encoding all that the early vision language models we saw were like lava
style adapters, right? It's it's
technically still just embedding latent space, let's add image, let's mix modality. So, and that that's one of the
modality. So, and that that's one of the things you super put out there too, right?
Yeah.
Yeah.
Yeah. Well, thank you for all of that.
Thank you for advancing the world on uh world modeling. Uh I honestly do think
world modeling. Uh I honestly do think that if people deeply understand everything we just covered, they will see what's coming. And I think you guys have, you know, made some really significant contribution here. What are
you hiring for? You know what what is the where do people find you? You know,
we we agreed that the CTA was a hiring call. Yeah. I mean, don't we have AGI?
call. Yeah. I mean, don't we have AGI?
You don't need you don't need engineers anymore right?
Yeah. On the model side, we we are actually striving towards basically a self-improving system. But what that
self-improving system. But what that means is that we need people to set up the self-improving system. Um so more specifically people who have the intersection of knowledge within code generation and computer vision and
graphics, right? That's that's sort of
graphics, right? That's that's sort of the core research background that we look for within our team and and the majority of the team today do have like both backgrounds. Um
both backgrounds. Um when you say computer vision and graphics, are they the same thing or is it computer vision one thing, graphics another thing? How intertwined are they?
another thing? How intertwined are they?
They're intertwined but different. Yeah.
And I think, you know, this relates to some of the themes that we've been talking about that the more explicit
underlying world models that are being constructed inside Moon Lake really draw
on the computer graphics tradition. And
so it's then combining that with the visual understanding of vision.
Got it. All right. So if you've written a game engine, you're come talk to us, right?
Oh yeah, definitely. But I do think that the line is blurred like increasingly blurred these days where it's like if you have a general understanding of vision and graphics.
I think for your standards it is uh for me it feels like vision is is you know I'll leave that to the big labs.
Graphics I I I can get that you know you would want to do that from more first principles but vision there's so many vision models off the shelf that I can take but probably not good enough for your I see. I see. If if you're sort of like
I see. I see. If if you're sort of like making that distinction, then then maybe we we care a little bit more about having graphics knowledge.
Yeah. Exactly. Exactly.
Um it could be like you know sometimes a hiring call can be as simple as like if you know the answer to blah you should talk to me you know like the the the sort of core known hard problem in uh in
your world.
Ah I see. Yeah. In that case, if you Yeah, definitely if you've written a game engine before, if you've RLED a variety of coding models on different objectives like easy um
many of those. Yeah,
if you've done multimodal lan space alignment. I I intentionally included
alignment. I I intentionally included space again.
A poor editor has a edit thing every time. Uh yeah, lean space alignment.
time. Uh yeah, lean space alignment.
Honestly, is it that hard?
Well, I there's some scripts out there that I've saved for the day I someday someday have to do it, but I don't have to do it.
It's done, right?
I think yeah, there's there's versions of that that are done. But I I think we are aligning audio, text, language, and video, right? Like, and basically, we
video, right? Like, and basically, we have these role models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to
the models to sort of self-improve. So,
it's an insanely exciting but also technically challenged problem. So people who want to do their
problem. So people who want to do their lives best work, you know, makes a place.
How big are you guys? Where are you guys based?
We're currently based in Sato, although we're moving up to SF. Um, we're about 18 folks right now.
My ending question was going to be why what is the name? What's behind the name?
Oh.
Um, very cool graphics and design by the way. Actually, at the at the time when
way. Actually, at the at the time when the when the when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like open AI but for
like almost like industrial light and magic vibes because it's like we care about creativity and using that as a funnel to solve AGI. So then we were we
brainstorm a lot around like dreamworks right like industrial light of magic and um so there's a few few basically uh space of things that we feel like are very very semantically close to the company's identity.
Yeah.
And then it ended up being Moon Lake partly because of the DreamWorks vibe, you know, the DreamWorks uh Moon Lake.
Exactly. Yeah.
Um, so that was a little bit of that inspiration and then the moon was sort of like a it basically was like about the reflection. The reflection part also
the reflection. The reflection part also implies the self-improvement loop that we sort of like really believed in and that's the path towards multimotal general intelligence.
So that's that's that's that I'll leave.
I love a good name. I love a good name.
This is great.
It's a very good name.
It's very good lore. I'm glad I asked the question. I will also say, you know,
the question. I will also say, you know, one of my favorite story uh books or biographies ever is Creativity Inc. with Ed Catmill's story about Pixar and how
he, you know, was rejected as a Disney animation artist. So then he went into
animation artist. So then he went into computing and brute forced his way into back into Disney.
Yeah. And Walt Disney is also like one of my favorite founders. He's like his his story like at the time you're like, "Okay, I'm going to create this like immersive park." like people can't can't
immersive park." like people can't can't even have that technology to create it virtually, but like you know what, let's just build it physically such that people can.
So he's the first world modeler.
No, I I I'll tell people that like theme parks are world models, too.
Yeah. Yeah. Yeah. I mean, you know, it's a small world or it's like the Epcot Center with all the little um replicas of the countries. Yeah, those are very interesting. Um Okay. Well, thank you.
interesting. Um Okay. Well, thank you.
We've covered uh you know, a huge amount. Thank you for your time and
amount. Thank you for your time and thank you for inspiring us.
Thank you for having us.
Fun chatting. Yeah, it's been a good time.
Loading video analysis...