“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI
By a16z
Summary
## Key takeaways - **AI's Cambrian Explosion Beyond Text**: We are currently experiencing a Cambrian explosion in AI, where not only text but also pixels, videos, and audio are being integrated into AI applications and models, signifying a rapid expansion of possibilities. [01:21] - **Compute Power: The Unsung Hero of AI**: The exponential growth in computational power over the last decade has been a critical, often underestimated, driver of AI advancements, enabling models trained in days in 2012 to be trained in minutes today. [07:28], [08:03] - **Data Drives Models, Not Just Algorithms**: While algorithmic breakthroughs are important, the true power of AI is unleashed when data drives the models, as demonstrated by the massive bet on ImageNet, which scaled data far beyond previous norms. [06:05], [06:44] - **Spatial Intelligence: AI's Next Fundamental Frontier**: Spatial intelligence, enabling machines to perceive, reason, and act in 3D space and time, is as fundamental as language and represents the next critical frontier for AI advancement, moving beyond 1D representations. [19:01], [27:29] - **From Static Scenes to Dynamic 3D Worlds**: The evolution of AI in computer vision is moving from recognizing static objects and scenes to generating dynamic, interactive 3D worlds, a progression that requires a native 3D representation to unlock new media and applications. [35:35], [33:18] - **Deep Tech Platform for Spatial Intelligence**: World Labs is building a deep tech platform, focusing on solving fundamental problems in spatial intelligence to serve diverse use cases, from virtual world generation to augmented reality and robotics. [41:06], [40:43]
Topics Covered
- The AI Cambrian Explosion: Beyond Text to Pixels, Video, and Audio
- The Recipe for AI Magic: Data, Compute, and Algorithms
- The Data-Driven AI Revolution: From Thousands to Internet Scale
- From Objects to Worlds: The Next Frontier of AI
- Spatial Intelligence: The Next Operating System for AR/VR
Full Transcript
visual spatial intelligence is so
fundamental it's as fundamental as
language we've got this ingredients
compute deeper understanding of data and
we've got some advancement of algorithms
we are in the right moment to really
make a bet and to focus and just unlock
[Music]
that over the last two years we've seen
this kind of massive Rush of consumer AI
companies and technology and it's been
quite wild but you've been doing this
now for decades and so maybe walk
through a little bit about how we got
here kind of like your key contributions
and insights along the way so it is a
very exciting moment right just zooming
back AI is in a very exciting moment I
personally have been doing this for for
two decades plus and you know we have
come out of the last AI winter we have
seen the birth of modern AI then we have
seen deep learning taking off showing us
possibilities like playing chess but
then we're starting to see the the the
deepening of the technology and the
industry um adoption of uh of some of
the earlier possibilities like language
models and now I think we're in the
middle of a Cambrian explosion in almost
a literal sense because now in addition
to texts you're seeing pixels videos
audios all coming out with possible AI
applications and models so it's very
exciting moment I know you both so well
and many people know you both so well
because you're so prominent in the field
but not everybody like grew up in AI so
maybe it's kind of worth just going
through like your quick backgrounds just
to kind of level set the audience yeah
sure so I first got into AI uh at the
end of my undergrad uh I did math and
computer science for undergrad at
keltech that was awesome but then
towards the end of that there was this
paper that came out that was at the time
a very famous paper the cat paper um
from H Lee and Andrew and others that
were at Google brain at the time and
that was like the first time that I came
across this concept of deep learning um
and to me it just felt like this amazing
technology and that was the first time
that I came across this recipe that
would come to define the next like more
than decade of my life which is that you
can get these amazingly powerful
learning algorithms that are very
generic couple them with very large
amounts of compute couple them with very
large amounts of data and magic things
started to happen when you compi those
ingredients so I I first came across
that idea like around 2011 2012-ish and
I just thought like oh my God this is
this is going to be what I want to do so
it was obvious you got to go to grad
school to do this stuff and then um sort
of saw that Fay was at Stanford one of
the few people in the world at the time
who was kind of on that on that train
and that was just an amazing time to be
in deep learning and computer vision
specifically because that was really the
era when this went from these first nent
bits of technology that were just
starting to work and really got
developed AC and spread across a ton of
different applications so then over that
time we saw the beginning of language
modeling we saw the beginnings of
discriminative computer vision you could
take pictures and understand what's in
them in a lot of different ways we also
saw some of the early bits of what we
would Now call gen generative modeling
generating images generating text a lot
of those Court algor algorithmic pieces
actually got figured out by the academic
Community um during my PhD years like
there was a time I would just like wake
up every morning and check the new
papers on archive and just be ready it
was like unwrapping presents on
Christmas that like every day you know
there's going to be some amazing new
discovery some amazing new application
or algorithm somewhere in the world what
happened is in the last two years
everyone else in the world kind of came
to the same realization using AI to get
new Christmas presents every day but I
think for those of us that have been in
the field for a decade or more um we've
sort of had that experience for a very
long time obviously I'm much older than
Justin I I come to AI through a
different angle which is from physics
because my undergraduate uh background
was physics but physics is the kind of
discipline that teaches you to think
audacious question s and think about
what is the remaining mystery of the
world of course in physics is atomic
world you know universe and all that but
somehow I that kind of training thinking
got me into the audacious question that
really captur my own imagination which
is intelligence so I did my PhD in Ai
and computational neuros siiz at CCH so
Justin and I actually didn't overlap but
we share um
the same amam mat um at keltech oh and
and the same adviser at celtech yes same
adviser your undergraduate adviser in my
PhD advisor petro perona and my PhD time
which is similar to your your your PhD
time was when AI was still in the winter
in the public eye but it was not in the
winter in my eye because it's that
preing hibernation there's so much life
machine learning statistical modeling
was really gaining uh gaining power and
we I I think I was one of the Native
generation in machine learning and AI
whereas I look at Justice generation is
the native deep learning generation so
so so machine learning was the precursor
of deep learning and we were
experimenting with all kinds of models
but one thing came out at the end of my
PhD and the beginning of my assistant
professor
there was a
overlooked elements of AI that is
mathematically important to drive
generalization but the whole field was
not thinking that way and it was Data
because we were thinking about um you
know the intricacy of beijan models or
or whatever you know um uh kernel
methods and all that but what was
fundamental that my students and my lab
realized probably uh earlier than most
people is that if you if you let Data
Drive models you can unleash the kind of
power that we haven't seen before and
that was really the the the reason we
went on a pretty
crazy bet on image net which is you know
what just forget about any scale we're
seeing now which is thousands of data
points at that point uh NLP community
has their own data sets I remember UC
see Irvine data set or some data set in
NLP was it was small compar Vision
Community has their data sets but all in
the order of thousands or tens of
thousands were like we need to drive it
to internet scale and luckily it was
also the the the coming of age of
Internet so we were riding that wave and
that's when I came to Stanford so these
epochs are what we often talk about like
IM is clearly the epoch that created you
know or or at least like maybe made like
popular and viable computer vision and
the Gen wave we talk about two kind of
core unlocks one is like the
Transformers paper which is attention we
talk about stable diffusion is that a
fair way to think about this which is
like there's these two algorithmic
unlocks that came from Academia or
Google and like that's where everything
comes from or has it been more
deliberate or have there been other kind
of big unlocks that kind of brought us
here that we don't talk as much about
yeah I I think the big unlock is compute
like I know the story of AI is of in the
story of compute but even no matter how
much people talk about it I I think
people underestimate it right and the
amount of the amount of growth that
we've seen in computational power over
the last decade is astounding the first
paper that's really credited with the
like Breakthrough moment in computer
vision for deep learning was Alex net um
which was a 2012 paper that where a deep
neural network did really well on the
image net Challenge and just blew away
all the other algorithms that F had been
working on the types of algorithms that
they' been working on more in grad
school that Alex net was a 60 million
parameter deep neural network um and it
was trained for six days on two GTX 580s
which was the top consumer card at the
time which came out in 2010 um so I was
looking at some numbers last night just
to you know put these in perspective the
newest the latest and greatest from
Nvidia is the gb200 um do either of you
want to guess how much raw compute
Factor we have between the GTX 580 and
the gb200 shoot no what go for it it's
uh it's in the thousands so I I ran the
numbers last night like that two We R
that two we training run that of Six
Days on two GTX 580s if you scale it it
comes out to just under five minutes on
a single GB on a single gb200 Justin is
making a really good point the 2012 Alex
net paper on image net challenge is
literally a very classic Model and that
is the convolution on your network model
and that was published in 1980s the
first paper I remember as a graduate
student learning that and it more or
less also has six seven layers the
practically the only difference between
alexnet and the convet what's the
difference is the gpus the two gpus and
the delude of data yeah well so that's
what I was going to go which is like so
I think most people now are familiar
with like quote the bitter lesson and
the bitter lesson says is if you make an
algorithm don't be cute yeah just make
sure you can take advantage of available
compute because the available compute
will show up right and so like you just
like need to like why like on the other
hand there's another narrative um which
seems to me to be like just as credible
which is like it's actually new data
sources that unlock deep learning right
like imet is a great example but like a
lot of people like self attention is
great from Transformers but they'll also
say this is a way you can exploit human
labeling of data because like it's the
humans that put the structure in the
sentences and if you look at clip
they'll say well like we're using the
internet to like actually like have
humans use the alt tag to label images
right and so like that's a story of data
that's not a story of compute and so is
it just is the answer just both or is
like one more than the other or I think
it's both but you're hitting another
really good point so I think there's
actually two EO that to me feel quite
distinct in the algorithmics here so
like the imag net era is actually the
era of supervised learning um so in the
era of supervised learning you have a
lot of data but you don't know how to
use data on its own like the expectation
of imet and other data sets of that time
period was that we're going to get a lot
of images but we need people to label
everyone and all of the training data
that we're going to train on like a
person a human labeler has looked at
everyone and said something about that
image yeah um and the big algorithmic
unlocks we know how to train on things
that don't require human labeled data as
as the naive person in the room that
doesn't have an AI background it seems
to me if you're training on human data
like the humans have labeled it it's
just not explicit I knew you were GNA
say that Mar I knew that yes
philosophically that's a really
important question but that actually is
more try language than pixels fair
enough yeah 100 yeah yeah yeah yeah yeah
but I do think it's an important
thinked learn itel just more implicit
than explicit yeah it's still it's still
human labeled the distinction is that
for for this supervised learning era um
our learning tasks were much more
constrained so like you would have to
come up with this ontology of Concepts
that we want to discover right if you're
doing in imag net like fa and and your
students at the time spent a lot of time
thinking about you know which thousand
categories should be in the imag net
challenge other data sets of that time
like the Coco data set for object
detection like they thought really hard
about which 80 categories we put in
there so let's let's walk to gen um so
so when I was doing my my PhD before
that um you came so I took U machine
learning from Andre in and then I took
like beigan something very complicated
from Deany Coler and it was very
complicated for me a lot of that was
just predictive modeling y um and then
like I remember the whole kind of vision
stuff that you unlock but then the
generative stuff is shown up like I
would say in the last four years which
is to me very different like you're not
identifying objects you're not you know
predicting something you're generating
something and so maybe kind of walk
through like the key unlocks that got us
there and then why it's different and if
we should think about it differently and
is it part of a Continuum is it not it
is so interesting even during my
graduate time generative model was there
we wanted to do generation we nobody
remembers even with the uh letters and
uh numbers we were trying to do some you
know Jeff Hinton has had to generate
papers we were thinking about how to
generate and in fact if you do have if
you think from a probability
distribution point of view you can
mathematically generate it's just
nothing we generate would ever impress
anybody right so this concept of
generation mathematically theoretically
is there but nothing worked so then I do
want to call out Justin's PhD and Justin
was saying that he got enamored by Deep
learning so he came to my lab Justin PhD
his entire PhD is a story almost a mini
story of the trajectory of the of the uh
field he started his first project in
data I forced him to he didn't like
it so in retrospect I learned a lot of
really useful things I'm glad you say
that now so we moved Justin to um to
deep learning and the core problem there
was taking images and generating words
well actually it was even about there
were I think there were three discret
phases here on this trajectory so the
first one was actually matching images
and words right right right like we have
we have an image we have words and can
we say how much they allow so actually
my first paper both of my PhD and like
ever my first academic publication ever
was the image retrieval with scene
graphs and then we went into the Genera
uh taking pixels generating words and
Justin and Andre uh really worked on
that but that was still a very very
lossy way of of of generating and
getting information out of the pixel
world and then in the middle Justus went
off and did a very famous piece of work
and it was the first time that uh
someone made it real time right yeah
yeah so so the story there is there was
this paper that came out in 2015 a
neural algorithm of artistic style led
by Leon gtis and it was like the paper
came out and they showed like these
these real world photographs that they
had converted into van go style and like
we are kind of used to seeing things
like this in 2024 but this was in 2015
so this paper just popped up on archive
one day and it like blew my mind like I
just got this like gen brainworm like in
my brain in like 2015 and it like did
something to me and I thought like oh my
God I need to understand this algorithm
I need to play with it I need to make my
own images into van go so then I like
read the paper and over a long weekend I
reimplemented the thing and got it to
work it was a very actually very simple
algorithm um so like my implementation
was like 300 lines of Lua cuz at the
time it was pre it was Lua there was
there was um this was pre pie torch so
we were using Lua torch um but it was
like very simple algorithm but it was
slow right so it was an optim
optimization based thing every image you
want to generate you need to run this
optimization Loop run this gradient Dent
Loop for every image that you generate
the images were beautiful but I just
like wanted to be faster and and Justin
just did it and it was actually I think
your first taste
of a an academic work having an industry
impact a bunch of people seen this this
artistic style transfer stuff at the
time and me and a couple others at the
same time came up with different ways to
speed this up yeah um but mine was the
one that got a lot of traction right so
I was very proud of Justin but there's
one more thing I was very proud of
Justin to connect to J AI is that before
the world understand gen Justin's last
piece of uh uh work in PhD which I I
knew about it because I was forcing you
to do it that one was fun that was was
actually uh input
language and getting a whole picture out
it's one of the first gen uh work it's
using gang which was so hard to use but
the problem is that we are not ready to
use a natural piece of language so
justtin you heard he worked on sing
graph so we have to input a sing graph
language structure so you know the Sheep
the the the grass the sky in a graph way
it literally was one of our photos right
and then he he and another very good uh
uh Master student of grim they got that
again to work so so you can see from
data to matching to style transfer to to
generative a uh uh images we're starting
to see you ask if this is a abrupt
change for people like us it's already
happening a Continuum but for the world
it was it's more the results are more
abrupt so I read your book and for those
that are listening it's a phenomenal
book like I I really recommend you read
it and it seems for a long time like a
lot of you and I'm talking to you fa
like a lot of your research has been you
know and your direction has been towards
kind of spatial stuff and pixel stuff
and intelligence and now you're doing
World labs and it's around spatial
intelligence and so maybe talk through
like you know is this been part of a
long journey for you like why did you
decide to do it now is it a technical
unlock is it a personal unlock just kind
of like move us from that kind of Meo of
AI research to to World Labs sure for me
is uh
um it is both personal and intellectual
right my entire you talk about my book
my entire intellectual journey is really
this passion to seek North Stars but
also believing that those nor stars are
critically important for the advancement
of our field so at the beginning
I remembered after graduate school I
thought my Northstar was telling stories
of uh images because for me that's such
a important piece of visual intelligence
that's part of what you call AI or AGI
but when Justin and Andre did that I was
like oh my God that's that was my live
stream what do I do next so it it came a
lot faster I thought it would take a
hundred years to do that so um but
visual intelligence is my passion
because I do believe for every
intelligent uh
being like people or robots or some
other form um knowing how to see the
world reason about it interact in it
whether you're navigating or or or
manipulating or making things you can
even build civilization upon it it
visual spatial intelligence is so
fundamental it's as fundamental as
language possibly more ancient and and
more fundamental in certain ways so so
it's very natural for me that um world
Labs is our Northstar is to unlock
spatial intelligence the moment to me is
right to do it like Justin was saying
compute we've got these ingredients
we've got compute we've got a much
deeper understanding of data way deeper
than image that days you know uh
compared to to that those days we're so
much more sophisticated and we've got
some advancement of algorithms including
co-founders in World la like Ben milen
Hall and uh Kristoff lar they were at
The Cutting Edge of nerve that we are in
the right moment to really make a bet
and to focus and just unlock that so I
just want to clarify for for folks that
are listening to this which is so you
know you're starting this company World
lab spatial intelligence is kind of how
you're generally describing the problem
you're solving can you maybe try to
crisply describe what that means yeah so
spatial intelligence is about machines
ability to un to perceive reason and act
in 3D and 3D space and time to
understand how objects and events are
positioned in 3D space and time how
interactions in the world can affect
those 3D position 3D 4D positions over
space time
um and both sort of perceive reason
about generate interact with really take
the machine out of the main frame or out
of the data center and putting it out
into the world and understanding the 3D
4D world with all of its richness so to
be very clear are we talking about the
physical world or are we just talking
about an abstract notion of world I
think it can be both I think it can be
both and that encompasses our vision
long term even if you're generating
worlds even if you're generating content
um doing that in positioned in 3D with
3D uh has a lot of benefits um or if
you're recognizing the real world being
able to put 3D understanding into the
into the real world as well is part of
it great so I mean Ju Just for everybody
listening like the two other co-founders
Ben M Hall and Kristoff lner are
absolute Legends in the field at the at
the same level these four decided to
come out and do this company now and so
I'm trying to get dig to like like why
now is the the the right time yeah I
mean this is Again part of a longer
Evolution for me but like really after
PhD when I was really wanting to develop
into my own independent researcher both
at for my later career I was just
thinking what are the big problems in Ai
and computer vision um and the
conclusion that I came to about that
time was that the previous decade had
mostly been about understanding data
that already exists um but the next
decade was going to be about
understanding new data and if we think
about that the data that already exists
was all of the images and videos that
maybe existed on the web already and the
next decade was going to be about
understanding new data right like people
are people are have smartphones
smartphones are collecting cameras those
cameras have new sensors those cameras
are positioned in the 3D world it's not
just you're going to get a bag of pixels
from the internet and know nothing about
it and try to say if it's a cat or a dog
we want to treat these treat images as
universal sensors to the physical world
and how can we use that to understand
the 3D and 4D structure of the world um
either in physical spaces or or or
generative spaces so I made a pretty big
pivot post PhD into 3D computer vision
predicting 3D shapes of objects with
some of my colleagues at fair at the
time then later I got really enamored by
this idea of learning 3D structure
through 2D right because we talk about
data a lot it's it's um you know 3D data
is hard to get on its own um but there
because there's a very strong
mathematical connection here um our 2D
images are projections of a 3D World and
there's a lot of mathematical structure
here we can take advantage of so even if
you have a lot of 2D data there's
there's a lot of people have done
amazing work to figure out how can you
back out the 3D structure of the world
from large quantities of 2D observations
um and then in 2020 you asked about bre
breakthrough moments there was a really
big breakthrough Moment One from our
co-founder Ben mildenhall at the time
with his paper Nerf N Radiance fields
and that was a very simple very clear
way of backing out 3D structure from 2D
observations that just lit a fire under
this whole Space of 3D computer vision I
think there's another aspect here that
maybe people outside the field don't
quite understand as that was also a time
when large language models were starting
to take off so a lot of the stuff with
language modeling actually had gotten
developed in Academia even during my PhD
I did some early work with Andre
Carpathia on language modeling in 2014
lstm I still remember lstms RNN brus
like this was pre- Transformer um but uh
then at at some point like around like
around the gpt2 time like you couldn't
really do those kind of models anymore
in Academia because they took a way way
more resourcing but there was one really
interesting thing that the Nerf the Nerf
approach that that Ben came up with like
you could train these in in in an hour a
couple hours on a single GPU so I think
at that time like this is a there was a
dynamic here that happened which is that
I think a lot of academic researchers
ended up focusing a lot of these
problems because there was core
algorithmic stuff to figure out and
because you could actually do a lot with
without a ton of compute and you could
get state-of-the-art results on a single
GPU because of those Dynamics um there
was a lot of research a lot of
researchers in Academia were moving to
think about what are the core
algorithmic ways that we can advance
this area as well uh then I ended up
chatting with f more and I realized that
we were actually she's very convincing
she's very convincing well there's that
but but like you know we talk about
trying to like figure out your own
depent research trajectory from your
adviser well it turns out we ended oh no
kind of concluding converging on on
similar things okay well from my end I
want to talk to the smartest person I I
call Justin there's no question about it
uh I do want to talk about a very
interesting technical um uh issue or or
technical uh story of pixels that most
people work in language don't realize is
that pre era in the field of computer
vision those of us who work on pixels
we actually have a long history in a an
area of research called reconstruction
3D reconstruction which is you know it
dates back from the 70s you know you can
take photos because humans have two eyes
right so in generally starts with stereo
photos and then you try to triangulate
the geometry and uh make a 3D shape out
of it it is a really really hard problem
to this day it's not fundamentally
solved because there there's
correspondence and all that and then so
this whole field which is a older way of
thinking about 3D has been going around
and it has been making really good
progress but when nerve happened when
Nerf happened in the context of
generative methods in the context of
diffusion models
suddenly reconstruction and generations
start to really merge and now like
within really a short period of time in
the field of computer vision it's hard
to talk about reconstruction versus
generation anymore we suddenly have a
moment where if we see something or if
we imagine something both can converge
towards generating it right right and
that's just to me a a really important
moment for computer vision but most
people missed it because we're not
talking about it as much as llms right
so in pixel space there's reconstruction
where you reconstruct
like a scene that's real and then if you
don't see the scene then you use
generative techniques right so these
things are kind of very similar
throughout this entire conversation
you're talking about languages and
you're talking about pixels so maybe
it's a good time to talk about how like
space for intelligence and what you're
working on
contrasts with language approaches which
of course are very popular now like is
it complimentary is it orthogonal yeah I
think I think they're complimentary I I
don't mean to be too leading here like
maybe just contrast them like everybody
says like listen I I I know opening up
and I know GPT and I know multimodal
models and a lot of what you're talking
about is like they've got pixels and
they've got languages and like doesn't
this kind of do what we want to do with
spatial reasoning yeah so I think to do
that you need to open up the Black Box a
little bit of how these systems work
under the hood um so with language
models and the multimodal language
models that we're seeing nowadays
they're their their underlying
representation under the hood is is a
one-dimensional representation we talk
about context lengths we talk about
Transformers we talk about sequences
attention attention fundamentally their
representation of the world is is
onedimensional so these things
fundamentally operate on a
onedimensional sequence of tokens so
this is a very natural representation
when you're talking about language
because written text is a
one-dimensional sequence of discret
letters so that kind of underlying
representation is the thing that led to
llms and now the multimodal llms that
we're seeing now you kind of end up
shoehorning the other modalities into
this underlying representation of a 1D
sequence of tokens um now when we move
to spatial intelligence it's kind of
going the other way where we're saying
that the three-dimensional nature of the
world should be front and center in the
representation so at an algorithmic
perspective that opens up the door for
us to process data in different ways to
get different kinds of outputs out of it
um and to tackle slightly different
problems so even at at a course level
you kind of look at outside and you say
oh multimodal LMS can look at images too
well they can but I I think that it's
they don't have that fundamental 3D
representation at the heart of their
approaches I totally agree with Justin I
think talking about the 1D versus
fundamental 3D representation is one of
the most core differentiation the other
thing it's a slightly philosophical but
it's really important to for me at least
is language is fundamentally a purely
generated signal there's no language out
there you don't go out in the nature and
there's words written in the sky for you
whatever data you feeding you pretty
much can just somehow regurgitate with
enough
generalizability at the the same data
out and that's language to language and
but but 3D World Is Not There is a 3D
world out there that follows laws of
physics that has its own structures due
to materials and and many other things
and to to fundamentally back that
information out and be able to represent
it and be able to generate it is just
fundamentally quite a different
problem we will be borrowing um similar
ideas or useful ideas from language and
llms but this is fundamentally
philosophically to me a different
problem right so language 1D and
probably a bad representation of the
physical world because it's been
generated by humans and it's probably
lossy there's a whole another modality
of generative AI models which are pixels
and these are 2D image and 2D video and
like one could say that like if you look
at a video it looks you know you can see
3D stuff because like you can pan a
camera or whatever it is and so like how
would like spatial intelligence be
different than say 2D video here when I
think about this it's useful to
disentangle two things um one is the
underlying representation and then two
is kind of the the user facing
affordances that you have um and here's
where where you can get sometimes
confused because um fundamentally we see
2D right like our retinas are 2D
structures in our bodies and we've got
two of them so like fundamentally our
visual system some perceives 2D images
um but the problem is that depending on
what representation you use there could
be different affordances that are more
natural or less natural so even if you
are at the end of the day you might be
seeing a 2D image or a 2d video um your
brain is perceiving that as a projection
of a 3D World so there's things you
might want to do like move objects
around move the camera around um in
principle you might be able to do these
with a purely 2D representation and
model but it's just not a fit to the
problems that you're the model to do
right like modeling the 2D projections
of a dynamic 3D world is is a function
that probably can be modeled but by
putting a 3D representation Into the
Heart of a model there's just going to
be a better fit between the kind of
representation that the model is working
on and the kind of tasks that you want
that model to do so our bet is that by
threading a little bit more 3D
representation under the hood that'll
enable better affordances for for users
and this also goes back to the norstar
for me you know why is it spatial
intelligence why is it not flat pixel
intelligence is because I think the Arc
of intelligence has to go to what Justin
calls affordances and uh and the Arc of
intelligence if you look at Evolution
right the Arc of intelligence eventually
enables animals and humans especially
human as an intelligent animal to move
around the world interact with it create
civilization create life create a piece
of Sandwich whatever you do in this 3D
World and and translating that into a
piece of technology that three native 3D
nness is fundamentally important for the
flood flood gate um of possible
applications even if some of them the
the serving of them looks Tod but the
but it's innately 3D um to me I think
this is actually very subtle yeah and
Incredibly critical point and so I think
it's worth digging into and a good way
to do this is talking about use cases
and so just to level set this we're
talking about generating a technology
let's call it a model that can do
spatial intelligence so maybe in the
abstract what might that look like kind
of a little bit more concretely what
would be the potential use cases that
you could apply this to so I think
there's a there's a couple different
kinds of things we imagine these
spatially intelligent models able to do
over time um and one that I'm really
excited about is World Generation we're
all we're all used to something like a
text image generator or starting to see
text video generators where you put an
image put in a video and out pops an
amazing image or an amazing two-c clip
um but I I think you could imagine
leveling this up and getting 3D worlds
out so one thing that we could imagine
spatial intelligence helping us with in
the future are upleveling these
experiences into 3D where we're not
getting just an image out or just a clip
out but you're getting out a full
simulated but vibrant and interactive 3D
World for gaming maybe for gaming right
maybe for gaming maybe for virtual
photography like you name it there's I
think there even if you got this to work
there'd be there'd be a million
applications for Education yeah for
education I mean I guess one of one of
my things is that like we in in some
sense this enables a new form of media
right because we already have the
ability to create virtual interactive
world worlds um but it cost hundreds of
hundreds of millions of dollars and a
and a ton of development time and as a
result like what are the places that
people drive this technological ability
is is video games right because if we do
have the ability as a society to create
amazingly detailed virtual interactive
worlds that give you amazing experiences
but because it takes so much labor to do
so then the only economically viable use
of that technology in its form today is
is games that can be sold for $70 a
piece to millions and millions of people
to recoup the investment if we had the
ability to create these same virtual
interactive vibrant 3D worlds um you
could see a lot of other applications of
this right because if you bring down
that cost of producing that kind of
content then people are going to use it
for other things what if you could have
a an intera like sort of a personalized
3D experience that's as good and as rich
as detailed as one of these AAA video
games that cost hundreds of millions of
dollars to produce but it could be
catered to like this very Niche thing
that only maybe a couple people would
want that particular thing that's not a
particular product or a particular road
map but I think that's a vision of a new
kind of media that would be enabled by
um spatial intelligence in the
generative Realms if I think about a
world I actually think about things that
are not just seene generation I think
about stuff like movement and physics
and so like like in the limit is that
included and then the second one is
absolutely if I'm interacting with it
like like are there semantics and I mean
by that like if I open a book are there
like pages and are there words in it and
do they mean like like are we talking
like a full depth experience or we
talking about like kind of a static
scene I think I'll see a progression of
this technology over time this is really
hard stuff to build so I think the
static the static problem is a little
bit easier um but in the limit I think
we want this to be fully Dynamic fully
interactable all the things that you
just said I mean that's the definition
of spatial intelligence yeah so so there
is going to be a progression we'll start
with more static but everything you've
said is is in the in the road map of uh
spatial intelligence I mean this is kind
of in in the name of the company itself
World Labs um like the world is about
building and understanding worlds and
and like this is actually a little bit
inside baseball I realized after we told
the name to people they don't always get
it because in computer vision and and
reconstruction and generation we often
make a distinction or a delineation
about the kinds of things you can do um
and kind of the first level is objects
right like a microphone a cup a chair
like these are discret things in the
world um and a lot of the imet style
stuff that F worked on was about
recognizing objects in the world then
leveling up the next level of objects I
think of his scenes like scenes are
compositions of objects like now we've
got this recording studio with a table
and microphones and people in chairs at
some composition of objects but but then
like we we envision worlds as a Step
Beyond scenes right like scenes are kind
of maybe individual things but we want
to break the boundaries go outside the
door like step up from the table walk
out from the door walk down the street
and see the cars buzzing past and see
like the the the the leaves on the tree
moving and be able to interact with
those things another thing that's really
exciting is just to mention the word New
Media with this technology the boundary
between real world and virtual imagin
world or augmented world or predicted
world is all blurry you really it there
the real world is 3D right so in the
digital world you have to have a
3D representation to even blend with the
real world you know you cannot have a 2d
you cannot have a 1D to be able to
interface with the real 3D World in an
effective way and with this it unlocks
it so it it the use cases can can be
quite Limitless because of this right so
the first use case that that Justin was
talking about would be like the
generation of a virtual world for any
number of use cases one that you're just
alluding to would be more of an
augmented reality right yes just around
the time world lab was uh um being
formed uh vision was released by Apple
and uh they use the word spatial
Computing we're almost like they almost
stole
our but we're spatial intelligence so
spatial Computing needs spatial
intelligence that's exactly right so we
don't know what Hardware form it will
take it will be goggles glasses contact
lenses contact lenses but that interface
between the true real world and what you
can do on top of it whether it's to help
you to augment your capability to work
on a piece of machine and fix your car
even if you are not a trained mechanic
or to just be in a Pokemon go Plus+ for
entertainment suddenly this piece of
technology is is going to be the the the
operating system basically uh for for
arvr uh Mixr in the limit like what does
an AR device need to do it's this thing
thing that's always on it's with you
it's looking out into the world so it
needs to understand the stuff that
you're seeing um and maybe help you out
with tasks in your daily life but I'm
I'm also really excited about this blend
between virtual and physical that
becomes really critical if you have the
ability to understand what's around you
in real time in perfect 3D then it
actually starts to deprecate large parts
of the real world as well like right now
how many differently sized screens do we
all own for different use cases too many
right you've got you've got your you've
got your phone you've got your iPad
you've got your computer monitor you've
got your t
you've got your watch like these are all
basically different side screens because
they need to present information to you
in different contexts and in different
different positions but if you've got
the ability to seamlessly blend virtual
content with the physical world it kind
of deprecates the need for all of those
it just ideally seamlessly Blends
information that you need to know in the
moment with the right way mechanism of
of giving you that information another
huge case of being able to blend the the
digital virtual world with the 3D
physical world is for anying agents to
be able to do things in the physical
world and if humans use this mix art
devices to do things like I said I don't
know how to fix a car but if I have to I
put on this this goggle or glass and
suddenly I'm guided to do that but there
are other types of Agents namely robots
any kind of robots not just humanoid and
uh their interface by definition is the
3D world but their their compute their
brain by definition is the digital world
so what connects that from the learning
to to behaving between a robot brain to
the real world brain it has to be
spatial intelligence so you've talked
about virtual world you've talked about
kind of more of an augmented reality and
now you've just talked about the purely
physical world basically which would be
used for robotics um for any company
that would be like a very large Charter
especially if you're going to get into
each one of these different areas so how
do you think about the idea of like deep
deep Tech versus any of these specific
application areas we see ourselves as a
deep tech company as the platform
company that provides models that uh
that can serve different use cases is of
these three is there any one that you
think is kind of more natural early on
that people can kind of expect the
company to lean into or is it I think
it's suffices to say the devices are not
totally ready actually I got my first VR
headset in grad school um and just like
that's one of these transformative
technology experiences you put it on
you're like oh my God like this is crazy
and I think a lot of people have that
experience the first time they use VR um
so I I've been excited about this space
for a long time and I I love the Vision
Pro like I stayed up late to order one
of the first ones like the first day it
came out um but I I think the reality is
it's just not there yet as a platform
for Mass Market appeal so very likely as
a company will will will move into a
market that's more ready than then I I
think there can sometimes be Simplicity
in generality right like if you we we
have this notion of being a deep tech
company we we believe that there is some
fun underlying fundamental problems that
need to be solved really well and if
solved really well can apply to a lot of
different domains we really view this
long Arc of the company as building and
realizing the the dreams of spatial
intelligence r large so this is a lot of
technology to build it seems to me yeah
I think it's a really hard problem um I
think sometimes from people who are not
directly in the AI space they just see
it as like AI as one undifferentiated
massive Talent um and and for those of
us who have been here long for for
longer you realize that there's a lot of
different a lot of different kinds of
talent that need to come together to
build anything in in AI in particular
this one we've talked a little bit about
the the data problem we've talked a
little bit about some of the algorithms
that we that I worked on during my PhD
but there's a lot of other stuff we need
to do this too um you need really high
quality large scale engineering you need
really deep understanding of 3D of the
3D World you need really there's
actually a lot of connections with
computer Graphics um because they've
been kind of attacking lot of the same
problems from the from the opposite
direction so when we think about Team
Construction we think about how do we
find expert like absolute topof thee
world best experts in the world at each
of these different subdomains that are
necessary to build this really hard
thing when I thought thought about how
we form the best founding team for World
Labs it has to start with the a a group
of phenomenal multidisciplinary funders
and of course justtin is natural for me
Justin cover your years as one of my
best students and uh one of the smartest
technologist but there are
two two other people I have known by
reputation and and one of them Justin
even worked with that I was drooling for
right one is Ben mhal we talked about
his um seminal work in nerve but another
person is uh Kristoff lner who has been
reputated in the community of computer
graphics and uh especially he had the
foresight of working on a precursor of
the gausian Splat um representation for
3D modeling five years right before the
uh the Gan spat take off and when when
we heard about when we talk about the
potential possibility of working with
Christof lastner Justin just jumped off
his chair Ben and Kristoff are are are
legends and maybe just quickly talk
about kind of like how you thought about
the build out of the rest of the team
because again like it's you know there's
a lot to build here and a lot to work on
not just in kind of AI or Graphics but
like systems and so forth yeah um this
is what so far I'm personally most proud
of is the formidable team I've had the
privilege of working with the smartest
young people in my entire career right
from from the top universities being a
professor at Stanford but the kind of
talent that we put together here at uh
at uh World Labs is just phenomenal I've
never seen the concentration and I think
the biggest
differentiating um element here is that
we're Believers of uh spatial
intelligence all of the
multidisciplinary talents whether it's
system engineering machine uh machine
learning infra to you know uh generative
modeling to data to you know Graphics
all of us whether it's our personal
research Journey or or technology
Journey or even personal hobby we
believe that spatial intelligence has to
happen at this moment with this group of
people and uh that's how we really found
our founding team and uh and that focus
of energy and talent is is is really
just uh um humbling to me I I just love
it so I know you've been Guided by an
Northstar so something about North Stars
is like you can't actually reach
them because they're in the sky but it's
a great way to have guidance so how will
you know when you've accomplished what
you've set out to accomplish or is this
a lifelong thing that's going to
continue kind of infinitely first of all
there's real northstars and virtual
North Stars sometimes you can reach
virtual northstars fair enough good
enough in the world in the world model
exactly like I said I thought one of my
Northstar that would take a 100 years
with storytelling of images and uh
Justin and Andre you know in my opinion
solved it for me so um so we could get
to our Northstar but I think for me is
when so many people and so many
businesses are using our models to
unlock their um needs for spatial
intelligence and that's the moment I
know we have reached a major Milestone
actual deployment actual impact actually
yeah I I don't think going to get there
um I I think that this is such a
fundamental thing like the universe is a
giant evolving four-dimensional
structure and spatial intelligence r
large is just understanding that in all
of its depths and figuring out all the
applications to that so I I think that
we have a we have a particular set of
ideas in mind today but I I think this I
think this journey is going to take us
places that we can't even imagine right
now the magic of good technology is that
technology opens up more possibilities
and and unknown so so we will be pushing
and then the possibilities will will be
expanding brilliant thank you Justin
thank you fa this was fantastic thank
you Martin thank you Martin thank you so
much for listening to the a16z podcast
if you've made it this far don't forget
to subscribe so that you are the first
to get our exclusive video content or
you can check out this video that we've
hand selected for you
Loading video analysis...