LongCut logo

“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

By a16z

Summary

## Key takeaways - **AI's Cambrian Explosion Beyond Text**: We are currently experiencing a Cambrian explosion in AI, where not only text but also pixels, videos, and audio are being integrated into AI applications and models, signifying a rapid expansion of possibilities. [01:21] - **Compute Power: The Unsung Hero of AI**: The exponential growth in computational power over the last decade has been a critical, often underestimated, driver of AI advancements, enabling models trained in days in 2012 to be trained in minutes today. [07:28], [08:03] - **Data Drives Models, Not Just Algorithms**: While algorithmic breakthroughs are important, the true power of AI is unleashed when data drives the models, as demonstrated by the massive bet on ImageNet, which scaled data far beyond previous norms. [06:05], [06:44] - **Spatial Intelligence: AI's Next Fundamental Frontier**: Spatial intelligence, enabling machines to perceive, reason, and act in 3D space and time, is as fundamental as language and represents the next critical frontier for AI advancement, moving beyond 1D representations. [19:01], [27:29] - **From Static Scenes to Dynamic 3D Worlds**: The evolution of AI in computer vision is moving from recognizing static objects and scenes to generating dynamic, interactive 3D worlds, a progression that requires a native 3D representation to unlock new media and applications. [35:35], [33:18] - **Deep Tech Platform for Spatial Intelligence**: World Labs is building a deep tech platform, focusing on solving fundamental problems in spatial intelligence to serve diverse use cases, from virtual world generation to augmented reality and robotics. [41:06], [40:43]

Topics Covered

  • The AI Cambrian Explosion: Beyond Text to Pixels, Video, and Audio
  • The Recipe for AI Magic: Data, Compute, and Algorithms
  • The Data-Driven AI Revolution: From Thousands to Internet Scale
  • From Objects to Worlds: The Next Frontier of AI
  • Spatial Intelligence: The Next Operating System for AR/VR

Full Transcript

visual spatial intelligence is so

fundamental it's as fundamental as

language we've got this ingredients

compute deeper understanding of data and

we've got some advancement of algorithms

we are in the right moment to really

make a bet and to focus and just unlock

[Music]

that over the last two years we've seen

this kind of massive Rush of consumer AI

companies and technology and it's been

quite wild but you've been doing this

now for decades and so maybe walk

through a little bit about how we got

here kind of like your key contributions

and insights along the way so it is a

very exciting moment right just zooming

back AI is in a very exciting moment I

personally have been doing this for for

two decades plus and you know we have

come out of the last AI winter we have

seen the birth of modern AI then we have

seen deep learning taking off showing us

possibilities like playing chess but

then we're starting to see the the the

deepening of the technology and the

industry um adoption of uh of some of

the earlier possibilities like language

models and now I think we're in the

middle of a Cambrian explosion in almost

a literal sense because now in addition

to texts you're seeing pixels videos

audios all coming out with possible AI

applications and models so it's very

exciting moment I know you both so well

and many people know you both so well

because you're so prominent in the field

but not everybody like grew up in AI so

maybe it's kind of worth just going

through like your quick backgrounds just

to kind of level set the audience yeah

sure so I first got into AI uh at the

end of my undergrad uh I did math and

computer science for undergrad at

keltech that was awesome but then

towards the end of that there was this

paper that came out that was at the time

a very famous paper the cat paper um

from H Lee and Andrew and others that

were at Google brain at the time and

that was like the first time that I came

across this concept of deep learning um

and to me it just felt like this amazing

technology and that was the first time

that I came across this recipe that

would come to define the next like more

than decade of my life which is that you

can get these amazingly powerful

learning algorithms that are very

generic couple them with very large

amounts of compute couple them with very

large amounts of data and magic things

started to happen when you compi those

ingredients so I I first came across

that idea like around 2011 2012-ish and

I just thought like oh my God this is

this is going to be what I want to do so

it was obvious you got to go to grad

school to do this stuff and then um sort

of saw that Fay was at Stanford one of

the few people in the world at the time

who was kind of on that on that train

and that was just an amazing time to be

in deep learning and computer vision

specifically because that was really the

era when this went from these first nent

bits of technology that were just

starting to work and really got

developed AC and spread across a ton of

different applications so then over that

time we saw the beginning of language

modeling we saw the beginnings of

discriminative computer vision you could

take pictures and understand what's in

them in a lot of different ways we also

saw some of the early bits of what we

would Now call gen generative modeling

generating images generating text a lot

of those Court algor algorithmic pieces

actually got figured out by the academic

Community um during my PhD years like

there was a time I would just like wake

up every morning and check the new

papers on archive and just be ready it

was like unwrapping presents on

Christmas that like every day you know

there's going to be some amazing new

discovery some amazing new application

or algorithm somewhere in the world what

happened is in the last two years

everyone else in the world kind of came

to the same realization using AI to get

new Christmas presents every day but I

think for those of us that have been in

the field for a decade or more um we've

sort of had that experience for a very

long time obviously I'm much older than

Justin I I come to AI through a

different angle which is from physics

because my undergraduate uh background

was physics but physics is the kind of

discipline that teaches you to think

audacious question s and think about

what is the remaining mystery of the

world of course in physics is atomic

world you know universe and all that but

somehow I that kind of training thinking

got me into the audacious question that

really captur my own imagination which

is intelligence so I did my PhD in Ai

and computational neuros siiz at CCH so

Justin and I actually didn't overlap but

we share um

the same amam mat um at keltech oh and

and the same adviser at celtech yes same

adviser your undergraduate adviser in my

PhD advisor petro perona and my PhD time

which is similar to your your your PhD

time was when AI was still in the winter

in the public eye but it was not in the

winter in my eye because it's that

preing hibernation there's so much life

machine learning statistical modeling

was really gaining uh gaining power and

we I I think I was one of the Native

generation in machine learning and AI

whereas I look at Justice generation is

the native deep learning generation so

so so machine learning was the precursor

of deep learning and we were

experimenting with all kinds of models

but one thing came out at the end of my

PhD and the beginning of my assistant

professor

there was a

overlooked elements of AI that is

mathematically important to drive

generalization but the whole field was

not thinking that way and it was Data

because we were thinking about um you

know the intricacy of beijan models or

or whatever you know um uh kernel

methods and all that but what was

fundamental that my students and my lab

realized probably uh earlier than most

people is that if you if you let Data

Drive models you can unleash the kind of

power that we haven't seen before and

that was really the the the reason we

went on a pretty

crazy bet on image net which is you know

what just forget about any scale we're

seeing now which is thousands of data

points at that point uh NLP community

has their own data sets I remember UC

see Irvine data set or some data set in

NLP was it was small compar Vision

Community has their data sets but all in

the order of thousands or tens of

thousands were like we need to drive it

to internet scale and luckily it was

also the the the coming of age of

Internet so we were riding that wave and

that's when I came to Stanford so these

epochs are what we often talk about like

IM is clearly the epoch that created you

know or or at least like maybe made like

popular and viable computer vision and

the Gen wave we talk about two kind of

core unlocks one is like the

Transformers paper which is attention we

talk about stable diffusion is that a

fair way to think about this which is

like there's these two algorithmic

unlocks that came from Academia or

Google and like that's where everything

comes from or has it been more

deliberate or have there been other kind

of big unlocks that kind of brought us

here that we don't talk as much about

yeah I I think the big unlock is compute

like I know the story of AI is of in the

story of compute but even no matter how

much people talk about it I I think

people underestimate it right and the

amount of the amount of growth that

we've seen in computational power over

the last decade is astounding the first

paper that's really credited with the

like Breakthrough moment in computer

vision for deep learning was Alex net um

which was a 2012 paper that where a deep

neural network did really well on the

image net Challenge and just blew away

all the other algorithms that F had been

working on the types of algorithms that

they' been working on more in grad

school that Alex net was a 60 million

parameter deep neural network um and it

was trained for six days on two GTX 580s

which was the top consumer card at the

time which came out in 2010 um so I was

looking at some numbers last night just

to you know put these in perspective the

newest the latest and greatest from

Nvidia is the gb200 um do either of you

want to guess how much raw compute

Factor we have between the GTX 580 and

the gb200 shoot no what go for it it's

uh it's in the thousands so I I ran the

numbers last night like that two We R

that two we training run that of Six

Days on two GTX 580s if you scale it it

comes out to just under five minutes on

a single GB on a single gb200 Justin is

making a really good point the 2012 Alex

net paper on image net challenge is

literally a very classic Model and that

is the convolution on your network model

and that was published in 1980s the

first paper I remember as a graduate

student learning that and it more or

less also has six seven layers the

practically the only difference between

alexnet and the convet what's the

difference is the gpus the two gpus and

the delude of data yeah well so that's

what I was going to go which is like so

I think most people now are familiar

with like quote the bitter lesson and

the bitter lesson says is if you make an

algorithm don't be cute yeah just make

sure you can take advantage of available

compute because the available compute

will show up right and so like you just

like need to like why like on the other

hand there's another narrative um which

seems to me to be like just as credible

which is like it's actually new data

sources that unlock deep learning right

like imet is a great example but like a

lot of people like self attention is

great from Transformers but they'll also

say this is a way you can exploit human

labeling of data because like it's the

humans that put the structure in the

sentences and if you look at clip

they'll say well like we're using the

internet to like actually like have

humans use the alt tag to label images

right and so like that's a story of data

that's not a story of compute and so is

it just is the answer just both or is

like one more than the other or I think

it's both but you're hitting another

really good point so I think there's

actually two EO that to me feel quite

distinct in the algorithmics here so

like the imag net era is actually the

era of supervised learning um so in the

era of supervised learning you have a

lot of data but you don't know how to

use data on its own like the expectation

of imet and other data sets of that time

period was that we're going to get a lot

of images but we need people to label

everyone and all of the training data

that we're going to train on like a

person a human labeler has looked at

everyone and said something about that

image yeah um and the big algorithmic

unlocks we know how to train on things

that don't require human labeled data as

as the naive person in the room that

doesn't have an AI background it seems

to me if you're training on human data

like the humans have labeled it it's

just not explicit I knew you were GNA

say that Mar I knew that yes

philosophically that's a really

important question but that actually is

more try language than pixels fair

enough yeah 100 yeah yeah yeah yeah yeah

but I do think it's an important

thinked learn itel just more implicit

than explicit yeah it's still it's still

human labeled the distinction is that

for for this supervised learning era um

our learning tasks were much more

constrained so like you would have to

come up with this ontology of Concepts

that we want to discover right if you're

doing in imag net like fa and and your

students at the time spent a lot of time

thinking about you know which thousand

categories should be in the imag net

challenge other data sets of that time

like the Coco data set for object

detection like they thought really hard

about which 80 categories we put in

there so let's let's walk to gen um so

so when I was doing my my PhD before

that um you came so I took U machine

learning from Andre in and then I took

like beigan something very complicated

from Deany Coler and it was very

complicated for me a lot of that was

just predictive modeling y um and then

like I remember the whole kind of vision

stuff that you unlock but then the

generative stuff is shown up like I

would say in the last four years which

is to me very different like you're not

identifying objects you're not you know

predicting something you're generating

something and so maybe kind of walk

through like the key unlocks that got us

there and then why it's different and if

we should think about it differently and

is it part of a Continuum is it not it

is so interesting even during my

graduate time generative model was there

we wanted to do generation we nobody

remembers even with the uh letters and

uh numbers we were trying to do some you

know Jeff Hinton has had to generate

papers we were thinking about how to

generate and in fact if you do have if

you think from a probability

distribution point of view you can

mathematically generate it's just

nothing we generate would ever impress

anybody right so this concept of

generation mathematically theoretically

is there but nothing worked so then I do

want to call out Justin's PhD and Justin

was saying that he got enamored by Deep

learning so he came to my lab Justin PhD

his entire PhD is a story almost a mini

story of the trajectory of the of the uh

field he started his first project in

data I forced him to he didn't like

it so in retrospect I learned a lot of

really useful things I'm glad you say

that now so we moved Justin to um to

deep learning and the core problem there

was taking images and generating words

well actually it was even about there

were I think there were three discret

phases here on this trajectory so the

first one was actually matching images

and words right right right like we have

we have an image we have words and can

we say how much they allow so actually

my first paper both of my PhD and like

ever my first academic publication ever

was the image retrieval with scene

graphs and then we went into the Genera

uh taking pixels generating words and

Justin and Andre uh really worked on

that but that was still a very very

lossy way of of of generating and

getting information out of the pixel

world and then in the middle Justus went

off and did a very famous piece of work

and it was the first time that uh

someone made it real time right yeah

yeah so so the story there is there was

this paper that came out in 2015 a

neural algorithm of artistic style led

by Leon gtis and it was like the paper

came out and they showed like these

these real world photographs that they

had converted into van go style and like

we are kind of used to seeing things

like this in 2024 but this was in 2015

so this paper just popped up on archive

one day and it like blew my mind like I

just got this like gen brainworm like in

my brain in like 2015 and it like did

something to me and I thought like oh my

God I need to understand this algorithm

I need to play with it I need to make my

own images into van go so then I like

read the paper and over a long weekend I

reimplemented the thing and got it to

work it was a very actually very simple

algorithm um so like my implementation

was like 300 lines of Lua cuz at the

time it was pre it was Lua there was

there was um this was pre pie torch so

we were using Lua torch um but it was

like very simple algorithm but it was

slow right so it was an optim

optimization based thing every image you

want to generate you need to run this

optimization Loop run this gradient Dent

Loop for every image that you generate

the images were beautiful but I just

like wanted to be faster and and Justin

just did it and it was actually I think

your first taste

of a an academic work having an industry

impact a bunch of people seen this this

artistic style transfer stuff at the

time and me and a couple others at the

same time came up with different ways to

speed this up yeah um but mine was the

one that got a lot of traction right so

I was very proud of Justin but there's

one more thing I was very proud of

Justin to connect to J AI is that before

the world understand gen Justin's last

piece of uh uh work in PhD which I I

knew about it because I was forcing you

to do it that one was fun that was was

actually uh input

language and getting a whole picture out

it's one of the first gen uh work it's

using gang which was so hard to use but

the problem is that we are not ready to

use a natural piece of language so

justtin you heard he worked on sing

graph so we have to input a sing graph

language structure so you know the Sheep

the the the grass the sky in a graph way

it literally was one of our photos right

and then he he and another very good uh

uh Master student of grim they got that

again to work so so you can see from

data to matching to style transfer to to

generative a uh uh images we're starting

to see you ask if this is a abrupt

change for people like us it's already

happening a Continuum but for the world

it was it's more the results are more

abrupt so I read your book and for those

that are listening it's a phenomenal

book like I I really recommend you read

it and it seems for a long time like a

lot of you and I'm talking to you fa

like a lot of your research has been you

know and your direction has been towards

kind of spatial stuff and pixel stuff

and intelligence and now you're doing

World labs and it's around spatial

intelligence and so maybe talk through

like you know is this been part of a

long journey for you like why did you

decide to do it now is it a technical

unlock is it a personal unlock just kind

of like move us from that kind of Meo of

AI research to to World Labs sure for me

is uh

um it is both personal and intellectual

right my entire you talk about my book

my entire intellectual journey is really

this passion to seek North Stars but

also believing that those nor stars are

critically important for the advancement

of our field so at the beginning

I remembered after graduate school I

thought my Northstar was telling stories

of uh images because for me that's such

a important piece of visual intelligence

that's part of what you call AI or AGI

but when Justin and Andre did that I was

like oh my God that's that was my live

stream what do I do next so it it came a

lot faster I thought it would take a

hundred years to do that so um but

visual intelligence is my passion

because I do believe for every

intelligent uh

being like people or robots or some

other form um knowing how to see the

world reason about it interact in it

whether you're navigating or or or

manipulating or making things you can

even build civilization upon it it

visual spatial intelligence is so

fundamental it's as fundamental as

language possibly more ancient and and

more fundamental in certain ways so so

it's very natural for me that um world

Labs is our Northstar is to unlock

spatial intelligence the moment to me is

right to do it like Justin was saying

compute we've got these ingredients

we've got compute we've got a much

deeper understanding of data way deeper

than image that days you know uh

compared to to that those days we're so

much more sophisticated and we've got

some advancement of algorithms including

co-founders in World la like Ben milen

Hall and uh Kristoff lar they were at

The Cutting Edge of nerve that we are in

the right moment to really make a bet

and to focus and just unlock that so I

just want to clarify for for folks that

are listening to this which is so you

know you're starting this company World

lab spatial intelligence is kind of how

you're generally describing the problem

you're solving can you maybe try to

crisply describe what that means yeah so

spatial intelligence is about machines

ability to un to perceive reason and act

in 3D and 3D space and time to

understand how objects and events are

positioned in 3D space and time how

interactions in the world can affect

those 3D position 3D 4D positions over

space time

um and both sort of perceive reason

about generate interact with really take

the machine out of the main frame or out

of the data center and putting it out

into the world and understanding the 3D

4D world with all of its richness so to

be very clear are we talking about the

physical world or are we just talking

about an abstract notion of world I

think it can be both I think it can be

both and that encompasses our vision

long term even if you're generating

worlds even if you're generating content

um doing that in positioned in 3D with

3D uh has a lot of benefits um or if

you're recognizing the real world being

able to put 3D understanding into the

into the real world as well is part of

it great so I mean Ju Just for everybody

listening like the two other co-founders

Ben M Hall and Kristoff lner are

absolute Legends in the field at the at

the same level these four decided to

come out and do this company now and so

I'm trying to get dig to like like why

now is the the the right time yeah I

mean this is Again part of a longer

Evolution for me but like really after

PhD when I was really wanting to develop

into my own independent researcher both

at for my later career I was just

thinking what are the big problems in Ai

and computer vision um and the

conclusion that I came to about that

time was that the previous decade had

mostly been about understanding data

that already exists um but the next

decade was going to be about

understanding new data and if we think

about that the data that already exists

was all of the images and videos that

maybe existed on the web already and the

next decade was going to be about

understanding new data right like people

are people are have smartphones

smartphones are collecting cameras those

cameras have new sensors those cameras

are positioned in the 3D world it's not

just you're going to get a bag of pixels

from the internet and know nothing about

it and try to say if it's a cat or a dog

we want to treat these treat images as

universal sensors to the physical world

and how can we use that to understand

the 3D and 4D structure of the world um

either in physical spaces or or or

generative spaces so I made a pretty big

pivot post PhD into 3D computer vision

predicting 3D shapes of objects with

some of my colleagues at fair at the

time then later I got really enamored by

this idea of learning 3D structure

through 2D right because we talk about

data a lot it's it's um you know 3D data

is hard to get on its own um but there

because there's a very strong

mathematical connection here um our 2D

images are projections of a 3D World and

there's a lot of mathematical structure

here we can take advantage of so even if

you have a lot of 2D data there's

there's a lot of people have done

amazing work to figure out how can you

back out the 3D structure of the world

from large quantities of 2D observations

um and then in 2020 you asked about bre

breakthrough moments there was a really

big breakthrough Moment One from our

co-founder Ben mildenhall at the time

with his paper Nerf N Radiance fields

and that was a very simple very clear

way of backing out 3D structure from 2D

observations that just lit a fire under

this whole Space of 3D computer vision I

think there's another aspect here that

maybe people outside the field don't

quite understand as that was also a time

when large language models were starting

to take off so a lot of the stuff with

language modeling actually had gotten

developed in Academia even during my PhD

I did some early work with Andre

Carpathia on language modeling in 2014

lstm I still remember lstms RNN brus

like this was pre- Transformer um but uh

then at at some point like around like

around the gpt2 time like you couldn't

really do those kind of models anymore

in Academia because they took a way way

more resourcing but there was one really

interesting thing that the Nerf the Nerf

approach that that Ben came up with like

you could train these in in in an hour a

couple hours on a single GPU so I think

at that time like this is a there was a

dynamic here that happened which is that

I think a lot of academic researchers

ended up focusing a lot of these

problems because there was core

algorithmic stuff to figure out and

because you could actually do a lot with

without a ton of compute and you could

get state-of-the-art results on a single

GPU because of those Dynamics um there

was a lot of research a lot of

researchers in Academia were moving to

think about what are the core

algorithmic ways that we can advance

this area as well uh then I ended up

chatting with f more and I realized that

we were actually she's very convincing

she's very convincing well there's that

but but like you know we talk about

trying to like figure out your own

depent research trajectory from your

adviser well it turns out we ended oh no

kind of concluding converging on on

similar things okay well from my end I

want to talk to the smartest person I I

call Justin there's no question about it

uh I do want to talk about a very

interesting technical um uh issue or or

technical uh story of pixels that most

people work in language don't realize is

that pre era in the field of computer

vision those of us who work on pixels

we actually have a long history in a an

area of research called reconstruction

3D reconstruction which is you know it

dates back from the 70s you know you can

take photos because humans have two eyes

right so in generally starts with stereo

photos and then you try to triangulate

the geometry and uh make a 3D shape out

of it it is a really really hard problem

to this day it's not fundamentally

solved because there there's

correspondence and all that and then so

this whole field which is a older way of

thinking about 3D has been going around

and it has been making really good

progress but when nerve happened when

Nerf happened in the context of

generative methods in the context of

diffusion models

suddenly reconstruction and generations

start to really merge and now like

within really a short period of time in

the field of computer vision it's hard

to talk about reconstruction versus

generation anymore we suddenly have a

moment where if we see something or if

we imagine something both can converge

towards generating it right right and

that's just to me a a really important

moment for computer vision but most

people missed it because we're not

talking about it as much as llms right

so in pixel space there's reconstruction

where you reconstruct

like a scene that's real and then if you

don't see the scene then you use

generative techniques right so these

things are kind of very similar

throughout this entire conversation

you're talking about languages and

you're talking about pixels so maybe

it's a good time to talk about how like

space for intelligence and what you're

working on

contrasts with language approaches which

of course are very popular now like is

it complimentary is it orthogonal yeah I

think I think they're complimentary I I

don't mean to be too leading here like

maybe just contrast them like everybody

says like listen I I I know opening up

and I know GPT and I know multimodal

models and a lot of what you're talking

about is like they've got pixels and

they've got languages and like doesn't

this kind of do what we want to do with

spatial reasoning yeah so I think to do

that you need to open up the Black Box a

little bit of how these systems work

under the hood um so with language

models and the multimodal language

models that we're seeing nowadays

they're their their underlying

representation under the hood is is a

one-dimensional representation we talk

about context lengths we talk about

Transformers we talk about sequences

attention attention fundamentally their

representation of the world is is

onedimensional so these things

fundamentally operate on a

onedimensional sequence of tokens so

this is a very natural representation

when you're talking about language

because written text is a

one-dimensional sequence of discret

letters so that kind of underlying

representation is the thing that led to

llms and now the multimodal llms that

we're seeing now you kind of end up

shoehorning the other modalities into

this underlying representation of a 1D

sequence of tokens um now when we move

to spatial intelligence it's kind of

going the other way where we're saying

that the three-dimensional nature of the

world should be front and center in the

representation so at an algorithmic

perspective that opens up the door for

us to process data in different ways to

get different kinds of outputs out of it

um and to tackle slightly different

problems so even at at a course level

you kind of look at outside and you say

oh multimodal LMS can look at images too

well they can but I I think that it's

they don't have that fundamental 3D

representation at the heart of their

approaches I totally agree with Justin I

think talking about the 1D versus

fundamental 3D representation is one of

the most core differentiation the other

thing it's a slightly philosophical but

it's really important to for me at least

is language is fundamentally a purely

generated signal there's no language out

there you don't go out in the nature and

there's words written in the sky for you

whatever data you feeding you pretty

much can just somehow regurgitate with

enough

generalizability at the the same data

out and that's language to language and

but but 3D World Is Not There is a 3D

world out there that follows laws of

physics that has its own structures due

to materials and and many other things

and to to fundamentally back that

information out and be able to represent

it and be able to generate it is just

fundamentally quite a different

problem we will be borrowing um similar

ideas or useful ideas from language and

llms but this is fundamentally

philosophically to me a different

problem right so language 1D and

probably a bad representation of the

physical world because it's been

generated by humans and it's probably

lossy there's a whole another modality

of generative AI models which are pixels

and these are 2D image and 2D video and

like one could say that like if you look

at a video it looks you know you can see

3D stuff because like you can pan a

camera or whatever it is and so like how

would like spatial intelligence be

different than say 2D video here when I

think about this it's useful to

disentangle two things um one is the

underlying representation and then two

is kind of the the user facing

affordances that you have um and here's

where where you can get sometimes

confused because um fundamentally we see

2D right like our retinas are 2D

structures in our bodies and we've got

two of them so like fundamentally our

visual system some perceives 2D images

um but the problem is that depending on

what representation you use there could

be different affordances that are more

natural or less natural so even if you

are at the end of the day you might be

seeing a 2D image or a 2d video um your

brain is perceiving that as a projection

of a 3D World so there's things you

might want to do like move objects

around move the camera around um in

principle you might be able to do these

with a purely 2D representation and

model but it's just not a fit to the

problems that you're the model to do

right like modeling the 2D projections

of a dynamic 3D world is is a function

that probably can be modeled but by

putting a 3D representation Into the

Heart of a model there's just going to

be a better fit between the kind of

representation that the model is working

on and the kind of tasks that you want

that model to do so our bet is that by

threading a little bit more 3D

representation under the hood that'll

enable better affordances for for users

and this also goes back to the norstar

for me you know why is it spatial

intelligence why is it not flat pixel

intelligence is because I think the Arc

of intelligence has to go to what Justin

calls affordances and uh and the Arc of

intelligence if you look at Evolution

right the Arc of intelligence eventually

enables animals and humans especially

human as an intelligent animal to move

around the world interact with it create

civilization create life create a piece

of Sandwich whatever you do in this 3D

World and and translating that into a

piece of technology that three native 3D

nness is fundamentally important for the

flood flood gate um of possible

applications even if some of them the

the serving of them looks Tod but the

but it's innately 3D um to me I think

this is actually very subtle yeah and

Incredibly critical point and so I think

it's worth digging into and a good way

to do this is talking about use cases

and so just to level set this we're

talking about generating a technology

let's call it a model that can do

spatial intelligence so maybe in the

abstract what might that look like kind

of a little bit more concretely what

would be the potential use cases that

you could apply this to so I think

there's a there's a couple different

kinds of things we imagine these

spatially intelligent models able to do

over time um and one that I'm really

excited about is World Generation we're

all we're all used to something like a

text image generator or starting to see

text video generators where you put an

image put in a video and out pops an

amazing image or an amazing two-c clip

um but I I think you could imagine

leveling this up and getting 3D worlds

out so one thing that we could imagine

spatial intelligence helping us with in

the future are upleveling these

experiences into 3D where we're not

getting just an image out or just a clip

out but you're getting out a full

simulated but vibrant and interactive 3D

World for gaming maybe for gaming right

maybe for gaming maybe for virtual

photography like you name it there's I

think there even if you got this to work

there'd be there'd be a million

applications for Education yeah for

education I mean I guess one of one of

my things is that like we in in some

sense this enables a new form of media

right because we already have the

ability to create virtual interactive

world worlds um but it cost hundreds of

hundreds of millions of dollars and a

and a ton of development time and as a

result like what are the places that

people drive this technological ability

is is video games right because if we do

have the ability as a society to create

amazingly detailed virtual interactive

worlds that give you amazing experiences

but because it takes so much labor to do

so then the only economically viable use

of that technology in its form today is

is games that can be sold for $70 a

piece to millions and millions of people

to recoup the investment if we had the

ability to create these same virtual

interactive vibrant 3D worlds um you

could see a lot of other applications of

this right because if you bring down

that cost of producing that kind of

content then people are going to use it

for other things what if you could have

a an intera like sort of a personalized

3D experience that's as good and as rich

as detailed as one of these AAA video

games that cost hundreds of millions of

dollars to produce but it could be

catered to like this very Niche thing

that only maybe a couple people would

want that particular thing that's not a

particular product or a particular road

map but I think that's a vision of a new

kind of media that would be enabled by

um spatial intelligence in the

generative Realms if I think about a

world I actually think about things that

are not just seene generation I think

about stuff like movement and physics

and so like like in the limit is that

included and then the second one is

absolutely if I'm interacting with it

like like are there semantics and I mean

by that like if I open a book are there

like pages and are there words in it and

do they mean like like are we talking

like a full depth experience or we

talking about like kind of a static

scene I think I'll see a progression of

this technology over time this is really

hard stuff to build so I think the

static the static problem is a little

bit easier um but in the limit I think

we want this to be fully Dynamic fully

interactable all the things that you

just said I mean that's the definition

of spatial intelligence yeah so so there

is going to be a progression we'll start

with more static but everything you've

said is is in the in the road map of uh

spatial intelligence I mean this is kind

of in in the name of the company itself

World Labs um like the world is about

building and understanding worlds and

and like this is actually a little bit

inside baseball I realized after we told

the name to people they don't always get

it because in computer vision and and

reconstruction and generation we often

make a distinction or a delineation

about the kinds of things you can do um

and kind of the first level is objects

right like a microphone a cup a chair

like these are discret things in the

world um and a lot of the imet style

stuff that F worked on was about

recognizing objects in the world then

leveling up the next level of objects I

think of his scenes like scenes are

compositions of objects like now we've

got this recording studio with a table

and microphones and people in chairs at

some composition of objects but but then

like we we envision worlds as a Step

Beyond scenes right like scenes are kind

of maybe individual things but we want

to break the boundaries go outside the

door like step up from the table walk

out from the door walk down the street

and see the cars buzzing past and see

like the the the the leaves on the tree

moving and be able to interact with

those things another thing that's really

exciting is just to mention the word New

Media with this technology the boundary

between real world and virtual imagin

world or augmented world or predicted

world is all blurry you really it there

the real world is 3D right so in the

digital world you have to have a

3D representation to even blend with the

real world you know you cannot have a 2d

you cannot have a 1D to be able to

interface with the real 3D World in an

effective way and with this it unlocks

it so it it the use cases can can be

quite Limitless because of this right so

the first use case that that Justin was

talking about would be like the

generation of a virtual world for any

number of use cases one that you're just

alluding to would be more of an

augmented reality right yes just around

the time world lab was uh um being

formed uh vision was released by Apple

and uh they use the word spatial

Computing we're almost like they almost

stole

our but we're spatial intelligence so

spatial Computing needs spatial

intelligence that's exactly right so we

don't know what Hardware form it will

take it will be goggles glasses contact

lenses contact lenses but that interface

between the true real world and what you

can do on top of it whether it's to help

you to augment your capability to work

on a piece of machine and fix your car

even if you are not a trained mechanic

or to just be in a Pokemon go Plus+ for

entertainment suddenly this piece of

technology is is going to be the the the

operating system basically uh for for

arvr uh Mixr in the limit like what does

an AR device need to do it's this thing

thing that's always on it's with you

it's looking out into the world so it

needs to understand the stuff that

you're seeing um and maybe help you out

with tasks in your daily life but I'm

I'm also really excited about this blend

between virtual and physical that

becomes really critical if you have the

ability to understand what's around you

in real time in perfect 3D then it

actually starts to deprecate large parts

of the real world as well like right now

how many differently sized screens do we

all own for different use cases too many

right you've got you've got your you've

got your phone you've got your iPad

you've got your computer monitor you've

got your t

you've got your watch like these are all

basically different side screens because

they need to present information to you

in different contexts and in different

different positions but if you've got

the ability to seamlessly blend virtual

content with the physical world it kind

of deprecates the need for all of those

it just ideally seamlessly Blends

information that you need to know in the

moment with the right way mechanism of

of giving you that information another

huge case of being able to blend the the

digital virtual world with the 3D

physical world is for anying agents to

be able to do things in the physical

world and if humans use this mix art

devices to do things like I said I don't

know how to fix a car but if I have to I

put on this this goggle or glass and

suddenly I'm guided to do that but there

are other types of Agents namely robots

any kind of robots not just humanoid and

uh their interface by definition is the

3D world but their their compute their

brain by definition is the digital world

so what connects that from the learning

to to behaving between a robot brain to

the real world brain it has to be

spatial intelligence so you've talked

about virtual world you've talked about

kind of more of an augmented reality and

now you've just talked about the purely

physical world basically which would be

used for robotics um for any company

that would be like a very large Charter

especially if you're going to get into

each one of these different areas so how

do you think about the idea of like deep

deep Tech versus any of these specific

application areas we see ourselves as a

deep tech company as the platform

company that provides models that uh

that can serve different use cases is of

these three is there any one that you

think is kind of more natural early on

that people can kind of expect the

company to lean into or is it I think

it's suffices to say the devices are not

totally ready actually I got my first VR

headset in grad school um and just like

that's one of these transformative

technology experiences you put it on

you're like oh my God like this is crazy

and I think a lot of people have that

experience the first time they use VR um

so I I've been excited about this space

for a long time and I I love the Vision

Pro like I stayed up late to order one

of the first ones like the first day it

came out um but I I think the reality is

it's just not there yet as a platform

for Mass Market appeal so very likely as

a company will will will move into a

market that's more ready than then I I

think there can sometimes be Simplicity

in generality right like if you we we

have this notion of being a deep tech

company we we believe that there is some

fun underlying fundamental problems that

need to be solved really well and if

solved really well can apply to a lot of

different domains we really view this

long Arc of the company as building and

realizing the the dreams of spatial

intelligence r large so this is a lot of

technology to build it seems to me yeah

I think it's a really hard problem um I

think sometimes from people who are not

directly in the AI space they just see

it as like AI as one undifferentiated

massive Talent um and and for those of

us who have been here long for for

longer you realize that there's a lot of

different a lot of different kinds of

talent that need to come together to

build anything in in AI in particular

this one we've talked a little bit about

the the data problem we've talked a

little bit about some of the algorithms

that we that I worked on during my PhD

but there's a lot of other stuff we need

to do this too um you need really high

quality large scale engineering you need

really deep understanding of 3D of the

3D World you need really there's

actually a lot of connections with

computer Graphics um because they've

been kind of attacking lot of the same

problems from the from the opposite

direction so when we think about Team

Construction we think about how do we

find expert like absolute topof thee

world best experts in the world at each

of these different subdomains that are

necessary to build this really hard

thing when I thought thought about how

we form the best founding team for World

Labs it has to start with the a a group

of phenomenal multidisciplinary funders

and of course justtin is natural for me

Justin cover your years as one of my

best students and uh one of the smartest

technologist but there are

two two other people I have known by

reputation and and one of them Justin

even worked with that I was drooling for

right one is Ben mhal we talked about

his um seminal work in nerve but another

person is uh Kristoff lner who has been

reputated in the community of computer

graphics and uh especially he had the

foresight of working on a precursor of

the gausian Splat um representation for

3D modeling five years right before the

uh the Gan spat take off and when when

we heard about when we talk about the

potential possibility of working with

Christof lastner Justin just jumped off

his chair Ben and Kristoff are are are

legends and maybe just quickly talk

about kind of like how you thought about

the build out of the rest of the team

because again like it's you know there's

a lot to build here and a lot to work on

not just in kind of AI or Graphics but

like systems and so forth yeah um this

is what so far I'm personally most proud

of is the formidable team I've had the

privilege of working with the smartest

young people in my entire career right

from from the top universities being a

professor at Stanford but the kind of

talent that we put together here at uh

at uh World Labs is just phenomenal I've

never seen the concentration and I think

the biggest

differentiating um element here is that

we're Believers of uh spatial

intelligence all of the

multidisciplinary talents whether it's

system engineering machine uh machine

learning infra to you know uh generative

modeling to data to you know Graphics

all of us whether it's our personal

research Journey or or technology

Journey or even personal hobby we

believe that spatial intelligence has to

happen at this moment with this group of

people and uh that's how we really found

our founding team and uh and that focus

of energy and talent is is is really

just uh um humbling to me I I just love

it so I know you've been Guided by an

Northstar so something about North Stars

is like you can't actually reach

them because they're in the sky but it's

a great way to have guidance so how will

you know when you've accomplished what

you've set out to accomplish or is this

a lifelong thing that's going to

continue kind of infinitely first of all

there's real northstars and virtual

North Stars sometimes you can reach

virtual northstars fair enough good

enough in the world in the world model

exactly like I said I thought one of my

Northstar that would take a 100 years

with storytelling of images and uh

Justin and Andre you know in my opinion

solved it for me so um so we could get

to our Northstar but I think for me is

when so many people and so many

businesses are using our models to

unlock their um needs for spatial

intelligence and that's the moment I

know we have reached a major Milestone

actual deployment actual impact actually

yeah I I don't think going to get there

um I I think that this is such a

fundamental thing like the universe is a

giant evolving four-dimensional

structure and spatial intelligence r

large is just understanding that in all

of its depths and figuring out all the

applications to that so I I think that

we have a we have a particular set of

ideas in mind today but I I think this I

think this journey is going to take us

places that we can't even imagine right

now the magic of good technology is that

technology opens up more possibilities

and and unknown so so we will be pushing

and then the possibilities will will be

expanding brilliant thank you Justin

thank you fa this was fantastic thank

you Martin thank you Martin thank you so

much for listening to the a16z podcast

if you've made it this far don't forget

to subscribe so that you are the first

to get our exclusive video content or

you can check out this video that we've

hand selected for you

Loading...

Loading video analysis...