The age of multimodality: Insights from the frontier
By Reka AI
Summary
Topics Covered
- Humans Seamlessly Fuse Multimodal Senses
- Text Frontier Hits Data Ceiling
- Shift to Video Data Abundance
- Four Pillars Build World Models
- Multimodal AI Transforms Robotics
Full Transcript
Well, uh good morning everyone and thank you for being here.
So, uh today's talk is about multimodality and why I think it represents a
fundamental shift in how we build intelligence and not just an incremental update.
But uh in order to understand the importance of multimodality uh I'd like to ground it by first uh discussing the highest level of
intelligence that we know to date which is in this room which is human intelligence right sorry let me go back okay uh I woke up this morning in my
hotel room still a bit jet lag to the sound of my alarm clock right it's 3:00 a.m. in California right now. I uh smell
a.m. in California right now. I uh smell like brewing tea. I saw uh the sunlight uh through my window
and then I I I can felt I I I I could feel the texture of the floor with my feet. So we live in this very rich
feet. So we live in this very rich multimodel world. We continue to process
multimodel world. We continue to process our our brain continues to process a constant stream of sights,
sounds, smells, and signals.
And somehow it can weave all of this information together seamlessly. So for
example, if we hear a a horn and a car, we know that the horn comes from the car that might be approaching us too soon.
Uh so this is very beautiful uh highest form of intelligence and we take it for granted.
Now in computer science uh one of our uh major goals or holy grails is to try to build replicate this intelligence in a silicon based system.
Now uh in order to do that in our quest to do that we literally turn sand into silicon chips. We build intricate uh
silicon chips. We build intricate uh systems mechanisms to try to uh replicate this intelligence that we have
inject this into like a machine, right?
And um so in the past traditionally there are like two school uh schools of thoughts on like how uh we're supposed to do that. The first one is based on
rules and logic. So people in the past tried to program rules into the system.
For example, if you're given a question, right? How are you supposed to answer?
right? How are you supposed to answer?
Hey, uh who's your name? Like what's
your name? What's your name? Right? Uh
rules and logic build it into the system. Then uh people quickly realized
system. Then uh people quickly realized that while this system works really well in smaller scale, it cannot build a general purpose system because in
[snorts] practice there's just too many possible scenarios that you need to face, right? So for example, you might
face, right? So for example, you might be asked uh you know how would you respond to your friend who uh jokes like sarcastically, right? It's very
sarcastically, right? It's very difficult like do you need to design everything like that? how how would you like you know operate in a rainy condition as opposed to like you know sunny condition. Now a a different camp
sunny condition. Now a a different camp a different group of people took a completely different approach to try to uh uh produce this intelligence. They
actually uh built the um a a system that can learn. So the approach is that
can learn. So the approach is that instead of programming rules into the system, you want to build a system, a machine that can learn right from a lot of data.
Now for the longest time, this approach also didn't work really well because while yes, we have a lot of data uh we don't really know uh you know how to
make the best use of this data. Sorry.
or uh we don't even know like uh you know uh what kind of system to build like a learning system to build that can make use of this data.
uh but then suddenly there's a breakthrough right so obviously over time there's there's a breakthrough uh and uh we we saw for a sign of uh like a
general purpose system uh with the invention of GPT uh it it it works really well on a lot of cases and it turns out what works
really well is remarkably simple right so what we do is uh so people just you know feed the entire internet essentially the entire internet to this
machine, right? Then we have GPT. So
machine, right? Then we have GPT. So
obviously this is like a ton of like uh compute massive amount that could even power maybe an entire like village like small village, right? Uh and then you
you throw it into this this model this system uh and then okay it it works really well. It could like you know uh
really well. It could like you know uh answer your questions, draft your emails, uh uh debug code, solve math problems and even help me you know draft
this presentation. Right? So this is the
this presentation. Right? So this is the kind of systems that we have today.
This is the the the text frontier that we have a frontier built entirely on tax. Right? Uh and then okay the natural
tax. Right? Uh and then okay the natural next step after that is uh we uh we then like realize okay uh like
this this chatbot is great uh it uh creates like a way for us to interact with uh with the machine in actual language in multiple languages in fact uh and then uh we decided okay we wanted
to do to be able to do more we give it access to tools sorry we give it access to tools right so now we created agents It's not just a chatbot. It now becomes
an agent that could interact with its environment. It could uh you know, for
environment. It could uh you know, for example, like if you give it access to the web, if you ask question, what's the weather like in Lisbon today? You could
be able to like, you know, check the web. Oh, it's uh like a bit cloudy, be
web. Oh, it's uh like a bit cloudy, be able to say it's cloudy, right? And even
like, you know, going a step further than that. Uh you could do uh uh a lot
than that. Uh you could do uh uh a lot more complex question. For example, uh go you know uh find uh latest uh
research on reinforcement learning like a topic of your choice right and uh create a summary for me you know onepage summary synthesize all the information and what the consensus is like about the
best approaches in the field today right so this is uh what we have today and it's amazing it's remarkable uh but uh it also like I think has limitations
we hit the ceiling And the ceiling is because we now run out of internet data.
We have fed the entire internet. Sure.
Yes. Uh we continue to produce new content on the internet like but the rate at which we produce new content, we write new content is just not fast
enough like to satisfy like you know the rate at which we want the system to improve over time. Like we want it to be better like you know next year, the year after and then how how how much more content can we produce in a year right?
So that's the first problem. The second
problem is uh now there's also so much AI generated content on the internet like so it's almost like there's this uh you know powerful system is learning
from its own output its own you know biases hallucinations mistakes like this data pollution it just uh makes it less like you know clean and robust in terms
of the learning process.
Now where do we go from here? what is
like you know the next step in capability.
We believe that the answer is in the abundance of multimodel data. So uh
every every second there's a lot of multimodal data that is being produced bunch of videos being uploaded to the internet bunch of footage being taken like cameras like you know uh from your
phone like being taken videos like bunch of cameras in the building uh sensors microphones right a lot of multimodel data like this
and uh our vision or the idea is to move from learning from the internet to learning from the physical by using all of this like multimodel data.
Now this uh raises like a a different question. How are you going to learn
question. How are you going to learn from a lot of this multimodel data? Uh
so obviously we need a a different kind of system that can take us input not just texts which is where the previous frontier was built on but also images,
video, audio and other sensors. Right?
And then once you have all of this data, there's also this other question. How do
you make the best use of this data right in this new multimodel frontier? So we
have uh four key pillars in how we uh leverage this multimodel data. The first
one similar to the old model, similar to the previous approach, we still train it on internet data, right? So I think uh trillions and trillions of tax tokens predicting the next word. It's
impressive like you know what kind of like abilities that could emerge from just predicting the next word. We've
seen it in the past. This allows you know systems that can you know uh like converse with you in natural language.
Uh we think uh language is still a key scaffolding for your abstract thought.
It allows you to learn logic, reasoning uh and also uh uh like repository of like knowledge from the world like that is available on
the internet right so this is the linguistic backbone of the model yes like complete a paragraph like predict the next token that's still key
the next thing that uh we want to do is connecting this language with vision right so in in this uh like phase like you know in this step like this like you
know category what we ask the AI to do in the learning process is like predicting uh describing what is an an
image or like a video right for example uh in this um uh example you'll see like
okay this is a an image of a traffic jam it's in an urban setting it seems like the condition is after after the rain,
right? So uh the AI will learn how to
right? So uh the AI will learn how to connect this uh you know we'll build this like you know sorry new new for
that connects this like you know uh concepts or uh that it has learned from like you know training on trillions of tokens to the internet data to pixels
like to identify hey this is what the traffic jam looks like this is what the car looks like like it builds this like vocabulary that connects you know like your your uh uh language foc to to
pixels.
So this is the next step that that we can do.
And then the third thing the third pillar is learning how to detect and locate an object. So you're not just describing hey this is an image or the
video of uh a person a male person you know uh presenting like a topic like speaking on stage or like in this case like a in a construction zone bunch of
people building some scaffolding but you also try to teach the model to locate or detect exactly where the person is. So
in this case essentially predicting the coordinates of like where the person is.
Okay, this is like the person you could ask like where is the helmet like or show me like the f right or the hole in
this case. So what this enables is now
this case. So what this enables is now you have a model that uh can also learn like have some sort of like spatial understanding right because it doesn't
just you know describe an image but be able to locate and detect objects and then uh last I think uh uh the things that we do is also uh try to to
to ask the model to uh reconstruct like videos. So, uh uh in a different style
videos. So, uh uh in a different style usually and uh in in this process you you allow the model to learn some sort
of like basic level of physics concepts.
You'll know that you know when people talking they try to move their hands or when you drop you know like something like an object an apple they try to like
like they they fall to the ground even though they you know do not know the equation of gravity they understand the
concept right or when you uh you know uh here like you know uh so you know that like the the model would know that waves uh crash on the shore or repulse for
example propagate outward right this kind of like basic level of concepts.
Now obviously when you combine all of this together this is not easy and there are like a lot of challenges in particular around two things. Obviously
if you want to learn from a lot more data you naturally need just more compute because obviously text represents like you know some portion of the data that you want to train on and then on top of that you also have other
multimodel data all these videos all these images all this audio that you want to to to learn from right uh so obviously you require more compute to take this approach but also like in
order to find the right data right so what kind of data you want to train on and what kind of annotation you want provide uh to the model like you know for example locating an object like
changing it to a different style uh you need to clean the data you need to align the data you need to find for example if you want the model to learn that if a glass uh you know like breaks on the floor
shutters like this is the sound you need to find a video that has that among all this background music maybe a near video like conversation so you need to do a lot of data cleaning
but if you combine all of them together when it works It works beautifully and this is our vision. This is our world model. So it's like moving from as I
model. So it's like moving from as I said like not just like you know learning from the internet but learning from the physical world and the goal the idea is not to just have this kind of models that understand the physical
world to be deployed in data centers. Uh
yes. So we have like models that can process global data in data centers but the idea is also to have hyperefficient version of these models that can run
everywhere like on your devices for example on your phone, your car, your camera, maybe your smart glasses as well, right? So this is multimodal AI
well, right? So this is multimodal AI that you can deploy anywhere and this is not just science fiction. uh the
applications are already transforming multiple industries for example in uh robotics, right? So now you could for example build on on top of this
model be able to tell your robot hey uh pick up the you know blue box like uh on the right hand side right like that the pie pointed the robot does not have just to be programmed to pick up something
but it understands from your like natural language command in uh other domains like you know in defense security. So for example, I
defense security. So for example, I could ask my, you know, home camera system, hey, you know, uh, send me a notification if there is somebody who
acts like suspiciously on my phone tour or maybe in, uh, you know, a disaster response. When you fly a drone over a
response. When you fly a drone over a disaster area, uh, it would be able to report back, hey, I identify something that looks like a dog behind a pile of rubble. This could be the difference
rubble. This could be the difference between life and death. You might want to check it out, right?
uh and then in uh in in other applications, other areas in uh automotive as well. So for example for cars this is just like no beyond self-driving like maybe uh you'll be
able to analyze footage like if somebody's driving and they suddenly break like is it because you know uh it's just they are driving recklessly or
because there is a kid suddenly jumping on the street and you just want to be really cautious like analyzing you know like driver behavior safety these kind of applications
and then uh in in media entertainment being able to search over millions of hours of footage. Being able to like pinpoint exactly, hey, this is the moment I'm looking for. Maybe when
somebody is scoring a goal, like in soccer, maybe when somebody's getting red carded. uh being able to predict
red carded. uh being able to predict whether a short form video is likely to go viral, whether this is suitable for like you know demographics in Europe, in
North America, whether suitable for kids or it processes like you know some sort of like brand safety risk this kind of applications. And then lastly obviously
applications. And then lastly obviously in wearable devices internet of things uh how often for example you uh go to a
grocery store you have your like grocery shopping list you look at the shelf you're not able to find exactly the item you're looking for if you wear this glasses maybe it would be able to identify for you ah okay this is
actually the item in your grocery shopping list right so yeah uh so this is the frontier this is the frontier
that we're all building together on. And
this is what we envision. Thank you.
Loading video analysis...