Google DeepMind robotics lab tour with Hannah Fry
By Google DeepMind
Summary
## Key takeaways - **Open Lab, Robust Vision**: The models are now trained with much more robust visual backbones, so we don't care too much about the lighting or the backgrounds as much. The visual generalization part of the problem is much more solved than like four years ago. [01:16], [01:23] - **VLAs Enable Action Generalization**: We developed VALAs which are vision language and action models, putting actions on the same footing as vision and language tokens so the models can figure out new sequences of actions for new situations. We've seen massive improvements in action generalization. [02:28], [02:48] - **1.5 Agents Chain Long Horizons**: The agent can orchestrate smaller moves for longer horizon tasks like checking London weather then packing a bag. This brings another layer of intelligence to do the full thing instead of short horizon actions. [03:04], [03:20] - **Thinking Before Acting Boosts Performance**: The thinking component makes the robot output its thoughts before taking an action, forcing it to think about what it's going to do. This act improves generalization and performance just like chain of thought in language models. [03:54], [04:05] - **Gemini Layer Handles Unseen Objects**: With a Gemini layer on top, the robot interacts with completely new objects like a stress ball or green pear it's never seen, generalizing to open instructions like 'put the pink blob inside the green pear'. [08:10], [09:24] - **Data Limits Robot Capabilities**: Robots take a lot of data to learn tasks; we need a breakthrough to learn more efficiently, as physical interaction data is limited unlike the internet. Learning from human manipulation videos could increase capabilities. [15:39], [16:23]
Topics Covered
- VLMs Unlock Robot Generalization
- Agents Chain Long-Horizon Tasks
- Thinking Boosts Robot Performance
- Robot Data Limits Scale
Full Transcript
Welcome to Google Deep Mind the podcast with me your host Hannah Fry. Now you
might remember that earlier this year I got to sit down with Karolina Parada who is the head of robotics at Google DeepMind and she was talking all about
taking Gemini's multimodal reasoning and putting it embedding it into a physical body and since we were coming to California for this trip to see what Google Deep Minders are doing on this
side of the Atlantic obviously the robotics lab was top of the list. Now,
you have to remember that these robots aren't these fancy pre-programmed robots. They're doing back flips, right?
robots. They're doing back flips, right?
This is something completely different.
These robots are open-ended. They are
understanding the instructions that you give them and then able to flexibly respond and adapt to a a unlimited
number of tasks. Now, our tour guide for the day is Keshkaro, who is the director of robotics at Google Deepmind.
I haven't been into a Deep Mind robotics lab since I think 2021.
>> Oh, okay.
>> I mean, already it looks quite different. You haven't got the privacy
different. You haven't got the privacy screens.
>> Yeah, >> they've gone.
>> They've gone. Yeah.
>> You don't need them anymore.
>> Uh, no. I mean, we're have the whole lab here in the open.
>> Is it that they're more capable of focusing?
>> Uh, yeah. The models are now trained with like much more robust like visual backbones. So, we don't care too much
backbones. So, we don't care too much about the lighting or the backgrounds as much. So like the visual generalization
much. So like the visual generalization part of the problem is much more solved than like four years ago.
>> Big improvements.
>> Big improvements. Yeah. Okay. There've
been a few big breakthroughs in robotics in the last couple of years and we're excited to show those today.
>> Yeah. I mean it might only be four years but it's basically an ocean of time in terms of what's changed.
>> Robotics looks very different than it did four years ago.
>> What are the big changes then? I mean
large language models, multimodal models.
>> Yeah. So basically we want robots to be general and to be general for human usage these robots must be able to understand you know general purpose like human concepts and the big breakthroughs in the last few years have been we're
kind of building robotics on top of these other bigger models these these large vision language models and turns out they have great understanding of general you know world things. So the
latest robot models are now built on top of that. So we're seeing like incredible
of that. So we're seeing like incredible improvements in like how they generalized like new scenes and new visuals and new instructions. So yeah,
robotics is way more general than it was a few years ago.
>> Because I was talking to Karolina earlier this year and she was sort of saying that it's not even just vision language models to to perceive the scene around but also to plan the actions that it's doing.
>> Yeah. So basically we developed these things called VALAS which are vision language and action models. So what we did with those is we put actions which are like physical actions the robot is doing in the world and put them on the same footing as the vision of the
language tokens. So then now that these
language tokens. So then now that these models can you know model these sequences and try to figure out if given a new situation what are the new sequence of actions to do there. So we
call this action generalization and even in this we've seen you know massive improvements in the last few years. So
in the previous release you know you saw robots doing kind of more short-term short horizon things like pick up things and place them somewhere else or like unzip a bag. But really to be useful to humans you want like longer horizon
things. And there we have an agent now
things. And there we have an agent now that can orchestrate some of these smaller moves to make it a much more longer horizon thing. Like you want to pack your luggage uh for like London, you want to first look up the weather in
London. So this agent can then check the
London. So this agent can then check the weather, decide what you want and then even like pack your bag for you.
>> So it's like you've got this kind of fundamental layer that sort of foundational model and then you're like building on top and on top and on top and on top until you can chain sequences of actions all together to to I mean do
a long complex task.
>> Yeah. And it makes it way more useful because you don't want that short horizon thing. What you really want for
horizon thing. What you really want for a robot is to do the you know the full thing for you. So this agent really brings that other layer of you know intelligence to like the whole thing.
>> And this is 1.5.
>> Yep. We have uh so there's two capabilities in 1.5. We have the the agent component and then we have the thinking component. And thinking is like
thinking component. And thinking is like a word that's been used a lot. So for
robotics purposes here. What we're
trying to do is we we're making the robot think about the action that it's about to take before it takes it. So it
uh will be it'll output its thoughts and then it will take the action. And just
this act of outputting its thoughts makes it more general and more performant because it's we're kind of forcing it to think about what it's going to do before it does it.
>> Cuz you see this in language models, right? Like take a bre take a deep
right? Like take a bre take a deep breath before answering or the chain of thought, you know, ideas does actually improve the performance. But it's the same in robotics.
>> It's the same principle that we're applying to robotics and physical actions.
>> Isn't that weird? Just some of these emergent properties are just so weird.
>> Yeah. I mean for robotics to do like you know basic manipulation tasks is really difficult. So we we do these tasks very
difficult. So we we do these tasks very naturally intuitively without thinking about it. But for robots it's hard. So
about it. But for robots it's hard. So
getting it to think about these actions before it does it it helps. It truly
helps the robots.
>> Amazing. Okay. Well obviously I want to see uh can we go see one of these ones?
>> Let's go. So let's take a look at the Aloha robots here. And here it's going to pack us a lunch with some very dextrous moves. Uh, and it'll do like a
dextrous moves. Uh, and it'll do like a long horizon task. Amazing. As Kaden was going to wonder about.
>> Thank you.
>> So, this is going to pack a lunch box.
And this is one of our most difficult tasks cuz it has, you know, it needs to know like the millimeter level precision of like grabbing the zip lock in the correct way.
>> Yeah.
>> And then like it'll try to get the bread in that tiny spot.
>> And it's just all visual servoing.
>> Oh my gosh, I'm so impressed. I mean, as soon as I said the word impress, it started flinging slightly, but >> get stage right. Yeah. And does it correct itself?
>> It'll keep trying.
>> Hey, you got lots of cameras pointing you. I understand. Understand the
you. I understand. Understand the
stress.
>> The first time I went into a Deep Mind Robotics Lab was maybe 2017 or so. Okay.
>> And at that point, they had, you know, like the big le Lego for like toddlers.
All they were trying to do was like stack one on top of the other, stack blocks. And honestly, the pile of
blocks. And honestly, the pile of discarded broken Lego in the corner was like illustrative of just how difficult.
>> But this I mean this idea of like millimeter precision for the bag.
>> Wow. Look at that.
>> Nice. Okay.
>> No. No way.
>> I'm so impressed.
>> Try from the top. Give it another go.
>> You want to see the the bread and the ziplo?
>> I'll try to do the Okay.
Oh, that is so almost almost almost.
Wow.
>> Yeah, >> that's amazing. That's amazing.
>> Because it crushed too hard on that and you wouldn't be able to close it.
>> Yep. And it's
>> too soft. You're not going to be able to.
>> Some more stuff.
>> I mean, that was easy. The chocolate
bar. And now the grapes. Is it going to hit go on a grape?
>> Almost certainly. That's some grape juice going on there.
>> This is really impressive. So this is the dexterity in action like just like how precise it can get.
>> Okay. So
>> and then it's going to try to close it I think. Yeah. So this will just learn
think. Yeah. So this will just learn from the data how to do this. This is
just end to end.
>> So but exactly end to end as you say right like that this is >> this is just visual >> vision and actions.
>> And what kind of data is it learning from? I mean do you have it's not going
from? I mean do you have it's not going to do the zip is it?
>> Let's find out.
>> What kind of data do you give it? So is
this based on just allowing the robot to try lots of things or are you simulating?
>> So this is actually done via tele operation. So we kind of embody the
operation. So we kind of embody the robot and do the task with the robot and it learns through that perspective and it is going to >> so it can pack you some lunches.
>> So you've demonstrated to it this is what it means to do it correctly.
>> Yep.
>> I see.
>> All right. Thanks. That was so cool.
>> I sort of want to give you a high five but your hands are quite pointy.
>> Yeah. Not these ones. Okay. So we saw dexterity here. Let's take a look at
dexterity here. Let's take a look at another demo where we'll showcase the generalization capabilities of these robots.
We talked about how VLMs are general world understanders. So, we'll see that
world understanders. So, we'll see that on robotics now >> cuz that one was a task that it does over and over and over again to >> it's more about the dexterity here. It's
more about the generalization. So,
Colleen here is going to show us what the robots can do in a more general setting.
>> So, uh here we have our robot um running a general policy so it can kind of interact with the objects and you can just speak to it because we have a Gemini layer on top. So, for example, um, hey, can you put the blue block into
the blue tray?
>> I am putting the blue block into the blue tray.
>> It's chatting while it does it.
>> Yeah. So, if you want to just tell ask it to do something, it's a push to hold mic.
>> Yeah. Can you put the uh green block in the orange tray, but do it as Batman would?
>> Nice.
>> I cannot perform actions as a specific character. However, I can put the green
character. However, I can put the green block in the orange tray for you.
>> Fine.
>> I am now moving the green block into the orange tray.
>> Okay, that's cool. So, if this is completely generalizable, okay, so I I have a stress ball.
>> Okay.
>> Um that I travel with um and uh it's never seen this before.
>> Yeah. So, if I do that in the scene.
>> Okay.
>> Um, and that's a pot, right?
>> Right. This is a little container and it lifts open.
>> Okay. Let's try this.
>> Open the lid of the green pear.
>> I'm getting started on opening the lid of the green pear.
>> It's going to be tricky.
>> That is difficult, right? That's small.
>> Amazing.
Place the pink blob inside the green pair.
>> I'm working on placing the pink blob inside the green pair.
>> Difficult.
>> Oh, nice.
>> Go on. Go on.
I want it to succeed so much. Oh,
squishy.
Yay. Okay. Put the green pear lid back on the pot. Yes.
Yes.
>> So impressed.
>> You know what? You like you look like a proud parent.
>> No, it really feels that way sometimes.
>> Amazing. This is amazing. Never done
that before.
>> Yeah, that thing stress ball has never seen before.
>> Yeah, the open-endedness of this is really extraordinary.
>> Yeah. So now we can chain together some of these short tasks into a long horizon task and it could be way more useful because as you saw with short horizon task it can only do parts of it but once you can string them along to do
something more impressive and longterm then we get like uh more useful tasks.
So instead of like instruction instruction instruction, you can just tell it to do something.
>> You can ask for some highle thing and then a orchestrator will break that down into smaller instructions for the VA and it will do the whole thing for you end to end. And we can see that now here.
to end. And we can see that now here.
>> Okay. I'm in San Francisco and I don't know the rules about sorting trash. Can
you look it up for me and then tidy up?
In San Francisco, you're required to separate your waste into three categories: recyclables, compostables, and trash. Each with its own colorcoded
and trash. Each with its own colorcoded bit.
Nice.
Nice.
Oh no.
Oh yes. Yes. Yes.
Wow.
Now I will put the rubbish into the black bin. So it's chaining the
black bin. So it's chaining the >> Yeah, you can see how the agent can orchestrate a few more tasks and make it way more useful.
>> So in terms of the architecture of that then how does it work? I mean do you have like sort of a separate system sitting on top that's giving instructions?
>> Yeah, we have two systems. One kind of our ER er model which is better at reasoning and that orchestrates the other model which is our VLA which does the physical actions. So both of these come together to do these long horizon tasks.
>> Okay. So the VA being the vision language action model and then the ER model being the >> a VLM of just a vision language model that's designed to, you know, be better at these kinds of tasks.
>> It's doing the reasoning.
>> Exactly.
>> I think if we're going for full science fiction future though, you don't want just arms. >> You want the full humanoid.
>> You want the full human.
>> Let's go check out the human lab.
>> Okay. Yes, please.
>> All right.
>> So here we have a robot that will sort laundry for us. It'll put the dark clothes in the dark and the white clothes in the white bin. This is
Stephanie and Michael who's going to run the demo.
>> And the cool thing is you'll just read the thoughts of the robot as it's doing it and you'll see, you know, what is thinking and this is our, you know, thinking and acting model where it'll first think and then take the action.
>> You get an insight into its brain.
>> Yes. You can look at what it's thinking now.
>> So, this is every time step, is it?
>> Yep.
>> I got you.
>> You want to throw in a few more clothes?
>> Absolutely.
>> Go for it.
>> Let's do it.
>> Let's get a red one in there. does get a red one in there.
>> So, do not put that in the white one.
Thank you.
>> So, do you have a system sitting on top of it that's kind of making these decisions? I mean, how does it work? Is
decisions? I mean, how does it work? Is
it like hierarchical?
>> This one is pure end to end.
>> He's thinking and acting in the same model. There's no hierarchy. So, it's
model. There's no hierarchy. So, it's
like very closed loop. So, okay.
>> The bottom closed from the table. Red
cloth.
>> The black box. Beautiful. Nice.
>> Beautiful. I mean, I would probably wash that separately, but you do you, hop in no time.
Okay. So then if this is end to end, how did you how do you extract this information out? Is it just outputting
information out? Is it just outputting actions?
>> So the beauty of this is it's outputting both its thinking and its actions. So
think about how Gemini outputs it thinking before it outputs the response to the user. This is doing kind of something similar.
>> Oh yeah.
>> No, it's actually truly exciting. Like
it's a different way of doing robotics that I feel like is >> is very exciting to us. Yeah.
>> All right. So here we have the robot.
when we showcase his generalization capabilities. And this is Kana, one of
capabilities. And this is Kana, one of the researchers working on it.
>> So, let's just see what he can pick up and maybe he can pick something and put it in one of these things.
>> Um, I'd like the plant in the basket.
>> Also, all these objects are not seen by the robot during training.
>> So, they are completely new.
>> Many of them we shopped like yesterday.
>> Oh, really? That's true. We went to Target and we bought a bunch of things yesterday.
>> So, this is about how the robot can handle completely new objects, >> things it's never seen before. Here we
go. Hi. Do it.
Oh, that is quite tricky to pick up, isn't it?
>> Yeah, it's like kind of sliding away.
Okay.
>> I'm just not sure if >> Nice. The scruff of its neck.
>> Nice. The scruff of its neck.
You did it.
>> What? What next?
>> Okay. I'd like um I'd like the Doritos in the hexagon.
>> And you can move it as it's trying to do it. And you can see it. Yeah. Trick it.
it. And you can see it. Yeah. Trick it.
Okay.
Oh, I hope you weren't planning to eat those.
>> Amazing thing is, okay, it's still a bit slow. It doesn't get it right 100% of
slow. It doesn't get it right 100% of the time, but you can see that it's like it's on the right path, right? I think
that's what feels very different to last time I came to one of these labs. Yeah,
you can see the intention behind, you know, the actions and it's generally trying to do the things that you're asking it for.
>> Do you feel as though the stuff that you're doing now isn't going to be thrown away and scrapped for a whole new technique or do you do you feel like you're building the sort of foundational blocks?
>> No. Yeah. I think these are the foundational blocks that will lead to the final picture of Dropbox robotics.
So, we'll just have to build on top of this >> in that building on top. Do you think it needs another revolution? Like, do we need another architecture or do you think that we've got enough already? You
know, I think we need at least one more big breakthrough. Like even now these
big breakthrough. Like even now these robots, they take a lot of data to learn these tasks. So we need a breakthrough
these tasks. So we need a breakthrough where they can learn more efficiently with data.
>> So do you think that's the only limiting factor then? Do you think if you had a
factor then? Do you think if you had a similar order of magnitude, you know, many many more orders of magnitude of data like you do with with large language models or visual language models, do you think that this would be sorted?
>> Uh there is one hypothesis that that is all you need. If you can collect like that much robot data, then we're we're done. We're going to pack it up. But
done. We're going to pack it up. But
there's still a long tale of problems to solve. Like they have to be safe, you
solve. Like they have to be safe, you know, they have to like really master the task. So there are still like
the task. So there are still like challenges, but the core of the problem is still like robot data, this physical interaction data, you know, like what it feels like to do all of this stuff. You
know, it it's just limited like it's not as big as the internet.
>> So right now we still have to collect all this experience on robots, but there is a lot of manipulation data that is collected by humans. Humans posting
videos about how to do anything. We
should be able to learn from that at some point and really increase how capable robots are. This is very unstructured. Like solving robotics,
unstructured. Like solving robotics, general manipulation is a very unstructured problem.
>> Yeah. And completely open-ended in terms of the type of things you could potentially ask it to do.
>> Amazing. I'm so impressed. Well done.
Sometimes these robots are a little bit on the slow side, right? Sometimes
they're a bit clunky, but you have to remember that this idea of having a robot that can understand semantics, that can get a contextual view of the
scene in front of it, that can reason through complex tasks. This is
completely inconceivable just a few years ago. And okay, then there may
years ago. And okay, then there may still be some way to go, but the progress here is really limited by the amount of data that we have on physical
interactions in the real world. But
solve that. go through that barrier and I don't think you're just going to be watching robots sort laundry. I think we could be on the cusp of a genuine robot revolution. You have been watching and
revolution. You have been watching and listening to Google Deep Mind the podcast. If you enjoy this little taste
podcast. If you enjoy this little taste of the future, then please do subscribe on YouTube so you won't miss an episode.
See you next time.
Loading video analysis...