Embodied Intelligence: Leslie Kaelbling
By MIT Schwarzman College of Computing
Summary
Topics Covered
- Make Tea in Any Kitchen Globally
- Planning Trades Space for Time
- Learn Skills from One Demo
- Decompose Big Problems Small
Full Transcript
Awesome.
OK.
You can hear me very good?
So thanks for coming to this event.
So that was a perfect setup.
So Pulkit does all this great work with real detailed, closed-loop, down-to-the-hardware sensorimotor stuff.
And I'm going to come at the problem from maybe the opposite direction.
So OK, so I want to understand AI.
And how do we make intelligent robots?
But I want to think now at the very high level.
An example to crystallize the types of problems I want to think about is making tea in anybody's kitchen.
I actually gave this as a homework assignment in a seminar one time, to go to somebody else's house and make tea without asking any questions.
So to do that, you have to understand a ton of stuff, right?
You have to understand something about getting water, making it hot, finding leaves, putting them in the water.
OK, so how could we make a robot that could go to anybody's house in the whole world and make tea?
So that's a problem I'd like to solve.
And it stresses a whole bunch of things, right?
We have to somehow have systems that are kind of flexible, robust, adaptive in their behavior.
They have to have a pretty long horizon view, right?
It might take a fair amount of work to make the tea in somebody else's house.
So the problem that I think about-- robot hardware is really interesting and important.
That's not the part of the problem that I work on.
So I assume that some of my colleagues who make awesome hardware have made hardware.
But then I, you and I, let's say, have the problem of making the software.
And the question is, how are we going to do that?
And I think it's really important-- Pulkit brought this up, and I'll just reinforce it a little bit-- that there are costs and considerations that happen in two places, right?
So one thing to think about is the work that happens, let's say, in the factory-- let's say I'm making household robots that I'm going to sell to you.
So there's work I have to do to make the robots while I have them there in the factory.
And then these robots are going to go out, and they're going to go to your house.
And they're going to mess things up and break dishes and stuff.
And so then there's also considerations about, how long does that take?
How much trouble do they cause, and so on?
Can they adapt?
Do they perform well?
So the question is, How do we think about the problem of making software for intelligent robots, thinking both about the fact that they're going to operate in potentially a really wide variety of situations?
So a classic method, and this is the thing that-- I forgot that Pulkit was going right before me, right?
A classic thing is to build some set of simulators, build some distribution of objectives, design some beautiful and general-purpose reinforcement learning algorithms, implement a big version, train it in simulation, and profit, right?
So we can make systems this way.
And they are very-- they're very awesome at the things that they do.
But sometimes-- so it's kind of a plausible approach.
But the amount of data and time and simulation you need really scales with the complexity of the problem.
And so although I think that this approach could in principle arrive at general-purpose AI, let's say, I'm too impatient to wait for that approach to get to general-purpose AI.
So I'm going to talk about a different strategy.
And what I want to think about is making systems that can learn from very little experience, very little data.
And one strategy for doing that is to design systems that are compositional so that we can build or learn small parts and put them together to solve a variety of other problems. So von Humboldt, who studied language, said, well, languages lets us "make infinite use of finite means."
The finite words we use can make infinitely many sentences.
So what we're doing in our research group, at least some of us, is taking what looks kind of an old-school approach.
The idea is to build in some fairly general-purpose compositional representations and inference mechanisms, which traditionally operated on hand-built models.
But instead of hand-built models, we're working on learning the components that can be composed using these general systems in order to solve very big, long-horizon problems. So I'm just going to walk you through some examples, some pieces and parts of our work.
We have a very old-school robot architecture in a way.
But what's a little bit different about it is that there are some parts that are mostly learned.
You'll notice the yellow boxes in this picture correspond to perception and to motor control loops.
So those things in the current world are almost completely done very nicely by learning methods.
But there are going to be some algorithmic methods built in here and some learning that supports, that interacts with the algorithmic methods, which is what I'm going to focus on.
So first, let me talk briefly about planning.
So Pulkit talked about planning.
So that was good.
So planning.
What is planning?
One way to think about planning is Karl Popper, a philosopher of science, called it letting your intentions die in your stead.
So instead of actually jumping off the cliff, you can imagine what it would be like and possibly decide not to do it, right?
So the ability to project alternative strategies forward and evaluate them is really important as a computational strategy for deciding how to behave.
So in the current world, it's interesting.
You read a lot of work in reinforcement learning to focus on policies, right?
A policy is somehow a direct mapping from states to actions.
And it can be very fast to compute online, which makes it excellent.
But a thing that makes policies maybe difficult and expensive to acquire sometimes is that they effectively have stored in them a precompiled solution to every possible problem that could come up, right?
So if I have a policy, I observe the situation, and I know what to do.
Now, there are some situations, though, I would guess that maybe you wouldn't know what to do instantaneously.
Maybe you haven't thought about what happens if a tidal wave comes into this room.
But maybe you could use your cognitive apparatus to think a little bit and decide what to do about it.
So policies are fast to execute, but they are sometimes very big because they would have to cover all possible things that could happen.
Planning, generally speaking, trades space for time.
It may take longer to react, but it's easier to make a smallish computational object that can actually cover a broader range of things.
So space and time trade-off is a kind of a trade-off between learning or building policies and learning or building planning systems. OK.
So in robot planning, I'm not going to talk about this in detail, but there's beautiful mathematical study of the underlying spaces of robot control.
What's interesting about controlling robots is that some part of what they do is to move in continuous space.
And some part of what they do is discrete.
Like, the choice of whether I'm holding the remote or not is a discrete choice.
So I make lots of discrete choices in combination with lots of continuous choices when I'm planning for a robot.
But the planning algorithms are beautiful.
In general, they take advantage of what we understand about geometry and kinematics and so on.
And so for even large suites of problems that are challenging for reinforcement learning, we can just address them directly using analytical methods and planning.
So there's some things we know the math for and some things we don't.
And one of my positions is that if you know the math for how to do something, maybe you should do it that way and then learn the parts that you don't know.
OK.
So that kind of idea, that if you had analytical models of your system, you could make long-horizon plans to solve problems is good.
But it begs the immediate question.
Well, sure, but where do you get the object models?
And so now I'm going to spend some time talking about how we can learn the models that support this style of work.
So one system that we built a while ago that's kind of fun to talk about is M0M, Manipulation with Zero Models.
So here, the assumption was that we have a robot.
We understand its kinematics.
And the robot already knows roughly the idea of how to pick up and put down objects.
But it's going to operate with objects it's never seen before.
And so we're going to use some pretrained neural network models to do things like segmentation, to predict how you can grasp something, to complete the shape of an object having only seen the front part of it.
We're going to take that understanding of the objects that the robot's looking at, put that into a planner, and use that whole thing to govern the behavior of the robot.
So we can construct a policy.
It's still a policy in the sense that it takes in images, and it generates motor torques.
But what goes on inside is that it builds its own mental model of the shapes of the objects in front of it.
It takes a goal description actually written in first-order logic.
I'll show you some examples.
Given what it sees about the world and the goal, it infers what actions it ought to take, and it takes them.
So here's an example.
This is the favorite example.
And I have to show it to you in pieces.
So let me just get it set up.
OK.
So here's a case where we've asked the robot to put all the objects on the blue mat.
So the robot can see from its own personal perspective.
The only object it can see is that cracker box.
So these are new objects.
It doesn't know about these objects in particular, but it can see the cracker box.
And it says, OK, I will make a plan to put all the objects that I can see on the mat.
So it does that.
It makes the plan.
And then it starts to execute it.
It picks up the object that it sees, puts it on the mat, takes another look at the table, and says, oh, snap.
There's actually more objects.
So then, but that's OK because the planner understands geometry and kinematics and all this stuff.
And it just makes a new plan.
It makes a new plan to put all these objects back on the mat.
So there's no learning.
The learning here-- there's learning that was involved in the perception.
But there is no learning in the planning.
It's just reasoning about space and kinematics and objects and does that.
And it can solve a bunch of other kinds of problems. Here, we told it to put a yellow thing on the blue mat.
So it's moving some stuff out of the way because it couldn't reach the yellow thing unless it moved the stuff out of the way.
It makes some mistakes, but it persists.
And it can put the yellow thing on the blue mat.
Here, we told it to put all the objects in the bowl of the color that is closest to it, right?
So again, we give it goals in a logical specification.
And it looks at the things it sees, and it knows what to do.
So we get enormous generalization from the built-in underlying structure of reasoning and understanding things about geometry.
This one is just kind of entertaining because it's supposed to put all the objects on the mat.
But it turns out there are more objects than it had actually anticipated.
OK.
And it can work on other robots and so on.
OK.
So that was dealing with not having perfect models of the objects in the world.
Another thing that we want to think about is, what about adding new skills?
So actually, Pulkit gave us beautiful examples of a robot learning new skills.
And then the question is, how can we incorporate them into the robot's general ability to behave?
So if you think about, for instance-- imagine the robot learns a skill for pouring.
But now, what we need to learn are the conditions under which that thing is going to work out because maybe I learned an awesome controller, but I have to know as a function of the shapes of the thing I'm pouring into and out of, as the gain of the controller or so on, is it going to work?
So we did some work where we describe-- we try to learn a description of when a skill will work effectively.
And we treat this as a Gaussian process regression problem.
What's important about this is if you're trying to learn how well a skill works, especially in the real world, you don't want to do it too many times.
So here, we have the robot trying to learn something about its own pouring and scooping and so on.
And we had to pick a lot of chickpeas up off the floor.
So we are very carefully using Bayesian experiment design methods to keep the number of experiments down.
But what this lets us do in the end is take a skill that we have learned and the package that we've learned that says, what has to be true in order to use this skill, and take those skills and just add them into the general repertoire of the robot without further learning, right?
So it's a local.
We do local learning of the properties of this skill and when it works.
We could do local learning of the properties of another skill and when it works.
And then we combine them together using the general-purpose planning mechanism in order to solve completely new problems completely de novo.
I'll show you-- I think the next example that's going to come up is a nice one.
Here, this robot has-- the learned skills are pushing and pouring.
And we asked it to put stuff in this red bowl.
The red bowl's all the way over there.
The cup with the stuff in it is always all the way over here.
The planner just used geometric reasoning to decide to push this over here and then pour into it.
OK.
Now we can grow our skill set by making combinations of things we've already seen.
I want you to watch this video carefully.
OK.
I know.
I thought that was pretty cool.
So what did we learn from that video?
We kind of learned that you could do a thing, that there was a trick or a strategy that works out.
But you weren't completely surprised.
It wasn't magic.
You could use your understanding of physics and force and torque and friction and so on to say, oh, yeah, I see.
I kind of get how that would work.
So you can learn things from one demonstration if you're already pretty-- if you already have a pretty good theory of what's going on in the world.
And so we wanted to take that idea and think about, what could we learn from one demonstration?
So this is much-- not as exciting as that.
But here's this robot.
We give it one demonstration.
It knows it's trying to pick up the-- it's trying to put the red block in the green bin.
But we do this demonstration.
And that's kind of interesting.
So what did we do?
We used the hook to bring the block closer so that the robot could pick it up and put it in the bin.
We could have represented that demonstration as a thousand tiny torque steps.
But then it's very hard to generalize that or really learn from it.
Another way to represent that demonstration in our head is in terms of the contacts that are being made between the objects and how they're changing.
So there's a change of picking up the hook.
And there's a change of contact mode when the hook touches the object.
There's different changes in contacts.
So there's a handful of those.
Not too many.
That's still kind of too many to plan with.
What we would like to do is take a demonstration like that and package it up as an operator, as the idea, as the concept of using a hook to pull something closer to me.
And if you can get ahold of that, using a hook to pull something closer to me, then you can take that concept and reuse it, and you can plan efficiently with that concept.
So the details I'm not going to get into.
But roughly, we have built a system that can take the one demonstration, extract that contact sequence that's necessary, and then practice, right?
You still have to practice how to grab the hook and how to pull it toward you.
Even a bucket operator couldn't probably learn from that one example of what that person did with the bucket truck.
They'd probably have to practice it a little bit to do it.
But if we do this, we can learn-- we can learn the trick, and we can learn to apply the trick very computationally efficiently.
And most importantly, we can take that learned trick and compose it with other tricks and other things that we know.
So for instance, having learned to use a hook to grab something, well, the planner can use a hook to grab a hook to grab a thing because it all just composes in the way that, let's say, language composes.
Or this is a more complicated example, where it's trying to keep the block from sliding down that inclined plane.
And it figures out that it can use the hook in two different ways.
It's going to use the hook to pull the object to itself.
And then it's going to use it, actually, to put it at the bottom of the plane to impede the block from sliding all the way off.
So that'll take a little while.
We probably don't have time to watch it.
There's some other interesting tricks that you can teach a robot.
Like, if you have a big, fat hand, it's hard to pick an object up unless you slide it over to the plate, over to the side.
So sliding something to the side so you can pick it up is a trick.
Here's another trick for picking up something if you have a big, fat hand.
Maybe you wouldn't have anticipated that.
But observing it, you'd be like, oh, I see.
Yeah, that works.
So these are all things where you see one demo, and you're like, ah, I get it.
We can do it with a real robot, with a banana, with a tennis ball, and so on.
OK.
I just want to mention a couple of other ingredients-- I got two minutes.
OK-- couple of other ingredients that go into trying to make an overall system that could be as capable as a general-purpose household robot.
Right.
So another really important thing that almost nobody is working on-- I don't know if anybody here wants a research topic.
Like, I think this is really important-- is building a memory of what you've seen and done and what the state of the world is like.
So this is a problem that we've worked on.
I could ask you, where's your favorite coffee mug?
And you could tell me something about that.
But how is it that you aggregated the history of your coffee mug over time?
Maybe you say, oh, well, I think I left it on the counter, but I'm not sure.
Or maybe, is it a picture?
Probably not as a picture.
Maybe your housemate has moved it.
So learning how to remember and make predictions about the objects in your life is an important problem.
Another important problem is scaling up.
So it's common in computer science to say, oh, well, I have an algorithm that can solve problems of this size.
And what I really need to do is work on making my algorithm better so I can solve problems of this size.
But that's inherently not-- eventually, that's not going to work anymore.
You have a small brain.
You have a big problem.
You could only solve problems of a certain size.
And so rather than thinking about taking really big problems and solving the whole thing, what I like to think about is taking really big problems and mapping them into a whole sequence of small problems by decomposing them compositionally maybe according to a temporal hierarchy, maybe according
to spatial decompositions, and so on so that I can just solve with my little brain a whole sequence of small problems instead of one giant problem.
So that's another really, I think, important focus.
Then the other thing everyone asked me about-- I can't have a conversation with anyone ever without being asked about foundation models.
So I'll say a word about that.
So it used to be, maybe if you thought about making a robot software system to do some stuff, you would get a bunch of different functional modules and wire them together, a mapping system and a planner and so on.
And I think the way that things are going to start going for us is that we're going to go buy a bunch of models now, maybe, instead of modules, design some basic architecture, maybe with some of the algorithms and methods that Pulkit and I have been talking about, but also taking advantage of existing large, pretrained models to do a variety of different kinds of things.
And we have to think about, how do we put all these things together to do a job?
So one thing that I think is interesting to think about is, what are the kinds of models that we could use in a general-purpose robot system?
So there's some, I think, that we could buy in a general marketplace-- so models of how the physical world works, models of scene understanding.
How is it that we can take images of the world and build a representation internally of what we're actually looking at?
Models of other agents, models of culture and know-how.
So a lot of this-- some of this stuff is in current LLMs, but some is not.
There's also other models that maybe would come with your robot.
Like, how does your robot's particular sensors and effectors work?
And then there are models, really, that the robot has to build itself because it's the model of your house.
So we're going to have to have models at different scales, of different sizes and granularities.
So some will be trained on more data.
Some will be more or less general.
And we'll have to think about the different roles that these models can play for us, right?
So there are things like predicting what will happen if I do this.
There are things that are more about giving you advice.
What should I do in this situation?
Things that maybe help you solve the problem of mapping the big problem into small ones.
Hopefully, they'll come with quantified uncertainty so we'll know who to believe at which moment.
A lot of people working on that, but we are not there.
So I don't know.
So language models have a lot of information in them.
They're fun to play with.
I think, right now, the main thing is that you should trust but verify.
So one thing that's interesting is that you can ask an LLM to give you some advice about how to do a thing.
I tried asking it about what to do about stove fires.
That turned out not to be so effective, right?
It said, a thing about, use a fire extinguisher if it's safe.
And then, well, is it safe to use a fire extinguisher?
Says if your fire is small and contained.
What makes a fire small contained?
Well, it's safe to use a fire extinguisher.
OK.
OK.
But they're much more useful in other kinds of circumstances.
But the point is that they can't tell you how to move your motors.
And they can't tell you to put your shoes on before you go outside.
And they can't tell you a million things like that.
And so I think what we, especially people who are trying to make general-purpose robots have to think about, is, really, what are the strengths?
What is the knowledge that's embedded in an LLM?
How can we use it when it's available?
But how can we test it also against our own understanding of the world and try to simulate?
If I asked you for some advice on how to cook banana bread, you would give me some advice.
But I would not execute it without thinking it through.
And I think we want to maybe think about how robots can do that too.
So OK, with that, I will say I think I would like to get to humanlike generalization ability.
I think we get that by combinatorially and efficiently reusing things that we learn by building in some architectural constraints.
And the work I showed you was the work of a bunch of students and colleagues.
And you can watch the robot mess up while you ask me questions.
Thank you.
[APPLAUSE] Thank you so much.
I think we have about five minutes for questions.
So if anyone would like to ask first-- I think you're supposed to come to the mic.
Oh.
Or maybe we could do a little line at the mic if-- that probably would be easier.
Yeah, or there's a mic over there too.
Anyway yes.
Thank you for the talk.
Sure.
I was thinking as you were talking about these combinations of planners and the mental model, which, as I understand, is basically like a physics simulation in a computational-- Or maybe more abstract.
I mean, yeah.
But right.
Think of it as a physics simulation for now.
Yeah.
OK, that was my question exactly.
What about things that-- or plans or actions that are more abstract and can't be high-definition physics simulations, or you're not sure what's going to happen, and you don't have enough information, so you have to plan abstractly?
Like, how do you represent those?
Yeah.
So we do, actually, a ton of work on building abstractions exactly because you can't plan at this very low granularity.
It's just the horizon's too long.
And what are you going to do?
And so the trick-- I mean, so the thing about contact-- the trick is figuring out, what are the important events in a trajectory of the world, right?
And so for robot interaction with objects, one kind of event that's important is basically discrete changes in the object's state or discrete changes in robot-object interaction.
So that gives us one idea about how to chunk the world up.
If we can chunk the world up a little bit and say under what circumstances will the state of this remote relative to me change-- like, oh, I opened my fingers a certain amount, or I fling it in the air-- that gives us-- I can begin to carve the world up.
I can make some discrete state kind of partitions.
And then I can think about what actions will cause the state transitions.
And so we do learn abstract, discrete, kind of symbolic but still probabilistic models of how the world could change.
And we plan in those models.
And sometimes, we just use those models.
And sometimes, we verify those plans back down in a more fine-grained model.
So there's a lot of moving levels of abstraction.
And that's really important to being efficient.
Yeah.
OK.
Thank you.
Sure.
Hey.
Thank you again.
Sure.
So I wanted-- I was wondering how important the equivariance of your sensory representations is for the success in multitask settings.
The equivariance of the sensory representations.
So I mean, what I-- OK, so I believe in an objective reality external to my sensors.
That is to say, like, this chair is a thing, and it has a shape.
And then it has physical matter and stuff.
So my goal in trying to deal with sensory stuff is actually to try to recover, to the extent I can, a model of the physical reality.
So those representations, if I think of an object in terms of its pose, let's say, in six dimensions, and I understand it as a pose, then it has a bunch of equivariances.
But maybe I'm not getting at your question.
Yeah.
So if you want to solve a single task, then your representation may be invariant to certain things that don't matter.
And you'll be totally fine.
But as soon as you want to use the same model, like the foundation models you mentioned for solving multiple tasks, then it certainly happens that the same representation that is invariant for the one task suddenly matters for another task because simply the underlying physics matters in a different way.
So you don't want the what we in neuroscience call exemplary representation in the feature space, rather, but then this equivariance.
So you want them to be spread out similar to a better variational autoencoder, for example.
In my dreams, I would have a pretty general-purpose representation of objects, but again, like in 3D, right?
But in my dreams, I would have a pretty general-purpose representation of objects.
And then to the abstraction point that I mentioned before, given a task, I would project my general-purpose representation into a lower-dimensional representation that surfaces the aspects of the problem that are relevant to the task that I'm trying to solve.
And when I talked about mapping big problems into small problems as a general strategy for solving really big problems, that's an example.
Perfect.
Right?
So-- Thank you very much.
--trying to say-- I think one of the things that we learn as children and into adults is what aspects of the world matter when we're trying to solve a problem of a certain kind.
And if we can learn that, we can use that to map big problems into small ones.
Perfect.
Thank you.
Sure.
Right.
Unfortunately, I think we're out of time for additional questions.
OK.
So another round of applause for Dr. Kaelbling.
Thanks.
[APPLAUSE]
Loading video analysis...