Embodied Intelligence: Pulkit Agrawal

By MIT Schwarzman College of Computing

Summary

## Key takeaways - **LLMs Plan but Fail Execution**: Google's robot uses LLMs for common sense reasoning like finding a Coke can and sponge to clean a spill, but fails to physically throw the can or wipe effectively. [00:47], [01:25] - **Teleop Lacks Robust Generalization**: Teleoperation enables complex tasks like inserting a battery, but requires precise object placement within drawn lines and lacks generalization without more data collection. [04:38], [05:03] - **Teacher-Student Sim-to-Real Reorientation**: Train RL policy in simulation using privileged state like object pose and velocity, then distill to vision-based student policy for in-hand reorientation of novel complex shapes with one camera. [10:03], [11:12] - **Multi-Task Framework Scales Behaviors**: Collect robotic data across many tasks to train multi-task models, accelerating controllers for new tasks via the same codebase applied to manipulation and locomotion. [03:23], [13:26] - **Pareto RL Automates Reward Design**: Two policies compete—one maximizes task performance, the other minimizes energy—constrained to match performance, yielding simple rewards that replicate hand-tuned behaviors like robust locomotion. [19:20], [20:35] - **NeRF Scans Enable Kitchen Policies**: Scan real kitchens with NeRF, add articulations for simulation, learn policies like opening microwaves or shelving mugs, then deploy robustly to reality despite disturbances. [21:17], [22:28]

Topics Covered

LLMs Plan but Robots Fail Execution
Sim-to-Real Beats Teleop Scalability
Teacher-Student Distillation Enables Reorientation
Pareto Games Automate Reward Design

Full Transcript

Great.

Perfect.

Well, good afternoon, everyone.

I have the fun and the onus of giving the first talk.

It's post-lunch.

I hope I'll keep you awake and tell you about embodied intelligence and my views on it.

So one of the things that me and many other people in the field are after is what one could describe as physical intelligence, which is a system that can be easily taught any manipulation or locomotion task a human can perform.

Now there have been great advances in large language models and large vision models.

And what does this mean for embodied intelligence and robotics?

So I'm going to play you this clip that Google had some time back.

And the question is, if I spill my Coke on the table, how would you throw it away and bring me something to help clean?

And the robot comes up with a plan using the latest and greatest advances in LLMs. It says, I would find the Coke can, pick it up, go to the trash can.

But then it cannot throw it.

Then it realizes to clean the spill, it should pick up a sponge.

So it goes, finds a sponge.

So what you're seeing is the system is able to perform some of the common sense reasoning that you would expect that you would need in a robotic system.

But when it comes to actually doing the task, this is what happens, right?

So what this means is one of the challenges which remains is, how do we embed these embodied intelligence systems with the ability to execute sensory motor tasks?

And this has been a topic which has been widely studied, and we have all seen these very impressive videos of robots doing amazing things.

So while these videos are amazing, it does take teams of people and time to code a particular behavior.

It might be months to a year, maybe more, to get one behavior.

So if we are interested in coding up many behaviors or many tasks, the scheme currently is just not scalable.

So that leads us to the goal of being able to quickly and easily design diverse and complex behaviors that work well in less structured settings.

And the one question you can ask is, can we use the same scheme that we had in vision language and speech?

In these domains we had tons of data.

But the problem is in robotics, we don't have the data already lying on the internet that we can download and train these large models.

So what are people doing, right?

So there's one line of work which is trying to bootstrap learning from passive data, for example, data available in terms of YouTube, by leveraging models that people have already trained, for example, language models, vision models, putting them all together.

And this is an exciting area.

But for today, I'll focus on really collecting robotic data.

And what do I mean by robotic data is we pick up a task.

And we say, do I have a controller which is going to take in observations and map to actions?

And if we can design a framework which allows me to do this for many, many tasks and very quickly, then, essentially, I can train a multi-task model, where over here TI indicates a particular task.

And once I have this multi-task model which I can start training, this in turn would help me design controllers for new tasks faster.

So the core question really is, how can I take a new task and design a controller very fast?

By very fast, I mean human time, wall clock time, the amount of effort which goes into it.

So I think in the field, there are pretty much two approaches which have emerged to tackle this sensory motor problem.

So one of them is collecting large demonstrations.

And there are many groups at TRI, at Stanford, and at Google which have done teleoperation and showed some impressive tasks that they are able to perform.

So the good news about teleoperating the robot is that you can pretty much have a robot do many different things.

But if you look closely at the kind of demonstrations that have been shown, for example, over here, putting a battery in this socket-- you see this white line which has been drawn over here.

And this will only work if the object is placed around this white line.

So the point being these systems can do complex tasks.

But they're not as robust as we would want them to be.

They're not generalizing.

And you could make them generalize.

And what that would mean is you have to go and collect more data.

So the data effort over here comes down to how much human time and human effort is required in collecting the data.

And this is what people end up doing.

They have this two arms. And they're trying to control the other two arms to collect more data.

So the other approach which has become quite popular is sim-to-real reinforcement learning.

And there have been good results on a broad variety of platforms. This is a photo from a nature paper showing how automatic drones or drones learned with reinforcement learning can beat the fastest human player.

There are a bunch of works on quadruped locomotion and also on dexterous manipulation.

And while this sim-to-real reinforcement learning can get good generalization and is more robust than what you get by collecting data manually, we can not scale this to many, many tasks because there are a couple of challenges.

For example, you can only go to tasks that one can simulate.

And there are other issues which come in.

So in a nutshell, I think the question, really, we should be asking is, if we have the approach of either teleoperating the robot or this approach of going to a simulator, collecting data in the simulator, what is the quantum which really matters?

And in some ways, that is amount of human effort and time which goes into setting up these systems so that we can come to a controller.

So in line with the theme of this session, I'll fixate on this sim-to-real reinforcement learning and tell you where we are, highlighting some of the work that we have done, and what are some of the issues which are preventing us from scaling as fast as we would like to scale up.

So for example, one thing we would want to do is to automate many of the tasks that the human hand can do today.

And the core part of many of these tasks is that there is motion between the hand and the object.

And that is what distinguishes this from purely picking up an object and placing it.

So OpenAI had this nice work back in 2017 where they were demonstrating a hand can rotate this cube.

And they had a camera looking at a hand.

But if you actually look at the setup, this is what it looked like.

It had so many cameras to make this happen so that the state of the object and the hand can be measured accurately.

So very good for a lab demo, but not really practical.

And furthermore, this system only worked for one particular object, which was the cube.

So it's not generalizing and seems impractical.

So the one thing we set out to do is, how can we get to a reorientation system which can work for diverse and complex shapes and also generalizes to new ones while having a setup which is much more practically deployable, for example, having a single camera look at this four-finger manipulator manipulate an object?

And this was work led by Tao Chen in the lab.

And the way it works is we go in a simulator.

And we collect large amounts of data.

And this is possible because we can run many, many environments in parallel.

And what you see is the hand trying to manipulate many different objects.

So ideally, what we want, the end result, is a policy which can take in a vision observation or a point cloud coming from a camera where the fingers are, and then output actions so that we can complete this reorientation task.

Now, it turns out that if you try to train a reinforcement learning policy from a high-dimensional observation space, it's actually quite hard to do it.

And the second thing is doing rendering in simulation can become quite challenging.

So these kinds of policies just do not work out.

The framework is there's a camera getting observations.

There's some rewards that we define, which is rt.

And the rewards might be how close we are to the target.

And we want to maximize the sum of rewards.

It turns out it's very hard to train these policies.

But in simulation, you can play god and leverage the fact that we know all the information about our environment and the object.

For example, we know the object pose.

We know the object velocity.

And this helps us construct a reduced state space.

By that, I mean this is much more lower dimensional than the observation space on the bottom.

Now, if you study machine learning, one of the basic things is that if your observation space is small, it's much more easier to learn a mapping than if your observation space is much larger.

So by leveraging this, we first train a policy using reinforcement learning, which we can do now easily.

And then what we do is we distill this policy to work from sensory observations.

So step one, use reinforcement learning.

Train the policy to do reorientation.

But this policy we cannot deploy because it already assumes object velocity, pose, and this kind of information which needs to be inferred from sensory data.

So to make that possible, we take sensory data, but we get targets coming from the first policy and just do supervised learning to learn the policy on the bottom.

And this framework is called the teacher-student framework.

And then we can take the second policy and deploy it on our test scenarios.

So here's an example where there's a camera looking at the object.

And this is the orientation that we want to get the object in.

So let's see what the system ends up doing.

And what you find is the system, it orients this object and takes it to the target orientation.

Now, these are objects that the system has not seen before.

Here's another example of trying to orient the object to the goal orientation.

So the system is not perfect.

It is still imprecise in terms of how precisely it can rotate the object.

But at least we are showing evidence that we can go to complex objects and perform this in-hand manipulation task.

Now, once we can start doing it, we can start leveraging these controllers for more tasks, like for example, peeling a vegetable over here, where the vegetable needs to be rotated.

So now I can have this vegetable that I can reorient, stop when I see the unpeeled part.

And then there is a second hand which comes and peels it.

At the time we created this video, this was being teleoperated.

But now we have automatic systems which can also do the peeling.

So I think this opens up like a nice space for actually picking up tools and trying to use them for bimanual tasks and which could enable interesting advances in manufacturing but also for robots to become more versatile.

Here is just another example of applying it on a different vegetable to make the point that we are not limited to just one or two vegetables, but we can bring in new ones.

Not ginger yet.

But that's where we are going.

Can you give a ginger and then have a robot actually peel it?

Now, the good news with the framework I just described to you, which is collecting data in simulation and transferring to reality, is that we can use the same code base, pretty much the same framework, and apply it to different tasks.

So I'll show you some results now in locomotion being pioneered by Gabe Margolis, who is a PhD student in my lab.

So what we've shown is that the policies that we end up learning through sim-to-real learning end up being quite robust because we can emulate many different scenarios in simulation.

Here's a robot slipping over oil, but it remains robust.

Once the robot can walk, we can make the task be more challenging.

For example, now it needs to kick the soccer ball.

So it needs to balance on three legs while use one of its leg to dribble.

And the human over here is pretty much a nuisance, trying to get the ball away from the robot.

These policies, again, by the virtue of collecting lots of data, are robust so that the robot falls down, it will get up, and continue dribbling the ball.

One thing I didn't realize before I moved to Boston is how much fun snow can be for robots.

So we can take robots out and really test their robustness in scenarios which can become quite challenging.

So here is our robot trying to dribble in snow.

It's struggling.

But at least it's trying to get its job done.

Now, once we can dribble, we can also put an arm on the robot and have the arm go around.

So here, we are teleoperating the system while Tiffany is telling the way the end effector should go.

And the robot uses its body to follow and make the end effector pose be possible.

Now, we can not just do positions, but we can also do this with forces, where over here, Gabe will try to pull the robot.

And what you'll see is the robot will adjust its hips to try to increase the amount of force it can apply.

And now Gabe is really struggling.

Now, once you can do force control-- oops.

I had some videos, but I think I skipped one of them.

It means that giving demos become very easy because you can just-- with one finger, now you can move the gripper of the robot.

And it will follow what you are doing.

So what I have shown you is some examples in dexterous manipulation and locomotion where we are able to design behaviors using the same framework.

So these tasks are complex.

And we are showing that they are working in less structured settings.

But there are still challenges in how quick and easy is it to design these behaviors.

So some of the challenges, really, are that one needs to spend substantial effort in designing the reward function and the environment.

The second challenge is, what can we actually simulate?

And the third challenge is, how safe are these controllers when we actually deploy them?

So first, I'll chat about this problem of reward design.

So when I was telling you about this robot which can run in different scenarios, what you would expect is I gave a reward function which expressed the velocity that we want in xy and maybe how much velocity I want the robot to turn.

But this turns out to be insufficient.

And people typically add more terms to it, for example, things like, what should be the velocity in the z-axis?

The base should not be too low.

So you're trying to help the robot discover particular behaviors.

And then there are some terms-- oh, this should not be here.

But there are some terms which are trying to make the robot's behavior be smooth by penalizing energy.

Now, coming up with these terms and coming up with these coefficients is what people spend up quite some time designing it.

So we would ideally want to get rid of this.

So how can we reduce this effort?

So what I'm going to tell you is an approach by which we can go from this old reward function to a reward function which looks much simpler but still retain many of the same properties.

So how do we go about it?

So let's first consider the question of that someone came up with these terms, but we still want to find these reward coefficients automatically instead of someone setting them by hand.

Maybe I have two minutes?

OK.

So now consider this problem set up where I want to maximize my task performance.

And I want to minimize the amount of energy consumption which is happening.

And then I want to trade them off with this parameter lambda, which says how much I prefer one term versus the other term.

Now, I can plot energy versus the task performance.

And what I want is to be very performant at the minimal energy expenditure.

So now if we end up choosing different lambdas, this ends up corresponding to different dots over here.

So ideally, what I want to find is what people call the Pareto optimal, which means that if I am getting a particular performance and a particular energy, I cannot improve my performance without increasing my energy expenditure.

Those are the kind of points that we would like to find.

So you could do it by doing a hyperparameter search on lambda.

But that ends up being quite expensive.

And this is what people end up doing by hand.

So we tried tackling this issue.

And the way we do it is to say, well, let's learn two policies and let there be a game between these two policies.

So one policy goes and tries to maximize performance.

And the other policy is going to minimize energy.

And hopefully, when these policies are fighting, we're going to find something close to the Pareto optimal.

So we have one policy, which is trying to minimize the energy, and the second policy, which is just trying to maximize speed.

And then we couple them with a constraint.

And the constraint pretty much says that the policy which is minimizing energy should get as much reward as the policy which is only trying to maximize speed.

So what you're saying is you're trying to push the policy to get more reward, the speed one.

But you're never letting the energy one be less performant.

So we're trying to push energy while still trying to maintain performance.

Now, if we end up doing this, we can replicate many of the behaviors I was showing, for example, fast spinning over here or, again, going on the ice with this controller.

And we don't need any other reward shaping except these energy terms and the reward terms. So this controller is robust.

So the robot will also come to still if it is giving a 0 velocity because it is trying to minimize energy.

And if you see we increase the velocity slightly, it moves its legs very little because, again, it's trying to conserve energy.

But if you increase velocity, it will pick up and start moving again.

So that's something which we are very excited about.

And I'll just end with the last part.

I mean the environment design part.

The other challenge is someone has to set up the simulation environment and get assets over there.

So in a recent work, what we're exploring is, could we scan the real world with advances which have happened in NeRF and other techniques and then add articulations in them, put them into a simulator, and then do learning in simulation and, again, roll them out in reality?

So imagine that you have your kitchen.

And you want a robot to work very well in your kitchen.

Can you scan your kitchen, put it into a simulator, learn policies, and put it back in reality?

So here's an example of this process which we did in a kitchen in CSAIL.

And then the goal is to open the microwave.

And Marcel is doing these other things while the robot is moving just to show the robustness which the system has in performing the task while other things are happening in background.

Even another example over here is that it needs to keep the mug on the shelf.

And then we would move the shelf.

You see the mug hit the shelf.

And it is moving.

You move the mug back again.

And the robot will still go and grab it and keep it.

So with this, I'll end the talk.

If you have more questions about other aspects, about safety, about what we can simulate, we have some exciting ongoing work.

And I'll be happy to chat with you.

So thank you so much for your attention.

We're going to have to cut it off at 1:30.

But if just a couple of questions before then, if there's anyone-- Hi, thank you.

Thank you for the talk.

I was wondering, how does this approach generalized to other tasks or learning one policy for a certain task and then being able to take subskills that it learned in that task to learning another policy more efficiently?

How does that work?

And what would you need to make that happen?

Yeah, so that's something that we have not done yet.

But for example, what you would do is you would condition the policy on some identification of the task.

And that would be the first step that you would do.

And then if you do that and then you train on multiple tasks the individual policies first, you can distill them all down into a single model.

So in the same way as you would expect generalization to happen from machine learning, you would expect the same kind of generalization to happen that now if you encounter a similar task, then the amount of data you might need is going to be lower.

There's a whole different question of how do we plan using the skills that we already have.

And I think that's a slightly different conversation because now you need to know, when can I execute a particular skill?

And when can I not execute a particular skill?

So I think there are formal ways of doing it, in which Leslie is an expert.

And I think there are more machine learning ways of doing it, where, for example, in simulation, you can try to learn the distribution of where the skill can start and where the skill will end.

So I think that those are some of the things that we are looking to explore.

Maybe one more quick question?

Yeah.

So I am a little interested in the safety bit if you have a 30,000-foot overview.

So 30,000-foot overview is we are trying to find test cases which are going to break our controller.

And the whole-- so this is the same problem which happens even with LLMs. For example, people want to find prompts which are going to make LLM not give the right answers or give toxic responses or racist responses.

So now the whole point is, can we automate these test cases instead of humans needing to come up with it?

So essentially, we are developing machine-learning methods which could try to come up with a diverse set of test cases where a system fails so that we can use it as training data.

So a kind of adversarial sort of thing?

Kind of, yes.

Well, thank you very much.

Perfect.

Well, thank you, everyone.

Loading...

Loading video analysis...