LongCut logo

comma ai | COMMA CON 2025 | Yassine Yousfi | Look Ma, No Labels | Head of Machine Learning

By george hotz archive

Summary

## Key takeaways - **End-to-End Scales with Data, Not Engineers**: End-to-end machine learning uses general purpose methods that scale with compute and data, not lines of code or numbers of engineers, avoiding the need to hire labelers or write more code to improve performance. [04:00], [04:36] - **Synthetic Data Fixes Masked Faces Fast**: During 2020 lockdowns, Comma created synthetic data by estimating face poses and landmarks with ML models, then overlaying masks from Google images to train driver monitoring without hiring models or artists. [11:43], [13:05] - **Future Signals Enable End-to-End Labels**: Driver monitoring uses future events like overrides or standstill to heuristically label readiness, covering a small data percentage but providing strong signals for training without manual labeling. [15:55], [18:04] - **Transformers Simulate Lateral Control**: Transformers predict next states from past observations, states, and actions to simulate lateral control across 300+ cars, enabling testing without real vehicles and handling quirks like tire inflation or road wetness. [22:59], [24:42] - **World Models Generate Driving Experience**: World models predict next frames and actions from past inputs to simulate environments, allowing on-policy training in the data center that scales with data, remains physically accurate, and generalizes out-of-distribution. [27:59], [33:01] - **End-to-End Generalizes to Robotics**: The same world model approach trains Billy the robot for indoor navigation and an SO ARM 100 for manipulation like grabbing coffee pods, using open-sourced datasets without hardcoding behaviors like traffic lights. [37:47], [41:31]

Topics Covered

  • End-to-End Scales with Compute, Not Code
  • Synthetic Data Fixes Out-of-Distribution Fast
  • Heuristics from Future Enable Self-Labeling
  • Transformers Predict Car Dynamics Across Models
  • World Models Simulate Superhuman Driving

Full Transcript

Hi, thank you so much uh Maxim and Shane. Amazing talk from the open pilot

Shane. Amazing talk from the open pilot team. Um next I'm very excited to

team. Um next I'm very excited to welcome Yen, our head of machine learning. Um he will be here momentarily

learning. Um he will be here momentarily when he gets miked up. So I'll just give a few housekeeping uh notes uh before we get started with our next talk. Um there

is still coffee in the back. So feel

free to grab a coffee. Um, and then we will have lunch served at about 12:45.

Um, which will be served in the entrance where everyone came in.

Hello everyone. Hi.

Let me see. Let's check the clicker.

Clicker is almost AGI level technology.

So we should all right works. Reach AGI.

Hi. Um my name is Yasin uh from the autonomy team at Karma and uh uh today we're going to talk about machine learning. I'm here to talk to you about

learning. I'm here to talk to you about not any kind of machine learning. A very

exciting style of machine learning that we call end to end. We're all familiar with it right now. Everybody's talking

about end to end. It's all the what the craze about. All the cool kids are doing

craze about. All the cool kids are doing end to end machine learning. So I'm here to spill the tea and tell you all what it is. What's what's end to end machine

it is. What's what's end to end machine learning? So if you're here, you

learning? So if you're here, you probably already know what open pile is.

We ship open pile open source L2 ADS system that runs on the comma 3X. Uh it

has tons of users, many of whom are here, I think. Uh a lot of supported cars, more than 300 supported cars. And

recently, Linus Tech Tips said that it was amazing. So, I guess if you guys

was amazing. So, I guess if you guys trust Linus Tech Tips, um I think it's amazing.

As a famous person said while uh looking at a a car, they said that it was all computers. Um what I like to say is

computers. Um what I like to say is well, it's more than that. What we ship, it's all machine learning. So, we uh

ship this device to you called the Comma 3X. It has cameras all around and it

3X. It has cameras all around and it runs on a card that has some sensors and it has some actuators and then everything gets blended into this crazy

uh linear algebra that we call machine learning and then drives your car. It's

magical, right? It's amazing. I think

it's cool.

Well, it's especially cool because we are doing it end to end. Um, and I'm I'm here to explain to you what end to end

means to us and why it's important.

Uh, so in general, what we want what we call end to end is what the general uh literature call general purpose methods.

So these methods what you want them to scale with is compute and data. You

don't want to them to scale with lines of code and you don't want them to scale with numbers of engineers. So you don't want to uh be in the situation where you

need to ship a new feature or you need to improve your system and the only solution is well I need to write more code or oh wait I need to hire a labeler

or I need to hire an engineer. That's

not what you want to be in. You want to be in a state where well I need to make this model better. I need to improve performance.

The best way to do it is well, I need to make this model bigger. I need to train it for longer. I need to train on more data. So that's what we call general

data. So that's what we call general purpose methods. And uh this uh little

purpose methods. And uh this uh little guy over there's name's Rich Sutton. And

that's a photo of our data center uh built uh by our friends uh Tiny. And they have a booth right over there if you guys

want to visit it. uh with some uh some some cool uh exposition hardware. Uh so

that's our data center and we want to scale with data, want to scale with compute, not with people and lines of code.

As I said, all the good cool kids are doing end to end, right? Even people

that were not doing end to end a few years ago now are talking about it as if it was, you know, the uh the only way to

do it and as if they always did it. Uh

which is fine. It's it's good to uh for people to get uh you know, convinced and change their opinions, but it's great.

So um yeah, the cool kids are doing end to end. Tesla is doing end to end.

to end. Tesla is doing end to end.

Even Whimo is doing end to end.

Robotics companies are doing end to end as well. Um you know everybody's doing

as well. Um you know everybody's doing end to end.

But you see this oh sorry these kind of job postings. A data

labeler job posting. Oh well that's odd.

It's not really end to end, is it? Uh or

even outside of the robotics, even outside of the self-driving uh you know space, you see these kind of jobs like

evaluator experts, prompts, specialists, optimizer.

Um there's a lot of hand coding still, but these models are end to end, right?

Well, how you train them is very important. They can be end to end. But

important. They can be end to end. But

the way you train these models might not be end to end. You might still need effort. You might still need people. You

effort. You might still need people. You

might still need intermediary steps to train these models which kind of defeats the the idea of end to end.

But you can sometimes you see things like this which are great. Sometimes you

see these in the news. AI achieves

silver medal in the Olympics. That's

great. And these are the things these are the methods that uh I'd like to call end to end because these not really rely

on human data, human labelers or prompts specialists or prompt engineers or evaluators. These models were trained

evaluators. These models were trained using reinforcement learning. These

models were trained uh using experience, not using data labeled by people.

Why is this importance?

A few years ago, we were doing prediction.

2016, 2018 was all about imageet. It's

all about predicting class labels, predicting lane lines, predicting this and that. Well, now now it's now it's

and that. Well, now now it's now it's the grown-up stuff. Now you need to take action. Now you need to act on your

action. Now you need to act on your predictions. And this is what makes it

predictions. And this is what makes it hard. It's not just about saying things

hard. It's not just about saying things about the world. It's about interacting with the world, make making actions and then observing what happens and being

robust to that.

Welcome to the era of experience.

This is where we're at and this is what we are doing right now as a company. We

are building agents, driving agents that uh act in the world that act in a car and we want them to learn how to drive

from experience too because that's how people learn and that's how uh you reach superhuman capabilities.

Again the case for end to end is is karma. This is a photo of last Comic-Con

karma. This is a photo of last Comic-Con two years ago. As you can see, not not a lot of people in this photo. This was

this is the entire company and you get to meet us all today. Uh this uh I think is the entire company at KMA today. Yes.

Uh so yeah not a lot of people end to end and general methods and the bitter lesson in general is the reason

why a company like KMA exists is the reason why we get to exist in a world where other people are throwing money uh from the window and labelers and things

like that while we are trying to solve it in a principled way in a way that scales with compute and data and experience.

So thanks to the bitter lesson.

So enough general and philosophical talk. Uh I'm going to uh dive into some

talk. Uh I'm going to uh dive into some concrete concrete problems and concrete examples of how these general methods

and these end to- end methods uh were used and are used by comma uh to solve some concrete problems. So I want to

start with driver monitoring. Uh DM

as we as we like to call it is how uh we make sure that uh the driver in the car is alert and not sleeping not using their phone while the combat 3X is

driving while the ADA system is driving because it's an L2 system. So we need the driver the user to be alert and ready to take over at all time.

Uh, I'm sorry to bring this up this early bad memories, but I want to start with this and get this over with so that we we get the bad memories out of the out

of the pictures out of the picture. Uh,

do you all remember 2020?

No. Tough year, right? Uh well we had this thing called uh surgical masks and u u face masks and suddenly people started wearing masks all the time even

in their cars. We have Uber drivers and things like that. So uh but that wasn't that common right uh people weren't wearing masks in the past in their cars

so face masks. So it's it was out of the distribution of our the train data of the DM models, the DM driving mo, the DM uh machine learning models. Uh so we had

to fix that fast.

Uh everybody was locked down. We

couldn't, you know, bring labelers or take photos of people and do do this, you know, how people do it usually. Uh

so we had to uh to be a little a little creative.

This is wishing is over there. Uh so you know an easy way to do it is is to create synthetic data uh procedurally

not using you know not not using an artist not cing someone to to to paste a

mask on a on a face or not using Photoshop. We're not uh hiring uh models

Photoshop. We're not uh hiring uh models to to wear a mask and then take pictures of them. No, we use our existing data

of them. No, we use our existing data and we use some general methods, some easy to implement general methods to estimate the pose. This again was done using a a machine learning model that

that's trained. Uh we're not using a

that's trained. Uh we're not using a labeler to to give us the face pose. Uh

same thing with the landmark detection.

we kind of delineate where the face is and then using these two information and these two information these these two sources of information plus a mask that

you can download from Google images and then you just add the two images and here you go you have a new data a new data set of someone wearing a mask and

driving that's great so uh again this example I like to bring this one up because you could solve it in a completely different way. If if you

didn't want to use a general method, you would solve it probably by hiring an artist, a Photoshop artist, or by hiring some models or by trying to wait for

people to start wearing a mask and then collect more data, but we had to act fast. So, so we went we went the general

fast. So, so we went we went the general way and that worked out great. Another

example I'd like to uh to uh to bring up this looks a little creepy here but uh we we used to call this the eye shutter.

Uh the eyes shutter is well the the models didn't didn't really detect when people were closing their eyes that well. So, and we didn't have a lot of

well. So, and we didn't have a lot of data of of eyes that were shut because when you shut your eye, it's pretty fast and then the data doesn't doesn't really

have that class balance we used to call.

Um, so again, you get the eyelid landmarks uh using a neuronet network.

There's plenty of them online that you can download or you can train your own.

Uh, and then using some simple morphological operations, you can just close the eye and there you go. you have

a photo of a of an eye that's closed from a photo of an eye that's that's open. And again, this is really a a

open. And again, this is really a a simple method general and that scales with data and it's uh you just inflate

your data set very easily without trying to call artists without without Photoshop artists. For example, more

Photoshop artists. For example, more recently, there's these things called VMs or uh vision language models that you can you can give them a prompt and an image and then they can interact with

you, right? So, um, with some very

you, right? So, um, with some very little prompt engineering, you can put a photo of someone driving and using their phone and then say and then ask the VM,

uh, say yes if if the person is distracted from driving because they're using their phone and the VM knows Adiv's using his phone.

Um, so yes, this is a more recent uh way to deal with this and we're experimenting with how to use larger and

and and uh more subsequent models uh that are trained on tons and tons of data to uh automatically label uh some

of our training training set um which is what I like to call vibe labeling.

That's a live photo of wishing over there.

Uh so something more interesting about DM is that uh we also know the future.

And that's the case also with our driving data is that you know the future. You know if someone is going to

future. You know if someone is going to take take over or not in the future you know if they're going to override the steering we call it override or you

take over. So if the person

take over. So if the person uh took over in the future, that means they were ready to take control, right?

Because they they did take control of the future. Uh or let's say if they're

the future. Uh or let's say if they're just sitting around uh in traffic, usually you're not really aware of what's happening around you when you're in traffic. You can look at your phone,

in traffic. You can look at your phone, you could look down, you can be distracted. So the probability of you

distracted. So the probability of you being distracted or not ready to take control when you're when you're sitting stand still is generally higher. And

then the rest uh we don't we say we mask it out. So it's obviously not just these

it out. So it's obviously not just these two cases that exist. There's tons and tons of other cases. But you we can label a small amount of our data using

these huristic data that comes from the future or that comes from other sensors around the car. and then the other stuff we just don't care about it. Uh so even

with a really really small percentage of of what this covers we can we can find some interesting examples like the one that I'm going to send you here. Uh I

don't know if you can see but this is a a photo of me driving and am I ready to take control? That's that's a

take control? That's that's a interesting question. Uh I mean I'm

interesting question. Uh I mean I'm looking at the road looks looks looks looks normal and yes I'm ready to take control because in the future I did take

control. So that's

control. So that's I did override. So that's a that's a good example in here. Am I ready to take control or

in here. Am I ready to take control or not? Obviously not. I'm looking down on

not? Obviously not. I'm looking down on my at my phone. Um I don't look that interested in what's happening around me. Well, yes, because I'm a g I'm at a

me. Well, yes, because I'm a g I'm at a gas station waiting for someone to come come back to the car. So, in this case, yes, I'm not ready to take take control.

Uh and uh and the reason why is because I mean stand still. So, uh we we call these end to- end labels and they cover a really really small percentage of our

data, but it's u it's enough to to get some really good signals and to get some interesting samples as well from our data set.

And uh yeah all this all this shipped by wishing and maintained by wishing uh which is over here. Uh so um

the idea is that you can ship and and and and create really good systems and uh uh really reliable machine learning models

uh using a very very small team. If you

do, if you go the route of general methods and if you don't get trapped into this uh you know trap of of of of

needing more features or improving the system using labelers and using u stuff that doesn't really scale. So yes should

be maintained by wishing.

Next example is uh what we call lateral control simulation.

Uh I forgot to ask who's here um is a open pilot user, right? Oh, nice.

Uh who's heard about tuning?

Oh, okay. Cool. Yeah. So, what's what's the deal with lateral control with lateral control in general?

This is from uh a really nice forum uh from our friends at Sunny Pilot which are here and have a booth. So you guys are welcome to go visit them. Uh this

model has been fantastic on my Hyundai Kia Genesis angle steering car.

Finally drove this model and that long both perform really good. Smooth turns

but some hugging but nothing too crazy.

I can't use this model because it hugs death too much.

To me, this model is okay, but it kind of turns too much and too quickly. I

don't like it with my Toyota lateral is not good.

I actually thought this was a step up.

[snorts] Yeah. So the thing is we support a lot

Yeah. So the thing is we support a lot of cars. We support more than 300 car

of cars. We support more than 300 car models and every car and every model has its own quirk and feature and uh uh

these quirks and features materialize in similar lateral control. Um

so um the idea is that we control the car most of our some some of our cars using torque. So you put in torque on the

torque. So you put in torque on the steering wheel and then uh you observe a lateral acceleration and the difficulty

is the predictability of this function.

How does this translate to that? And

that depends on so many things. That

depends on your car, your model, your how your how how much your tires are inflated, uh how the road is rolled, and

is it is the road wet? Is it not wet? Uh

is your steering rack old? How old is it? So, it's really complex and it's a

it? So, it's really complex and it's a really difficult thing to to to model.

And that's why you hear about things like tuning, hugging, turn cutting, uh things like, "Oh, this performed well on my car." And then someone else says,

my car." And then someone else says, "Well, this didn't perform on my car."

Well, it's because of this, partly because of this. Um, so how do we solve this? Um, we could do this for every

this? Um, we could do this for every car. We could have a Python file for

car. We could have a Python file for every car and then kind of try to like hard code this using using some Python

code using some C++ code maybe or we could just write this as a as a big big big big problem big machine learning problem. So you have your input

learning problem. So you have your input observations and then your input states and then we're going to talk about we're going to come back to a style of slides that are like this a lot in this

presentation. So you have inputs that

presentation. So you have inputs that are observations in time and states in time and actions in time in the past you

know in in in your context and then what you want to output is the next state.

What given these observations and given what you observed and given some actions you took what are you going to observe

next? And we all know the only way to

next? And we all know the only way to solve machine learning problems is using the transformer architecture, right? That's how we do it. It's how how

right? That's how we do it. It's how how the cool kids do it.

What's a transformer? Uh I'm going to talk not going to talk about it that much today, but it's a stack of layers that that

we just repeat a number of times and then you get to repeat them a lot. And

the more the more you repeat them, the more the bigger these layers become, the more parameters or weights or weights you end up with. And the bigger the model you end up with. And that's how

you you here like, oh, this model's three billion parameter, this model five billion parameters. Well, these yeah,

billion parameters. Well, these yeah, it's essentially how many of these layers do you have? And the transformer uh not this one the the machine learning

architecture is trained to predict a set of numbers shifted by one. So the

transformer takes a set of context number and then shift them by one and essentially tries to predict the future which is really cool because that's what we want to do, right? We have a set of

observations, a set of states and you want to predict the next state. It's

great.

and it magically works.

The red line follows the green line or the scan line. That's great. So yeah, we train these big models using our data

sets. Um and they do predict whatever

sets. Um and they do predict whatever we're interested in and that that that is that is wonderful.

Now how do we use this? Uh, in the autonomy team, we have this rule called no testing in cars. It's inefficient.

You have to go down downstairs. You have

to start the car. Sometimes the car is off. Sometimes you need to pull the

off. Sometimes you need to pull the branch. Oh, wait. I didn't didn't have

branch. Oh, wait. I didn't didn't have the right key. Someone took the key. Oh,

uh, someone took the car. The car is in the car wash. So, yeah, it's inefficient. You don't want to do that.

inefficient. You don't want to do that.

So what this enables us to do is well you can start with the segment that let's say with the driving data that's on a Toyota and then simulate as if it

was coming from a Hyundai or similar as if it was coming from another car. Uh

and then you can see how the model behaves using different cars using different what we call noise models that come from different platforms. Uh which

is really cool. So you can see like this is a this is one segment that's replayed. I think it was originally from

replayed. I think it was originally from a Kia AV6 and then we changed the noise model so that it will match uh a bolt or it will match a Corolla which is really

cool. So you can just like swap the car

cool. So you can just like swap the car without going downstairs.

And this is shipped and maintained by Casper and David. And they're over there. They have a booth where you can

there. They have a booth where you can actually drive uh with a noise model from which car? Oh, you can pick the car. Yeah. So, you can actually drive

car. Yeah. So, you can actually drive the drive using a joystick and observe what the the model is living every day is observing every day. Uh and uh it's a

really fun booth over there uh with the the steering wheel and the TV. So you

can you can get to experience what the model experiences uh in real life.

All right. Now to uh driving models. Uh

so this is this is the model that drives the car. This is the model that outputs

the car. This is the model that outputs a desired curvature and desired acceleration uh given given the the camera views. And uh recently we wrote a

camera views. And uh recently we wrote a paper uh called learning to drive from a

world model. Um and it really um cements

world model. Um and it really um cements ce really shows how we want to solve or how we are currently solving the problem

of of uh training driving agents end to end uh from experience.

This is uh from a screenshot from from our blog post where we describe in large details how we how we do this and I encourage you to go check it out. It's

it's in our blog and you can also read the paper you can look at the slides even the talk is published uh and we have a really nice interactive blog post

where you can click and play and kind of interact with with the word model that that we that we train.

What's a word model? Uh word model is something that uh something of a of a buzz word right now. Everybody's talking

about it like people are talking about end to end. People are talking about word models. Again, a word model is

word models. Again, a word model is something that is is a model that takes in previous states, previous observations and previous actions and predicts what is the next state. what's

going to happen to the world that you're observing given what happened in the past and given an action. Uh, and it's really important because some really smart people are saying that it's

important.

That's it.

[snorts] No, it's it's important because uh it really allows us to interact with the world and uh generate the these

experiences and give the opportunity to opportunity to the model to interact with the with the world without actually interacting with the real world because

that would be expensive. You don't want to put a a model that's not trained, put in a car and then go to oh you know just go go and try things and come back with with some data and then do do some

gradient descent on that data and then send the model back. You don't want that. You want you want some kind of

that. You want you want some kind of simulator. You want some kind of world

simulator. You want some kind of world model that enables you to do this in the data center instead of the real world.

Um so our world model is uh essentially two models. Uh one of them is a model that

models. Uh one of them is a model that takes in the observations and uh compresses them in what we call

a latent space or laten laten vector.

And uh this compression here is important because we use that compressed feature set uh as a as a set of observations. And then the word model

observations. And then the word model which is a diffusion model uh is trained to predict the next state trained to predict the next latent vectors. And

there's some details about how these models are trained train using the latest tricks and the latest uh cool stuff in machine learning which I'm not

going to go into uh much details today.

So if you're interested just go in the paper and then you can read how all these models are trained uh in in large detail.

again inputs outputs a so input is the states uh it's the the camera essentially the camera for the camera

feed fcam and the ecam uh the action is uh what we call the pose or the position update which is a a vector of you know translation and rotation a six degrees

of freedom input and then the output of this model is uh also So what is the next frame? So what's the what's the

next frame? So what's the what's the word going to look like given the past the action uh action that I took? What's

the word going to look like after that?

So things move around you you move around. So it should predict uh it

around. So it should predict uh it should predict that accurately. But also

we want this world model to tell us something about about driving not just the world because this is uh this is a very special type of model. This is what

we call action world model. Uh we are asking the world model to also tell us how to drive in this world. How to drive

in this future world that I'm going to to to to imaginate. What is the best path? what is the best curvature and

path? what is the best curvature and what is the best acceleration to take given what you did in the past and given what you what the future is going to

look like. So this is a special type of

look like. So this is a special type of model uh it's future anchored uh to kind of uh give it some more recovery

pressure uh and again all the details of why we do this and how we do this are in the paper. So I encourage you to go

the paper. So I encourage you to go check that out. So we use these world models as a simulator. Again, same I've said this before. You get past frames

and you get an action in the past. You

get an action uh and then the driving simulator the the word model predicts what is the next frame and also what is the next action. You can use this next

action as you want. You can use it as ground truth to train a different models which is what we do or you can just use it feed it back into itself and do it

auto reggressively to uh kind of like do a roll out a roll out of of a simulation.

Um and these models are great. They scale with data as you can see. Lips is uh is a metric of how good the image looks like.

So the the lower the LPIPS the better.

So yes more more more more more layers a better model or bigger data set better model which is what we want. Great. Rich

Sutton's happy.

Also this model is very this is a very important thing to us. And when when you train a model that wants to drive the drive a car you want it to be physically

accurate. You want it to, if you ask it

accurate. You want it to, if you ask it to move 0.5 meters to the right, you need it to actually move 0.5 meters to the right, not 25. Uh, you don't want to

wing this, right? This is driving. This

is serious stuff. When you're doing a a video game and you move it, you you ask the the player to move a little bit to the right. You don't really care how

the right. You don't really care how much you move to the right, right? But

in driving, you really need to be physically accurate. So we have this

physically accurate. So we have this benchmark test that uh starts in the lane center and then ask ask the model to move to the right and then you know

exactly how much you can you how much the model moved using a different model.

So you can you can check whether the model is physically accurate and with a lot of care with a lot of care about how you train and and use these models you

can make them physically accurate.

Also they generalize out of distribution which is pretty cool. Look at this crazy swerve.

Wow. I hope no one did this in our data set.

I think no one did this. You all are very responsible.

So yeah, as you can see, uh we really ask it to go very very far off the road on on the shoulder and it it responds to that pretty accurately. So uh we say

that it generalizes our our distribution.

And if this plays, yes, this is a very long video. So I'm not going to make you

long video. So I'm not going to make you watch eight minute of rollouts, what we call rollouts, but this is essentially uh uh how the world model, how the the

simulated videos look like. Um yeah, as you can see, uh all these videos are simulated. They're not real. The top one

simulated. They're not real. The top one is uh the frame that comes from the narrow view camera. We call it the FCAM.

And the bottom frame is one that's generated. It's it's also simulated and

generated. It's it's also simulated and it comes from the wide angle camera. So

you can also see how u consistent the views are and how the movement looks looks consistent between these two these two frames. They're generated by the

two frames. They're generated by the same model. It's one model that

same model. It's one model that generates both frames in one shot. So

you can see how consistent they are. And

there's a little installation over there in the TV where you can see this uh loop around with some fun captions that we we all wrote before every video. So uh I'm

not going to make you watch all this now. If you if you if you want to see

now. If you if you if you want to see some examples, they're in the TV back there near near the coffee stand.

Um yeah, and we use this model. You use this world model that generates simulated

worlds and simulated environments to train on policy and to gener generate experiences. And we described this in

experiences. And we described this in our in our paper. Uh and essentially that's how I that's what I said uh earlier. We start with pass frames with

earlier. We start with pass frames with an action. the driving simulator, the

an action. the driving simulator, the word model uh gives us a next frame and the next action. We use the next frame to feed it back in and we use this

action that the word model tells us, hey, this is what I recommend you to do.

We use this to as ground truth for a different model that we call the policy.

And that is our ground truth. We use it to uh train train the policy and then we do this over and over again. the policy

uh gets better and better at driving u and that's how we ship the models right now in open pilot no secrets this is a very just a little fancier way

to describe what I just said and uh shift maintain by Harold Armon and myself and journey other people from

the autonomy team so uh yeah that was the driving Oh, now this translates uh really easily to

other types of of robotics problems. So, driving is a robotic problem. It's a

very simple robotics problem where you know you just actuate using two numbers curvature or torque and acceleration.

Uh but things can get a little messy uh for general autonomy.

Do you all know Billy? Who don't who knows Billy here? Oh, cool. A few a few fans. Billy Billy is our friend. Uh, a

fans. Billy Billy is our friend. Uh, a

comma body. Um, comma body is uh looks like this. It's a it's a comma 3x on a

like this. It's a it's a comma 3x on a stick on a hoverboard. Uh, and it's it's a it's a robot.

So, we use the same approach that I just described for driving. We used it uh for general indoor navigation types of problems. So the goal is uh well you

need to make Billy roam around the office, avoid objects and if uh it hits an object or if it gets stuck it needs

to recover autonomously uh and avoid collusions obviously. So

we open sourced a data set of indoor navigation using using the comma body using Billy uh in uh in our office. So

it's uh it's a few gigabytes of of uh driving around of of roaming around the office. It's open source

office. It's open source and the the problem here simple. You

have uh inputs observations of of of camera inputs and then you can also look at the wheel speeds. That's also nice to

have. And the actions are what the WD

have. And the actions are what the WD with the wasd inputs for the gamers. The

gamers right here. Uh and then the output is again what's the next frame going to look like? What's the world going to look like given this past observations and action and also what's

the next action that that you should take.

So, as you can see, this kind of works.

It recovers.

This is our kitchen.

This is the open pilot lounge. So, you

can see it kind of moves around, bumps into people, tries to recover. Here you go. It backs

up. So, this is all end to end. This is

a model that that that is trained to predict the future and to predict uh the next action and recovers.

So this is a short way to show you how these kind of approaches that don't really hardcode things. I don't know if if I said this, but I probably said this in a previous presentation is that there

in no way in no place in our code you you will find traffic lights or cone only in our tests. Some some of our tests kind of like

tries to find segments with or data with cones and see interesting ways to to evaluate the model. But in our training code, we don't really hardcode anything about driving. Everything is learned by

about driving. Everything is learned by this model. So all the behavior, all the

this model. So all the behavior, all the driving is done implicitly. We don't

hardcode anything. We don't hardcode lane lines. We don't hard code any of

lane lines. We don't hard code any of that. So if we just put that to just

that. So if we just put that to just swap the data set, swap it with a data set of of indoor navigation and a robot,

it just magically works. So again,

general methods uh work and generalize to to different kinds of problems, which is great.

Uh and also we open source this model.

Yeah, this model is open source. So if

you have a a comma body, you could just try it out at home and see if it avoids your your coffee machine or see if it avoids your your fancy

uh couch.

Now, can you do this for input for manipulation? So this is an SO ARM 100.

manipulation? So this is an SO ARM 100.

It's an open source kind of uh 3D printed arm. Uh and uh yeah, same thing.

printed arm. Uh and uh yeah, same thing.

We we tried to do this. We open source a data set of uh of uh something that I like to call the coffee test which is just grabbing a coffee pod a curig a

curic pod and then put it in this little tray and you can train a model. I didn't

do this but you can train a model uh using this data set and to u do this autonomously using the arm.

Okay. Uh, and this was shipped and maintained by Armon, which is also here, Casper and myself.

And, uh, while I'm here, I would like to advertise something that we haven't uh, published yet, but we will do very, very soon. And I know you guys are like

soon. And I know you guys are like compression a lot. I I don't know why people love compression. So, we're doing a new compression challenge. We've done

one in the past that we call the Comma VQ challenge. Uh this is called the

VQ challenge. Uh this is called the Combat 2K19 compression challenge and it's essentially uh a new style of of

compression problem. Um so stay tuned

compression problem. Um so stay tuned for that. It's going to be a very fun

for that. It's going to be a very fun challenge for for everyone who likes to likes to do some some coding likes to do some challenges. So the idea is to

some challenges. So the idea is to compress videos uh make make the video size small uh but

uh with with a very different goal of not making the image look good but actually preserving semantic content and

temporal dynamics. That's it. You can

temporal dynamics. That's it. You can

make the image look as bad as you want, but as long as when you put that image into a segment or into a poset, it predicts the right stuff. So you could

do whatever you want to that image as as long as the segmentation model and the pose model is fooled by your compression, you're good to go. And the

cool thing with this ch is that we open source the segnet that we use. So it's

going to be open sourced. And we also open source a net. So, uh, it's going to come very soon. Uh, stay tuned. So,

yeah, we're open sourcing all this, uh, for for the challenge and you get to play a little bit with with, uh, with the compression and plot things like

rate distortion curves and do sciency stuff if you want. Uh so yeah, stay tuned for the Comma 2K compression challenge

and also stick around for more talks about autonomy and research and machine learning. Uh Harold's going to talk

learning. Uh Harold's going to talk later about how we bootstrap very useful robots today. Uh and Mitchell is going

robots today. Uh and Mitchell is going to talk about building a million mile data set.

Any questions?

>> Thank you, Yasine. All right. Amazing.

>> [applause] >> We're going to take some questions from the audience. Do I have any questions?

the audience. Do I have any questions?

Hands raised.

>> Um, how do you evaluate your world model? I mean, you must be getting

model? I mean, you must be getting better every day. Like, how do you evaluate what you're getting better on?

>> Yes. So, we have we have tests for the world model. Uh, we we described that a

world model. Uh, we we described that a lot in the paper, but in general, what you care about is image quality.

how how well the image looks like as as output for the word model, how good the video is, not just the image because we do this autogressively. So you want the

word model to generate good videos, not just good images. Uh so that's the f the the two first things. And then you also want the vid the word model to be

physically accurate. So we have this

physically accurate. So we have this other part of you control the word model using some fake actions and then you see how well the word model responds to those fake actions. So we have these

these dual kind of word model evaluation.

>> All right, more questions.

>> Hi. So I assume that your world model is stereo uh and temporally consistent because I think it has to be to be useful. But uh you guys use two

useful. But uh you guys use two mismatched sensors and lenses. So I

guess my question would be how much of that is uh I guess generalizable if you wanted to change the sensor or lens in a future uh comet device. How much of your

your model could be reused and how much of it would you just have to uh remake from scratch?

>> Yes, that's a very good question. So

yeah, some of some of the I mean most of the models will need to be retrained if you add a new sensor. Uh but we also do uh some kind of pre-training of the word

model that is independent on the number of sensors for example things like that.

So or or like the the temporal context or how big the temporal context is. So

there are things that we can change in our architecture today that don't require retraining everything but yeah adding a new sensor adding a new

observation vector to the state will need yeah retraining which is fine.

Another question.

>> Hey, uh did you have a version of experiment where you're not where the world model is just predicting the actions and not necessarily the visual cues? And does it really help to have

cues? And does it really help to have both the visual cues and the actions predicted together?

Uh the short answer is yes, it does help. Yeah. Uh so we you can train a

help. Yeah. Uh so we you can train a model that doesn't even input the the the the im the images, right? And it

only outputs uh only outputs the the future actions. But yes, uh outputting

future actions. But yes, uh outputting the the the images and the actions makes the model more more accurate and it it generally makes it

more intelligent basically.

>> Yes, >> we can take a few more questions. Any

more questions for you scene?

>> So is the next frame prediction also happening during inference on device or is that just to train a world model for okay for pre-training I guess.

>> Yes. So the the next frame prediction is not done on device. It's in the back end uh and we we use it just to to train the policy model which is a much smaller

model. Uh and you can see this as a

model. Uh and you can see this as a little model, small model that plays in this inside the dreams of this bigger model, the world model uh and gets distilled, learns a little bit from the

big model and then that's the that's the the policy model is what we shipped.

>> And that one does not predict the next frame. Yeah. Only predicts the next

frame. Yeah. Only predicts the next action.

>> Thank you.

>> Any more questions?

uh when you were using visual language model to determine if person has a phone or not, how accurate was it? For

example, a person had a hat or didn't have anything.

>> Uh yes. So we we used it in a very specific case where it was very easy to label because it was just asking about a phone whether a phone was in the frame

or not. It was pretty accurate uh from

or not. It was pretty accurate uh from from what I recall. Uh because you can see the phone in the frame, it's pretty easy to to see. Uh some people do have

some uh very uh creative ways to hide their phones. Uh but uh as long as you

their phones. Uh but uh as long as you can see it, the the VM usually sees it pretty easily. But yeah, you have to be

pretty easily. But yeah, you have to be careful with with the prompt with the question that you ask. Like the the question that we asked in this case was very simple was about was there a phone

usage or not? But if you ask it for example, is the person looking at the road? Well,

road? Well, I don't know. Is it the road that they're looking at or is it the dash or is it the steering wheel? You don't

really know. So the the the vis the VMs doesn't know these kind of things. But

for phone usage or I think sleeping detection as well, the these visual language models would do very well.

Thank you.

>> We'll take one more question.

>> Thank you. Uh um you mentioned you have this world model that is predicting the action but then you're kind of doing some like pseudo distillation with this on policy smaller model. So I assume the bigger model would benefit from data of

all cars but then you can distill uh according to each separate car that um you want to deploy on. So the smaller model would only see data from that specific car model.

>> Mhm. Uh is that the case?

>> We could do that. We don't do that. U

>> then what's the benefit of like retraining an on policy model? Is it

just for inference speed reasons?

>> Well, the the first reason is that the word model is not causal by not causal means that it needs the future. That's what I wrote in my slide

future. That's what I wrote in my slide called the future anchoring. So the word model cannot be shipped on device for this first reason. This very fundamental reason is that you you don't know the

future when you drive. So you need to distill it into a causal model which is which doesn't know the future and can act. Uh so that's the first reason. The

act. Uh so that's the first reason. The

second reason is yeah it's too big. So

even if we had a causal model we would not be able to ship it on a device. So

we still need to distill it in a smaller model. Uh and why we don't do this for

model. Uh and why we don't do this for every car? Well, it doesn't really make

every car? Well, it doesn't really make sense. You you want a policy model to

sense. You you want a policy model to know how to drive in general. This is

not a car specific problem. This is a a driving understanding of the scene problem. So, the car the the policy

problem. So, the car the the policy should be really independent and should be robust to these different types of of cars and makes and models and then the

controls is what deals with uh the differences. So yeah,

differences. So yeah, >> there was there was one last question.

>> One more question.

>> Yeah.

>> Would having more real world data uh as opposed to synthetic data improve model qual quality? And is there a way to

qual quality? And is there a way to assess that? Did you do that?

assess that? Did you do that?

>> Yeah. So we the the word model benefits from real world data. Yes. The more real world data that you train it with, the better the word model is and then use that word model to generate synthetic

data for the policy. Yes. So real world data is still very very useful because we use it to train these water models that are very very big and very data hungry. So yeah definitely

hungry. So yeah definitely >> thank you so much Jine.

>> Thanks [applause] we'll be able to take more questions after all the talks. We'll have a Q&A with all of the teams. So you'll have the whole autonomy team on the stage. So

if you have burning questions, you can keep those for later.

Loading...

Loading video analysis...