LongCut logo

Waymo: The future of autonomous driving with Vincent Vanhoucke

By Google DeepMind

Summary

## Key takeaways - **Simplest Yet Hardest Robot**: Autonomous driving is the simplest robotics problem because you only need to predict two numbers: steering direction and acceleration or deceleration. But this hides the deep complexity of understanding the environment, predicting human behaviors, and following road rules, making it a very social robot embedded in the unpredictable real world. [04:09], [04:39] - **Sensor Fusion for Robust Perception**: Cameras provide human-like vision but struggle with depth; LiDAR excels at depth and geometry but lacks color and semantics; radar detects speed and range effectively. By fusing these diverse, flawed sensors like merging conflicting eye views, the AI creates a cohesive, redundant picture of the environment, enhancing safety through belief updates even in low-light conditions. [06:24], [09:01] - **Closed Loop Learning Challenge**: You cannot train autonomous driving behaviors in isolation due to the closed loop problem, where actions influence other agents' responses, leading to error accumulation like the dagger problem in robotics. Simulating entire interactions in a faithful environment allows learning stable, real-world behaviors, akin to training language models to stay on topic in long conversations. [16:44], [17:29] - **Semantics Drive Behavior Prediction**: Categorizing agents semantically—distinguishing a car from an emergency vehicle or snow from a rock—enables accurate predictions of behaviors, like a cat darting unpredictably or yielding to police activity. This deep understanding, learned over 15 years, informs safe decisions, such as giving bicycles proper passing width, avoiding issues like early DARPA tumbleweed confusion. [19:25], [20:27] - **Safer Than Human Drivers**: Autonomous vehicles should act like the most normal, boring driver to blend in and avoid exploitation, but adopt a more conservative posture than humans, who often tailgate too closely or overlook hidden risks like occluded pedestrians. Waymo achieves 88% fewer severe injury accidents, an order of magnitude safer by surpassing average human risk analysis. [34:49], [37:45] - **No New Breakthroughs Needed**: Autonomous driving has evolved through five generations of AI over 30 years, from robotics to transformers and foundation models, reaching a point where current technology suffices for mainstream adoption without fundamental new breakthroughs. The path forward relies on innovations like world models and large-scale simulations to make it practical worldwide. [50:26], [51:05]

Topics Covered

  • Autonomy Simplest Yet Hardest Robotics
  • Sensor Fusion Beats Single Flaws
  • Simulations Enable Closed-Loop Learning
  • Semantics Unlock Agent Predictions
  • Boring Normal Driving Maximizes Safety

Full Transcript

Welcome back to Google DeepMind the podcast with me, your host, Hannah Fry. Now, the

dream of autonomous vehicles has been a science fiction hope for decades, but now they are finally here because Waymo, a driverless car company that is headquartered in Mountain View, has got autonomous taxis available to hire across a number of American cities. They're very noticeable on the streets. They are these

big white cars with lots of sensors on top, but crucially, nobody sitting behind the steering wheel. But this journey to make driverless cars that are reliable and

steering wheel. But this journey to make driverless cars that are reliable and safe has been an incredibly complex one. And so today I get to talk to Waymo's distinguished engineer, Vincent van Hoek for the podcast.

But before we catch up in the studio, we thought we would take the Waymo for a spin. Okay, all right. So

we're off. What's interesting is you sort of forget that there's no one driving the car. I sort of looked up and gave myself a bit of a shock. It

car. I sort of looked up and gave myself a bit of a shock. It

just feels like being in a normal car. Being in a Waymo is very exciting for the first five minutes of your ride. And then it gets, like it's very zen. It is very zen. Now, this is interesting because there's roadworks going on.

zen. It is very zen. Now, this is interesting because there's roadworks going on.

Yes. And you see... Yeah, the cones. But also, it was sort of making its mind up about which direction it was going to go in. Because that's the thing, actually, the conflicting information. You've got the original street markings, the signs and so on. Then you've got the temporary roadworks thing. Yes. But

you could even have another layer on top of that, which is, you know, perhaps a person who's indicating to ignore all of that. Yes, you have... Cones are a big signal, right? Being able to sort of understand from the series of cones that it means that there is a lane. A trajectory. Right. Yeah. I mean,

you're getting a really accurate view. I sort of keep expecting to see it slip up in some way. It's like you're making a little mistake, but it's getting it absolutely spot on. Another aspect here is that we... integrate the information over time.

So we don't make a new observation at every point in time. We take the observation, what we know, and then we update our belief of the cars over time.

And we find this estimate and change our estimate as we acquire more information from all the sensors over. So you're doing that every time, Steph? Mm-hmm. But are

you doing it for the entire scene simultaneously, or are you doing it for each individual object? For everything. For everything. For everything. It's all a continuous stream of information

individual object? For everything. For everything. For everything. It's all a continuous stream of information that comes in and gets integrated into our belief space of the car. It's just

a big Bayesian machine, basically. Yes. Amazing. Are there other spaces that you think will eventually have the autonomous driving capability? I think it's going to be prevalent in every transportation sector that there will be an option to have autonomy whether it's trucking, whether it's personal cars. In fact, we picked the right

hailing space because it's hard, it's unforgiving. There is no driver in the seat, in the front seat, and it has a lot of complexities that may not exist in other scenarios. So the technology is very generalizable and can be used in many, many other places. There is still a lot of work to do. Sure.

There is still a ton of work to do to get there.

Welcome to the podcast, Vincent. Thanks for having me. I know you've worked for Google for a number of years previously for robotics. How does the driverless car problem differ from a more generic robotics problem? Well, in some ways, the autonomous driving problem is the simplest robotics problem. You have basically two things you need to do. You have

to know if you're going to turn left or right. That's one number. And then

you have to know if you're going to accelerate or decelerate. That's two numbers. In

most robotics problems, you have to predict hundreds of numbers to figure out all the degrees of freedom of your robot. This is the simplest robot that has only two degrees of freedom, but that hides all the complexity of the actual problem. Predicting those two numbers is actually a very deep and hard

problem. You have to understand the environment. You had to understand the people that are

problem. You have to understand the environment. You had to understand the people that are around you, around the car, how they're going to behave, what the environment is going to look like in the future. You have to predict the rules of the roads, what you're allowed to do, what you're not allowed to do.

And the mix of all this makes the problem hard. Conceptually, it is a robotics problem. Those are robots, but they're very social robots. And they're also embedded in the

problem. Those are robots, but they're very social robots. And they're also embedded in the real world, which I imagine could be quite humbling. The real world is extremely challenging to work in. The expectations

in a lot of robotic contexts is that you have a robot in an environment that you more or less control, or that you have a reasonable expectation about what the other agents in that environment will do. Like a factory floor, for example, where you have total control over it. Yeah. We have to basically understand and mesh with the environment, be respectful of the

people that live in it, blend into the environment as best we can, so that we can serve the public and to enable us to thrive and have the freedom to operate. Okay, so in terms of choosing those two numbers, I mean, first the car has to perceive the world around it.

So I guess, you know, before it plans what to do next. So I guess if we start there then, in terms of the perception, I mean, Waymo has a number of different sensors. You've got cameras, LiDAR and radar. What are the benefits of each of those and perhaps where do they struggle more as well? Yeah, the different

sensors have different strengths and weaknesses. A camera is basically like your eye, right? You see the world as a human would see, but it gives you maybe

right? You see the world as a human would see, but it gives you maybe slightly less information about the depth information until you actually put multiple cameras and then you can reason about depth. In contrast, LIDAR is very good about sensing depth, that's what it does. LIDAR is basically a laser that you shoot out and bounces off

of objects and bounces back and gives you an estimate of how far the objects are. They don't see in colour, right? So they only give you geometric information. They

are. They don't see in colour, right? So they only give you geometric information. They

give you a lot less about the semantics of the scene. Those lasers also bounce off things quite easily, don't they? Yes, they reflect off of things like polished metal and things like this. Which there's quite a lot of on the road. There's quite

a lot of it on the road. So that can be a disadvantage in the sense that it adds noise to the signal, or it can also mean you can get to see behind corners. We didn't talk about the radar. Radar is very good at sensing speed. And so the relative speed between the other agents and the car is a really important signal for us to understand are we at risk of colliding

with something? And radar gives you quite, I mean, you can use radar to sense

with something? And radar gives you quite, I mean, you can use radar to sense quite far out as well, right? As opposed to visual, which is just in your immediate vicinity. Yeah, they have a much longer range. Cameras are going to be obstructed

immediate vicinity. Yeah, they have a much longer range. Cameras are going to be obstructed by, you know, the cars in front of you and the scene. The radars can look much, much further field. What I think is more important is that it adds a different piece of information to the context. And when you want to

fuse information from different sources, you want to have information that comes from different places, right? The analogy is you have two eyes.

If your two eyes gave you the same information exactly, you would not be able to perceive depth. It's the discrepancy between the two that gives you that extra bit of information. So similarly with LIDAR and camera and radar, they give you very different

of information. So similarly with LIDAR and camera and radar, they give you very different pieces of information with different strengths of weaknesses, and then the role of the AI is to fuse them into a cohesive picture of the environment. There is

something interesting there, though. You have these three different senses, all of which have strengths and weaknesses. All of them are flawed in some way. And yet, because they're flawed

and weaknesses. All of them are flawed in some way. And yet, because they're flawed in different ways, you can build a bigger picture of the scene. But what do you do when they disagree? Like, who gets the ultimate say between those three? It's

not like they're voting. It's a merger of the different information. The example I like is, again, going back to your eyes, your left eye basically tells your brain that your nose is on the right. Your right eye tells your brain that your nose is on the left. It's not like one eye is going to win that contest, right? You basically fuse in your brain the different information

that are slightly conflicting, but that actually give you a a global picture of the scene in front of you. A great example is what happens at nighttime.

Imagine it's really dark out there. Your camera just sees a wall of black. You

don't have a lot of information. That's a place where the LIDAR is really useful because the LIDAR doesn't care if it's day or night. It just keeps shooting its lasers and figure out how far objects are. So that complementarity really is what drives the safety there. Safety is a lot about redundancy. Safety really

comes from taking different sources of information, never entirely trusting them 100%, and merging the evidence based on the different pieces of hints of information that you get, such that you can have an overall system that you can trust that has a much higher degree of fidelity. We often say

there's only one way to be right, there is many ways to be wrong, right?

If your different sensors are wrong in different ways, you know that there is something not right about the information that you get. But once the sensors start agreeing about a picture of the world, you're pretty sure it's going to be right. So is

it almost like the car is sort of updating its belief, as it were, about what the scene is around it at any time? It's very literally that, in the sense that the mathematical formulation is a belief update, and you can prove that the more information you add that comes from

different sources, you only improve the overall estimate of your belief.

Even if the information is noisy, you're only adding information.

And the fusion that happens between the different sensors is really the crux of enabling a very similar safe perception stack. We can, for example, hide one of the cameras. The system will be

perception stack. We can, for example, hide one of the cameras. The system will be okay with that. We can have, if there is like dirt that accumulates on one camera, we can still understand the end. You don't want to have a system that's brittle that will just, you know, collapse if a single sensor is

providing erroneous information. You want something robust. And the diversity is really what brings the robustness. That idea of having all three, though, I mean, particularly the LiDAR, because there

robustness. That idea of having all three, though, I mean, particularly the LiDAR, because there are, you know, people who do work on driverless cars without the addition of LiDAR.

Do you think it's absolutely necessary to get fully autonomous vehicles that you need all three? The current state of evidence that we have is that it looks like you

three? The current state of evidence that we have is that it looks like you can get to human level performance by just using a camera. And the

proof to that is that, you know, people use their eyes and You know, they don't have a fancy radar in their brain and they can drive just well enough.

What we're seeing is that people want to know that those cars are safe beyond what the average driver would be able to provide because it's a new technology and also because we can, right? We have proven that we can improve the safety posture of cars on the roads So let's do it. Like, this is actually a

valuable thing that we're adding to society in terms. Just going back for a minute to that idea of the car having a belief of what's going on around it.

Is it constructing a 3D model of the world as it goes? It is constructing 3D models of the world. It's looking at the geometric information that it obtains from its different sensors. This is very useful for two things. One is Planning is easier when you have a 3D

things. One is Planning is easier when you have a 3D space that you can use to reason about, right? You want to be saying, hey, I want to avoid hitting this. I want to turn right here because that's what the rules of the road are telling me, or this is the route that I want to take. Also, having a 3D representation of the world enables you to simulate

the environment. And that's a critical piece, is we are leaning really hard on

the environment. And that's a critical piece, is we are leaning really hard on simulation as a way to validate that our driver is safe and doing and behaving the way we want. We've driven billions of miles in simulation, many orders of magnitude more than we've driven on the roads. And

having a simulator that's very faithful and close to reality is what enables us to make the technology advance fast and validate it offline before we actually have to do the testing on the road. Okay, so I see how in simulation you're sort of making a prediction about what the car will do

next and then have it interacting with your simulated environment. But inside the car itself, as it's out on the roads, you're still making predictions about what will happen at the next time frame. Is there some elements of simulation in that too? Yes, yes. So predicting what other agents on the road

that too? Yes, yes. So predicting what other agents on the road will do or might do is a very important piece of that equation. You want

to know that, for example, a pedestrian that is on the sidewalk, is it likely that they're going to jump in front of the car? Or are they just walking straight? Are they trying to cross? the crosswalk, or are they

straight? Are they trying to cross? the crosswalk, or are they staying there because it's not their turn? Other cars, are they going to try to drive through an intersection at a stop sign or not? All of this reasoning about the other agent is a very important part for the car to be able to

make its own decisions. Driving is inherently a social thing. And what's interesting is that we tend to model those interactions as little bits of conversations Literally, it's visual movement conversations. You know, I move forward. What will this other car do? This car stops. Okay, I can go. Or

forward. What will this other car do? This car stops. Okay, I can go. Or

this car goes, then I'm going to have to stop, right? It's literally modeled as a visual motion conversation, very similar to what you would do in a conversational agent. And so there are lots of parallels between the conversational AI

conversational agent. And so there are lots of parallels between the conversational AI and the autonomous driving problem. that we can leverage and learn from and improve the technology based on that. Because that was always one of the big questions was, if you have two autonomous cars that come to a stop sign at precise at the same time, what happens, which one yields, you know, do you end up in this

situation of a stalemate? And that's the solution, right? It's that you're predicting what the other car will do and making a small intervention and continuing to update your belief about the situation. Yeah. And it's important because one of the very interesting aspects of learning or AI for autonomous driving is the closed loop problem. And by

that I mean, you cannot learn your behavior in isolation.

You have to learn your behavior in the context of the other agents in the road. And so the only way, or one of the only ways to do that

road. And so the only way, or one of the only ways to do that is to imagine what you would do unroll what the world would do in response in a simulator and then feed that information back. And so evaluating and training your model in a closed loop enables you to learn the

kind of behaviors that you would actually observe in a real world environment. We know

from robotics in general that there's what's called the dagger problem. The

dagger problem is that very often if you just optimize for open-loop behavior, you end up in a place where your system tries to do the best it can at every step, but every little error that it makes just accumulates over time.

And so if you want to learn in a way that doesn't have an accumulation of error, you have to simulate the entire environment and feed that back into your system. That makes it very, very complex. Again, there is a similar analogy in language

system. That makes it very, very complex. Again, there is a similar analogy in language models that if you have a long conversation and you don't train your system right, your conversation might veer into really weird kind of idiosyncratic territory because the model is not being trained to stay on topic. It's very much the

same problem of staying on topic when you're having a long conversation as it is to drive in a way that is stable and gets you to the place that you want eventually. It does have those elements of a game to it. I mean,

I'm thinking about playing chess, there's no point in learning how to play chess, not against an opponent, because playing in isolation is like, it's sort of worthless in a lot of ways. Except that you're playing with many, many, many players simultaneously. It's sort

of a wonder that you've got as far as you have when you consider how complex this problem is. As always, simulating the environment has been at the core of solving this problem. It's really the core of the development and the study of the problem, that building a simulation that is very faithful to the real

world and that enables you to reason about this closed loop problems is a key component to making it work. How much of your prediction about what another agent in the space is going to do comes down to your categorization of what that agent is? I mean, I'm thinking, for example, I was watching some videos this morning and there was a Waymo going past a cat and the cat was

sort of curled up could have been a football, right? Like, do you have to categorize what it is before you predict how it might behave? We don't necessarily have to categorize it exactly, but if we know it's a cat, then we can make better predictions about what its behavior might be. A cat will, you know, change

direction very quickly and go in a random direction. They will not get on the crosswalk to cross the street, right? So all of the different agents We try to categorize them as best we can and to predict, essentially, their behavior. It's also important so that we can make decisions about how

behavior. It's also important so that we can make decisions about how we should behave in terms of world rules. So we want to know, this is a bike, this is somebody on a bicycle, they're in the bike lane. We can

pass them. We need to give them that much width to pass them safely. So all the semantic information about the agents is very important. I

them safely. So all the semantic information about the agents is very important. I

think that's been kind of the long-term learning from 15 years of autonomous driving has been the semantics of a scene. What the agents are is extremely important to reason about. We want to know that a car is not just

a box, it's not just a cube on the road. A car can be an emergency vehicle and we need to yield to that. If there are money, emergency vehicles somewhere, probably we don't wanna go through there because there is police activity or something like that. So all this kind of deep semantics really matters to being able

to operate the car. Yeah, because I remember the DARPA challenges sort of the earliest days of driverless cars, there was a big problem with tumbleweed, I seem to remember.

Oh, yes. Not knowing what it was, just seeing that there was an object in front that you could have driven through. It wasn't potentially going to be damaging. Do

you still end up in that situation? I'm thinking about snow here, for example. Do

you find yourself in situations where there are objects that are in the way, but they're not things that should necessarily affect the planning of the vehicle? Yes, so snow is a great example. It's something that we started working on seriously because we're trying to expand in the more northern cities. So we've done a lot of testing in

the Sierras over the last year, and snow is a typical example of here is something that is big, massive, potentially on the roads, but you have to reason about and say, okay, this is snow, so the right thing for me to do is to drive through it. Unless it's a big pile of snow and you can't... But

if it's a reasonable path, pile of snow just that you experience in normal driving, you want to cross through that snow. If it were a rock, you wouldn't do that. So categorizing things at a fine grain like this and understanding what you

do that. So categorizing things at a fine grain like this and understanding what you can or cannot do is really part of the equation and it's necessary to be able to drive in those conditions. I mean, these ambitions to have driverless cars, they really predated the big changes of large language models. And we're talking here about like

understanding the context of a scene and the semantics of it. How much has the advances that have been made in multimodal models fed directly into driverless cars? Quite a

bit. It's funny, I was reflecting a few weeks ago on, do you know when the first transcontinental autonomous drive happened. There was a transcontinental drive where they were at

happened. There was a transcontinental drive where they were at about 98% autonomy, so almost fully autonomous, driving about 60 miles per hour.

That happened 30 years ago this summer. Wow. That was in 1995.

It took 30 years from the proof of concept of autonomous driving to where we are today. And what's fascinating is that it took basically not only a lot of work, but multiple generations of machine learning and AI to really

get to a level of performance that was necessary. And the modern AI revolution that we're seeing today for the last few years is but the last of those, right? There's been several, over the past decades that we've experienced and gone through. The modern AI world, what it's opening up

for autonomous driving is really this idea that you can get at the semantics of a scene, essentially zero shots, without having to train the model specifically for it. If I show to Gemini a picture of an accident scene on the road, Gemini will tell us this is an accident scene.

This is not something I need to train specifically. This is something that can be learned from what we refer to as world knowledge. Or

more prosaic things like, what do emergency vehicles look like in Tokyo or in London? We don't necessarily have that knowledge built into our driver a priori because

London? We don't necessarily have that knowledge built into our driver a priori because we've never driven in those areas until very recently. But those large AI models have that knowledge built in. So the key is how do you leverage that in a way that is robust, in a way that basically

provides the right level of information for the car to be able to operate. Well,

so how do you is the next obvious question, because I mean, you're talking about lots of different elements here. You've got like the integration of the sensor data, you've got the categorization of the semantics of the scene, you've got the prediction of how different objects might behave in the future. I mean, how do you put all of that together? Yeah. So one cheat is that we can do all of that in

that together? Yeah. So one cheat is that we can do all of that in the Cloud first. So we can build essentially a very large driver in the Cloud, very large model that incorporates all that information, all the sensor information, all the experience that we have from driving millions of miles, all the data

that comes from various sources that provides us with world knowledge. And the benefit of that is that when you do that in the cloud, you don't really have the same operational constraints that you would have on a car. It can be slow, it can take a lot of memory, take a lot of compute, it can also,

not necessarily meet all the real-time constraints that the car requires. But once

you have that teacher driver, you can use that to basically instruct the onboard system, teach the onboard system based on that supervision that you provide from the cloud-based driver and distill all that information onto the onboard system, which itself can... be different, can have different operational constraints,

different compute constraints, different architectural constraints, and so on and so forth. So that's

one path to, you know, bringing the power of very powerful AI onto the car without really having to just shove it all in the car. So let me make sure I understand that then. So you have this sort of giant model in the cloud that sort of gives you like a solution space, as it were, of like, this is what good driving looks like. And then when you're in the car, you

just have to work out which bit of that solution space you're in in that particular moment in time. So we don't do that in real time, to be clear.

This is not, you know, the car is on the road, it's going to ask the driver in the cloud, you know, what do I do next? No, no, that doesn't work. We need to have the driver be self-contained. on the car and have

doesn't work. We need to have the driver be self-contained. on the car and have all the, you know, be completely autonomous and independent. Not rely on the internet access.

That's right. Relying on internet connectivity would not be a good safety posture.

But we do that offline, meaning that when we train the models that are used in the car, we query that large model in the cloud as an oracle to tell us, you know, what would be the ideal thing to do and then essentially back-propagate that onto the onboard system. I mean, it sounds like you're stepping

closer towards the sort of Gemini version where there's like, there is the giant model and everyone's sort of tapping into the main model. Is that sort of the big aim? Yeah, it's closer to this. What's been interesting in the journey is that, as

aim? Yeah, it's closer to this. What's been interesting in the journey is that, as I mentioned earlier, the driving problem is not that far removed conceptually from the dialogue problem that LLMs solve.

It is a visual dialogue or a motion dialogue with multiple agents. And because we can frame it that way, I mean, literally the math is

agents. And because we can frame it that way, I mean, literally the math is the same, right? So we basically train a model that has very much the same properties as a Gemini model, And we are able to leverage all the same techniques that a Gemini model

applies to the problem, including how do you scale it? How do you provide it the right level of supervision? And all those questions are very, very, very similar. I'm also wondering, I mean, end to end is something that people are talking

similar. I'm also wondering, I mean, end to end is something that people are talking about quite a lot now. Is there an ambition that you could sensor data in the same way that you can tokenize language? That's already very much what a multimodal model like Gemini does, right? You tokenize all the images and the sensor inputs and

pass them on to a language model as abstract tokens that basically act like words. So the machinery under every perception system fundamentally is compatible with this idea of tokenization. And then

the question is, what kind of tokens do you want to pass around? Do you

want tokens that are very abstract? Or do you want to pass around information that is very concrete? The abstract information can potentially be richer in some ways. Information that is more concrete gives you the power of simulating that state in a much

more direct way. actually simulating down to the pixel level is something that is within reach. There's a lot of work right now going towards world models. So once

within reach. There's a lot of work right now going towards world models. So once

you can do sensor generation in a controllable way, that opens up the capability to simulate the entire rollout of an autonomous car. And

that's nascent. That's kind of where a lot of the technology is headed. Is

it going to solve the entire problem? You know, time will tell. But that's really what the edge of the research is at right now. I'm thinking here about the sim-to-real gap that roboticists in particular talk about a lot of the time. I mean,

how realistic are those simulations that you're creating? Because presumably there'll be certain things, you know, in an environment that are broadly not necessary for the purposes of driving. The

fidelity of the simulation is Very, very important, but not necessarily this kind of visual fidelity that you're talking about. We want geometric fidelity.

We want to make sure that the physics of the environment are respected. But

like you said, the visuals really matter too. One of the things that we do with our simulator is that we're able to take examples of driving and then re-simulate it in different conditions at night, in snow, in the morning, in the evening, by simply just basically using AI to

adapt the visuals and turn a spring scene into a winter scene, turn it into a summer scene or a night scene, and really augment the amount of data and the different conditions in which we can simulate the entire environment. So is there an ambition then that at some point in the future you

environment. So is there an ambition then that at some point in the future you will be able to have a driverless car that just purely takes in the sensor data and just rather than sort of spitting out all these, maybe not intermediate steps, but things that give you some way of exploring what's going on behind the scenes, but just purely outputs speed and direction? There is another angle to this story, which

is the safety angle and the validation, right? So the way that safety rules that you and me can relate to exist. It's basically things like don't collide with anything.

exist. It's basically things like don't collide with anything.

It's don't run a red light. It's respect priority and world rules in general. All of those rules are very concrete. They're not

abstract token in token space, right? So you want to be able to enforce those rules, both the road rules and the safety rules in a way that can be reasoned about, can be explicit, can be guaranteed. So having a concrete representation of the environment, of what the driver is going to do, has huge amount of

benefits in terms of providing a way of expressing a safety envelope that you wouldn't have necessarily if all of the reasoning of the car happens in a completely abstract fashion.

Just on the idea about things being concrete, those safety rules that you listed there, right? No collisions, what are the other ones you have? Don't run a red light.

right? No collisions, what are the other ones you have? Don't run a red light.

Don't run a red light, sure, yeah, exactly. Are those hard-coded then into the Waymos?

They're provided as the guidance that we provide to the car. We have a safety framework that basically encodes the kind of information about what it is to do copper driving. And we want to be convinced as engineers that the encoding of those rules is something that the driver will meet at every

point in time. Let me just dig into the human side a little bit more.

Should an autonomous vehicle behave like a human driver? It's a really good question. One

thing that we've learned over the years is that you want to be basically the most normal car on the roads. You don't necessarily want to be more timid than other drivers on the roads, because then people will pick up on that difference and actually abuse the car. Not necessarily out of mischief or anything like that, just because,

you know, if they know that a Waymo car will always be more timid than they are, they will... want to drive in front of it. If on the opposite side you're more aggressive than the average driver, then you're disruptive to the flow of traffic or you violate other people's expectations. They don't expect an autonomous car to be

raging on the road, right? So the sweet spot is really if you act like the most boring, normal driver on the road, It turns out it's also the safest. It's the best safety posture to have. We don't necessarily want to

safest. It's the best safety posture to have. We don't necessarily want to reproduce everything a human driver does. In fact, what we find is that very often, humans are not very good at risk analysis. Right. They will do things that are, if you do the math, are not necessarily safe. Give an example.

I mean, a lot of people will tell Gates. at distances that are much smaller than what is recommended by people who've done the analysis, say, you know, you should be at this many meters away from the rear of the car in front of you if you are at that speed. And I'm very sure that most of those

guidelines, people look at them and like, that feels extreme, that feels too much. And

it's not what in practice people will want to do on the road necessarily. when

we look at the data we understand that there are things that we should be a lot more cautious people also don't necessarily reason about what could happen when it's not in front of their eye right so you have a you know a big truck occluding a pedestrian crossing you have to think about

it's very possible that a pedestrian will be crossing through there if it's not in sites it's not always that people think about all the possibilities of what could be happening. So the safety posture that we need to have and how we reason

be happening. So the safety posture that we need to have and how we reason about those kinds of situations tend to be a bit more conservative than humans. And

I think that's part of why we are, you know, the numbers today show that we're compared to the human drivers in the same kind of environments that we're driving in, we're basically around an order of magnitude safer. the rate of accidents that lead to severe injury is about 88% lower

with Waymo cars. And that gap really we can attain by not just doing what every human would do, not by hitting the average, but also having a more conservative safety posture. I mean, does driving change depending on where you are? I think you're doing some testing in Japan at the moment, Anu. Is

you are? I think you're doing some testing in Japan at the moment, Anu. Is

there a different style of driving in different countries? Yes, expectations are going to change depending on where you are. And again, in the spirit of being as inconspicuous as possible, we will need to adapt the driver to the local conditions, simply because the rules of the road are different in different

places, like even in the US. Some states let you turn right on red, some states don't. And there is a lot of expectations that are built into not just the road rules, but also common practices. In practice, what we've seen, at least in the US, is that people make the difference in driving up

a lot more than it is there in reality. Really? I think a lot of people are proud of their local way of driving or proud to say that people in their local environment are terrible drivers or are aggressive drivers and things like that.

At the end of the day, this is still a very imaginous kind of environment.

I'm talking about continental US. I see, I see. Phoenix versus LA. So Japan is, first of all, driving on the other side of the road, right? A very simple, you know, massive road rule change that has an impact on how you behave has an impact on how pedestrians behave because you will look left

versus looking right. Can you not just switch it? Yeah, to a first degree approximation, it's a mirror image of the world. Switch the camera around. But there is also a lot more expectations from driving. Anecdotally, I don't have hard data on this, but anecdotally, there is a lot more agents on the road gesturing at cars

in Japan. it's a lot more of a normal part of traffic. It's not

in Japan. it's a lot more of a normal part of traffic. It's not

just in emergency situations. And understanding those gestures well and what they are and maybe how they're codified slightly differently may be a factor.

Well, a lot of that, I guess, comes down to, I mean, the sort of the physical size of the cars that you're in, the like shape of the roads, the kind of density of traffic. But I am wondering about the hand signals, though, because even if I started driving in Japan, I think I would have an intuitive sense of what hand signals might mean. How do you build those into the car?

I mean, hand signals in particular are quite difficult, aren't they? They are a topic in and of itself, right? We need to understand, and often, you know, we say hand signals, but really it's a whole body language. There is a lot of information that gets conveyed that is not just, you know, how the hand moves. It's

the entire... For example, people at night shining lights to tell you where to go or things like that. We have a lot of that data and we can evaluate how we interpret that data and measure our effectiveness at understanding hand signals and improve on the system

using data-driven methods, using heuristics. Just coming here, I saw a construction worker stopping a Waymo with their hand in front because there was a truck that was backing up. And people were very confident that that hand signal was going to be interpreted properly by the Waymo. That's a

huge amount of trust that people are placing in the system and we need to meet that trust. Yeah, absolutely. So I am curious though, how do you teach a driverless car In the situations where it's acting like the average driver, how do you use that as an instruction? I mean, what's the reward function that you're going for here? Like what are you optimizing? There are different optimization

functions. One basic optimization function is just imitation learning,

functions. One basic optimization function is just imitation learning, right? So this is the bread and butter learning that every robot learning starts

right? So this is the bread and butter learning that every robot learning starts from, right? It's how you build a baseline system that Basically, you take in the

from, right? It's how you build a baseline system that Basically, you take in the data from human drivers and imitate exactly what they do on average. That

takes you to some level. Just like for LLMs, just doing imitation learning on human text is the baseline that you start from when you build a language model. On top of that, you tend to have things like reinforcement learning. You

language model. On top of that, you tend to have things like reinforcement learning. You

tend to have learning from data, learning from human annotations and things like this. So you can imagine a similar setup in which beyond just the human

like this. So you can imagine a similar setup in which beyond just the human imitation, we would provide signals to the driver about areas where we don't think humans are optimal and hence as to what is a better solution.

And then we can simulate that. We do a lot of simulations. We also simulate a lot of cases where the extreme cases that we never want to observe on the road. So that's another angle that is very important is there are a lot

the road. So that's another angle that is very important is there are a lot of things that we hope to never see happen. And so we cannot learn from humans in those scenarios because we just don't see them. Like multi-car

pilots. Exactly. So if there is a multi-car pilot in front of you, what do you do? We won't have that data. We hope to never have that data, but

you do? We won't have that data. We hope to never have that data, but we simulate. those kinds of pileups, and then we learn from that what is the

we simulate. those kinds of pileups, and then we learn from that what is the right behavior for the car. Let me ask you about safety then. Have there been incidents involving the windows? Yeah, we are not perfect. The

world around us is not perfect. So there have been incidents. Whenever there is an incident, we look at it very carefully. We replay it in simulation. We

always look at, are there ways that we could have done things Are there ways we could have mitigated the issue, irrespective of responsibility? And we tried to build in a lot more sort of defensive measures. Even when

there is not an incident, if there is something that makes us think that potentially there could be an incident, we try to mitigate that too. Simulation is a great tool for this because it enables us to provide counterfactuals. So we

simulate what happens if, in this instance, the other driver was drunk, and so their reaction time was a lot slower than what it was.

Would we have mitigated the issue? Would we have been able to perform in a way that would have avoided an incident in general?

So we take those kinds of learnings very seriously and incorporate it in the driver in a continuous way. So these simulations then you're using them, I guess, to validate that the model that the car is using is the best that it can be?

It goes beyond just validating. We also use simulation as a way of providing feedback on the driver. So there is a training loop that basically leverages simulation and so feeds back this information to improve the driver itself. So it's not just passively looking at whether we meet the performance criteria that the simulation

is providing us with, but it's also using that signal as something we can back propagate into the model so that the driver itself improves. When it comes to rolling out into a new location, I know there's rumors of you going to London quite soon. How much mapping do you have to do before you can put cars

quite soon. How much mapping do you have to do before you can put cars without drivers on the roads? We like to have a map because a map gives us a prior on what we should expect to be there. There is usually information that is also more relevant to us that is not necessarily directly available on

Google Maps, for example. We like to map where there are speed bumps. Right. We try to see that there is a speed bump and slow

bumps. Right. We try to see that there is a speed bump and slow down, but if it's not visible, for example, if it's at night or something like that, we have a knowledge that there might be a speed bump. And all of that gives us a way of having a higher level of confidence about the

information that the car should expect to see in an environment. At the end of the day, the car makes its own decisions. We don't ever assume that the map is correct, but it's like an extra piece of evidence that it's adding to change its belief of what the best thing to do next is. So then how different are these different areas? I'm thinking of like freeway driving compared to inner city driving.

Different types of roads have different challenges. A simple example is if you're on the surface streets, if something wrong happens, typically you can just stop. It may not be optimal, It may not be the right thing to do,

stop. It may not be optimal, It may not be the right thing to do, but push-conscious to shove, there is the option of pulling over and stopping.

On a freeway, you don't want to be stopping in the middle of traffic. This

is not the safety first option. So you want to be able to reason about exceptional circumstances in a very different way. The

speed The fact that you have a car that is at much higher velocity means that you really need to anticipate much further ahead what may be happening. You have to reason about traffic lanes in a different way. You often have traffic stacks at exits of

freeways. It's not necessarily the kind of thing that you... experience on surface streets. I've

freeways. It's not necessarily the kind of thing that you... experience on surface streets. I've

been taking a lot of rides on freeways lately and experienced, you know, the Bay Area freeways. In Waymo? In Waymo, yeah. So... Because that's not available to the public yet? It's not available to the public yet, but this is one of the

public yet? It's not available to the public yet, but this is one of the dimensions that we're working on to enable. I mean, look, I'm biased, right? So I'm

going to ask you about London. But I think sort of London, feels, well, at least qualitatively very different from the American streets in terms of the density of the cars and traffic, in terms of the number of road users, in terms of how narrow the streets are. I mean, is it just a case of picking up what you've got here and then moving it to a new location? Or are there a

whole other range of edge cases that you have to be concerned about that just don't come up when you're in the American cities? So, What's interesting about San Francisco is that it's actually the second densest city in the US, believe it or not.

After New York. After New York. So by US standards, it's actually a pretty complex and dense environment to be in. Is it on par with London? I know it's not. London has a lot more very narrow and complex. The question is, is it

not. London has a lot more very narrow and complex. The question is, is it fundamentally harder or is it just that it's the same thing, just different with different tightness of tolerances and things like this. Are there things that are materially different and funnily different about

it? And the traffic is awful in London. You can't drive anywhere fast.

it? And the traffic is awful in London. You can't drive anywhere fast.

I mean, that's true. It's a very slow-moving problem. It's slow. I'm reading some fantastic novels right now that are set in London, and they spend half of their time in the novel in traffic. It's really sad. What breakthroughs do you think have to happen then before fully autonomous driving becomes fully mainstream? What are the

barriers that remain? I don't think there is any need for any fundamentally new breakthroughs. I think we're in the right generation of technology. I'm not saying we have

breakthroughs. I think we're in the right generation of technology. I'm not saying we have solved it. I'm saying autonomous driving has a history that did back

solved it. I'm saying autonomous driving has a history that did back 30 years. It has to have gone through I want to say

30 years. It has to have gone through I want to say five different generations of technology along the way. It started with from a perspective of an autonomous car is a robot. Let's solve it the robotics way. And then, you know, machine learning and computer vision came in. Transformers

came in, behavior modeling, foundation models came in. I feel that today's the day, like we are in the right moment. There is a lot of innovations that are coming down, like things like world models, things like large scale foundation models. But that's the present in some ways, like

in the sense it's on the horizon. I don't think we need another job for it to become practical in the real world. So I think it's the moment. Like

I really feel like autonomous driving is happening and we've got to make it happen.

Do you think that we'll get to a stage where driverless cars are just completely ubiquitous, or do you think that there will always be a place for human drivers?

It's a really good question. I could picture a future in which, you know, your grandkids, my grandkids ask us, hey, is it true that in your day, we used to drive by hand? That sounds scary. That sounds dangerous.

So it's entirely possible that we're going towards a future where the experience of driving by hand is no longer the norm and most of the driving happens automatically. Whether this future will be realized, I think, depends on a

happens automatically. Whether this future will be realized, I think, depends on a number of factors. I think costs and accessibility is one of them. Adapting the

infrastructure I don't think it's going to be soon, but it's a possible future. Absolutely fascinating. Vincent, thank you so much for joining me. Thanks for having me.

future. Absolutely fascinating. Vincent, thank you so much for joining me. Thanks for having me.

It's particularly telling, the comment that Vincent made at the end there, that the path to autonomous vehicles working all over the world is laid out. And okay, sure, there's loads more work to do, but we don't need another big revolution in artificial intelligence to get there. The stepping stones are in place. We've got semantic understanding of a scene. We've got good simulations. And if you start to put all of that

a scene. We've got good simulations. And if you start to put all of that learning together in the right way, you won't just replicate human driving, you'll surpass it. You have been listening to Google DeepMind, the podcast with me, your host,

surpass it. You have been listening to Google DeepMind, the podcast with me, your host, Hannah Fry. If you've enjoyed this episode from the back of a Waymo, then please

Hannah Fry. If you've enjoyed this episode from the back of a Waymo, then please do leave us a review on your favourite podcast platform or subscribe and like and all of the other things, comment, etc. on YouTube. And as ever, we have got plenty more incredibly interesting topics coming up later in the series. So please do join us again. Thank you.

us again. Thank you.

Loading...

Loading video analysis...