Tesla AI Day in 23 Minutes (Supercut) (2022)

By Tesla Daily

Summary

Topics Covered

Optimus Could End Scarcity and Poverty
Tesla Trained 75,000 Neural Network Models in One Year
Auto-Labeling Generates Scenes 1,000x Faster Than Artists
Tesla Is a Hardcore AI Company, Not Just a Carmaker
Dojo Outperforms GPUs by 3x at Lower Cost

Full Transcript

uh welcome to Tesla AI day 2022. we've

got some really exciting things to show you um I think you'll be pretty impressed uh I do want to set some expectations with

respect to uh our Optimus robot um as as you know last year it was just a person in a robot suit uh

but uh We've not we've come a long way and it's I think you know compared to that it's going to be very impressive we're going to talk a lot about our progress in AI autopilot as well as a

progress in uh with with Dojo so should we should we bring up the vot before we do that we have one one little bonus tip for the day this is

actually the first time we try this robot without any backup support cranes

mechanical mechanisms no cables nothing

[Music]

we'll show you some videos now of the robot doing a bunch of other things yeah we wanted to show a little bit more what we've done over the past few months with apart and just walking around and dancing on stage just humble beginnings

but uh you can see the autopilot neural networks running as is just retrained for the bud uh directly on that on that new platform that's my watering can yeah when you see

a rendered view that's that's the robot what's the that's the world the robot sees so it's it's very clearly identifying objects that like this is the object it should pick up we use the

same process as we did for autopilot to collect data in train your networks that we didn't Deploy on the robot that's an example that illustrates the upper body a little bit more

something that we'll like try to nail down in a few months over the next few months I would say to perfection absolutely so

um that what you saw was what we call Bumble C that's our uh sort of rough development robot using semi-off-the-shelf actuators but we

actually have gone a step further than that already the team's done an incredible job and we actually have an Optimus bot with a fully Tesla designed

and both actuators battery pack control system everything it wasn't quite ready to walk but I think it will walk in a few weeks but we

wanted to show you the robot uh the the something that's actually fairly close to what will go into production and um and show you all the things it can do

so let's bring it out so here you're using Optimus with uh this is the with the degrees of freedom that we expect to have in Optimus production

unit one which is the ability to move all the fingers independently uh move the uh to have the thumb have two degrees of freedom uh so it has opposable thumbs and uh both left and

right hand so it's able to operate tools and do useful things our goal is to make a a useful humanoid robot as quickly as possible and we've also designed it

using the same discipline that we use in designing the car which is to say to design it for manufacturing such that it's possible to make the robot at in

high volume at low cost with high reliability this is the Optimus is society and extremely capable robot but made in very high volume probably

ultimately millions of units and it is expected to cost much less than a car right here uh I would say probably less than twenty thousand dollars

I guess and the potential like I said is is really boggles the mind because you have to say like what what is an economy an economy is uh sort of productive

entities times the productivity uh Capital times out productivity per capita at the point in which there is not a limitation on capita the it's not clear what an economy even

means at that point an economy becomes quasially infinite this means a future of abundance a future where

um there is no poverty where people you can have whatever you want in terms of products and services um

it really is a a fundamental transformation of civilization as we know it you know that's it's very important that the the corporate entity that has that that

makes this happen is something that the public can properly influence um and so I think the Tesla structure is is ideal for that

all right so you've seen a couple robots today let's do a quick timeline recap so that robot that came out and did the little routine for you guys we had that within six months built working on

software integration Hardware upgrades over the month since then but in parallel we've also been designing the Next Generation this one over here

so this guy is rooted in the the foundation of sort of the vehicle design process you know we're leveraging all of those learnings that we already have again we're using that vehicle design

foundation so we're taking it from concept through design and Analysis and then build and validation along the way we're going to optimize for things like cost and efficiency

because those are critical metrics to take this product to scale eventually so in the middle of our torso actually it is the Torso we have our battery pack this sized at 2.3 kilowatt hours which

is perfect for about a full day's worth of work so going on to sort of our brain it's not in the head but it's pretty close um also in our torso we have our Central

Computer so as you know Tesla already ships full self-driving computers in every vehicle we produce we want to leverage both the autopilot hardware and the software for the humanoid platform

but because it's different in requirements and in form factor we're going to change a few things first so we still are gonna it's gonna do everything that a human brain does

processing Vision data making Split Second decisions based on multiple sensory inputs and also Communications so to support Communications it's equipped with wireless connectivity as

well as audio support and then it also has Hardware level security features which are important to protect both the robot and the people around the robot so can we utilize our capabilities and

methods from the automotive side to influence a robot and since we had crash software we used the same software here we can make it fall down the purpose of this is to make sure that if it falls

down ideally it doesn't but it's superficial damage so he wanted to dust itself off get on with a job it's been given

so our actuator is able to lift a half tone nine foot concert grand piano for our robotic hand design we were

inspired by biology we have five fingers an opposable thumb our fingers are driven by metallic tendons that are both flexible and strong we have the ability to complete wide

aperture power grasps while also being optimized for precision gripping of small thin and delicate objects all right um so all those cool things we've shown

earlier in the video were posted possible just in a matter of a few months thanks to the amazing word that we've done autopilot over the past few years most of those components ported quite

easily over to the Bots environment if you think about it we're just moving from a robot on Wheels to a robot on legs so some of those components are pretty similar and some other require

more heavy lifting so for example our computer vision neural networks reported directly from autopilot to the Bots situation we're also trying to find

ways to improve those occupancy networks using work made on your Radiance fields to get really great volumetric rendering of the Bots environments for example

here so machine read that the bot might have to interact with another interesting problem to think about is in indoor environments mostly with that

sense of GPS signal how do you get about to navigate to its destination so we've been training more neural networks to identify high frequency features key

points within the Bots camera streams and track them across frames over time as the bot navigates to its its environment and we're using those points to get a

better estimate of the Bots pose and trajectory within its environment as it's working and this is a video of the motion control code running into your product

simulator simulator showing the evolution of the robots walk over time and so as you can see we started quite slowly in April I started accelerating as we unlock more joints and uh deeper

more Advanced Techniques like arms balancing over the past few months right so hopefully by now you guys got a good idea of what we've been up to over the past few months we start having

something that's usable but it's far from being useful there's still a long and exciting road ahead of us um I think the first thing within the next few weeks is to get optimists at

least at par with Bumble C the other bug prototype you saw earlier and probably Beyond we're also going to start focusing on the real use case at one of our factories and really gonna try to

try to uh nail this down and I run out all the elements needed to deploy this product in the real world I was mentioning earlier you know indoor

navigation graceful for management or even servicing all components needed to scale this product up but um I don't know about you but after seeing what we've shown tonight I'm

pretty sure we can get this done within the next few months or years um and I make this product a reality and change the entire economy um so I would like to thank the entire Optimus team for their hard work over

the past few months I think it's pretty amazing all of this was done in barely six or eight months thank you very much hi I'm Ashok I lead the autopilot team alongside Milan this time around last year we had

roughly 2 000 cars driving our FSD beta software since then we have significantly improved the software as robustness and capability uh that we have now shipped

it to 160 000 customers as of today for example we train 75 000 neural network models just last one year that's roughly a model every eight minutes uh that's

you know coming out of the team and then we evaluate them on our large clusters and then we ship 281 of those models that actually improve the performance of the car and this space of innovation is

happening throughout the stack the the planning software the infrastructure the tools even hiring everything is progressing to the next level let's use this intersection scenario to

dive straight into how we do the planning and decision making in autopilot so we are approaching this intersection from a side street and we have the yield to all the crossing vehicles

right with as we are about to enter the intersection The Pedestrian on the other side of the intersection decides to cross the road without a crosswalk now we need to yield to this pedestrian

yield to the vehicles from the right and also understand the relation between The Pedestrian and the vehicle on the other side of the intersection so a lot of these intra object

dependencies that we need to resolve in a quick glance and humans are really good at this we look at a scene understand all the possible interactions evaluate the most

promising ones and generally end up choosing a reasonable one but the same framework extends to objects behind occlusions we use the video feed from eight cameras

to generate the 3D occupancy of the world the blue mask here corresponds to the visibility region we call it it basically gets blocked at the first

occlusion you see in the scene we consume this visibility mask to generate what we call as ghost objects which you can see on the top left now if you model the spawn regions and the state transitions of this ghost

objects correctly if you tune your control response as a function of that existence likelihood you can extract some really nice human-like behaviors

now I'll pass it on on to fill to describe more on how we generate these occupancy networks the occupancy Network takes video streams of all our edit cameras as inputs produces a single

unified volumetric occupancy in Vector space directly for every 3D location around our car it predicts the probability of that location being occupied or not

let's talk about some training infrastructure so we've seen a couple of videos you know four or five uh I think and care more and worry more about a lot more

clips than that so we've been looking at the occupancy networks just from Phil just fills videos it takes 1.4 billion frames to train that Network what you

just saw and if you have a hundred thousand gpus it would take one hour but if you have one GPU it would take a hundred thousand hours so that is not a

Humane time period that you can wait for your training job to run right we want to ship faster than that so that means you're going to need to go parallel so you need a more compute for that that means you're going to need a

supercomputer so this is why we've built in-house three supercomputers comprising of 14 000 gpus where we use 10 000 gpus for training and around four thousand

gpus for auto labeling so I could go on and on I just went on uh on touched on two projects that we have internally but this is actually part of a huge continuous effort to

optimize the compute that we have in-house so accumulating and aggregating through all these optimizations We Now train occupancy networks twice as fast just because it's twice as efficient and

now if we add in bunch more compute and go parallel we can now train this in hours instead of days and with that I'd like to hand it off to the biggest user of compute John

hi everybody my name is John Emmons I lead the autopilot Vision team I'm going to cover two topics with you today the first is how we predict lanes and the second is how we predict the future

behavior of other agents on the road all right so ultimately what we get from this Lane detection network is a set of lanes in their connectivities which comes directly from the network there's no additional step here for as far as

applying these you know dense predictions into in dispersed ones this is just a direct unfiltered output of the network okay so I talked a little bit about Lanes I'm going to briefly touch on how we model and predict the

future paths in other semantics on objects so I'm just going to go really quickly through two examples the video on the right here we've got a car that's actually running a red light and turning

in front of us what we do to handle situations like this is we predict a set of short time Horizon future trajectories on all objects we can use these to anticipate the dangerous situation here and apply whatever you

know braking and steering action is required to avoid a collision so putting it all together the autopilot Vision stack predicts more than just the geometry and kinematics of the world it also predicts a rich set of

semantics which enables safe in human-like driving let's talk about Auto labeling so we have several kinds of all the labeling Frameworks to support various

types of networks but today I'd like to focus on the awesome Lanes net here this machine easily scales as long as we have available compute and trip data

so about 50 trips were newly order labeled from this scene and some of them are shown here so 53 from different vehicles

so this is how we capture and transform the space-time slices of the world into the network supervision take for example the simulated scene playing behind me a complex

intersection from Market Street in San Francisco it would take two weeks for artists to complete and for us that is painfully slow however I'm going to talk about using

jaegan's automated ground truth labels along with some brand new tooling that allows us to procedurally generate this scene and many like it in just five minutes that's an amazing a thousand

times faster than before and this really sets us up for size and scale as you can see on the map behind us we can easily generate most of San

Francisco city streets and this didn't take years or even months of work but rather two weeks by one person and now to come full circle because we generated all these tile sets from

ground truth data that contain all the weird intricacies from The Real World and we can combine that with the procedural Visual and traffic variety to create Limitless targeted data for the

network to learn from and that concludes the Sim section I'll pass it to Kate to talk about how we can use all this data to improve autopilot thank you this data engine framework

applies to all our signals whether they're 3D multi-cam video whether the data is human labeled Auto labeled or simulated whether it's an offline model or an online model model

and Tesla is able to do this at scale because of the fleet Advantage the infra that our end team has built and the labeling resources that feed our Networks to train on all this data we need a

massive amount of compute so I'll hand it off to Pete and Ganesh to talk about the dojo supercomputing platform I'm frequently asked why is a car company building a super computer for training

and this question fundamentally misunderstands the nature of Tesla at its heart Tesla is a hardcore technology company tonight we're going to talk a

little bit about dojo and give you an update on what we've been able to do over the last year now last year we showcased our first functional training tile and at that

time we already had workloads running on it and since then the team here has been working hard and diligently to deploy this at scale

now we've made amazing progress and had a lot of Milestones along the way and of course we've had a lot of unexpected challenges but this is where our fail fast

philosophy has allowed us to push our boundaries Now by focusing on density at every level we can realize the vision of a single accelerator

now starting with the uniform nodes on our custom D1 die we can connect them together in our fully integrated training tile

and then finally seamlessly connecting them across cabinet boundaries to form our Dojo accelerator and all together we can house two full

accelerators in our exopod for a combined one exaflop of ml compute now all can altogether this amount of technology and integration has only ever

been done a couple of times in the history of compute next we'll see how software can leverage this to accelerate their performance dialocal reduction

followed by global reduction towards the middle of the tie then the reduced value broadcast radiating from The Middle accelerated by the Hardware's podcast facility

this operation takes only five microseconds on 25 Dojo dice the same operation takes 150 microseconds on 24 gpus this is an orders of magnitude

improvement over gpus so how do we do on these two networks the results we're about to see were measured on multi-die systems for both the GPU and Dojo but normalized to per

die numbers on our Auto labeling Network we're already able to surpass the performance of an a100 with our current Hardware running on our older generation

vrms on our production Hardware with our newer vrms that translates to doubling the throughput of an a100 and our model showed that with some key compiler optimizations we could get to

more than three extra performance of an a100 we see even bigger leaps on the occupancy Network almost 3x with our production Hardware

with room for more and this Dojo tile costs less than one of these GPU boxes what it really means is that networks that took more than a

month to train now take less than a week so we started with Hardware design that breaks through traditional integration boundaries in service of our vision of a single giant accelerator

we've seen how the compiler and injust layers build on top of that Hardware so after proving our performance on these complex real-world networks we knew what our first large-scale deployment would Target our high

arithmetic intensity Auto labeling Networks today that occupies 4 000 gpus over 72 GPU racks with our dense computer and our high

performance we expect to provide the same throughput with just four Dojo cabinets and these four Dojo cabinets will be part of our first exopod that we plan to

build by quarter one of 2023.

this one more than double Tesla's Auto labeling capacity the first extra part is part of a total of seven extra parts that we plan to build in Palo Alto right here across the wall

and we have a display cabinet from one of these exopods for everyone to look at six tiles densely packed on a tray 54 petaflops of compute 640 gigabytes of

high bandwidth memory with power and host to feed it we really wanted to show the the depth and breadth of Tesla and um artificial intelligence

compute Hardware robotics actuators and [Music] and try to really shift the perception of the company away from uh you know a

lot of people think we're like just a car company or we make cool cars whatever but uh they don't have most people have no idea

that Tesla is arguably the the leader in real world AI hardware and software and that we're building

uh what is arguably the first uh some of the most radical computer architecture since the the cray one supercomputer and I think if you're interested in

developing uh some some of the most advanced technology in the world that's going to really affect the world in a positive way uh tells us the place to be

Loading...

Loading video analysis...