L1 Introduction -- CS294-158 SP24 Deep Unsupervised Learning -- UC Berkeley Spring 2024

By Pieter Abbeel

Summary

Topics Covered

Brain Demands Unsupervised Learning
LeCun's Cake: Unsupervised Base
Unsupervised Equals Intelligent Compression
Diffusion Outscales GAN Mode Collapse
Mask 90% to Unlock Vision SSL

Full Transcript

okay 210 let's get rolling uh welcome to deep unsupervised learning first lecture let's start with the instructor team I'm Peter Beal I'm Professor

teaching the course and we have three co-instructors with me here maybe three of you can briefly stand up maybe even here in front of the camera also and do

a quick self intro um yeah Philip why you go first then Kevin then Wilson um yeah hi everyone my name is Phillip I'm a PhD student uh with

Professor Peter be um and I work on a variety of different topics but largely with real world robot learning um Crossing reinforcement learning imitation learning and supervised learning so yeah hope we'll have a great

semester thanks pH yeah everyone I'm Kevin also a fish with uh Peter and Sergey and um yeah I work kind of on reinforcement learning algorithms and generative modeling which

we'll talk about in the class great thanks K yeah hey everyone I'm also one of Peter's PC students and I do research in generative models more specifically stuff with like video and language

generation you actually took the classes underground here I yeah I took the class and I was a TA the semester after yeah but it's been it's been a long time since been a while last time we T the class was uh 2020 so 4 years ago it's

pretty interesting because I was looking at the slides from four years ago and I saw I put a lot of motivation in there and especially 5 years ago a lot of motivation why people should study deep UNS supervised learning because I

thought people weren't interested enough in deep un survice learning should really motivate them and now I feel like I need to do the opposite uh or maybe slightly more precisely um I want to

make sure in this lecture to make clear what we are actually planning to do in this class because becoming so popular that in principle you could do a million different things in a deep un survis learning class and so I want to make

sure you know what we are going to do here rather than maybe what you imagine we could be doing so couple of logistics communication uh there's a website

that's the URL um went up I think yesterday um we should be putting all information up there that's needed for the class we'll put uh typically the

slides there before lecture not today it's the first lecture it's always bit of a transition period when the semester starts but go starting next week we plan to put up the slides before lecture uh so you can take notes on them if you

want or look at them a little bit ahead of time um has the schedule everything if you see anything that seems off let us know um could be that

there's still a little typo here or there uh just let us know some announcements uh we use Ed for communication I've actually never used it um it's the first time I'm going to be using it but it seems the the

standard thing to use for Berkeley classes currently so I'm excited to to be using it um please start there if you have questions because then others can see your questions help answer them

instead of us getting separate questions from everybody who might then the questions might have a lot of overlap if you're by the way registered or on the wait list we already put you

into the forum for this class if you are neither then you have to put yourself in so Ed is preferred if you have questions

for us um uh just to be clear not during lecture I mean like during lecture just raise your hand ask questions that way but outside the lecture start there if you have rather than email but if it's

something you think that's more suitable for email feel free to email the staff list or if you think it's specific to any one of us email just us individually um but try to use Ed if

possible office hours will start next week um for for me it'll be after lecture so lecture is 2: to 5: on Thursdays um it's a long slot 3 hours

most lectures are not that long we'll take a pretty big break in the middle 15 20 minute break um we'll even have snacks to uh have like reinforcements

for the second half of of lecture um and that's actually quite deliberate I think one of the big things you can get out of the class is to get to know other students in the class who working on similar topics you can exchange ideas

come up with new ideas together and I think having the long lecture slot with a break in the Middle where you get to hang out talk to each other can be very good for that purpose I'll do my office hours right

after class so that'll be a very long stretch if you stick around for that uh it'll be 5 to six um Wilson Kevin and uh Phillip will uh announce their office

hours sometime next week um we still figure out how to do it we might have some office hours or maybe one ta is in charge of one homework and then has all the office hours leading up to that homework due time or maybe we'll have

regular like weekly office hours we're still sorting that out so for homework by the way ta office ours are the best venue even though I'd love to be able to help you with

homework like I dream every every every winter break I dream of a semester where I actually have the time to work through the homework myself and uh then be able to help you the reality is like I just

can't like I don't know the details well enough but these three will so for homework specific questions go to them um for anything else all of us uh should

be able to help you um and I think for a lot of us one of the favorite things is to talk with you about your projects what you are doing research on how you think it might connect with things in the course and how things in the course can help you be more effective in

whatever you are doing research-wise or maybe your research is squarely and UNS supervised learning and is directly feeding off the course um admission to the course it's a

bit of a challenge that there is more demand than there there is uh spots and I think that's just going to be the reality some people are not going to get

in um what we'll do to Nodge things towards uh a best possible situation is people who don't have a strong homework one or no homework one at all submitted

we'll ask uh the register's office to drop them out of the class now obviously if something really crazy comes up and you can't do homework one because there some external circumstances reach out to

us and let us know we can think about it but to us that's just a sign if you're either not suitable for the class or you just don't care enough about the class to submit a strong homework will drop you and we'll let people in from the

wait list also if just after today if you register for the class and you realize oh my God I thought it's going to be what all these VCS talk about with generative AI we're going to be like the next big startup out of this class it's

not exactly targeted that way um so then maybe also drop out of the class make room for others could be many reasons you think the class is not for you after today's lecture don't don't wait till the drop deadline because that's

inconvenient for for everybody else just drop when you know you're going to drop even if you don't make it into class you're welcome to audit you're welcome to submit homework uh that

that's all good you just won't get the credit for having taking the class um and as we move people into the class um it could be undergrads could be

good fits for the class PhD students I mean it's really more about what you know what you're capable of than exactly what degre you're pursuing typically phg students with more natural fits but not

always so we'll see what happens any questions about registration yes are there any like quantifiable like cut offs for what a strong homework one means and also how

does this relate to whether you're like fully registered versus weight listed versus not so we'll grade everybody's submission whether you're weight listed or not even weight listed or you are

registered if you submit your home homework will graded um then what is strong homework one is the way we design them is essentially they should be solvable for somebody who can who's

taking the class it's not something where like a strong homework is like solving half the homework a strong homework means that you essentially solve everything that doesn't mean it can be a beauty mistake here or there

but it can be that there's just a whole part of the homework that you didn't do um we're also here to help you I mean we want you to learn it's not that you need to be able to do this without any help come to office hour ask question

questions but if you use all the resources available and all the time you're willing to commit to it and then end up having big parts of your homework just in one substantial section that's

just blank or completely off then likely we would drop you if we see that there are many people on the weight list to have much stronger uh submissions um

think of it like you know maybe in in numbers let's say you got to score 90% or something on homework one but don't worry if you have 98 or 99 we're not going to rank people and then like be

like okay think the exact ranking but roughly 90% as a threshold you're good yes what should we expect as the deadline for homework form H coming up

in a moment great question yeah I have a slide on on all the deadlines okay um let me tell a bit about the syllabus today's intro we're not going

to cover actual uh materials today we're going to cover motivation for the class why this class might be worth taking um and we will cover some Logistics um but

then next week we'll start diving in with autoaggressive models which is the first type of model we'll cover the week after we'll do flow models then we'll do latent variable models then we'll do

Gans uh SL implicit models and then we'll do diffusion models and final project discussion so those will be in some sense the five main types of models

that we'll cover um from there we'll look at other types of learning uh so five main generative models then we'll look at self supervised learning non-generative representation learning

we'll look at strength and weaknesses of everything covered so far because part of what's hard about the field I would say is that it's a bit disjoint it's not so easy to directly necessarily compare

a vae with an autoaggressive model there are pros and cons to them the field hasn't fully converged maybe we'll never conversion on just one thing clearly language is more conversed onto

autoaggressive and vision is more converged onto um diffusion models right now but this this can change over time in fact last time we taught the course

diffusion models didn't exist um we still had to invent them later that semester semi-supervised learning unised distribution alignment these are things

that are somehow not as hot topics these days but I think they're very important and I think part part of what the beauties of this class by the way same for flow models is that we have the time

to cover some things that are less hot today but maybe with some Innovation can become the next breakthrough if you find the right Innovation compression compression is

not typically a machine learning topic but it turns out everything unsupervised learning is effectively tied back to compression because the core hypothesis in unsurprised learning is that you're modeling data and what does it mean to

model data well what does it mean to do comp impression is the same thing it's finding the patterns in the data and based on that representing it in a more compact way in machine learning we don't just care about compact we want it to be

semantically meaningful so there's a little bit of an extra twist to the notion of compression there but it's very very related spring break week no lecture uh

naturally then we'll do language models a dedicated lecture obviously some things will come back there from earlier language models are autoaggressive models large so some things will come

back but we'll dive much deeper into specifics of language models then we'll have a midterm um I think I have a separate slide on that so I'll wait for that to

say more about that um we'll do multimodal models and video generation that lecture I'll definitely remind you will be in a different location we'll be

in the auditorium upstairs slightly bigger room um then we'll have electron AI for science uh and we'll have representation learning for reinforcement learning note that these

two lectures are in the same week so there's going to be a Thursday and a Friday lecture a pretty wild week cuz you have like three hours on Thursday another three hours on Friday um but we won't have a lecture the next week so

the last week of the normal semester the one be a lecture I moved it Forward onto the Friday before that um week 15 no

lecture RR week then week 16 your final project reports will be due and video presentation submission so we'll take instead of doing an inclass sequence of

presentations which I found at the end of the semester students are so busy and it's hard to find a time to attend the other students presentations so it's often like students just jumping in doing their presentation jumping back out get to study for their final or

another project we're just going to take submissions we'll watch the videos but we're going to share the videos unless there's a specific reason you don't want your videos shared we'll share them internally with the class so anybody

who's interested in a topic can go watch the video present a on that topic that was done in the class any questions on the schedule yeah

lecture 13 be in the same classroom at the same time as usual I don't know yet I've put in a request for for a room but I haven't heard back yet um yeah very good question and good reminder but it

is on my to-do list to to sort it out I asked them two days ago and it's probably a busy beginning of semester um for them to respond nothing thing about lectures they're all recorded so there a Camtasia

recording running here which sees whoever is standing in front of the laptop and Records the screen can't guarantee it works out every time sometimes these things crash and then

there's no proper recording but hopefully usually it works out and then we'll put these online yeah we do have a lecture on the same day as the midterm yes so the

midterm will be very short I'll say more about that soon homework we'll have homework one go out next week Thursday due 13 days later

then 14 days after that um homework two will go out so essentially every two weeks a homework will go out if you look at previous offering what you see has changed that there's right now no homework and flow models if you care

about them you want to learn more about them and do a homework on them you can go to the 2020 website do the homework from there for your own good not not for the class um I think it's still

interesting but we only want to give you four homeworks and we think right now it's better to give you a homework on diffusion models rather than flow models just the way things are shaping up but again these things can change you could

come back here a year from now and say what a big mistake you should really have done flow models way more important um hard to predict but this is what we're going to

do so that puts us roughly halfway to semester and then the idea is that from there a lot of your time outside of class is spent on your final project for homework you can discuss things uh but

you need to write your own code and make your own submission um you can be late with a late policy but not too much at most four days um why do we have the postive

at most four days logistically essentially we want to be able to get started on grading is one two we think you can learn a lot from looking at Solutions but you learn more if not a

lot of time has passed before you get to see the solutions so after 4 days we will release Solutions you can look at those learn from that and hopefully that's a good learning experience but at

the same time you know that means you can submit after those four days again if there are really exceptional circumstances always you know don't be afraid to reach out we can think about

it um but uh that's the standard model we're going to use so the midterm we have a midterm during lecture the beginning of lecture on April

11th um I think in the end what's most important is what you learn from from the class what you take away and what hopefully you put in your final project

along the way but the reason I like to do a midterm is because it gives you an opportunity to be forced to really study the materials um you're all so busy and if nobody forces you to study the

materials probably you just have something else fighting for your time and winning out um it's not meant to be excruciating or to be weeding you out in

any way or other um is going to be topics covered to the week before we will provide a document with questions and answers so essentially we will go through the lecture slides pick the most

important derivations according to us write a question let's say derive the rational lower Bound for a vae and then put the answer below it and you'll have

anywhere 10 to 20 pages of that kind of material and then when the midterm comes we'll ask you two or three of those um obviously if you fail to study two or three of them you might do pretty poorly

but hopefully since only 10 to 20 Pages you have to study it should be easy to just study everything uh and it's upon us to make sure that these are the 10 to 20 most important things from the class

you can principle memorize it I don't think that's the best way to study it ideally you understand it and that way are able to rederive it but you know nobody is stopping you from from memorizing it either and often

memorizing can help you understand uh it's kind of an inter process in fact language models just kind of you know memorize and then understand so it's uh

it's just what you're supposed to do I guess okay any questions about the midterm all right um final project uh scope IDE you

explore and push the boundaries in unsupervised learning uh there will be a proposal could be a proposal plus evaluation of new algorithms architectures investig investigation of

application of unsupervised learning benchmarking unsupervised learning could be something related to compression studying synergies between unsupervised learning and other types of learning and

so forth um pretty broad what's acceptable but it really has to involve the topics in the class ideally it could be the foundation for a future conference paper there's no

expectation that by the time you submit the final project report that it's already at the level of a conference paper but hopefully maybe there's like the initial experiments that show the promise that this could really become a

conference paper with an extra push that you put behind it um we actually encourage you to come up with your own idea if you can it'll

likely be more original because you have your own background your own knowledge that's different from ours um that said um we're also happy to brainstorm ideas together um and often

going back and forth can also lead to new things that neither one of us would come up with uh individually so the main reason I started teaching this class five years

ago was for these projects I wanted more onar learning projects to happen at Berkeley um some really good ones have happened um one that um uh sticks with

me uh is Roshan raal uh student in the first offering of the class did a project on sequence modeling for biology uh wrote one of the first papers on

essentially protein property prediction based on pre-training on just sequence modeling um released a tape Benchmark as a nervs paper for that is all before

before Alpha fold um and from there actually went to the Facebook sleta um bio uh AI group and since then that team

has left meta and has started a company that's essentially trying to build the foundation models for bio for for the future so that would be a great trajectory any one of you go on that

trajectory love it I mean there's many other great trajectories but something that that is really um try to do something exciting um if something doesn't work but you have a lot of um

evidence that you try things that are meaningful and didn't work that's more interesting to me at least than if you did something pretty boring and yeah it worked but it was kind of boring and it was kind of of you know kind of too

close to what other people have done so try try to pick something exciting and surprising if it works out

timeline project proposals due February 28th so you have a bit of time um you'll put that in a Google doc so we can easily give you feedback about a week later um we will have iterated with you

in that Google doc and hopefully have arrived at a project proposal that we're all happy with um then a three-page milestone is due in April we'll also do Google doc

for ease of feedback um the idea here is that it forces you to get started it's very easy to wait till the end of the semester if there's nothing forcing you to get started earlier so you'll have to start about a month and a half before

the end of the semester or even a bit earlier to get something in there it doesn't need to be anywhere near final um but you should have some results some initial investigations that you can

report on May 10th everything will be do I believe I'd have to check but I think that is the the Friday of finals week

um grading Logistics 60% homework 10% midterm 30% final project um that's yeah and then the the letter grades there's something on the website

that says what the letter grade is based on the grades that you get um but I believe it's probably like 90% and above is an A and then from there every 5% it

loses uh a partial letter grade do you need to attend class um no hard requirement um obviously there's not enough seats so it could be even be

uncomfortable to sit on the on the floor for 3 hours um but I would say it's highly recommended um and it goes back to the opportunity to learn from each other as

I said in the beginning of lecture we'll have pretty large break right in the middle good opportunity to learn from each other and often new projects arise from that uh actually Roshan's project

that I mentioned he got to know some of the students that he did that with in the class there were bio students taking the the class he was a CS student and then they start working on it

together only the third offering of the course so some rough edges along the way bear with us give feedback if there's something you think that might be better done a different way let us know that doesn't mean we can always get around

doing it but we'll take notes and what we can do we will do okay so that's Logistics any questions about logistics before we start with more content in the

back yeah so we're trying to record Camtasia is running right now which is a screen capturing software um which can capture at higher resolution for the

screen than let's say Zoom would um we'll try every time maybe sometimes it crashes and it doesn't work but hopefully every time we have a recording yes and then we'll hopefully post it uh

later that evening or a couple more questions yes are the final projects completely independent or oh good question up to

three people I would recommend forming teams of two or three because often when there is more than one of you you have more of a back and forth on ideas and you can arrive at more interesting things uh we do expect a little bit of a

linear scaling so if there's two or three of you you're supposed to have a more substantial achievement in your project than when there's only one of

you Yes actually question find

and yeah so we will um link the website uh from The Forum the Ed Forum later today so then you can find find the

website where all the materials will be posted for submitting homework um I guess we'll sort that out when we release the first homework so by the time it comes out next week we'll

hopefully have uh set up the way to receive your homework also thank you yes for under should we fill out a

form or just in the first one um just turn in the first homework and we'll take things from there

all right so let's switch from Logistics to content what is deep unsupervised learning it's about capturing Rich patterns in raw data with deep networks

in a label-free way the label free part really matters it's the original motivation for UNS supervised learning because labeling is

time consuming and hence it's not nice if you don't have to do it um but there's actually additional motivations Beyond it being label free which we'll get to

soon there's kind of two sub areas in unsupervised learning there is generative models which try to recreate the raw data distribution and there is cell supervised learning which tend to

be puzzle tasks you have your data and you remove some part of the data or some view of the data in some way or other and then it's supposed to fill it back in if you can fill that back in then

presum you understand something about the data um generative models recreate the raw data distribution there's a lot of talk about generative Ai and I was thinking like

how how would I really Define what we're trying to cover in this course for the generative AI part and the way I think of it is it's still neural Nets right it's

it's still a neural net you you're outputting you have an input you generate an output even a generative model is a NE Network and it looks a lot like supervised learning in many cases

so what what's really different about it um the way I think of it as being different is that in a lot of the typical work in supervised learning

there's a very clear deterministic solution in that you have an image and there's a clear label that needs to be assigned to it cat or dog or you have a self-driving car scene and every pixel

needs to be Street pavement versus pedestrian and so forth and so the output is in some sense deterministic what you're supposed to do whereas in

generative models we're trying to model pretty complex output distributions on the neural network and that's actually hard to do to model complicated distributions and that's why there's

actually five different lectures each three-hour lectures covering five different ways of covering effectively complex distributions with neural networks and so that's the way I would

think about it is generative models at least in this class is about models that can represent complicated probability distributions well beyond the kind of

everything's concentrated on one output Sal learning is the puzzle tasks um the there's been an interesting back and forth initially on supervised

learning generative models were what people were excited about then it became all self-supervised learning for a while and now it's swing back to more generative models who knows where it'll

be next year um I think it's good to know both of them so why do we care I think this might be the most interesting motivation of them all I hope you all know Jeff hon if you don't

please uh know him now Jeff Hinton is essentially The Godfather of deep learning he's been working on deep learning since the 1970s had to wait for it to have a

breakthrough till 2012 so it's like a 40-year career of working on the thing that he thought was going to be right and after 40 years he proved himself right because he was the one with his

students who had the big breakthrough on image recognition with alexnet now this is motivation for unsupervised learning the brain has about 10 to the power 14 synapses

synapses in the brain are connections between neurons and they're essentially the learnable part of the brain where the neurons are connected or not is the variation in what our brain is as a little baby where nothing is really

wired up to know anything and then when we're older we start knowing things 10 to the 14 synapsis and we only live for about 10 to the 9

seconds so we have a lot more parameters than data this motivates the idea that we must do a lot of unsupervised learning since the perceptual input including Pro

perception is the only place we can get 10 the 5 dimensions of constraint per second right because if we assume that the brain is kind of meant to be fully

utilized um then we do need to get 10 the 5 essentially bits in some sense per per second to fully utilize our brain now there are other motivations why the brain is so large it could be that a

larger brain can learn faster than a smaller brain even if not that many bits have to be learned that's something Jeff has also talked about I think both probably

have some truth to it but this is a very clear sign that supervised learning is not enough for how humans essentially use their brains there's a lot more going on Beyond supervised

learning Yan Lon similar thing need tremendous amount of information to build machines that have common sense and generalize um and I think this kind of gets into another Trend that we've seen

recently in AI which is foundation models Foundation models are train on such a large amount of data that they generalize much better than any models

before um it used to be you'd have a model for let's say um recognizing the sentiment of uh a paragraph is this a positive or A negative review of a

product and you had a dedicated language model to do that and it was better back then to train a dedicated language model for sentiment analysis than to train one model to do everything but that's

changed Foundation models are trained on all the data I'll perform the specialized models now and I would argue in some sense that that's the closest we're getting to Common Sense with current

machines is by training on such a wide swath of data that somehow everything starts to fall within distribution at least a little bit so Yan Lon um

presented this cake now known as L cake um at the at his nur's keynote in 2016 and the point he was making is that um back in 2016 actually all the

excitement at nurs was about reinforcement learning so he was trying to make a point uh that people maybe should be a little less excited about reinforcement learning was part of his point not not excited but maybe a little

less and pay more attention to unsupervised learning his point was that essentially you need a lot of data only way to get a lot lot of data is unsupervised learning so

that's the foundation of the Ki it's most of the volume SL mass of the kick the icing is the supervised learning and then the Cherry is the reinforcement learning it's pretty interesting if you

look at um chat GPT it's exactly this model chat GPT is trained on the entire internet to get the foundational knowledge supervised fine tuning that's the icing on the kick then r F

reinforcement from Human feedback that's the Cherry it's l cake essentially predicted um six years ahead of time how AI would shape up into its at least

current most capable form another way you might get motivated for unsupervised learning is this notion of Ideal intelligence so ideal intelligence people argue is all about

compression finding all the patterns in your data that's what it means to be smart you understand all the patterns that are there so finding all patterns means

finding a short description of raw data meaning low kog complexity so what is comra

complexity com of complexity says the size in some sense measured as com of complexity of a certain data set

is the size of the shortest computer program that can represent that data set put it out as on it's output now the simplest way to do it but that might not

be the winning uh size is to just store the entire data set and just have a program that says print and then it prints out the entire data set so that's

that's a very simple program but it's rarely going to be the one that wins but if you had a completely random string with no pattern in it that might be the winning program that you could possibly

produce um to do that but for many real world data sets um there's other ways to describe the data that's more efficient and com of complexity measures that by

the way the way to think of this in neural net land is that you should think of the size of the neural net as essentially the size of your program the

neural network is the program because instead of writing code you're training a neural network so the neural network is the program and so here some's asking what is the most compact neural net

Network that can regenerate my data and that should be the one that understands the data the best and hence U best at let's say generalizing hopefully there are some subtleties there by the way

smallest NE net doesn't need to mean smallest number of parameters it could mean something else it could mean that you have low Precision parameters but large number of parameters and maybe that's a better way to represent things

than having higher Precision parameters but less parameters um so keep that in mind shortest code length um in principle

should also allow for optimal inference um and S of induction is kind of a bit more focused on on that that side of things um extensible to Optimal action

making agents which is called aixi so aixi is essentially looking at this notion that if you want to build an agent that solves problems what is the smallest agent I

can build that can solve these problems and that's the counterpart of recreating your data in unsupervised learning but now in reinforcement

learning here not a motivation very related in fact IL sis Chief scientist of openai co-founder of open AI um talked about this just about a half year

ago at the Science Institute here assume we pre-train in an unsupervised way on data distribution D1 and then fine tune on a data

distribution D2 then then if D1 and D2 are related compressing D2 conditioned on already knowing D1 should be more

efficient than compressing D2 outright the additional effort should be less given the effort you've done before to compress D1 if there's any kind of relationship between them so

pre-training on D1 should allow you to learn faster on D2 this is kind of like a very let's say it's not a super ma ially precise but it's an intuitive

argument for why unsupervised learning should help when you later all you care about is supervised learning all you care about is labeling every pixel on your self-driving car scene still UNS

supervised learning might be the right thing to start with by the way these kinds of motivation I remember um early days of um early days

of of open AI when it was just 10 of us um Ilia John Andre uh couple of my students myself back then supervised learning was

always winning like anything you cared about supervis Lear was always winning and we actually had to think hard about these aspects to decide that we should work on unsupervised learning part of

motivation was there's more data so clearly that that's going to help but the other part was people would say well it doesn't matter because that data is not what you want you just need to label enough supervised data and you'll get

things done but our argument was things like this is why we should still work on unsupervised learning today that argument is more mot it's empirical you pre-train and then fine tune you get

better results um wasn't the case back then when you went you started with unsupervised as your pre-training aside from theoretical

interests um deep enzed learning has many powerful applications you can generate novel data uh you can do conditional synthesis you can do compression with

it um by the way the compression hasn't really been commercialized yet I still wondering when that might happen cuz um when we when we're seeing the electron compression

effectively what is compression how much can you compress a data set is based on how good a model you have for the probability distribution that constitutes your data set and then the bound is the entropy of that

distribution that's how much you can compress your data and so if you have a better model you should be able to compress better so far it turns out that maybe

these new Nets are kind of large and so once you count off actually the comor of complexity including the size of N Net and the amount of inference required to decode it hasn't been winning yet you

could imagine a future where if compute chips become even larger and cheaper that you have a massive neural network decoding all the movies that you watch and very little bits have to come over

the pipe because it can decode efficiently and from a very highly compressed representation compared to what's possible today improve any Downstream task by

doing pre-training un or self-supervised production level impact Google search powered by Bert was the first one that's why I'm highlighting it here I think that was the first real

production level impact of unsupervised learning flexible building blocks for other things some of the architectures that were invented for UNS supervised learning are now reused in reinforcement

learning in supervised learning so a lot of reasons um you could be interested in what we're going to cover here so let me dive

into actually let's see how long are we in yeah let's let's let's not take a break right now let's keep going just a little bit um this is part of where

generative model started at least for image generation uh it was called Deep Bel Nets back then it was a different name um but it was the same kind of idea you train a neural network a specific

type of neural network in this case that has fallen out of favor and these were some of the early results showing that you can train this neural Nets to generate images that look like the images in your training set at the time

this was pretty surprising to people it was still hard to correctly classify cats and dogs at the time because that didn't really happen till imag net Alex net 2012 as the fact that

while people were still struggling to have an image classifier for cats and dogs other people were generating digits it was pretty surprising that was

possible um the variation Auto encoder kind of made one of the big steps forward after neural networks broke through in 2012 and we'll look a lot at the vae couple lectures from

now um Gans by Ian Goodfellow and collaborators in 2014 were the first sign that as this Neal Nets are likely to generate realistic images in the foreseeable

future until Gans somehow if you look back at the vae everything looks kind of smooth out and that's just how these models tended to be they try to predict things that are

on average kind of right but on average being right is not exactly right it's like if the answer is you know either yes or no to a question saying something in the middle is never going to be

correct um same thing here it's just smoothing things out too much but Gan started resolving that this still low resolution that was just compute

limitation at the time exciting story here by the way um Ian was uh having drinks with friends in a bar in Montreal and they just you know what do

you do when you're AI student you have drinks you talk about AI um and they're just like how is it possible that these neuron Nets can create proper

images even a neuronet can see that these images aren't sharp um and Ian's like wait even if if NE can even see this these pre previous images are not that sharp I should just just use a NE

net to give feedback to the create creation neet the Gen NE net and set up a feedback loop and see what happens he CED up the same night he started seeing a real sign of life and then a little

later wrote the paper interestingly um Gans well here's another interesting backstory actually Alec Ratford first author of this paper

this was the first time with the bedroom data set here that Gans were showing actual high quality images people said this is not just a sign that this will

be possible soon this this now becoming possible um Alec I believe um dropped out of college to play with AI and had some good success he wrote this paper

and then open I recruited him I remember that we recruited him I was like okay he did the best generative model to date back in 2015 um and so um it was interesting at the time that these

things were possible um just you know you find the paper in archive and based on that you decide to recruit somebody Alec by the way was a great recruit obviously because Alec Ratford

is the one who started the whole GPT um uh sequence of Works had open ey gpt1 well gpt1 in some sense was was an lstm but then gpt2 three all started with

Alec uh interesting story there uh he actually let a model train while he was away for a while and then it kept improving it realized wait we've never been training long enough clearly we should be training much longer and that

also helped Inspire the much uh longer training sessions um so sometimes say did going away and not killing your process is is uh the way to go

too from these are some images of uh faces which we as humans tend to be very sensitive to artifacts and faces so obviously here we're seeing signs alive

but we don't yet see this as realistic um super illusion Gans came soon thereafter um Al EOS and his students here did things with Gans that were very

entertaining you could essentially recolor uh subjects and images this case the horse uh becomes a a zebra um and then deep mine did a very

large training the first very very large Gan training and so this was the first time by this was 2018 um where Gans were showing very

realistic images now some of them in between are not super realistic cuz it's interpolating between different ones um but you can see when it's not

interpolating it is um is it playing music so pretty amazing um at the time people essentially thought this is it

you know Gans are the way to generate images uh in the best possible way and um a lot of iterations were done especially Nvidia did a lot of work

scaling up even more getting even more um impressive results over time time um styan came out of Nvidia uh these faces

become kind of indistinguishable from real human faces and that was uh also 2018 then something happened that was kind of surprising in some sense uh even though Gans were all the rage and

everybody thought this is it with this is how we're going to solve image Generation Um diffusion models came on the scene you might wonder why even bother coming up with diffusion models

at the time um the challenge people ran into with Gans was that even though everything was very realistic it wasn't covering the entire distribution that the data had so it was

focusing on specific modes of the distribution rather than having great coverage of the entire distribution and so the open question at the time is can

you come up with a new approach or improve Gans that's fine too but can you come up with something that is both generating realistic images and has good

coverage doesn't leave out a lot of the um data effectively and so diffusion models showed that you know clear sign alive in this paper in 2020 and they're

also the models powering almost all the image generation today so I'll just put some fun examples at least some of the ones that I find the most fun um all done with diffusion models um these were

not in the original diffusion paper the ones I show here are from doly from open AI um a masterful oil painting a Persian exotic cat discovering their astounding

crypto losses while checking their phone why is this an interesting prompt well it's interesting partially because it's entertaining obviously but it's also interesting

because it's targeting something that would not be in the training data right and if the AI in this case this diffusion model can generate a good

response to this it means that it understands something Beyond replicating what's in the data it should understand something about how to combine cats

with phones and a standing crypto losses which probably means being not particularly happy and here's what it comes up with then here's another one of my

favorites a Victorian man struggles with his addiction to Tik Tok same gist it's there some humor to it there's also this notion that it's not going to be in the data Victorian

man did not have access to Tik Tok that was T years in England um nobody had well anything like Tik Tok at the time and here's what it comes up with it

actually gets it really well it's the the guy is holding an alcohol flask that some kind of addiction generally speaking red cheeks looking pretty sad looking at his phone and it's all said

in the Victorian era here's another one of my favorites Darth fader realizing he's forgotten to add an attachment to the email mail Darth Vader never forgot to add

attachments but we could imagine what it would look like and again the model is capable of doing this looks like a business traveler sitting on the hotel

room bed firing off some emails before going to dinner and then U making a mistake these were all open eyes do model um here's oh The Prompt should

have come first but um this one from Google's imagine a photo of a shba Uno dog with a backpack riding a bike is wearing sunglasses and a beach hat and it creates a photo realistic image in

this case a lot of them have come since so open eyes DOI and Google's imagine came

out first soon thereafter stable diffusion came out um stability AI has been hosting that mid Journey has been building off of that on their Discord

servers um in fact the folks on the Imagine team one of the key people on the Imagine team team is Jonathan hoe who was a PhD student here wrote a diffusion models paper then went to

Google did this work as well as the Imagine video work um and then she left Google to start a company called ideogram um because Google was not putting its Ms out there it just became

frustrating for him to do this great work and then everybody's like oh do is so much fun mid journey is so much fun he like AB built this amazing imagine mall and nobody's having fun with it

it's just like behind closed doors at Google and so he ideogram which is also one of the top providers of text to image right

now so in the uh remainder of today's lecture we'll look at a few more um pretty exciting kind of examples of progress in UNS supervised learning over the last few years Kevin will be

presenting those and then at the very end I'll wrap up the class with the last slide um one quick thing that come up during the break um it turns out that

the Forum we use is um you can't self sign up we have to add you it seems um and so if you want to get access to us you kind of need to start here if you're already on the class roster already on

the weight list you should already be on the Ed Forum but if you're not email us here we can add you to the Forum and then from there you can start finding everything else because you can find the class website and everything else we'll

post there um but yeah if you're stuck getting started the whole process email us here and then uh we'll help you get going okay cool yeah so for the remainder of

the lecture we're just going to kind of go over some fun things that unsupervised learning can do today so a lot of what we looked in the past here and then the last few sides are are image based and and here's kind of one

thing where we try to go into a different domain so wave net here um from 2018 it kind of ask the question okay we can generate images um what can we do with with other domains and it

turns out okay we can actually generate audio signals as well and and the method they use here is actually pretty simple to what we were doing back in the pixel domain so these are diffusion models so

they're kind of from a bit later but this wave net builds off uh pixel CNN which we might cover in the in the autoaggressive part but it basically just predicts the audio step by step so

if you take the raw audio format predict the next one condition on everything in the past you get kind of a a generative model of the entire audio framework and so wavenet proves that okay if we do

this in the right scalable way this works uh and of course these things are getting better over time so this is audio craft um comes out of meta uh from

2023 so just earlier uh this year or last year and uh it's basically building off the same technique so if we do auto regressive prediction we can generate

the samples one by one and here um we we use a few tricks so instead of predicting things in the Raw audio space like we're doing back here in the wavenet in audio craft we first um first

we'll tokenize everything so this is kind of a trick that we'll kind of see recurring is that actually if you tokenize things first U predicting is a lot easier so tokenize everything first here we tokenize for example text and

audio into the same space uh and then you can just concatenate so text audio comes after prct it all at once and you get basically audio generation and here they show okay we can do it for music we

can generate songs and we can do it for things such as um other sounds uh text to speech is also something we can do with unsupervised learning so I guess in this case it is

it is somewhat supervised and there are some some labels we use to connect the two but a lot of what's going on here such as the embeddings and all these things are are going on um in an

unsupervised way so this tacron work is basically one of the one of the landmark works that says okay we can take in text do some processing on it and we get out these these spectrograms which are

basically just ways of representing audio uh and then we'll use this this wavenet model so something built off wavenet earlier to turn it into into a raw audio file

uh and yeah these days the Texas speech has gotten really good so there's this um company called 11 Labs has been making uh basically making these better internally uh we can do things like giving speech in in different styles so

we have different for example personalities here uh and it turns out that converting between Styles is also pretty easy so it's kind of a it's a tough problem to go from text directly

to sounds because actually the text doesn't contain all the information need to make a sound we we have things like like intonation we have expressivity that's that's not exactly all in the sounds uh but if we do for example voice

conversions uh we can convert from like one side to another this is something fun that like some of you might have seen in the past so like it gained popularity last year but basically people using these generators to to talk

in the voices of of other people so one of the famous ones is these presidents talking about something that they would never actually talk about but it is pretty funny so that's kind of text to speech

and then video is also kind of this new domain so I think video stuff is is just starting to work um I know Wilson is working on it some people are working on it um but basically can we do can we do

generation in video space the same way that generation in image image space is working really well so here's some examples using Gans so you know if we just have data sets of videos we can

actually just predict or we can optimize the the generative model over that using the same Gan format that we use um back in images uh and then there's also kind

of more methods based on diffusion model so this this emu video uh does this kind of stepbystep process so what it does is uh this is actually text text video the

text is not shown but you can kind of describe the scene first and then what it does is like kind of the lesson learn in this paper is you if you factorize a distribution it it works a lot better so what you do here is given the text first

predict the first frame and that's kind of the whole image generation problem in itself and then given the first frame and the text let's just generate the rest of the keyr and then let's generate the in between frames and it turns out

just factorizing things that way works really well we can get things that are at least consistent I think consistency in the videos is something that's still kind of an unsolved problem I think we're it's getting there things are

getting more consistent um as new works come out but you can still you can still see artifacts here where like this Frame doesn't actually connect to to the other frame physically uh video poet so this is

another kind of concurrent is work uh where they try to do the same problem so from text to video uh and in video poet they they take the same lesson that happened back in the um the audio generation where they say okay actually

let's just tokenize everything so the same way that we do it in language models we're going to tokenize images into image tokens uh and then we have our text tokens we have our image tokens we can predict it auto regressively uh

and if you scale this L the right way we start to get video generation that makes sense so and I think these are pretty interesting because they see the same sort of scaling properties that we saw back in images so like these videos

obviously are not real like the pen is pling cards this is not actually in the data distribution uh but we have the nice kind of interpolation that that we proved earlier in image space now are

they're applying to other domains and in image and sounds as well we're able to generate for example like songs that are played you know modern songs played with a classical instrument for example which

aren't actually in the data set but we can guess what they might sound and and there's kind of some some applications for these things outside of just generating videos like these videos

here are cool but it's kind of hard to think of you know what are we actually going to use it for well one answer is if we can simulate the world this actually gives us a nice thing if we

want to do control so if we want to for example get a robot to to plan to imagine what would happen if I tried to do a certain thing well if we have a good video simulation of the world we can actually just roll that out we don't

have to run the experiment in real life but we can run it in a simul so as um as these models get better and better we're going to start seeing more Downstream applications of them so the video thing is trained you know in an

unsupervised way this is kind of the pre-training where the pre-training goes into the world model and then later we can do for example reinforcement learning or planning on top of these mods uh and then there's a text link so

we we'll go into this I won't go into it in too much because I think the language modeling we're going to have a lecture on it later um and I don't want to get too much into language mod specific stuff but language models are a form of

unsupervised learning and this is kind of character RNN um Andre carpy wrote this little blog post back in 2015 where where he said okay what if we have an RNN um a recurrent Network that just

predicts each character after another and this is before um you know we kind of knew that these things would scale very well but we could still generate pretty fun things so some stories for

example that that are that are made up here um and the the Character level rnns were actually pretty powerful in that you could train them on things that are not entirely language or natural language for example we can generate

latch text we can generate text uh and and this is a lesson that we'll see in in language modeling at least um that some people like to say that language contains a lot more than just words it

contains kind of some some patterns of thought there's arguments that language models can can serve as Foundation models or things that are not just language and here is kind of just some very early results that okay we can

generate math we can generate things uh I don't know if this math actually makes sense it might be making it up but it it it looks right from a

distance uh and then so yeah gpd2 kind of the open ai's next attempt that at scaling these models up showed that uh these prompts so so I think the main the

main kind of innovation that came out of gpd2 is that with the right prompting you can actually get these Foundation models or these like large unsupervised models to do the things we want them to

do whereas you know in the early kind of generate math or generate text we didn't really know how to guide them it can generate from the distribution of text that's trained on but we don't know how

to get it to do exactly what we want uh and GPT to kind of showed that you can just tell it what to do and it's kind of this the theme we'll see a little bit later is that there's kind of a shift

from having to train early on these like supervised models that everything is trained specifically on what we want it versus okay we can just train these these giant distributions um and as long as we constrain it by providing some some

constraints at the top like the prompt what comes after kind of naturally has to follow certain instructions so gpd2 is able to do some nice things like generate

stories and um it can fight against you I think this is saying yeah recycling is not good for the world don't really want to be saying this but it

can uh and then so this is chat gbt so gpd2 this was around in 2019 since then the models have scaled up um even larger although the the principles are essentially the same uh but we can kind

of see why it's so powerful to to learn these UNS supervised models because the same kind of lesson we got here where we can do a little bit of prompting uh to get what we want out uh continues to improve the better our distribution

matches so with language models today you can actually do things like for example ask it for Json output and it can give you nice structured files and and this is kind of opening up a lot of avenues for like okay we can use this to

generate things that are readable by computers not necessarily only by humans uh another kind of nice thing that language models can do today is is this long form summarization so as um

the context lengths go up so as as the length of text that the models can actually handle in memory goes up uh we can start to do kind of more understanding based tests so back in gpd2 the prompts were usually like a

sentence long and then most of the output comes from here and that's essentially because the limit of the model was was pretty small like you couldn't really go much longer than this

uh but now it's cut off over there but there's a big article about for example what unsupervised learning is and the models can actually just synthesize it so we're we're we're entering a stage

where the models themselves can be conditioned in these very like high high bit ways uh even if the output itself is is little bit uh and yeah these models uh are all available for use so you can try it I'm

sure many of you have heard heard of these these things before they've been all over around but uh yeah things things are escaping and and they're working so yeah a little bit about

compression as well so this is paper from I think 2017 if this is the same um but but basically it's it's showing how how generative models can be used as

compression and again we'll have a lecture on this later but the basic idea is if we look at actual kind of hand handwritten compression methods like jpeg they assume some form about the the

natural image space so jpeg assumes that pixels are naturally smooth and if you assume this we can turn pix we can turn images into less less bits than if we assume nothing and compression using a

generative model it's basically doing the same thing but instead of assuming this kind of natural image distribution in terms of let's say smoothness or in terms of neighbors we just learn it from data and

so we say okay if all we care about our images that come from let's say the the million images in in this training Set uh I only need you know a smaller number of bits because I'm assuming

okay I won't have images of noise or of Graphics if it's a 3D data set and so we can get compression numbers that are better because we make stricter assumptions and kind of one yeah one one

argument against these models some of them are are lossy I mean jpeg is already lossy to begin with and for a lot of applications it's perfectly fine if it's lossy like all video codecs are

essentially lossy because it doesn't make sense to encode all of that most frames are the same we can get if your YouTube video doesn't load you'll see things like this um but if we assume like an even

stricter um constraint on the distribution we can get things like this where the same number of bits gives you a much crisper image because we kind of learned what what images are likely and which ones

are um things in 3D are also working so this is a paper from 2020 that uses Nerfs so Nerfs are this this way of representing a 3D image as kind implicitly as the outut of a neural

network uh and the first NERF paper just trained them um by taking videos potentially of one object and then this paper which is I think it's like it's either a gan based Nerf or or a vae

based Nerf but it says okay actually if we have a representation of a 3D model um let's just learn that distribution let's learn to generate things and so here we have models of face like like a

car a chair and because these models are actually 3D models you can move the camera around and see them from different angles and again a lot of the the power of these methods comes from the fact that they

are unsupervised you don't need labels we don't actually even need the the 3D models themselves in some cases as long as you have pictures of them it's enough to reconstruct the model itself a

question the fora model how youc like is a hyperarch of like see which l or yeah it's a great question and I

don't know the answer off my head uh I I I could make a guess but I think think you would you would get the correct answer reading the work here jump in for a moment so for some gen models it's a

bit harder to measure but for the ones that optimize likelihood it's literally the likelihood so the log average log loss directly translates into bits now you might have

to apply a factor log two to it to get it in bits rather than in Nats um but it's the same metric um just because in optimal compression assuming you use

optimal comp compression you get to compress your data down to the entropy of the distribution that's the number of bits you need and that corresponds to

the average log prop of all your data points entropy is average log prop and so your Lo average log prop is exactly what's being measured there it's

measured in and so bits per bite here means average log prop per bite an image will have a red bite a green bite and a

blue bite to resent the three colors at each location and so it's measuring what's the average log prop per image and then normalizing it

by the number of bytes that are in the image to get you this number uh yeah so basically that's that's one application of these models do some compression and and like Peter

said earlier if you actually look at the the how to measure the compression it it requires an optimal encoder and an optimal decoder so that kind of motivates the fact that the more powerful your model is the better you

should be able to actually Reach This bound on the how how you can compress a data set yeah and the optimal since the optimal compressor is not actually possible don't they usually use the

normalized compression distance which comes from the right so I guess yeah one way you could view these these are all lower bounds like we don't know the true compression of the data set because we

don't know if our encoder or decoder are optimal um I guess you can say that they're like Optimal is in terms of if your if your com complexity is how many

parameters your your network is this is it's not the best actually you can get in that parameter C because we're training them there could be a better solution in that parameter space but it's it's what we get if we try to push

as hard as we can on those bounds um yeah so 3D generation it's also getting better so that this is from 2020 this is what we can do in 2022 now

again similar similar techniques uh where we kind of assume and and the interesting thing about 3D is we can actually assume even more it's kind of even more unsupervised

than than 2D images because in 2D images we we have data sets of 2D images in 3D sometimes we don't even have data sets of 3D models all we have are our 2D pictures of the world but we can assume

certain things about them like if you take kind of a few pictures of of an object from different perspectives we know that light reflects in certain ways and and there's kind of one model that

actually results in these views and by writing down little things like this we can actually generate models even though we don't have that data so it's kind of we have one assumption and then we can

do our our unsupervised stuff under this assumption uh and finally like there there's some domains that are also less visual but arguably more more important Alpha fold as one of

them um figured out how to do essentially protein prediction uh and here there is I think there is some supervision at the end but a lot of the power comes from this kind of

unsupervised representation learing that happens in the protein space before this um yeah and then so maybe what came before that was okay here here are some cool applications of what we can do with

unsupervised learning um and now it's like a bit of motivation for why unsupervised learning is important um if you even if you have a specific problem you want to solve uh sometimes it's

better to still do large scale pre-training before uh and and this is from the foundation models paper uh from Stanford but a lot of the field these days is moving to this this setting where we do this large scale

unsupervised pre-training and then we do adaptation to what we actually care about and the form of adaptation can take many forms it can be fine-tuning the network it can be prom thing it could even be for example zero shot or

or some something few shot but this sort of Paradigm has has kind of this is a big shift from maybe 10 years ago five years ago um this is now kind of what what uh seems to be working at least in

terms of generalizing to some real problems and so we we'll show some examples of of how this pre-train formulation has come out so this is from the GPT paper I believe and it shows

that okay we have these language models we can actually they just do sentiment detection I think this is either with a linear header or some sort of small adaptation um it knows how to do sentim

detection off the bat and a lot of things now build off these things where okay we we example pre-train with with a birt loss or we'll

see later a GPT loss and then they they solve these benchmarks at a much higher rate so we'll see for example the way we actually measure and this is in language

model space how good a language model is is actually pretty it's pretty rric there's all these different kinds of things like like code generation reading comprehension maybe solving exams and it's really showing the point that the

unsupervised learning on these models can then solve like all these random domains that it's not actually specifically trained to be able to do uh and and there's this so prompting is kind of a we won't go too much into

it but it's it's also this this adaptation method where there's a lot of interesting things like strategies now about how to extract what we want out of the big model so we we know that these

models have all these this knowledge ingested we know that at some level they theyve represented them correctly or represented them to a really nice degree and then the question is just how to get it out and there's some there's some

messy ones um this is from open eyes guide and and the fact that they have the guide on there shows that this is something that people are at least thinking a lot about and finally we we'll see the same

thing in Vision so this is this is from kind of I think this is from 2018 but this is this is using contrastive learning so another unsupervised method and it's showing that the performance when we transfer from unlabeled data

even on these tasks that have traditionally been been using supervised data uh they're still pretty good and uh for example mass auto encod which

comes out in 2021 also shows results on on these data sets which are traditionally um you can train a supervised method on it it turns out if you pre-train this is big Auto encoder

on a lot of data it works better here uh and okay so this is the last slide and it's just an overview of all the stuff before but I think one one kind of nice motivating thing about

about UNS supervis learning is it seems like the method that scales the best as as data comes in like we saw the laon's cake earlier where where most of the volume is is on supervised learning and

that's because it's just so much easier to collect this raw data than to have human labels and human labels they can be wrong they can be they can be multimodal they can have all sorts of weird

properties um whereas the fact that essentially the same methods such as like Auto aggressiveness or or things we'll learn later in the class can solve all these different domains should should tell something about um that

there's something correct about how we're doing these things thank you Kevin um I want to quickly highlight one

thing that uh we went over just a little fast this slide here this is from 2020 and I want to highlight the thing on the right here there was a bet

between two professors here alosa EOS and jender Malik and it the formal version of the BET says if by the first day of autumn of 2015 a method will

exist that can match or beat the performance of rcnn on Pascal vocc detection so detection task rather than classification without use of any extra

human annotation i s uh so on surise pre-training effectively Mr Malik promises to buy Mr

EOS One galat Two Scoops one chocolate one vanilla so what was going on here alosa EOS challenged the ten and said I

think un survis learning will win and jener said well let's make it bet around it you tell me when it's going to win clear alosa was optimistic he put it on

autumn 2015 it didn't happen until this paper here the CPC V2 paper which was Autumn 2020 um so

um yeah uh five year five years later but it did happen and I think it's kind of interesting to see some of these things play out where uh for the longest

time the thing that was winning there were actually other things people already thinking about that you could project into the future would be the ones winning out in the long run we're just not there we need to figure

out some details we need to figure out scale and once it's figured out it'll actually uh be better than what is the current way of doing things so chender I

got the gelato from alosa in this case but um in the long run alosia was was right I guess that Unis learning is a more powerful

pre-train so in summary un learning is actually rapidly advancing as a field thanks to compute for sure deep learning engineering practice becoming better data sets and these days lots of people

working on it it's not just an academic interest topic anymore it used to be but definitely not anymore today language modeling image generation Vision language multimodal pre-training are all

working really well and have production level impact with the bird one for Google search having been the first one but obviously many more now what is true now may not be true

even a year from now and I'll give you two examples just to make this concrete um example one we just talked about self supervised pre-training was way worse than supervis and computer vision tasks

like detection segmentation until actually so turns out was full 2019 when CPC V2 came out and now it's better example two representational inform

Vision through masking didn't work people said why not just mask like in in language you just mask things out nothing worked um or nothing worked all

that well until King her and his collaborators um made it work in November 2021 and the word on the street was masking doesn't work in

Vision doesn't work but turns out now it's what works the best so these things are all in flux and that's one of the things I want you to keep in mind I would love to see final projects where

you challenge some of the common wisdoms of today and you ideally you have a reason for it right maybe the reason here could have been well if it works for language why can it not at least work reasonably

well for vision maybe it's not going to be the best but at least it should be reasonably well what was the key thing King did that nobody else did well first he's a very good experimenter so he

iterated over many things he's he's the best but then What timly mattered is he masked out 80 90% of the image and turns out that was key and before that people

were masking out 10 20% and masking out 10 20% the task turns out too easy remember the early slide Jeff in saying the brain must be doing a lot of work with having you know

10 to the five synapses available per second to be trained um well King effect said let's put the N work make it work harder than just refilling back in 10

20% make it somehow create the 90% that you left out and all of a sudden results are great um Vision Transformer architecture helps with making masking

easier to do in your training and hence it also helped in kind of making this possible um but the combination factors I think the biggest one timly conceptually is masking way more than

anybody had ever done before um autoagressive models flows Vees Gans diffusion models um the ones we'll cover in the first five weeks of class I think they still have

significant room for surprising capabilities meaning that maybe if you scale it up in a domain where people haven't done it yet um maybe video which is still a very early domain for people

to apply is too maybe robotics which is still early going biodata other Sciences data even just applying the current ideas can likely give very surprising uh

results so it's a great time to work with them on them um but also think the core of un surprise learning might still have some major Innovations ahead um the example I gave earlier lost offering in

2020 which is a while ago now but back then we had no lecture on diffusion models because we hadn't written the denoising diffusion paric models paper yet and and at the time people thought

of Gans as the thing that is the best at image generation and since then it's obviously changed I think more changes could be ahead and so I really challenge

you to think hard about all these methods that we'll be presenting and um see if you can somehow find a way to to

improve them by thinking very hard about what how they sit up and I think often and of course I'll try to give more color when we're get into specific ones but often the devils in details in these

things something is presented the big picture makes sense it seems like yeah this is going to be the best and maybe it's even the best as it's presented but then you go to the specifics of How It's

actually written out in the equations and actually there's already a bit of a gap from the motivation to what the equations actually have and then maybe the implementation introduces another

Gap because exactly the math is not what's being implemented and so there's all these gaps that could come along the way where you could see an opportunity to make things better and maybe invent

the Next Generation or next iteration of these of these models that would be a pretty great outcome if that could happen uh from this class all right that's it for today thank you and uh see

you all next week

Loading...

Loading video analysis...