L1 Introduction -- CS294-158 SP24 Deep Unsupervised Learning -- UC Berkeley Spring 2024
By Pieter Abbeel
Summary
Topics Covered
- Brain Demands Unsupervised Learning
- LeCun's Cake: Unsupervised Base
- Unsupervised Equals Intelligent Compression
- Diffusion Outscales GAN Mode Collapse
- Mask 90% to Unlock Vision SSL
Full Transcript
okay 210 let's get rolling uh welcome to deep unsupervised learning first lecture let's start with the instructor team I'm Peter Beal I'm Professor
teaching the course and we have three co-instructors with me here maybe three of you can briefly stand up maybe even here in front of the camera also and do
a quick self intro um yeah Philip why you go first then Kevin then Wilson um yeah hi everyone my name is Phillip I'm a PhD student uh with
Professor Peter be um and I work on a variety of different topics but largely with real world robot learning um Crossing reinforcement learning imitation learning and supervised learning so yeah hope we'll have a great
semester thanks pH yeah everyone I'm Kevin also a fish with uh Peter and Sergey and um yeah I work kind of on reinforcement learning algorithms and generative modeling which
we'll talk about in the class great thanks K yeah hey everyone I'm also one of Peter's PC students and I do research in generative models more specifically stuff with like video and language
generation you actually took the classes underground here I yeah I took the class and I was a TA the semester after yeah but it's been it's been a long time since been a while last time we T the class was uh 2020 so 4 years ago it's
pretty interesting because I was looking at the slides from four years ago and I saw I put a lot of motivation in there and especially 5 years ago a lot of motivation why people should study deep UNS supervised learning because I
thought people weren't interested enough in deep un survice learning should really motivate them and now I feel like I need to do the opposite uh or maybe slightly more precisely um I want to
make sure in this lecture to make clear what we are actually planning to do in this class because becoming so popular that in principle you could do a million different things in a deep un survis learning class and so I want to make
sure you know what we are going to do here rather than maybe what you imagine we could be doing so couple of logistics communication uh there's a website
that's the URL um went up I think yesterday um we should be putting all information up there that's needed for the class we'll put uh typically the
slides there before lecture not today it's the first lecture it's always bit of a transition period when the semester starts but go starting next week we plan to put up the slides before lecture uh so you can take notes on them if you
want or look at them a little bit ahead of time um has the schedule everything if you see anything that seems off let us know um could be that
there's still a little typo here or there uh just let us know some announcements uh we use Ed for communication I've actually never used it um it's the first time I'm going to be using it but it seems the the
standard thing to use for Berkeley classes currently so I'm excited to to be using it um please start there if you have questions because then others can see your questions help answer them
instead of us getting separate questions from everybody who might then the questions might have a lot of overlap if you're by the way registered or on the wait list we already put you
into the forum for this class if you are neither then you have to put yourself in so Ed is preferred if you have questions
for us um uh just to be clear not during lecture I mean like during lecture just raise your hand ask questions that way but outside the lecture start there if you have rather than email but if it's
something you think that's more suitable for email feel free to email the staff list or if you think it's specific to any one of us email just us individually um but try to use Ed if
possible office hours will start next week um for for me it'll be after lecture so lecture is 2: to 5: on Thursdays um it's a long slot 3 hours
most lectures are not that long we'll take a pretty big break in the middle 15 20 minute break um we'll even have snacks to uh have like reinforcements
for the second half of of lecture um and that's actually quite deliberate I think one of the big things you can get out of the class is to get to know other students in the class who working on similar topics you can exchange ideas
come up with new ideas together and I think having the long lecture slot with a break in the Middle where you get to hang out talk to each other can be very good for that purpose I'll do my office hours right
after class so that'll be a very long stretch if you stick around for that uh it'll be 5 to six um Wilson Kevin and uh Phillip will uh announce their office
hours sometime next week um we still figure out how to do it we might have some office hours or maybe one ta is in charge of one homework and then has all the office hours leading up to that homework due time or maybe we'll have
regular like weekly office hours we're still sorting that out so for homework by the way ta office ours are the best venue even though I'd love to be able to help you with
homework like I dream every every every winter break I dream of a semester where I actually have the time to work through the homework myself and uh then be able to help you the reality is like I just
can't like I don't know the details well enough but these three will so for homework specific questions go to them um for anything else all of us uh should
be able to help you um and I think for a lot of us one of the favorite things is to talk with you about your projects what you are doing research on how you think it might connect with things in the course and how things in the course can help you be more effective in
whatever you are doing research-wise or maybe your research is squarely and UNS supervised learning and is directly feeding off the course um admission to the course it's a
bit of a challenge that there is more demand than there there is uh spots and I think that's just going to be the reality some people are not going to get
in um what we'll do to Nodge things towards uh a best possible situation is people who don't have a strong homework one or no homework one at all submitted
we'll ask uh the register's office to drop them out of the class now obviously if something really crazy comes up and you can't do homework one because there some external circumstances reach out to
us and let us know we can think about it but to us that's just a sign if you're either not suitable for the class or you just don't care enough about the class to submit a strong homework will drop you and we'll let people in from the
wait list also if just after today if you register for the class and you realize oh my God I thought it's going to be what all these VCS talk about with generative AI we're going to be like the next big startup out of this class it's
not exactly targeted that way um so then maybe also drop out of the class make room for others could be many reasons you think the class is not for you after today's lecture don't don't wait till the drop deadline because that's
inconvenient for for everybody else just drop when you know you're going to drop even if you don't make it into class you're welcome to audit you're welcome to submit homework uh that
that's all good you just won't get the credit for having taking the class um and as we move people into the class um it could be undergrads could be
good fits for the class PhD students I mean it's really more about what you know what you're capable of than exactly what degre you're pursuing typically phg students with more natural fits but not
always so we'll see what happens any questions about registration yes are there any like quantifiable like cut offs for what a strong homework one means and also how
does this relate to whether you're like fully registered versus weight listed versus not so we'll grade everybody's submission whether you're weight listed or not even weight listed or you are
registered if you submit your home homework will graded um then what is strong homework one is the way we design them is essentially they should be solvable for somebody who can who's
taking the class it's not something where like a strong homework is like solving half the homework a strong homework means that you essentially solve everything that doesn't mean it can be a beauty mistake here or there
but it can be that there's just a whole part of the homework that you didn't do um we're also here to help you I mean we want you to learn it's not that you need to be able to do this without any help come to office hour ask question
questions but if you use all the resources available and all the time you're willing to commit to it and then end up having big parts of your homework just in one substantial section that's
just blank or completely off then likely we would drop you if we see that there are many people on the weight list to have much stronger uh submissions um
think of it like you know maybe in in numbers let's say you got to score 90% or something on homework one but don't worry if you have 98 or 99 we're not going to rank people and then like be
like okay think the exact ranking but roughly 90% as a threshold you're good yes what should we expect as the deadline for homework form H coming up
in a moment great question yeah I have a slide on on all the deadlines okay um let me tell a bit about the syllabus today's intro we're not going
to cover actual uh materials today we're going to cover motivation for the class why this class might be worth taking um and we will cover some Logistics um but
then next week we'll start diving in with autoaggressive models which is the first type of model we'll cover the week after we'll do flow models then we'll do latent variable models then we'll do
Gans uh SL implicit models and then we'll do diffusion models and final project discussion so those will be in some sense the five main types of models
that we'll cover um from there we'll look at other types of learning uh so five main generative models then we'll look at self supervised learning non-generative representation learning
we'll look at strength and weaknesses of everything covered so far because part of what's hard about the field I would say is that it's a bit disjoint it's not so easy to directly necessarily compare
a vae with an autoaggressive model there are pros and cons to them the field hasn't fully converged maybe we'll never conversion on just one thing clearly language is more conversed onto
autoaggressive and vision is more converged onto um diffusion models right now but this this can change over time in fact last time we taught the course
diffusion models didn't exist um we still had to invent them later that semester semi-supervised learning unised distribution alignment these are things
that are somehow not as hot topics these days but I think they're very important and I think part part of what the beauties of this class by the way same for flow models is that we have the time
to cover some things that are less hot today but maybe with some Innovation can become the next breakthrough if you find the right Innovation compression compression is
not typically a machine learning topic but it turns out everything unsupervised learning is effectively tied back to compression because the core hypothesis in unsurprised learning is that you're modeling data and what does it mean to
model data well what does it mean to do comp impression is the same thing it's finding the patterns in the data and based on that representing it in a more compact way in machine learning we don't just care about compact we want it to be
semantically meaningful so there's a little bit of an extra twist to the notion of compression there but it's very very related spring break week no lecture uh
naturally then we'll do language models a dedicated lecture obviously some things will come back there from earlier language models are autoaggressive models large so some things will come
back but we'll dive much deeper into specifics of language models then we'll have a midterm um I think I have a separate slide on that so I'll wait for that to
say more about that um we'll do multimodal models and video generation that lecture I'll definitely remind you will be in a different location we'll be
in the auditorium upstairs slightly bigger room um then we'll have electron AI for science uh and we'll have representation learning for reinforcement learning note that these
two lectures are in the same week so there's going to be a Thursday and a Friday lecture a pretty wild week cuz you have like three hours on Thursday another three hours on Friday um but we won't have a lecture the next week so
the last week of the normal semester the one be a lecture I moved it Forward onto the Friday before that um week 15 no
lecture RR week then week 16 your final project reports will be due and video presentation submission so we'll take instead of doing an inclass sequence of
presentations which I found at the end of the semester students are so busy and it's hard to find a time to attend the other students presentations so it's often like students just jumping in doing their presentation jumping back out get to study for their final or
another project we're just going to take submissions we'll watch the videos but we're going to share the videos unless there's a specific reason you don't want your videos shared we'll share them internally with the class so anybody
who's interested in a topic can go watch the video present a on that topic that was done in the class any questions on the schedule yeah
lecture 13 be in the same classroom at the same time as usual I don't know yet I've put in a request for for a room but I haven't heard back yet um yeah very good question and good reminder but it
is on my to-do list to to sort it out I asked them two days ago and it's probably a busy beginning of semester um for them to respond nothing thing about lectures they're all recorded so there a Camtasia
recording running here which sees whoever is standing in front of the laptop and Records the screen can't guarantee it works out every time sometimes these things crash and then
there's no proper recording but hopefully usually it works out and then we'll put these online yeah we do have a lecture on the same day as the midterm yes so the
midterm will be very short I'll say more about that soon homework we'll have homework one go out next week Thursday due 13 days later
then 14 days after that um homework two will go out so essentially every two weeks a homework will go out if you look at previous offering what you see has changed that there's right now no homework and flow models if you care
about them you want to learn more about them and do a homework on them you can go to the 2020 website do the homework from there for your own good not not for the class um I think it's still
interesting but we only want to give you four homeworks and we think right now it's better to give you a homework on diffusion models rather than flow models just the way things are shaping up but again these things can change you could
come back here a year from now and say what a big mistake you should really have done flow models way more important um hard to predict but this is what we're going to
do so that puts us roughly halfway to semester and then the idea is that from there a lot of your time outside of class is spent on your final project for homework you can discuss things uh but
you need to write your own code and make your own submission um you can be late with a late policy but not too much at most four days um why do we have the postive
at most four days logistically essentially we want to be able to get started on grading is one two we think you can learn a lot from looking at Solutions but you learn more if not a
lot of time has passed before you get to see the solutions so after 4 days we will release Solutions you can look at those learn from that and hopefully that's a good learning experience but at
the same time you know that means you can submit after those four days again if there are really exceptional circumstances always you know don't be afraid to reach out we can think about
it um but uh that's the standard model we're going to use so the midterm we have a midterm during lecture the beginning of lecture on April
11th um I think in the end what's most important is what you learn from from the class what you take away and what hopefully you put in your final project
along the way but the reason I like to do a midterm is because it gives you an opportunity to be forced to really study the materials um you're all so busy and if nobody forces you to study the
materials probably you just have something else fighting for your time and winning out um it's not meant to be excruciating or to be weeding you out in
any way or other um is going to be topics covered to the week before we will provide a document with questions and answers so essentially we will go through the lecture slides pick the most
important derivations according to us write a question let's say derive the rational lower Bound for a vae and then put the answer below it and you'll have
anywhere 10 to 20 pages of that kind of material and then when the midterm comes we'll ask you two or three of those um obviously if you fail to study two or three of them you might do pretty poorly
but hopefully since only 10 to 20 Pages you have to study it should be easy to just study everything uh and it's upon us to make sure that these are the 10 to 20 most important things from the class
you can principle memorize it I don't think that's the best way to study it ideally you understand it and that way are able to rederive it but you know nobody is stopping you from from memorizing it either and often
memorizing can help you understand uh it's kind of an inter process in fact language models just kind of you know memorize and then understand so it's uh
it's just what you're supposed to do I guess okay any questions about the midterm all right um final project uh scope IDE you
explore and push the boundaries in unsupervised learning uh there will be a proposal could be a proposal plus evaluation of new algorithms architectures investig investigation of
application of unsupervised learning benchmarking unsupervised learning could be something related to compression studying synergies between unsupervised learning and other types of learning and
so forth um pretty broad what's acceptable but it really has to involve the topics in the class ideally it could be the foundation for a future conference paper there's no
expectation that by the time you submit the final project report that it's already at the level of a conference paper but hopefully maybe there's like the initial experiments that show the promise that this could really become a
conference paper with an extra push that you put behind it um we actually encourage you to come up with your own idea if you can it'll
likely be more original because you have your own background your own knowledge that's different from ours um that said um we're also happy to brainstorm ideas together um and often
going back and forth can also lead to new things that neither one of us would come up with uh individually so the main reason I started teaching this class five years
ago was for these projects I wanted more onar learning projects to happen at Berkeley um some really good ones have happened um one that um uh sticks with
me uh is Roshan raal uh student in the first offering of the class did a project on sequence modeling for biology uh wrote one of the first papers on
essentially protein property prediction based on pre-training on just sequence modeling um released a tape Benchmark as a nervs paper for that is all before
before Alpha fold um and from there actually went to the Facebook sleta um bio uh AI group and since then that team
has left meta and has started a company that's essentially trying to build the foundation models for bio for for the future so that would be a great trajectory any one of you go on that
trajectory love it I mean there's many other great trajectories but something that that is really um try to do something exciting um if something doesn't work but you have a lot of um
evidence that you try things that are meaningful and didn't work that's more interesting to me at least than if you did something pretty boring and yeah it worked but it was kind of boring and it was kind of of you know kind of too
close to what other people have done so try try to pick something exciting and surprising if it works out
timeline project proposals due February 28th so you have a bit of time um you'll put that in a Google doc so we can easily give you feedback about a week later um we will have iterated with you
in that Google doc and hopefully have arrived at a project proposal that we're all happy with um then a three-page milestone is due in April we'll also do Google doc
for ease of feedback um the idea here is that it forces you to get started it's very easy to wait till the end of the semester if there's nothing forcing you to get started earlier so you'll have to start about a month and a half before
the end of the semester or even a bit earlier to get something in there it doesn't need to be anywhere near final um but you should have some results some initial investigations that you can
report on May 10th everything will be do I believe I'd have to check but I think that is the the Friday of finals week
um grading Logistics 60% homework 10% midterm 30% final project um that's yeah and then the the letter grades there's something on the website
that says what the letter grade is based on the grades that you get um but I believe it's probably like 90% and above is an A and then from there every 5% it
loses uh a partial letter grade do you need to attend class um no hard requirement um obviously there's not enough seats so it could be even be
uncomfortable to sit on the on the floor for 3 hours um but I would say it's highly recommended um and it goes back to the opportunity to learn from each other as
I said in the beginning of lecture we'll have pretty large break right in the middle good opportunity to learn from each other and often new projects arise from that uh actually Roshan's project
that I mentioned he got to know some of the students that he did that with in the class there were bio students taking the the class he was a CS student and then they start working on it
together only the third offering of the course so some rough edges along the way bear with us give feedback if there's something you think that might be better done a different way let us know that doesn't mean we can always get around
doing it but we'll take notes and what we can do we will do okay so that's Logistics any questions about logistics before we start with more content in the
back yeah so we're trying to record Camtasia is running right now which is a screen capturing software um which can capture at higher resolution for the
screen than let's say Zoom would um we'll try every time maybe sometimes it crashes and it doesn't work but hopefully every time we have a recording yes and then we'll hopefully post it uh
later that evening or a couple more questions yes are the final projects completely independent or oh good question up to
three people I would recommend forming teams of two or three because often when there is more than one of you you have more of a back and forth on ideas and you can arrive at more interesting things uh we do expect a little bit of a
linear scaling so if there's two or three of you you're supposed to have a more substantial achievement in your project than when there's only one of
you Yes actually question find
and yeah so we will um link the website uh from The Forum the Ed Forum later today so then you can find find the
website where all the materials will be posted for submitting homework um I guess we'll sort that out when we release the first homework so by the time it comes out next week we'll
hopefully have uh set up the way to receive your homework also thank you yes for under should we fill out a
form or just in the first one um just turn in the first homework and we'll take things from there
all right so let's switch from Logistics to content what is deep unsupervised learning it's about capturing Rich patterns in raw data with deep networks
in a label-free way the label free part really matters it's the original motivation for UNS supervised learning because labeling is
time consuming and hence it's not nice if you don't have to do it um but there's actually additional motivations Beyond it being label free which we'll get to
soon there's kind of two sub areas in unsupervised learning there is generative models which try to recreate the raw data distribution and there is cell supervised learning which tend to
be puzzle tasks you have your data and you remove some part of the data or some view of the data in some way or other and then it's supposed to fill it back in if you can fill that back in then
presum you understand something about the data um generative models recreate the raw data distribution there's a lot of talk about generative Ai and I was thinking like
how how would I really Define what we're trying to cover in this course for the generative AI part and the way I think of it is it's still neural Nets right it's
it's still a neural net you you're outputting you have an input you generate an output even a generative model is a NE Network and it looks a lot like supervised learning in many cases
so what what's really different about it um the way I think of it as being different is that in a lot of the typical work in supervised learning
there's a very clear deterministic solution in that you have an image and there's a clear label that needs to be assigned to it cat or dog or you have a self-driving car scene and every pixel
needs to be Street pavement versus pedestrian and so forth and so the output is in some sense deterministic what you're supposed to do whereas in
generative models we're trying to model pretty complex output distributions on the neural network and that's actually hard to do to model complicated distributions and that's why there's
actually five different lectures each three-hour lectures covering five different ways of covering effectively complex distributions with neural networks and so that's the way I would
think about it is generative models at least in this class is about models that can represent complicated probability distributions well beyond the kind of
everything's concentrated on one output Sal learning is the puzzle tasks um the there's been an interesting back and forth initially on supervised
learning generative models were what people were excited about then it became all self-supervised learning for a while and now it's swing back to more generative models who knows where it'll
be next year um I think it's good to know both of them so why do we care I think this might be the most interesting motivation of them all I hope you all know Jeff hon if you don't
please uh know him now Jeff Hinton is essentially The Godfather of deep learning he's been working on deep learning since the 1970s had to wait for it to have a
breakthrough till 2012 so it's like a 40-year career of working on the thing that he thought was going to be right and after 40 years he proved himself right because he was the one with his
students who had the big breakthrough on image recognition with alexnet now this is motivation for unsupervised learning the brain has about 10 to the power 14 synapses
synapses in the brain are connections between neurons and they're essentially the learnable part of the brain where the neurons are connected or not is the variation in what our brain is as a little baby where nothing is really
wired up to know anything and then when we're older we start knowing things 10 to the 14 synapsis and we only live for about 10 to the 9
seconds so we have a lot more parameters than data this motivates the idea that we must do a lot of unsupervised learning since the perceptual input including Pro
perception is the only place we can get 10 the 5 dimensions of constraint per second right because if we assume that the brain is kind of meant to be fully
utilized um then we do need to get 10 the 5 essentially bits in some sense per per second to fully utilize our brain now there are other motivations why the brain is so large it could be that a
larger brain can learn faster than a smaller brain even if not that many bits have to be learned that's something Jeff has also talked about I think both probably
have some truth to it but this is a very clear sign that supervised learning is not enough for how humans essentially use their brains there's a lot more going on Beyond supervised
learning Yan Lon similar thing need tremendous amount of information to build machines that have common sense and generalize um and I think this kind of gets into another Trend that we've seen
recently in AI which is foundation models Foundation models are train on such a large amount of data that they generalize much better than any models
before um it used to be you'd have a model for let's say um recognizing the sentiment of uh a paragraph is this a positive or A negative review of a
product and you had a dedicated language model to do that and it was better back then to train a dedicated language model for sentiment analysis than to train one model to do everything but that's
changed Foundation models are trained on all the data I'll perform the specialized models now and I would argue in some sense that that's the closest we're getting to Common Sense with current
machines is by training on such a wide swath of data that somehow everything starts to fall within distribution at least a little bit so Yan Lon um
presented this cake now known as L cake um at the at his nur's keynote in 2016 and the point he was making is that um back in 2016 actually all the
excitement at nurs was about reinforcement learning so he was trying to make a point uh that people maybe should be a little less excited about reinforcement learning was part of his point not not excited but maybe a little
less and pay more attention to unsupervised learning his point was that essentially you need a lot of data only way to get a lot lot of data is unsupervised learning so
that's the foundation of the Ki it's most of the volume SL mass of the kick the icing is the supervised learning and then the Cherry is the reinforcement learning it's pretty interesting if you
look at um chat GPT it's exactly this model chat GPT is trained on the entire internet to get the foundational knowledge supervised fine tuning that's the icing on the kick then r F
reinforcement from Human feedback that's the Cherry it's l cake essentially predicted um six years ahead of time how AI would shape up into its at least
current most capable form another way you might get motivated for unsupervised learning is this notion of Ideal intelligence so ideal intelligence people argue is all about
compression finding all the patterns in your data that's what it means to be smart you understand all the patterns that are there so finding all patterns means
finding a short description of raw data meaning low kog complexity so what is comra
complexity com of complexity says the size in some sense measured as com of complexity of a certain data set
is the size of the shortest computer program that can represent that data set put it out as on it's output now the simplest way to do it but that might not
be the winning uh size is to just store the entire data set and just have a program that says print and then it prints out the entire data set so that's
that's a very simple program but it's rarely going to be the one that wins but if you had a completely random string with no pattern in it that might be the winning program that you could possibly
produce um to do that but for many real world data sets um there's other ways to describe the data that's more efficient and com of complexity measures that by
the way the way to think of this in neural net land is that you should think of the size of the neural net as essentially the size of your program the
neural network is the program because instead of writing code you're training a neural network so the neural network is the program and so here some's asking what is the most compact neural net
Network that can regenerate my data and that should be the one that understands the data the best and hence U best at let's say generalizing hopefully there are some subtleties there by the way
smallest NE net doesn't need to mean smallest number of parameters it could mean something else it could mean that you have low Precision parameters but large number of parameters and maybe that's a better way to represent things
than having higher Precision parameters but less parameters um so keep that in mind shortest code length um in principle
should also allow for optimal inference um and S of induction is kind of a bit more focused on on that that side of things um extensible to Optimal action
making agents which is called aixi so aixi is essentially looking at this notion that if you want to build an agent that solves problems what is the smallest agent I
can build that can solve these problems and that's the counterpart of recreating your data in unsupervised learning but now in reinforcement
learning here not a motivation very related in fact IL sis Chief scientist of openai co-founder of open AI um talked about this just about a half year
ago at the Science Institute here assume we pre-train in an unsupervised way on data distribution D1 and then fine tune on a data
distribution D2 then then if D1 and D2 are related compressing D2 conditioned on already knowing D1 should be more
efficient than compressing D2 outright the additional effort should be less given the effort you've done before to compress D1 if there's any kind of relationship between them so
pre-training on D1 should allow you to learn faster on D2 this is kind of like a very let's say it's not a super ma ially precise but it's an intuitive
argument for why unsupervised learning should help when you later all you care about is supervised learning all you care about is labeling every pixel on your self-driving car scene still UNS
supervised learning might be the right thing to start with by the way these kinds of motivation I remember um early days of um early days
of of open AI when it was just 10 of us um Ilia John Andre uh couple of my students myself back then supervised learning was
always winning like anything you cared about supervis Lear was always winning and we actually had to think hard about these aspects to decide that we should work on unsupervised learning part of
motivation was there's more data so clearly that that's going to help but the other part was people would say well it doesn't matter because that data is not what you want you just need to label enough supervised data and you'll get
things done but our argument was things like this is why we should still work on unsupervised learning today that argument is more mot it's empirical you pre-train and then fine tune you get
better results um wasn't the case back then when you went you started with unsupervised as your pre-training aside from theoretical
interests um deep enzed learning has many powerful applications you can generate novel data uh you can do conditional synthesis you can do compression with
it um by the way the compression hasn't really been commercialized yet I still wondering when that might happen cuz um when we when we're seeing the electron compression
effectively what is compression how much can you compress a data set is based on how good a model you have for the probability distribution that constitutes your data set and then the bound is the entropy of that
distribution that's how much you can compress your data and so if you have a better model you should be able to compress better so far it turns out that maybe
these new Nets are kind of large and so once you count off actually the comor of complexity including the size of N Net and the amount of inference required to decode it hasn't been winning yet you
could imagine a future where if compute chips become even larger and cheaper that you have a massive neural network decoding all the movies that you watch and very little bits have to come over
the pipe because it can decode efficiently and from a very highly compressed representation compared to what's possible today improve any Downstream task by
doing pre-training un or self-supervised production level impact Google search powered by Bert was the first one that's why I'm highlighting it here I think that was the first real
production level impact of unsupervised learning flexible building blocks for other things some of the architectures that were invented for UNS supervised learning are now reused in reinforcement
learning in supervised learning so a lot of reasons um you could be interested in what we're going to cover here so let me dive
into actually let's see how long are we in yeah let's let's let's not take a break right now let's keep going just a little bit um this is part of where
generative model started at least for image generation uh it was called Deep Bel Nets back then it was a different name um but it was the same kind of idea you train a neural network a specific
type of neural network in this case that has fallen out of favor and these were some of the early results showing that you can train this neural Nets to generate images that look like the images in your training set at the time
this was pretty surprising to people it was still hard to correctly classify cats and dogs at the time because that didn't really happen till imag net Alex net 2012 as the fact that
while people were still struggling to have an image classifier for cats and dogs other people were generating digits it was pretty surprising that was
possible um the variation Auto encoder kind of made one of the big steps forward after neural networks broke through in 2012 and we'll look a lot at the vae couple lectures from
now um Gans by Ian Goodfellow and collaborators in 2014 were the first sign that as this Neal Nets are likely to generate realistic images in the foreseeable
future until Gans somehow if you look back at the vae everything looks kind of smooth out and that's just how these models tended to be they try to predict things that are
on average kind of right but on average being right is not exactly right it's like if the answer is you know either yes or no to a question saying something in the middle is never going to be
correct um same thing here it's just smoothing things out too much but Gan started resolving that this still low resolution that was just compute
limitation at the time exciting story here by the way um Ian was uh having drinks with friends in a bar in Montreal and they just you know what do
you do when you're AI student you have drinks you talk about AI um and they're just like how is it possible that these neuron Nets can create proper
images even a neuronet can see that these images aren't sharp um and Ian's like wait even if if NE can even see this these pre previous images are not that sharp I should just just use a NE
net to give feedback to the create creation neet the Gen NE net and set up a feedback loop and see what happens he CED up the same night he started seeing a real sign of life and then a little
later wrote the paper interestingly um Gans well here's another interesting backstory actually Alec Ratford first author of this paper
this was the first time with the bedroom data set here that Gans were showing actual high quality images people said this is not just a sign that this will
be possible soon this this now becoming possible um Alec I believe um dropped out of college to play with AI and had some good success he wrote this paper
and then open I recruited him I remember that we recruited him I was like okay he did the best generative model to date back in 2015 um and so um it was interesting at the time that these
things were possible um just you know you find the paper in archive and based on that you decide to recruit somebody Alec by the way was a great recruit obviously because Alec Ratford
is the one who started the whole GPT um uh sequence of Works had open ey gpt1 well gpt1 in some sense was was an lstm but then gpt2 three all started with
Alec uh interesting story there uh he actually let a model train while he was away for a while and then it kept improving it realized wait we've never been training long enough clearly we should be training much longer and that
also helped Inspire the much uh longer training sessions um so sometimes say did going away and not killing your process is is uh the way to go
too from these are some images of uh faces which we as humans tend to be very sensitive to artifacts and faces so obviously here we're seeing signs alive
but we don't yet see this as realistic um super illusion Gans came soon thereafter um Al EOS and his students here did things with Gans that were very
entertaining you could essentially recolor uh subjects and images this case the horse uh becomes a a zebra um and then deep mine did a very
large training the first very very large Gan training and so this was the first time by this was 2018 um where Gans were showing very
realistic images now some of them in between are not super realistic cuz it's interpolating between different ones um but you can see when it's not
interpolating it is um is it playing music so pretty amazing um at the time people essentially thought this is it
you know Gans are the way to generate images uh in the best possible way and um a lot of iterations were done especially Nvidia did a lot of work
scaling up even more getting even more um impressive results over time time um styan came out of Nvidia uh these faces
become kind of indistinguishable from real human faces and that was uh also 2018 then something happened that was kind of surprising in some sense uh even though Gans were all the rage and
everybody thought this is it with this is how we're going to solve image Generation Um diffusion models came on the scene you might wonder why even bother coming up with diffusion models
at the time um the challenge people ran into with Gans was that even though everything was very realistic it wasn't covering the entire distribution that the data had so it was
focusing on specific modes of the distribution rather than having great coverage of the entire distribution and so the open question at the time is can
you come up with a new approach or improve Gans that's fine too but can you come up with something that is both generating realistic images and has good
coverage doesn't leave out a lot of the um data effectively and so diffusion models showed that you know clear sign alive in this paper in 2020 and they're
also the models powering almost all the image generation today so I'll just put some fun examples at least some of the ones that I find the most fun um all done with diffusion models um these were
not in the original diffusion paper the ones I show here are from doly from open AI um a masterful oil painting a Persian exotic cat discovering their astounding
crypto losses while checking their phone why is this an interesting prompt well it's interesting partially because it's entertaining obviously but it's also interesting
because it's targeting something that would not be in the training data right and if the AI in this case this diffusion model can generate a good
response to this it means that it understands something Beyond replicating what's in the data it should understand something about how to combine cats
with phones and a standing crypto losses which probably means being not particularly happy and here's what it comes up with then here's another one of my
favorites a Victorian man struggles with his addiction to Tik Tok same gist it's there some humor to it there's also this notion that it's not going to be in the data Victorian
man did not have access to Tik Tok that was T years in England um nobody had well anything like Tik Tok at the time and here's what it comes up with it
actually gets it really well it's the the guy is holding an alcohol flask that some kind of addiction generally speaking red cheeks looking pretty sad looking at his phone and it's all said
in the Victorian era here's another one of my favorites Darth fader realizing he's forgotten to add an attachment to the email mail Darth Vader never forgot to add
attachments but we could imagine what it would look like and again the model is capable of doing this looks like a business traveler sitting on the hotel
room bed firing off some emails before going to dinner and then U making a mistake these were all open eyes do model um here's oh The Prompt should
have come first but um this one from Google's imagine a photo of a shba Uno dog with a backpack riding a bike is wearing sunglasses and a beach hat and it creates a photo realistic image in
this case a lot of them have come since so open eyes DOI and Google's imagine came
out first soon thereafter stable diffusion came out um stability AI has been hosting that mid Journey has been building off of that on their Discord
servers um in fact the folks on the Imagine team one of the key people on the Imagine team team is Jonathan hoe who was a PhD student here wrote a diffusion models paper then went to
Google did this work as well as the Imagine video work um and then she left Google to start a company called ideogram um because Google was not putting its Ms out there it just became
frustrating for him to do this great work and then everybody's like oh do is so much fun mid journey is so much fun he like AB built this amazing imagine mall and nobody's having fun with it
it's just like behind closed doors at Google and so he ideogram which is also one of the top providers of text to image right
now so in the uh remainder of today's lecture we'll look at a few more um pretty exciting kind of examples of progress in UNS supervised learning over the last few years Kevin will be
presenting those and then at the very end I'll wrap up the class with the last slide um one quick thing that come up during the break um it turns out that
the Forum we use is um you can't self sign up we have to add you it seems um and so if you want to get access to us you kind of need to start here if you're already on the class roster already on
the weight list you should already be on the Ed Forum but if you're not email us here we can add you to the Forum and then from there you can start finding everything else because you can find the class website and everything else we'll
post there um but yeah if you're stuck getting started the whole process email us here and then uh we'll help you get going okay cool yeah so for the remainder of
the lecture we're just going to kind of go over some fun things that unsupervised learning can do today so a lot of what we looked in the past here and then the last few sides are are image based and and here's kind of one
thing where we try to go into a different domain so wave net here um from 2018 it kind of ask the question okay we can generate images um what can we do with with other domains and it
turns out okay we can actually generate audio signals as well and and the method they use here is actually pretty simple to what we were doing back in the pixel domain so these are diffusion models so
they're kind of from a bit later but this wave net builds off uh pixel CNN which we might cover in the in the autoaggressive part but it basically just predicts the audio step by step so
if you take the raw audio format predict the next one condition on everything in the past you get kind of a a generative model of the entire audio framework and so wavenet proves that okay if we do
this in the right scalable way this works uh and of course these things are getting better over time so this is audio craft um comes out of meta uh from
2023 so just earlier uh this year or last year and uh it's basically building off the same technique so if we do auto regressive prediction we can generate
the samples one by one and here um we we use a few tricks so instead of predicting things in the Raw audio space like we're doing back here in the wavenet in audio craft we first um first
we'll tokenize everything so this is kind of a trick that we'll kind of see recurring is that actually if you tokenize things first U predicting is a lot easier so tokenize everything first here we tokenize for example text and
audio into the same space uh and then you can just concatenate so text audio comes after prct it all at once and you get basically audio generation and here they show okay we can do it for music we
can generate songs and we can do it for things such as um other sounds uh text to speech is also something we can do with unsupervised learning so I guess in this case it is
it is somewhat supervised and there are some some labels we use to connect the two but a lot of what's going on here such as the embeddings and all these things are are going on um in an
unsupervised way so this tacron work is basically one of the one of the landmark works that says okay we can take in text do some processing on it and we get out these these spectrograms which are
basically just ways of representing audio uh and then we'll use this this wavenet model so something built off wavenet earlier to turn it into into a raw audio file
uh and yeah these days the Texas speech has gotten really good so there's this um company called 11 Labs has been making uh basically making these better internally uh we can do things like giving speech in in different styles so
we have different for example personalities here uh and it turns out that converting between Styles is also pretty easy so it's kind of a it's a tough problem to go from text directly
to sounds because actually the text doesn't contain all the information need to make a sound we we have things like like intonation we have expressivity that's that's not exactly all in the sounds uh but if we do for example voice
conversions uh we can convert from like one side to another this is something fun that like some of you might have seen in the past so like it gained popularity last year but basically people using these generators to to talk
in the voices of of other people so one of the famous ones is these presidents talking about something that they would never actually talk about but it is pretty funny so that's kind of text to speech
and then video is also kind of this new domain so I think video stuff is is just starting to work um I know Wilson is working on it some people are working on it um but basically can we do can we do
generation in video space the same way that generation in image image space is working really well so here's some examples using Gans so you know if we just have data sets of videos we can
actually just predict or we can optimize the the generative model over that using the same Gan format that we use um back in images uh and then there's also kind
of more methods based on diffusion model so this this emu video uh does this kind of stepbystep process so what it does is uh this is actually text text video the
text is not shown but you can kind of describe the scene first and then what it does is like kind of the lesson learn in this paper is you if you factorize a distribution it it works a lot better so what you do here is given the text first
predict the first frame and that's kind of the whole image generation problem in itself and then given the first frame and the text let's just generate the rest of the keyr and then let's generate the in between frames and it turns out
just factorizing things that way works really well we can get things that are at least consistent I think consistency in the videos is something that's still kind of an unsolved problem I think we're it's getting there things are
getting more consistent um as new works come out but you can still you can still see artifacts here where like this Frame doesn't actually connect to to the other frame physically uh video poet so this is
another kind of concurrent is work uh where they try to do the same problem so from text to video uh and in video poet they they take the same lesson that happened back in the um the audio generation where they say okay actually
let's just tokenize everything so the same way that we do it in language models we're going to tokenize images into image tokens uh and then we have our text tokens we have our image tokens we can predict it auto regressively uh
and if you scale this L the right way we start to get video generation that makes sense so and I think these are pretty interesting because they see the same sort of scaling properties that we saw back in images so like these videos
obviously are not real like the pen is pling cards this is not actually in the data distribution uh but we have the nice kind of interpolation that that we proved earlier in image space now are
they're applying to other domains and in image and sounds as well we're able to generate for example like songs that are played you know modern songs played with a classical instrument for example which
aren't actually in the data set but we can guess what they might sound and and there's kind of some some applications for these things outside of just generating videos like these videos
here are cool but it's kind of hard to think of you know what are we actually going to use it for well one answer is if we can simulate the world this actually gives us a nice thing if we
want to do control so if we want to for example get a robot to to plan to imagine what would happen if I tried to do a certain thing well if we have a good video simulation of the world we can actually just roll that out we don't
have to run the experiment in real life but we can run it in a simul so as um as these models get better and better we're going to start seeing more Downstream applications of them so the video thing is trained you know in an
unsupervised way this is kind of the pre-training where the pre-training goes into the world model and then later we can do for example reinforcement learning or planning on top of these mods uh and then there's a text link so
we we'll go into this I won't go into it in too much because I think the language modeling we're going to have a lecture on it later um and I don't want to get too much into language mod specific stuff but language models are a form of
unsupervised learning and this is kind of character RNN um Andre carpy wrote this little blog post back in 2015 where where he said okay what if we have an RNN um a recurrent Network that just
predicts each character after another and this is before um you know we kind of knew that these things would scale very well but we could still generate pretty fun things so some stories for
example that that are that are made up here um and the the Character level rnns were actually pretty powerful in that you could train them on things that are not entirely language or natural language for example we can generate
latch text we can generate text uh and and this is a lesson that we'll see in in language modeling at least um that some people like to say that language contains a lot more than just words it
contains kind of some some patterns of thought there's arguments that language models can can serve as Foundation models or things that are not just language and here is kind of just some very early results that okay we can
generate math we can generate things uh I don't know if this math actually makes sense it might be making it up but it it it looks right from a
distance uh and then so yeah gpd2 kind of the open ai's next attempt that at scaling these models up showed that uh these prompts so so I think the main the
main kind of innovation that came out of gpd2 is that with the right prompting you can actually get these Foundation models or these like large unsupervised models to do the things we want them to
do whereas you know in the early kind of generate math or generate text we didn't really know how to guide them it can generate from the distribution of text that's trained on but we don't know how
to get it to do exactly what we want uh and GPT to kind of showed that you can just tell it what to do and it's kind of this the theme we'll see a little bit later is that there's kind of a shift
from having to train early on these like supervised models that everything is trained specifically on what we want it versus okay we can just train these these giant distributions um and as long as we constrain it by providing some some
constraints at the top like the prompt what comes after kind of naturally has to follow certain instructions so gpd2 is able to do some nice things like generate
stories and um it can fight against you I think this is saying yeah recycling is not good for the world don't really want to be saying this but it
can uh and then so this is chat gbt so gpd2 this was around in 2019 since then the models have scaled up um even larger although the the principles are essentially the same uh but we can kind
of see why it's so powerful to to learn these UNS supervised models because the same kind of lesson we got here where we can do a little bit of prompting uh to get what we want out uh continues to improve the better our distribution
matches so with language models today you can actually do things like for example ask it for Json output and it can give you nice structured files and and this is kind of opening up a lot of avenues for like okay we can use this to
generate things that are readable by computers not necessarily only by humans uh another kind of nice thing that language models can do today is is this long form summarization so as um
the context lengths go up so as as the length of text that the models can actually handle in memory goes up uh we can start to do kind of more understanding based tests so back in gpd2 the prompts were usually like a
sentence long and then most of the output comes from here and that's essentially because the limit of the model was was pretty small like you couldn't really go much longer than this
uh but now it's cut off over there but there's a big article about for example what unsupervised learning is and the models can actually just synthesize it so we're we're we're entering a stage
where the models themselves can be conditioned in these very like high high bit ways uh even if the output itself is is little bit uh and yeah these models uh are all available for use so you can try it I'm
sure many of you have heard heard of these these things before they've been all over around but uh yeah things things are escaping and and they're working so yeah a little bit about
compression as well so this is paper from I think 2017 if this is the same um but but basically it's it's showing how how generative models can be used as
compression and again we'll have a lecture on this later but the basic idea is if we look at actual kind of hand handwritten compression methods like jpeg they assume some form about the the
natural image space so jpeg assumes that pixels are naturally smooth and if you assume this we can turn pix we can turn images into less less bits than if we assume nothing and compression using a
generative model it's basically doing the same thing but instead of assuming this kind of natural image distribution in terms of let's say smoothness or in terms of neighbors we just learn it from data and
so we say okay if all we care about our images that come from let's say the the million images in in this training Set uh I only need you know a smaller number of bits because I'm assuming
okay I won't have images of noise or of Graphics if it's a 3D data set and so we can get compression numbers that are better because we make stricter assumptions and kind of one yeah one one
argument against these models some of them are are lossy I mean jpeg is already lossy to begin with and for a lot of applications it's perfectly fine if it's lossy like all video codecs are
essentially lossy because it doesn't make sense to encode all of that most frames are the same we can get if your YouTube video doesn't load you'll see things like this um but if we assume like an even
stricter um constraint on the distribution we can get things like this where the same number of bits gives you a much crisper image because we kind of learned what what images are likely and which ones
are um things in 3D are also working so this is a paper from 2020 that uses Nerfs so Nerfs are this this way of representing a 3D image as kind implicitly as the outut of a neural
network uh and the first NERF paper just trained them um by taking videos potentially of one object and then this paper which is I think it's like it's either a gan based Nerf or or a vae
based Nerf but it says okay actually if we have a representation of a 3D model um let's just learn that distribution let's learn to generate things and so here we have models of face like like a
car a chair and because these models are actually 3D models you can move the camera around and see them from different angles and again a lot of the the power of these methods comes from the fact that they
are unsupervised you don't need labels we don't actually even need the the 3D models themselves in some cases as long as you have pictures of them it's enough to reconstruct the model itself a
question the fora model how youc like is a hyperarch of like see which l or yeah it's a great question and I
don't know the answer off my head uh I I I could make a guess but I think think you would you would get the correct answer reading the work here jump in for a moment so for some gen models it's a
bit harder to measure but for the ones that optimize likelihood it's literally the likelihood so the log average log loss directly translates into bits now you might have
to apply a factor log two to it to get it in bits rather than in Nats um but it's the same metric um just because in optimal compression assuming you use
optimal comp compression you get to compress your data down to the entropy of the distribution that's the number of bits you need and that corresponds to
the average log prop of all your data points entropy is average log prop and so your Lo average log prop is exactly what's being measured there it's
measured in and so bits per bite here means average log prop per bite an image will have a red bite a green bite and a
blue bite to resent the three colors at each location and so it's measuring what's the average log prop per image and then normalizing it
by the number of bytes that are in the image to get you this number uh yeah so basically that's that's one application of these models do some compression and and like Peter
said earlier if you actually look at the the how to measure the compression it it requires an optimal encoder and an optimal decoder so that kind of motivates the fact that the more powerful your model is the better you
should be able to actually Reach This bound on the how how you can compress a data set yeah and the optimal since the optimal compressor is not actually possible don't they usually use the
normalized compression distance which comes from the right so I guess yeah one way you could view these these are all lower bounds like we don't know the true compression of the data set because we
don't know if our encoder or decoder are optimal um I guess you can say that they're like Optimal is in terms of if your if your com complexity is how many
parameters your your network is this is it's not the best actually you can get in that parameter C because we're training them there could be a better solution in that parameter space but it's it's what we get if we try to push
as hard as we can on those bounds um yeah so 3D generation it's also getting better so that this is from 2020 this is what we can do in 2022 now
again similar similar techniques uh where we kind of assume and and the interesting thing about 3D is we can actually assume even more it's kind of even more unsupervised
than than 2D images because in 2D images we we have data sets of 2D images in 3D sometimes we don't even have data sets of 3D models all we have are our 2D pictures of the world but we can assume
certain things about them like if you take kind of a few pictures of of an object from different perspectives we know that light reflects in certain ways and and there's kind of one model that
actually results in these views and by writing down little things like this we can actually generate models even though we don't have that data so it's kind of we have one assumption and then we can
do our our unsupervised stuff under this assumption uh and finally like there there's some domains that are also less visual but arguably more more important Alpha fold as one of
them um figured out how to do essentially protein prediction uh and here there is I think there is some supervision at the end but a lot of the power comes from this kind of
unsupervised representation learing that happens in the protein space before this um yeah and then so maybe what came before that was okay here here are some cool applications of what we can do with
unsupervised learning um and now it's like a bit of motivation for why unsupervised learning is important um if you even if you have a specific problem you want to solve uh sometimes it's
better to still do large scale pre-training before uh and and this is from the foundation models paper uh from Stanford but a lot of the field these days is moving to this this setting where we do this large scale
unsupervised pre-training and then we do adaptation to what we actually care about and the form of adaptation can take many forms it can be fine-tuning the network it can be prom thing it could even be for example zero shot or
or some something few shot but this sort of Paradigm has has kind of this is a big shift from maybe 10 years ago five years ago um this is now kind of what what uh seems to be working at least in
terms of generalizing to some real problems and so we we'll show some examples of of how this pre-train formulation has come out so this is from the GPT paper I believe and it shows
that okay we have these language models we can actually they just do sentiment detection I think this is either with a linear header or some sort of small adaptation um it knows how to do sentim
detection off the bat and a lot of things now build off these things where okay we we example pre-train with with a birt loss or we'll
see later a GPT loss and then they they solve these benchmarks at a much higher rate so we'll see for example the way we actually measure and this is in language
model space how good a language model is is actually pretty it's pretty rric there's all these different kinds of things like like code generation reading comprehension maybe solving exams and it's really showing the point that the
unsupervised learning on these models can then solve like all these random domains that it's not actually specifically trained to be able to do uh and and there's this so prompting is kind of a we won't go too much into
it but it's it's also this this adaptation method where there's a lot of interesting things like strategies now about how to extract what we want out of the big model so we we know that these
models have all these this knowledge ingested we know that at some level they theyve represented them correctly or represented them to a really nice degree and then the question is just how to get it out and there's some there's some
messy ones um this is from open eyes guide and and the fact that they have the guide on there shows that this is something that people are at least thinking a lot about and finally we we'll see the same
thing in Vision so this is this is from kind of I think this is from 2018 but this is this is using contrastive learning so another unsupervised method and it's showing that the performance when we transfer from unlabeled data
even on these tasks that have traditionally been been using supervised data uh they're still pretty good and uh for example mass auto encod which
comes out in 2021 also shows results on on these data sets which are traditionally um you can train a supervised method on it it turns out if you pre-train this is big Auto encoder
on a lot of data it works better here uh and okay so this is the last slide and it's just an overview of all the stuff before but I think one one kind of nice motivating thing about
about UNS supervis learning is it seems like the method that scales the best as as data comes in like we saw the laon's cake earlier where where most of the volume is is on supervised learning and
that's because it's just so much easier to collect this raw data than to have human labels and human labels they can be wrong they can be they can be multimodal they can have all sorts of weird
properties um whereas the fact that essentially the same methods such as like Auto aggressiveness or or things we'll learn later in the class can solve all these different domains should should tell something about um that
there's something correct about how we're doing these things thank you Kevin um I want to quickly highlight one
thing that uh we went over just a little fast this slide here this is from 2020 and I want to highlight the thing on the right here there was a bet
between two professors here alosa EOS and jender Malik and it the formal version of the BET says if by the first day of autumn of 2015 a method will
exist that can match or beat the performance of rcnn on Pascal vocc detection so detection task rather than classification without use of any extra
human annotation i s uh so on surise pre-training effectively Mr Malik promises to buy Mr
EOS One galat Two Scoops one chocolate one vanilla so what was going on here alosa EOS challenged the ten and said I
think un survis learning will win and jener said well let's make it bet around it you tell me when it's going to win clear alosa was optimistic he put it on
autumn 2015 it didn't happen until this paper here the CPC V2 paper which was Autumn 2020 um so
um yeah uh five year five years later but it did happen and I think it's kind of interesting to see some of these things play out where uh for the longest
time the thing that was winning there were actually other things people already thinking about that you could project into the future would be the ones winning out in the long run we're just not there we need to figure
out some details we need to figure out scale and once it's figured out it'll actually uh be better than what is the current way of doing things so chender I
got the gelato from alosa in this case but um in the long run alosia was was right I guess that Unis learning is a more powerful
pre-train so in summary un learning is actually rapidly advancing as a field thanks to compute for sure deep learning engineering practice becoming better data sets and these days lots of people
working on it it's not just an academic interest topic anymore it used to be but definitely not anymore today language modeling image generation Vision language multimodal pre-training are all
working really well and have production level impact with the bird one for Google search having been the first one but obviously many more now what is true now may not be true
even a year from now and I'll give you two examples just to make this concrete um example one we just talked about self supervised pre-training was way worse than supervis and computer vision tasks
like detection segmentation until actually so turns out was full 2019 when CPC V2 came out and now it's better example two representational inform
Vision through masking didn't work people said why not just mask like in in language you just mask things out nothing worked um or nothing worked all
that well until King her and his collaborators um made it work in November 2021 and the word on the street was masking doesn't work in
Vision doesn't work but turns out now it's what works the best so these things are all in flux and that's one of the things I want you to keep in mind I would love to see final projects where
you challenge some of the common wisdoms of today and you ideally you have a reason for it right maybe the reason here could have been well if it works for language why can it not at least work reasonably
well for vision maybe it's not going to be the best but at least it should be reasonably well what was the key thing King did that nobody else did well first he's a very good experimenter so he
iterated over many things he's he's the best but then What timly mattered is he masked out 80 90% of the image and turns out that was key and before that people
were masking out 10 20% and masking out 10 20% the task turns out too easy remember the early slide Jeff in saying the brain must be doing a lot of work with having you know
10 to the five synapses available per second to be trained um well King effect said let's put the N work make it work harder than just refilling back in 10
20% make it somehow create the 90% that you left out and all of a sudden results are great um Vision Transformer architecture helps with making masking
easier to do in your training and hence it also helped in kind of making this possible um but the combination factors I think the biggest one timly conceptually is masking way more than
anybody had ever done before um autoagressive models flows Vees Gans diffusion models um the ones we'll cover in the first five weeks of class I think they still have
significant room for surprising capabilities meaning that maybe if you scale it up in a domain where people haven't done it yet um maybe video which is still a very early domain for people
to apply is too maybe robotics which is still early going biodata other Sciences data even just applying the current ideas can likely give very surprising uh
results so it's a great time to work with them on them um but also think the core of un surprise learning might still have some major Innovations ahead um the example I gave earlier lost offering in
2020 which is a while ago now but back then we had no lecture on diffusion models because we hadn't written the denoising diffusion paric models paper yet and and at the time people thought
of Gans as the thing that is the best at image generation and since then it's obviously changed I think more changes could be ahead and so I really challenge
you to think hard about all these methods that we'll be presenting and um see if you can somehow find a way to to
improve them by thinking very hard about what how they sit up and I think often and of course I'll try to give more color when we're get into specific ones but often the devils in details in these
things something is presented the big picture makes sense it seems like yeah this is going to be the best and maybe it's even the best as it's presented but then you go to the specifics of How It's
actually written out in the equations and actually there's already a bit of a gap from the motivation to what the equations actually have and then maybe the implementation introduces another
Gap because exactly the math is not what's being implemented and so there's all these gaps that could come along the way where you could see an opportunity to make things better and maybe invent
the Next Generation or next iteration of these of these models that would be a pretty great outcome if that could happen uh from this class all right that's it for today thank you and uh see
you all next week
Loading video analysis...