Stanford CS234 Reinforcement Learning I Introduction to Reinforcement Learning I 2024 I Lecture 1
By Stanford Online
Summary
Topics Covered
- RL Essential for Intelligence
- ChatGPT Powered by RLHF
- RL Core: Delayed Consequences
- Exploration Censors Information
- Reward Hacking Undermines Optimization
Full Transcript
hi everyone we're going to go ahead and get started I'm Emma Brill um I'm delighted to welcome you to reinforcement learning cs234 um this is a brief overview of the
class uh and what we're going to be covering today and just want to start that probably everyone's heard of reinforcement learning these days and that wasn't true about 10 15 years ago
but you can describe what is happening in reinforcement learning by a pretty simple statement which is the idea of an automated agent learning through
experience to make good decisions now that's a pretty simple statement to say um it sort of encapsulates a lot of what me and my lab and many many others have been trying to work on for the last 10
to 15 years um but it's sort of deceptively simple because it involves a lot of different really challenging and important things so the first is that any sort of
General agenda to try to achieve General artificial intelligence has to include the ability to make decisions there's been absolutely enormous progress in what we would call sort of perceptual um
machine learning things like being able to perceive faces or cats or like identify cars and we call that often uh perceptual machine learning because it
focuses on trying to say identify something but of course in reality what we're all trying to do all the time is also to make decisions based on our perception and based on our information
we're receiving and and so it's critical if we think about what it means to be intelligent to consider how to make decisions and not just any decisions but what it means to have good
decisions this sort of question over how can we learn to make decisions particularly faced uh by uncertainty and limited data has been a central question that people have been thinking about at
least since the 1950s um particularly pioneered by the ideas of Richard bman and we'll hear a lot more about belman's equation which many of you might have seen before for
later even um in this lecture or next lecture so there's one sort of argument for studying reinforcement learning um which is it's an essential part of intelligence it has to be part of a
general agenda of artificial intelligence and so we should study it to try to understand what it means to be intelligent and that certainly for me is one of the really big motivations is to I think there's just a lot of
fundamental questions about what is the data needed to learn to make good decisions but there's another really good motivation to study reinforcement learning which is but it's practical and it allows us to solve problems we'd like
to solve so in particular over the last roughly decade there started to be a lot of really impressive successes of using reinforcement learning to tackle
problems or to get unprecedented performance in a lot of really important domains so the first one is the board game go so who here plays
go okay a few people maybe maybe not you can talk to the people that raise their hands so it's an incredibly popular board game it's also an incredi hardboard game it's far harder than chess um and it was considered a really
long outstanding question of artificial intelligence but roughly like I guess about eight years ago now eight to nine years ago um there was a team at Deep Mind which was still a fairly small
organization at that point that thought that they could make significant Headway at teaching AI agents to be able to play go and the idea in this case is that
we're going to combine between the ideas of reinforcement learning and Monte Carlo research which is something we're going to hear about later in this class um to create a system that played go
better than any humans in the world and so there's even a movie now about sort of one of the seminal um games in that sort of endeavor and how humans felt
about that and and how the creators of the AI systems felt about that but this uh feat was achieved far earlier than what people expected and one of the key reasons for that was the was using
reinforcement learning another really interesting place that we've seen progress of um using reinforcement learning to tackle incredible challenges is in the idea of Fusion science fusion is a potential
approach for trying to tackle the huge energy issues that we have and trying to transition to more sustainable options for that and one of the challenges here and I'm not a fusion um expert is to
manipulate and sort of control um things within um a vessel and so what the reinforcement learning question in this case is how do you command the controllers the coil controllers in
order to manipulate this into different types of shapes and so this was a nature paper from two years ago where they showed you could use deep reinforcement learning techniques to accomplish this in a way that was far more flexible than
had previously been imagined one of my favorite examples of the applications of reinforcement learning um comes from a pretty recent important case which is covid um testing
so this was a system that was deployed in Greece they had limited resources and they were trying to understand who you should test in order to help control the epidemic because as many of you may know there's a lot of sort of free movement
within Europe um and there was a lot of Transitions and they were trying to think about how to leverage their resources in a data- driven way because of course the epidemic was changing um and so this was a beautiful paper by a
Stanford Graduate um HMA Basi and her colleague she's a professor over at Penn now that use reinforcement learning to really quickly do this and it was deployed so Greece used this um for for
their um testing at the border but perhaps the most famous example recently is chat GPT so I think that um as many of you might know natural language processing
has had incredible successes over the last decade and there worlds a lot of work trying to use Transformers um to make really really capable natural
language systems but up to around you know I guess like a year and a half ago um most of that work was not known to the broader public so even even though we were getting these amazing advances in natural language processing it wasn't
at the state yet where like everybody was using it and so the key idea of chat GPT was to use reinforcement learning to create vastly more capable
systems and I like to talk about chat GPT not because it's perhaps the most well-known success for reinforcement learning but also because it exhibits a lot of the different technical challenges and questions that we're
going to be covering in this class so let's just walk through sort of how at a very high level sort of this from chat GPT of how the chat GPT system works in terms of training so the first thing it
does is it does what we would probably call Behavior cloning or imitation learning and we'll be covering that in this class and we'll be talking more about it
even in this lecture so what did it do so again just to remind I suspect everybody in this class probably uses chat gbt probably multiple times a day or Claude or gemini or one of the other
um large language model systems but just in case you have not um the idea in this case is you might have some sort of prompt or task you want your language system to do like explain reinforcement learning to a
six-year-old and then someone gives a response like we give treats and punishments to teach Etc you can think you can try this out with chat GPT and see how well you think it explains it um
and then what that was treated as is sort of a direct supervised learning problem so just trying to take that input and then to produce that output and we will call that also imitation
learning or behavior cloning in this class and we'll talk about why so that was the first step and this is sort of what people have been doing in uh natural language processing and the systems were good but they weren't that
good so the next idea was to try to explicitly think about utility or rewards like how good were these um particular labels or these particular outputs so here we're going to actually
build a model we're going to build a model of a reward which relates to model-based reinforcement learning and the way we're going to do
this is um or the way they did this is we collect preference data we ask people to compare or rank ACR across different forms of outputs and then we use that to learn a preference model and we're going to cover that in this class that's going
to be one of the differences to this class compared to a couple years ago that um I think preference-based reward signals are really important and very powerful and so we're going to be covering that in this class this term so
in this case they would learn a reward and again don't worry if you haven't if you're not familiar with what rewards are and stuff we'll go through all of that I just want to give you a high level sort of sense of how chat GPT is related to some of the things we're going to cover in the class so they
learned a reward signal and then they did reinforcement learning using that learned reward system signal so now they're going to do reinforcement learning and this is called
rhf because it is reinforcement learning from Human feedback and I'll just note here that this was not the first time this idea was um uh introduced it had
been introduced maybe about four to 5 years before this for sort of Robotics simulated robotics tasks um but chat GPT demonstrated that this really made a
huge difference in performance and so I think that's a it's a really nice example of the types of ideas that we're going to be covering as well asort of the incredible successes that are
possible now for even before chat GPT came along there was starting to be a huge interest in reinforcement learning so some of you we have an optional textbook for the class um it's
by Sutton Narto Richard Sutton is um from Canada and is one of the sort of founders of the field and when I started in reinforcement learning and I would go give my talks and conferences it used to
be like me and Rich and 30 other people nobody cared about reinforcement learning um I mean except for you know a few of us did because we thought it was really amazing but as you can see like sort of through like the 2000s which is
when I I was getting my training around here um there just weren't that many papers and the community wasn't nearly as large but this nice paper by Peter
Henderson the the y- axis's uh papers shows that there's been this sort of enormous increase in interest and I think a lot of this was really due to
the fact that kind of around here there was some amazing excess um successes on the Atari video games where people showed that you could learn directly from Pixel input to make decisions and
then there started to be the successes in alphago and then there became more and more successes so it is an incredible time for reinforcement learning this curve has continued to go up um however I
think it's also important to notice that there's also a number of Skeptics so there was a pretty famous talk by Yan Lun um in 2016 at one of the
major machine learning conferences nurs so Yan theun for those of you who don't know him is one of the sort of seminal figures in neural network research he has won the Turing award he is um an
amazing amazing researcher so he gave a keynote at nerps I believe it was a keynote he certainly gave a very famous talk there where he was talking about the role of different types of machine
learning questions and at sub areas in terms of making progress on machine learning and he very famously talked about machine learning as a cake and so he said that the main cake is really
unsupervised learning and that's really going to be the body the most important aspect respective machine learning things like representation learning from unlabeled data that's really going to be the core that's where we're going to
have huge amounts of data and we're going to make a lot of progress and then supervised learning was the icing so that's still pretty important it's like very important part of a cake at least in my opinion um it doesn't have as much
we don't have as much supervised learning um and it's sort of this additional one and then he argued that reinforcement learning was just the Cherry now you know cherries are
important um but uh but not nearly as much perhaps as the rest of the cake and he went on and talked about some places where he thought that RL still might have a role but it was considered a
really important talk because what he was sort of demonstrating is that reinforcement learning was having a part to play in machine learning but maybe only a very minor part now I think it'd be interesting to talk to him today I
haven't talked to him recently so I don't know what his current opinion is um but I think it's a really important thing to think about like where are all of these different techniques important and where will we be able to to make the
most progress in terms of advancing Ai and so with that we're going to try and do our first poll which is about why you guys want to take this class um so so we'll look through these you'll have to bear with us a little bit with um we
had a few technical difficulties that we're working with CTL on but it should work out so if you go to either the first link and Ed or you go to this HTTP you can if you have any issues like do you want to be registered if it's
hanging just skip the registration refresh that should all sort it out and then didn't just Ed in your sun ID as your screen and just take a second and write down a bit about why you want to take this
class and it could be anything it could be that you're really curious about something it could be because you're doing an internship and they told you you had to take something about reinforcement learning any of the any of the things are
fine just take a minute or two for for for for for thanks for all the great reasons um I
will talk about some of those when I talk about also what we're going to cover today and try to address why I think that a lot of the things people are bringing up are things that we're going to be touching upon so I think if
we want to think about um I think it's really important to start thinking about what is what is reinforcement learning about because if we understand what it's about then we know what sort of uh types of questions we're interested in this
space and we also understand what sort of applications it might be helpful for um though of course your creativity is unlimited so you can see what you might come up with other ideas that people may not have thought of for applying RL but
the the the four things that people top typically think about when they think about reinforcement learning as a discipline and as the um sort of what reinforcement learning involves is optimization delayed consequences
exploration and generalization so the first is optimization um and the optimization aspect is really just saying that we're thinking about the best way to make
decisions which means that we explicitly have to have some notion of utility this um would an example of this would be something like finding the minimum distance route between two
cities given a network of roads this means you can directly compare different solutions because if one solution has a smaller distance than the other it is strictly preferred so there are many many important optimization questions
and reinforcement learning because it is concerned with making good decisions cares about us being able to rank or you know decide across those different ones the second one is delayed
consequences the idea being that the decisions that we make now can affect things far later so maybe saving for retirement now has some uh immediate cost but leads to some significant
benefit later or maybe there's something you can do early in a video game that later has a lot of benefit there are two reasons why delayed consequences is challenging one is for the reason of
planning um many of you might have actually raised your hand if you've taken AI at Stanford okay so about half of you so you probably saw planning in Ai and and planning is the idea that even when we
understand how the world works it might be really complicated to try to decide what the optimal thing is to do so you could think of this like chess the rules are known um it's still really complicated to think about what's the
right thing to do so when the decisions you make involve reasoning not just about the immediate outcomes but the longer term ramifications these sort of planning problems are even harder but the other reason this is
really hard is when we're learning meaning that like we don't know how the world works and we're trying to understand how um through direct experience so when we're learning um temporal credit assignment is hard
meaning that if you take some action now and later on you receive a good outcome or a bad outcome how do you figure out which of your outcomes caused that good or bad later later
result this happens all the time to us as humans right like how do you know why you got into Stamford well I don't know was it because you colored your you know you did you wanted coloring contest when you were a six because you scored well in the SAT because you went to a good
high school or you wrote a really good essay it's really hard to understand this in some cases it may be impossible but when we're getting to make repeated decisions it's really important that we can start to use the prior experience to
figure out out which decisions were important or LED to good outcomes so that we can repeat them so that's one of the reasons why this is hard exploration is one of my favorite
things um in terms of reinforcement learning and the idea of this is that the agent can only learn about the world through direct experience so it's like
trying to learn to ride a bike by trying and failing and trying again um and through that direct experiencing learning the right way to ride a bike
and the key idea about this is that um information is censored in that you only get to learn about what you try so for example right now you don't know how much worse your life would be
if you were MIT I went to MIT for grad school MIT is also a great place um but you generally can't ever understand what that counterfactual um life would have been
like right it's one of the the central challenges um it's also a huge challenge in causal inference um which is another big interest of mine and something my lab works on so so this's this General challenge that you only get to learn
about the actual things that you do as an agent or as a human as an agent Etc um and so the question is how do you use that experience to figure out how to make good decisions so as a concrete
example of this you can imagine you're a company and you give some promotion to all your customers you can't know what it would have been like if you didn't give the promotion to those customers and even if you can give it to One
customer and not another they are not the same people so I can't rewind and say dilip who is our head ta this time I'm not going to give you the promotion let's see how that world would have worked out that's one of the central
challenges so we'll talk a lot about exploration later because it's one of the key things that is different compared to many prior approaches and generalization um has to
do with this question of really wanting to solve really big challenging problems so we'll talk a lot about what decision policies are but in general you can just think of them as a mapping from experience to to a decision and you
might think in those cases you could just pre-program it so like if your robot goes down the hallway if it hits the end of the hallway turn left but let's think about um a video game which we can think of as just sort
of generally having some input image so let's imagine that it's something like 300 by 400 and let's say we have at least 256 different colors so now we have an image
set of images that we could see that is at least 256 to the 300 CR 400 so those are at least the space of images and that's probably an underestimate and now we get to think about what we
would do in each of those different scenarios so the combinatorics are completely mind-blowing and we can't write these down in a table so this is why we would need something like a deep neural network or something else in
order for us to try to make decisions in these real like realistic settings which are extremely large in terms of the the type of scenarios the number of scenarios we want to make decisions
on so you've probably seen all of these ideas or at least most of them in other classes for other types of AI or machine learning so I think it's useful just to contrast what is reinforcement learning
doing um compared to these other ones so the first is AI planning so in AI planning generally we're doing some form of optimization Trying to minimize a distance or something like that we are
often trying to handle delayed consequences and um and those are the two main things so we we might also have to do generalization if the the size of
the Space is really large okay so that is how so RL in general will involve all of these so this is how those would compare if we think about something like
supervised learning supervised learning does involve learning so we learn from data you know whether something's a cat or not and we have to do
generalization so we have those two things and again this is going to be compared to reinforcement learning which has all of those in contrast to supervised learning where you get the correct labels and
unsupervised learning we don't get any labels but we're still learning from experience and we're still trying to do generalization now the next thing and this has become a really poopular thing
is to think about whether we can map um reinforcement learning to imitation learning we talked about this really briefly um about chat GPT and we'll talk about a lot more in the
course so in imitation learning or behavior cloning or reducing reinforcement learning to supervised learning we generally assume that we get access to expert trajectories so this could be like
someone saying what they would do in response to those prompts it could be someone driving a car and then you want to mimic their behavior um or or some other similar example so then these
ideas is that we get input demonstrations of good policies and that allows us to reduce reinforcement learning back to supervised learning so we're s of taking this we're
reducing it back to here now I think in general the idea of reductions is incredibly powerful um for those of you that have taken CS Theory classes that's what we do all the time
we reduce things to sat or other things like that and in general I think in computer science it's one of the strengths of it that they think of how can we reduce one problem to another and then inherit all the progress that's been made on that
problem so in this way reinforcement learning is similar to other aspects of computer science in that we will try often to reduce reinforcement learning to other problems this is particularly done in the
theoretical aspects of reinforcement learning yeah yeah just oh whenever you um ask just because I'm going to try and learn names could you say your name please yeah my name is yeah so just to
be clear um imitation learning then isn't like a separate technique it's just an application of supervised learning to like the specific reinforcement learning context it's good question so I think some of you I mean
there's a lot of techniques that think about when you're doing imitation learning specifically for kind of decision data you can just think of it just reducing it back um if you want to
do imitation learning where you might recover like the reward function we'll talk more about that soon and others then you may need to use other types of techniques as well but like the most straightforward aspect is just to say
I've got demonstrations I'm going to ignore um sort of like this uh delayed consequences aspect and exploration and I'm just going to reduce it
back yeah and name first please wait what do you mean by input demonstrations of good calls what does that mean great question so let me give you an example so people have thought a lot about this maybe maybe one of the
first examples of this or one of the first really public examples of this was for driving like at least what you could do is you could I could drive a car it could record everything that I do in terms of controlling the steering wheel
and then we could if I'm a good driver they could say that's a good demonstration so instead of the car trying to learn from itself how to steer the wheel in order to say successfully Drive you could have humans drive it and
it could try to figure out at each point what like how should I steer the wheel in order to um have good behavior so the idea is that you actually have access
already to good demonstrations of what is a good policy yeah name CH please what do you exactly mean by optimization
and optimization and what what do you mean by optimization uh okay good question so what I mean is that when we do imitation learning from good trajectories we are assuming that
like we want to do well so we want to actually get a good policy so imitation learning normally we're not normally trying to imitate bad performance um but you could think of this as sort of
reinforcement learning but without the the exploration part because it's not trying to pick its own data have the optimization yeah so I think um because we normally
don't have the notion of uh utility in those so you might say this is a cat or it's not a cat it's not like a good picture of a cat or not whereas in decisions we often have a real valued
scalar value of like it was like you know A7 good decision yeah name first please optimize yes so we do often we
always have loss functions um and that's a great but where in those cases there's not normally a utility that goes so if you could get you could maybe have some smooth notion there of how well do you match
like a stochastic policy a stochastic output there um but for many of those it would be more if it's like did you say it was a cat or not you would have a binary 01
loss yeah so does that mean if you have the data for imitation learning it's like almost always better than reinforcement learning you're kind of like avoiding
the you can like directly learn what is good say that again so like if you have the data for imitation learning like you have someone actually driving the car
does that mean that um you'll probably learn a better policy than reinforcement learning or that's almost always great question so we'll get into that so the question was if you can hear that is if you have good demonstrations say of
driving behavior um that you're using imitation learning can that be better than reinforcement learning it will depend on your reinforcement learning algorithm in general reinforcement learning should always always be able to
exceed equal or ex um exceed the performance of imitation learning yeah so can you explain the difference between ilil and
rlf yes great question so in imitation learning um what we would uh have and this is what this was the first part you would say people give me um given a prompt I look on the internet and I
assume that those were good so um in the internet I see if someone said like how to explain uh reinforcement learning to six-year-old this is what they said back and so I just train on those what rhf
said is that well you know the internet's a big place probably not all of it is good answers so now let's actually ask people which of these two responses they prefer and now we're going to try to do reinforcement learning on that to actually get to a
better policy yeah I'd like to ask so alab actually discover some go strategies that are not
invented by humans that we have never experienced before so doesn't mean that if we apply imitation learning too much it might actually hinder the model's capabilities to explore like what is
actually good in like what humans have thought of which is probably absolutely and actually I think this is on the next side let's go it good okay perfect so this turn to where are some of the places that you might hope that
reinforcement learning would be better than these other strategies so one of them is where you don't have examples of desired Behavior this is exactly like the example that was just brought up if you want to go beyond Human Performance
you cannot rely on Human Performance just to do imitation learning because you're not going to be able to go get better than it so there are a lot of application areas I think particularly in areas like healthcare or education
and others where we think we can go beyond Human Performance um and so in those cases reinforcement learning because it's trying to optimize performance um could go beyond it could be a particularly useful technique and
others where you don't have any existing data for a task so there might be something where you think of it as um a decision-making problem but you don't have prior data you need to learn from scratch um and you want to directly
optimize so that's another place where reinforcement learning can be very powerful another category is interesting because in some ways it's also kind of a reduction technique and this is the
question to place where you have an enormous search or optimization problem with delayed outcomes so there's been a number of examples of the work of doing this from Deep Mind which have been
really extremely elegant so what I I put up here is Alpha tensor um if you haven't heard of it it's a faster way to do matrix multiplication which is kind of mind-blowing so what they did is they
said all right there's standard ways to do matrix multiplication this comes up all the time could we learn an algorithm that would be better at matrix multiplication not me as like a
scientist try to write down an algorithm have a an agent learn one and they they showed yes and the way they did that was with reinforcement learning um and they've done this in other cases too
like learning faster sorting algorithms so I think this is a pretty incredible Frontier the idea is saying could we have ai actually be inventing new algorithms um and one of the ways
that they they framed it here and you can think of alphao is similar is that it was a really really really large search problem and the challenge with really really large search problems is that
even there we may not have great techniques for solving them and so it's sort of a reduction you can think of people taking a planning problem and trying to reduce it to a reinforcement learning problem to make it more
tractable so that's pretty wild most of the time we think of sort of ourl go being reduced in the other direction or involving planning or about that but here in some ways you can think of these as like um either adversarial planning
problems or expecting Max problems that are being reduced back to to learning as a way to just more efficiently go through the search space so those are two of the areas that I think are particularly promising in terms of why
reinforcement learning is still a really um practical and really important area to think about I think I saw a question back but maybe you yeah oh what was your name
for Alpha tenser is that like it's FAS but within some error of the correct matrix product it's faster but with some error or do you actually get the corre value no you get the correct value which
is wild yeah yeah so no it's just better yeah and one of the really clever things they had to think of in this case was how do you know that the answer is correct how could you like provably
verify that so yeah incredibly elegant all right now we're going to go quickly through some course Logistics um before starting to dive into some content and feel free to interrupt me
throughout this or anything else if you have other questions so in terms of the content um we're going to start off by talking about markup decision processes and planning and then we're going to talk
about model free um policy evaluation and model free control don't worry if you don't know what I mean by model I I'll specify it then we're going to jump into policy search policy search is
things like um uh proximal policy optimization and other uh and reinforc and other approaches some of you guys might have already seen related ideas say in robotics if you've taken them um
and then I'm highlighting here that this is one of the important differences compared to Prior years so we're going to do um a a deep dive into offline reinforcement learning offline here
meaning that we have a fixed amount of data and we want to learn from it to get a good decision policy and during this we're going to talk about reinforcement learning from Human feedback and direct preference
optimization so that's um going to be a new that's going to be a new sort of third part of the course that we haven't done assignments on before so I think that'll be pretty exciting and we'll also talk about exploration and do
Advanced topics so the high level learning goals of the class is that by the end of class you should be able to define the key features of reinforcement learning you should be able to given an application
specify how you would write that down as a reinforcement learning problem as well as well as whether or not you think it would be good to use RL for it that you can Implement and Code common RL algorithms and that you understand
the theoretical and empirical approaches for evaluating the quality of an RL algorithm so as you could probably imagine from those papers going up there's going to be there's going to be continued progress in this field there's
going to be a huge number of different RL algorithms and so one of the key things that I hope to talk about is sort of how do you evaluate and compare them um which might vary depending on the application area you care
about so the way that the course is structured is that we'll have live lectures we'll have three homeworks we'll have a midterm we'll have a multiple choice quiz we'll do a final project um and
then we'll have what I call check or refresh your understanding exercises which will be going through um the poll anywhere and we'll have problem sessions which are optional problem sessions are
a great chance to think more about the conceptual and the theoretical aspects of the class um and they'll be held starting next week so one of the main application areas I think about a lot is education I
think education is one of the greatest tools we have to try to address poverty um and inequality and so I'm really interested in evidence uh to think about how do we educate effectively so with
respect to that I wanted to share this paper that came out I guess almost a decade ago to now where they did a study to look at how people who are taking massive open online courses how they
spent their time and how that related to their learning outcomes and what they found is that um if you do more activities there seem to be a six times L larger learning benefit compared to watching videos or
reading and um you might think this is just based on time but it wasn't in fact it seemed like students spent less time per activity than reading a page and I bring this up because sometimes I have people who come talk to me right before
the midterm and they say I've rewatched your lectures like three times you know what else can I do and while I am flattered if that they want to watch the the lectures time I really highly
recommend you don't do that that instead you spend time um doing problems uh or going through problems from the sessions going through the homework going through the check your understandings it's far
more effective and efficient in general so in general engage practice um particularly forced recall where you have to sort of think about things um uh
without checking the answers is shown to be very effective for learning um and so to achieve the class learning goals I encourage you to spend as much time as you can or the time you have available
for the course on those typ type of um sort of directly engaging activities rather than more passive ones like reading or watching yeah name
first um do you have a time frame for when the um problem sessions will be held great question we will announce those um by the end of tomorrow and for those ones we know it's
like impossible to coordinate schedules so um if you can't make it we encourage you to come in person but if you can't make it we also release all the materials and the videos afterwards okay um I will highlight I guess just
also on this too and I saw several people asking about this well um just go back to this part allow me to cover so several people mentioned that
they were excited about having some more theoretical aspects um this class does involve Theory uh it is um perhaps there's probably more Theory I think probably than the normal um machine
learning in AI classes probably a little bit more and not as much as like an advanced seminar on Theory so normally most problem problem sets will have like one Theory question um and if you're not
familiar with some of the sort of theoretical you know theoretical techniques um totally fine you can come to problem sessions you don't have to have any prior background um in doing proofs to be able to succeed another
thing people asked about were um Monte Carlo treer several people brought up reinforcement learning from Human feedback we will be talking about that some people asked about uh multi agents we're going to be thinking about Monte
Carlo treesearch and other ways to have multiple agents that are making decisions um and a number of people said they wanted to get up to speed on sort of the latest ideas in reinforcement learning so they could read papers or do
things in their applications and I think this was all very relevant to that um the final thing is just uh we have five wonderful Tas who will be supporting the main ways to get
information about the class is to go to the website or go to Ed um we'll be releasing our office hours by theend of tomorrow and we'll start them for the rest of the week and all of you guys are completely capable of succeeding in the
course and we're here to help yeah yeah thank you going back to the course topic slide yeah do some of those topics include model based approaches as
well yeah so the first part great question so um when we first start talking about here we'll talk about models um at the beginning and particularly when we're defining Markoff decision processes and then we will
likely be talking again about them more when we get into the offline approach there's a lot of really interesting questions about um when we're picking different there's a lot of we we'll get into the fact that there's a lot of different representations you
can use for reinforcement learning and there's a lot of questions over which to use when or when you combine them and in particular where do errors propagate in the different types of representations in terms of leading to error in the
final decisions you make but modelbased reinforcement learning can certainly be a really powerful tool any other questions on the logistics all right so let's start to dive into
the material all right we're going to start with a refresher exercise so um raise your hand if you've seen reinforcement learning at least a little bit in the past okay so most people not all um if
you haven't if everything I am about to say doesn't make sense don't worry we're going to cover it um but I like to kind of get a a gauge in case people are like I've seen all of this before for the very beginning of the course um so this
is going to be a refresh exercise we're going to do it on Ed um I'll put the link up again or you can go to Ed it'll be the second link so here's the question we're going to think about how would we formulate a particular problem
as a reinforcement learning problem or as a Markoff decision process so one of the first application areas to use reinforcement learning for Education used it in roughly the following way not
exactly the idea was that you would have a student that didn't know a set of topics let's here just consider addition which we'll assume is an easier topic for people to learn and subtraction which we're going to assume is harder
imagine the beginning the student doesn't know either of these things and what the AI tutor agent can do is they can provide practice problems they can provide subtraction uh practice problems
or they can provide addition practice problems and what happens is the AI agent gets a reward of plus one if the agent if the student gets the problem right and they get a minus one if the student gets the problem
wrong and so what I'd like you to think about here is to model is a decision process what would like the state space be the action space the reward model if you've taken classes with markraft Sy
process before you don't remember it's totally fine to look up you know and refresh your memory this is not a test um I'd like you to write down sort of you know what would a Dynamics model represent in this case and then in
particular what would a policy to optimize the expected discounted sum of rewards do in this case for how I've set up this scenario so I'd like you to write down
your enter them in Ed and then we're going to do some small group discussion um in about 5 minutes for and if you're not familiar with these particular words like State space Etc
it's still fine just to think about you know given what I've told you about the reward for an agent what might happen in this case e ah okay sorry you might have to switch
to bed for for for all right try to enter in something it's okay if you're not sure and then turn to
someone near you and compare what you did e e e e e all right we're going to come back um
hopefully I heard a lot of really fruitful discussions um so let's see I know at least one group I talked to um I had a great idea for what
the St space could be do you guys want to share what your state space was and maybe tell your name as well oh sure the state space could be just like a set of word pairs of like to like
natural numbers or any kind of numbers of like how good the student is at addition then how good student is um at subtraction yeah so you could imagine
something which is is at addition and subtraction so you could imagine something like this where you just have like a vector pair where it's like maybe their 0.9 close to Mastery
for addition and like 04 close to Mastery for subtraction this is not the only way you could write down there's lots of choices for the state space but that would certainly be one reasonable
one those are challenging in some ways because you can't directly observe them but it's a pretty natural way to write it down um and in fact there are commercial systems that essentially do that where they have like um for those
of you familiar with hidden Markoff models it's basically a hidden Markoff model over whether someone has mastered something or not can we have a different type of State space that they wrote
down yeah may I talked about we basically wanted oh could you say your name first please um the knowledge that the student has and also maybe the questions that have already been asked to like capture the environment the
current environment that were at um so we like I guess this is a better representation of capturing the knowledge the student has we were thinking of also just like the history of questions and students answers
whether they got it right or not um I guess that's harder to represent no that's beautiful so exactly what was that right yeah that's exactly so that was the other one I was hoping people might come up with which is the idea of
this just being a history like a history of all the previous questions you've asked given or all the questions the robot has given the person and what they've responded um so you could imagine it's like uh
observation question reward dot dot dot and in fact those two representations here the history and how the student um how good the student is
can be depending on your representation be exactly isomorphic so sometimes this can be um a sufficient statistic to capture that history and as was pointing out one of the challenges with histories
is that they grow unboundedly so if you want to have like your neural network be predicting something you might be able to use something like an lstm or you might want to summarize the state so those are both great ideas for what the
um States could be there's not a right answer both of them would be great but there's also other ones the actions I heard many people share what the actions are someone want to tell me what they in there I know you guys mentioned what the
action space was sure just whether you go an addition or subtraction exactly so these are just like what the agent can actually do the teaching agent addition
question or subtraction and the reward model is plus one if the student gets it right I saw some questions um about what
a Dynamics model is and um inside of the responses people were putting on the form what I mean by a Dynamics model here and we'll talk a lot more about this is what happens to the state of the
student after a question is given so in this case and I talked to some people about this who had you know had a great understanding of this already the idea would be sort of how does either that
history change after you give um an a question to the student or how does this the sort of internal knowledge of the student change so the hope would be as long as this sort of you know curriculum is vaguely reasonable that after you
give the student an addition question they now know more about addition you know or they're more likely to have mastered addition so that would be sort of the this idea of there being a
Dynamics process that where you know you start in one state you get an action and now you transition to a new state afterwards and we'll talk a lot more about that now what is the challenge
with this particular representation yeah can you say your name first please um depending on your implementation there's a risk that the the agent just gives really easy
problems yeah in fact that's exactly exactly right and in fact that's exactly what we think will happen so we think that um an agent that is maximizing its
reward should only give easy questions so in this paper which I took the inspiration from for this example it
was very close to this where they tried to pick not um correctness but how long it took people to do problems and so if the student took less time to do problems um which isn't necessarily bad
in itself it might indicate some notion of fluency the agent got more reward but of course what that means is that you should just give really easy questions that will take the student no time to do because then the agent can get lots and
lots and lots of reward um and this is probably not what the intend like the designers of this system to try to help students learn things intended um they probably actually wanted the students to
learn both addition and subtraction um but I bring this up because this is an example of what is often called reward hacking where the reward that we specify does not necessarily um provide the
behavior that we really hope to achieve and we will talk a lot more about this in this case it's a fairly simple example where we can see it fairly quickly but there are a lot of cases where it's a lot more subtle to
understand whether or not the system really will do what you hope it will do um and we'll talk about that more throughout the course all right great so we're going to now just start to talk about sort of
sequential decision- making more broadly and some of this will be reviewed for some of you but I think it's useful to go through and refresh our memories um so the idea in sequential decision- making under uncertainty is that we're
going to have an agent that is taking decisions or actions so I'm going to use actions and decisions interchangeably which are going to interact in the world and then they're
going to get back some sort of observation and reward signal so in in the first example I just gave you it's like the agent um you know provides a problem to the student and then they see whether the
student gets that correctly or incorrect and then they also use that information to get a reward so s of getting reward and feedback and the goal in this case is for the agent to select actions to
maximize the total expected future reward meaning both the immediate reward they get now as well as the rewards they're going to get over time and this generally is often going to involve balancing long-term and short-term
rewards so there are lots and lots of examples I'll just go through a couple of them just to give you a sense so one is something like web advertising um in this case you know Amazon for example
might choose like a web ad ad to show you or a product to suggest to you they might observe things like view time and whether or not you click on the ad whether or not you make a purchase and the goal in this case could probably be
for them to optimize either click time or view time or Revenue in the context of something like robotics the control space or the decision space might be something like
how to move a joint um and then the feedback that the the robot might get back might be something like a camera image of a kitchen and perhaps they just get a plus one if there are no more dishes on the
counter now just quick question could this potentially be a reward hacked specification I see some Smiles what could
happen yeah sorry robot could just push everything off the car which I will say um with you know it's tempting right you're like I'm just going to make it all go away but in fact it does not
solve the problem and now you just have broken dishes and food on the floor um so that would not be a good thing to do so yeah this would be probably not a great reward to put you probably want a reward more like that the dishes are inside of the dishwasher and finally
clean so not just that they were put in there but actually that you ran the dishwasher so this would be a second example of of a setting another would be something like blood pressure control
where you could imagine that the agent gives recommendations like exercise or medication the feedback is things like blood pressure and then you would Define some reward like maybe plus one if
you're in a healthy range else some sort of you know sloping penalty for being outside of the healthy range all right so all of these are nice examples of the numerous ways where we often try to make sequences of decisions
under uncertainty in general um we're going to assume that we have a finite series of time steps so we're not going to be thinking about continuous time in this class lots of interesting things there we're not going to cover it what
we're going to assume is that the agent is making a series of decisions so we're going to think of there being a series of time steps like you know 1 minute 2 minute 3 minute 4 minute the agent will take an action the world will update
given that action and admit an observation and reward and then the agent receives that updates and then makes another decision and we just close this loop it's a feedback
cycle in this case as we sort of just talked about at a high level we can think of there being histories which is sequences of past actions rewards and
out observations up to the present time point so the history HT would consist of all the previous actions of the agent the observations it receives and the reward it's
got in general this is something you could use to make decisions you could just keep track of everything you've experienced so far and then condition on that to try to make your next decision but we often are going to
assume that there's some sort of sufficient statistic that we can use to summarize the history it will be much more practic iCal in many cases yeah oh sorry just make observation basically
like the hisory and what's your name um uh so the observation in this case would be something like the immediate information you get back after the Last Action so in the case of the
student it would have been whether they get the last problem correct or not so just like a single time step and then the history would be everything like up to this time Point good
question so in particular often to make things tractable and because often in reality it's not a terrible assumption we're going to normally make the Markoff assumption um and the idea is that we're
going to try to come up with some sort of informative information state that is a sufficient statistic of the history so we don't have to keep around all of the prior history of everything the agents
ever done or seen or gotten reward for and what we say is a state St is Mark off if and only if the probability of going to the next state given the
current state in Act is the same as if youd conditioned on the whole entire history so another way to say this which I think is kind of a nice evocative idea
um this is not from me it's from others is that the future is independent of the past given the present that means if you have a rich representation of your current state you
don't have to think about the previous history and of course in general this will be true if you make S equal to HT but in general we're going to be
thinking often of sort of projecting down to a much smaller State space so for example you might say well I could think about someone's blood pressure from all of time but maybe it's sufficient just to think of their blood
pressure over the last like two hours in order to make my next decision yeah uhuh um is there a difference
between State and observation and dis great question yes in general so I'll give you a particular example um Atari um which is these video games that uh deine learned an agent to play what
their stayed in that case was the last four frames so not just the last frame the last four frames does may have an idea why you might want four frames instead of one yeah um maybe like uh so
you can see like if there's momentum to an object already moving exactly it gives you velocity and acceleration yeah so there are a number of cases where you might think that there are parts of the state that really depend on temporal differences and then in those cases
you're going to want more than just the immediate State great questions all right so why is this popular um it's used all the time uh it's simple it can
often be satisfied as we were just discussing if you use some history as part of the state um generally there are many cases where you can just use the most recent State not always but but
many cases and it has huge implications for computational complexity data required and resulting performance what I mean by the resulting performance is is that in many of these cases just like in a lot of statistics
and machine learning there will be trade-offs between bias and variance and so there'll be a trade-off to between using states that are really small and easy for us to work with but aren't really able to capture the complexity of
the world and the applications we care about so that it might be fast to learn with those sort of representations But ultimately performance is poor so there'll often be trade-offs with how we actually the expressive power of our
representations versus how long it takes us to learn right so one of the big questions when we talk about sequential decision-making processes is is um is the state Markoff and is the world
partially observable so partial oh yeah uh my question is that doesn't the mark of assumption make this reward
attribution problem somehow like harder why all right good question um well I don't know I guess you could imagine it might make it easier or harder there's still the question of you
might only get periodic rewards and you still would have to figure out which decisions caused you to get to a state where you got those rewards yeah so let me think of it so you might have a case where the reward
might be a function of your current state um yeah uh let me think if I can think of a um a good example okay so
let's say maybe um you get a you want to run a marathon and you get a plus 100 if you make it Boston Marathon is a competitive Marathon to get into so you you get a plus 100 if you can qualify
for Boston and you do a lot of different things in your training regime you like eat healthy and you sleep and you train and then um and you get zero reward for any of that and then on like the day of
your race you see if you qualify for Boston so your state only like your your reward for getting into Boston only depends on that current state but you don't know which of those decisions was it that you ate well was it that you
slept was it that you trained every week for 17 weeks caused you to get to the state in which you qualified for Boston um and so that's independent of the market assumption in that case because
you still have the question of what series of decisions allowed you to get to a state that achieved High reward great question so another thing is whether the world is partially observable we will
mostly not be talking about this in this class Michael coocher has a great class where he talks about this a lot but this does relate to the case we talked about with students so for students one way
you could think about that is that there's some latent state that you can't directly access which is whether or not they know addition or they know subtraction but you get noisy observations when they do problems where
they get it right or get it wrong and the reason it's noisy is because you know all of us make mistakes on addition sometimes whereas I have complete faith that everyone here actually knows how to do addition um and sometimes you might
guess right even if you don't know it so the idea is that it's latent you don't directly get to observe it um this comes up uh in a lot of Robotics problems too so I'll just give a quick example here
if you have a robot that uses a Laser Rangefinder to figure out out these little arrows or lasers to figure out its environment so
it could have like 180° of laser range finders and what it's getting back is just the distance in all these different angles till where it hits a wall so as you can imagine many rooms would look identical so any room that has like kind
of the same dimensions would look identical to that robot and it wouldn't be able to tell is it on the third floor or the second floor so that would be a partially observable case where it can't uniquely identify its state based on
this observation so we won't talk too much about about that but it's important to know about another thing is whether the Dynamics are deterministic or stochastic so there are many cases where
things are close to deterministic like if I um put down a piece on a a go board it goes there but there are other things that we often treat as stochastic like when I flip a coin I don't know whether
it's going to be heads or tails so that'll be an important decision and then the final thing is whether the actions influence only immediate reward um or reward in next state
so as an example of this you might imagine if you were making a policy for what ad to show to people and you just imagine for each person coming onto the web you just show them like onto your website you show an ad and then they go
away and they either buy something or they don't um a bandit would be a case where like you just have bless you you have a series of customers coming in and
so whether or not um I showed a particular ad and he clicks on it or not does not impact whether or not Ellen comes along and and leges an ad so
that's a case where you it impacts your immediate reward but not the next state we can talk more about that all right let's think about a particular sort of running example we'll think of a Mars
Rover so Mars Rover is a markof decision process we imagine that Mars is really small we only have seven places in Mars so in this case we would have the state
is the location of the Rover which is one of seven discrete locations we could have actions called try left and try right meaning that our Rover is not perfect so sometimes it tries go a
direction and it doesn't succeed and let's imagine that we have rewards which is there's some interesting field sites and so if you spend time over here you get a plus one and you have spend over here you get a plus 10 and else you get
zero reward so this would be a particular case where we could think of there being these dates and these actions um and rewards so when we think of um a markof
decision process we think of there being a Dynamics um and a reward model so in particular the Dynamics model is going to tell us how the state evolves as we
make decisions we will not always have direct access to this but the idea is that in the world there is some Dynamics process and things are changing as we make
decisions so in particular we generally want to allow for stochastic systems meaning that given we're currently in a state and we take a particular action what is the distribution over our next
states that we might reach so for example I'm that Mars Rover and um I'm going to try to go to the right it might be that I can go to the
right with like 50% probability but I'm I'm not a very accurate Rover and so 50% of the time I go to the left or maybe I stay in the same location so this Dynamics model just specifies what actually the distribution
of outcomes that can happen in the world when I make a decision the reward model predicts the immediate reward which is if I'm in this state and I take the action what is my expected
reward I want to highlight here that there are different conventions you could have the reward be a function only of the current state excuse me it could be a function of the state and the action you take or it could be a
function of the state the action you take and the next day you reach you'll see all of these conventions in reinforcement learning papers probably the most common one is
this um but we'll try just to be specific whenever we're using it so it's clear and you can always ask me or ask any the Tas if it's not clear bless you so let's think about
sort of what a stochastic Mars Rover model would be so I've written down a particular choice for the reward and let's imagine that part of the Dynamics
model is the following which is if I start in state S1 and I try to go to the right then I have some probability of going to S2 else I have some probability
of staying here what I want to clear about here and this relates to the question before about U models is that this is like the agent's idea of how the world works it
doesn't have to be how the world actually works so what I told you in the previous slides is that imagine that in this world in reality this gives you plus one and this gives you plus 10 in
terms of the reward that's how the world actually works but the agent might have the wrong model of how the world works because it only learns about the world through its experiences or it just might have a bad model
so this is an example of sort of like a model-based Markov system where we where the agent would have a particular representation of the Dynamics model and a particular assumption over the how the rewards
work in these settings we have a policy a decision policy is just going to be a mapping from states to actions it's like an if then table if it's deterministic we just have
a single action that we would take in a particular state like maybe we always show this one ad to a particular customer or we could have a stochastic policy where we randomize this so this
would be something like oh when this customer shows up I show you know a vacation ad or I show a board game ad with you know 90% probability versus 10% both types of policies are really
common and it can depend in part what sort of domain you're in and whether you're trying to learn from that experience okay so let's see what that would look like in this case so for Mars
rer you could say that no matter where it is it always just tries to go right so that would just be one example of a policy you could have and it just requires you to specify for every single
state what is the action you would take or what is the distribution over actions you would take so in this sort of setting we're normally interested in two M oh yeah
yeah question so it's like making like decisions based on the state that it's in can it like learn to like switch from like different types of policy so not just like different actions ver on the
state but also switch to like checking the past State the future state in the same way like in deep learning it like tries a bunch of different functions can it do that or can it not do that remind
me your name um yeah so great question it will in general in general when we're learning we're it'll change its policy a lot over time so it might start with a particular policy and then over time it
will explore lots of different policies in trying to search for something that's good that's a great question and that relates to what I was just putting here which is two of the central questions we're going to talk a lot about
particularly at the beginning is evaluation and control evaluation says someone gives you a fixed policy and you want to know how good it is like maybe your boss says hey I think this is the right way to advertise to customers and
we're going to make a lot of money and you go out and you just deploy that particular decision policy and you see how much money you make so that would be evaluation control is you actually want
to find the best policy and so in general to actually find the best policy we're going to have to do a lot of trial and error and we want to do that in a strategic efficient way so we can quickly learn what that good policy
is so in general we're going to be talking about things I just want to highlight we're going to sort of build up in complexity in terms of the type of problems we're talking about so we're going to be thinking about both like
planning and control and sort of thinking about uh how complicated these spaces are okay so we're going to think about
evaluation and control because evaluation is often a subpart of doing control if you know how good a policy is you may be able to improve it and then we're going to talk about tabular and
function approximation methods because we're going to want to be able to solve really large problems and then we're going to talk about both planning and
learning in planning we're going to assume someone gives us that Dynamics model and that reward model and the state in action space and we're just going to try to find a really good policy and in learning we're going to
actually have to control the decisions we make to give us information that allows us to identify an optimal policy all right so what we're going to start with is sort of the simplest of this settings which we're going to
assume that we have a finite set of states and actions we're given models of the world meaning someone like writes down down for us what those look like and we want to evaluate the performance of the best decision
policy and then compute the optimal policy and we can think of this really as AI planning okay so to think about how this works we're going to start with Markoff
processes and then build up to mdps and this is relevant because it turns out you can think of evaluation as basically being a Markoff reward process okay so how does a markof chain
work and just raise your hand if you've seen markof chains before awesome okay so most people have which is great so this is a memor random process there's no rewards yet um there's a finite set of states in this
case and we have a Dynamics model and if it's just a finite set of States we can just write this down as a matrix okay just says what's the probability going to the next state
given the previous state and so you could just have this say in our this would be a Markoff chain transition Matrix for our Mars rer case and if you want to to get an
episode you just sample so let's say you're you always touchdown as dat S4 you just sample episodes from that particular chain yeah um all rows and
columns you add to one all of the um what's your name yeah so all of the rows have to sum to one then is it coincidence that column
yeah okay yeah I was thinking just now that I should have changed that it's a good question yeah cuz and we'll see also why why that's important later okay in a Markoff reward process it's a
Markoff chain plus rewards so same as before but now we have a reward function that tells us how good each of those states are okay and we're also going to have a discount factor and I'll talk about that in a
second we still have no actions um and we can express r as a vector so in this we could imagine our markof for word process where we have a
plus one and S1 10 in S7 so+ one plus and zero in all other states okay in this case this is where we start to see the ideas that are going to be really useful for decision
processes which is we can start to think about how good particular trajectories are so we're going to have a horizon and you're going to see this in your homework too which is the number of time steps in each episode it could be
infinite or it could be finite it's like basically how many time steps do you get to make decisions and the return which we're going to call GT is just going to be the
discounted sum of rewards from the time step current time step to the end of The Horizon and a value function in this case is just going to be the expected
return in general this is not going to be the same as the actual return unless you just have a deterministic process because the idea is that you're going to have stochasticity in the trajectories
you reach and because of that you're going to get different rewards so you might wonder if you haven't seen it for why do we have this discount Factor thing so we're of weighing um earlier rewards more than
later rewards one is that it's just mathematically really convenient it's going to help us not sum to Infinity particularly if we have infinite number of time steps we can make decisions um
and it turns out humans often act as if there is a discount Factor like often we sort of implicitly weigh future rewards less than immediate rewards um and this is true for organizations
too and if the epsod lengths are always finite you can always bless you use gamma equal one meaning you don't have to make a large discount but when you have infinite Horizons it's generally important to
make this less than one so your rewards don't blow up part of that is because it's really hard to compare Infinities so it's hard to say like this reward that this policy that has infinite reward is better than this other policy
that has infinite reward whereas you can keep everything bounded if you have a gamma less than one all right next time we will start to talk about how we actually can comp the
value of these types of markof reward processes and then start to connect it to decision processes I'll see you on Wednesday thanks
Loading video analysis...