YouTube Video
By Unknown
Summary
Topics Covered
- Chain of Thought Prompting Magic
- Microsoft's Orca Model Distills GPT-4 Reasoning Into a Small 13B Parameter LLM
- Self-Training Lets AI Improve by Teaching Itself
- Relabeling Failed Trajectories Turns Mistakes Into Training Data
- Even GPT-4V Makes Embarrassing Mistakes Like Typing Email as Password
Full Transcript
okay uh let's just get started welcome to lecture 14 everyone uh hope you've been uh doing well and uh you know
managing all of the various deadlines so uh today we'll be looking at two interesting applications of language models uh the first half I'll be talking about using
language models to reason in domains like math geometry uh doing things like spatial reasoning and then in the second half of the lecture I'll be talking about how you can use
language models to take actions in grounded environments okay um so a little bit of a disclaimer a lot of the content
today's research that was done in the last 3 4 years so there's plenty of questions plenty of unanswered questions and not a lot of uh uh an so you know
let's let's maybe we can have more of a discussion U around these topics okay okay so let's get started with
reasoning um so experts like to start a lecture on reasoning by really uh talking about what are the various kinds of freezing so I'm going to do that here okay but at a high level it's really
about using facts and logic to arrive at an answer okay uh but more concretely there's three distinct categories of reasoning that that we we can uh talk about the first one which is probably
the one that most of you are familiar with is deductive reasoning where we go from uh rules of logic along with a premise to come with a firm conclusion
so an example of that could be that we have the sentence all mammals have kidneys and all whales are mammals and then we can come up with the conclusion
all whales have kidneys and we could do multiple such steps of reasoning okay a second form of reasoning is inductive
where uh given observations we derive conclusions okay so maybe we've learned from experience that every time we see a
creature with wings it is usually a bird and uh let's say we observe uh a state where we see a creature with wings and using our observ
using our experience we can come up with this conclusion that the creature is likely to be so that form of reasoning is inductive okay and finally we have abductive
reasoning where we're given an observation and then we start drawing possible explanations okay so maybe you see a car that cannot start and there's
a puddle of liquid under the under the engine and then you you start drawing inferences about the situation so one of them could be that the car has leak in the radiator
okay all right and apart from that taxonomy uh we can also think of reasoning in formal and informal terms where formal reasoning involves using
aums and rules of formal logic to derive truth conditions okay uh there's also informal reasoning which is what you and I uh probably do every day and here we
just reason about everyday situations and use common sense to derive conclusions for most of the lecture when I say reasoning I will mean informal
dedu Ive reasoning and it's often going to involve multiple steps okay so uh let's let's come back
to language models okay so uh we've learned in lectures 9 10 11 that uh language models are really really good at or large language models are really really good at coming up with plausible
continuations of text that reflect human preferences and constraints today we'll try to answer if they can also reason
okay so one of the most basic ways we can try to answer this question is we are prompting okay and uh we've probably already seen this uh there's this uh
popular method called Chain of Thought prompting where you get a language model to produce a reasoning step before producing an answer and we could do this
by providing some in context examples with explicit uh reasoning steps that the language model can then mimic at test time okay so that's Chain of Thought
prompting uh another rather surprising uh property of language models is that sometimes you don't even have to show them these in context examples and you could just prompt them with the sentence
let's things step by step and you can get these uh reasoning rationals before they produce an answer okay so that's pretty
simple uh but let's let's keep going okay so another popular way to prompt language models to do reasoning is via self-consistency
so here what we do is instead of uh greedily sampling a rationale followed by an answer we're going to sample multiple reasoning steps and
correspondingly multiple answers okay so what we see in the figure on the right uh we have a question and then what you would normally do with Chain of Thought
prompting is you would greedily decode a rational and then condition on the rational generate an answer with self-consistency we're going to sample multiple times so sample multiple
rationals they are all going to lead to multiple answers and then we're going to pick the one that is the most common okay with the idea being that if an answer keeps appearing uh for multiple
rationals as the majority uh of the rationals agree on then it's more likely to be correct and the authors of self-consistency find that on a VAR of
mathematical reasoning tasks if you add this simple idea of self-consistency where you sample multiple times and sort of do majority voting that improves performance pretty drastically over over
standard Chain of Thought and interestingly you know when I saw this result the first time I thought okay this is just like ensembling which is you know we we
learned this in cs229 the idea is if you want to boost the performance of your system I'm going to produce like uh 10 classifiers with different random seeds and I'm going to produce a
classification decision and I'm going to do a majority voting but turns out that it's doing maybe a little bit more than just simple ensembling so the author's also compared uh an ensembling approach
where it's the same language model with multiple different prompts and then you do majority voting there and then turns out that self-consistency is better than just simple
embling okay okay so earlier today I said that I'll be talking about multi-step reasoning uh so far we've looked at sort of math problems but not and and like prompting but not
necessarily multi-step reasoning uh one of the main kind of aspects about multi-step reasoning is it involves breaking down a large problem into several
subparts uh where and and answering each of the sub uh subp parts and then combining everything into a solution okay uh so there's this kind of decomposition strategy that was uh
integrated into another prompting method called least to most prompting and the idea behind least to most prompting is uh like I said given a question we're going to first break it
down into sub questions as shown here um and then given these sub questions the language model will sort of answer each of the sub questions and
then conditioned on its answers to the sub questions is going to generate the final answer and this is kind of how it looks like uh for uh sort of a math reasoning
problem so in standard Chain of Thought prompting uh you would have a question followed by a rationale and the answer with least to most prompting which is
this like de composition strategy uh you take the question and then instead of directly producing a rational you you uh sort of ask the language model to break
it down into problems and then you have like these uh two different sub problems and then you start answering both of those sub problems and then condition your final answer on the answers to
those sub problems so okay so that's just like a prompting method right uh one interesting experiment from least to most prompting was showing that you can
sometimes generalize from a small number of reasoning steps to much larger number of reasoning steps so here in this sort of math word
problem uh there's two reasoning steps and if we show this prompt to the language model sort of as in context example we see that it continues to
generalize even on examples that required more than five steps of reasoning um and in a way that's much better than standard Chain of
Thought uh but it's not entirely clear if structuring inference in this manner is really fundamental uh one of the other results
they reported was um sort of uh that with enough prompt engineering so uh the rose corresponding to best uh normal Chain of Thought is on par with least to
most prompting but it's it's kind of an interesting idea of trying to break down problems into sub problems solving the sub problems and then sort of building up a solution based on your answers to
the sub problems okay so all this was different sort of prompting methods to get reasoning behavor AV out of language models can we do something more um so
one of the things that we might be interested in is instead of trying to get really large language models to do reasoning maybe we want to somehow get
this kind of reasoning behavior in a smaller language model and one popular approach for doing that is distillation where maybe you want to
fine-tune a smaller llama model uh by teaching it to imitate a larger llama model model um and so that's what we're going to look at now okay so uh this model is
called Orca and at a high level Orca is going to fine tune a smaller 13 billion Lama language model on explanations produced
by gbd4 okay and to construct this data it it's pretty simple uh it has these three
steps so the first step is uh we get a wide variety of instructions from the flan V2 collection okay so flan V2 is basically a data set it's it kind of
cumulates multiple data sets into one sort of collection uh and it consists of instructions paired with uh questions and answers and I I'll show an example
of this um in a moment and then we're going to prompt gbd4 or chat GPT with these instructions along with a
system message and the objective of the system message is to get chat GPT or gp4 to produce an informative explanation along with the
answer so here we have a question about you know simple data processing uh about calculating the median and there's a system instruction that says uh please
justify your steps and uh kind of answer step by step and uh in producing its output the model sort of provides uh a fairly detailed explanation of how
it got to the answer and what orai is going to do is use precisely this explanation to fine tune a much smaller model okay uh so that's what's going to
happen uh once we have these explanations we're going to fine-tune a much smaller 13 billion parameter llama
model on these explanations okay so uh so far we've looked at sort of math reasoning
um and um sort of grade school math problems uh let's kind of turn to a different Benchmark for reasoning so we're going to look at big bench hard uh
and this is another data set for multi-step reasoning okay and uh let's look at some examples from Big bench hard um so it consists of uh multiple
different uh subas so there's a total of 23 different subas I'm going to show a few examples so one of them is evaluating Boolean Expressions so uh the
question is true and false and not true and true is okay so that's basically um you know uh evaluate this Boolean
expression and um you know with with sort of Chain of Thought the model can evaluate each of the sub expressions and get to The Final
Answer um and another example of a task from Big bench hard is dat uh is data understanding where uh you know like maybe the question is uh sorry this is date
understanding not data understanding uh so the question is tomorrow is a given date uh what is the date one year ago from today in a given format and uh it's
paired with some options and again the model can sort of think step by step following you know basic Chain of Thought and then come up with an answer so this is kind of the flavor of tasks
in big bench you know most of these involveed mult step reasoning they're fairly synthetic but also um reasonably hard uh for for for for language models
okay another example is geometric shapes and uh this one is you know pretty surprising that language models can do anything here so you're given sort of the SVG path
element and uh you know sort of I I have no idea what this renders us but like the question is uh you know just given
the SVG what shape uh you're going to get okay and there's a bunch of options and then again the model uh prompted with let's things step by step will produce some answer we don't know if
it's correct but it's going to produce some answer okay uh and so it's basically this data set covering different kinds of reasoning spatial
reasoning data understanding um you know evaluating booleans um and it's sort of multi-choice so it's easier to kind of
get uh get sort of an accuracy number and so yeah so it covers like a wide variety of different tasks um on the
left we have performance from uh really large language models uh this is zero short um Chain of Thought with just the prompt lets things step by
step um so gbd4 has some potential contamination issues with big bench heart so let's maybe we can ignore that
column uh wuna is is um I think a few months ago it was state-ofthe-art as as an instruction tuned uh llama 13B model
and orca is is again a llama 13B that's fine-tuned specifically on this uh like explanation um data where you know you have instructions and then you have
explanations from chat GPD or gp4 and you f tune on that and we see that overall it it outperforms chat GPT uh maybe because it's specialized to just
like these reasoning problems uh and it outperforms wuna which was not trained on like these really extensive explanations um so that's one way you
can get a smaller language model to display some kind of reasoning Behavior okay so you know this was all great and you know we we we are very happy that
like you can just generate rationals from Big LM and then fine tune a smaller language model on that but then someone could ask why not just fine tune the big
language model on its own rationals right um so that's also been explored uh and there's a bunch of different methods that do this I'm going to talk about one of them called reinforced self trining
or rest and that's going to alternate between two stages the first stage uh given a reasoning problem and perhaps The Prompt lets things step by step I'm going to
have the language model generate multiple rationals okay and then I'm going to filter these rationals based on whether they give me the correct answer
or not okay so you know think about the word uh algebra problems someone has three apples someone else has four apples you generate a rationale and the answer comes out to be seven you keep
that rational the answer is 12 you sort of leave that rational out and then I'm going to do an update step where I'm going to take uh these rationals that I
filtered in my first stage I'm going to f tune the language model on that and then I can do this iteratively now I have an updated language model I can get hopefully better rationals and then I
can update the language model on better rationals to get an even better language model and I can keep doing that okay and uh the results are promising
but uh you know uh what we find is uh on on gsm 8K which is this great School math uh data set of like algebraic word
problems um as you increase the number of iterations of self-training we see a slight Improvement in performance and then it starts
degrading uh math is another data set that again focuses on multi-step reasoning uh covering math problems and
again we on this data set we see that as we do more iterations of this reinforced self trining Paradigm we see an improvement in in the
accuracy and uh the numbers in Orange here are a much larger uh Palm model the numbers in blue are a smaller model uh
and the dash lines represent what you get um sort of if you did supervised fine tuning on human provided rationals so one of the promising things about
this approach is when you do multiple iterations of training on your own rationals you can
outperform um sort of human generated rationals um and that is exemplified again in in this graph where uh what we
find is uh the blue bar represents accuracy when you take uh the Palm model and you do supervised fine-tuning on human provided
rationals okay and then in green is if you did if you controlled uh for the sorry uh so blue is if you f tune on all human provided
rationals orange is if you f tune on one rational per training example okay and these are from these are written by humans in
green uh it's what you get if you fine tune on one rational chosen at random per question which is generated by the model so it's controlling for the number
of rationals and we see that it out performs human provided rationals and then if you sort of do the full multi-step uh iterative procedure where you keep improving the model we see
again a boost in performance so that's uh super promising um but let's kind of start revisiting the question that we asked um
in the beginning about reasoning in language models okay um so one way of answering that question is we can apply all these methods and we can look at
benchmarks uh but maybe the way to answer the question correctly is to be more systematic uh come up with counterfactual tasks and be very uh
careful about possible data contamination and I'm going to show some results uh around around that so uh we started the lecture with
Chain of Thought and maybe the first question to ask is are the rationals that the model produces would change of thought faithful what I mean by Faithful Is
maybe the model produces some rational and then it produces an answer maybe the answer does not even depend on the rational that It produced right so maybe the question was you know Tom has three
apples and Jerry has four apples and the rational It produced was okay Tom has three apples Jerry has four 3 + 4 is seven so the answer is 25 you know uh so in a case like that uh you'd say that
the model was not faithful to its rational and so what we see in this plot is is a very careful experiment where um on on
the x- axis uh we have the number of reasoning samples so okay so the setup is something like this so for every
question the model produces a rational and a rational here is multiple sentences and what we're going to do is we're going to force the model to
sort of early exit from its rationalization and just like force it to produce an answer okay so if it produced four rationals I can early exit right after the first rational and ask
it to produce an answer I can exit after the second rational ask it to produce an answer and so on and what I'm going to plot on the Y AIS is the model's accuracy after early exiting um in in in
this procedure so let's say that I early exerted after just one rational and the model produced exactly the same answer that it would if it had seen all four uh
sentences in its rational then maybe we can conclude that uh the kind of reasoning is is not faithful like it doesn't matter if the model C is the
full rational or just the first sentence uh and if you take that to the extreme um you know maybe you terine it even without any rational and it produces the
same answer so the results here are somewhat mixed but we see that there are enough data sets where uh it doesn't matter if you see the full if the model
sees the full rational before answering or if you early sort of early exit you kind of get the same answer which means that sometimes uh these rationals may be
post hog explanations of the model's answer okay another experiment that tries to answer this exact same question
is uh you can take these rationals and then you can start corrupting them so maybe your rational was length four and then I generate the first rational the second rational and for the third
rational I just corrupt it okay and then uh the fourth rational and then I asked the model to generate my answer if it turns out that no matter how much I corrupt my rational the model produces
the same answer then I can conclude that again the answer kind of did not depend on my rational so on the x-axis uh we are
looking at the number number of re the percentage of reasoning steps before uh I add sort of a mistake in the rational
okay so what you should see is kind of a strictly increasing uh increasing sort of trend where if I add a mistake after the very first step then that's probably
going to change the answer a lot and then if I add a mistake after the last step that maybe doesn't change the answer all that much but again we find that for some data sets uh it so happens
that you know you can add a mistake in the first sentence in your rationale and the answer is not going to change all that much and so that's also kind of an indicator that maybe these rationals are
sort of post talk explanations of the model's behavior um so yeah so there's a lot of lines here so if anyone has questions uh
see a few blank faces in the audience okay so let's let's uh let's keep moving um okay so that's that was about like whether uh the models where sort of
Chain of Thought expresses kind of a reasoning that the model is faithful to uh another question you could ask is what if I changed my setting a little
bit right so my model let's say I observe that it's able to do arithmetic in base 10 so it's able to answer something like 12 + 14 uh does that mean that my model knows how to do it
arithmetic or maybe there was just this exact same um you know example was present in the training data so one way you could test for this is by creating
counterfactuals which uh based on our understanding of the data you expect uh to not be present that frequently in the training data so instead of doing base 10
addition you could do addition in base 9 and then if the model has the same accuracy in base 9 then you can conclude that maybe this model has under OD how to do
addition similarly for logic uh maybe uh the reason why the model is so good at solving logic problems is because it's seen something very similar in its
training data so what if I construct a world where I don't know corgis are reptiles can it still do this logic
problem okay and so what we find is uh there is a you know sometimes a pretty significant drop when you move from
there's a question sorry you counteract why is base 9 counteract and Bas 10 so it's it's a counterfactual excuse me in the sense that
um uh the the authors comment that like base 10 addition is like frequently observed in training data but very few people do base 9 addition and so there's going to be much fewer examples of this
in training data so it's more so add a distribution I right yeah yeah so you can also call it out of distribution for sure
um and yeah so like from results like what we see is uh you know there's there's like this drop in performance even for like very simple logic problems that don't involve like multiple steps
of reasoning a you know kind of a significant drop in performance um which maybe suggest that there's not that much reasoning there's
more memorization um yeah so we could keep going with this Paradigm of like changing the problem setting so that it starts looking sort
of out of distribution to the training Corpus um and this is exactly what was done in this paper that looked at analogical
reasoning so uh so basically the setup is something like this I'm going to show certain examples of string Transformations and I'm going to ask the
model to generalize to new examples okay so in this extend sequence problem I have ABCD and the output is AB bcde e
and then given i j KL the model has to produce I J K L M okay and so on now the way you can sort of make this into like a
counterfactual or uh something that is out of distribution is uh maybe you can kind of change what the extend sequence Quin task is so now instead of
outputting ABCDE e maybe the model has to Output ABCD F okay so instead of outputting the next character it has to
Output um sort of one more uh so the second character after the next and so on uh the other kind of contactual you
could add is instead of operating on the standard uh alphabet you could modify the alphabet completely so instead of the alphabet being ABC maybe you start
at x y and you know so on um so what we find is uh so we find two things uh the first thing that we
find is that uh there's a significant drop in performance as we go from the standard sort of analogical reasoning problem to one of these counterfactuals where we either change the alphabet we
change the description of the Tas so that it becomes slightly unnatural on the other hand the authors also did this exact same experiment on
human subjects where they find very little uh decrease in performance okay so overall what this result suggests is maybe there's some
reasoning um maybe there's some memorization but there's nothing systematic okay uh so you know again this is like all emerging so uh maybe uh someone will find that you know if you
if you change your prompt a little bit now now models get can do reasoning uh but this is kind of the current lay of the land okay
um so uh that was sort of the reasoning module of the lecture I'm going to now switch gar and talk about uh language model agents
um so uh and and this is kind of related to reasoning in the sense that uh reasoning involves sort of this multi-step inferences where you know given some facts have to arrive at
completely new conclusions um with agents what we'll see is that there's some high level kind of objective or model has to accomplish and it has to reason about post
conditions object affordances um kind of uncertainty in the world to carry out a sequence of steps um so let's start with some
terminology okay so we have our agent on the right um that's going to be some some neural network and then we have an
environment um you know and I I I'll give some examples of what what these environments could be um the agent receives an observation from its
environment and based on the observation IT issues an action okay and along with that it receives
this second variable G and G represents a language instruction okay uh so there's many names for this setting and and what uh
and these models uh digital agent language conditioned policy or an instruction following agent uh some examples of environments
are maybe uh it's it's sort of a web browser and in sort of a browsing environment uh where the uh objective is to book a flight from San Francisco to
New York and the observation could either be raw pixel that the model
sees or or it could be the HTML Dom representation and the action space if you're looking at these web environments
could be uh typing on specific web elements clicking on web elements uh moving your mouse to a certain web element to interact with it and so
on and yeah I mean like this are sort of a vast number of applications I don't think can cover all applications but like you know uh we can look at some so
there's obviously uh like digital assistance like uh you know I'm not going to say the names because I I know people's mobiles might might might start
popping up um but you know you can give them natural language commands and like you know set an alarm uh set reminders and so on uh you could also do natural
language programming uh where you could given natural language descript descriptions uh get a model to sort of write python code another example of
this could be UI automation where maybe you want to do automated testing of of UI elements and so instead of having a human sort of
verify whether uh a UI UI element Works maybe you can get a model to execute actions corresponding to a given instruction or it could be something
more sort of user facing where uh you know given some kind of complex environment like Spotify you could ask an agent to play some
songs and then finally uh there is this sort of emerging application where we want to add additional tools um or plugins to
language models so that they can control various different applications um okay so uh before we look at how we can use language models to do
instruction following I think it's very helpful to look at how this was done before language models um so uh there were basically three main
ideas uh sometimes uh the the the right thing to do was uh collect examples of
utterances paired with uh logical forms so logical forms uh could be some kind of an executable representation that you could just execute against either a
knowledge graph or a database to get an answer so maybe you have a query like what state botherers Texas and then there exists some sort of
program description that you could execute against uh a Knowledge Graph to get sort of an answer or a list
here um and and idea number one that people looked at was to treat this as almost like machine translation right so you have uh
a source language which is sort of English commands and then you have a target language which is sort of these uh these like meaning representations or
logical forms and then you could apply the same Machinery from assignment 3 uh to build kind of a natural language interface here okay so you directly
maximize the probability of a sequence of actions given a goal or a command idea number two was
um something a little bit more complex so here you have um instructions paired with actions instead of directly mapping
instructions to actions uh I'm going to infer an executable plan okay from these
instructions uh and action sequences and I'm going to train a model to go from instructions to these plans and then Define a very rich execution model
that's going to directly execute these plans the advantage of this is uh maybe there is more sort of highlevel uh decisions you could encode in your plan
which would be harder to like get into the model if you were to just train it uh to produce the action trajectories directly and I have an example of a system like that from
2011 which uh was basically an agent that could navigate in um in sort of grounded environment and yeah the idea was something like this that uh you kind of took an
instruction and obtained a plan um and then you would uh train a semantic par so which is basically like this kind of machine translation system
that would convert commands into sequences of uh into this plan and then once that's trained at test time given a completely new instruction you would run
the semantic parsel get this plan and then execute it in this execution model okay and I have an example of an
instruction and and a plan uh from this 2011 system the third idea uh which is probably um you know maybe the first one
that comes to mind if you see a setting like that is to use reinforcement learning directly and what people did there was to use RL to directly map instructions
into actions so I'm going to learn a policy that outputs actions that maximize some reward okay which is conditioned on my natural language
instruction and the observation and this reward could be both sparse which is I carry out the entire task and then my environment tells me if I achieve the task or not or it could be something
that I obtain after each step so I take an action and then the and and then the environment tells me if this action sort of completed some percentage of my task
or not uh and on the top I've included an example of a system from 2009 that did this for automated Windows debugging
and so you know you have some natural language uh instruction uh to click some UI elements and that that get mapped into kind of an API command that the
model executes one after the other um okay so these were basically the three main ideas that people had before language models you would either
train semantic parsers you would either infer these plans from instruction trajectory pairs uh and then learn to directly model plans and then have an
execution model that can execute plans or you would do reinforcement learning if you had a reward signal so how do we do things in
2024 so uh there are a few ways to think about this uh I think like maybe most instructive is to think about what we trying to achieve right so we are trying
to model trajectories so sequence of actions conditioned on some goal okay so I want my model to book a flight from San Francisco to New York and I want it
to produce a trajectory of like maybe tip uh typing and clicking actions so let's look at how that factorizes so the probability of a
trajectory uh conditioned on a goal or an instruction is just the probability of the State action next state and so on
condition on the goal and you could factorize that into two terms so the first term is sort of the transition dynamics of the environment and that's
just what happens if I take a certain action in a given State how is my state going to change and the second object is
sort of the agent policy which is given my goal and the trajectory so far what what is the next action I should be
taking okay and then uh sort of people quickly realize that you could just treat this as kind of a generative problem so you could treat uh the
problem of decision- making in environments as sort of a generative trajectory modeling problem and what I have in sort of the
top right is an example of a transformer that just takes uh the history of actions it's taken so far are the
current state and uh some indication of what task it should achieve here uh based on reward but it could be a natural language string and it's just trained to predict what's the next
action okay and you could just train an auto regressive language model to do this and it turned out that this worked very well in sort of an offline RL case
question sorry in figure why are we predicting one one action uh so we are predicting one action before and the current action uh oh so
so no no so you predict an action execute that right append that to your trajectory and then you predict the next action and so on goe so we we we resolve
three input to tokens into one output token and turn it off yeah okay sounds good um and it turned out that this worked really well and so you know the
the instead of uh you know uh getting these latent plans and training semantic parsers or trying to do reinforcement learning uh we started using language
models as policies and so a simple way to do all of that is uh to prompt a language model in a loop okay so uh we're going to specify the
action space and text so this is like a simple uh sort of language model agent this is not going to work at all but like probably just like illustrative of
of how agent can be built now so you provide an action space in text um so maybe it's a digital environment and
maybe it can type maybe it can click maybe it can type characters maybe it can move Mouse somewhere uh you provide it an
instruction and you provide it the sequence of actions and observations it's received so far okay and then
condition on all that you ask it to predict the next uh the next action and there's nothing deep going on here this is just Chain of Thought
prompting in a loop okay but uh the hope is that uh because all of this uh because we reduce the problem of decision making into just Auto regressive modeling this this could work
okay and indeed like you know a slightly more complex version of this can work in some environments okay so now I'm going to sort of give a little FL faor of what
different environments look like now for evaluating language models um as agents so the simplest environment uh
that that that people consider is mini wob so uh this is a Sandbox environment um that evaluates like basic browser interactions like you know maybe on a
mini Twitter environment can you get a language model to retweet a given tweet um given sort of a simulat email client can the model forward someone's email
can it compose an email uh can it click on certain buttons or not uh it's not at all real world so it's not real websites uh and it's
relatively short Horizon so given any instruction most tasks can be accomplished in under three actions uh but zero short performance of even the best language models is still
far from perfect even on this very simple Benchmark um a second slightly more real world Benchmark is web
Arina and this is also a Sandbox environment but it's kind of an a pretty close approximation of real websites uh that span e-commerce so there is a
website in web Arina that resembles Amazon um social media so something that resembles Twitter and additionally there are utility tools like Maps so an
instruction could require a model to open up sort of a map application find the shortest path from point A to point B and use that uh in its later uh
sequence of actions and there's multi-tab browsing like we kind of commonly do uh so with mini wob there's only one single tab uh and and with web
Arena I think this was the first environment that introduced this idea where uh you kind of have multiple tabs and the agent can sort of switch between
uh apps tabs uh and again we are going to evaluate sort of functional correctness um which is whether the model sort of gave uh the correct answer
at the end whether the sequence of steps it took um gave the intended Behavior as opposed to whether it took a sequence of steps that maybe a user had
pre-programmed so another popular uh kind of uh environment is or a data set is web links so web links also has
multi-tab browsing and uh it has web interactions on real websites so this is not sandboxed approximations of real websites is not
sandboxed kind of just like browser like uh uh like browser interactions these are like actual real websites um and it also introduced like a new action where
the agent could communicate with the user so maybe there's some instruction uh which is to like reserve um kind of
I don't know like like a movie or like uh buy a movie ticket or something and then at some point the model has to request credit card information and so there is this like additional action
where a human could be involved in communicating um with the agent uh and this is not an environment uh but just a collection of interactions
uh so you can't for example do any kind of exploration or online learning here but you could definitely use this for evaluating um um okay uh so this was
just a taste of what some benchmarks look like uh for for language model agents so how are we going to train these models right so uh you know given
that we we're going to train we're going to treat like uh decision making as sort of casual uh as causal language modeling we're not going to use any of the ideas from
pre-ms uh the standard practice is to do in context learning with few short examples uh and in the few short examples uh for typically for any new
kind of uh website or any new use case you're going to get humans to perform those tasks and sort of feed that into the language models prompt as in context
demonstrations which it could then use to solve um similar similar looking tasks on very similar websites so obviously this is not
scalable uh there's thousands of environments on some environments that like lots of different interactions that are possible and so maybe there's something better that we can do than
just U sort of getting humans to provide demonstrations for every new use case um and so we going to use something we saw early on in the lecture okay
which was to kind of use the language model to generate rationals and then fine tune on that and here we don't have rationals but we could produce action trajectories and then we're going to use
that as supervision okay so the way that looks like is something like this so let's say I have some
environment um you know let's say it's some mini wob environment and I'm going to just uh get an agent that can randomly explore the environment so it'll just execute a random
sequence of clicks and types uh and scrolling operations and let's say it produces some trajectories okay and now I'm going to use these traj and somehow filter them so that was the
idea from earlier so you're going to get a bunch of different outputs and then you're going to filter it somehow so here we're going to use a second language model because we don't know
what a good trajectory looks like so not like a math problem where you know you know the correct answer uh we just had a language model interact with the website and generate trajectories and we want to
somehow filter out what a good trajectories and so we're going to use a second model that will produce a description uh of these trajectories and
the idea here is that if you can get a model to produce a description of what uh what the sequence of actions corresponds to then maybe that's a good
enough signal for a good trajectory okay and so maybe given the first trajectory it guesses that the instruction was to book a flight from San Francisco to New
York um for the second trajectory it said set the date to some given date um um and maybe it it wasn't able to come up with any good uh sort of uh
instruction for the third trajectory and then we're going to do something uh again uh that we saw earlier on which is to like kind of do
this iteratively so now uh we have a a goal that we got for for a trajectory and now I'm going to get the language
model to condition Its Behavior on this goal so the goal is to set the date as some given date and now instead of doing random exploration the model is going to
produce a sequence of actions that have a better correspondence with some natural language instruction So It produced a trajectory based on that
instruction and then I'm going to use sort of some course filter that's just going to look at correspondences between the instruction and uh the sequence of
actions and the states the language model visited and used that to decide if the trajectory was a good trajectory for the
instruction and in this case uh you know given the instruction this seems like a pretty good trajectory for completing
this task and so then we added to a set of examples okay but maybe sometimes uh things are not so good so for that second
instruction the generated label was to book a flight from San Francisco to New York and let's say we run that again through the language model and It
produced a second trajectory okay and clearly this does not seem like uh kind of a successful trajectory corresponding to booking a
flight um and so what do we do here maybe we can throw away this uh interaction but interactions are pretty costly like specifically uh you know if you're looking at real websites and each
interaction uh you know could take a few milliseconds and so maybe we don't want to throw away this interaction so what we're going to do here is again invoke the Rel laaber to take the
trajectory and assign it a new label so the model was not successful at accomplishing the task it set out to do but it accomplished something and we're going to come up with the best guess of what that was with a second language
model and it's going to say that okay uh maybe the instruction you accomplished instead was to set the origin to SFO and the destination to New York City okay
and so that's going to get get fed back into the language model and we're going to keep doing this iteratively till our filter says that this is a good instruction trajectory pair okay so we
have the same idea of using a language model to sort of generate outputs and some iterative uh procedure that will like you know give us kind of a good set
of training examples um so overall the method looks something like this uh you know you have some
environment uh we going to use uh kind of an unconditioned language model to just randomly explore the environment and generate a sequence of trajectories and then we're going to
convert these trajectories into synthetic training data by iteratively converting trajectories into natural language descriptions and then taking natural language descriptions and
converting them into even better trajectories and so on and once we have this collection of synthetic examples uh there are two things we could do
one could fine-tune using this data uh but the simplest thing you could do is kind of repeat the Paradigm earlier of you know replace uh human provided demonstrations in context with these
synthetic demonstrations um and we find uh a a reasonable boost in performance or 13 Point Improvement on the mini Benchmark
and again uh you know even though mini wob is very very simple zero short performance for even the best language models is far from from perfect and we also see an improvement on second sort
of multi-step uh tool use uh environment but so far we've only looked at text right um but uh maybe for real world
applications it's kind of intractable to for every environment obtain the HTML and feed that into the language models context uh sometimes there can be tens
of thousands of Dom uh elements and then corresponding uh JavaScript and inputting all that into the language models context could be you know intractable and maybe that's also not
the best way to kind of uh show the state of the environment maybe the best way is to directly show the pixels uh corresponding to uh the the
environment and so now we're going to look at some examples of vision language models that people have used for building these agents
okay so uh the first one that that we're going to look at is lava uh and the idea here is again kind of similar to Orca that we looked at uh
in in sort of the reasoning half of the lecture uh we're going to use gp4 to generate uh this time both instructions and
responses uh for textual descriptions of images so maybe there's an image um and we're going to sort of use uh metadata
corresponding to that image to come up with a texture description feed that into gbd4 and ask it to generate possible questions and responses and then we're going to
jointly fine-tune um sort of an image encoder um here clip along with a uh a Texton
decoder here wuna which is a Lama model that is instruction tuned um and through this sort of joint fine-tuning uh at the end we we kind of get this image
encoder um that can output language responses and now we can sort of ask questions about images maybe use that uh to directly input uh screenshots instead
of HTML Dom elements so a second approach that um looked at sort of building joint uh
image language models that then people later adapted to agents was uh pix to struct and uh the idea is again very similar uh there's an image encoder and
a text decoder U the image encoder will will will sort of take the image convert them into patches and assign each patch sort of a position ID uh run that through a
Transformer and then there's a decoder that will decode out some text okay uh one of the new things that pix to struct introduced was a new pre-training task
so uh for for lava the pre-training was you know fairly simple uh we're going to use gbd4 to just generate sort of synthetic uh questions and responses based on textual descriptions of images
but there's only so far you can go with textual descriptions of images what pix to struck did was uh to look at screenshots uh from websites and mask
out uh screenshots and then ask the Transformer decoder to produce HTML corresponding to the marked outout elements uh so here there is uh like
this list um that has a corresponding HTML uh one of the data points in uh in pix to struct looks something like this
so you you might mask out let's say the first uh the first answer corresponding to Python and ask the model to produce the HTML corresponding to just the uh
patch that was mased out uh and so this seems like a more natural sort of pre-training objective that can maybe have like better
interactions between image uh and text and then this was also adapted for uh building like these multimodal agents okay so uh you know at this point
I just want to kind of highlight that this is really an emerging application um there's kind of this huge kind of prompting Gap uh is what I like to call
it so if you do not do extensive prompting and if you do not use thepoke few short example where for every different environment you have a different set of few short examples even
the best language models are very very far from perfect even on very very simple tasks like mini wob uh where you know the goal was just to click on certain elements or uh you know respond
to someone's email where in mini wob that just takes like five actions um and then uh even for something as simple as mini wob even after doing
extensive prompting and few short examples is this like U drop in performance as you go from sort of the simplest task that involve mapping and instruction
into a single action to mapping an instruction into maybe five or 10 actions uh so long Horizon planning uh is is still very very hard even on these very simple
benchmarks um and then if you look at something more complex like web Arena which tries to approximate real websites has multi-tab browsing has external
tools that the mod can use there's just a huge difference between sort of human level uh task success rate and what the
best models get uh even after prompting even with few short examples um and then the kinds of Errors
model make models make are also pretty weird so one of the examples uh from from web links was uh the task
was to just open Google Translate and sign in using credentials and there was an email and a password and then what gbd4 V did was instead of uh typing in
the password it just typed in the email into the password tab uh and it just couldn't recover from this error so you know it it tried to sign in there was an error it tried to insert uh try tried to
type in the the email again and so on and I'm sure with extensive prompting you can fix this and maybe that's besides the point right um and then again uh you know there was
like a different example where uh the model had to issue a search and then instead of issuing the search with the correct term it sort of repeated the
same term like three times um and obviously that's not going to return any query uh return any results um so there's lot lot of room
for for improvement as you can see um and then there's lots to be done in the space okay so I'm going to recap um and take any questions so we kind of looked
at two different things today we looked at reasoning in language models uh we saw that there's a few ways that you can get reasoning like behavior in language models you can prompt them in various
ways so the simplest example of that is Chain of Thought prompting you can do Chain of Thought prompting but generate multiple rationals and sort of try to uh reconcile them and pick the answer that
was most uh most like frequent uh you can do sort of problem decomposition in your prompt so ask the model to explicitly decompose a problem
into multiple steps before answering U so that was all prompting you could also try and train specialize small language models for
reasoning by generating rationals from a big language model and then F tuning a smaller language model on these rationals uh instead of fine tuning a
smaller language model on rationals from a big language model you could just fine-tune the big language model on its own rationals and keep doing this iteratively and we saw that sometimes
like if you do multiple iterations of that performance can keep improving and and and can even outperform sort of human provided rationals um but on the flip side we saw
that while there are some initial reasons to be optimistic if we go and do counterfactual evalu ation we see that
you know it's not clear if the models are good because they're reasoning or if models are good because you know all of these problems were in some shape or form already in the training data and we
saw that with sort of counterfactual evaluation um in the second part we looked at language model agents uh we kind of talked about the historical perspective through which uh
people built sort of grounded agents and then we saw that you could recast the problem of decision making as just sort of uh causal language modeling and then
we looked at various ways through which people have modeled uh decision making with language models most of it involves prompting and in context learning and
then we looked at a method for U you know similar to sort of what we saw in the first module uh generating synthetic demonstrations and here we looked at doing exploration and the same kind of
iterative relabeling um um you know most of the language models we looked at today were text only uh we saw some examples of language models that can take both text
and uh visual input and then uh you know we we saw that benchmarks are very very challenging models make kind of trivial mistakes uh there's a huge gap between
human performance and sort of what we get with models uh so there's a huge uh like there's a huge difference between human performance and where models are and you know a lot of room for driving further
Improvement um and maybe some of you are doing it for your projects uh thank you [Applause]
Loading video analysis...