Language Agents (Shunyu Yao PhD Defense)
By Shunyu Yao
Summary
Topics Covered
- Digital Automation: A New Frontier for AI Agents
- Language Agents: Reasoning Like Humans
- WebShop: A Scalable Benchmark for Real-World Tasks
- ReAct: Combining Reasoning and Action for Generalization
- Tree of Thoughts: Enhancing Deliberate Reasoning
Full Transcript
stuff that you to talk about um so I still remember the very first time that you know Shinu and I met he told me that you know he done some prior research on computer vision but
for some reason uh wanted to explore language and reinforcement learning and so uh we sat together and then uh you know came up with the idea that um hey there's this new thing that that's
gaining fraction called language models right this was still in the days when tpd2 I guess gpd2 um it was not it was not super mainstream but you know the the models were getting better and and
generating more and more cend text so we thought hey why not think about you know using language models as a way of uh enhancing ourl agents and specifically for for the very very constraint domain
of text games so if you played these text Adventure Games you know that you know you have to read these U um states that are written in text and and take actions and text and so on right so uh
we started working on that and you know uh Co hit uh with an couple months so this was actually you know the paper had to be finished uh the last four months were pretty much on Zoom uh and and
little did I know uh at that point and I'm sure I'm not sure how much you knew that uh much of that paper would end up influencing um sh news work down the
line uh both in terms of building more realistic benchmarks for these kind of language agents but also some of the uh uh meth methodological work that he will he will talk about
um and so not not just you know the works that that ended up being uh forming a core part of thesis but um also um uh you know serving as a Resurgence for this idea around you know
language capable agents uh that that uh that has caused quite some excitement in in both Academia and Industry in these days right um so uh that's that's kind of like you know trajectory summary but
but apart from that you know Shinu has been uh such a creative force uh you know it's been a pleasure working with Shinu uh um Through The Years you know uh some of the work that he won't talk
about here actually touched upon very very different kinds of interdiciplinary ideas you know he's actually um things like that so he's he's really U Been creative and and
taking this interdisciplinary um ideas to heart um and not just that shu's been really a team player as some of the the folks in the lab can can attest right
he's he's been a uh a great friend to several of of the of the other students in the lab as well as a mentor to uh some of the junior students um and and and really help them you know find uh
different interesting research ideas to pursue uh fun fact we also all um got to attend his wedding a couple years ago and uh it was unique for me at least to
actually give him the ring before he actually took his W so uh I would say uh and at that point he even mentioned that
you know um being with his wife provided some reprieve from uh foring resarch um but uh uh at least I'm I'm
happy that you know he did end up doing more interesting after that so uh hopefully that that ended up uh being a good thing so I'll stop here right let's
let's see what sh has to say all right thanks carik for the introduction uh you save me some time so that you know you cover my introduction part and
ackowledgement part so I can see less in my talk okay so very happy to be here and talk about my PhD thesis on language agents let's get
started so uh we know interacting with you know autonomous agents to interact with the word has been a central goal for artificial intelligence and throughout the history we have developed
many methods and tasks but I think in a very high level existing methods can be categorized in two kinds those using like manual rule
design or using intensive learning and on a very high level existing environments can be categorized into two kinds those interacting with humans are physical
environment um I think this talk is about uh introducing a new kind of agents and a new type of environment but before I
talk about this I want to quickly moate uh why we need fundamentally new methods and tasks for autonomous agents and those are the zoom like maybe you can mute
yeah so on the methodological side if you think about designing rules to build agents or use imitation or reinforc learning to build agents either way it
takes intensive domain specific efforts to build agents and those agents cannot really generalize beyond their rule coverage or TR
distribution also those methods are usually very uh intensive and hard to build right it takes experts like us to build and even for expert like us it's
actually very hard that allow most of the researchers outside AI or outside computer science um on the environment side I
think there has been this fundamental dyam between scalability and practicality so if you think about practical agents like ch or
autonomous very challenging to collapse and rewards humans or physical
environments that's why we often turn to more scalable game or simulation environments where we can have unlimited rewards
ORS however Prov very hard to trans set which problem as ex me you can you mute all everybody
you can mute everybody so that you can can be muted you can mute everybody as a host it's a little bit noisy for us to listen in The Zo oh okay like this you
can as a host you can mute everybody smart smart idea yeah um okay so um so today I want to talk about my research that track address these two
fundamental challenges for building autonomous agents so first I want to talk about my work that created a new type of AI benchmarks based on large scale real world digital environments
such as the internet code or computer softwares so I call those task digital Automation and solving those tasks will uh need to tremendous practical values uh but
also those real world digital environment provide us a skillable way to collect reward signal and evaluation and they prove very challenging for traditional rule based or learning based
agents and even for large models because they require you know uh real world uh reasoning over Real World Language and over open-ended
actions so those challenges motivate me to uh build a new type of agents that I call language agents where the key idea is to uh reason to act s similar to how
we humans think about the situation in our mind before taking actions so I will show how this new type of Agents combine ideas from language
models and traditional agents and uh they Pro much more General and generalizable in the sense that uh it can work across various domains and it can take one or two samples to
generalize to new scenarios uh and are performing traditional agents using hundreds of thousands or millions of uh learning samples so my work on digital Automation and
language agents have new to various followup methods and uh Industrial Products but as a field GRS it often
turns uh hard to understand uh what this complex field is going on as followup methods become more and more complex or even ad hog so in this last part of my
talk I want to shift from Individual empirical tasks uh and methods to a more principal conceptual framework that tries to unify various agents and
provide uh guiding future directions toward autonomous agents so let's start with uh this first part which I talk about uh benchmarking
agents using uh digital automation so what I mean by digital automation is essentially what we do every day on the computer the task we are trying to solve as a researcher you
can think of you know filing reimbursement reports or writing code debugging run experiments finding papers writing reviews arguing with a
reviewer all those things that we do on computer right so uh if all those things can be automated you can imagine there's tremendous practical value and everybody can graduate PhD in three years but
there has been little progress toward those digital agents if you think about Theory right it practically canot can do nothing and the reason is that uh
existing methods really struggle to reason over real world language think of uh like a complicated paper or uh piece of code and uh existing methods cannot
really make decisions over open ended actions over L Horizon think of uh search uh search a query Google or write a piece of code right those decisions
are extremely open-ended uh but solving those challenges are not just key for building digital agents right so if we want physical agents like robots to C in the
wild to navigate uh in cities or Human Society to plan out complex tasks or to coordinate with humans or other agents using naal language we also need to
solve those uh underline research challenges so uh I want to solve those challenges but at the time uh all those all the benchmarks that use uh language
uh looks something like this right so all those existing agent benchmarks they reflect none of the challenges that I mentioned above they feature simulation environments small action space
synthetic task and shering problems so even if we solve those benchmarks it's unknown like how that leads to uh solving some of the more practical
challenges while you may ask if the goal is to solve those you know very hard practical digital task what we directly you know benchmark on the web or on the
computer or using those environments as a benchmark it turns out the key B is evaluation so for example uh this web GBC work from open AI is trying to solve
a very practical task right given a question uh they build an agent that interact with a browser to uh come up with this very long
answer so this is a very practical task that uh interact with uh large scale digital environments however as you can see the answer here is very hard to
evaluate so as a as a result they have to rely on professional anal is to label you know what answer is good what answer is bad and it turns out to be more of a
demo for agent instead of a benchmark so the point about making a benchmark based on those digital environments is not just about building large complex environments are having
very really challenging task one of the key BT is how to build automatic reward functions so this motivated webshop and
uh here is a demo of webshop so the idea is uh you given instruction to buy certain type of item and web shop is this large scale web website
environment where the agent needs to act like humans to click buttons and search queries and uh read real word language and images and has you decide you know when
you want to buy the item and when you want to explore more and if you explore whether to search more or click more items and if you search more how to generate a PR formulation and so on and
so forth so at the end of the day an agent needs to decide on an item uh click some buttons to customize the product and at
the end of the day it needs to click this by now button and upon which that ends and there's a reward from Z to one given to the agent indicating how much
of the how how many attributes in that uh instruction uh is satisfied in the chosen product so as you can see to build this environment we actually script over a
million real world Amazon products and they're Associated Real World Language and images to build this large scale complex environment but uh why out of all the
possible web talks which choose shopping uh the key reason is that shopping is a very well defined task where it's possible for us to synthesize automatic reward based on the attribute matching of instruction and
product and as you can see solving this pass is quite challenging because it not only on require understanding of language and images but also long Horizon decision making over this
open-ended action space So at the time uh 2021 uh we try to combine all the possible techniques that we can combine right so we use a PR
image models language models imitation and reinforcement learning but even the best agent that we can build is only less than half of the human performance so one analysis that we do
at a time that's inside for is we look at what is the average length of the episode for those agents and we found that uh for human to succeed it actually
it's a very long Horizon task because humans uh they actually search different pries they check different products they carefully look over all the different products and make decisions but this
kind of lising decision making is still very challenging for uh artificial agents another very exciting thing about webshop is that uh which train agents on
web shop and then uh we deploy them to real world environments like Amazon or Ebay and we find that those agents can actually except a similar performance so this kind of sim to real transfer is
very exciting and it's very different from traditional agent task like robotic simulation and here the key driving force of the S to real transfer is we're using Real World Language to build a
simulation environment and that's enable a transfer to uh real environments with also uh similar language
thematics so since web shop uh this direction of web interaction has become very popular there has been various methods that explore various aspects of
this holistic task of web interaction including uh visual understanding language understanding decision making exploration planning reinforcement learning and so on so forth there has
been better uh or or more comprehensive benchmarks that try to Benchmark uh more domains and webshop has has been used as a testb for various industrial developments but I think Beyond web
interaction webshop creates this or Inspire this new direction of using real world digital task to Benchmark agents uh so another domain uh of this kind of
flavor is coding so we know that coding has been a very traditional task for natural language processing that's been studied for decades but however if you look at some of the most popular codings
today like human Evo they usually look something like this so it features very synthetic toy problems that can be solved within a file or method or even a single
line and this when this human Eva was proposed you know back in 2021 it was really hard best model can only solve 20
7 26% of the task but within you know two years uh this has become apparently too easy and best methods can already reach 95%
accuracy so uh in in this work of swein we uh we also have this uh philosophy of directly using real word digital task
and here uh the task input is actually a GitHub repository and the actual issue in that repository and output is a file def that can be applied to the rle and resolve
the task and one thing that's good about this Benchmark is that we can directly collect unit test from Real World Pro request to serve as a scalable and F for evaluation
method so can any of you guess what what is the accuracy for state of our like large landage models to solve this uh coding task okay
uh okay you are spoiling alert my uh second part but uh uh we try to uh we try to use you know large Stander model to solve that in a sequence to sequence
matter but uh they perform terribly so even the best model can only solve two 2% of the task and that's kind of intuitive because if I ask you as a
human to read over hundreds of thousands of lines of code and then type Auto regressively each token solve the task it's too hard fundamentally we need a way to break down the problem into steps
and use execution feedback to uh interact with with a code and and solve the task at least that's the way human programmers self this task so um if I want to quickly
summarize this part I think uh digital automation can be seen as a new frontier for benchmarking and training and evaluating autonomous agents it has tremendous practical values and it
provides a skillable environment where there's a chance for skillable learning however the key B like uh often is this problem of evaluation so if we
can figure out a way to skilla evalue agents on those uh task and well uh and set up the task in scalably uh measure way then it has been it will be a great
environment um and those digital automation task requires sequential decision making over open dat Challenge and that has been a key research
challenge that language models in a sick to sick manner our traditional learning agents cannot really solve and solving this challenge uh requires fundamentally
new type of Agents so this motivated the second part of my talk which uh I'll talk about how we can use large language models tool to
build language agents that can reason to act so I'm sure all of you are very familiar with language models today now
uh but to quickly you know give a recap uh language models are are developed to solve this very simple task to generate text and it's doing so in a very simple
way predicting next token Auto regressively uh since change in in 2020 when uh gbd3 showed that uh those models cannot just solve text generation they
can actually solve a vary of NLP task if you can uh give them this prompt uh providing task instructions and a few sh examples so since then uh there has been
a lot of progress in this direction of language models so for example uh followup research find that if you can not only specify the task instruction
and the input and output if you Al also give demonstration of how to bridge the input and output with reasoning then uh the language model can generalize better
and solve a diversity of uh question answering tasks so here uh what we call reasing in L language processing or at least in
this context of LMS is uh this ability to derive new information to update the internal context and to bridge the input and output to solve the
question however there has been some challenges uh for LM reasoning so uh let me give a very simple question so uh if you have a lot of money you will
probably ask some question like this and uh if you give it to gbd4 uh it will give you a wrong answer
and the reason is that first uh language models they're uh they're trained and then they're they're they're deployed but their knowledge is kept at their
training time so even the biggest best language model like tb4 they have limited knowledge and their knowledge can be updated and also language models are not very good at a lot of different
stuff for example here you can see uh tb4 make make a calculation eror so uh at the time uh various people have tried to find you know limitations
of LM and reasoning and tried to have pwi solution to each of the problem right so you can uh you can have retrieval to to solve this knowledge problem or you can fine tune for this
you know calculation problem but I think you very high level I think a root issue for all of those problem is that a language model they're trained to generate text they're not really trained
to act right so if you can have a way to uh make those models take actions to interact with external environments uh for example it could interact with Google to get new information and
knowledge you can interact with a calculator to to patch up the calculation uh it might serve as a unifying way to to solve various
problems however uh making language models take actions is really hard because that's not what they're trying to do and especially if the the the task
environment uh action looks very different from their train distribution uh it's very hard for language models to generate actions to solve this decision making
problem however um like as a as a human right so we can solve you know various tasks even though we have very little experience and the
way we solve those tasks is uh with think in our mind right so if you're going into a new website uh you don't need to leverage existing you know thousands of hundreds of thousands of uh
examples of pattern recognition to solve the task you can actually reason your head what do you need how to solve the task in language so uh combining this
two sides of motivation uh uh Le to react which is really about this new paradigm of Agents that's trying to use language models to generate both
reasoning Ines and actions yeah so uh I will show how this Synergy of reasoning and acting needs to very simple ways to build various agents that's General
across domains and generalizable meaning that it could solve various tasks using one or two examples but before I talk about those empirical details uh I want to quickly
uh illustrate what is the you know fundamental conceptual difference between traditional agents and language agents so uh all the traditional agents
in a high level they're trying to solve this uh task of interacting with environment where you have a action space defined by the environment if you're your action space
is up left right fixed by the environment and uh at each step the task of agent is that given the context meaning all the observations and actions
uh up until this moment uh the agent needs to predict the action based on the context and to maximize the future reward so um I think the key essence of
react is that U we're argumenting action Space by thinking or reasoning and thinking or reasoning is a very special type of action uh because
uh first the argument can be any language sequence so it's a infin space but also it doesn't really uh in any acal feedback from the environment so if you think anything in your mind it
doesn't really change the acal environment however what it does for the agent is that it changes the internal context of the agent so uh as you can see if you think something in your mind
and is actually in the context of the agent uh this this thinking action or reasoning action can actually uh influence the uh future decision making
through influencing the context so in other words thinking is a special action because it's not changing the external world but it's changing the internal
World okay so uh this looks a little like complicated but uh how do we actually use react in the in practice uh it's actually something very simple so all you need to do is uh you need to
demonstrate the example task and as a human you just literally just write down what you think and how you act to Sol the T along with the environmental
observations and given this you know trajectory react trajectory as a uh example you can condition on new task and condition on this a prompt the
language model can generate a new initial thought and action and this action can be fit into exp environments to Cur observation and then thought
action observation are appended to the context of the model to generate uh the next next action so on and so forth so obviously
uh you can use more than one example uh and if you have many many examples you can even uh find T language model using
those example reaction so it's a really flexible mechanism to use models or uh because today's language model are getting really good uh you can
use zero examples so I'm what I'm showing here is like a zero shot prompt to use react so essentially you just tell a language model uh you're an agent
and you you want to do this task by using what type of actions you give a description of the task and action space and the format of the generation uh and
and the new task and the langage model can uh use this information to to solve new task so let's see how this very simple zero prompt can solve this uh
fairly hard question for NLP models so given this prompt uh St L model in this case tb4 generate this
thought uh which is that okay I need to search what is the market cap of the three companies to understand how much it will cost to buy them so you search
their market cap and Google returns noisy information right but re to to organize information right so I already have all the mar cap I just need
to add them up and then this recent incurs the next action which is to use a search engine as a calculator to add those numbers up and uh the search Eng
give you number and the reasoning is a increas a way to stop the task and say I have the answer now and the answer is uh it's not
enough so uh as you can see it's a very intuitive and intuitive way to to to solve this very hard questioning task and I would argue like as humans that
would that's probably what we do so to make this task slightly harder right so what I tried is I try to uh adversar
give some uh hard observations to force the model to uh to to try different strategies to solve the game right so instead of giving all the ground CH
information so here I give this ad Ser observation right nothing is fun I like there's no Google passage about this market caps and then what's interesting
is that that reasoning gives this very flexible way to adjust the plan right so if I couldn't search for all three companies maybe I can just search one at a time so recently uh incur this new
action to search one company at a time but then I try to give out Ser observation again right instead of giving the market cap I give a stock price of of
Apple um what's interesting is that recently uh again uh uh it's used to readjust the plan right so if
uh apparently according to common sense it's a stock price not market cap but uh market cap can calculated by share price times the number of shares so I can I
can search instead What's the total number of shares for example and a language model uh the react continues to solve the game uh to solve the task in
this way so I think the point here is that uh it's not just action that's providing reasoning with uh external knowledge or feedback or tools reasoning
is also guiding the actions by making a plan adjusting plan tracking the test progress so there are really is this typ
Synergy that that benefiting both sides so uh uh we try to uh Benchmark react on wide range of tasks from
question answering to fact verification to playing games to even shop online and what we find is that uh reacts consistently outperform reasoning only
or act only Baseline and we show this tyy of reasoning and acting but I'm what I'm most excited about is how it's able to solve the
digital automation task that I proposed in the first part so uh on web shop uh react actually just use pling sample to reach a performance of 40% and op
perform uh the best RL agent at the time that's using 100,000 training samples and as you can see if you remove the a reasoning part from the example then the performance degrad a lot which
shows the importance of combining reasoning and acting and on swi if we use this re to decompose the
task into multiple steps and leverage execution feedback the T performance is much better so without using any example a
zero shot react agent and uh can perform you know more than 10% of the task and hard with of our Product Industry and this is very exciting this means that we
can directly deploy those agents on GitHub and actually solve reward problems and uh this a joint work with colleagues from uh our lab and our paper
will coming out soon hopefully um so uh real is proposed in 203 it has already a lot of impact that year so it has been the basis of
literally all the followup methods of langage Agents it has enabled a lot of interdisciplinary research uh not only in Ai and not only in computer science
but even outside uh computer science like chemistry or creative art or uh mathematics and uh it has enabled a lot
of Industry applications and startups that's trying to solve various problems using AI agents so uh one of the I guess one of
the most interesting application that I find of react uh is actually this application of chemical Discovery where uh scientists actually use react to
build this digital agent that's uh trying to discover a new chold so what they do is uh they give react some tools to uh train models or search information
and they give the agent a lot of data and then the agent actually use those digital tools and data to try to come up with uh some uh
really cool is that uh this uh reasoning and the reasoning of this react agent is actually connected to physical actions in the B lab so uh uh the agent is
connected to the physical world and syze new camera F and actually solv some new problem so very excited about you know a lot of very exciting applications of a
agents especially scientific discovery I think even for the hardest C for humans right there can still be a lot of things that we can automate and
accelerate so I think it's a good time now to take a pause because from this very simple uh idea of next prediction we have a language model that generate
text but then we scale them M and we see that it's able to solve a wide range of NLP task and methods like react actually enable those models to solve more than
just NP task right if you connect them to a web to code to robotics to R lab you can actually solve even wider range
of task so I think uh it's good to reflect on this very Incredible Journey and and ask like I think this fundamental question
right so is this mechanism of next token prediction enough for General problem solving because that's what we see so far right by scaling up by connecting varus environment is seem to solve more
and more General TS so to this fundamental question U our work Tri of Sal actually give a negative
answer and we give this negative answer not through you know very complicated hard task but we actually show that through a very simple task this uh for example this Cass of P 24 where the idea
is uh you want to combine four numbers to get 24 and you can give state of our language model leg4 some examples and
reasones uh however it it turns out gbd4 is not really good at task so the success rate is only around 4% and if you look at the task uh the reason that
it's so hard for gbd4 is that uh next token prediction is a linear mechanism that's trying to make token level decision from left to left to right and one by one however for some task like
this SC 24 initial tokens can be very hard to decide because uh you really need to look ahead understand you know whether the first token is 10 is good or six is
good it's really hard to decide uh immediately at this token level and once you generate some wrong tokens you cannot backtrack so for example in this
case the gbd4 generates 10 and times but then the task is already failed because no matter 10 times L there's no way to combine the rest of the three numbers to get
24 so um I think the takea away from this kind of poor example is that uh this mechanism of autog influence uh the the
fundamental flaw is that it's not really good at deliber reasoning and it cannot really look ahead or backtrack so um how can we solve this uh
fundamental flaw of next toen prediction we really took inspiration from Human cognition U so uh uh Daniel K in his book thinking SL fast proposed that uh
for human cognition we have two system right so we have a fast automatic system one that's trying to handle uh simple and U automatic task right and we have a
system to that's uh imposing control over system one to try to solve slower and deliberate or deliberate task so if we can uh draw this parallel between
system one and next prediction uh it seems that the way to fix it is that we probably also need some algorithm to impose control over that mechanism to
solve M deliberate reasoning problems and research seems to be a natural choice for such a contr aism which is also one of the oldest ideas in AI right back in 1950s uh we have for example
General Problem Solver from no assignment but you may ask research has been around for decades why we really
haven't applied them for those kind of natural language reasoning problems and I think the reason is what I mentioned in react because the space of thinking
is comp peral and infinite and there's no feedback so if you think about a classical domain of search right for example chess uh what you have is a
small well defined action space and there's a a well defined rule where you can actually simulate exal feedback so these two pieces actually
enable you to design or learn evaluation characteristics and then search can enable you to solve those tasks however like I mentioned in the
react card right so this space of thought or space of reasoning is totally different because first it's com material infinite and there is no external feedback right so what no
matter what you think there is not a ground truth external observation in return so these two features make it really hard to enumerate or evaluate
thought as a a type of actions however the key idea of the tree of thought is that uh language is not just infinite andoral it's actually
compositional meaning that there are schematically coherent units of test and U I want to give this uh I want to use this geme of 24 as example to illustrate
a point right so on my extrem uh one way you can solve this task is you can search in a tree of tokens so that will make it very easy to uh generate what's
the next step because the next step only there four or five choices but as I mentioned it's extremely hard to evaluate whether this step is good or
not on the Other Extreme you can think of a searching in a bandit or you can think of a search in this AB Bandit of all the possible
thoughts uh the benefit is that it's extremely easy to uh evaluate whether a s is good or not right so uh you can just look at whether the final number is
24 or not but it's extremely hard to generate a good because uh if you generate a good St then the task is already
solved so in this task it seems like the right level of balance is to search in this space of intermediate equations where uh you can relatively easily
generate you know a bunch of valid equations to to drive the task forward and then it's relatively easy to evaluate whether each intermediate equation is good so really what is the
thought is a problem specific TR of design for different problems and the key idea of tree of thought is that if we design the thought space right then we can use LM to systemically generate
evaluate thoughts and search in this space of thought so what we did for this G4 is that uh it's kind of like a soft level uh bre search process or beam search
process where uh we use a very simple generation prompt to come up as a bunch of different equations and then we use a evaluation prompt to assign a score to each of the possible equations and then
we keep only the most promising s and then we search again so obviously uh this evaluation prompt cannot be profit and it doesn't have to be Profit just like any search heris sixs it just have
to be good enough to bias the search into promised directions and what we found is that uh the Chain of Thought approach can only solve 4% of the task with G4 but having this systematic
search process uh can significantly improve performance so in this p in this paper of tra of thoughts we actually tried a wide range of different tasks not just
this kind of to search task like4 but also tasks involving natural language like crosswords or creative writing and we and we find that across various tasks with various features we can actually
flexibly uh Define the space of thought and have this modular mechanism to generate thought evaluate thought and search thought and it leads to a consistent performance gain across
various domains so it has become one of the most popular uh while used uh language agent
method uh and it has enable a bunch of different uh followup applications where you cannot just evaluate task but you can also try to simulate humans or
simulate other agents or even simulate yourself and try to uh tackle some of the more practical paths like jailbreaking of the L model or recommendation for products or uh
auction games so to quickly summarize this part I think the the fundamental essence of language agents is that we can think of reasoning as an internal action for
agents so what weact show is that reasoning and acting can be complimentary and they can help each other uh to solve various tasks and what she also show is that if we think of
reasoning as action then some of the classical approaches to plan action such as research can be read applied to improve reasoning so language agents address
some of the key limitations of language models and traditional agents uh because it's grounding language models with external feedback internal control and enables the FI sh generalization to act
in totally new domains unlike traditional agents so uh in the last part of my talk I I want to shift from those empirical
task and methods and try to have a more overarching conceptual framework of everything we've discussed so far so the motivation for a conceptual work instead
of empirical one is that um we have already too many of the empirical works right so starting from language model people have tried various ways to use
the model through promp engineering and we have come up with mechanisms for reasoning and acting but suddenly there has been you know hundreds of papers now flowing on Twitter and archive every day
and various Concepts and various mechanisms to use those LMS to build various systems to S various Tas and this field is growing very fast and it's
coming over very overwhelming but I think uh on lineing all those empirical work uh some of the fundamental conceptual question haven't been answered right so how do we even make
sense of the various LM systems how do we understand them even they're defining very different terms and Concepts and grounded in very different task and
where should the field be going so if we cannot give a uh great conceptual understanding uh it's very hard to make sense of where we should do
next so uh I want to draw a parallel to uh a 100 years ago where I think there's a similar situation for digital circuits
where you know in 20 1920s 1930s we have developed various uh analog and digital circuits for uh telegraphy for various
source of calculation of equation or stuff like that but uh it has become overwhelming right so like underlying all those dig Circ what is the principle what is how to we understand them and
where should this circuit be going right so something like vom architecture actually give a a very good characterization of the circuit and under vom architecture it become very
clear what is the computational essence of the circuits and where should the F should be going and the field goes to this uh direction of general purpose Computing circuits which we call
computers now so I think we also need a overarching framework for agents and that could be a guiding principle towards general
purpose autonomous agent and we do that actually not from scratch but through some of the classical insights so uh we we draw from
this concept of production system and cognitive architecture from coxide and symbolic AI so production system is a
essentially just a set of rules where you have a condition and have action and if the symbolic condition is met then action is triggered so what I show here
is essentially a production system for air conditioner and you can argue that's possibly one of the simplest AI agents you can build conceptually so the reason that I
brought up this kind of alien concept of production system is that this has actually been the basis of symbolic AI agents seven decades ago and actually
the evolution of symbolic agents has I think gone through a similar process where sin become more and more complicated to solve more and more complicated task uh but around 1980s uh
cognitive scientists and artificial intelligence Pioneers like uh the well U researchers proposed cognitive architectures as a framework to modelize
and build complex symbolic AI agents using con Inspirations so uh it has been guiding the development of symbolic EI
agents so how does this relate to you know to this's language agents the key inside is that we can really just think of language model as a
huge stochastic production system where you have much more lled than any of the production system we have ever buil manually but also it's stochastic so uh once we draw this
analog between langage model and production system it becomes immediate that those count architectures can guide the development of language agents so we propos this framework of
cognitive architectures for language agents uh which is strong for koala and I think uh on a very high level it's trying to categorize agents
using three concepts so the first concept is this concept of memory so usually empirically we don't think of memory we just think of the context window of the language model and how we
manipulate the string in the context window however inly now we see that um complicated task like sing right essentially you cannot possibly store
all the information in the context window there has to be a way to uh manage your context uh and also uh you want a way to learn from the pre
experience and uh use the information in this class to help your future tasks so uh I think the conceptual cation is that
you can really think of uh where agent store information as memory devices where you have a a shortterm working memory that's roughly the Contex window of the language model right now but we
can also have various long memories right so uh we can think of the language models wayte as a type of precision memory and we can update that kind of memory through ging descent but in
particularly we have seen some new ways of learning where you can add task Gories to episodic memory and then retrieve it later for better task
performance you can reflect or get infert knowledge and then add that to some data store and then that can be retrieves later to help task performance
and even you can update the code of the agent itself and that can be also a way to improve the task performance so I think all of that can be categorized
into this second concept of action space so aess B like I said is a trivial concept for traditional agents because it's defined by the environment but because we have those internal memories
now for language agents we can have systematic WR actions to update the shter memory and then we can have systematic tral and learning actions to read and write longterm memory and then
finally we have those gring actions to interact with the ex environment which is similar to the traditional agents and lastly once you have a space of actions how do you make a decision uh
that goes to this concept of decision making procedure which is about how do you use reasoning and retrieval to choose an action so on a very high level what we're trying to do is we are trying to
characterize each of the agent using only three concepts memory action and decision well how is that useful so one one of the usefulness of this framework is
that Ser hundreds of papers and those those uh different methods they they work on very different tasks where using very different concepts but uh through the lens of koala it becomes immediately
clear what is a computational Essence message and we can immediately compare different agents across different task understand the trend of this whole field right so one thing we can see is that we have made a lot of progress toward
grounding those agents in various environments uh robots web code games so on so forth but uh this concept of L memory is only getting started where
only recent work has started to explore how to learn Beyond GRE descent uh where we can write you know various pieces of information into a language store and then retrieve it l here for uh task
performance and this concept of decision making is also only getting started so Tri is some of the earliest work that's trying to go beyond just generating action but instead trying to generate
Val multiple actions and then evaluate each one or even simulate their uh consequences internally so I think there has there will be a lot of exciting future directions in terms of L memory
and this making and that seems to be where the field is going so uh so far I've talked to you about a bunch of uh work around the
central topic of language agents and I think the way the they connect each other is that U first we proposal sping bars and the spin have new challenges
that motivate new tasks and by like uh summarizing over all the spinch marks and methods we have a concept of framework and this framework can actually inspire better methods and task
in the future so uh I've also worked on a bunch of different topics uhr PhD and I think uh uh the this framework of
language agents along with Deary Inspirations Inspire some of my future work so through the 10 minutes I'll just talk about maybe two I think exciting directions for language agents I think
one is that uh I think training models for agents will be a very exciting Direction because so far what we what we see is that uh somebody train the model
for some purpose and then we use it for various agent task right so gpt3 gbd4 they're not trained to to be agents
they're trained to to solve questions or dialogue but uh what we we what we see recently is that uh this recent work of
far uh we show that actually if you can collect a bunch of this kind of language agent reator they can f tune the model and you can use a very small uh Source
model to uperform large Preparatory models I think uh it's very important to establish this model agent Synergy uh similar to uh you know how gpus and deep
learning methods coevolve with each other so because GPU they're not invented for deep learning they're invented for playing games and then we discover they can be used for deep
learning and then uh essentially those de methods in turn inspired the future design of the gpus and the co evolve together so language model they are not really trained to be agents right
they're train to generate tasks but then we happen to find all those amazing usages of the models but I think now it's the time those amazing usages can
Inspire uh new uh like new insights and new data to train open source models and they can perhaps provide the next training tokens for model learning given
that internet data is running out I think another very exciting a direction that I'm personally very excited about is how can we go beyond
just imitating existing human knowledge but instead uh discover new knowledge and even like teach uh human knowledge in turn so I think language agent have
the potential to be the best Professor uh in the sense that they can personalize to each student and adapt to each student and give very personalized
education and they have this broad set of knowledge uh it has it also has the potential to be the best PhD student in the world because it could not only read you know
hundreds of papers uh every year they could read millions of paper every day and I'm sure there's a lot of ideas and Concepts and Inspirations uh if you can
just combine you know two random areas and then uh you just find this uh idea I think there's a lot of hi ideas even if we can just uh read more
papers through the L of koala I think to this agents are not ready for them yet because uh if you think about personalized education it really requires a flexible lter memory that's trying that's storing all the
interaction with this student and trying to personalize according to that because you don't want to find to model for each of the student and through L koala you can see that uh for scientif scientific
discovery actually one of the bottom L is decision so like as a p student what we do research like what we do is Extreme open it so we have to rely on
intrinsic motivation like curiosity how to incorporate those kind of mechanisms into decision making of language agent I think that's one of the most important
problems um I think it's a good time to uh stop and go to the Act important part of the talk which is acknowledgement
uh so uh so first I want to thank my committee and U uh I want to thank Dani for recruiting all the great students
and Havey found my great friends and uh even though I haven't worked with st personally but by interacting with her great students I think I learned a lot I
want to thank Tom for all the classical insights uh I have been wanting to work with Tom even before I entered grad school and I'm very fortunate to finally
work with him toward the end of my PhD and I think uh it's be extremely helpful to learn foxy and uh I hope you know all this inspiration can help uh us to build
better agents Monday um I want to thank s for the retreat which has been clearing my mind this week and for the retweet which is
promoting my SES so sanj said in Twitter this is a great sucess and this has been a great compliment for me so in myr I
read his TCS textbook and he's like a CS hero for me and uh it's hard to imagine when he and say my sister is great uh I want to thank Ben for making
me feel old it's surprising to find that you know my examiner is only maybe two years older than me that's when I know I'm too old I need to graduate uh I want
to I want to also thank tat which is I think uh in Zoom right now remote watching The Talk uh thanks for shiing my talk I think he actually gave me one
of the best advices for giving talk and uh even though I'm not his student he shared all his you know job talk experience with me I really appreciate
that so uh next I want to thank my advisor and my friends car already gave us spoil alert you know how we meet how this story happens so I'll probably just
skip that uh I think it's a shame that you know for all of my community I have no picture with him you know so I will definitely take a picture with him afterwards uh and then yesterday I Tred
to find in my picture with carity and turn out we only have you know two or three pictures taken together but actually found a great one that we're
together uh where where you know you're my ring holder at my wedding so you know it's like paper you don't need a lot of papers you just need high quality good
ones a few of them yeah I I think you know um car has not just been the best advisor for me uh I feel like this this photo illustr
has also been I feel like a best friend a best big brother and also like a best man so I really appreciate you know the
the free environment that you give us and all the support and it has been a great five years and I have but in contast to you know professors I actually have a lot of
feature with students but uh I'm a man of simplicity so I'll just put one picture here as a a representation right so this is also taken on my wedding day
where all of the Princeton friends they come to my wedding and they take this picture together I think that's probably one of my happiest days in my life and U I I I really appreciate you know
Princeton and I think not just all the great professors but also I think all the great students that has been my friend they have diverse interest they
are very friendly they have been uh uh not just my lates or uh peer students but also my lifelong friends I really appreciate this support and you know
this uh great atmosphere that Princeton has provided and lastly I I want to thank all my collaborators and maybe pick some questions yeah
yep yeah what do you think about the direction teaching the model to by itself meaning instead of having the
value estimator or like Pro the model acting as and then reflecting own action while and
then maybe instead of back tracking by deling steps just allocating to yeah so I think there has been some followup work where people essentially
combine react and tree of thought and the idea is you cannot just not only uh search in this space of thought you can actually search in this space of thought action so for example if you're trying
to solve a really really hard question you can actually try to open very various web browser tabs and and not only make decisions over what you think
but also what you do uh in what web page uh I think there it's a trade off between how big your action space is and how hard decision making is so obviously
if you have a bigger action space that means you're capable of doing more things but that also means choosing over the actions the decision- making is harder and I think one of the key botom
L for this kind of combining research and langage model is that langage model is not very good at self- evaluation yet and that's exactly why we need to uh
train models for agents like having this Synergy and yeah I think uh empirically what people Tred is that you can try to generate a bunch of different TR
and then you can select the good parts and then you can f tun the model on top of it you can think of that it's a very shallow or like form of reinforcement
learning or uh where you're only using the good data to do imitation learning uh I think that has been a very promising direction to improve you know
where the model isn't good at uh in terms of being agent right evaluation and decision making those things are not trained as part of the langage model but maybe you know once we discover this
usages we can in turn Tri better models to sue those usages uh let's see yeah is there any question
online yes I do have a question go ahead okay thanks it's a really good talk and uh I do have a question on like
uh the framework part because I think uh it's very good for you to put this into a really big picture like for L large L model as part of the big picture like uh
what do you think about the bottleneck of that like uh structure there like just CPU because uh the uh processing time speed is really fast right but so
sometimes the ball neck is the memory part so what do you think about here thanks um I think
empirically the bottom l in terms of like where the field is going like I said I think we should I think there's this bottom L of th memory we haven't explored enough how we can have this
flexible context management Beyond just putting everything in the context of the language model and I think there's this spot of decision- making so uh like I answered in the last question right a
decision making is extremely hard especially when your action space is very large and uh how to internally simulate all the possible scenarios and
self evaluate uh I think that's Pro still probably a Bott Lu yeah okay thanks uh yeah uh I have a high LEL
question do you think we'll have like an end to agent system one they this is like there's analogy to uh Al driving so
like the new F system they Chang like the har planning system to like a new system and like it's like very impressive like they
can even understand like when pedestrian waves to the car like the car just pass in of course like that
isar to just of thought on like this kind of like more complicated uh components
system versus like yeah that's a great question I have a non answer at the last paragraph of the K paper but I think the short answer is that U we can always
think of the agent as a new network plus some code that use a new network but essentially uh the more capable the neon network is perhaps the less code or
structure that we need to to use the new network U I think for koala what we are trying to propose is sort of like a minimum framework of what is the
essential essential components that's wrapping around this big neon network uh but obviously as the new network become more and more capable I think empirically we will have we will need to
use less and less complex scaffolding and that's why I think it's very important to stick to very simple but General message like react uh I think uh
also perhaps this is a way to collect data to better distill into the new network in the sense that we perhaps need to First have those scaffolding
around the new network to discover what they're capable of doing and then to generate scalable data and then maybe those data can be intered to better uh increase the capability of the model and
hopefully we can have less structure along the ways but I think uh Beyond just you know the empirical goal of building Ai and building useful system I think this framework is also very
interesting if you want to understand intelligence like as a science oring about cognitive science where uh it provides you know understanding of you
know what is essential so I think at the end of my paper I imagine a scenario where maybe one day you know gb10 can simulate everything as a CH of St you
can have L memory or decision making all the complicated thing just through next comp in this one contest window with 100 million tokens but even that I think it
will be interesting to modelize you know what are the components of the new network that's doing those different components and compare that to human intelligence and serve as a way to
establish some understanding of t uh is there more questions online or okay uh yes I have one question can I ask sure
sure yeah so what are your thoughts on continual learning of llm Agents or just llms uh can you repeat the question uh what are your thoughts on
continual learning in the form of llm Agents by continual learning I mean uh adapting to different domains or adapting to an evolution in the domain
like the domain keeps changing with time so what are your thoughts on that yeah I think as this picture shows um if you think about longterm memory and you
think about learning is just right into the L memory uh essentially you have I guess two ways of learning right you can either learn by updating the new network
with scening descent or you can by you can learn by adding more knowledge or uh information into into the textual part of your memory and I I feel like they
might serve complimentary uh strengthes right so for tasks where you can collect a lot of examples and there's like repetitive pattern then
obviously green might be the better way to go but imagine some more open task like doing PhD research right so uh you
it's like you only do it once and uh you learn along the way and uh uh at least for me you know as pH student what I do is uh you learn all those knowledge you
learn all those experience and you learn all the skills and somehow they're there they're probably like like a textual knowledge in my brain so if I bring up this subject I have this knowledge in my
in my brain that I can retrieve and to help me and for me as a human like learning is this continu experience where uh perhaps I'm not just updating those me muscle memory but also uh those
kind of high level language memory and uh I think that part is still not there yet for today's language agent I think that probably one of the most important
directions oh thank you um I have a question okay oh go ahead go ahead oh sorry um okay thanks uh great talk uh there's there's a
there's a question in chat which I thought was pretty good and so I'll I'll kind of put that to you um and the question is what do you see as the most likely way that language agents fail to
maintain real world Traction in the coming years or otherwise as not as significant as you anticipate uh maybe another way to think about it is like
between on you know on this koala framework between memory action space and decision- making like what do you think is the bottleneck or the most
challenging aspect of the framework um I think your first question is asking about what are the domains in the near future those agents might be
most likely deployed and the second question is what's the B of the framework yeah um and I guess they might go hand in hand because if it's um if
there's a big bottle NE you would not see a real world application yeah I think uh my personal opinion is that maybe one of the biggest domain for
application of this languag is exactly the digital automation task that I propose how to automate various tasks on the computer on internet on those uh large scale digital infrastructure that
we have built and we work and entertain every day over and uh the reason is that it's both skillable and practical and U obviously there are also other applications like deploy this language
Vision ideas to robot or as chapos obviously but I think in terms of the potential impact and the the opportunity for improvement I think we will we will
see a lot of exciting applications uh For Those computer using agents and I think the bottom like as I said before I think this idea of L memory and decision making and that I think not only
requires us to understand you know what's going on and what to do but also I think a very important part is how to better train models once we understand you know what is lacking and what we
should improve upon which is you know how to better retrieve and learn and better make decisions by self- evaluating and self simulating awesome thank
you [Music] yeah system yeah I think safety is probably one of the most important topic for AI
in general but especially for those language agents because U if you think about the danger of a language model right it's mostly about generating
hateful speech or biased verbal speech but language agents safy conern can be much much more because it could be connected to as IAL code or VB or even
robots and uh once it has ability to affect the reward the the safety concern is much more uh I think uh for the
safety problem we first need a very good abstraction of what the agent even is and uh uh I think the concept of memory
action space and decision making can actually be important for saf purposes because once we Define modular uh memory devices we can control we can
systematically Define you know the right access and the access control for each of the memory and maybe this user can only access this memory and this model
can only access that M and only when we Define a very clear action space we can have some kind of estimation of the worst possible scenario and what could
be the worst thing that could be done that's the model actually is a model actually able to change its own code or is an agent actually able to uh control your whole computer right you need to
First Define action this and then estimate the consequence and I think decision making is something interesting where uh if we can have a clearly defined decision making maybe I guess
the empirical way to go is more of a human gr approach where we in inally build trust by interacting with the agents over loiz I don't think you want
to have this fully autonomous agent deployed on your computer immediately and then do whatever you do whatever it wants without your permission I think it
might be a more gradu process where build is uh a trust is build in so in your um you talk about react and tree of thoughts and those are two
Extremes in the sense that react has one thought and then act and tree of thought has a lot of thoughts yeah how do you think about how you should think about
how much mod should be um that is a great question and U I think right now empirically uh we don't have a good
solution yet so what happen is for different tasks you need to Define you know what is the right amount of thought and what is the format of thought and the prop lens of thought and to try to
solve the a given task for example for this question task obviously you don't need a lot of search then the re format is good and for those more search
oriented PR in like M24 then obviously we design them to have more search and then more action but I think if we want those agent to be more General purist
aous along way I think one critical issue is how it can make decisions over how much it wants to think about and having uh is sort of like a meta
reasoning question and uh I don't have a good solution yet but I think drawing from Human Inspirations might be a way
to go and my other question is in your first part you talk you focused on language as a geran we can evaluate these models and then you subsequently
talked about reasoning um and I have a question about whether you think that there's something special about Lang which it allows you to play both these roles right so one role is it sort of
allows you to Define complex worlds and this is something which know allows us to evaluate small interest but when you're thinking about using language the reasoning it seems
like that's something which is different from the way that to Genera models
or a video model video reasoning right but is there something special about yeah in my opinion I think there are two things that are special about
language that makes language model language agent different from the traditional agent so language is the general purpose representation that we
use to to do various things and uh that means first uh language can be a very unifying representation
to uh to represent various domains so we can we can make web navigation a text game we can make a robotic manipulation a text game and we can make dialogue a
text game and we can make video game also a text game so I think this ability to represent various domains and various observations and actions that's one part that's special about language
but a s language is not only a general representation for representing those things it's also a a a medium of thought and communication for humans right so
this so secondly I think what's special about language is that you can actually have this mechanism of reasoning which is uh kind of like self commmunication
that's uh uh in a way supporting a new way of generalization Beyond just uh learning through lots of samples or manual rule design and I think you can
think of uh the manual rule design is written in like a symbolic domain specific language and you can think of those neural neural agents no matter
imitation learning or reinfor learning as recing in a neural uh domain specific language but I think what's special about language is
is a predefined general purpose language and we have all rich cars that we can operate over that and that both generalization again talking about the react seems like there are two
components there model externals to S AP what do you think going be the
relative cost of calling in all Space versus during the Reas the Google here isn't Gemini it's
Google's Bal space yeah you're asking what is the cost of taking actions or what's the cost of doing the
reasoning there so opening is paying two costs here uh there could be paying two costs one is they're paying use serer
and two they're paying Google to get you're go know Bas yeah what are the ratios of this
cost uh years I think empirically most of the cost is is associated with the reasoning part which is using language
models API to to generate all those language like usually the tool is relatively cheap I remember for example this Google API I think it's much cheaper than open API Google search
engine uh I think an interesting empirical question is that uh those language model API is designed in a stus
manner meaning that uh if I generate all those thoughts and then those observations uh if I do this locally I can store those uh c i can cach those
keys and values and Transformer States and I don't have to recompute everything but uh if I have ay with 100 steps that means I actually have to recalculate 100
times over the same prepending context at I think uh that's like a very interesting and important system problem that we need Sol maybe we need a staple
API um mention langage is very speci um reasoning is there any other type of reason that you think could not possibly be covered by
language yeah and you know you know language models like very complicated and basically black we try to reason with our way of thinking
and describe our finding season language but what could possibly is there is there anything that exists that
understand I think that's a great question and U uh the answer is yes obviously language cannot res cannot reson over everything and I think there
is a need to incorporate multimodel capabilities into this framework of language agents so I think you can think of uh recently there has been lces I
think along two directions of Mo mod One Direction is how to better turn images or videos whatever into language think of tbt 4B for example and there's on the
other hand mechanisms to better turn language into other modalities think of D or Sor uh I think it's important to think
of how to incorporate those two type of mechanisms into this framework of language agents because the first type can be seen as a type of multimodal
reasoning so uh for if you want to work beyond the domain of language you you need to reason over images or whatever modality that the environment is given
and obviously uh for for example for perception details those things like you obviously need some multimodel cilities I think what's interesting is that you
can also think of things like Dy or Sora as a way to uh to build better word modeling for decision- making so in tree of thoughts for example what you're
simulating is just this language based Dynamics or games but if you're trying to solve a more physical task maybe you can think of D or Sora as implementing a
way to build better simul internal simulation and that can maybe enable better decision making right you can imagine if I kick ball this way how what the results look like and then I can
render the video and I can look at the video and see how good that is and I think that could be a very interesting way to explore those multimodel models
earlier you have mentioned that um there might be differences in how much thinking would be best for a reasoning agent could you ever imagine the
scenario where like react agent has another reasoning agent that helps it um kind of with how it should act or like how do you think about having multiple
ages to yeah that is also a great question and a very important future direction that I didn't cover so uh we actually have this followup work of react called
refle where the idea is suppose a react agent makes some mistakes but due to the nature of Auto regression it might be very hard for the re agent itself to correct its own mistakes because uh
usually you don't have a lot of self correction kind of data on the internet to train those language models right so what you can do is you can have an external critic or evaluator that checks the
whole and gives some feedback say okay this is wrong this is wrong this is wrong maybe try this again this is different approach and it's shown to significantly increase the performance um I think multi-agent is a very
interesting future Direction but my personal opinion is that we need to be careful how to define even a single agent because essentially I think empirically a lot of the so-called
multi-agent methods are essentially just multi- prompt methods where you have multiple promp that's trying to do do dialogue and S but I think like if you
think about hum agent like the agent is more than just like a gbd4 or whatever language model plus Sy prompt right is it has it own each of us have a l memory
we have our own like uh decision- making procedure for example like each agent should be much richer than just a language model and a
prompt so I think if the capability of each agent goes beyond certain threshold then it will be very interesting to see if we scale up the number of Agents how
things will emerge I think a parallel will be if you have you know a team of stupid people and the more people you have less likely is able to achieve the
task but if you have a room of smart people like like now the more people you have probably the more likely you're able to solell more C STS but I think to
this's single agent isn't really that good enough yet so my personal opinion is that we should make the single agent do now first and then we can my personal opinion is that we
should make the single agent Def off first and then we can
Loading video analysis...