[S1E10]OSWorld Benchmarking Multimodal Agents forOpen Ended Tasks in Real Computer Environments
By AITalk Echo
Summary
Topics Covered
- LLMs Plan But Cannot Execute
- Agents Need Iterative Sensing-Action Loops
- OSWorld Enables Real Computer Tasks
- Current VLMs Fail Screenshot Control
Full Transcript
he uh I'll start I have started the recording you're free to start okay uh thank you for all for
attending this uh this seminar or a small talk about our reason work yeah um yeah I will I will be excited to share
our reason work that called o world so it has worked at benchmarking the multimodel agents for open-ended tasking real computer environments so it is just
archived um about two weeks ago and you can yeah see some people like um make some PR and share their views and here
we want to talk about more um technical details and Analysis in this um small talk so I hope you enjoy and and feel
free to ask any question if you want so let's make a small uh introduction that this is an example from my advisor tou
that that take eeka furniture assembly for example that uh for every students that uh study abroad or maybe start a
new home the first thing they do maybe like buy some uh or order some uh furniture from EA and they will receive
a box that contains a small book that uh is the assembly instructions and some like piece of O that what they do next
step is they assemble the uh this pieces of O into a assemble chair or tables or
their bat uh in a few hours according to the simple uh instructions so the instruction May demonstrate in a few ways that combine the like the
illustration of the uh in including the modality of the images and the text so as shown in the left so this is actually
a a interesting topic that we focus and the whole LLP Community Focus know is that it is called the planning so it is actually a detailed annotation of the
stepbystep plans and the left side is as we shown is the tools set that you may have some um tools uh that provided uh such as the hammer or other things that
you can use to assemble these chairs so uh the people have the intelligence that they are intelligent enough to uh grounding on this
instructions and to make this assembling they they can turn this page of um uh illustrations uh to put together all the
things to make this assemble Furniture well it is simp the same thing happen in the computer task in the
digital world like we also feel uh we also like see this uh scenarios that um sometimes we want to make some small
changes or to make some operations on our computer that we may be not familiar with or we forget how to do it so for example here how do I change my desktop
background on my laptop such as the Macbook so on the Mac OS environment um maybe the first thing you need to do is to like searching for some uh documents
uh whether officially or uh other kind of in official documents on the website and by screen and website you may get
some Clues and you ground on this uh website you hold your mouse you hit on your press on your keyboard and you make this operations uh through the mouse and
keyboards and you finish this task so well you you m change with the new wallpaper and the this task is finished so we are wondering about that
whether the large long models or a vision language model can be used for this TX as well so we um basic step that
we do is to okay let's ask always the gbt 4 or chbt like we ask how do I change my Mac desktop background that
it's answer is the is very detailed but it is only the response of text based um response that it told it told us that to
First click on the Apple menu second click on the saver third um you will find blah blah blah and four and five and you are
finished so uh the same way is the E chair that if you ask uh the chat gbt about how to assemble e chair it will also give you some uh plans or even the
detail steps that seems very makx that you can probably uh the human beings the normal user can probably draw
some insights from the he response and and try to uh finish the task by uh themselves well it seems that the chbt is able to generate step by step plans
that could be useful so whenever we improve some instructions that can generate some um
um fruitful inset Maybe it is due to the uh its inheritability that it is trained on the website that stall this knowledge
inside is ways that they can output according to your uh scenarios but what they cannot do is
that they cannot excuse task on your Mac by grounding plans into actions like it can only do the um description on the
task but you are the one that it's suppos the chpt is supposed to do this task by yourself and it cannot help you
do this task so you can never expect and the open a may not expect the alignment on the making the chat gbt or the gbt 4 to becoming the
actra actual executor of the uh this agent task so it cannot hold your mouse and your keyboard to help you to change the new
wallpaper and it can of course cannot hold your hand or make some humanoid robotic or some other kind of formats of
Robotics to finish the furniture assembling so here we uh ourly propos a interesting direction that we call it
the X long so we are wondering that what if we can make this agents based on the lar models and the vision langage models
to take human instructions and grounding this uh instructions into the Real Environment to Output some actions
varying from the C Pyon code maybe the select con star from table or some AP cost which open a previously call it the
plugins maybe here the shop API and with the parameters that the name is truths or other things and it certainly it can also generate some web or app control
such as you click on a coordinates that with 100 xal to 100 and yals to 100 and you can also output some robotics controls that maybe the you grasp
something with the speed equals to one and the false equals to F .0 and we can further execute this
.0 and we can further execute this actions into the environments such as the data web apps mobile desktop or even the physical world and to get the new
observation and in inter rate again and again of course we can hold some tool from the tool case such as the SQL interpreter pass interpreter the robotic
arm the calculator or huging face API or other model ways or other things and through that we can solve the previous
uh challenging qu challenging task that we cannot model through the previous single inference uh like we call it single observation and single action
ways such as the maybe input output ways to solve that in this new formulations and that that's what the Aon would like
to do so yeah just a few background knowledge what is an intelligent agent here's a a screenshot from the book that call from the po p that called the
artificial intelligence that um here it uh it shows that agent is the sense that percepts one at a time and Maps the
percept cons sequence to a sequence of discrete actions so that it use a sensor to take the
percepts and make actions on to the environment to uh exq this action so the sensors to be more concrete it may be
some camera or screenshots or radar according to your scen uh application scenarios whether it is a you want want a digital agent or you want a
self-driving car or you want like humanoid Robotics and it's cor is there brain that it may be served uh by the
last long models or the vision Lang models or some neural symbolic stuff in the future may design by ourself later and it can also have some uh effectors
maybe the robotics arms or the code interpreters yeah and the environment could be like uh as mentioned according to your application scenario such as the
computer mobile and physical world so here um in the robotics community that one paper called the codes policy have made some progress
that for format the actions into the code uh to finish some tasks and we extend it to become more as
we mentioned before and uh we focus on this the the directions that it could be made to achieve a more
broaden scenarios such as the data webs mobile and physical world okay so yeah so the challenges and the abilities it may for for the agent
may need to have is that first it need to have the ability to extract the user int instructions and second it need to have the ability to utilize tools expand
capacities and it needs also need to explore complex and SC environments and multiple steps on the planning and reasoning and it need also need to
follow the feedbacks and make self debugging so the agents may need all this kind of ability to uh make it
enough capability for to for facing the real world scenarios and our reason work um for advancing the natural lry interface space language model as agents uh shown
as below ranging from the instructor to Bender lemur open agents text RS and today we'll talk about the O
world yeah so yeah the thew is a drawn work by many authors collaborators and from different uh
institutions so um yeah so we propose uh autonomous agents for computer task that computer task often involve multiple
apps and interfaces so take this example for uh for instance like uh it is example that we want to update the
bookkeeping sheet with my recent transport trans transactions over the past few days in the provided folder
like uh the life side is a initial state of this uh environment that the Excel the sheet is open and it has the uh
previous record records of the uh my transactions and the Agents need to First minimize the this window and they
need to check the uh desktop on the uh this computer and it will find out that oh there are some raason transactions and they will click
them one by one and uh make the perception clearly they have to look into this details and have the ability to extract this information and finally
it has to return back and fill this information into this U sheet by predicting aurent coordinates and PR
predicting current clicks and TS so it's quite challenging but yeah its main challenge is that for the Benchmark the main
challeng Maj challenge is that there there's no such real scalable interactive environments for modeling this kind of real word scenarios like we
mentioned two works that the first is the m web so yeah it's we call it static Benchmark that it only contains the
demos without executable environment that so it it contains no execute execution based evaluation and it's Su it cannot support the Interactive
Learning or real world exploration and another work on the also on the web agent scenarios that uh called WEA that
it contains uh the environment various from like some sampled uh scenarios but the environments are limited to specific apps or
domains because it only focus on the website itself so it simplifies the agent observation and action spaces and Limited Test scope cannot support the
evaluation of complex real computer tasks so here o word we propos the first scalable and real computer environment
which can serve as an environment a unified multimodal agent environment for evaluating the open-ended computer tasks that involve aure apps and interfaces
across operation system so here we will dive into the more details that the task formulation is shown here that first we
have have a task instruction so as shown here the updated bookkeeping sheet with my re transactions over the past few
days in the provided folder that uh we input this task instruction into the agent and the agent is responsible for interact with the environment which is a
real computer environment that um with different operating system such as the Mac os2 or windows with AB apps such as
the folder other system stuff or VLC for video play and playing and GMP for image editing and Chrome for web browsing like
cboard for email checking uh and then sheet and PowerPoint or or vs code for coding and it's also cross the
interfaces such as the CLI and GUI so the graphic user interface and the commanda user interface and after the uh
the agent decide it is fill it fill on this task or it's done on its task so finally it's uh it's draw is conclusion or it's
reach the maximum steps of the sighting and we will draw the uh final state of the environment and our evaluation
script while decide uh whether this task is accomplished successfully or the the task is failed so yeah later we will tell more
details about how we did in in these steps yeah one thing we need to be care uh be be uh special uh be specially
noted that uh it contains the obser it is specialized in observation and action spaces that make it more realistic well
all also uh offer more challenge for the agent that uh it is based on the screenshots and accessibility tree as
the input and it's predict the raw mouse and keyboard um as the output of for the each turn of the S each turn of the
agent interaction yeah here that um here is some definition stuff that um yeah we will skip this for this
moment and uh we will show here that the action space here uh is actually clicking uh for for example for the mouse clicking is actually clicking on
the certain pixel of the screen such as the click 300 and 540 with the botton right so and the um uh observation space
here is the raw screenshot with you can see with is what human perceived and exil with the a human may may have some
disability that they need to uh operation uh they need to operate on the computer okay so uh Next Step we'll tell
more about more details of of how we dat in this environment what happens in this environment to enable this kind of real computer tasks for the agent training
evaluation and further further learning or other things that you may be interesting about uh so the first step is that the giving a computer task instruction such as here the update
booking bookkeeping sheet as we mentioned before and we will have a task initial State setup config as we shown
here uh and Mark with different um highlight color that uh including the um you can see here the download some we
will download some uh bookkeeping sheet and open some sheet yeah and we also have evaluation evaluators results
and fun functions config uh which we will discuss later and with this uh task initial environment State setup uh our
task interpreter will interpret this download and open actions into uh into the environments and to re um prepare
this uh initial state so so that the the initial uh computer is is empty is uh with nothing and our interpreter will
interpret this uh initial State set up into this uh scenario into the state that you see on the left of this
page and the next step is the uh agent takes in takes the task instruction as input and uh the the environment Val offer is observation with the screenshot
access ability tree and we will have a example here that for the screenshot we may contain the raw screenshot that with without this
bonding boss here and uh as you may all know that for for currently uh it is uh one meth called a SLE Mark that may uh
draw some Bing box on each button or uh text box that may aent the multimodel model on the attention
and make it perform better so that we we have also have an example here that uh to for demonstration so anyway it improves the screenshot and for the
exibility tree the raw exibility tree is contains milon of tokens so that it needs some like triming on these trains uh to to significantly reduce on the
tokens so it's another research Direction that's uh laay for others and here we just uh use some puristic that
as meth to retrieve some very useful buttons such as the a button that is clickable or textb that is uh able to
take some input or other things you can see more details in our code and uh uh agent is responsible
predict the mouse and keyboard actions and uh the environment will take this mouse and keyboard actions uh as we show here here that we
we contains the Full capabilities of uh any humans can do uh on the computer we provide we formulate
the actions in that way and we like the agent to generate in that way as well so which means that if the agent is
powerful enough they will not be blocked by our action space and observation space that it it's will be cap capable
to generate any kind of action that human beings can do to finish any task that human beings can do so we will show two example uh uh we will first look at
the right side of the action space here that it include includes the move to click right price H key scroll track two
key down key up and three special actions that we add into the uh this called the weight fell and down so it
um the wait is that the agent decides it should wait for a few seconds uh to wa some actions on the computer to be to be
take to take effect and if fa it decide uh it may happens that the agents think this task is INF feasible to uh accomplish such as the some features of
the software may be outdated and the or some hallucination that caused by the human user that they think it is the
function that there that can be operated but they're actually not and also the done so the agent decid the task is
finished so yeah here is here are also two example on the left side that it can click on the Chrome X and chrome y that
click on some points to open the Chrome uh apps and the second example here uh as scr here is that it can tapiz some
commands in the terminal and maybe later it will click some uh uh hit also hit on the enter to execute this uh commands as
well yeah and um for example we will show a full example here like monitor the system CPU for 30 seconds and output
the results so it will output the sequence SL first click on the terminal X and terminal Y and second click on the focus X and the focus y
and typy some commands and uh it's done okay so that's the action space and observation space of the uh agent of the environment that we
provided and after interactions again and again and finally it finish all these steps we will have the final step the
evaluation so it is very important that in executable language that we have the realistic environment but also the uh reliable
evaluation so how how do we do in our evaluation like we um it is very special and distinguish us with other Benchmark
that we wrote Every we wrote One task specific evaluation script for
every single nearly every single task so we have the example wise evalu script to make our choice on the examples more
flexible and diverse so previously you may have the uh one exact match Matrix for the open domain QA or you may have the execution accurancy the pass one or
pass five for the code generation task but for now we are handling a more uh complex and maybe uh fost possible task
that called computer task so it involve any more different modeling of the evaluation that you can handle the
task around all around the computer so here for example that a consider about task here that can you help me clean up my computer by getting a rid of all the
tracking things that Amazon may have saved so to decide whether the agent have finished this task that we have the eval function that check about the
cookie uh we first get a cookie from the environment which is a virtual machine and secondly we use some rules to check
that whether uh the inside this cookies the role uh that F the the the slot that fit the role with the domains with the
uh amazon.com is delect delete deleted yeah so that that's this div script and for the second example such as the rename sheet one to some names and make
a copy and place the copy um BL BL so we have we have uh for this sheet handling task we have a different uh strategy of
evaluation script that we we first get a fil and secondly we get a fil from get a golden fil from
the cloud and we compare this treat through our compared table function to this uh uh and we take inut some uh special requirements we which we call
the rules here and we all put the final results so and so consider about St TT you can have a rly grasp of our
Benchmark here that with as a result we collect three uh 169 real world computer task or we can say computer environments
that involve real web and desktop apps in open domains including the o the F IO and professional office daily
workflow Etc it it not only contains the C GUI but also contains the CLI and their com
their comination sorry so and each task example is carefully uned with one real world task instruction from the real users and an initial State set up config
to simulate the human working progress so that when you need a a digital agent to help you to finish the task sometimes
you you not start from the uh beginning after you are start your computer you are in the middle stage of some uh work
so we try to um mimic that that scenario to make our agent task more realistic and the third is that we also have a
custom execution based evalu script as we mentioned before a very very um carefully annotated execution based evaluation script which is very reliable for evaluating that whether this task is
finish or not and we have more statistic as shown here and you can check on our website and paper
later so what is distinguish us um among all this um benchmarks uh we uh divide
that into a few different uh aspects uh ranging from the previous agent Benchmark such as the Gaia agent bench
inter code and the other maybe the Vib agent Focus uh the GUI operation Focus Benchmark such as the ranging from the
uh earlier work such as the mini wol to the reason work such as the uh worken visual Vina and Vina and om act so that
we uh to dis distinguish system that we have um a large uh relatively actually a large of the uh task tabs uh you can see
here that actually they have like more than like 10,000 of uh data examples but most of them are generates through one
or maybe uh couple of hundreds of um templates and fill in some different value slots but our T are not they not
done in that way so every task in our uh Benchmark is very special and our execuive executable environment is focused on the computer which model uh
the whole digital task in a unified way that you don't need to like focus on the code or focus on the we or focus on the mobile isol uh in isolated way that you
can you can uh try our environment and hand handle them all together and also we uh also highlight our scalability on
the environments that we not only provide the Benchmark that we also provide a paradigm that you can scale your environment in this way that you
can onate a initial State set of config you can annotate a executable based evance scripts so that you can um
annotate any example that you need in our environment or in maybe some later environment that in that way and of course the mul models about for
maybe in the next year or the later of this year they will be much more powerful multimodel Foundation model to support this kind of digital agent or
other agent and we also distinguish them that we are cross apps so it first F more on the uh real scenarios and we
also have intermediate initial steps and lastly we have more than 100 evaluation functions that others may have not have
and to demonstrate our um difficulty of our uh o environment or benchmarks that we did a human evaluation that um shows
compared with the previous very famous web agent Benchmark called weina that we our task contains a lot of uh
example that require human to take more than 10 minutes to finish this task and most of task the me
median of our task are is the over 100 seconds while it is only um o over half
a minute for the B now and our accurancy for the human is much more lower than
there yeah so next we will talk about okay so how how did how how how do the current L long model or Vision Lang
model agent based Lin do in this Benchmark so we use a prompting way that we wrote an instruction as shown on the
right side that we told this L models or wish long model to act as a digital agent that you can output a pyo g code
and you cannot do uh certain things and you can you can do certain things and uh additional know such as um my computer's
password and uh other uh information to control the format of actions so after inputting this we choosing we carefully choose from the uh L models and the
vision long model from the Open Source One such as the mix trail uh mix off expert version and the co agent which support um specially optimize for the
high resolution scenarios and we also choose from the uh most powerful close off uh models such as the gbd4 CLA 3
allus and German I Pro uh vision and tex versions so yeah so we should we fix the temperature of 1.0 and the top P of 1
Point uh 0.9 and we provide the most recent three observation and action pairs as history Hance for each step so we call it
history encoding and later we will show whether it is important so we choose different uh according to the perception that we uh
have four choices that uh the first is the basic accessibility tree that we it is only required the model to uh with with the tax input ability and then the
screenshot so it is the same perception with the human beings that we take the stre snapshot and we oppose the actions
and third is the screenshot Plus exib tree so to to see whether uh this uh full uh there comp this this complete
modeling can help in to which exent and we also uh test on the setle mark which makes some bonding box on the screenshot
and we also input a detail information uh which is a table of uh the coordinat and the size of the spawning box together with this annotated screenshots
into the model to see if the set of marks works on this uh our OS World environments on these tasks so the result as is shown as on
the left on the right that here is some text ways that first the is the L long models and the vision Lang models are still far from being digital agents on
the real computer so the most advanced result among all these settings all and all these models is only
um about 10 10% so and this 10% is actually the most easy part of our environments such as you can maybe you
you don't need the agent to uh you you you may may not need the agent to mod modeling this kind of s
that you can barely generate multi step of code to finish this T so yeah and then the the second takeway is that the
agent performance uh is in different distribution with human performance across the different types of the hum of
the computer tax so that as shown at the below that human performance overall is over 60% and it is consist among all the
subset such as the os office daily professional and multiple apps workflow so it is consistent uh all around the 16
so yeah I mean that's some some sometimes are really uh difficult that maybe it's not enough for human normal human that are not familiar with this
task to do but the it is consistent but for the uh langage model and the visual language model you can see that they
failed on a signif they got significant low results on the subsid such as the off and workflow so they infer some like
the distribution uh shifted or like some ability shifted among the current U Vision language model and language model
with uh the the humans and may infer some mechanism or others other ways for us to be to do research and the thir te
is that the exib Tre and S so infective virus by the models so you can see that uh the germani pro the gp4 and CLA three
opos like with different uh input settings like their performance uh is quite different and we find that maybe
it's um some models may not good at the taking the setle mark uh as perceptions and some model may may be Excel in in
putting only the AIL tree and the fourth takeway is that the vrm agents with screenshot only settings show very low performance but it should be the
automated configuration in the long run so that H if you see this four part of four different setting you can see the screenshot holds the lowest result that
the other settings for the best model such as the gbt 4 V and uh other models it got performance over 10 and for the
screenshot they all struggling in uh performance around five to like even zero um according to our observation
that uh even for the gbt 4V they cannot predict a current coordinates that you you can only see is hung around and
click randomly on the screenshot on the screen and make some nonsense actions so yeah but we mentioned that it's the
ultimate configurations all human beings are do this Tas in that way but for now it has has no ability on this kind of tasks for the vision language model and the
language Vision language model on the screenshots yeah we also did some result analysis of the language model and vision language model baselines so the
first is that um the first takeway is that the higher screenshot resolution typically leads to improve performance
that we uh this setting is that we uh use adopt the GT4 V models and input with two settings such as the screenshot
only setting as and theom settings especially one and and and we adapt the down sampling with different rituals on
the screenshot before we input this screenshot into the vision language model and except for one um a special
point4 for the so which we suspect that it due to the uh reason that there are a lot of um uh in P training data around
this size resp sponding box so it it may be the best for the uh g4b to do so on that except for this point so you can
see a trend that it's uh if you do uh the the more St sampling you did that which makes the image more smaller and
low performance you you will get and we further analysis on the uh uh
due to the analysis on the Contex lens so uh which which extent uh do the accessibil tree do the tokens consumed
by accessibil tree if we want to formulate the task in the pure tax way or a mixture of the Accel and the screenshot way so how many tokens uh
does this method take that we calculate the um the distribution of some sample task that uh the result
is that to hold at least 90% of the exibit tree that your model need to support at least 6K
6,000 uh tokens to inut the XS perception and the longer tax based trajectory history cont has improved the
performance uh and unlike the screenshot his only history but post efficienc challenges so we mean that for this so
setting of The gp4 V model that uh we input more observation action pairs from the history from the current step that the last step and the last step before
last step uh we input more of this history encoding trajectory lens and we got better performance and combined with
the uh distribution of the accessibility tree that you may need more efficient and more the models with more capability
of the tokens uh uh of the like long Contex models that you need to have to support this kind of
usage and the next thing is that we found that we also did some experiments on the not only the ubu operating system
but also on the Windows operating system and we found that um first thing is that the performance of the visual long mode across different Opera system is in strong
correlation and this imply this impli that the insight and methodologies developable with the oso Benchmark can be effective transferred to Windows
environments with a high degree of the reliability and we plan currently we only uh public the one two uh uh image and we are very
very for very in the very near future we will also um release the windows ones but may need some activation from the user
side yeah and the next uh result is that the we did some distrib uh we did some distribtion to to the environment that
um in different aspect uh ranging from the positions that we may change in the positions of the window from time to time to see if this effects uh and we
also change the size of the windows such as we may uh minimize some windows or we
we may we may like a drag is a border uh to make it not the minimize but as much smaller as it can and we may also add
some clatter that irrelevant windows open some irrelevant applications to U make some this this disturbs okay uh and we found
that the current vrm agents are not Reon to us UI layout and noise so originally
we choose um a set of uh the test that the the agent may be good at that have the 50% of accurancy and after we made
this disturbs and it's drop significantly especially for the size uh this this this that we we see a drop of
like nearly uh 40% of the currency so and you can see the paper for more interesting analysis that we showed we did a l lot of analysis uh in this paper
in this work and uh surprisingly that we see some successful case of the large long model and the vision Lang model agent
baselines so here that is example of the extract the sub title from a video that the agent successfully open the terminal
and run the code of the MFM PG to extract the current uh video so it requires like some uh software knowledge
and coding knowledge and some uh GOI interaction ability and even some uh memory stuff and exploration stuffs but
is is is done by by the current digital agent so it is like but as we shown the performance is quite low so there are
still many challenge stuff we can do um man chall point that we can try to overcome in the near future yeah so thank you for listening and here are
some uh future research direction that you may interest about but if you have some question here that we can discuss more U since time
is yeah all right so thanks Timo for the exciting work and the talk it's very detailed and okay
I'll stop the recording then we can start the QA session
Loading video analysis...