LongCut logo

[S1E10]OSWorld Benchmarking Multimodal Agents forOpen Ended Tasks in Real Computer Environments

By AITalk Echo

Summary

Topics Covered

  • LLMs Plan But Cannot Execute
  • Agents Need Iterative Sensing-Action Loops
  • OSWorld Enables Real Computer Tasks
  • Current VLMs Fail Screenshot Control

Full Transcript

he uh I'll start I have started the recording you're free to start okay uh thank you for all for

attending this uh this seminar or a small talk about our reason work yeah um yeah I will I will be excited to share

our reason work that called o world so it has worked at benchmarking the multimodel agents for open-ended tasking real computer environments so it is just

archived um about two weeks ago and you can yeah see some people like um make some PR and share their views and here

we want to talk about more um technical details and Analysis in this um small talk so I hope you enjoy and and feel

free to ask any question if you want so let's make a small uh introduction that this is an example from my advisor tou

that that take eeka furniture assembly for example that uh for every students that uh study abroad or maybe start a

new home the first thing they do maybe like buy some uh or order some uh furniture from EA and they will receive

a box that contains a small book that uh is the assembly instructions and some like piece of O that what they do next

step is they assemble the uh this pieces of O into a assemble chair or tables or

their bat uh in a few hours according to the simple uh instructions so the instruction May demonstrate in a few ways that combine the like the

illustration of the uh in including the modality of the images and the text so as shown in the left so this is actually

a a interesting topic that we focus and the whole LLP Community Focus know is that it is called the planning so it is actually a detailed annotation of the

stepbystep plans and the left side is as we shown is the tools set that you may have some um tools uh that provided uh such as the hammer or other things that

you can use to assemble these chairs so uh the people have the intelligence that they are intelligent enough to uh grounding on this

instructions and to make this assembling they they can turn this page of um uh illustrations uh to put together all the

things to make this assemble Furniture well it is simp the same thing happen in the computer task in the

digital world like we also feel uh we also like see this uh scenarios that um sometimes we want to make some small

changes or to make some operations on our computer that we may be not familiar with or we forget how to do it so for example here how do I change my desktop

background on my laptop such as the Macbook so on the Mac OS environment um maybe the first thing you need to do is to like searching for some uh documents

uh whether officially or uh other kind of in official documents on the website and by screen and website you may get

some Clues and you ground on this uh website you hold your mouse you hit on your press on your keyboard and you make this operations uh through the mouse and

keyboards and you finish this task so well you you m change with the new wallpaper and the this task is finished so we are wondering about that

whether the large long models or a vision language model can be used for this TX as well so we um basic step that

we do is to okay let's ask always the gbt 4 or chbt like we ask how do I change my Mac desktop background that

it's answer is the is very detailed but it is only the response of text based um response that it told it told us that to

First click on the Apple menu second click on the saver third um you will find blah blah blah and four and five and you are

finished so uh the same way is the E chair that if you ask uh the chat gbt about how to assemble e chair it will also give you some uh plans or even the

detail steps that seems very makx that you can probably uh the human beings the normal user can probably draw

some insights from the he response and and try to uh finish the task by uh themselves well it seems that the chbt is able to generate step by step plans

that could be useful so whenever we improve some instructions that can generate some um

um fruitful inset Maybe it is due to the uh its inheritability that it is trained on the website that stall this knowledge

inside is ways that they can output according to your uh scenarios but what they cannot do is

that they cannot excuse task on your Mac by grounding plans into actions like it can only do the um description on the

task but you are the one that it's suppos the chpt is supposed to do this task by yourself and it cannot help you

do this task so you can never expect and the open a may not expect the alignment on the making the chat gbt or the gbt 4 to becoming the

actra actual executor of the uh this agent task so it cannot hold your mouse and your keyboard to help you to change the new

wallpaper and it can of course cannot hold your hand or make some humanoid robotic or some other kind of formats of

Robotics to finish the furniture assembling so here we uh ourly propos a interesting direction that we call it

the X long so we are wondering that what if we can make this agents based on the lar models and the vision langage models

to take human instructions and grounding this uh instructions into the Real Environment to Output some actions

varying from the C Pyon code maybe the select con star from table or some AP cost which open a previously call it the

plugins maybe here the shop API and with the parameters that the name is truths or other things and it certainly it can also generate some web or app control

such as you click on a coordinates that with 100 xal to 100 and yals to 100 and you can also output some robotics controls that maybe the you grasp

something with the speed equals to one and the false equals to F .0 and we can further execute this

.0 and we can further execute this actions into the environments such as the data web apps mobile desktop or even the physical world and to get the new

observation and in inter rate again and again of course we can hold some tool from the tool case such as the SQL interpreter pass interpreter the robotic

arm the calculator or huging face API or other model ways or other things and through that we can solve the previous

uh challenging qu challenging task that we cannot model through the previous single inference uh like we call it single observation and single action

ways such as the maybe input output ways to solve that in this new formulations and that that's what the Aon would like

to do so yeah just a few background knowledge what is an intelligent agent here's a a screenshot from the book that call from the po p that called the

artificial intelligence that um here it uh it shows that agent is the sense that percepts one at a time and Maps the

percept cons sequence to a sequence of discrete actions so that it use a sensor to take the

percepts and make actions on to the environment to uh exq this action so the sensors to be more concrete it may be

some camera or screenshots or radar according to your scen uh application scenarios whether it is a you want want a digital agent or you want a

self-driving car or you want like humanoid Robotics and it's cor is there brain that it may be served uh by the

last long models or the vision Lang models or some neural symbolic stuff in the future may design by ourself later and it can also have some uh effectors

maybe the robotics arms or the code interpreters yeah and the environment could be like uh as mentioned according to your application scenario such as the

computer mobile and physical world so here um in the robotics community that one paper called the codes policy have made some progress

that for format the actions into the code uh to finish some tasks and we extend it to become more as

we mentioned before and uh we focus on this the the directions that it could be made to achieve a more

broaden scenarios such as the data webs mobile and physical world okay so yeah so the challenges and the abilities it may for for the agent

may need to have is that first it need to have the ability to extract the user int instructions and second it need to have the ability to utilize tools expand

capacities and it needs also need to explore complex and SC environments and multiple steps on the planning and reasoning and it need also need to

follow the feedbacks and make self debugging so the agents may need all this kind of ability to uh make it

enough capability for to for facing the real world scenarios and our reason work um for advancing the natural lry interface space language model as agents uh shown

as below ranging from the instructor to Bender lemur open agents text RS and today we'll talk about the O

world yeah so yeah the thew is a drawn work by many authors collaborators and from different uh

institutions so um yeah so we propose uh autonomous agents for computer task that computer task often involve multiple

apps and interfaces so take this example for uh for instance like uh it is example that we want to update the

bookkeeping sheet with my recent transport trans transactions over the past few days in the provided folder

like uh the life side is a initial state of this uh environment that the Excel the sheet is open and it has the uh

previous record records of the uh my transactions and the Agents need to First minimize the this window and they

need to check the uh desktop on the uh this computer and it will find out that oh there are some raason transactions and they will click

them one by one and uh make the perception clearly they have to look into this details and have the ability to extract this information and finally

it has to return back and fill this information into this U sheet by predicting aurent coordinates and PR

predicting current clicks and TS so it's quite challenging but yeah its main challenge is that for the Benchmark the main

challeng Maj challenge is that there there's no such real scalable interactive environments for modeling this kind of real word scenarios like we

mentioned two works that the first is the m web so yeah it's we call it static Benchmark that it only contains the

demos without executable environment that so it it contains no execute execution based evaluation and it's Su it cannot support the Interactive

Learning or real world exploration and another work on the also on the web agent scenarios that uh called WEA that

it contains uh the environment various from like some sampled uh scenarios but the environments are limited to specific apps or

domains because it only focus on the website itself so it simplifies the agent observation and action spaces and Limited Test scope cannot support the

evaluation of complex real computer tasks so here o word we propos the first scalable and real computer environment

which can serve as an environment a unified multimodal agent environment for evaluating the open-ended computer tasks that involve aure apps and interfaces

across operation system so here we will dive into the more details that the task formulation is shown here that first we

have have a task instruction so as shown here the updated bookkeeping sheet with my re transactions over the past few

days in the provided folder that uh we input this task instruction into the agent and the agent is responsible for interact with the environment which is a

real computer environment that um with different operating system such as the Mac os2 or windows with AB apps such as

the folder other system stuff or VLC for video play and playing and GMP for image editing and Chrome for web browsing like

cboard for email checking uh and then sheet and PowerPoint or or vs code for coding and it's also cross the

interfaces such as the CLI and GUI so the graphic user interface and the commanda user interface and after the uh

the agent decide it is fill it fill on this task or it's done on its task so finally it's uh it's draw is conclusion or it's

reach the maximum steps of the sighting and we will draw the uh final state of the environment and our evaluation

script while decide uh whether this task is accomplished successfully or the the task is failed so yeah later we will tell more

details about how we did in in these steps yeah one thing we need to be care uh be be uh special uh be specially

noted that uh it contains the obser it is specialized in observation and action spaces that make it more realistic well

all also uh offer more challenge for the agent that uh it is based on the screenshots and accessibility tree as

the input and it's predict the raw mouse and keyboard um as the output of for the each turn of the S each turn of the

agent interaction yeah here that um here is some definition stuff that um yeah we will skip this for this

moment and uh we will show here that the action space here uh is actually clicking uh for for example for the mouse clicking is actually clicking on

the certain pixel of the screen such as the click 300 and 540 with the botton right so and the um uh observation space

here is the raw screenshot with you can see with is what human perceived and exil with the a human may may have some

disability that they need to uh operation uh they need to operate on the computer okay so uh Next Step we'll tell

more about more details of of how we dat in this environment what happens in this environment to enable this kind of real computer tasks for the agent training

evaluation and further further learning or other things that you may be interesting about uh so the first step is that the giving a computer task instruction such as here the update

booking bookkeeping sheet as we mentioned before and we will have a task initial State setup config as we shown

here uh and Mark with different um highlight color that uh including the um you can see here the download some we

will download some uh bookkeeping sheet and open some sheet yeah and we also have evaluation evaluators results

and fun functions config uh which we will discuss later and with this uh task initial environment State setup uh our

task interpreter will interpret this download and open actions into uh into the environments and to re um prepare

this uh initial state so so that the the initial uh computer is is empty is uh with nothing and our interpreter will

interpret this uh initial State set up into this uh scenario into the state that you see on the left of this

page and the next step is the uh agent takes in takes the task instruction as input and uh the the environment Val offer is observation with the screenshot

access ability tree and we will have a example here that for the screenshot we may contain the raw screenshot that with without this

bonding boss here and uh as you may all know that for for currently uh it is uh one meth called a SLE Mark that may uh

draw some Bing box on each button or uh text box that may aent the multimodel model on the attention

and make it perform better so that we we have also have an example here that uh to for demonstration so anyway it improves the screenshot and for the

exibility tree the raw exibility tree is contains milon of tokens so that it needs some like triming on these trains uh to to significantly reduce on the

tokens so it's another research Direction that's uh laay for others and here we just uh use some puristic that

as meth to retrieve some very useful buttons such as the a button that is clickable or textb that is uh able to

take some input or other things you can see more details in our code and uh uh agent is responsible

predict the mouse and keyboard actions and uh the environment will take this mouse and keyboard actions uh as we show here here that we

we contains the Full capabilities of uh any humans can do uh on the computer we provide we formulate

the actions in that way and we like the agent to generate in that way as well so which means that if the agent is

powerful enough they will not be blocked by our action space and observation space that it it's will be cap capable

to generate any kind of action that human beings can do to finish any task that human beings can do so we will show two example uh uh we will first look at

the right side of the action space here that it include includes the move to click right price H key scroll track two

key down key up and three special actions that we add into the uh this called the weight fell and down so it

um the wait is that the agent decides it should wait for a few seconds uh to wa some actions on the computer to be to be

take to take effect and if fa it decide uh it may happens that the agents think this task is INF feasible to uh accomplish such as the some features of

the software may be outdated and the or some hallucination that caused by the human user that they think it is the

function that there that can be operated but they're actually not and also the done so the agent decid the task is

finished so yeah here is here are also two example on the left side that it can click on the Chrome X and chrome y that

click on some points to open the Chrome uh apps and the second example here uh as scr here is that it can tapiz some

commands in the terminal and maybe later it will click some uh uh hit also hit on the enter to execute this uh commands as

well yeah and um for example we will show a full example here like monitor the system CPU for 30 seconds and output

the results so it will output the sequence SL first click on the terminal X and terminal Y and second click on the focus X and the focus y

and typy some commands and uh it's done okay so that's the action space and observation space of the uh agent of the environment that we

provided and after interactions again and again and finally it finish all these steps we will have the final step the

evaluation so it is very important that in executable language that we have the realistic environment but also the uh reliable

evaluation so how how do we do in our evaluation like we um it is very special and distinguish us with other Benchmark

that we wrote Every we wrote One task specific evaluation script for

every single nearly every single task so we have the example wise evalu script to make our choice on the examples more

flexible and diverse so previously you may have the uh one exact match Matrix for the open domain QA or you may have the execution accurancy the pass one or

pass five for the code generation task but for now we are handling a more uh complex and maybe uh fost possible task

that called computer task so it involve any more different modeling of the evaluation that you can handle the

task around all around the computer so here for example that a consider about task here that can you help me clean up my computer by getting a rid of all the

tracking things that Amazon may have saved so to decide whether the agent have finished this task that we have the eval function that check about the

cookie uh we first get a cookie from the environment which is a virtual machine and secondly we use some rules to check

that whether uh the inside this cookies the role uh that F the the the slot that fit the role with the domains with the

uh amazon.com is delect delete deleted yeah so that that's this div script and for the second example such as the rename sheet one to some names and make

a copy and place the copy um BL BL so we have we have uh for this sheet handling task we have a different uh strategy of

evaluation script that we we first get a fil and secondly we get a fil from get a golden fil from

the cloud and we compare this treat through our compared table function to this uh uh and we take inut some uh special requirements we which we call

the rules here and we all put the final results so and so consider about St TT you can have a rly grasp of our

Benchmark here that with as a result we collect three uh 169 real world computer task or we can say computer environments

that involve real web and desktop apps in open domains including the o the F IO and professional office daily

workflow Etc it it not only contains the C GUI but also contains the CLI and their com

their comination sorry so and each task example is carefully uned with one real world task instruction from the real users and an initial State set up config

to simulate the human working progress so that when you need a a digital agent to help you to finish the task sometimes

you you not start from the uh beginning after you are start your computer you are in the middle stage of some uh work

so we try to um mimic that that scenario to make our agent task more realistic and the third is that we also have a

custom execution based evalu script as we mentioned before a very very um carefully annotated execution based evaluation script which is very reliable for evaluating that whether this task is

finish or not and we have more statistic as shown here and you can check on our website and paper

later so what is distinguish us um among all this um benchmarks uh we uh divide

that into a few different uh aspects uh ranging from the previous agent Benchmark such as the Gaia agent bench

inter code and the other maybe the Vib agent Focus uh the GUI operation Focus Benchmark such as the ranging from the

uh earlier work such as the mini wol to the reason work such as the uh worken visual Vina and Vina and om act so that

we uh to dis distinguish system that we have um a large uh relatively actually a large of the uh task tabs uh you can see

here that actually they have like more than like 10,000 of uh data examples but most of them are generates through one

or maybe uh couple of hundreds of um templates and fill in some different value slots but our T are not they not

done in that way so every task in our uh Benchmark is very special and our execuive executable environment is focused on the computer which model uh

the whole digital task in a unified way that you don't need to like focus on the code or focus on the we or focus on the mobile isol uh in isolated way that you

can you can uh try our environment and hand handle them all together and also we uh also highlight our scalability on

the environments that we not only provide the Benchmark that we also provide a paradigm that you can scale your environment in this way that you

can onate a initial State set of config you can annotate a executable based evance scripts so that you can um

annotate any example that you need in our environment or in maybe some later environment that in that way and of course the mul models about for

maybe in the next year or the later of this year they will be much more powerful multimodel Foundation model to support this kind of digital agent or

other agent and we also distinguish them that we are cross apps so it first F more on the uh real scenarios and we

also have intermediate initial steps and lastly we have more than 100 evaluation functions that others may have not have

and to demonstrate our um difficulty of our uh o environment or benchmarks that we did a human evaluation that um shows

compared with the previous very famous web agent Benchmark called weina that we our task contains a lot of uh

example that require human to take more than 10 minutes to finish this task and most of task the me

median of our task are is the over 100 seconds while it is only um o over half

a minute for the B now and our accurancy for the human is much more lower than

there yeah so next we will talk about okay so how how did how how how do the current L long model or Vision Lang

model agent based Lin do in this Benchmark so we use a prompting way that we wrote an instruction as shown on the

right side that we told this L models or wish long model to act as a digital agent that you can output a pyo g code

and you cannot do uh certain things and you can you can do certain things and uh additional know such as um my computer's

password and uh other uh information to control the format of actions so after inputting this we choosing we carefully choose from the uh L models and the

vision long model from the Open Source One such as the mix trail uh mix off expert version and the co agent which support um specially optimize for the

high resolution scenarios and we also choose from the uh most powerful close off uh models such as the gbd4 CLA 3

allus and German I Pro uh vision and tex versions so yeah so we should we fix the temperature of 1.0 and the top P of 1

Point uh 0.9 and we provide the most recent three observation and action pairs as history Hance for each step so we call it

history encoding and later we will show whether it is important so we choose different uh according to the perception that we uh

have four choices that uh the first is the basic accessibility tree that we it is only required the model to uh with with the tax input ability and then the

screenshot so it is the same perception with the human beings that we take the stre snapshot and we oppose the actions

and third is the screenshot Plus exib tree so to to see whether uh this uh full uh there comp this this complete

modeling can help in to which exent and we also uh test on the setle mark which makes some bonding box on the screenshot

and we also input a detail information uh which is a table of uh the coordinat and the size of the spawning box together with this annotated screenshots

into the model to see if the set of marks works on this uh our OS World environments on these tasks so the result as is shown as on

the left on the right that here is some text ways that first the is the L long models and the vision Lang models are still far from being digital agents on

the real computer so the most advanced result among all these settings all and all these models is only

um about 10 10% so and this 10% is actually the most easy part of our environments such as you can maybe you

you don't need the agent to uh you you you may may not need the agent to mod modeling this kind of s

that you can barely generate multi step of code to finish this T so yeah and then the the second takeway is that the

agent performance uh is in different distribution with human performance across the different types of the hum of

the computer tax so that as shown at the below that human performance overall is over 60% and it is consist among all the

subset such as the os office daily professional and multiple apps workflow so it is consistent uh all around the 16

so yeah I mean that's some some sometimes are really uh difficult that maybe it's not enough for human normal human that are not familiar with this

task to do but the it is consistent but for the uh langage model and the visual language model you can see that they

failed on a signif they got significant low results on the subsid such as the off and workflow so they infer some like

the distribution uh shifted or like some ability shifted among the current U Vision language model and language model

with uh the the humans and may infer some mechanism or others other ways for us to be to do research and the thir te

is that the exib Tre and S so infective virus by the models so you can see that uh the germani pro the gp4 and CLA three

opos like with different uh input settings like their performance uh is quite different and we find that maybe

it's um some models may not good at the taking the setle mark uh as perceptions and some model may may be Excel in in

putting only the AIL tree and the fourth takeway is that the vrm agents with screenshot only settings show very low performance but it should be the

automated configuration in the long run so that H if you see this four part of four different setting you can see the screenshot holds the lowest result that

the other settings for the best model such as the gbt 4 V and uh other models it got performance over 10 and for the

screenshot they all struggling in uh performance around five to like even zero um according to our observation

that uh even for the gbt 4V they cannot predict a current coordinates that you you can only see is hung around and

click randomly on the screenshot on the screen and make some nonsense actions so yeah but we mentioned that it's the

ultimate configurations all human beings are do this Tas in that way but for now it has has no ability on this kind of tasks for the vision language model and the

language Vision language model on the screenshots yeah we also did some result analysis of the language model and vision language model baselines so the

first is that um the first takeway is that the higher screenshot resolution typically leads to improve performance

that we uh this setting is that we uh use adopt the GT4 V models and input with two settings such as the screenshot

only setting as and theom settings especially one and and and we adapt the down sampling with different rituals on

the screenshot before we input this screenshot into the vision language model and except for one um a special

point4 for the so which we suspect that it due to the uh reason that there are a lot of um uh in P training data around

this size resp sponding box so it it may be the best for the uh g4b to do so on that except for this point so you can

see a trend that it's uh if you do uh the the more St sampling you did that which makes the image more smaller and

low performance you you will get and we further analysis on the uh uh

due to the analysis on the Contex lens so uh which which extent uh do the accessibil tree do the tokens consumed

by accessibil tree if we want to formulate the task in the pure tax way or a mixture of the Accel and the screenshot way so how many tokens uh

does this method take that we calculate the um the distribution of some sample task that uh the result

is that to hold at least 90% of the exibit tree that your model need to support at least 6K

6,000 uh tokens to inut the XS perception and the longer tax based trajectory history cont has improved the

performance uh and unlike the screenshot his only history but post efficienc challenges so we mean that for this so

setting of The gp4 V model that uh we input more observation action pairs from the history from the current step that the last step and the last step before

last step uh we input more of this history encoding trajectory lens and we got better performance and combined with

the uh distribution of the accessibility tree that you may need more efficient and more the models with more capability

of the tokens uh uh of the like long Contex models that you need to have to support this kind of

usage and the next thing is that we found that we also did some experiments on the not only the ubu operating system

but also on the Windows operating system and we found that um first thing is that the performance of the visual long mode across different Opera system is in strong

correlation and this imply this impli that the insight and methodologies developable with the oso Benchmark can be effective transferred to Windows

environments with a high degree of the reliability and we plan currently we only uh public the one two uh uh image and we are very

very for very in the very near future we will also um release the windows ones but may need some activation from the user

side yeah and the next uh result is that the we did some distrib uh we did some distribtion to to the environment that

um in different aspect uh ranging from the positions that we may change in the positions of the window from time to time to see if this effects uh and we

also change the size of the windows such as we may uh minimize some windows or we

we may we may like a drag is a border uh to make it not the minimize but as much smaller as it can and we may also add

some clatter that irrelevant windows open some irrelevant applications to U make some this this disturbs okay uh and we found

that the current vrm agents are not Reon to us UI layout and noise so originally

we choose um a set of uh the test that the the agent may be good at that have the 50% of accurancy and after we made

this disturbs and it's drop significantly especially for the size uh this this this that we we see a drop of

like nearly uh 40% of the currency so and you can see the paper for more interesting analysis that we showed we did a l lot of analysis uh in this paper

in this work and uh surprisingly that we see some successful case of the large long model and the vision Lang model agent

baselines so here that is example of the extract the sub title from a video that the agent successfully open the terminal

and run the code of the MFM PG to extract the current uh video so it requires like some uh software knowledge

and coding knowledge and some uh GOI interaction ability and even some uh memory stuff and exploration stuffs but

is is is done by by the current digital agent so it is like but as we shown the performance is quite low so there are

still many challenge stuff we can do um man chall point that we can try to overcome in the near future yeah so thank you for listening and here are

some uh future research direction that you may interest about but if you have some question here that we can discuss more U since time

is yeah all right so thanks Timo for the exciting work and the talk it's very detailed and okay

I'll stop the recording then we can start the QA session

Loading...

Loading video analysis...