吳恩達探討 AI Agent 和代理推理的興起 | BUILD 2024
By fOx Hsiao
Summary
Topics Covered
- The real AI opportunity is the application layer
- Build 20 prototypes, keep what works
- Agentic workflows beat bigger models
- AI agents unlock trapped visual data
Full Transcript
please welcome Andrew
[Applause] ing thank you it's such a good time to be a builder so I'm excited to be back
here at snowflake build what i' like to do today is share with you where I think are some of ai's biggest opportunities you may have heard me say
that I think AI in new electricity that's because a a general purpose technology like electricity if I ask you what is electricity good for it's always hard to answer because it's good for so
many different things and new AI technology is creating a huge set of opportunities for us to build new applications that weren't possible before people often ask me hey Andrew
where are the biggest AI opportunities this is what I think of as the AI stack at the lowest level is the semiconductors and then on top of that lot of the cloud infro including of
Course Snowflake and then on top of that are many of the foundation model trainers and models and it turns out that a lot of the media hype and excitement and social media Buzz has
been on these layers of the stack kind of the new technology layers whenever there's a new technology like generative AI lot the buzz is on these technology layers and there's nothing wrong with
that but I think that almost by definition there's another layer of the stack that has to work out even better and that's the application layer because we need the applications to generate
even more value and even more Revenue so that you know to really afford to pay the technology providers below so I spend a lot of my time thinking about AI applications and I think that's where
lot of the best opportunities will be to build new things one of the trends that has been growing for the last couple years in no small part because of generative AI is
fast and faster machine learning model development um and in particular generative AI is letting us build things faster than ever before take the problem
of say building a sentiment classifier you taking text and deciding is this a positive or negative sentiment for reputation monitoring say typical workflow using supervised learning might
be that will take a month to get some label data and then you know train AI model that might take a few months and then find a cloud service or something
to deploy on that'll take another few months and so for a long time very valuable AI systems might take good AI teams six to 12 months to build right and there's nothing wrong with that I
think many people create a very valuable AI systems this way but with generative AI there's certain cles of applications where you can write a prompt in days and
then deploy it in you know again maybe days and what this means is there are a lot of applications that used to take me and used to take very good AI teams months to build that today you can build
in maybe 10 days or so and this opens up the opportunity to experiment with build new prototypes and and ship new AI products that's certainly the
prototyping aspect of it and these are some of the consequences of this trend which is fast experimentation is becoming a more promising path to
invention previously if it took six months to build something then you know we better study it make sure there user demand have product managers we look at it document it and then spend all that effort to build in it hopefully it turns
out to be worthwhile but now for fast moving AI teams I see a design pattern where you can say you know what it take us a weekend to throw together prototype let's build 20 prototypes and see what
SS and if 18 of them don't work out we'll just stitch them and stick with what works so fast iteration and fast experimentation is becoming a new path
to inventing new user experiences um one interesting implication is that evaluation is evals for short are becoming a bigger bottleneck for how we build things so it
turns out back in supervised learning world if you're collecting 10,000 data points anyway to trade a model then you know if you needed to collect an extra 1,000 data points for testing it was
fine whereas extra 10% increase in cost but for a lot of large language Motel based apps if there's no need to have any traing data if you made me slow down to collect a thousand test examples boy
that seems like a huge bottleneck and so the new development workflow often feels as if we're building and collecting data more in parallel rather than sequentially um in which we'll build a
prototype and then as it becomes import more important and as robustness and reliability becomes more important then we gradually build up that test data in parallel but I see exciting Innovations
to be had still in how we build evals um and then what I'm seeing as well is the prototyping of machine learning has become much faster but building a soft application has lots of steps does the
product work you know the design work does the software integration work lot of Plumbing work um then after deployment devops and L Ops so some of those other pieces are becoming faster
but they haven't become faster at the same rate that the machine learning modeling part has become faster so you take a process and one piece of it becomes much faster um what I'm seeing
is prototyping is now really really fast but sometimes you take a prototype into robust reliable production with God rails and so on those other steps still take some time but the interesting
Dynamic I'm seeing is the fact that the machine learning part is so fast is putting a lot of pressure on organizations to speed up all of those other PS as well so that's been exciting
progress for our few and in terms of how machine learning development um is speeding things up I think the Mantra moved fast and break things got a bad
rep because you know it broke things um I think some people interpret this to me we shouldn't move fast but I disagree with that I think the better mindra is
move fast and be responsible I'm seeing a lot of teams able to prototype quickly evaluate and test robustly so without shipping anything out to The Wider world that could you know cause damage or
cause um meaningful harm I'm finding smart teams able to build really quickly and move really fast but also do this in a very responsible way and I find this exhilarating that you can build things
and ship things in a responsible way much faster than ever before now there's a lot going on in Ai and of all the things going on AI um in
terms of technical Trend the one Trend I'm most excited about is agentic AI workflows and so if you ask what's the one most important AI technology to pay
attention to I would say is agentic AI um I think when I started saying this you know near the beginning of this year it was a bit of a controversial
statement but now the word AI agents has is become so widely used uh by by Technical and non-technical people is become you know a little bit of a hype
term uh but so let me just share with you how I view AI agents and why I think they're important approaching this from a technical perspective the way that most of us use
large language models today is with what something is called zero shot prompting and that roughly means we would ask it to uh give it a prompt write an essay or
write an output for us and it's a bit like if we're going to a person or in this case going to an AI and asking it to type out an essay for us by going from the first word writing from the
first word to the last word all in one go without ever using backspac let just right from start to finish like that and it turns out people you know we don't do our best writing this way uh but despite
the difficulty of being forced to write this way a Lish models do you know not bad pretty well here's what an agentic work is like uh to gener an essay we ask an AI to
First write an essay outline and ask it do you need to do some web research if so let's download some web pages and put into the context of the large hange model then let's write the first draft and then let's read the first draft and
critique it and revise the draft and so on and this workflow looks more like um doing some thinking or some research and then some revision and then going back to do more thinking and more research
and by going round this Loop over and over um it takes longer but this results in a much better work output so in some teams I work with we apply this agentic
workflow to processing complex tricky legal documents or to um do Health Care diagnosis Assistance or to do very complex compliance with government
paperwork so many times I'm seeing this drive much better results than was ever possible and one thing I'm going to focus on in this presentation I'll talk about later is theise of visual AI where
agentic repal are letting us process image and video data but to get back to that later um it turns out that there are benchmarks that
show seem to show a gentic workflows deliver much better results um this is the human eval Benchmark which is a benchmark for open ey that measures
learning out lar rage model's ability to solve coding puzzles like this one and um my team collected some data turns out that um on this Benchmark I it was posic
K Benchmark posic K metric GB 3.5 got 48% right on this coding Benchmark gb4 huge Improvement you know
67% but the improvement from GB 3.5 to gbd4 is dwarf by the improvement from GB 3.5 to GB 3.5 using an agentic workflow
um which gets over up to about 95% and gbd4 with an agentic workflow also does much better um and so it turns out that
in the way Builders built agentic reasoning or agentic workflows in their applications there are I want to say four major design patterns which are reflection two use planning and
multi-agent collaboration and to demystify agentic workflows a little bit let me quickly step through what these workflows mean um and I find that agentic workflows sometimes seem a
little bit mysterious until you actually read through the code for one or two of these go oh that's it you that's really cool but oh that's all it takes but let me just step through um to for for
concreteness what reflection with ls looks like so I might start off uh prompting an LM there's a coder agent l so maybe the assistant message to your
roles to be a coder and write code um so you tell you know please write code for certain tasks and the L May generate codes and then it turns out that you can construct a prompt that takes the code
that was just generated and copy paste the code back into the prom and ask it you know he some code intended for a TS examine this code and critique it right and it turns out you prompt the same
Elum this way it may sometimes um find some problems with it or make some useful suggestions out to proofy code then you prompt the same L with the
feedback and ask you to improve the code and become come over with a new version and uh maybe fores shouting two use if can have the LM run some unit tests and give the feedback of the unit test back
to the LM then that can be additional feedback to help it iterate further to further improve the code and it turns out that this type of reflection workflow is not magic doesn't solve all
problems um but it will often take the Baseline level performance and lift it uh to to better level performance and it turns out also with this type of workflow where we're thinking of
prompting an LM to critique his own output use his own criticism to improve it this may be also foreshadows multi-agent planning or multi-agent workflows where you can prompt one
prompt an to sometimes play the role of a coder and sometimes PR on to play the role of a CR of a Critic um to review the code so actually the same
conversation but we can prompt the LM you know differently to tell sometimes work on the code sometimes try to uh make helpful suggestions and this same results in improved performance so this
is a reflection design pattern um and second major design pattern is to use uh in which lar language model can be prompted to generate a request for an
API call to have it decide when it needs to uh search the web or execute code or take other tasks like um issue a customer refund or send an email or pull up a calendar entry so to use is a major
design pattern that is letting large language models make function calls and I think this is expanding what we can do with these agentic workflows um real quick here's a planning or reasoning
design pattern in which if you were to give a fairly complex request you your gener image or where girls reading a book and so on then an LM this example adapted from the huging GTP paper an LM
can look at the picture and decide to first use a um open pose model to detect the pose and then after that gener picture of a girl um after that you know
describe the image and after that you set the spe or TTS to generate the audio but so in planning you have an L look at a complex request and pick a sequence of
actions execute in order to deliver on a complex task um and then lastly multi asent collaboration is that design pattern alluded to where instead of
prompting an LM to just do one thing you prompt the LM to play different roles at different points in time so the different agents simulate agents interact with each other and come
together to solve a task and I know that some people may may wonder you know if you're using one why do you need to make this one play the role with multiple
multiple agents many teams have demonstrated significant improved performance for a variety of tasks using this design pattern and it turns out that if you have an LM sometimes
specialize on different tasks maybe one at the time have interact many teams seem to really get much better results using this I feel like maybe um there's
an analogy to if you're running jobs on a processor on a CPU you why do we need multiple processes it's all the same processor you know at the end of the day but we found that having multiple FS of
processes is a useful abstraction for developers to take a task and break it down to subtask and I think multi-agent collaboration is a bit like that too if you were big task then if you think of hiring a bunch of agents to do different
pieces of task then interact sometimes that helps a developer um build complex systems to deliver a good result so I think with these four major agentic
design patterns agentic reasoning workflow design patterns um it gives us a huge space to play with with to build Rich agents to do things that frankly
were just not possible you know even a year ago um and I want to one aspect of this I'm particularly excited about is the rise of not not just large language
model of these agents but large multimodal based a large multimodal model based agents so um give an image
like this if you were wanted to uh use a lmm large multimodal model you could actually do zero shot prompting and that's a bit like telling it you know take a glance at the image and just tell
the output and for simple image thoughtss that's okay you can actually have it you know look at the image and uh right give you the numbers of the runners or something but it turns out
just as with large language modelbased agents large multi model based model based agents can do better with an itative workflow where you can approach this problem step by step so to take the
faces detect the numbers put it together and so with this more IR workflow uh you can actually get an agent to do some planning testing write code plan test
write code and come up with a most complex plan as articulated as expressing code to deliver on more complex thoughts so what I like to do is
um show you a demo of some work that uh Dan Malone and I and the H AI team has been working on on building a gentic
workflows for visual AI tasks so if we switch to my laptop
um let me have an image here of a uh soccer game or football game and um I'm going to say let's see counts the
players in the vi oh and just so fun if you're not how to prompt it after uploading an image This little light bulb here you know give some suggested prompts you may ask for this uh but let
me run this so C the players on the field right and what this kicks off is the process that actually runs for a couple minutes um to Think Through how
to write code uh in order to come up a plan to give an accurate result for uh counting the number of players in a few this is actually a little bit complex because you don't want the players in the background just be in a few I
already ran this earlier so we just jump to the result um but it says the Cod has selected seven players on the field and
I think that should right 1 2 3 4 5 six seven um and if I were to zoom in to the model output Now 1 2 3 4 5 6 7 I think that's
actually right and the part of the output of this is that um it has also generated code uh that you can run over
and over um actually generated python code uh that if you want you can run over and over on the large collection of images
and I think this is exciting because there are a lot of companies um and teams that actually have a lot of visual AI data have a lot of images um have a
lot of videos kind of stored somewhere and until now it's been really difficult to get value out of this data so for a lot of the you know small teams or large
businesses with a lot of visual data visual AI capabilities like the vision agent let you take all this data previously shove somewhere blob storage and and you know get real value all of
this I think this is a big transformation for AI um here's another example you know this is um given a video split this is another soccer game
or football game so given video split the video clips of 5 Seconds find the clip where go is being scored display a frame stive output so Rand is already because six a
little bit of time to run then this will generate code evaluate code for a while and this is the output and it says true
1015 so things those a go St you know around here around between the right and there you go that's the go
and also as instructed you know extracted some of the frames associated with this so really useful for processing um video data and maybe
here's one last example uh of of of the vision agent which is um you can also ask it for program to split the input video into small chunks every 6 seconds describe each chunk and install the
information at Panda's data frame long with clip name Su in time return the Panda's data frame so this is a way to look at video data that you may have and
generate metadata for this uh that you can then store you know in snowfake or somewhere uh to then build other applications on top of but just to show
you the output of this um so you know clip name start time end time and then is actually written code um here right wrot code that you can then run
elsewhere if you want uh put in a stream the tab or something that you can then use to then write a lot of you know text
descriptions for this um and using this capability of the vision agent to help write code my team at Landing AI
actually built this little demo app that um uses code from the vision agent so instead of us needing the write code the vision agent write the code to build
this metadata and then um indexes a bunch of videos so let's see let she browsing so skar airborne right I actually ran this earlier hope it works
so what this demo shows is um we already ran the code to take the video split in the chunks store the metadata and then when I do a search for skier Airborne
you know it shows the clips uh that have high similarity right oh Mark here with the green has high similarity well this is getting my
heart rate out seeing me do that oh here's another one whoa all right all right and and the green parts of the timeline show where the skier is
Airborne let's see gray wolf at night I actually find it pretty fun you when when you have a collection of video to index it and then just browse through right here's a gry wolf at night and
this timeline in green shows with a G wolf and Nittis and if I actually jump to different part of the video there's a bunch of other stuff as well right there there not a Dre wolf at night so that
that's pretty cool um let's see just one last example so um yeah if I actually been on the road a
lot uh but if espcially if your luggage this black luggage right um there this but it turns turns out there actually a lot of black Luggage So if you want your luggage let's say black
luggage with rainbow strap this there lot of black luggage out there then you know there right black luggage
with rainbow strap so a lot of fun things to do um and I think the nice thing about this is uh the work needed to build applications like this is lower
than ever before so let's go back to the slides um and in terms of AI opportunities I spoke
a bit about agentic workflows and um how that is changing the AI stack is as follows it turns out that in addition to
this stack that I show there's actually a new emerging um agentic orchestration layer and there little orchestration layer like L chain that been around for a while that are also becoming
increasingly agentic through langra for example and this new agentic orchestration layer is also making it easier for developers to build applications on top uh and I hope that
Landing ai's Vision agent is another contribution to this that makes it easier for you to build visual AI applications to process all this image and video data that possibly you had but
that was really hard to get value all of um until until more recently so but fire when sh you what I think are maybe four of the most important AI Trends there's a lot going on in AI is impossible to
summarize everything in one slide if you had to make me pick what's the one most important Trend I would say is agentic AI but here are four other things I think are worth paying attention to
first um turns out agentic workflows need to read a lot of text or images and generate a lot of text so we say that generates a lot of tokens and their exciting efforts to speed up token
generation including semiconductor work by Sova service BR and others a lot of software and other types of Hardware work as well this will make a gentic workflows work much better second Trend
I'm about excited about today's large language models has started off being optimized to answer human questions and human generated instructions things like you know why did Shakespeare write Mac
beath or explain why Shakespeare wrote Mac beath these are the types of questions that large language models are often ask answer on the internet but agentic workflows call for other operations like to use so the fact that
large language models are often now tuned explicitly to support tool use or just a couple weeks ago um anthropic release a model that can support computer use I think these exciting
developments are create a lot of lift create a much higher ceiling for what we can now get a gentic workflows to do with lar language models that tune not
just to answer human queries but to tune EXA explicitly to fit into these erative agentic workflows um third data engineering's importance is rising
particularly with unstructured data it turns out that a lot of the value of machine learning was the structure data kind of tables of numbers but with geni we're much better than ever before at
processing text and images and video and maybe audio and so the importance of data engineering is increasing in terms of how to manage your unstructured data and the metad data for that and
deployment to get the unstructured data where it needs to go to create value so that that would be a major effort for a lot of of large businesses and then lastly um I think we've all seen that the text processing revolution has
already arrived the image processing Revolution is in a slightly early phase but it is coming and as it comes many people many businesses um will be able to get a lot more value out of the
visual data than was possible ever before and I'm excited because I think that will significantly increase the space of applications we can build as
well uh so just wrap up this is a great time to be a builder gen is letting us experiment faster than ever a gentic AI is expanding the set of things that now possible and there just
so many new applications that we can now build in visual AI or not in visual AI that just weren't possible ever before if you're interested in checking out the
uh visual AI demos that I ran uh please go to va. landing.ai the exact demos that I ran you better try out yourself online and get the code and uh run code
yourself in your own applications so with that let me say thank you all very much and please also join me in welcoming Elsa back onto the stage thank you
Loading video analysis...