强烈推荐:“比你刷过的所有 Claude 教程都实用 10 倍”里面讲的不是 prompt 技巧,而是 Stanford 真正教工程师如何从零构建可靠 AI 系统的完整方法论。绝对是你最有生产力的事!
By TimesSliced
Summary
Topics Covered
- LLMs are notoriously hard to control: the Tay disaster
- Context windows have a hidden attention problem
- The centaur versus cyborg model: how humans delegate to AI
- Fine-tuning is a trap: the Slack bot cautionary tale
- The paradigm shift from deterministic to fuzzy engineering
Full Transcript
And now we're going one level beyond into what would it look like if you were building agentic AI systems at work in a
startup in a company. Um and it's probably one of the more practical lectures. Again, the goal is not to
lectures. Again, the goal is not to build a product end to end in the next hour or so, but rather to tell you all the techniques that AI engineers have
cracked, figured out or exploring so that after the class you have sort of the breath of view of different prompting techniques, different agentic
workflows, multi- aent systems, eval when you want to dive deeper, you have the baggage to dive deeper and learn faster about it. Okay,
let's try to make it as interactive as possible as usual. Um, when we look at the agenda, the agenda is going to start
with the core idea behind challenges and opportunities for augmenting LLMs. So, we start from a base model. How do we maximize the performance of that base
model? Um, then we'll dive deep into the
model? Um, then we'll dive deep into the first line of optimization, which is prompting methods, and we'll see a variety of them. Then we'll go slightly deeper. If we were to get our hands
deeper. If we were to get our hands under the hood and do some finetuning, what would it look like? I'm not a fan of fine-tuning, and I talk a lot about that, but I'll explain why. I I try to
avoid fine-tuning as much as possible.
Um, and then we'll do a section four on retrieval augmented generation or rag, which you've probably heard of in the news. Maybe some of you have played with
news. Maybe some of you have played with rags. We're going to sort of unpack what
rags. We're going to sort of unpack what a rag is and how it works and then the different methods within rags and then we'll talk about agentic
AI workflows. Um, I'll define it. Um
AI workflows. Um, I'll define it. Um
Andrew Ang is one of the call it first ones who have called this trend a agentic AI workflows. And so we look at the definition that Enzu gives to
agentic workflows and then we start seeing examples. The section six is very
seeing examples. The section six is very practical. It's a case study where we
practical. It's a case study where we will think about an agentic workflow and we'll and I ask you to measure um if the
agent actually works and we brainstorm how we can measure if an agentic workflow is working the way you want it to work. There's plenty of methods
to work. There's plenty of methods called eval um uh solve that problem. uh and then we'll look briefly at multi- aent of workflow and then we can have a a sort
of open-ended discussion where I'll share some thoughts on what's next in AI um and I'm looking forward to hearing from you all as well on that one. Okay,
so let's get started uh with the problem of augmenting LLM. So open-ended
question for you. You are all familiar with pre-trained models like GPT3.5 Turbo or GPT40.
What's the limitation of using just a base model? What are the typical issues
base model? What are the typical issues that might arise as you're using a vanilla pre-trained model?
Yes.
Lacks some domain knowledge. You're
perfectly right. you know, you we we had a group of students a few years ago was not LLM related, but you know, they were building a an autonomous uh farming uh
device or vehicle that had a camera underneath taking pictures of crops to determine if the crop is sick or not, if it should be thrown away, like if it
should be if it should be used or not.
And and that data set is not a data set you find out there. And the base model or a pre-trained um computer vision model would lack that knowledge of
course. What else?
course. What else?
Yes.
Okay. Maybe the you're saying so just to repeat for people online you're saying the model might have been trained on high quality data but the data in the wild is actually not that high quality.
And in fact, yes, the distribution of the real world might defer uh as we've seen with GANs from the training set and that might create an issue with pre-trained models. Although pre-trained
pre-trained models. Although pre-trained LLMs are getting better at, you know, handling all sorts of data inputs. Uh
yes, like what lacks current information. Uh the LLM is not up to date. And in fact, you're right. Imagine you have to retrain from
right. Imagine you have to retrain from scratch your LLM every couple of months.
One story that I found funny um it's from probably three years ago or maybe more five years ago where um during his first presidency, President Trump one
day tweeted Kov Fe. You remember that tweet or no? Just Kof. And it was probably a typo or it was in his pocket.
I don't know. But that word did not exist. the LLMs in fact that Twitter was
exist. the LLMs in fact that Twitter was running at the time could not recognize that word and so the recommener system sort of went wild because suddenly
everybody was making fun of that tweet using the word coe and the LM was so confused on you know what does that mean where should we show it to whom should
we show it and is an example of nowadays especially on social media there's so many new trends and um it's very hard to retrain an LLM to match the new trend
and understand the new words out there.
I mean, you know, you often times hear Gen Z words like re or mid or whatever.
I don't know all of them, but you probably want to find a way that can allow the LLM to understand those trends without retraining the LLM from scratch.
Yeah. What else?
It it's trained to have a breath of knowledge and if you wanted to do that might limit yeah it might be trained on a breath of knowledge but it might fail or not
perform adequately on a narrow task that is very well defined. Think about
enterprise applications that yeah enterprise application you need high precision high fidelity low latency and maybe the model is not great at that specific thing. It might do fine but
specific thing. It might do fine but just not good enough and you might want to augment it in a certain way. Yeah.
So it makes the model a lot heavier, a lot slower.
So maybe it has a lot of broad domain knowledge that might not be needed for your application and so you're using a massive heavy model when you actually are only using 2% of the model capability. You're perfectly right. You
capability. You're perfectly right. You
might not need all of it. So you might find ways to prune, quantize the model, modify it. All of these are are are good
modify it. All of these are are are good points. I'm going to add a few more as
points. I'm going to add a few more as well. Um, LLMs are very difficult to
well. Um, LLMs are very difficult to control. Uh, your last point is actually
control. Uh, your last point is actually an example of that. You want to control the LM to use a part of its knowledge, but it's not. It's in fact getting confused. Uh, we've seen that in
confused. Uh, we've seen that in history. In 2016, Microsoft created a
history. In 2016, Microsoft created a notorious Twitter bot that learned from users and it quickly became a racist
jerk. Microsoft ended up removing the
jerk. Microsoft ended up removing the bot 16 hours after launching it. The
community was really fast at determining that this was a racist bot. Um, and and you know, you can empathize with Microsoft in the sense that it is actually hard to control an LLM. they
might have done a better job to qualify before launching, but it is really hard to control an LLM. Even more recently, this is a tweet from Sam Alman last
November, um, where there was this debate between Elon Musk and Sam Alman on whose LLM is the left-wing propaganda machine or the right-wing propaganda
machine. And they were hating on each
machine. And they were hating on each other's LLMs. But that tells you at the end of the day that even those two teams Grock and OpenAI which are probably the
best funded team with a lot of talent are not doing a great job at controlling their LLMs, you know.
And from time to time, if you hang out on X, you might see screenshots of users interacting with LLM and the LLM saying
something really controversial or or or racist or, you know, something that would not not be considered great by social standards, I guess. And uh and
that tells you that the model is really hard to to control.
Um the second aspect of it is something that you've mentioned earlier. LLMs may
underperform in your task. Um and that might include specific knowledge gaps such as medical diagnosis. If you're
doing medical diagnosis, you would rather have an LLM that is specialized for that and is great at it. And in
fact, something that we haven't mentioned as a group has sources. So the
answer is sourced specifically. you have
a hard time believing something unless you have the actual source of the research that backs it up. Um,
inconsistencies in style and format. So,
imagine you're building a legal AI agentic workflow. Uh, legal has a very
agentic workflow. Uh, legal has a very specific way to write and read uh where every word counts. You know, if you're negotiating a large contract, every word
on that contract might mean something else when it comes to the court. And so
it's very important that you use an LLM that is very good at it. The precision
matters. And then you know task specific understanding such as doing a classification on a niche field. Here I
pulled an example where you know let's say a biotech product is trying to use an LLM to categorize user reviews into
positive, neutral or negative. um you
know maybe for that company something that's uh would be considered a negative review typically is actually considered a neutral review because the NPS of that
industry tends to be way lower than other industries let's say um that's a task specific understanding and the LLM needs to be aligned to what the company believes is the categorization that it
wants an example of how to solve that problem in a second and then limited context handling. Um, a lot of AI applications
handling. Um, a lot of AI applications especially in the enterprise have uh require data that has a lot of context.
Like just to give you a simple example, knowledge management is an important space that enterprises buy a lot of knowledge management tool. When you go on your drive and you have all your
documents, ideally you could have an LLM running on top of that drive. You can
ask any question and it will read immediately thousands of document and answer what was our Q4 performance in sales. Um it was X dollars. Uh it finds
sales. Um it was X dollars. Uh it finds it super quickly. In practice because LLMs do not have a large enough context, you cannot use a standalone vanilla
pre-trained LLM to solve that problem.
You will have to augment it.
Does that make sense? Uh the other aspect around context windows is they are in fact limited. If you look at the context windows of the models from the
last 5 years um even the best models today will range in context window or number of tokens it can take as input um
somewhere in the hundreds of thousands of tokens max. Just to give you a sense 200,000 tokens is roughly two books.
Yeah. So that's how much you can upload uh and it can read pretty much and you can imagine that when you're dealing with video understanding or heavier data
files that is of course an issue.
So you might have to chunk it, you might have to embed it, you might have to find other ways to get uh the LLM to handle larger contexts.
Um the attention mechanism is also powerful but a problematic because it does not do a great job at attending in very large contexts. There is actually
an interesting uh problem called needle in a haststack. It's an AI problem where or call it a benchmark where um in order
to test if your LLM is good at attending at putting attention on a very specific fact within a large corpus, researchers
might randomly insert in a book one sentence that that outlines a certain fact such as Arun and Max are having coffee at Blue
Bottle in the middle of the Bible, let's or some very long texts. Um, and then you ask the LLM, um, what were Aruin and
Max having, you know, at Blue Bottle, and you see if it remembers that it was coffee. It's actually a complex problem,
coffee. It's actually a complex problem, not because the question is complex, but because you're asking the model to find a fact within a very large corpus, and that's complicated.
So again, this is a limiting factor for LLMs. We we'll talk about rag in a second but I want to preview you know the the there is debates around whether
rag is the right long-term approach for AI systems. So as a as a high level idea a rag is a mechanism if you will that
embeds documents that an LLM can retrieve and then add as context to its initial prompt and answer a question. It
has lots of application knowledge management is an example. So imagine you have your drive again but every document is sort of compressed in representation
and the LLM has access to that lower dimensional representation. Um the
dimensional representation. Um the debate that this tweet from Yu um outlines is in theory if we have
infinite compute then rag is useless because you can just read a massive corpus immediately and answer your question. Uh but even in that case um
question. Uh but even in that case um latency might be an issue. Imagine the
time it takes for an AI to read all your drive every single time you ask a question. It doesn't make sense. So rag
question. It doesn't make sense. So rag
has other advantages beyond even the accuracy. On top of that the sourcing
accuracy. On top of that the sourcing matters as well. So it might rag allows you to source. We we'll talk we'll talk about all that later. But there are
there's always this debate in the in the community whether a certain method is actually future proof because in practice as compute power doubles every year let's say some of the methods we're
learning right now might not be relevant 3 years from now we don't know essentially um you know and and the analogy that
that he makes on on you know context windows and why rag approaches might be relevant even a long time from now um is search. You know when you search on a
search. You know when you search on a search engine, you still find sources of information and in fact in the background there's very u you know
detailed traversal algorithms that rank and find uh the specific links that might be the the best to present you. um
versus if you had to read imagine you had to read the entire web every single time you're doing a search query uh without being able to narrow to certain portion of the space that might again
not be uh reasonable.
Okay. When we're thinking of um improving LLM the easiest way uh we think of it is is two dimensions. One
dimension is we are going to improve the foundation model itself. So for example, we move from GPT uh 3.5 turbo to GPT4 to
GPT40 to GPT5. Each of that is supposed to improve the base model. GPT5 is
another debate because it's sort of packaging other models within itself.
But you know if you're thinking about 3.5 4 and 4, that's really what it is.
The pre-trained model improves and so you should see your uh performance improve on your tasks. Um but the other dimension is we can actually engineer
leverage the LLM in a way that makes it better. So you can prompt simply GPT40.
better. So you can prompt simply GPT40.
You can chain some prompts and improve the prompt and it will improve the performance. It's shown. Uh you can even
performance. It's shown. Uh you can even put a rag around it. You can put an agentic workflow around it. You can even put a multi- aent system around it. And
that is another dimension for you to improve performance. So that's how I
improve performance. So that's how I want you to think about it. Which LLM
I'm using and then how can I maximize the performance of that LLM.
This lecture is about the vertical axis.
Those are the methods that we will see with it together.
Sounds good for the introduction. So
let's move to prompt engineering. Um I'm
going to start with an interesting study just to motivate why prompt engineering matters. Um there is a a study from uh
matters. Um there is a a study from uh you know HPS, UPUP um uh as well as Harvard Business School and and and
others also involved Wharton that took a subset of BCG consultants, individual contributors, split them into three groups. One group had no access to AI.
groups. One group had no access to AI.
One group had access to I think it was GPT 4. Um and then one group had access
GPT 4. Um and then one group had access to the LLM but also a training on how to prompt better. Um and then they observed
prompt better. Um and then they observed the performance of these consultants across a wide variety of tasks. There's
a few things that they noticed that I thought was interesting. One is
something they call the JAG frontier. um
meaning that certain tasks that consultants are doing fall beyond the jack frontier. Meaning AI is uh is not
jack frontier. Meaning AI is uh is not good enough. It's not it's not improving
good enough. It's not it's not improving human performance. In fact, it's
human performance. In fact, it's actually making it worse. Um and some tasks are within the frontier, meaning that uh AI is actually significantly
improving the performance, the speed, the quality of the consultant. um many
tasks fell within and many tasks fell without and they shared their um insights but the TLDDR is there is a frontier within which AI is absolutely
helping and one where they call out this behavior of falling asleep at the wheel where people relied on AI on a task that was beyond the frontier and in fact it
ended up going worse because the human was not reviewing the outputs carefully enough. Um
enough. Um they did note that the the group that was trained was uh the best uh better than the group that was not trained on prompt engineering which also motivates
why this lecture matters. Um so that you're you're you're within that group afterwards. One other insights um were
afterwards. One other insights um were the century.
They noticed that consultants had the tendency to work with AI in one of two ways and you might yourself find be part of one of these groups. The centaurs are
mythical creatures that are half human, half I think half what? Horses. Yeah, horses. Half horses,
what? Horses. Yeah, horses. Half horses,
half something. Um, and those were individuals that would divide and delegate. They might give a pretty big
delegate. They might give a pretty big task to the AI. So imagine you're working on a PowerPoint, which consultants are known to do. um you
might actually write a very long prompt on how you wanted to do your PowerPoint and then let it work for some time and then come back and it's done when others would act as cyborgs. Cyborgs are fully
blended bionic human robots, a human robot and robot augmented with robotic parts. Uh and those individuals would
parts. Uh and those individuals would not delegate fully a task. They would
actually work super quickly with the model and like back and forth. I find
that a lot of students are actually more working like cyborgs than centaurs. But
while maybe in the enterprise when you're trying to automate a workflow, you're thinking more like a centaur.
Yeah, that's just something good to keep in mind. Also, a lot of companies would
in mind. Also, a lot of companies would tell you, oh, we're hiring prompt engineers, etc. It's a curer. I I don't buy that. I think it's just a skill that
buy that. I think it's just a skill that everybody should have. uh you're not going to make a career out of prompt engineering, but you're probably going to use it as a very powerful skill in your career.
Um so let's talk about basic prompt design principles. I'm giving you a very
design principles. I'm giving you a very simple prompt here. Summarize this
document and then the document is uploaded alongside it. And the model has not um much context around you know what should be the summary, how long should be read the summary, what should it talk
about. etc. You can actually improve
about. etc. You can actually improve this prompt um by by doing you know something like summarize this 10page scientific paper on renewable energy in
five bullet points focusing on key findings and implications for policy makers. That's already better, right?
makers. That's already better, right?
You're sharing the audience and it's going to tailor it to the audience.
You're saying that you want five bullet points and you want focus only on key findings. You know, that's a better
findings. You know, that's a better prompt, you would argue. Um, how could you even make this prompt better? What
are other techniques that you've heard of or tried yourself that could make this oneshot prompt better?
Yeah.
Okay. Write right examples. So, say uh you mean here is an example of a great summary. Yeah, you're right. That's a
summary. Yeah, you're right. That's a
good idea.
Very popular technique. Act like
renewable energy expert giving a conference at Davos, let's say. Yeah,
that's great. Someone
sounds like you're good at it.
You're the best in the world at this explain. Yeah, actually I mean these
explain. Yeah, actually I mean these things work. It's funny, but uh it does
things work. It's funny, but uh it does work to to say act like XYZ. It's a very popular prompt template. We we see a few examples. What else could you do?
examples. What else could you do?
Yes.
I personally like to say critique your own login.
Okay.
Critique your own project. So you're
using reflection. So you might actually do one output and then ask it to critique it and then give it back. Yeah,
we'll see that. That's a great one.
That's that's the one that probably works best within those typically, but we we'll see some examples. What else?
Uh yeah.
down into steps.
Okay. Break the task down into steps. Do
you know how that is called?
No.
Okay. Chain of thoughts. So, um this is actually a popular method that's been shown in in research that it improves.
You could actually give a clear instruction and also encourage the model to think step by step. Approach the task step by step and do not skip any step.
And then you give it some steps such as step one, identify the three most important findings. Step two, explain
important findings. Step two, explain how key each finding impact renewable energy policy. Step three, write the
energy policy. Step three, write the five bullet summary with each point addressing a finding um etc. So, chain of thoughts, I linked the paper from
2023 uh that popularized chain of thoughts. Chain of thoughts is very very
thoughts. Chain of thoughts is very very popular right now especially in AI startups that are trying to control their elements.
Okay, to go back to your examples about act like XYZ, I what I like to do, Andrew also talks about that uh is to look at other
people's prompts and in fact in online you have a lot of prompt repositories for free on GitHub. In fact, I I linked the awesome prompt template repo on
GitHub where you have so many examples of great prompt that engineers have built. They said it works great for us
built. They said it works great for us and they published it online. And a lot of them starts with act as you know act as a Linux terminal, act as an English
translator, act like um a position interviewer etc. The advantage of a prompt template is that you can actually put it in your
code and scale it for many user requests. So let me give you an example
requests. So let me give you an example from worker. um you know where Ker
from worker. um you know where Ker evaluates skills. Some of you have taken
evaluates skills. Some of you have taken the assessments already and um tries to personalize it to the user. Um and in fact if you actually read in an HR
system in an enterprise in the HR system you might have Jane is a product manager level three and she is uh in the US and her preferred language is English and
actually that metadata can be inserted in a prompt template that we personalize for Jane. And similarly for Joe whose
for Jane. And similarly for Joe whose favorite language is preferred language is Spanish, it will it will tailor it to
Joe and that's called a prompt template.
Foundation models they don't use something you have to.
So uh the question is do the foundation models use a prompt templates or do you have to integrate it yourself? So the
foundation models probably use a system prompt that you don't see like when actually you type on chat GPT it is possible it's not public that open AI
behind the scene has like act like a very helpful assistant for this user and by the way here is your memories about your the user that we kept in a database
you can actually check your memories and then your prompt goes under and then the generation starts. So probably they're
generation starts. So probably they're using something like that but it doesn't mean you can't um add one yourself. So
in fact if you if you think about a prompt template for for the work here example I was showing maybe it starts when you call open AI by act like a helpful assistant and then underneath
it's like act like a great uh AI mentor that helps people in their career and open AAI's Chrome template also has follow the instruction from the creator or something like that you know it's
possible. Yeah.
possible. Yeah.
Uh, questions about prompt templates.
Again, I would encourage you to go and read examples of prompts. Some of them are quite thoughtful.
Um, let's talk about zero shot versus few shot prompting. It came up earlier.
Here's an example. Again, going back to the categorization of product reviews.
Let's say that we're working on a task where the prompt is classify the tone of this sentence as positive, negative, or neutral. and the the and then you paste
neutral. and the the and then you paste the review which is the product is fine but I was expecting more.
If I were to survey the room I would bet that some of you would say it's negative, some of you would say it's neutral because you actually have a first part that is relatively
positive. It's fine and then the second
positive. It's fine and then the second part I was expecting more which is relatively negative. So where do you
relatively negative. So where do you land? This can be a subjective question
land? This can be a subjective question and maybe in one industry this would be considered amazing and another one it would be considered really bad because people are used to really flourishing
reviews. And so the way you can actually
reviews. And so the way you can actually align the model to your task is by converting that zeros short prompt. Zero
shot refers to the fact that it's not beginning given any example uh into a few short prompts where the model is given in the prompt a set of examples to
align it to what you want it to do. So
the example here is you you again you paste the same prompt as before with the user review and then you add here are examples of tone classifications. This
exceeded my expectation completely.
Positive. It's okay but I wish it had more features. Negative. The service was
more features. Negative. The service was adequate. Neither good nor bad. Uh
adequate. Neither good nor bad. Uh
neutral. Now classify the tone of this sentence. um you know after you've heard
sentence. um you know after you've heard about these things and the model then says negative and the reason it says negative of course is likely um because
of the second example which was it's okay but I wish it had more features which we told the model that was negative because the model saw that it's aligned now with your expectations
fot prompts are very popular and in fact for AI startups that are slightly more sophisticated you might see them keep a prompt up to date
whenever the a user says something and they might have a human label it and then add it as a few shots in the relevant prompt in their codebase. um
you can think of that as almost building a data set, but instead of actually building a separate data set like we've seen with supervised fine-tuning and then train fine-tuning the model on it, you're just putting it directly in the
prompt. And turns out it's probably
prompt. And turns out it's probably faster to do that if you want to experiment quickly because you don't touch the model parameters. You just
update your prompts. And you know, if it's text examples, you can actually, you know, concatenate so many examples in a single prompt. At some point it will be too long and you will not have
the necessary context window but it's a pretty strong approach that is quick to align an element.
So question was is there any research on how long the prompt can be before the model essentially loses itself or doesn't follow instructions anymore?
There is uh the problem is that research is outdated every few months uh because models get better. Um and so I I don't know where the state-ofthe-art is. You
can probably find it online on benchmarks on like we see that I I I give you an example uh on the worker product you have um a voice conversation
for some of you that have tried it where you you ask explain what is a prompt and then you explain and then there's a scoring algorithm in your we know that after eight turns the model loses itself
after eight turns because you always paste the previous user response it just starts going wild and so the techniques we use in the background is we actually create chapters of the conversation.
Maybe one chapter is the first aid prompt and then you actually start over from another prompt. You you can summarize the first part of the conversation, insert the summary and then keep going. You know, those are
engineering hacks that engineers might have figured out in the background.
Yeah. Because yeah, eight turns makes a prompt quite long actually.
Uh let's move on to chaining. Uh
chaining is the most popular technique out of everything we've seen so far in uh uh in prompt engineering. Uh it's not chain of thought. So chain of thought we've seen is think step by step, step
one, step two, step three. Do not skip any step. This is different. This is
any step. This is different. This is
chaining complex prompt to improve performance. And this is what it looks
performance. And this is what it looks like. Um, you take a singlestep prompt
like. Um, you take a singlestep prompt such as read this customer review and write a professional response that acknowledges their concern, explains the
issue, offers a resolution, and then you paste the customer review, which is, I ordered a laptop, it arrived 3 days late, the packaging was damaged, very
disappointing. I need it that urgently
disappointing. I need it that urgently for work. And then the output is an
for work. And then the output is an email that is immediately given to you by the LLM after it reads the prompt.
Um, so this might work, but it might be hard to control, you know, cuz think about it, there's multiple steps that you have listed and everything is
embedded in the same prompt. And if you wanted to debug step by step and know which step is weaker, you couldn't. You
would have everything mixed together. So
one advantage of chaining is you know you would you would separate the prompts so that you can debug them separately and it will also lead to um an easier
manner to improve your workflow. Um
let's say a first prompt is extract the key issues. Identify the key concerns
key issues. Identify the key concerns mentioned in this customer review. Paste
a customer review. Second prompt using these issues. So you paste back the
these issues. So you paste back the issues. Draft an outline for a
issues. Draft an outline for a professional response that acknowledges concerns, explains possible reasons, and offer a resolution.
So this is not, you know, prompt number three, write the full response. So using
the outline, um, write the professional response and then you get your final output.
So in theory, you can tell me, oh, the second approach is better than the first one at first. But what you can notice is that we can actually test those three prompts separately from each other and
determine if we will get the most gains out of uh engineering the first prompt, optimizing it or the second one or the third one. We now have three prompts
third one. We now have three prompts that are independent from each other.
And you know maybe if uh the outline was better um the performance of uh the email the email how much it will the open rate maybe or the user satisfaction
on the response will actually get higher you know and so chaining improves performance but most importantly helps you control your
workflow and debug it more seamlessly.
Yes. So if we know that the three pump independently work very well if we combine them into one pump and we
highlight that step by step thinking process do we does we get the same or we still have to do that.
So let me try to rephrase you say let's say we look at the first prompt which has uh all three task built in that prompt. Um what what exactly do you
prompt. Um what what exactly do you mean? You mean like if we evaluate the
mean? You mean like if we evaluate the output and we measure some user insight satisfaction etc. Um why don't we just modify that prompt and essentially see
how it improves user satisfaction?
Yeah it's not process.
I see. So why do we need the three steps?
Yeah. I mean think about it. The
intermediate output is what you want to see. Like if I'm debugging the first um
see. Like if I'm debugging the first um approach, uh the way I would do it is I would capture user insights like here's the email. How good was the response?
the email. How good was the response?
Thumbs up, thumbs down. Um was your issue issue resolved? Uh thumbs up, thumbs down. Those would tell me how
thumbs down. Those would tell me how good is my prompt? And I can engineer that prompt, optimize it, and I would probably drive some gains. Um but I will not be able easily to trace back to what
the problem was. While in the second approach, not only I can use the end to end metrics to improve my process, I can also use the intermediate steps. For
example, if I look at prompt two and I look at the outline and I see the outline is actually meh, it's not great.
Then I think I can get a lot of gains out of the outline. Um, or the outline is actually really good, but the last prompt doesn't do a good job at translating it into an email. So the
outline is exactly what I want the the LLM to do, but the the translation in a customerf facing email is not good. In
fact, it doesn't follow our vocabulary internally. Then I know the third prompt
internally. Then I know the third prompt is where I would get the most gains.
So that that's what it allows me to do.
Have intermediate steps to review. Yeah.
Are there any lat?
We'll talk about it. Are there any latency concerns? Yes. um in certain
latency concerns? Yes. um in certain applications um you don't want to use a chain or you don't want to use a long chain because it adds latency.
We'll talk about that later. Good point.
So practically this is what uh chaining complex prompts look like. You have your first prompt with your first task. It
outputs the output is is pasted in the second prompt with the second task being defined. The output is then pasted into
defined. The output is then pasted into the third prompt with the third task being defined and so on. That's what it looks like in practice.
Super. U
we'll talk more later about testing your prompts, but there are methods now to do it and we'll we'll see later in this lecture with our case study how uh we
can test our prompts. Uh but here is an example of how you you might do it. Um
you you might have a a summarization workflow you know prompt um that is the baseline. It's a single uh prompt. You
baseline. It's a single uh prompt. You
might have a refined summarization which is a modified prompt of this or a workflow with a chain you know. Um, and
then you have your test case, which is the input that you want to summarize, let's say, and then you have the generated output, and you can have
humans go and rate these outputs. And
you would notice uh that the baseline is better or worse than the refined prompt.
Of course, this manual approach takes time. Um, but it's a good way to start
time. Um, but it's a good way to start and usually the advice is get hands-on at the beginning because you would quickly notice some issues and it will give you better intuition on what tweaks
can lead to better performance. However,
if you wanted to scale that system across many products, many parts of your codebase, uh you might want to find a way to do that automatically without asking humans to review and grade
summaries, right? Um, one approach is
summaries, right? Um, one approach is to, um, use, you know, platforms like at Portera, we, our team uses a platform called Prompt Fu that allows you to
actually automate part of this testing.
Um, in a nutshell, what it does is it can allow you to run the same prompt with five different LLMs immediately, put everything in a table that makes it
super easy for a human to grade, let's say. Or alternatively, it might allow
say. Or alternatively, it might allow you to define define LLM judges.
LLM judges can come in different flavors. For example, I can have an LLM
flavors. For example, I can have an LLM judge that does a pair-wise comparison.
So, what the LLM is asked to do is here are two summaries. Just tell me which one is better than the other one. Um,
that's what the LLM does. And that can be used as a proxy for how good the summarization baseline versus the refined version is. Another way to do an
LLM judge is if you do it for a single answer grading. So here's a summary,
answer grading. So here's a summary, grade it from one to five, you know, and then you can go even deeper and do a a reference guided pair wise comparison or
you add also a rubric. You say a five is when a summary is below 100 characters.
I'm just making up. below 100 characters mentions at least three key points that are distinct and starts with a first sentence that displays the overview and then goes into the detail. That's a
great summary number five out of five.
Zero is the LLM failed to summarize and actually was very verbose let's say and so you put a rubric behind it and you have an LLM as just finding the rubric.
Of course you can now pair different techniques. You can do a few shot for
techniques. You can do a few shot for the rubric. You can actually give
the rubric. You can actually give examples of a five out of fives, four out of fours, three out of threes because now you know multiple techniques. Okay,
techniques. Okay, does that make sense?
Yeah. Okay. So that was the second section on prompt engineering uh or the first line of optimization. Now, let's
say you've exhausted all your chances for prompt engineering and you're thinking about actually touching the model, modifying its weights or fine-tuning it. In other words, um I was
fine-tuning it. In other words, um I was telling you I'm not a a fan of fine-tuning. There's a few reasons why.
fine-tuning. There's a few reasons why.
Um one, it requires substantial label data typically to fine-tune. Although
now there are approaches that are getting better at fine-tuning that look more like few shot prompting actually than fine-tuning. It's sort of merging
than fine-tuning. It's sort of merging uh although one modifies the weight, the other doesn't modify the weights. Um
finetune models may also overfeed to specific data. We're going to see a
specific data. We're going to see a funny example actually. Um losing uh their general purpose utility. So you
might fine-tune a model and actually when someone asks a pretty generic question, it doesn't do well anymore.
You know, it might do well on your task.
So it might be relevant or not. And then
it's it's time and cost intensive.
That's my main problem. And you know, at at work here, we don't we don't we steer away from fine-tuning as much as possible. Uh because by the time you're
possible. Uh because by the time you're done fine-tuning your model, the next model is out and it's actually beating your fine-tuned version of the previous model. So I would steer away from
model. So I would steer away from fine-tuning as much as you can. The
advantage of the prompt engineering methods we've seen is you can put the next best pre-train model directly in your code. It will update everything
your code. It will update everything immediately. Fine-tuning doesn't work
immediately. Fine-tuning doesn't work like that.
There are advantages though where it still makes sense. if the task requires repeated high precision outputs such as legal scientific explanation and if the general purpose LLM struggles with
domain specific language. So uh let's look at a quick example together um which is an example uh from Ross
Lazerovit's um I think it was a couple of years ago September 23 where um uh Ross tried to do Slack finetuning. So
he looked at a lot of Slack messages within his company and he was like, I'm going to fine-tune a model that speaks like us or operates like us because this is how we work, right? This is the data
that represents how people work at the company. Um and so it he actually went
company. Um and so it he actually went ahead and fine-tuned the model. Um gave
it a prompt like hey write a you know he was delegating to the model write a 500word blog post on prompt engineering.
And the model responded I shall work on that in the morning. Um
and and then he tries to push the model a little further and say it's morning now. Um and the model said I'm writing
now. Um and the model said I'm writing right now. It's 6:30 a.m. here. Uh,
right now. It's 6:30 a.m. here. Uh,
write it now. Okay, please.
Okay, I shall write it now. I actually
don't know what you would like me to say about prompts engineering. I can only describe the process. The only thing that comes to mind for a headline is how do we build a prompt? You know, it's
kind of a funny example for fine-tuning because it's true that it went wrong.
like he was supposed to think like I want the model to speak like us at work and it ended up acting like people and not actually following instructions.
So, one example why I would steer away from fine-tuning.
Super.
Uh, let's talk about rags. Um, rags is important. It's important to know out
important. It's important to know out there and at least having the basics.
It's a very common interview question, by the way. If you go interview for a job, they might ask you to explain in a nutshell to a 5-year-old what is a rag.
and hopefully after that you'll be able to do it. Um so uh we've seen some of the challenges with standalone LLMs. Those challenges include the context
window being small, the fact that it's hard to remember details within a large context window. Um knowledge gaps, you
context window. Um knowledge gaps, you know, uh cutoff dates you mentioned earlier, the model might be trained up to a date and then it cannot follow the trends or be up to date. Um
hallucinations there are some fields think about medical diagnosis where hallucination are very costly. You can't
afford a hallucination. You know even in education imagine deploying a model for the US youth education and it hallucinates and it teaches millions of people something completely wrong. It's
a problem. Um and then lack of sources.
Uh a lot of fields love sources.
Research fields love sources. Education
love sources. legal loves sources as well and so um the pre-trained LLM doesn't do a good job to source and in fact if you if you have tried to find
sources on a plain LLM it actually hallucinates a lot it makes up research papers it just lists like completely fake stuff um so how do we solve that um
with a rag rag integrates with external knowledge sources databases documents APIs It ensures that answers are more
accurate, up-to-date and grounded because you can actually update your document. Your your drive is always up
document. Your your drive is always up to date. I mean ideally you're always
to date. I mean ideally you're always pushing new documents to it and when you query what is our Q4 performance in sales hopefully there is the last board
deck in uh the drive and it can read the last board deck. Yeah. Um and more developer control. we'll see why um rags
developer control. we'll see why um rags allow for targeted customization without actually requiring the retraining of the model. In fact, you don't touch the
model. In fact, you don't touch the model with rags. It's really a technique that is put on top of the model. So to
see an example of a rag, this is a question answering application where we're in the medical
field and a user is asking uh a query, what are the side effects of drug X?
U is an important question. You can't
hallucinate. You need to source. You
need to be up to date. Maybe there is a new update to that drug that is now in the database and you need to read that. So
you have to like a rag is a great example of what you would want to use here. The the way it works is you have
here. The the way it works is you have your knowledge base of a bunch of documents. Um what you do is you use an
documents. Um what you do is you use an embedding to embed those documents into lower dimensional representations. Uh so
for example if the document is a PDF a long PDF uh you might you know read the PDF understand it and then embed it.
We've seen plenty of embedding approaches together triplet loss etc. you remember so imagine one of them here for LLMs is embedding those documents
into lower representation.
If the representation is too small you will lose information. If it's too big, you will add latency, right? It's a
trade-off. Um,
you will store typically those representation into a database um called a vector database. There's a lot of vector database providers as out there.
Um, you know, I I I think I've listed a couple that are very common. No, I
haven't listed, but I can I can share afterwards. um the vector database is
afterwards. um the vector database is essentially storing those vector in a very efficient manner allowing the fast retrieval with a certain distance metric. So um
what you do is you also embed usually with the same algorithm the user prompts and you run a retrieval process which is
essentially saying um based on the embedding from the user query and the vector database find the relevant documents based on the distance between those embeddings.
Once you found the relevant documents, you pull them and then you add them to the user query with a system prompt or a prompt template on top. So the prompt
template can be um answer user query based on list of documents.
If answer not in the documents say I don't know. That's your chrome templates
don't know. That's your chrome templates where the user query is pasted. the
documents are pasted and then your output um should be what you want because it's now grounded in the document. You can also add to this
document. You can also add to this chrome templates tell me the exact page, chapter line of the document that was relevant and in fact link it as well
just to be more precise.
Any question on rags? There's a simple vanilla rag.
Yeah. Yes. How do the embeddings still retain information about what's on web page and what?
Question is, do the document embeddings still retain the information of the location of the information within that document especially in big documents?
Um, great question. We'll get to it in a in a second because you're right that the vanilla rag might not do a good job with very large documents. So let's say
you know when you open a medication box and you have this gigantic white paper with all the information and it's very long uh maybe a vanilla rag would not
cut it. So what people have figured out
cut it. So what people have figured out is a bunch of techniques to improve rags and in fact chunking is a great technique that is very popular. So you
might actually store in the vector database the embedding of the full document and on top of that you will also store a chapter level uh vector you know and when you retrieve you retrieve
the document you retrieve the chapter and that allows you to be more precise with the sourcing. It's one example.
Um another technique that's popular is hide um hypothetical document embeddings where a group of researchers
um published a paper showing that when you get your user query one of the main problem is the user query actually does not look like your documents. For
example, the user query might be what are the side effect of drug X when actually in the document in the vector database the vectors represents very long documents. So how do you guarantee
long documents. So how do you guarantee that the vector embedding is going to be close to the document embedding? What
they do is they use the user query to generate a fake hallucinated document.
They embed that document and then they compare to the vector in the vector database. Does that make sense? So for
database. Does that make sense? So for
example, the user says, "What is the side effect of drug X?" There's a prompt that this is given to another prompt that says, "Based on this user query, generate a five-page report answering
the user query. It generates potentially a completely fake um answer. You embed
that and it will be closer to the document that you're looking for likely." Yeah,
likely." Yeah, it's one example of a of a rag approach.
Again, the purpose of this lecture is not to go through all these three and explain you every single methods that has been discovered for rags, but I just wanted to show you how much research has
been done between 2020 and 2025 in rags and how many branches of research you you now have that you can uh learn from.
The survey paper is linked in the slides, by the way, and I'll share them after the lecture.
Super.
So, uh, we've made some progress.
Hopefully now you feel like if you were to start an LLM application, you know how to do better prompts, you know how to do chains, you know how to do fine-tuning. Uh you also know how to do
fine-tuning. Uh you also know how to do retrieval and you have the baggage of techniques that you can go and read and find the code base, pull the code, v
code it, but you have the breath. Now
u the next set of uh topics we're going to see is around uh the question of how could we extend the capabilities of LLM from performing single tasks and hands
with external knowledge to handling multi-step autonomous workflows. Yeah.
And this is where we get into proper agent AI.
So let's talk about agent TKI workflows towards autonomous and specialized systems. Then we'll talk about evals. Then we'll
see multi- aent systems. And we'll end with a with a little thoughts on what's next in AI.
So um Andrew actually um coined the term agent AI workflows. Um and his reason was that a lot of companies use uh let's
say agents, agents, agents everywhere, agents everywhere.
If you go and work at these companies, you would notice that they mean very different things by agent. Some people
actually have a prompt and they call it an agent. You know, other people, they
an agent. You know, other people, they have a very complex multi- aent system.
They call it an agent. And so calling everything an agent doesn't do it justice. So Andrew says, let's call it
justice. So Andrew says, let's call it agentic workflows because in practice it's a bunch of prompts with tools with
additional resources, API calls that ultimately are put in a workflow and you can call that workflow agentic. So it's
all about the multi-step process um to complete the task.
Also calling it agentic workflow allows us to not mix it up with what I called agent the in the last lecture with reinforcement learning because in RL agent has a very specific definition
interacts with an environment passes from one state to the other has a reward and an observation you remember that chart right so um here's an example of how we move
from a onestep promp to a a multi-step agentic workflow let's say a user query is a a
a product, what is your refund policy on a chatbot? Um, and the response using a
a chatbot? Um, and the response using a rag says refunds are available within 30 days of purchase. And maybe the rag can even look linked to the policy document.
That's what we learned so far. Um,
instead an agentic workflow can function like this. The user says, "Can I get a
like this. The user says, "Can I get a refund for my order?" And the response via the agentic workflow is the agent retrieves the refund policy using a rag.
The agent then follows up with the users and says can you provide your order number? Um then the agent queries an API
number? Um then the agent queries an API to check the order details and finally it comes back to the user and confirms your order qualifies for a refund. The
amount will be processed in 3 to five business days. This is much more
business days. This is much more thoughtful than the first version which is sort of vanilla, right?
So that's what we're going to talk about in the next couple of slides is how do we get from the first one to the second one.
Uh there are plenty of specialized agentic workflows online. You know,
you've heard and if you hang out in SF, you probably see a bunch of billboards, uh, you know, AI software engineer, AI skills mentor, you've interacted with in the class to work, AI SDR, AI lawyers,
AI, you know, specialized cloud engineer, you know, it would be a stretch to say that everything works, but there's work
being done towards that. You know,
I'm not personally a fan of putting a face behind those things. I think it's gimmicky and I think in a few years from now actually very few products will have a human face behind it. Uh but might be
a marketing tactic from some startups.
It's more scary than it is engaging frankly. Um okay, I want to talk about
frankly. Um okay, I want to talk about the pirating shift. Uh and that's especially useful. Let's say you're a
especially useful. Let's say you're a software engineer or you're planning to be a software engineer because software engineering as a discipline is sort of shifting or at least the best engineers
I've worked with are able to move from a deterministic mindset to a fuzzy mindset and balance between the two whenever they need to get something done. So
here's the paradigm shift between traditional software and agent TKI software. The first one is uh the way
software. The first one is uh the way you handle data. Traditional software
deals with structured data. You have
JSONs, you have databases. They're
pasted in a very structured manner in a data engineering pipeline and then they're used to be displayed on a certain interface. The user might for
certain interface. The user might for feel a form that is then retrieved and pasted in the database. All of that historically has been structured data.
Now more and more companies are handling uh free form text, images, um and and all of that requires dynamic interpretation
um uh to transform an input into an output. Um the software itself used to
output. Um the software itself used to be deterministic. Now you have a lot of
be deterministic. Now you have a lot of software that is fuzzy and fuzzy software creates so many issues. I mean,
imagine if you let your user ask anything on your website.
The chances that it breaks is tremendous. The chances that you're
tremendous. The chances that you're attacked is tremendous. The chances,
it's really, really complicated. It's
more complicated than people make it seem uh on Twitter. Um, fuzzy
engineering is truly hard. You know, you might get hate as a company because one user did something that you authorized them to do that ended up breaking the database and ended up, you know, we've seen that with many companies in the
last couple of years. So, it takes a very specialized engineering mindset to do fuzzy engineering, but also know when you need to be deterministic.
Um, the other thing I call is with agentic AI software, um, you you you sort of want to think about your software as like you're a manager. So
you're familiar with the monolith or or you know microservices approaches in software you know where you structure your software in different you know
boxes that can talk to each other and it allows teams to debug one section at a time you know now the equivalent with agentic AI is you think as a manager. So
you think okay if I was to delegate my product to be done by a group of humans what would be those roles? Would I have a graphic designer that then you know puts together a chart and then sends it
to a marketing manager that converts it into a nice blog post that then gives it to the performance marketing expert that then publishes the work the blog post and then optimizes and AB tests then to
a data scientist that analyzes the data and then puts hypothesis and validates them or invalidates them. That's how you would typically think if you're building an agency AI software, you know, when
actually the equivalent of that in traditional software might be completely different. It might be we have a data
different. It might be we have a data engineer box right here that handles all our data engineering. And then here we have the UIUX stuff. Everything UIUX
related goes here. And you know, companies might structure it in very different ways. And here's the business
different ways. And here's the business logic that we want to care about. And
there's five engineers working on the business logic. Let's say
business logic. Let's say okay uh testing and debugging is also very different um and we'll talk about it in the next section.
Uh the other thing that uh I feel matters is with AI in engineering the cost of experimentation is going down drastically and so people I feel should
be more comfortable throwing away code you know it's it's like in traditional software engineering you probably don't throw away code a ton you you build a code and it's solid and it's bulletproof
and then you you update it over time when we've seen AI companies be more comfortable table throwing away codes.
Yeah. Which has advantages in terms of the speed at which you move but also disadvantages in terms of the quality of your software that it can break more.
No.
Okay. So anyway, just wanted to do an apartate on the the the paring shift from deterministic to fuzzy engineering.
Um, oh, and actually I can give you an example from uh from workera that we learned uh probably over the last 12 months is like if you if you've used
work era you might have seen that the interface has um ask you sometimes multiple choice questions and sometimes it asks you multiple select and sometimes it asks you drag and drop
ordering matching whatever right those are example of deterministic item types meaning you answer the question on a multiple choice there's one correct answer. It's fully deterministic. On the
answer. It's fully deterministic. On the
other hand, you sometimes have voice questions uh where you go through a role play or you have voice plus coding questions where your code is being read by the interface or whatever. Those are
fuzzy meaning the scoring algorithm might actually make mistakes and those mistakes might be costly. And so
companies have to figure out a human in the loop system um which you might have seen with the appeal feature at the end.
So at the end of the assessment you have an appeal feature where it allows you to say I want to appeal the agent because I want to challenge what the agent said on my answer because I thought I was better
than what the agent thought. And then
you bring a human in the loop that then can fix the agent can tell the agent actually you were too harsh on the answer of this person. Um, and you know that's an example of a fuzzy engineered
system that then adds a human in the loop to make it more aligned. And so if you're building a company, I would encourage you to think about what can I get done with determinism and let's get
that done. And then the fuzzy stuff, I
that done. And then the fuzzy stuff, I want to do fuzzy because it allows more interaction. It allows more back and
interaction. It allows more back and forth, but I need to put guard rails around it. And how am I going to design
around it. And how am I going to design those guard rails pretty much? All
right.
Here's another example uh from enterprise workflows um which are likely to change due to agent AI. Um this is a paper from McKenzie uh I believe from
last year where they looked at a financial institution and they said that you know we observed that they often spend one to four weeks to create a credit risk memo and here's the process.
A relationship manager um gathers data from 15 and more than 15 sources on the borrower, loan type, other factors. Then the relationship
other factors. Then the relationship manager and the credit analyst collaboratively analyze that data from these sources. Then the credit analyst
these sources. Then the credit analyst typically spends you know 20 more 20 hours or more writing a memo and then goes back to the relationship manager.
They give feedback and then they go through this loop again and again and uh it takes a long time to get a credit memo out.
They then run a a research study where they changed the process. They said
Genai agents could actually cut time um by 20 to 60% on credit risk memos and the process has changed to the relationship manager directly work with
the genai agent system provides relevant materials that needs to produce the memo. The agent subsizes the project
memo. The agent subsizes the project into tasks that are assigned to specialist agents gathers and analyzes the data from multiple sources. Drafts a
memo. Then the relationship manager and the credit analyst sit down together, review the memo, give feedback to the agent and within, you know, 20 to 60% less time are done. And so this is an
example where you're actually not changing the human stakeholders, you're just changing the process and adding genai to uh reduce the time it takes to
get a credit memo out. It turns out that um imagine you're an enterprise and you have you know 100,000 employees and there's a lot of enterprises with
100,000 employees out there. Uh you are currently under crisis in terms of redesigning your workflows. You you are you know it turns out that if you actually pull the job descriptions from
the HR system and you interpret them, you also pull the business process workflows that you have encoded in your drive. you actually can find gains in
drive. you actually can find gains in multiple places and uh in the next few years you're probably going to see workflows being more optimized to add genai. Um even if that happens the
genai. Um even if that happens the hardest part is changing people. What we
know this is this is great in theory but now let's try to fit that second workflow for 10,000 credit risk analysts and relationship managers. My guess is
it will take years. It will take 10 20 years to get to this being actually done at scale within an organization. Uh
because change is so hard you know so hard to rewire business workflows, job descriptions, incentivize people to do different and be different and train
them and so so you know this is what the world is going towards but it's going to take a long time I think.
Um, okay. Then I want to talk about how the agent actually works and what are the core core components of an agent. U
imagine um a travel booking AI agent.
That's an easy example you've all thought about. I still haven't been able
thought about. I still haven't been able to get an agent to book a trip for me or or I was scared because it was going to book a very expensive or long trip. But
in theory um uh you you can you can have a travel booking agent that has prompts.
So the prompts we've seen, we know the methods to optimize those prompts. That
travel agent also has a content management context management system which is essentially the memory of what it knows about the user. Uh that context management system might include a core
memory or working memory and an archival memory. Okay. What the difference is um
memory. Okay. What the difference is um within memory um is not every memory needs to be fast to access. Like think
about it. You you're modeling on a product and the first question is hi what's your name? And I say my name is Keon. That's probably going to sit in
Keon. That's probably going to sit in the working memory because the agents every time he's going to talk to me is going to want to use my name, right? But
then maybe the second question is Keon, what's your birthday? And I give it my birthday. Does it need my birthday every
birthday. Does it need my birthday every day? Probably not. So it's probably
day? Probably not. So it's probably going to park it on the long-term memory or the archival memory. And those
memories are slower to access. they're
farther down the stack and you know that structure allows agent to determine what's the working memory and what's the long-term memory you know and that makes it easier for the agent to retrieve
super fast cuz think about it when you interact with GPT you feel that it's very personal at times right you feel like it understands you um imagine every time you call it it has to read the
memories right and that can be costly it's like a very it's a very burdensome cost because it happens every time you you talk to it. So you want to be highly
optimized with the working memory. You
know, if it takes 3 seconds to look in the memory, every time you're going to talk to your LLM, it's going to take 3 seconds, which you don't want. So
anyway, and then you have the tools. The
tools can include uh APIs like a flight search API, hotel booking API, car rental API, weather API, and then the payment processing API. And typically
you would want to tell your agent how that API works. It turns out that agents or LLMs I should say are very good at reading API documentation. So you give
it the API documentation and it reads the JSON and it reads what does a get request look like and this is the format that I need to push and then it pushes
it in that format let's say and then it retrieves something.
Does that make sense? those different
components.
Uh, you know, entropic also talks about resources. Resources is data that is
resources. Resources is data that is sitting somewhere that you might let your agent read. For example, if you're building your startups, you have a CRM.
A CRM has data in it and you want to do lookups in that data. You will probably give a lookup tool and you will give access to the resource and it will do
lookups whenever you want. Super fast.
uh this type of architecture can be built with different degrees of autonomy from the least autonomous to the most autonomous and I give you a few examples.
Less autonomous would be you've hardcoded the steps. So let's say I tell the travel agent first identify the intent then um look
up in the database the history of this customer with us and their preferences then go to the flight API blah blah blah then go to the I would hardcode the
steps. Okay, that's the least
steps. Okay, that's the least autonomous. The semi autonomous is I
autonomous. The semi autonomous is I might hardcode the tools but I'm not going to hardcode the steps. So, I'm
going to tell the agent um you're act like a travel agent and um and uh you your task is to help the
person book a travel and these are the tools that you have accessible to yourself. And so I'm not hard coding the
yourself. And so I'm not hard coding the steps. I'm just hard coding the tools
steps. I'm just hard coding the tools that you have access to for yourself. Uh
the more autonomous is the agent decides the steps and can create the tools. So
that's where you might give actually access to a code editor to the agent.
And the agent might actually be able to ping any API in the web, perform some web search. It might even be able to
web search. It might even be able to create some code to display data to the user. It might even be able to perform
user. It might even be able to perform some calculations like, oh, I'm going to calculate the fastest route to get from San Francisco to New York. Um, and which
one might be the most appropriate for what the user is looking for. And then I want to calculate the distance between the airport and that hotel versus that hotel and I'm going to write code to do
that. So it's actually fully autonomous
that. So it's actually fully autonomous from that perspective.
Okay.
So yeah remember those keywords memory uh prompts tools etc. Now I presented the flight API but it
does not have to be an API. You probably
have heard the term MCP or model context protocol that was coined by entropic. I
I pasted the seminal article on MCP at the bottom of this slide. But let me explain in a nutshell why those things
would differ. Um uh in in the API case
would differ. Um uh in in the API case um you would actually teach your LLM to ping an API. So you would say this is
how you ping this API and this is the data that it will send you back and um you would have to do that in a one-off manner. So you would have to build or
manner. So you would have to build or sort of give the API documentation of your flight API, your booking hotel API, your car rental API and then you would
give tools for your model to communicate with those APIs. Um
it doesn't scale very well um you know versus MCP. MCP um it's really about you
versus MCP. MCP um it's really about you know putting a system in the middle sort of that would make it simpler for your
LLM to communicate with that endpoint.
So for instance um you might you know have an MCP server and MCP client where you're trying to communicate with that travel uh database or the flight API or
MCP and uh your agent might actually just communicate with it and say hey what do you need in order to give me more flight information and that that agent will respond by I would like you
to tell me where is the origin flight where is the destination and what you're looking for at a high level this is my requirement okay let me get back you with my requirements. Oh, you forgot to
tell me your budget, whatever. Oh, let
me give you my budget, etc. Um, and it's it's it's agent to agent communication which allows more scalability. You don't
need to hardcode everything. Companies
have displayed their MCPS out there and you can your agent can communicate with them and figure out how to get the data it needs. Does that make sense?
it needs. Does that make sense?
Yeah.
Oh, sorry. rewriting
something that yes is not a strict issue.
Uh I think it it is ultimately the question is is it achieving issue because anyway if an API has to be updated the MCP has to be updated is what you say right? Yes, that that's
correct. But at least um it allows the
correct. But at least um it allows the agent to sort of go back and forth and figure out what the requirements are.
But at the end of the day, ideally, if you're a startup, you have some documentation and automatically have an agent or an LLM workflow that reads that documentation and updates the code
accordingly, you know, but I agree it's not it's not something that is fully autonomous.
Yeah. I've seen that some security issues. Why is that?
issues. Why is that?
Which security specifically?
Yeah. So, are there security issues with MCPS?
So, think about it this way. MCPS
depending on the data that you get access to might have different requirements, lower stake or higher stake. I'm not an expert at, you know,
stake. I'm not an expert at, you know, the full range, but it wouldn't surprise me that um, you know, when you when you when you expose an MCP to an I think you
would a lot of MCPS have authentication.
So, you know, you might actually need a code to actually talk to it just like you would with an API or a key. Um,
yeah, but that's a good question. I'm,
you know, I'm not an expert at the security of these systems, but you know, we can look into it.
U any other questions on what we've seen with the agentic workflows, APIs, tools, MCPS, memory, all of that is under progress. So even memory is not a solved
progress. So even memory is not a solved problem by any mean. It's pretty hard actually to get done. Yes,
you don't need to prop just makes it easier to access the API.
But thankfully I'm looking uh way to deal more.
Exactly. Exactly. Yeah. Is is MCP about efficiency or accessing more data? It's
about efficiency. It's like, you know, let's say you have a coding agent um and you know, it has an MCP client and there's multiple MCP servers that are
exposed out there. Um that agent can communicate very efficiently with them and find what it needs. um uh and it's it's a more efficient process than
actually displaying APIs and the APIs on that side and how to ping them and what the protocol you know but you know it's not about the data that is being exposed because ultimately you control the data
that is being exposed you probably you know depending on how the MCP is built my guess is you probably expose yourself to other risks
because your your MCP server can can see any input pretty much from another LLM and so it has to be robust. Um
but yeah, super. Uh so let let's look at an example of a step-by-step uh uh uh workflow for the travel agent. So let's
say the user says I want to plan a plan a trip um to Paris from December 15th to 20th um with flights hotels near the FL
tower and then itinerary of must visit places that's the task to the travel agent step two the agent plans the steps so it says I'm going to find flights use
the flight search API uh to get option for December 15th search hotels generate recommendation for places to visit, validate preferences,
um, budget, etc. Book the trip with the payment processing API. Step three,
that's just the planning by the way.
Step three, execute the plan. Use your
tools, combine the results, and then proactive user interaction and booking.
It might make a first proposal to the user and ask the user to validate or invalidate and then may repeat uh that planning and execution process. And then
finally, it might actually update the memory. It might say, "Oh, I just
memory. It might say, "Oh, I just learned through this interaction that the user only likes direct flights. Next
time I I'll only give direct flights."
Or, "I noticed users are fine with three star hotels or fourstar hotels and in fact they're they don't want to go above budget or something like that."
Um, so that hopefully makes sense by now on, you know, how you might do that. My
question for you is uh how would you know if this works and if you had such a system running in production, how would you improve it?
Yeah.
So that's an example. So let users rate their experience at the end. Uh that
would be an end to end test, right?
You're looking at the user experience through the steps and say how good was it from 1 to five, let's say. Yeah, it's
a good way. And then if you learn that a user says one, what how do you improve the the workflow?
Okay, so you would go down a tree and say, okay, you said one uh what what was your issue? and then the user says uh
your issue? and then the user says uh the prices were too high, let's say, and then you would go back and fix that specific uh uh tool or prompt or Yeah.
Okay. Any other ideas?
Yeah, good. So, that's a good insight.
Separate the LLM related stuff from the non LLM related stuff, the deterministic stuff. The deterministic stuff, you
stuff. The deterministic stuff, you might be able to fix it, you know, more objectively essentially. Yeah, there was
objectively essentially. Yeah, there was what else?
So give me an example of an objective issue that you can notice and how you would fix it versus a subjective issue.
Yeah.
Okay. So let's say you say there's the same flight but one is cheaper than the other. Let's say it's objectively worst.
other. Let's say it's objectively worst.
And so you can capture that almost automatically. Yeah. So you could
automatically. Yeah. So you could actually build evals that are objective that are tracked across your users. And
you might actually run an analysis after and see that for the objective stuff. We
notice that our LLM AI agentic AI workflow is bad with pricing. It just
doesn't read price as well because it always gives a more expensive option.
Yeah, you're perfectly right. How about
the subjective stuff?
Yeah.
But do you choose a direct or indirect flight if the indirect is a little bit cheaper?
Yeah, good one. Do you do you choose a direct flight or an indirect flight if the indirect is cheaper but the direct is more comfortable? Um yeah, that's a
good one actually. Um so how would you capture that information? Let's say this is used by thousands of users.
Um, could you feed something in about us?
Uh, could you feed something in? Yeah, I
mean you could you could u could you feed something in uh about the user preferences? Well, you could you could
preferences? Well, you could you could build a data set that has some of that information. So, you build 10 prompts
information. So, you build 10 prompts where the user is asking specifically for direct is saying that I prefer direct flights because I care about my time. let's say and then you look at the
time. let's say and then you look at the output and you actually give a good the example of a good output and you probably are able to capture the
performance of your agentic workflow on this specific eval whether does it prioritize does it understand price conscious is it price conscious
essentially and comfort conscious yeah what about the tone let's say let's say the LLM right now is not very friendly
how would notice that and how would you fix it?
Yeah.
Test user and like run prompts and see if there's something wrong with that.
Okay. Have a test user run the prompt and see if there's something wrong with that. Tell me about the last step. How
that. Tell me about the last step. How
would you notice that something is wrong?
So testify.
Yeah, I agree with your approach. Have
LLM judges that evaluate the response against a certain rubric of what politeness look like. So here in this case you could actually start uh with error analysis. So you start you you
error analysis. So you start you you have a thousand users and you know you can pull up 20 user interaction and read through it and you might notice at first
sight the LLM seems to be very rude. You
know it's just super super short in its answers and it's not very helpful. Um
you notice that with your air analysis manually then you go to the next stage.
You actually put evals behind it. you
say, "I'm going to create a set of um a set of LM judges that are going to look at the user interaction and are going to rate how polite it is and I'm going to
give it a rubric. Then what I'm going to do is I'm going to flip my LLM. Instead
of using GPT4, I'm going to use Grock and instead of using Grock, I'm going to use Lama. And then I'm going to run
use Lama. And then I'm going to run those three LLM side by side, give it to my LLM judges, and then get my
subjective score at the end to say, "Oh, X model was more polite on average."
Yeah, perfectly right. That's an example of an EVA that is very specific and allows you to choose between LM. You
could actually do the same eval across LM, but fix the LLM, change the prompt.
You actually instead of saying act like a travel agent, you say act like a helpful travel agent and then you see the influence of that word on your eval with the LLM as judges. Does that make
sense? Okay.
sense? Okay.
Uh super. So let's let's move forward and do a case study with evals and then we're we're almost done um for today. Uh
let's say your product managers uh manager asks you to build an AI agent for customer support.
Okay, where do you start? And here is an example of the user prompt. I need to change my shipping address for order blah blah blah. I move to a new address.
So what what do you start if I'm giving you that project, you know?
Yes.
So do some research, see benchmarks and how different models perform at customer support and then pick a model. That's
what you mean. Yeah, you it's true. You
could do that. What what what else could you do? Yeah.
you do? Yeah.
Okay. Yeah, I like that. try to
decompose the different tasks that it will need and try to guess which ones will be more of a struggle, which ones should be fuzzy, which one should be deterministic. Yeah, you're right.
deterministic. Yeah, you're right.
Yeah, similar to what you said. That's
what I would recommend as well. You say
I would sit down with a customer support agent for a day or two and I would decompose the task they're going through. I will ask them where do they
through. I will ask them where do they struggle, how much time it takes. Yes,
that's usually where you want to start with task decomposition. So let's say we've done that work and we have this list. I'm simplifying but the customer
list. I'm simplifying but the customer support agent human typically would extract key info then look up in the database to retrieve the customer record
then check the policy you know are we allowed to update the address or is it a fixed data point um and then draft the response email and sends the email okay
so we've decomposed that task u once you've decomposed that task uh how do you design your agentic workflow Yes.
Which one?
Which one? We're going to use a run method or whatever task what are you going to use for exactly so to repeat I'm going to you're going to look at the de composition of
tasks get an instinct of what's fuzzy what's deterministic and then determine which line is going to be an LLM one shot
which one will require maybe a rag which one will require a tool which one will require memory which one so you will start designing that map completely right that's also what I would recommend
you you might actually draft it and say okay I take the user prompt um and the first step of my task de composition was
extract information that seems to be a vanilla LLM you you you can guess that the vanilla lm would probably be good enough at extracting the user wants to change their address and this is the
order number and this is the new address you probably don't need too much technology there other than the LLM. Um
the next step it feels like you need a tool because you're actually going to have to look up in the database and also update the address.
So that might be a tool and you might have to build a custom tool for the LLM to say let me connect you to that database or let me give you access to that resource with an MCP. You know,
after that, you probably need an LLM again to draft the email, but you would probably paste confirmation. You would
paste a confirmation that your address has been updated from X to Y. And then
the LLM will draft an answer. And of
course, just to not forget, you might need a tool to send the email. You know,
you might actually need to um, you know, post something to for the email to to go out and then you'll get the output. Does
that make sense? So exactly what you described.
Okay. Now moving to the next step. Once
we have de composed our tasks, then we have designed an agentic workflow around it. It took us 5 minutes. In practice,
it. It took us 5 minutes. In practice,
it would take you more if you're building your startup on that. You want
to make sure your task composition is accurate. Your thing is accurate here.
accurate. Your thing is accurate here.
And then you can have a lot of work done on every tool and optimize it and latency and cost and but let's say and now we want to know how um uh if it
works you know and I'm going to assume that you have LLM traces. LLM traces are very important. Actually, if you're
very important. Actually, if you're interviewing with an AI startup, I would recommend you in the interview process to ask them, do you have LLM traces?
Because if they don't have LLM traces, it is pretty hard to debug an LLM system. you know because you don't have
system. you know because you don't have visibility on the chain of complex prompts that were called and where the bug is and you know so it's a basic sort
of part of an AI startup stack to have LLM traces so let's assume you have traces how would you know if your system work you know we I I you know I I'm
going to summarize some of the things I heard earlier um you gave us an example of an end toend metric you look at the user satisfaction at the end. Um you can
also do a component based approach where you actually will look at the tool the database updates and you will manually do an error analysis and see oh the tool
actually always forgets to update the email. It just fails at writing you know
email. It just fails at writing you know and I'm going to fix that. This is
deterministic pretty much. um or um you know when it tries to send the email and ping the system that is supposed to send the email, it doesn't send it in the
right format and so it bugs at that point. Again, you could fix that. Um
point. Again, you could fix that. Um
draft of the email, the LM doesn't do a great job. It's not very polite at
great job. It's not very polite at drafting the email, you know. So, you
could look at component by component and it's actually easier to debug than to look at it end to end. You would
probably do a mix of both.
Um, another way to look at it is what is objective versus what is subjective. So,
for example, an objective example would be the LLM um extracted the wrong order ID. You know, the the user said my order
ID. You know, the the user said my order ID is X and the LLM when it's actually pasted looped up in the database, it used the wrong order ID. This is
objectively wrong. you can actually write a Python code that checks that checks just the alignment between what the user mentioned and and what was actually pasted in the database or for
the lookup. Um you also have subjective
the lookup. Um you also have subjective stuff which we talked about where you probably want to do either human rating or LLM as judges. It's very relevant for
subjective evals.
Um and finally you will find yourself having quantitative evals and more qualitative evals. So quantitative would
qualitative evals. So quantitative would be percentage of successful address updates. The latency you could actually
updates. The latency you could actually track the latency component based and see which one is the slowest. Let's say
sending the email is 5 second you know it's too long let's say you would notice component based or the full workflow.
And then you will decide where am I optimizing my latency and how am I going to do that. And then finally qualitative. You might actually do some
qualitative. You might actually do some error analysis and look at you know where are the hallucinations? Um where
are the tone mismatches? Uh you know are the user confused and by what they're confused. You know that would be more
confused. You know that would be more qualitative and typically it would take more you know white glove approaches to do that. Okay. So here's what it could look
that. Okay. So here's what it could look like. I gave you some examples but you
like. I gave you some examples but you would build evals to determine objectively subjectively component based end to end based and then quantitatively
and qualitatively where is your LLM failing and where it's doing well.
Does that give you a sense of the type of stuff you could do to fix improve that agentic workflow?
Super. Well, that was our case study on EVA. We're not going to del deeper into
EVA. We're not going to del deeper into it, but hopefully it gave you a sense of the type of stuff you can do with LM judges with, you know,
objective, subjective, component based, end to end, etc. U last section on multi- aent workflows.
So you might you might ask uh hey uh why do we need a multi- aent workflow when we when the workflow already has multiple steps already calls the LLM
multiple times already gives them tools why do we need multiple agents and so many people are talking about multi- aent system online it's not even a new thing frankly I mean multi- aent system
have been around for a long time the the main advantage of a multi- aent system is going to be parallelism it's like is there something that I wish I would run
in parallel sort of independently but maybe there are some syncs in the middle but that's where you want to put a multi- aent system it's when it's parallel the other
advantage that some companies um have with multi- aent system is an agent can be reused. So let's say in a company you
be reused. So let's say in a company you have an agent that's been built for design. That agent can be used in the
design. That agent can be used in the marketing team and it can be used in the product team, you know, and so now you're optimizing an agent which has multiple stakeholders that can
communicate with it and benefit from its uh performance.
Um actually I'm going to ask you a question and take a few uh maybe a minute to think about it. Let's say you were uh building smart home automation
for your apartment or your home. What
agents would you want to build? Yeah,
write it down. And then I'm going to ask you in in a minute to share some of the agents that you will build. Also, think
about how you would put a hierarchy between these agents or how you would organize them or who should communicate with who. Okay. Okay. Take a minute for
with who. Okay. Okay. Take a minute for that.
be creative also because I'm gonna ask all of your agents and maybe you have an agent that nobody has thought of.
Okay, let's get started. Who wants to give me a a set of agents that you would want for your home smart home? Yes.
So the first is like set of agents that track my movements in the house and drop information about my house.
Another agent receive the information and adjust their room temperature and another.
Okay. So, let me repeat. You have four agents I think roughly.
One that tracks biometric like you're where are you in the home? where you're
moving, how you're moving, things like that that sort of knows your location.
The second one um determines the temperature of the rooms and has the ability to change it. The third one tracks energy
change it. The third one tracks energy efficiency and might be feedback on energy and energy usage and might be I don't know maybe it has the control over
the temperature as well. I don't know actually or the gas or the water might cut your water at some point the and
then you have an orchestrator agent.
What is exactly the orchestrator doing?
Okay. Passes instructions. So is that the agent that communicates mainly with the user?
Okay. So if I have I'm coming back home and I'm saying I want the oven to be preheated, I communicate with the orchestrator and then it would funnel to another agent. Okay, sounds good. Yeah,
another agent. Okay, sounds good. Yeah,
so that's an example of a I want to say a hierarchical um agent multi- aent system.
Um what else? Any other ideas? What
would you add to that? Yeah.
access to the like minimal action that you can do. Imagine entering a room or just entering a computer or just opening
the minimum kind of action.
like lot of per who is it all the Oh, I like that. That's a really good one. So, let me summarize. You have a
one. So, let me summarize. You have a security agent that determines if you can enter or not. And when you enter, it understands who you are. And then it
gives you certain sets of permissions that might be different depending of if you're a parent or a kid or you know you might have access to certain cars and not others or the kid cannot open the
fridge or I don't know like something like that. Yeah. Or okay. I like that.
like that. Yeah. Or okay. I like that.
That's a good one. Yeah. And it does feel like it's a complex enough workflow where you want a specific workflow tied
to that. I I agree.
to that. I I agree.
What What else?
Yes.
not keep your blinds open.
Another as well.
Well, that's really good actually. So,
you mentioned two of them. One is maybe an agent that has access to external APIs that can understand the weather out there, the wind, the sun, and then has
control over certain devices at home, temperature, blinds, things like that, and also understands your preferences for it. That does feel like it's a good
for it. That does feel like it's a good use case because you could give that to the orchestrator, but it might lose itself because it's doing too much. So
you probably and also these problems are tied together like temperature outdoor with the weather API might influence the temperature inside how you want it etc.
And then the second one which I also like is you might have an agent that looks at your fridge and what's inside and it might actually have access to the camera in the fridge for example um and
know your preferences and also has access to the e-commerce API uh to order Amazon groceries ahead of time. Um, I
agree. And maybe the orchestrator would be the communication line with the user, but it might communicate with that agent um in order to get it done. Uh, yeah, I like those. So, those are all really
like those. So, those are all really good examples here. Here is the list I had um up there. So, climate control, lighting, security, energy management,
entertainment, notification agent, alerts about the system updates, energy saving, and orchestrator. So, all of them you mentioned actually. Um, and
then we didn't talk about the different interaction patterns, but you do have different ways to organize a multi- aent system, flat, hierarchical. It sounds
like this would be hierarchical. I
agree. And the reason is UIUX is I would rather have to only talk to the orchestrator rather than have to go to a specialized application to do something like it feels like the orchestrator
could be responsible for that. And so I agree. I would probably go for a
agree. I would probably go for a hierarchical setup here. But maybe you might act also add some connections between other agents like in the flat system where it's all to all for example
uh with climate control and energy. If
you want to connect those two, you might actually allow them to speak with each other. When you allow agents to speak
other. When you allow agents to speak with each other, it is basically an MCB protocol by the way. So you treat the agent like a tool. Exactly like a tool.
Here is how you interact with this agent. Here is what it can tell you.
agent. Here is what it can tell you.
here is what it needs from you essentially.
Okay, super. And then without going into the details, there are advantages to multi- aent workflows versus, you know, single agents such as debugging. It's
easier to special debug a specialized agent and to be debug an entire system.
Parallelization as well. It's easier to have things run in parallel. Um, and you can earn time. Um, you know, there are some advantages to doing that. And I
leave you with the slide if you want to go deeper. Super. So we've learned so
go deeper. Super. So we've learned so many techniques to optimize LLMs from prompts to chains to finetuning retrieval um and to multi- aent system
as well. And then just to end on um a
as well. And then just to end on um a couple of trends I want you to watch. Uh
I think next week is Thanksgiving. Is
that it? Is Thanksgiving break? No, the
week after. Okay. Well, ahead of the Thanksgiving break. So if you're
Thanksgiving break. So if you're traveling, you can think about these things. Um what's next is in AI. I
things. Um what's next is in AI. I
wanted to call out a couple of trends.
So Elas discover one of the OGs of uh you know LLM's um and you know openi co-founder um raised that question about are we plateauing or not? You know the
question of are we going to see in the coming years LLM sort of not improve as fast as we've seen in the past. It's
been the feeling in the community probably that you know the last version of GPT um did not bring the level of performance that people were expecting
although it did make it so much easier to use for consumers because you don't need to interact with different models.
It's all under the same hood. So it
seems that it's progressing. Um but the plateau is is unclear. The the way I would think about it is um the LLM scaling laws tell us that if we
continue to improve compute and energy then LLM should continue to improve but at some point it's going to plateau. So
what's going to take us to the next step? Um it's probably architecture
step? Um it's probably architecture search. Still a lot of LLM even if we
search. Still a lot of LLM even if we don't understand what's under the hood are probably transformerbased today. But
we know that the human brain does not operate the same way. There's just
certain things that we do that are much more efficient, much faster. We don't
need as much data. So theoretically, we have so much to learn in terms of architecture search that we haven't figured out. It's not a surprise that
figured out. It's not a surprise that you see those labs hire so many engineers. Because it is possible that
engineers. Because it is possible that in the next few years, you're going to have thousands of engineers trying to figure out the different engineering hacks and tactics and architectural searches that are going to lead to
better models. and one of them suddenly
better models. and one of them suddenly will find the next transformer and it will reduce by 10x the need for compute and the need for energy. Um you know it's sort of if you've read Isak
Azimoff's foundation series um individuals can have an amazing impact on the future because of their decisions you know whoever discovered transformers had a
tremendous impact on the direction of AI. I think we're going to see more of
AI. I think we're going to see more of that in the coming years where some group of researcher that is iterating fast might discover certain things that would suddenly unlock that plateau and take us to the next step and it's going
to continue to improve like that and so it doesn't surprise me that there's so many companies hiring engineers right now to figure out those hacks and those those techniques. Um the other set of
those techniques. Um the other set of gains that we might see is from multimodality. So the way to think about
multimodality. So the way to think about it is we've we've we've had LLM's first textbased and then we've added imaging and today you know models are very good
at images they're very good at text turns out that being good at images and being good at text makes the whole model better. So the fact that you're good at
better. So the fact that you're good at understanding a cat image makes you better at text as well for a cat. Now
you add another modality like audio or video the whole system gets better. So
you're better at writing about a cat if you know what a cat sounds like. If you
can look at a cat on an image as well.
Does that make sense? So we see gains that are translated from one modality to another. And that might lead in the
another. And that might lead in the pinnacle of robotics where all these modalities come together and suddenly the robot is better at running away from a cat because it understands what a cat
is, how it sounds like, what it looks like, etc. That make sense? Um the other one is the multiple methods working in harmony. In the Tuesday lectures, we've
harmony. In the Tuesday lectures, we've seen supervised learning, unsupervised learning, self-supervised learning, reinforcement learning, prompts engineering, rags, etc. If you look at
um how babies learn, um it is probably a mix of those different approaches like a baby um might have some metalarning, meaning you know it has some survival
instinct that is in the encoded in the DNA most likely. Um, and that's like the baby's pre-training, if you will. On top
of that, uh, the mom or the dad, um, is pointing at stuff and saying, "Bad, good, bad, good, supervised learning."
On top of that, the baby's falling on the ground and getting hurt, and that's a reward signal for reinforcement learning. On top of that, the baby's
learning. On top of that, the baby's observing other people doing stuff or other babies, you know, doing stuff, unsupervised learning. You see what I
unsupervised learning. You see what I mean? It's we're probably a mix of all
mean? It's we're probably a mix of all these methods. And um and I think that's
these methods. And um and I think that's where the trend is going is where those methods that you've seen in CS230 come together in order to build an AI system
that learns fast, is low latency, is cheap, energy efficient, and makes the most out of all of these methods. Um
finally, and this is especially true at Stanford, um you have research going on that you would consider human centric and some research that is nonhuman ccentric. by humanentric, I should say
ccentric. by humanentric, I should say human approaches that are modeled after the brain and approaches that are not modeled after humans because it turns out that the human body is very
limiting. And so if you actually only do
limiting. And so if you actually only do research on what the human brain looks like, you're probably missing out on compute and energy and stuff like that that you can optimize even beyond neuronal connections in the brain. But
you still can learn a lot from the human brain. And that's why there are
brain. And that's why there are professors that are running labs right now that try to understand how does back propagation work for humans. And in
fact, it's probably that we don't have back propagation. We don't use back
back propagation. We don't use back propagation. We only do forward
propagation. We only do forward propagation, let's say. So this type of stuff is interesting research that I would encourage you to read if you're curious about the direction of of AI.
Um, and then finally, um, one thing that's going to be pretty clear, I call it all the time, but it's the velocity at which things are moving. You're
noticing part of the reason we're we're giving you a breath in CS230 is because these methods are changing so fast. So,
I don't want to bother going and teaching you the number 17 methods on rag that optimizes the rag because in two years you're not going to need it, you know. So I would rather you think
you know. So I would rather you think about what is the breadth of things you want to understand and when you need it you are sprinting and learning the exact thing you need faster because the half-life of skill is so low you know
you want to come out of the class with a good breath and then have the ability to go deep whenever you need after the class and so that's sort of how that class is designed as well. Um yeah
that's it for today.
Loading video analysis...