Stanford CS230 | Autumn 2025 | Lecture 8: Reinforcement Learning

By Stanford Online

Summary

## Key takeaways - **Base LLMs Lack Domain Knowledge**: Pre-trained models like GPT lack specific domain knowledge, such as identifying sick crops from a custom farming dataset not found in public training data. [04:12], [04:36] - **LLMs Fail on Current Trends**: LLMs are not up-to-date and struggle with new words like Trump's 'Kov Fe' tweet or Gen Z slang such as 're' or 'mid', requiring retraining otherwise. [05:25], [06:34] - **Hard to Control LLMs**: LLMs are difficult to control, as shown by Microsoft's 2016 Twitter bot turning racist in 16 hours and ongoing debates between Grok and OpenAI teams on political bias. [08:05], [08:42] - **Prompt Training Boosts Performance**: BCG study showed consultants trained in prompt engineering outperformed those with just GPT-4 access, unlike untrained AI users who fell 'asleep at the wheel' on tough tasks. [18:29], [19:58] - **Chaining Enables Debugging**: Chaining prompts into separate steps like extracting issues, drafting outlines, and writing responses allows independent testing and debugging for better control. [33:28], [34:29] - **Avoid Fine-Tuning Trap**: Fine-tuning is time-costly and often obsolete by new base model releases; Slack fine-tune example produced lazy responses like 'I shall work on that in the morning.' [42:25], [43:52]

Topics Covered

Base LLMs Fail Enterprise Precision
Prompt Engineering Beats No Training
Chaining Enables Debuggable Workflows
Avoid Fine-Tuning Trap
Agentic Workflows Shift Engineering

Full Transcript

Hi everyone, welcome to another lecture for CS230 deep learning. Today we're

going to talk about enhancing large language model applications and I I call this lecture beyond LLM.

Um it has a lot of newer content and uh the idea behind this lecture is you we started to learn about neurons and then we learned about layers and then we

learned about deep neural networks and then we learned a little bit about how to structure projects in C3 and now

we're going one level beyond into what would it look like if you were building agentic AI systems at work in a startup

in a company. Um, and it's probably one of the more practical lectures. Again,

the goal is not to build a product end to end in the next hour or so, but rather to tell you all the techniques that AI engineers have cracked, figured

out, are exploring so that after the class, you have sort of the breath of view of different prompting techniques, different agentic workflows, multi- aent

systems, eval when you want to dive deeper, you have the baggage to dive deeper and learn faster about it. Okay,

let's try to make it as interactive as possible as usual. Um, when we look at the agenda, the agenda is going to start

with the core idea behind challenges and opportunities for augmenting LLMs. So, we start from a base model. How do we maximize the performance of that base

model? Um, then we'll dive deep into the

model? Um, then we'll dive deep into the first line of optimization, which is prompting methods, and we'll see a variety of them. Then we'll go slightly deeper. If we were to get our hands

deeper. If we were to get our hands under the hood and do some finetuning, what would it look like? I'm not a fan of fine-tuning, and I talk a lot about that, but I'll explain why. I I try to

avoid fine-tuning as much as possible.

Um, and then we'll do a section four on retrieval augmented generation or rag, which you've probably heard of in the news. Maybe some of you have played with

news. Maybe some of you have played with rags. We're going to sort of unpack what

rags. We're going to sort of unpack what a rag is and how it works and then the different methods within rags and then we'll talk about agentic

AI workflows. Um, I'll define it. Um

AI workflows. Um, I'll define it. Um

Andrew Ang is one of the call it first ones to have called this trend a a agentic AI workflows and so we look at the definition that Andrew gives to

agentic workflows and then we start seeing examples. The section six is very

seeing examples. The section six is very practical. It's a case study where we

practical. It's a case study where we will think about an agentic workflow and we'll and I ask you to measure um if the

agent actually works and we brainstorm how we can measure if an agentic workflow is working the way you want it to work. There's plenty of methods

to work. There's plenty of methods called eval that's um solve that problem. uh and then we'll look briefly

problem. uh and then we'll look briefly at multi- aent workflow and then we can have a a sort of open-ended discussion where I'll share some thoughts on what's next in AI. Um and I'm looking forward

to hearing from you all as well on that one. Okay, so let's get started uh with

one. Okay, so let's get started uh with the problem of augmenting LLM. So

open-ended question for you. You are all familiar with pre-trained models like GPT3.5 Turbo or GPT40.

What's the limitation of using just a base model? What are the typical issues

base model? What are the typical issues that might arise as you're using a vanilla pre-trained model?

>> Yes.

>> Lacks some domain knowledge. You're

perfectly right. you know, you we we had a group of students a few years ago was not LLM related, but you know, they were building a an autonomous uh farming uh

device or vehicle that had a camera underneath taking pictures of crops to determine if the crop is sick or not, if it should be thrown away, like if it

should be if it should be used or not.

And and that data set is not a data set you find out there. and the base model or a pre-trained um computer vision model would lack that knowledge of

course. What else?

course. What else?

Yes.

>> Okay. Maybe the you're saying so just to re repeat for people online you're saying the model might have been trained on high quality data but the data in the wild is actually not that high quality.

And in fact, yes, the distribution of the real world might differ as we've seen with GANs from the training set and that might create an issue with pre-trained models. Although pre-trained

pre-trained models. Although pre-trained LLMs are getting better at, you know, handling all sorts of data inputs. Uh

yes, >> current >> lack what >> lacks current information. Uh the LLM is not up to date. And in fact, you're

right. Imagine you have to retrain from

right. Imagine you have to retrain from scratch your LLM every couple of months.

Uh, one story that I found funny, um, it's from probably three years ago or maybe more, five years ago, where, um, during his first presidency, President

Trump one day tweeted Kov Fe. You

remember that tweet or no, just Kof. And

it was probably a typo or it was in his pocket, I don't know. But that word did not exist. The LMS in fact that Twitter

not exist. The LMS in fact that Twitter was running at the time could not recognize that word. And so uh the recommener system sort of went wild because suddenly everybody was making

fun of that tweet using the word kof and the LM was so confused on you know what does that mean where should we show it to whom should we show it and is an

example of nowadays especially on social media there's so many new trends and um it's very hard to restrain an LLM to match the new trend and understand the

new words out there. I mean, you know, you often times hear Gen Z words like re or mid or whatever. I don't know all of them, but u you probably want to find a

way that can allow the LLM to understand those trends without retraining the LLM from scratch. Yeah. What else?

from scratch. Yeah. What else?

>> It's trained to have a breath of knowledge and if you wanted to do something specialized that might >> Yeah. It might be trained on a breath of

>> Yeah. It might be trained on a breath of knowledge but it might fail or not perform adequately on a narrow task that is very well defined. Think about

enterprise applications that yeah enterprise application you need high precision, high fidelity, low latency and maybe the model is not great at that specific thing. It might do fine but

specific thing. It might do fine but just not good enough and you might want to augment it in a certain way. Yeah.

[clears throat] >> So it makes the model a lot heavier, a lot slower.

>> So maybe it has a lot of broad domain knowledge that might not be needed for your application and so you're using a massive heavy model when you actually are only using 2% of the model capability. You're perfectly right. You

capability. You're perfectly right. You

might not need all of it. So you might find ways to prune, quantize the model, modify it. All of these are are are good

modify it. All of these are are are good points. I'm going to add a few more as

points. I'm going to add a few more as well. Um, LLMs are very difficult to

well. Um, LLMs are very difficult to control. Uh, your last point is actually

control. Uh, your last point is actually an example of that. You want to control the LM to use a part of its knowledge, but it's not. It's in fact getting confused. Uh, we've seen that in

confused. Uh, we've seen that in history. In 2016, Microsoft created a

history. In 2016, Microsoft created a notorious Twitter bot that learned from users and it quickly became a racist

jerk. Microsoft ended up removing the

jerk. Microsoft ended up removing the bot 16 hours after launching it. The

community was really fast at determining that this was a racist bot. Um, and and you know, you can empathize with Microsoft in the sense that it is actually hard to control an LLM. they

might have done a better job to qualify before launching but it is really hard to control an even more recently this is

a tweet from Samman last November um where there was this debate between Elon Musk and Samman on whose LLM is the

left-wing propaganda machine or the right-wing propaganda machine and they were hating on each other's LLMs but that tells you at the end of the day that even those two teams Grock and

OpenAI, which are probably the best funded team with a lot of talent, are not doing a great job at controlling their LLMs, you know.

And from time to time, if you hang out on X, you might see screenshots of users interacting with LLMs and the LLM saying

something really controversial or or or racist or, you know, something that would not not be considered great [laughter] by social standards, I guess.

And uh and that tells you that the model is really hard to to control.

Um the second aspect of it is something that you've mentioned earlier. LLMs may

underperform in your task. Um and that might include specific knowledge gaps such as medical diagnosis. If you're

doing medical diagnosis, you would rather have an LLM that is specialized for that and is great at it. And in

fact, something that we haven't mentioned as a group has sources. So the

answer is sourced specifically. you have

a hard time believing something unless you have the actual source of the research that backs it up. Um,

inconsistencies in style and format. So,

imagine you're building a legal AI agentic workflow. Uh, legal has a very

agentic workflow. Uh, legal has a very specific way to write and read uh where every word counts. You know, if you're negotiating a large contract, every word

on that contract might mean something else when it comes to the court. And so

it's very important that you use an LLM that is very good at it. The precision

matters. And then you know task specific understanding such as doing a classification on a niche field. Here I

pulled an example where you know let's say a biotech product is trying to use an LLM to categorize user reviews into

positive, neutral or negative. um you

know maybe for that company something that's uh would be considered a negative review typically is actually considered a neutral review because the NPS of that

industry tends to be way lower than other industries let's say um that's a task specific understanding and the LLM needs to be aligned to what the company believes is the categorization that it

wants we will see an example of how to solve that problem in a second and then limited context handling. Um, a lot of AI applications especially in the

enterprise have require data that has a lot of context. Like just to give you a simple example, knowledge management is an important space that enterprises buy

a lot of knowledge management tool. When

you go on your drive and you have all your documents, ideally you could have an LLM running on top of that drive. You

can ask any question and it will read immediately thousands of document and answer what was our Q4 performance in sales. Um it was X dollars. Uh it finds

sales. Um it was X dollars. Uh it finds it super quickly. In practice because LLMs do not have a large enough context, you cannot use a standalone vanilla

pre-trained LLM to solve that problem.

You will have to augment it.

Does that make sense?

>> [laughter] >> Uh the other aspect around context windows is they are in fact limited. If

you look at the context windows of the models from the last 5 years um even the best models today will range in context

window or number of tokens it can take as input um somewhere in the hundreds of thousands of tokens max. Just to give

you a sense 200,000 tokens is roughly two books. Yeah. So that's how much you

two books. Yeah. So that's how much you can upload uh and it can read pretty much and you can imagine that when you're dealing with video understanding

or heavier data files that is of course an issue.

So [snorts] you might have to chunk it, you might have to embed it, you might have to find other ways to get uh the LLM to handle larger contexts.

Um the attention mechanism is also powerful but a problematic because it does not do a great job at attending in very large contexts. There is actually

an interesting uh uh problem called needle in a haststack. It's an AI problem where or call it a benchmark where um in order to test if your LLM is

good at attending at putting attention on a very specific fact within a large corpus, researchers might randomly

insert in a book one sentence that can that that outlines a certain fact such as Arun and Max are having coffee at

Blue Bottle in the middle of the Bible, let's say or some very long text. Um and

then you ask the LLM um what were Arun and Max having you know at Blue Bottle and you see if it remembers that it was coffee. It's actually a complex problem

coffee. It's actually a complex problem not because the question is complex but because you're asking the model to find a fact within a very large corpus and that's complicated.

So again this is a a limiting factor for LLMs. We we'll talk about rag in a second but I want to preview you know the there is debates around whether rag

is the right long-term approach for AI system. So as a as a high level idea a

system. So as a as a high level idea a rag is a mechanism if you will that embeds documents that an LLM can

retrieve and then add as context to its initial prompt and answer a question. It

has lots of application knowledge management is an example. So imagine you have your drive again but every document is sort of compressed in representation

and the LLM has access to that lower dimensional representation. Um the

dimensional representation. Um the debates that this tweet from Yu um outlines is in theory if we have

infinite compute then rag is useless because you can just read a massive corpus immediately and answer your question. But even in that case, latency

question. But even in that case, latency might be an issue. Imagine the time it takes for an AI to read all your drive every single time you ask a question. It

doesn't make sense. So, rag has other advantages beyond even the accuracy. On

top of that, the sourcing matters as well. So, it might rag allows you to

well. So, it might rag allows you to source. We'll talk we'll talk about all

source. We'll talk we'll talk about all that later. But there are there's always

that later. But there are there's always this debate in the in the community whether a certain method is actually future proof because in practice as

compute power doubles every year let's say some of the methods we're learning right now might not be relevant 3 years from now we don't know essentially

[snorts] um you know and and the analogy that that he makes on on you know context windows and why rag approaches might be relevant even a long time from

Now um is search. You know when you search on a search engine you still find sources of information and in fact in the background there is very

you know details traversal algorithms that rank and find the specific links that might be the the best to present you. um versus if you had to read

you. um versus if you had to read imagine you had to read the entire web every single time you're doing a search query uh without being able to narrow to

certain portion of the space that might again not be uh reasonable.

Okay, when we're thinking of improving LLMs, the easiest way we think of it is is two dimensions. One dimension is we are going to improve the foundation

model itself. So for example, we move

model itself. So for example, we move from GPT uh 3.5 Turbo to GPT4 to GPT40

to GPT5. Each of that is supposed to

to GPT5. Each of that is supposed to improve the base model. GPT5 is another debate because it's sort of packaging other models within itself. But you know

if you're thinking about 3.5 4 and 4, that's really what it is. The

pre-trained model improves and so you should see your uh performance improve on your tasks. Um but the other dimension is we can actually engineer

leverage the LLM in a way that makes it better. So you can prompt simply GPT40.

better. So you can prompt simply GPT40.

You can chain some prompts and improve the prompt and it will improve the performance. It's shown uh you can even

performance. It's shown uh you can even put a rag around it. You can put an agentic workflow around it. You can even put a multi- aent system around it. And

that is another dimension for you to improve performance. So that's how I

improve performance. So that's how I want you to think about it. Which LLM

I'm using and then how can I maximize the performance of that LLM.

This lecture is about the vertical axis.

Those are the methods that we will see with it together.

Sounds good for the introduction. So

let's move to prompt engineering. Um I'm

going to start with an interesting study just to motivate why prompt engineering matters. Um there is a a study from uh

matters. Um there is a a study from uh you know HPS, UPEN um uh as well as Harvard Business School and and and

others also involve Wharton that took a subset of BCG consultants, individual contributors, split them into three groups. One group had no access to AI.

groups. One group had no access to AI.

One group had access to I think it was GPT 4. Um and then one group had access

GPT 4. Um and then one group had access to the LLM but also a training on how to prompt better. Um and then they observed

prompt better. Um and then they observed the performance of these consultants across a wide variety of tasks. There's

a few things that they noticed that I thought was interesting. One is

something they call the JAG frontier. um

meaning that certain tasks that consultants are doing fall beyond the jack frontier. Meaning AI is uh is not

jack frontier. Meaning AI is uh is not good enough. It's not it's not improving

good enough. It's not it's not improving human performance. In fact, it's

human performance. In fact, it's actually making it worse. Um and some tasks are within the frontier, meaning that uh AI is actually significantly

improving the performance, the speed, the quality of the consultant. Um many

tasks fail within and many tasks fail without and they shared their um insights but the TLDDR is there is a frontier within which AI is absolutely

helping and one where they call out this behavior of falling asleep at the wheel where people relied on AI on a task that was beyond the frontier and in fact it

ended up going worst because the human was not reviewing the outputs carefully enough. Um

enough. Um they did note that the the group that was trained was uh the best uh better than the group that was not trained on prompt engineering which also motivates

why this lecture matters. Um so that you're you're you're within that group afterwards. One other insights um were

afterwards. One other insights um were the centurs and the cyborgs. They

noticed that consultants had the tendency to work with AI in one of two ways and you might yourself find be part of one of these groups. The centaurs are

mythical creatures that are half human, half I think half like what? Horses. Yeah, horses. Half

like what? Horses. Yeah, horses. Half

horses, half something. Um, and those were individuals that would divide and delegate. They might give a pretty big

delegate. They might give a pretty big task to the AI. So imagine you're working on a PowerPoint, which consultants are known to do. um you

might actually write a very long prompt on how you wanted to do your PowerPoint and then let it work for some time and then come back and it's done when others would act as cyborgs. Cyborgs are fully

blended bionic human robots, a human robot and robot augmented with robotic parts. Uh and those individuals would

parts. Uh and those individuals would not delegate fully a task. They would

actually work super quickly with the model and like back and forth. I find

that a lot of students are actually more working like cyborgs than centaurs. But

while maybe in the enterprise when you're trying to automate a workflow, you're thinking more like a centur.

Yeah, that's just something good to keep in mind. Also, a lot of companies would

in mind. Also, a lot of companies would tell you, oh, we're hiring prompt engineers, etc. It's a curer. I I don't buy that. I think it's just a skill that

buy that. I think it's just a skill that everybody should have. uh you're not going to make a career out of prompt engineering, but you're probably going to use it as a very powerful skill in your career.

Um so let's talk about basic prompt design principles. I'm giving you a very

design principles. I'm giving you a very simple prompt here. Summarize this

document and then the document is uploaded alongside it. And the model has not um much context around you know what should be the summary, how long should be read the summary, what should it talk

about. etc. You can actually improve

about. etc. You can actually improve these prompts um by by doing you know something like summarize this 10page scientific paper on renewable energy in

five bullet points focusing on key findings and implications for policy makers. That's already better, right?

makers. That's already better, right?

You're sharing the audience and it's going to tailor it to the audience.

You're saying that you want five bullet points and you want focus only on key findings. You know, that's a better

findings. You know, that's a better prompt, you would argue. Um, how could you even make this prompt better? What

are other techniques that you've heard of or tried yourself that could make this oneshot prompt better?

>> Yeah.

>> Okay. Right. Right. Example. So, say uh you mean here is an example of a great summary. Yeah, you're right. That's a

summary. Yeah, you're right. That's a

good idea.

to be like someone act like you are >> very popular technique act like a renewable energy expert giving a conference at Davos let's say yeah

that's great someone >> sounds like you're really good at it like >> you are the best in the world at this explain [laughter] yeah actually I mean

these things work it's funny but uh it does work to to say act like xyz it's a very popular prompt template But we we'll see a few examples. What else

could you do?

>> Yes.

>> I personally like to say critique your own project.

>> Okay.

>> Critique your own project. So you're

using reflection. So you might actually do one output and then ask it to critique it and then give it back. Yeah,

we'll see that. That's a great one.

That's that's the one that probably works best within those typically, but we we'll see some examples. What else?

Yeah.

>> Breaks.

>> Okay. break the task down into steps. Do

you know how that is called?

>> No.

>> Okay. Chain of thoughts. So, um this is actually a popular method that's been shown in in research that it improves.

You could actually give a clear instruction and also encourage the model to think step by step. Approach the task step by step and do not skip any step.

And then you give it some steps such as step one identify the three most important findings. Step two, explain

important findings. Step two, explain how key each finding impact renewable energy policy. Step three, write the

energy policy. Step three, write the five bullet summary with each point addressing a finding um etc. So, chain of thoughts, I linked the paper from

2023 uh that popularized chain of thoughts. Chain of thoughts is very very

thoughts. Chain of thoughts is very very popular right now, especially in AI startups that are trying to control their elements.

Okay, [snorts] to go back to your examples about act like XYZ, I what I like to do, Andrew also talks about that uh is to look at other

people's prompts and in fact in online you have a lot of prompt repositories for free on GitHub. In fact, I I linked the awesome prompt template repo on

GitHub where you have so many examples of great prompt that engineers have built. They said it works great for us

built. They said it works great for us and they published it online. And a lot of them starts with act as you know act as a Linux terminal, act [laughter] as

an English translator, act like um a position interviewer, etc. The advantage of a prompt template is that you can actually put it in your

code and scale it for many user requests. So let me give you an example

requests. So let me give you an example from worker. um you know where Cara

from worker. um you know where Cara evaluates skills some of you have taken the assessments already and um tries to personalize it to the user um and in

fact if you actually read in an HR system in an enterprise in the HR system you might have Jane is a product manager level three and she is uh in the US and

her preferred language is English and actually that metadata can be inserted in a prompt template that we personalize for Jane and similarly for Joe whose

favorite language is preferred language is Spanish. It will it will tailor it to

is Spanish. It will it will tailor it to Joe and that's called the prompt template.

>> Yeah.

>> Foundation models they don't use something you have to.

>> So uh the question is do the foundation models use a prompt templates or do you have to integrate it yourself? So the

foundation models probably use a system prompt that you don't see like when actually you type on chatgpt it is possible it's not public that openai

behind the scene has like act like a very helpful assistant for this user and by the way here is your memories about your the user that we kept in a database

you can actually check your memories and then your prompt goes under and then the generation starts. So probably they're

generation starts. So probably they're using something like that, but it doesn't mean you can't um add one yourself. So in fact, if you if you

yourself. So in fact, if you if you think about a prompt template for for the work here example I was showing, maybe it starts when you call OpenAI by act like a head assistant and then

underneath it's like act like a great uh AI mentor that helps people in their career and OpenAI's Chrome template also has follow the instruction from the creator or something like that, you

know, it's possible. Yeah.

U questions about prompt templates.

Again, I would encourage you to go and read examples of prompts. Some of them are quite thoughtful.

Um let's talk about zero shot versus few shot prompting. It came up earlier.

shot prompting. It came up earlier.

Here's an example. Again, going back to the categorization of product reviews.

Let's say that we're working on a task where the prompt is classify the tone of this sentence as positive, negative, or neutral. and the the and then you paste

neutral. and the the and then you paste the review which is the product is fine but I was expecting more.

If I were to survey the room I would bet that some of you would say it's negative, some of you would say it's neutral because you actually have a first part that is relatively

positive. It's fine and then the second

positive. It's fine and then the second part I was expecting more which is relatively negative. So where do you

relatively negative. So where do you land? This can be a subjective question

land? This can be a subjective question and maybe in one industry this would be considered amazing and another one it would be considered really bad because people are used to really flourishing

reviews. And so the way you can actually

reviews. And so the way you can actually align the model to your task is by converting that zero shot prompt. Zero

shot refers to the fact that it's not beginning given any example uh into a few short prompts where the model is given in the prompt a set of examples to

align it to what you want it to do. So

the example here is you you again you paste the same prompt as before with the user review and then you add here are examples of tone classifications. This

exceeded my expectation completely.

Positive. It's okay but I wish it had more features. Negative. The service was

more features. Negative. The service was adequate. Neither good nor bad. Uh

adequate. Neither good nor bad. Uh

neutral. Now classify the tone of this sentence um you know after you've heard about these things. And the model then says negative. And the reason it says

says negative. And the reason it says negative of course is likely um because of the second example which was it's okay but I wish it had more features

which we told the model that was negative because the model saw that it's aligned now with your expectations.

Fshot prompts are very popular. Um, and

in fact, for AI startups that are slightly more sophisticated, you might see them uh keep a prompt up to date, whenever the a user says something and they might have a human label it and

then add it as a few shots in the relevant prompt in their codebase. Um,

you can think of that as almost building a data set, but instead of actually building a separate data set like we've seen with supervised fine-tuning and then train fine-tuning the model on it, you're just putting it directly in the

prompt. And turns out it's probably

prompt. And turns out it's probably faster to do that if you want to experiment quickly because you don't touch the model parameters. You just

update your prompts. And you know if it's text examples, you can actually you know concatenate so many examples uh in a single prompt. At some point it will be too long and you will not have the

necessary context window. But it's a pretty strong approach that is quick to align an LM.

Okay. Yes.

>> Research on how long can be until it starts with >> So question was is there any research on how long the prompt can be before the

model essentially loses itself or doesn't follow instructions anymore?

There is uh the problem is that research is outdated every few months uh because models get better. Um and so I I don't know where the state-of-the-art is. You

can probably find it online on benchmarks on like we see that I I I give an example uh on the worker product you have um a voice conversation for

some of you that have tried it where you you're ask explain what is a prompt and then you explain and then there's a scoring algorithm in behind we know that after eight turns the model loses itself

after eight turns because you always paste the previous user response it just starts going wild and so the techniques we use in the background is we actually create chapters of the conversation.

Maybe one chapter is the first aid prompt and then you actually start over from another prompt. You you can summarize the first part of the conversation, insert the summary and then keep going. You know, those are

engineering hacks that engineers might have figured out in the background.

Yeah. Because yeah, eight turns makes a prompt quite long actually.

[snorts] Uh let's move on to chaining.

uh chaining is the most popular technique out of everything we've seen so far in uh uh in prompt engineering.

Uh it's not chain of thought. So chain

of thought we've seen is think step by step, step one, step two, step three, do not skip any step. This is different.

This is chaining complex prompt to improve performance. And this is what it

improve performance. And this is what it looks like. Um, you take a singlestep

looks like. Um, you take a singlestep prompt such as read this customer review and write a professional response that acknowledges their concern, explains the

issue, offers a resolution, and then you paste the customer review, which is, I ordered a laptop, it arrived 3 days late, the packaging was damaged, very

disappointing. I need it that urgently

disappointing. I need it that urgently for work. And then the output is an

for work. And then the output is an email that is immediately given to you by the LLM after it reads the prompt.

Um, so this might work, but it might be hard to control, you know, cuz think about it. There's multiple steps that

about it. There's multiple steps that you have listed and everything is embedded in the same prompt. And if you wanted to debug step by step and know which step is weaker, you couldn't. You

would have everything mixed together. So

one advantage of chaining is you know you would you would separate the prompts so that you can debug them separately and it will also lead to um an easier

manner to improve your workflow. Um

let's say a first prompt is extract the key issues. Identify the key concerns

key issues. Identify the key concerns mentioned in this customer review. Paste

a customer review. Second prompt using these issues. So you paste back the

these issues. So you paste back the issues. Draft an outline for a

issues. Draft an outline for a professional response that acknowledges concerns, explains possible reasons, and offer a resolution.

So, this is not, you know, prompt number three, write the full response. So,

using the outline, um, write the professional response and then you get your final output.

So, in theory, you can't tell me, oh, the second approach is better than the first one at first. But what you can notice is that we can actually test those three prompts separately from each

other and determine if we will get the most gains out of uh engineering the first prompt, optimizing it or the second one or the third one. We now have

three prompts that are independent from each other. And you know maybe if uh the

each other. And you know maybe if uh the outline was better um the performance of uh the email the email how much it will the open rate will be or the user

satisfaction on the response will actually get higher you know and so chaining improves performance but most importantly helps you control your

workflow and debug it more seamlessly.

>> Yes.

So [snorts] if we know that the three pump independently work very well, if we combine them into one pump and we highlight that stepbystep thinking

process, do we does on average we get the same policy output or we still have to do that break down?

>> So let me try to rephrase you say let's say we look at the first prompt which has uh all three task built in that prompt. Um what what exactly do you

prompt. Um what what exactly do you mean? You mean like if we evaluate the

mean? You mean like if we evaluate the output and we measure some user insight, satisfaction, etc. Um why don't we just modify that prompt and essentially see

how it improves user satisfaction?

>> Yeah, instead process.

>> I see. See, why do we need the three steps?

>> Yeah.

>> Yeah. I mean, think about it. The

intermediate output is what you want to see. Like if I'm debugging the first um

see. Like if I'm debugging the first um approach, uh the way I would do it is I would capture user insights. Like here's

the email, how good was the response?

Thumbs up, thumbs down. Um was your issue issue resolved? Uh thumbs up, thumbs down. Those would tell me how

thumbs down. Those would tell me how good is my prompt? And I can engineer that prompt, optimize it, and I would probably drive some gains. Um but I will not be able easily to trace back to what

the problem was. While in the second approach, not only I can use the end to end metrics to improve my process, I can also use the intermediate steps. For

example, if I look at prompt two and I look at the outline and I see the outline is actually meh, it's not great, then I think I can get a lot of gains out of the outline. Um, or the outline

is actually really good, but the last prompt doesn't do a good job at translating it into an email. So the

outline is exactly what I want the the LLM to do, but the the translation in a customerf facing email is not good. In

fact, it doesn't follow our vocabulary internally.

Then I know the third prompt is where I would get the most gains.

So that that's what it allows me to do.

Have intermediate steps to review.

>> Yeah.

>> Are there any lat?

>> We'll talk about it. Are there any latency concerns? Yes. um in certain

latency concerns? Yes. um in certain applications um you don't want to use a chain or you don't want to use a long chain because it adds latency.

We'll talk about that later. Good point.

So practically this is what uh chaining complex prompts look like. You have your first prompt with your first task. It

outputs the output is is pasted in the second prompt with the second task being defined. The output is then pasted into

defined. The output is then pasted into the third prompt with the third task being defined and so on. That's what it looks like in practice.

Super. U

we'll talk more later about testing your prompts, but there are methods now to do it and we'll we'll see later in this lecture with our case study how uh we

can test our prompts. Uh but here is an example of how you you might do it. Um

you you might have a a summarization workflow, you know, prompt um that is the baseline. It's a single uh prompt.

the baseline. It's a single uh prompt.

You might have a refined summarization which is a modified prompt of this or a workflow with a chain, you know. Um, and

then you have your test case, which is the input that you want to summarize, let's say, and then you have the generated output, and you can have

humans go and rate these outputs. And

you would notice uh that the baseline is better or worse than the refined prompt.

Of course, this manual approach takes time. Um, but it's a good way to start

time. Um, but it's a good way to start and usually the advice is get hands-on at the beginning because you would quickly notice some issues and it will give you better intuition on what tweaks

can lead to better performance. However,

if you wanted to scale that system across many products, many parts of your codebase, uh you might want to find a way to do that automatically without asking humans to review and grade

summaries, right? Um, one approach is

summaries, right? Um, one approach is to, um, use, you know, platforms like at Portera, we, our team uses a platform called Prompt Fu that allows you to

actually automate part of this testing.

Um, in a nutshell, what it does is it can allow you to run the same prompt with five different LLMs immediately, put everything in a table that makes it

super easy for a human to grade, let's say. Or alternatively, it might allow

say. Or alternatively, it might allow you to define define LLM judges.

LLM judges can come in different flavors. For example, I can have an LLM

flavors. For example, I can have an LLM judge that does a pair-wise comparison.

So, what the LLM is asked to do is here are two summaries. Just tell me which one is better than the other one. Um,

that's what the LLM does. And that can be used as a proxy for how good the summarization baseline versus the refined version is. Another way to do an

LLMA judge is if you do it for a single answer grading. So here's a summary,

answer grading. So here's a summary, grade it from one to five, you know, and then you can go even deeper and do a a reference guided pair wise comparison or

you add also a rubric. You say a five is when a summary is below 100 characters.

I'm just making up below 100 characters mentions at least three key points that are distinct and starts with a first sentence that displays the overview and then goes into the detail. That's a

great summary number five out of five.

Zero is the LLM failed to summarize and actually was very verbose let's say and so you put a rubric behind it and you have an LLM as just finding the rubric.

Of course you can now pair different techniques. You can do a few shot for

techniques. You can do a few shot for the rubric. You can actually give

the rubric. You can actually give examples of a five out of fives, four out of fours, three out of threes because now you know multiple techniques. Okay, [clears throat]

techniques. Okay, [clears throat] does that make sense?

Yeah. Okay. So that was the second section on prompt engineering uh or the first line of optimization. Now, let's

say you've exhausted all your chances for prompt engineering and you're thinking about actually touching the model, modifying its weights or fine-tuning it. In other words, um I was

fine-tuning it. In other words, um I was telling you I'm not a a fan of fine-tuning. There's a few reasons why.

fine-tuning. There's a few reasons why.

Um one, it requires substantial label data typically to fine-tune. Although

now there are approaches that are getting better at fine-tuning that look more like few shot prompting actually than fine-tuning. It's sort of merging

than fine-tuning. It's sort of merging uh although one modifies the weight, the other doesn't modify the weights. Um

fine-tune models may also overfeed to specific data. We're going to see a

specific data. We're going to see a funny example actually. Um losing uh their general purpose utility. So you

might fine-tune a model and actually when someone asks a pretty generic question, it doesn't do well anymore.

You know, it might do well on your task.

So it might be relevant or not. And then

it's it's time and cost intensive.

That's my main problem. And you know, at at work here, we don't we don't we steer away from fine-tuning as much as possible. Uh because by the time you're

possible. Uh because by the time you're done fine-tuning your model, the next model is out and it's actually beating your fine tuned version of the previous model. So I would steer away from

model. So I would steer away from fine-tuning as much as you can. The

advantage of the prompt engineering methods we've seen is you can put the next best pre-trained model directly in your code. It will update everything

your code. It will update everything immediately. Fine-tuning doesn't work

immediately. Fine-tuning doesn't work like that.

[laughter] There are advantages though where it still makes sense. if the task requires repeated high precision output such as legal scientific explanation and if the general purpose LLM struggles

with domain specific language. So, uh,

let's look at a quick example together, um, which is an example, uh, from Ross Lazerovitz, um, I think it was a couple

of years ago, September 23, where, um, uh, Ross tried to do Slack finetuning.

So, he looked at a lot of Slack messages within his company and he was like, I'm going to fine-tune a model that speaks like us or operates like us because this is how we work, right? this is the data

that represents how people work at the company. Um and so it he actually went

company. Um and so it he actually went ahead and fine-tuned the model. Um gave

it a prompt like hey write a you know he was delegating to the model write a 500word blog post on prompt engineering and the model responded I shall work on

that in the morning. Um

and and then he tries to push the model a little further and say it's morning now. Um, and the model said, "I'm

now. Um, and the model said, "I'm writing right now. It's 6:30 a.m. here.

Uh, write it now." Okay, please.

[laughter] Okay, I shall write it now. I actually

don't know what you would like me to say about prompt engineering. I can only describe the process. The only thing that comes to mind for a headline is how do we build prompt? You know, it's kind

of a funny example for fine-tuning because it's true that it went wrong.

like he was supposed to think like I want the model to speak like us at work and it ended up acting like people and not actually following instructions.

So one example why I would steer away from fine tuning.

Super uh let's talk about rags. Um rags is important. It's important to know out

important. It's important to know out there and at least having the basics.

It's a very common interview question by the way. If you go interview for a job,

the way. If you go interview for a job, they might ask you to explain in a nutshell to a 5-year-old what is a rag and hopefully after that you'll be able to do it. Um, so, uh, we've seen some of

the challenges with standalone LLMs. Those challenges include the context window being small, the fact that it's hard to remember details within a large

context window, um, knowledge gaps, you know, uh, cutoff dates you mentioned earlier. The model might be trained up

earlier. The model might be trained up to a date and then it cannot follow the trends or be up to date. Hallucinations

there are some fields think about medical diagnosis where hallucination are very costly. You can't afford a hallucination. You know even in

hallucination. You know even in education imagine deploying a model for the US youth education and it hallucinates and it teaches millions of people something completely wrong. It's

a problem. Um and then lack of sources.

Uh a lot of fields love sources.

research fields love sources. Education

loves sources. Legal loves sources as well. And so um the pre-trained LLM

well. And so um the pre-trained LLM doesn't do a good job to source. And in

fact, if you if you have tried to find sources on a plain LLM, it actually hallucinates a lot. It makes up research papers. It just lists like completely

papers. It just lists like completely fake stuff. Um so how do we solve that?

fake stuff. Um so how do we solve that?

Um with a rag. Rag integrates with external knowledge sources, databases, documents APIs.

It ensures that answers are more accurate, upto-date and grounded because you can actually update your document.

Your your drive is always up to date. I

mean ideally you're always pushing new documents to it. And when you query what is our Q4 performance in sales, hopefully there is the last board deck in uh the drive and it can read the last

board deck. Yeah. [snorts] Um and more

board deck. Yeah. [snorts] Um and more developer control. We'll see why um rags

developer control. We'll see why um rags allow for targeted customization without actually requiring the retraining of the model. In fact, you don't touch the

model. In fact, you don't touch the model with rags. It's really a technique that is put on top of the model. So to

see an example of a a rag, this is a question answering application where we're in the medical

field and a user is asking uh a query.

What are the side effects of drug X?

This is an important question. You can't

hallucinate. You need to source. You

need to be up to date. Maybe there is a new update to that drug that is now in the database and you need to read that. So

you have to like a rag is a great example of what you would want to use here. The the way it works is you have

here. The the way it works is you have your knowledge base of a bunch of documents. Um what you do is you use an

documents. Um what you do is you use an embedding to embed those documents into lower dimensional representations.

So for example if the document is a PDF a long PDF uh you might you know read the PDF understand it and then embed it.

We've seen plenty of embedding approaches together. triplet loss etc.

approaches together. triplet loss etc. you remember. So imagine one of them

you remember. So imagine one of them here for LLMs is embedding those documents into lower representation.

If the representation is too small, you will lose information. If it's too big, you will add latency, right? It's a

trade-off. Um

you will store typically those representation into a database um called a vector database. There's a lot of vector database providers as out there.

Um you know I I I think I've listed a couple that are very common. No I

haven't listed but I can I can share afterwards. Um the vector database is

afterwards. Um the vector database is essentially storing those vector in a very efficient manner allowing the fast retrieval with a certain distance metric. So um

what you do is you also embed usually with the same algorithm the user prompts and you run a retrieval process which is

essentially saying um based on the embedding from the user query and the vector database find the relevant documents based on the distance between those embeddings.

Once you found the relevant documents, you pull them and then you add them to the user query with a system prompt or a prompt template on top. So the prompt

template can be um answer user query based on list of documents.

If answer not in the document say I don't know. That's your prompt templates

don't know. That's your prompt templates where the user query is pasted. the

documents are pasted and then your output um should be what you want because it's now grounded in the document. You can also add to this

document. You can also add to this prompt templates tell me the exact page, chapter line of the document that was relevant and in fact link it as well

just to be more precise.

Any question on rags? There's a simple vanilla rag.

Yeah. Yes. document embeddings still retain information about what's down on what page and what paragraph?

>> Question is, do the document embeddings still retain the information of the location of the information within that document, especially in big documents?

Um, great question. We'll get to it in a in a second because you're right that the vanilla rag might not do a good job with very large documents. So let's say uh you know when you open a medication

box and you have this gigantic uh white paper with all the information and it's very long uh maybe a vanilla rag would not cut it. So what people have figured

out is a bunch of techniques to improve rags and in fact chunking is a great technique that is very popular. So you

might actually store in the vector database the embedding of the full document and on top of that you will also store a chapter level uh vector you know and when you retrieve you retrieve

the document you retrieve the chapter and that allows you to be more precise with the sourcing. It's one example.

Um another technique that's popular is hide um hypothetical document embeddings where a group of researchers

um published a paper showing that when you get your user query one of the main problem is the user query actually does not look like your documents. For

example, the user query might be what are the side effect of drug X when actually in the document in the vector database the vectors represents very long documents. So how do you guarantee

long documents. So how do you guarantee that the vector embedding is going to be close to the document embedding? What

they do is they use the user query to generate a fake hallucinated document.

They embed that document and then they compare to the vector in the vector database. Does that make sense? So for

database. Does that make sense? So for

example, the user says, "What is the side effect of drug X?" There's a prompt that this is given to another prompt that says, "Based on this user query, generate a five-page report answering

the user query." It generates potentially a completely fake um u answer. You embed that and it will be

answer. You embed that and it will be closer to the document that you're looking for likely. Yeah,

it's one example of a of a rag approach.

Again, the purpose of this lecture is not to go through all these three and explain you every single methods that has been discovered for rags, but I just wanted to show you how much research has

been done between 2020 and 2025 in rags and how many branches of research you you now have that you can uh learn from.

The survey paper is linked in the slides, by the way, and I'll share them after the lecture.

>> [laughter] >> Super.

So, uh, we've made some progress.

Hopefully now you feel like if you were to start an LLM application, you know how to do better prompts, you know how to do chains, you know how to do fine-tuning, uh you also know how to do retrieval and you have the baggage of

techniques that you can go and read and find the codebase, pull the code, vi code it, but you have the breath. Now

u the next set of uh topics we're going to see is around uh the question of how could we extend the capabilities of LLM from performing single tasks and hands

with external knowledge to handling multi-step autonomous workflows. Yeah.

And this is where we get into proper agent AI.

So let's talk about agent TKI workflows towards autonomous and specialized systems. Then we'll talk about evals. Then we'll

see multi- aent systems. And we'll end with a with a little thoughts on what's next in AI.

So um Andrew Wang actually coined the term agentic AI workflows. Um

and his reason was that a lot of companies use uh let's say agents, agents, agents everywhere. Agents

everywhere.

If you go and work at these companies, you would notice that they mean very different things by agent. Some people

actually have a prompt and they call it an agent. You know, other people they

an agent. You know, other people they have a very complex multi- aent system.

They call it an agent. And so calling everything an agent doesn't do it justice. So Andrew says, "Let's call it

justice. So Andrew says, "Let's call it agentic workflows because in practice it's a bunch of prompts with tools with

additional resources, API calls that ultimately are put in a workflow and you can call that workflow agentic. So it's

all about the multi-step process um to complete the task.

Also calling it agentic workflow allows us to not mix it up with what I called agent the the last lecture with reinforcement learning because in RL agent has a very specific definition

interacts with an environment passes from one state to the other has a reward and an observation. You remember that chart right?

So um here's an example of how we move from a one-step prompt to a a multi-step agentic workflow. Let's say a user

agentic workflow. Let's say a user queries a a a product, what is your refund policy on

a chatbot? Um, and the response using a

a chatbot? Um, and the response using a rag says refunds are available within 30 days of purchase. And maybe the rag can even look linked to the policy document.

That's what we learned so far. Um,

instead an agenting workflow can function like this. The user says, "Can I get a refund for my order?" And the response via the agentic workflow is the

agent retrieves the refund policy using a rag. The agent then follows up with

a rag. The agent then follows up with the users and says can you provide your order number? Um then the agent queries

order number? Um then the agent queries an API to check the order details and finally it comes back to the user and confirms your order qualifies for a refund. The amount will be processed in

refund. The amount will be processed in 3 to five business days. This is much more thoughtful than the first version which is sort of vanilla, right?

So that's what we're going to talk about in the next couple of slides is how do we get from the first one to the second one.

Uh there are plenty of specialized agentic workflows online. You know,

you've heard and if you hang out in SF, you probably see a bunch of billboards, you know, AI software engineer, AI skills mentor you've interacted with in the class to worker, AISDR, AI lawyers,

AI, you know, specialized cloud engineer.

You know, it would be a stretch to say that everything works, but there's work being done towards that. Yeah,

I'm not personally a fan of putting a face behind those things. I think it's gimmicky and I think in a few years from now actually very few products will have a human face behind it. Uh but might be

a marketing tactic from some startups.

It's more scary than it is engaging frankly. Um okay, I want to talk about

frankly. Um okay, I want to talk about the pirating shift. Uh that's especially useful. Let's say you're a software

useful. Let's say you're a software engineer or you're planning to be a software engineer because software engineering as a discipline is sort of shifting or at least the best engineers

I've worked with are able to move from a deterministic mindset to a fuzzy mindset and balance between the two whenever they need to get something done. So

here's the paring shift between traditional software and agent TKI software. The first one is uh the way

software. The first one is uh the way you handle data. Traditional software

deals with structured data. You have

JSONs, you have databases. They're

pasted in a very structured manner in a data engineering pipeline and then they're used to be displayed on a certain interface. The user might feel

certain interface. The user might feel fill a form that is then retrieved and pasted in the database. All of that historically has been structured data.

Now more and more companies are handling uh free form text, images um and and all of that requires dynamic interpretation

um uh to transform an input into an output. Um the software itself used to

output. Um the software itself used to be deterministic. Now you have a lot of

be deterministic. Now you have a lot of software that is fuzzy and fuzzy software creates so many issues. I mean,

imagine if you let your user ask anything on your website.

The chances that it breaks is tremendous. The chances that you're

tremendous. The chances that you're attacked is tremendous. The chances,

it's really, really complicated. It's

more complicated than people make it seem uh on Twitter. [snorts] Um, fuzzy engineering is truly hard. Yeah, you

might get hate as a company because one user did something that you authorized them to do that ended up breaking the database and ended up, you know, we've seen that with many companies in the last couple of years. So, it takes a

very specialized engineering mindset to do fuzzy engineering, but also know when you need to be deterministic.

Um, the other thing I call is with agentic AI software, um, you you you sort of want to think about your software as like your manager. So you're

familiar with the monolith or or you know microservices approaches in software you know where you structure your software in different you know boxes that can talk to each other and it

allows teams to debug one section at a time you know now the equivalent with agentic AI is you think as a manager. So

you think okay if I was to delegate my product to be done by a group of humans what would be those roles? Would I have a graphic designer that then you know puts together a chart and then sends it

to a marketing manager that converts it into a nice blog post that then gives it to the performance marketing expert that then publishes the work the blog post and then optimizes and AB tests then to

a data scientist that analyzes the data and then puts hypothesis and validates them or invalidates them. That's how you would typically think if you're building an agency AI software

when actually the equivalent of that in traditional software might be completely different. It might be we have a data

different. It might be we have a data engineer box right here that handles all our data engineering. And then here we have the UIUX stuff. Everything UIUIX

related goes here. And you know companies might structure it in very different ways. And here's the business

different ways. And here's the business logic that we want to care about. And

there's five engineers working on the business logic. let's say

business logic. let's say okay [snorts and laughter] uh testing and debugging is also very different um and we'll talk about it in the next section.

uh the other thing that uh I feel matters is with AI in engineering the cost of experimentation is going down drastically and so people I feel should

be more comfortable throwing away code you know it's it's like in traditional software engineering you probably don't throw away code a ton you you build a code and it's solid and it's bulletproof

and then you you update it over time when we've seen AI companies be more comfortable throwing away codes. Yeah.

Which has advantages in terms of the speed at which you move but also disadvantages in terms of the quality of your software that it can break more.

No.

Okay. So anyway, just wanted to do an aparte on the the parading shift from deterministic to fuzzy engineering.

Um, oh, and actually I can give you an example from uh from worker that we learned uh probably over the last 12 months is like if you if you've used

worker you might have seen that the interface has um asks you sometimes multiple choice questions and sometimes it asks you multiple select and sometimes it asks you drag and drop

ordering matching whatever right those are example of deterministic item types meaning you answer the question on a multiple le choice there's one correct answer it's fully deterministic. On the

other hand you sometimes have voice questions uh where you go through a role play or you have voice plus coding questions where your code is being read by the interface or whatever. Those are

fuzzy, meaning the scoring algorithm might actually make mistakes and those mistakes might be costly. And so

companies have to figure out a human in the loop system, which you might have seen with the appeal feature at the end.

So at the end of the assessment, you have an appeal feature where it allows you to say, I want to appeal the agent because I want to challenge what the agent said on my answer because I thought I was better than what the agent

thought. And then you bring a human in

thought. And then you bring a human in the loop that then can fix the agent, can tell the agent, actually you were too harsh on the answer of this person.

Um, and you know, that's an example of a fuzzy engineered system that then adds a human in the loop to make it more aligned. And so if you're building a

aligned. And so if you're building a company, I would encourage you to think about what can I get done with determinism and let's get that done. And

then the fuzzy stuff, I want to do fuzzy because it allows more interaction. It

allows more back and forth, but I need to put guard rails around it. And how am I going to design those guard rails pretty much? All right,

pretty much? All right, here's another example uh from enterprise workflows um which are likely to change due to agent AI. Um this is a

paper from McKenzie uh I believe from last year where they looked at a financial institution and they said that you know we observe that they often spend one to four weeks to create a

credit risk memo and here is the process. A relationship manager

process. A relationship manager um gathers data from 15 and more than 15 sources on the borrower loan type other

factors. Then the relationship manager

factors. Then the relationship manager and the credit analyst collaboratively analyze that data from these sources.

Then the credit analyst typically spends you know 20 more 20 hours or more writing a memo and then goes back to the relationship manager. They give feedback

relationship manager. They give feedback and then they go through this loop again and again and uh it takes a long time to get a credit memo out and then then run

a a research study where they changed the process. They said Genai agents

the process. They said Genai agents could actually cut time um by 20 to 60% on credit risk memos and the process has changed to the relationship manager

directly work with the genai agent system provides relevant materials that needs to produce the memo. The agent

subsizes the project into tasks that are assigned to specialist agents gathers and analyzes the data from multiple sources. Drafts a memo. Then the

sources. Drafts a memo. Then the

relationship manager and the credit analyst sit down together, review the memo, give feedback to the agent, and within, you know, 20 to 60% less time are done. And so this is an example

are done. And so this is an example where you're actually not changing the human stakeholders, you're just changing the process and adding genai to uh

reduce the time it takes to get a credit memo out. It turns out that um imagine

memo out. It turns out that um imagine you're an enterprise and you have you know 100,000 employees and there's a lot of enterprises with 100,000 employees

out there. Uh you are currently under

out there. Uh you are currently under crisis in terms of redesigning your workflows. You you are you know it turns

workflows. You you are you know it turns out that if you actually pull the job descriptions from the HR system and you interpret them you also pull the business process workflows that you have

encoded in your drive. you actually can find gains in multiple places and uh in the next few years you're probably going to see workflows being more optimized to

add genai. Um even if that happens the

add genai. Um even if that happens the hardest part is changing people. What we

know this is this is great in theory but now let's try to fit that second workflow for 10,000 credit risk analysts and relationship managers. My guess is

it will take years. It will take 10 20 years to get to this being actually done at scale within an organization because change is so hard you know so hard to

rewire business workflows job descriptions incentivize people to do different and be different and train them and so so you know this is what the

world is going towards but it's going to take a long time I think um okay then I want to talk about how the agent actually works and What are

the core core components of an agent? U

imagine um a travel booking AI agent.

That's an easy example you've all thought about. I still haven't been able

thought about. I still haven't been able to get an agent to book a trip for me or or I was scared because it was going to book a very expensive or long trip. But

in theory um uh you you can you can have a travel booking agent that has prompts.

So the prompts we've seen we know the methods to optimize those prompts. That

travel agent also has a content management context management system which is essentially the memory of what it knows about the user. That context

management system might include a core memory or working memory and an archival memory. Okay. What the difference is um

memory. Okay. What the difference is um within memory um is not every memory needs to be fast to access. Like think

about it. you you're born on a product and the first question is hi what's your name and I say my name is Keon that's probably going to sit in the working memory because the agent every time he's

going to talk to me is going to want to use my name right but then maybe the second question is Keon what's your birthday and I give it my birthday does it need my birthday every day probably not so it's probably going to park it on

the long-term memory or the archival memory and those memories are slower to access they're farther down the stack and you know that structure allows agent

to determine what's the working memory and what's the long-term memory you know and that makes it easier for the agent to retrieve super fast cuz think about it when you interact with GPT you feel

that it's very personal at times right you feel like it understands you um imagine every time you call it it has to read the memories right and that can be costly it's like a very it's a very

burdensome cost because it happens every time you you talk to it So you want to be highly optimized with the working memory. You know, if it takes 3 seconds

memory. You know, if it takes 3 seconds to look in the memory, every time you're going to talk to your LLM, it's going to take 3 seconds, which you don't want. So

anyway, and then you have the tools. The

tools can include uh APIs like a flight search API, hotel booking API, car rental API, weather API, and then the payment processing API. And typically,

you would want to tell your agent how that API works. It turns out that agents or LLMs I should say are very good at reading API documentation. So you give

it the API documentation and it reads the JSON and it reads what does a get request look like and this is the format that I need to push and then it pushes

it in that format let's say and then it retrieves something.

Does that make sense? Those different

components you know entropic also talks about resources. resources is data that is

resources. resources is data that is sitting somewhere that you might let your agent read. For example, if you're building your startups, you have a CRM.

A CRM has data in it and you want to use lookups in that data. You will probably give a lookup tool and you will give access to the resource and it will do

lookups whenever you want. Super fast.

uh this type of architecture can be built with different degrees of autonomy from the least autonomous to the most autonomous and I'll give you a few examples

less autonomous would be you've hardcoded the steps so let's say I tell the travel agent first identify the intent then um look

up in the database the history of this customer with us and their preferences then go to the write API blah blah blah then go to the I would hardcode the

steps. Okay, that's the least

steps. Okay, that's the least autonomous. The semi autonomous is I

autonomous. The semi autonomous is I might hardcode the tools but I'm not going to hardcode the steps. So, I'm

going to tell the agent um your act like a travel agent and um and uh you your task is to help the

person book a travel and these are the tools that you have accessible to yourself. And so I'm not hard coding the

yourself. And so I'm not hard coding the steps. I'm just hard coding the tools

steps. I'm just hard coding the tools that you have access to for yourself. Uh

the more autonomous is the agent decides the steps and can create the tools. So

that's where you might give actually access to a code editor to the agent and the agent might actually be able to ping any API in the web, perform some web

search. It might even be able to create

search. It might even be able to create some code to display data to the user.

It might even be able to perform some calculations like, oh, I'm going to calculate the fastest route to get from San Francisco to New York. Um, and which one might be the most appropriate for

what the user is looking for. And then I want to calculate the distance between the airport and that hotel versus that hotel. And I'm going to write code to do

hotel. And I'm going to write code to do that. So it's actually fully autonomous

that. So it's actually fully autonomous from that perspective.

Okay.

So yeah, remember those keywords, memory, uh prompts, tools, etc. Now I presented the flight API, but it

does not have to be an API. Uh you

probably have heard the term MCP or model context protocol that was coined by anthropic. I I pasted the seminal

by anthropic. I I pasted the seminal article on MCP at the bottom of this slide. But let me explain in a nutshell

slide. But let me explain in a nutshell why those things would differ. Um uh in in the API case um you would actually

teach your LLM to ping an API. So you

would say this is how you ping this API and this is the data that it will send you back and um you would have to do that in a one-off manner. So you would have to build or sort of give the API

documentation of your flight API, your booking hotel API, your car rental API and then you would give tools for your

model to communicate with those APIs. Um

it doesn't scale very well um you know versus MCP. MCP um it's really about you

versus MCP. MCP um it's really about you know putting a system in the middle sort of that would make it simpler for your

LLM to communicate with that endpoint.

So for instance um you might you know have an MCP server and MCP client where you're trying to communicate with that travel database or the flight API or MCP

and uh your agent might actually just communicate with it and say hey what do you need in order to give me more flight information and that that agent will respond by I would like you to tell me

where is the origin flight where is the destination and what you're looking for at a high level. This is my requirement.

Okay, let me get back to you with my requirements. Oh, you forgot to tell me

requirements. Oh, you forgot to tell me your budget, whatever. Oh, let me give you my budget, etc. Um, and uh, it's it's it's agentto agent communication,

which allows more scalability. You don't

need to hardcode everything. Companies

have displayed their MCPS out there and you can your agent can communicate with them and figure out how to get the data it needs. Does that make sense?

it needs. Does that make sense?

>> Yeah.

>> Oh, sorry. like rewriting any help like it's suffering only changes in the API rather agent you can rewrite that

>> yes is it not just sh is >> uh I think it it is ultimately the question is isn't it chiefing issue because anyway if an API has to be updated the MCP has to be updated so

what do you say right yes that that's correct but at least um it allows the agent to sort of go back enforce and figure out what the requirements are.

But at the end of the day, ideally, if you're a startup, you have some documentation and automatically have an agent or an LLM workflow that reads that documentation and updates the code

accordingly, you know, but I agree it's not it's not something that is fully autonomous. Yeah. Yeah.

autonomous. Yeah. Yeah.

>> Why is that?

>> Which security specifically?

>> Yeah. So are there security issues with MCPS?

So think about it this way. MCPS

depending on the data that you get access to might have different requirements, lower stake or higher stake. I'm not an expert at you know the

stake. I'm not an expert at you know the full range. But it wouldn't surprise me

full range. But it wouldn't surprise me that um uh you know when you when you when you expose an MCP to an I think you would a lot of MCPs have authentication.

So, you know, you might actually need a code to actually talk to it just like you would with an API or a key. Um,

yeah, but that's a good question. I'm,

you know, I'm not an expert at the security of these systems, but, you know, we can look into it.

Any other questions on what we've seen with the agentic workflows, APIs, tools, MCPS, memory? All of that is under

MCPS, memory? All of that is under progress. So, even memory is not a

progress. So, even memory is not a solved problem by any mean. It's pretty

hard actually to get. Yes,

>> you don't need to confer to access the API, but technically engineer uh your way to achieving

something from the API you can do the same.

>> Exactly. Exactly. Yeah. is is MCP about efficiency or accessing more data. It's

about efficiency. It's like, you know, let's say you have a coding agent um and you know, it has an MCP client and there's multiple MCP servers that are

exposed out there. Um that agent can communicate very efficiently with them and find what it needs. um uh and it's it's a more efficient process than

actually displaying APIs and the APIs on that side and how to ping them and what the protocol is you know but you know it's not about the data that is being exposed because ultimately you control

the data that is being exposed you probably you know depending on how the MCP is built my guess is you probably expose yourself to other risks

because your your MCP server can can see any input pretty much from another LLM and so it has to be robust. Um

but yeah, super. Uh so let let's look at an example of a step-by-step uh uh uh workflow for the travel agent. So let's

say the user says I want to plan a plan a trip um to Paris from December 15th to 20th um with flights hotels near the FL

tower and then an itinary of must visit places that's the task to the travel agent. Step two, the agent plans the

agent. Step two, the agent plans the steps. So it says I'm going to find

steps. So it says I'm going to find flights. Use the flight search API uh to

flights. Use the flight search API uh to get option for December 15th. Search

hotels, generate recommendation for places to visit, validate preferences, um, budget, etc. Book the trip with the payment processing API. Step three,

that's just the planning by the way.

Step three, execute the plan. Use your

tools, combine the results, and then proactive user interaction and booking.

It might make a first proposal to the user and ask the user to validate or invalidate and then may repeat that planning and execution process. And then

finally, it might actually update the memory. It might say, "Oh, I just

memory. It might say, "Oh, I just learned through this interaction that the user only likes direct flights. Next

time I I'll only give direct flights."

or I notice users are fine with threestar hotels or four-star hotels and in fact they're they don't want to go above budget or something like that.

Um so that hopefully makes sense by now on you know how you might do that. My

question for you is uh how would you know if this works and if you had such a system running in production how would you [clears throat] improve it?

Yeah, >> so that's an example. So let users rate their experience at the end. Uh that

would be an end to end test, right?

You're looking at the user experience through the steps and say how good was it from one to five, let's say. Yeah,

it's a good way. And then if you learn that a user says one, what how do you improve the the workflow?

>> Okay, so you would go down a tree and say, "Okay, you said one uh what what was your issue?" And then the user says uh the prices were too high, let's say,

and then you would go back and fix that specific uh uh tool or prompt or Yeah.

Okay. Any other ideas?

Yeah, good. So that's a good insight.

Separate the LLM related stuff from the nonLM related stuff. The deterministic

stuff. The deterministic stuff you might be able to fix it, you know, more objectively essentially. Yeah, there was

objectively essentially. Yeah, there was what else?

So, give me an example of an objective issue that you can notice and how you would fix it versus a subjective issue.

>> Yeah.

the flight which is cheaper directive that's >> okay so let's say you say there's the

same flight but one is cheaper than the other let's say it's objectively worst and so you can capture that almost automatically yeah >> so you could actually build evals that

are objective that are tracked across your users and you might actually run an analysis after and see that for the objective stuff. We noticed that our LLM

objective stuff. We noticed that our LLM AI agentic AI workflow is bad with pricing. It just doesn't read prices as

pricing. It just doesn't read prices as well because it always gives a more expensive option. Yeah, you're perfectly

expensive option. Yeah, you're perfectly right. How about the subjective stuff?

right. How about the subjective stuff?

>> Yeah.

>> Like do you choose a direct or indirect flight if the indirect is a little bit cheaper?

>> Yeah, good one. Do you do you choose a direct flight or an indirect flight if the indirect is cheaper but the direct is more comfortable? Um yeah, that's a

good one actually. Um so how would you capture that information? Let's say this is used by thousands of users.

>> Um could you feed something in about us?

>> Uh could you feed something in? Yeah, I

mean you could you could u could you feed something in uh about the user preferences? Well, you could you could

preferences? Well, you could you could build a data set that has some of that information. So, you build 10 prompts

information. So, you build 10 prompts where the user is asking specifically for direct is saying that I prefer direct flights because I care about my time, let's say. And then you look at

the output and you actually give a good the example of a good output and you probably are able to capture the performance of your agentic workflow on

this specific eval whether does it prioritize does it understand price conscious is it price conscious essentially and comfort conscious

what about the tone let's say let's say the LLM right now is not very friendly how would you notice do that and how would you fix it?

>> Yeah, >> test user and like run prompts and see if there's something wrong with >> Okay, have a test user run the prompt and see if there's something wrong with

that. Tell me about the last step. How

that. Tell me about the last step. How

would you notice that something is wrong?

So have a couple of evaluates response and see if it's like satisfied.

>> Yeah, I agree with your approach. Have

LLM judges that evaluate the response against a certain rubric of what politeness look like. So here in this case you could actually start uh with error analysis. So you start you have a

error analysis. So you start you have a thousand users and you know you can pull up 20 user interaction and read through it and you might notice at first sight

the LLM seems to be very rude. You know

it's just super super short in its answers and it's not very helpful. Um

you notice that with your air analysis manually. Then you go to the next stage.

manually. Then you go to the next stage.

You actually put eval behind it. you

say, "I'm going to create a set of um a set of LM judges that are going to look at the user interaction and are going to rate how polite it is and I'm going to

give it a rubric. Then what I'm going to do is I'm going to flip my LLM. Instead

of using GPT4, I'm going to use Grock.

And instead of using Gro, I'm going to use Lama. And then I'm going to run

use Lama. And then I'm going to run those three LLM side by side, give it to my LLM judges, and then get my

subjective score at the end to say, "Oh, X model was more polite on average."

Yeah, perfectly right. That's an example of an EVA that is very specific and allows you to choose between LM. You

could actually do the same eval across LM, but fix the LLM, change the prompt.

You actually instead of saying act like a travel agent, you say act like a helpful travel agent and then you see the influence of that word on your eval with the LLM as judges. Does that make

sense? Okay. Uh super. So let's let's

sense? Okay. Uh super. So let's let's move forward and do a case study with eval and then we're we're almost done um for today. Uh let's say your product

for today. Uh let's say your product managers manager asks you to build an AI agent for customer support.

Okay, where do you start? And here is an example of the user prompt. I need to change my shipping address for order blah blah blah. I move to a new address.

So what what do you start if I'm giving you that project? you know.

>> Yes.

>> So do some research, see benchmarks and how different models perform at customer support and then pick a model. That's

what you mean. Yeah. You It's true. You

could do that. What What What else could you do? Yeah.

you do? Yeah.

>> Okay. Yeah, I like that. Try to

decompose the different tasks that it will need and try to guess which ones will be more of a struggle, which ones should be fuzzy, which one should be deterministic. Yeah, you're right.

deterministic. Yeah, you're right.

to sit down for like a day or two with a customer and see how the task probably task.

>> Yeah, similar to what you said. That's

what I would recommend as well. You say

I would sit down with a customer support agent for a day or two and I would de compose the task they're going through.

I will ask them where do they struggle, how much time it takes. Yes, that's

usually where you want to start with task decomposition. So let's say we've

task decomposition. So let's say we've done that work and we have this list I'm simplifying but the customer support agent human typically would extract info

then look up in the database to retrieve the customer record then check the policy you know are we allowed to update the address or is it a fixed data point um and then draft the response email and

sends the email okay so we've decomposed that task u once you've decomposed that task ask

how do you design your agentic workflow?

>> Yes. each step

which one which one we're going to use method or whatever in each task what are you going to use for resources

>> exactly so to repeat I'm going to you're going to look at the decomposition of tasks get an instinct of what's fuzzy what's deterministic and then determine which

line is going to be an LLM one shot which one will require maybe a rag which one will require a tool, which one will require memory, which one so you will start designing that map completely

right that's also what I would recommend you you might actually uh draft it and say okay I take the user prompt um and the first step of my task de composition

was extract information that seems to be a vanilla LLM you you you can guess that the vanilla lm would probably be good enough at extracting the user wants to

change address and this is the order number and this is the new address. You

probably don't need too much technology there other than the LLM. Um the next step it feels like you need a tool because you're actually going to have to

look up in the database and also update the address.

So that might be a tool and you might have to build a custom tool for the LLM to say let me connect you to that database or let me give you access to that resource with an MCP. Yeah.

After that, you probably need an LLM again to draft the email, but you would probably paste confirmation. You paste a confirmation that your address has been updated from X to Y. And then the LLM

will draft an answer. And of course, just to not forget, you might need a tool to send the email. You know, you might actually need to, you know, post

something to for the email to to go out and then you'll get the out. Does that

make sense? So, exactly what you described. [sighs] Okay, now moving to

described. [sighs] Okay, now moving to the next step. Once we have decomposed our tasks, then we have designed an agentic workflow around it. It took us five minutes. In practice, it would take

five minutes. In practice, it would take you more if you're building your startup on that. You want to make sure your task

on that. You want to make sure your task de composition is accurate, your thing is accurate here. And then you can have a lot of work done on every tool and optimize it and latency and cost. But

let's say and now we want to know how um uh if it works, you know, and I'm going to assume that you have LLM traces. LLM

traces are very important. Actually, if

you're interviewing with an AI startup, I would recommend you in the interview process to ask them, do you have LLM traces? because if they don't have LM

traces? because if they don't have LM traces, it is pretty hard to debug an LLM system, you know, because you don't have visibility on the chain of complex prompts that were called and where the

bug is and you know, so it's a basic sort of part of an AI startup stack to have LLM traces. [laughter] So let's assume you have traces. How would you

know if your system work? you know, we I I you know, I I'm going to summarize some of the things I heard earlier. Um,

you gave us an example of an end to end metric. You look at the user

metric. You look at the user satisfaction at the end. Um, you can also do a component-based approach where you actually will look at the tool, the

database updates and you will manually do an error analysis and see, oh, the tool actually always forgets to update the email. it just fails at writing, you

the email. it just fails at writing, you know, and I'm gonna fix that. This is

deterministic pretty much. Um or um you know, when it tries to send the email and ping the system that is supposed to send the email, it doesn't send it in

the right format and so it bugs at that point. Again, you could fix that. Um

point. Again, you could fix that. Um

draft of the email, the LM doesn't do a great job. It's not very polite at

great job. It's not very polite at drafting the email, you know. So you

could look at component by component and it's actually easier to debug than to look at it end to end. You'll probably

do a mix of both. Um another way to look at it is what is objective versus what is subjective. So for example an

is subjective. So for example an objective example would be the LLM um extracted the wrong order ID. You know

the the user said my order ID is X and the LLM when it actually pasted looked up in the database it used the wrong order ID. this is objectively wrong. You

order ID. this is objectively wrong. You

can actually write a Python code that checks that checks just the alignment between what the user mentioned and and what was actually pasted in the database or for the lookup.

You also have subjective stuff which we talked about where you probably want to do either human rating or LLM as judges.

It's very relevant for subjective evals.

[snorts] Um and finally you will find yourself having quantitative evals and more qualitative evals. So quantitative

will be percentage of successful address updates. The latency you could actually

updates. The latency you could actually track the latency component based and see which one is the slowest. Let's say

sending the email is 5 second you know it's too long let's say you would notice component based or the full workflow.

And then you will decide where am I optimizing my latency and how am I going to do that. And then finally qualitative you might actually do some error

analysis and look at you know where are the hallucinations um where are the tone mismatches uh you know are the user confused and by

what they're confused you know that would be more qualitative and typically it would take more you know white glove approaches to do that okay so here's

what it could look like I gave you some examples but you would build evals to determine objectively, subjectively, component based, end to- end based and

then quantitatively and qualitatively where is your LLM failing and where it's doing well.

Does that give you a sense of the type of stuff you could do to fix improve that agentic workflow?

Super. Well, that was our case study on Evas. We're not going to del deeper into

Evas. We're not going to del deeper into it, but hopefully it gave you a sense of the type of stuff you can do with LLM judges with, you know,

objective subjective component-based end to end, etc. U last section on multi- aent workflows.

So you might you might ask uh hey uh why do we need a multi- aent workflow when we when the workflow already has multiple steps already calls the LLM

multiple times already gives them tools why do we need multiple agents and so many people are talking about multi- aent system online it's not even a new thing frankly I mean multi- aent system

have been around for a long time the the main advantage of a multi- aent system is going to be parallelism it's like is there something that I wish I would run

in parallel sort of independently but maybe there are some syncs in the middle but that's where you want to put a multi- aent system it's when it's parallel the other

advantage that some companies um have with multi- aent system is an agent can be reused. So let's say in a company you

be reused. So let's say in a company you have an agent that's been built for design. That agent can be used in the

design. That agent can be used in the marketing team and it can be used in the product team, you know, and so now you're optimizing an agent which has multiple stakeholders that can

communicate with it and benefit from its uh performance.

Um actually I'm going to ask you a question and take a few uh maybe a minute to think about it. Let's say you were uh building smart home automation

for your apartment or your home. What

agents would you want to build? Yeah,

write it down and then I'm going to ask you in in a minute to share some of the agents that you will build. Also, think

about how you would put a hierarchy between these agents or how you would organize them or who should communicate with who. Okay. Okay. Take a minute for

with who. Okay. Okay. Take a minute for that.

be creative also because I'm gonna ask all of your agents and maybe you have an agent that nobody has thought of.

Okay, let's get started. Who wants to give me a a set of agents that you would want for your home smart home? Yes.

So uh the first is like set of agents that track my movements in the house and drop information about my house.

Another agent receive that information and adjust the room temperature and another usage of

>> okay so let me repeat you have four agents I think roughly one that tracks biometric like you're where are you in the home where you're moving, how you're

moving, things like that. That sort of knows your location. The second one um determines the temperature of the rooms and has the ability to

change it. The third one tracks energy

change it. The third one tracks energy efficiency and might be feedback on energy and energy usage and might be I don't know maybe it has the control over

the temperature as well. I don't know actually or the gas or the water might cut your water at some point the and then you have an orchestrator

agent. What is exactly the orchestrator

agent. What is exactly the orchestrator doing?

>> Instructions.

>> Okay. Passes instructions. So is that the agent that communicates mainly with the user?

>> Yep.

>> Okay. So if I have I'm coming back home and I'm saying I want the oven to be preheated, I communicate with the orchestrator and then it would funnel to another agent. Okay, sounds good. Yeah,

another agent. Okay, sounds good. Yeah,

so that's an example of a I want to say a hierarchical um agent multi- aent system.

Um what else? Any other ideas? What what

would you add to that? Yeah.

>> minimal action that you can do. Imagine

entering a room or just entering a computer or just opening the minimum kind of action. you have like lot of agent per

[clears throat] and then depending on who who is it and all the contact you have >> oh I like that that's a really good one so let me summarize you have a security

agent that determines if you can enter or not and when you enter it understands who you are and then it gives you certain sets of permissions that might

be different depending of if you're a parent or a or you know you might have access to certain cars and not others or the kid cannot open the fridge or I don't know like something like that.

Yeah. Or okay. I like that. That's a

good one. Yeah. And it does feel like it's a complex enough workflow where you want a specific workflow tied to that. I

I agree. [snorts]

>> What What else?

>> Yes. Continuing on the ambient stuff, you can get more complicated. So energy

savings with keeps open as well from the grocery store understand what's in your fridge or not.

who are out to.

>> Well, that's really good actually. So,

you mentioned two of them. One is maybe an agent that has access to external APIs that can understand the weather out there, the wind, the sun, and then has

control over certain devices at home, temperature, blinds, things like that, and also understands your preferences for it. that does feel like it's a good

for it. that does feel like it's a good use case because you could give that to the orchestrator but it might lose itself because it's doing too much. So

you probably and also these problems are tied together like temperature outdoor with the weather API might influence the temperature inside how you want it etc.

And then the second one which I also like is you might have an agent that looks at your fridge and what's inside and it might actually have access to the camera in the fridge for example. Um and

know your preferences and also has access to the e-commerce API to order Amazon groceries ahead of time. Um I

agree and maybe the orchestrator will be the communication line with the user but it might communicate with that agent um in order to get it done. Uh yeah I like those. So those are all uh really good

those. So those are all uh really good examples here. Here is the list I had um

examples here. Here is the list I had um up there. So climate control, lighting,

up there. So climate control, lighting, security, energy management, entertainment, notification agent, alerts about the system updates, energy saving and orchestrator. So all of them

you mentioned actually. Um and then we didn't talk about the different interaction patterns, but you do have different ways to organize a multi- aent

system. flat hierarchical. It sounds

system. flat hierarchical. It sounds

like this would be hierarchical. I

agree. And the reason is UIUX is I would rather have to only talk to the orchestrator rather than have to go to a specialized application to do something like it feels like the orchestrator

could be responsible for that. And so I agree I would probably go for a hierarchical setup here. But maybe you might act also add some connections between other agents like in the flat

system where it's all to all for example uh with climate control and energy if you want to connect those two. You might

actually allow them to speak with each other. When you allow agents to speak

other. When you allow agents to speak with each other it is basically an MCB protocol by the way. So you treat the agent like a tool exactly like a tool.

Here is how you interact with this agent. Here is what it can tell you.

agent. Here is what it can tell you.

Here is what it needs from you essentially.

Okay, super. And then without going into the details, there are advantages to multi- aent workflows versus, you know, single agents such as debugging. It's

easier to special debug a specialized agent and to be debug an entire system.

Parallelization as well. It's easier to have things run in parallel. Um, and you can earn time. Um, you know, there are some advantages to doing that. And I

leave you with this slide if you want to go deeper. Super. So we've learned so

go deeper. Super. So we've learned so many techniques to optimize LLMs from prompts to chains to finetuning retrieval um and to multi- aent system

as well. And then just to end on um a

as well. And then just to end on um a couple of trends I want you to watch. Uh

I think next week is Thanksgiving. Is

that it? Is Thanksgiving break? No, the

week after. Okay. Well, ahead of the Thanksgiving break. So if you're

Thanksgiving break. So if you're traveling, you can think about these things. Um what's next is in AI, I

things. Um what's next is in AI, I wanted to call out a couple of trends.

Um so Elas discover one of the OGs of uh you know LLM's um and you know OpenAI co-ounder um raised that question about are we plateauing or not? You know the

question of are we going to see in the coming years LLM sort of not improve as fast as we've seen in the past. It's

been the feeling in the community probably that you know the last version of GPT um did not bring the level of performance that people were expecting

although it did make it so much easier to use for consumers because you don't need to interact with different models it's all under the same hood so it seems that it's progressing um but the plateau

is is unclear the the way I would think about it is um the LLM scaling laws tell us that if we continue to improve compute and energy

then LM should continue to improve but at some point it's going to plateau so what's going to take us to the next step and it's probably architecture search still a lot of LLM even if we don't

understand what's under the hood are probably transformer based today but we know that the human brain does not operate the same way there's just certain things that we do that are much

more efficient much faster we don't need as much data so theoretically we have so much to learn in terms of architecture search that we haven't figured out. It's

not a surprise that you see those labs hire so many engineers because it is possible that in the next few years you're going to have thousands of engineers trying to figure out the different engineering hacks and tactics

and architectural searches that are going to lead to better models and one of them suddenly will find the next transformer and it will reduce by 10x the need for compute and the need for

energy. Um, you know, it's sort of if

energy. Um, you know, it's sort of if you've read Isak Azimov's u foundation series um individuals can have an amazing impact on the future because of

their decisions. You know, whoever

their decisions. You know, whoever discovered transformers had a tremendous impact on the direction of AI. I think

we're going to see more of that in the coming years where some group of researcher that is iterating fast might discover certain things that would suddenly unlock that plateau and take us to the next step and it's going to

continue to improve like that. And so it doesn't surprise me that there's so many companies hiring engineers right now to figure out those hacks and those those techniques. Um the other set of gains

techniques. Um the other set of gains that we might see is from multimodality.

So the way to think about it is we've we've we've had LLM first text based and then we've added imaging and today you know models are very good at images.

They're very good at text. Turns out

that being good at images and being good at text makes the whole model better. So

the fact that you're good at understanding a cat image makes you better at text as well for a cat. Now

you add another modality like audio or video, the whole system gets better. So

you're better at writing about a cat if you know what a cat sounds like if you can look at a cat on an image as well.

Does that make sense? So we see gains that are translated from one modality to another. And that might lead in the

another. And that might lead in the pinnacle of robotics where all these modalities come together and suddenly the robot is better at running away from a cat because it understands what a cat

is, how it sounds like, what it looks like, etc. That make sense? Um, the

other one is the multiple methods working in harmony. In the Tuesday lectures, we've seen supervised learning, unsupervised learning, self-supervised learning, reinforcement

learning, quality engineering, rags, etc. If you look at um how babies learn um it is probably a mix of those

different approaches like a baby um might have some metalarning meaning you know it has some survival instinct that is in the encoded in the DNA most likely

um and that's like the baby's pre-training if you will on top of that uh the mom or the dad um is pointing at stuff and saying bad good bad good good

supervised learning on top of that the baby's falling on the ground and getting hurt and that's a reward signal for reinforcement learning. On top of that,

reinforcement learning. On top of that, the baby's observing other people doing stuff or other babies, you know, doing stuff, unsupervised learning. You see

what I mean? It's we're probably a mix of all these methods and um and I think that's where the trend is going is where those methods that you've seen in CS230 come together in order to build an AI

system that learns fast, is low latency, is cheap, energy efficient, and makes the most out of all of these methods.

Um, finally, and this is especially true at Stanford, um, you have research going on that you would consider human centric and some research that is nonhuman

centric. By humancentric, I should say

centric. By humancentric, I should say human approaches that are modeled after the brain and approaches that are not modeled after humans because it turns out that the human body is very

limiting. And so if you actually only do

limiting. And so if you actually only do research on what the human brain looks like, you're probably missing out on compute and energy and stuff like that that you can optimize even beyond neuronal connections in the brain. But

you still can learn a lot from the human brain. And that's why there are

brain. And that's why there are professors that are running labs right now that try to understand how does back propagation work for humans. And in

fact, it's probably that we don't have back propagation. We don't use back

back propagation. We don't use back propagation. We only do forward

propagation. We only do forward propagation, let's say. So this type of stuff is interesting research that I would encourage you to read if you're curious about the direction of of AI. Um

and then finally um one thing that's going to be pretty clear I call it all the time but it's the velocity at which things are moving. You're noticing part of the reason we're we're giving you a breath in CS230 is because these methods

are changing so fast. So I don't want to bother going and teaching you the number 17 methods on rag that optimizes the rag because in two years you're not going to need it, you know. So I would rather you

think about what is the breadth of things you want to understand and when you need it you are sprinting and learning the exact thing you need faster because the half-life of ski is so low you know you want to come out of the

class with a good breath and then have the ability to go deep whenever you need after the class and so that's sort of how that class is designed as well. Um

yeah that's it for today. So thank you um thank you for participating.

Loading...

Loading video analysis...