Build Hour: AgentKit

By OpenAI

Summary

## Key takeaways - **AgentKit simplifies complex agent building**: Previously, building agents involved complex coding for orchestration, tool connection, and UI development, often leading to breaking changes and manual prompt optimization. AgentKit introduces a visual builder, versioned workflows, a connector registry for secure tool integration, built-in evals, and automated prompt optimization. [01:08], [01:51] - **Visual Agent Builder for custom workflows**: Agent Builder allows users to visually design agent workflows, connect tools, automate and optimize prompts, and add guardrails. Workflows can be hosted independently or within OpenAI, enabling integration beyond traditional chat applications via webhooks. [04:50], [05:26] - **ChatKit for embeddable, customizable agent UIs**: ChatKit provides a customizable chat UI that can be hosted independently or with OpenAI, allowing developers to match brand guidelines, including color schemes, font families, and starter prompts. It enables rich widget outputs, moving beyond traditional text or markdown responses. [21:27], [22:31] - **Evals for testing and optimizing agent performance**: The Evals platform integrates with Agent Builder to test individual nodes and assess end-to-end workflow performance. It supports creating datasets, trace grading, and automated prompt optimization, enabling developers to identify issues and improve agent reliability at scale. [24:53], [26:50] - **Real-world adoption by major companies**: Companies like Ramp, Ripling, HubSpot, Carlyle, and Bain are using AgentKit. Ramp built a procurement agent prototype 70% faster, HubSpot enhanced their AI assistant, and Carlyle and Bain saw a 25% efficiency gain in their eval datasets. [37:38], [39:00]

Topics Covered

Agent Kit simplifies complex AI orchestration and deployment.
Why are evaluations crucial for trusting AI agents at scale?
Automated prompt optimization slashes development time for agents.
Quality of evaluation data trumps mere quantity.
Optimize agent performance with specialized sub-agent architectures.

Full Transcript

All right. Hi everyone. Welcome to

OpenAI Build Hours. I'm Tasha, product

marketing manager on the platform team.

Really excited to introduce our speakers

for today. So, myself kicking things

off, uh, Summer from our applied AI team

on the startup side, and Henry who runs

product for the platform team.

Awesome. So, as a reminder, our goal

here with build hours is to empower you

builders with the best practices, tools,

and AI expertise to scale your company,

your products, and your vision with

OpenAI's APIs and models. Um, you can

see the schedule down here at the link

below openai.com/buildours.

Awesome. Our agenda for today. So, I

will quickly go over agent kit, which we

launched just a couple weeks ago at

Devday. Um, then hand it off to Samarth

for an agent kit demo. Henry will then

run us through eval which really helps

bring those agents to life and let us

trust them at scale. Um if we have time

we'll go over a couple real world

examples and then definitely leaving

time for Q&A at the end. So feel free to

add your questions as we go through.

Awesome. So let's do a quick snapshot of

what agents was like um building with

for the last really like several months

or even year. Uh it used to be super

complex. Uh orchestration was hard. You

had to write it all in code. Uh if you

wanted to update the version, it would

sometimes introduce breaking changes. Uh

if you wanted to connect tools securely,

you had to write custom code to do so.

Um and then running evals required you

to manually extract data from one system

into a separate eval platform, daisy

chaining all of these separate systems

together to make sure you could actually

trust those agents at scale. Um prompt

optimization was slow uh and manual. And

then on top of all of that, you had to

build UI to bring those agents to life.

And that takes another several weeks or

months to build. So basically it was in

massive need of a huge upgrade which is

what we're doing here. Um so with agent

kit we hope that we've made some

incremental improvements to how you can

build agents. Um now workflows can be

built visually with a visual workflow

builder. It's versioned um so no intro

no no breaking changes are introduced.

Um there's an admin center uh called the

connector registry where you can safely

uh connect data and tools and we have

built-in evals into the platform that

even includes thirdparty model support.

Um as Samarth will show us in a bit,

there's an automated prompt optimization

tool as well uh which makes it really

easy to perfect those prompts uh

automatically rather than trial and

error yourself manually. Um and then

finally we have chatkit which is a

customizable UI.

Cool. So bringing it all together, this

is sort of the agent kit tech stack. At

the bottom we have agent builder uh

where you can choose which models to

deploy the agents with. Connect tools um

write and automate and optimize those

prompts. Add guardrail so that the

agents perform as you would expect them

to even when they get um unexpected

queries. uh deploy that to chatkit which

you can host yourself or with open AAI

and then optimize those agents at scale

in the real world with real world data

from real humans by observing uh and

optimizing how they perform uh through

our eval platform.

Cool. So we're already seeing a bunch of

startups and Fortune 500s and everything

in between using agents to build a

breath of use cases. Some of the more

popular ones that we're seeing are

things like customer support agents to

triage and answer chatbased customer

support tickets, sales assistants

similar to the one that we'll actually

demo today, um internal productivity

tools like the ones that we use at

OpenAI to help teams across the board um

work smarter and faster and reduce

duplicate work. Uh knowledge assistance

and even doing research like document

research or general research. And the

screenshot here on the right is just a

few uh templates that we have in the

agent builder that show some of the

major use cases that we're already

powering.

Okay, so um let's make this all real

with a real world example. Uh a common a

challenge that businesses face is

driving and increasing revenue. Let's

say that your sales team is too busy

outbounding to prospects, building

relationships, meeting with customers.

We want to build a go-to market

assistant to help save sales time and

increase revenue. And with that, I will

kick it over to Samar to show us how to

do it.

>> Great. One of the biggest questions that

we get at OpenAI is how do we use OpenAI

within OpenAI? Um, and hopefully this

kind of rolls the curtain a little back

so you can take a peek at how we

actually build some of our goto market

assistance. Um we'll cover a few

different topics today like uh maybe the

agents that are capable of uh doing data

analysis, lead qualification as well as

outbound email generation. Um so what

I'll do here is move over and share.

Great. So we're actually on our Atlas

browser. Um feel free to download that.

I had a fantastic time using it these

past few weeks and um I think it saved

me hours if not uh you know days worth

of time doing some things sometimes and

uh um I'm a big fan. Uh okay so we'll

get started and when we get into the

agent builder platform the first thing

that we really see um is a start node

and the agent node. Um you can think of

the agent as the atomic particle within

you know the workflow that you go in and

construct and behind it is the agents

SDK which actually powers the entirety

of agent builder. Whenever we build

these agent builder workflows, um it

doesn't have to live within the OpenAI

platform. Uh you can copy this code,

host this on your own, and you might

want to even, you know, take this beyond

traditional chat applications and do

things uh like being able to trigger

these via web hooks. So for this

example, um we have three agents in mind

that we're looking to build out. the

data analysis one where we'll pull from

data bricks a lead qualification one

where we'll scour the internet for

additional details and outbound email

generation um where we want to maybe

qualify an email with things on a

product or a marketing campaign that

we're launching. Sound good?

>> That sounds great. I'm on board.

>> Okay, great. So, we'll get started by

building our first agent here. Since we

have uh three different types of use

cases in mind for what we're actually

trying to build, um what we want to do

is use a very traditional architectural

pattern using a triage agent. So the way

that we think about this is that agents

are really good at doing specialized

tasks. So if we break down this question

to um you know the proper sub agent, we

might be able to get better responses.

So for this first agent, let's call this

a question classifier.

Typing is hard.

a copy over the prompt that we've we've

put in here. I'll just take a quick peek

at what this looks like. And really what

we're doing here is asking the model to

qualify or classify a question as either

a qualification, a data, or an email

type of question. Really, the idea here

is that we can then route this query

depending on what the model selected as

what its output should be. Um, and

rather than having a traditional text

output, what we want to do here is

actually force the model to output in a

schema that we recognize and can use for

the rest of the workflow. So let's say

let's call this variable that the out

the the model will output in category

and select the type as enum. What this

means is the model will only output a

selection uh from the list that we

provide here. So um from my prompt I had

the email agent, the data agent and the

qualification agent.

>> Great.

>> And real quick uh how did you write the

prompt? Did you write that all yourself

or I know the importance of prompt and

steering the agent. How did you come up

with that? I think writing prompts is

one of the most cumbersome things that

we can do. Um I there's a lot of time

spent spinning wheels on what actually

matters when you're capturing that

initial prompt. And I think um one of

the most key ways that I write prompts

myself is use chat GPT and GPT5 to be

able to create my vzero of the prompts.

Um, within agent builder itself, you can

actually go in and, uh, edit the prompt

or create prompts from scratch to be

able to use as, uh, the bare bones for

what you might, you know, spin on in the

future for your agent workflows. Um, for

now, we'll leave it as the one that, uh,

we pasted in here, but we'll in the rest

of this workflow, we'll take a peek at

what using that actually looks like.

Great. Um, so now that we've actually

got got the output, um, agent builder

actually allows us to make this very

stateful. So for example I have a um a

set state icon here. Sorry just

again drag and dropping also can be

difficult. Um so what we want to do here

is take that output value from the

previous stage and assign that to a new

variable such that the rest of this

workflow is able to reference it. Um

we'll call this category again. Um and

assign no default value for now. Um,

using that same value, I can now

conditionally branch to either the data

analysis agent or the rest of my

workflow to handle maybe additional

steps I want to do prior to executing

the email um or the data qualification

use case or the customer qualification

use case. Um, what we'll do here is drag

this agent in

and we'll set that the we'll set the

conditional statement here to say um if

the state category is equal to data.

Let's see. Oh, it looks like I spelled

it wrong.

>> Debugging.

Great.

>> As you can see, there's helpful hints

where we were actually able to see um

what actually went wrong and be able to

really quickly go back and debug that.

So here in this case we want to see if

it's a data you a data agent will route

to that separate agent

and if it's not we'll probably use um

additional logic to go in and scour the

internet for those um you know inbound

leads that we want to qualify or an

email that we want to write. Um let's

stick with the data analysis agent for

now and go over what it's like to

actually go in and connect to external

sources within agent builder and largely

agents SDK. Um what I want to do here is

actually instruct the model on how to

use data bricks and create queries that

it can use um in co in cohort with an

MCP server. So what we've done here is

uh added a tool for the model to be able

to go and access this MCP server and

query data bricks however it chooses

fit. Um if my quer is really hard and

might require you know joins data bricks

and GPT5 would be able to use those

together to then be able to create a

concise query. Um, so since I've built

my own server for now, um, I'll add it

here. And let's call this I'll add my

URL first. Um, I'll call this the datab

bricks MCP server.

Um, and what I'll do here is actually

choose the authentication pattern. You

can also select no authentication. Um,

but for things that are protected

resources or might with live within

authenticated platforms, you might want

to use something like a personal access

token to go do that last mile of

federation. So, in this case, I'll I'll

I'll use a um a personal access token I

created within my data bricks instance

and hit create here. Let's give it a

second to pull up the tools. And we can

see that a fetch tool is actually

submitted here. Um what this allows us

to do is select a subset of the

functions that are actually allowed to

the MCP server um to really allow the

model to not get overwhelmed with the

choices of potential actions that it can

take. So, I'll add that tool there.

Oops. Um and I'll also um I'll go back.

One thing I might have missed here is

actually setting the model. What I want

to do is make this really snappy. And so

what I can do is choose a non-reasoning

model there. But for this one, I really

want the model to iterate on these

queries and react to the way that the

model or the the the results of the

model were actually perceived to um the

agent. And so uh what we'll do here is

do a quick test query to make sure the

piping works. So maybe I'll say um show

me the top 10 accounts.

That should be good enough. Um, and what

we can see is the model actually

stepping through the individual stages

of this workflow. So in the beginning,

you can see that it classified this

question as a data question, saved that

state and then routed. Um, we can see

that when it reached that agent and

decided to use that tool, it actually

asked us for consent to be able to go

and take that action. You can configure

that logic on the front end to be able

to handle how to actually show to the

user, hey, the model actually wants to

go and uh select an action there. Um,

with MCP you're able to do both read and

write actions. And we have a few of

these MCP servers out of the box. Think

like Gmail. Um, we have a ton more uh

out of the box that you're able to

connect to.

>> SharePoint. Totally. Um, and so here we

can see that the the model is actually,

you know, thinking about how to

construct that query. And we can see

that we can see a response here. We

didn't ask for the model to really

format this result for us, but we can

actually really quickly do that with

this agent itself. by just asking the

model to say um I would like the results

to be in natural language and just by

you know spinning on um the generate

button within agent builder itself

you're able to provide these inline

changes depending on the results that

you see in real time

>> super cool

>> cool u so the next thing I want to do is

actually create another agent to do some

of that research that we were mentioning

that might be useful for something like

generating an email or

uh qualifying a lead. So, we'll call

this the information gathering agent.

Looks like it's stuck here. I might have

to give it a quick refresh in a moment.

See,

platform's a bit buggy.

Great. Um,

cool. So, we're at this information

gathering agent and what we want to do

is tell the model uh how to actually go

and search the internet for the leads

that we want. Particularly, we're

looking for a subset of the information

that might be publicly available for a

company. So, think about like the

company legal name, the number of

employees they have, the company

description, maybe their annual revenue,

as well as their geography. Um, and what

we want to do here again is use a

structured output to define what our

output should look like when the model

goes and um searches the internet for

this. This gives us a good mapping and

for the model itself to know what to

look for when it's writing these queries

and we're able to then uh you know

instruct the model in terms of the way

that it should search for the uh across

the internet.

Great. Um what we want to do here is

also change the output format for uh the

schema that we want to enter. Maybe we

want to put the the fields that we

previously just showed into a structured

output format. You can also add

descriptions um in the in the

properties, but for now we're going to

leave those blank.

Great. So now that when the model goes

to this information gathering agent,

it will hit this uh agent, search the

internet, and output in the format that

we're looking for. Cool. Um since we

saved the state of the the query routing

in the beginning, we can go ahead and

reference this again um when we're when

we're going to um route again via email

or to the lead generation u and lead

enhancement agent. So what we'll do here

is set this equal to email and then

otherwise we'll just route it to the

other agent.

>> Awesome. Yeah. And the sub aent

architecture is great because it means

that you get better quality results a

bit faster than you would just using one

general purpose agent

which is helpful for actually having

impact and helping the sales team be

more productive.

>> Um what we'll do here is paste in a

prompt for this email agent. Um, but

really the highlight for for this for

the email agent is that we're looking to

generate emails that are not just from,

you know, information from the query or

from the the internet, but we also want

to upload files that might map to the

way that we're actually thinking about

building uh emails in general for

marketing campaigns. So, what you may

have in this case is something like PDFs

that contain information on what the

campaign is. Maybe you have other PDFs

that contain information of how you

should write emails. Um, all of this is

really useful information for the model

in order to spec out what that email

should actually look like. Um, so what

we'll do here is add a tool to actually

go and search these files. You can

attach vector stores that you may have

already um to the workflow and be able

to use those out of the box. You're also

able to add these via API. Um, but for

now, what we'll do is just drag in a

couple files that we have. Um, we have

one that's a standard operating

procedure for how to write emails. And

we have another document on a a

potential promotion that this sample

company has. Um, and what we've done is

allowed the model to then go in and

search the vector store for this type of

information in order to actually go and

generate that email. um on the lead

enhancement agent. Um instead of writing

one ourselves, let's pretend like we

have a uh like a general segmentation of

u the market that we want to actually

assign various account executives to. So

in this case, what we want to do is

essentially um be able to output a quick

schematic of how we're going to do that

assigning process depending on the

information that was gathered from the

internet. And without writing a prompt,

um, agent builder will be able to output

an entire, um, you know, version of that

prompt as a as a starting point.

>> Super cool.

>> Great. Um, before I move away from agent

builder and and show like this working

end to end, what I wanted to show is

that agent builder doesn't just support

text and structured output formats, we

also support really rich widgets. So,

what this looks like in practice is that

uh, we can instead of outputting text or

JSON, upload a widget. And I'll show you

in a little bit what it looks like to

actually create a widget and use a

widget. But we can actually go in and

upload a widget file itself. Um, so I'll

drag this in here. Or maybe I have to

Great. So we can see a quick preview of

what this widget looks like. Rather than

just outputting in text and maybe, you

know, traditionally like chat GPT what

you see is like a markdown formatted

result. we want to maybe render

something richer such that if you do

host this on your own on your on your

own website, you're able to have that

multimodal um component as well. So what

we'll do here is create this component.

And now if I say draft an email to

should we use open AI open AI about

>> great so we can see that it went to the

information gathering agent um since

we've given access to the web search

tool from the reason did we do that let

me make sure that I did that

>> may skip

>> might may have skipped that step

>> there we go Great.

>> So again, sorry,

>> I was just gonna say I love that you can

test the workflow live here and debug it

like we're doing before going to

production.

>> Totally. And the really nice thing is as

you run questions through this workflow,

we save the traces of exactly how the

model has executed um you know various

queries and then more holistically the

way that the workflow has orchestrated.

So this is really rich information as

you're continuing to iterate on your

workflow. And Henry will touch on this a

ton, but the ability to really peel back

the curtain and see how the model is

thinking about this and then assign

graders I think really allows you to

scale out this process of evaluations as

well.

>> Yeah.

>> So great. Looks like here it's searching

for Lumaf Fleet. Um we'll let this run

for a little bit and see see what

happens at the end. Um,

okay. Looks like it might take a little

bit to do that. We we'll get back to

that one. Um, end to end. So, what we've

built here is essentially an agent that

allows you to do three different things.

The first one is that allows you to go

and query data bricks for being able to

pull in that invi

um you know

information that might live bei behind

some form of a information wall and be

able to pull that within the agent

workflow itself. Um and then

alternatively being able to qualify

write uh emails um and then also qualify

inbound that you might get from

customers. Um, all of this lives within

a workflow that you can then host within

um, chatkit, which we'll cover, or you

can take this out and use it in your own

codebase to handle um, what those chat

workflows actually look like.

>> Super cool. Um, one of the questions I

was wondering was uh, what's the

difference between pulling a tool from

the lefth hand sidebar in like drag and

dropping that in as a node as opposed to

adding that tool into the um, agent node

specifically?

>> Totally great question. So, um, when I

added like the search tool to the

information gathering agent, I've

allowed the model to determine if it

should actually go in and use that tool.

Sometimes I always want the tool to run

prior to an agent actually getting that

information. So, I can add one of these

nodes to be able to ensure that the

model is actually doing this action,

prior to the agent actually receiving

that information.

>> Makes a ton of sense. Yeah. So, agent

kit then I feel like is a good

combination of deterministic and

somewhat if you want non-deterministic

outcomes to be true.

>> Yeah.

>> Cool.

>> Great. Um I want to pivot to so we built

this amazing workflow. Now we want to go

in and deploy it. Um I think one of the

most fantastic things that we released

at our most recent dev day was the

ability to go in and host these

workflows that you've built. So using

using the uh workflow ID that we've gone

in and built, we're able to actually

power these chat interfaces that uh

might require a ton of engineering

otherwise to support things like

reasoning models as well as being able

to support um you know complex agent

architectures and the handoffs that you

might want to show to users. Um what

this looks like in production is that

you're able to match the entirety of

your uh your brand guidelines to the

actual chat interface that you're

building. and we'll take a peek at how

some of our real customers are using

this today. Um, but really the I wanted

to highlight the fact that you can, you

know, entirely customize this to, you

know, the color scheme, the font

families, as well as the starter prompts

that your users might go in and use. Um,

say for example, we have a workflow that

looks at our utility bills, um, where we

might want it to go and connect to an

MCP server, pull up your billing

history, analyze those past bills, uh,

and then be able to, uh, show a really

rich widget to the user. the entirety of

that process and the customization of

what the user sees is entirely uh

configurable through chatkit. So here in

the question, how's my energy usage?

Rather than just showing a traditional

text response, we see a really rich

graph that allows you to visualize the

output.

>> This is super cool. Yeah. And I think

for our use case uh example, just to

drive it home, one of the widgets that

we have available that maybe you'll show

us shortly is uh an email widget. So, if

you wanted the agent to actually draft a

email to Opening Eye, which I think it's

still researching information for

because there's so much public

information out there. Um, and then

sales can just click to have that button

uh to have that email sent to the

customer.

>> Totally. Yeah. Let's take a look at a

few what those widgets could be. So,

we've released a gallery where you can

take a peek at some of the ones that we

think are really cool. You can also

click into these and see what the code

is to actually build these. But what I

think is really cool is being able to

generate gen generate these through

natural language. Like for example, if I

wanted to have an email uh component or

widget that I wanted to mock up that um

contains some specific brand guidelines

or um formatting in a way of that widget

that really appealed to my brand. Um I'm

totally able to do that via natural

language. Um and so using this you can

then export that into agent builder and

then show that UI when uh agent builder

invokes that that widget in chatkit.

>> Amazing.

>> Great. Um before moving it to Henry, I

wanted to show an example of what this

looks like in real life. Um we have a

website here that renders a p a globe

with or a picture of the earth. And what

we want to do is be able to control this

globe that we have uh via natural

language. So where should we go today,

Tasha?

>> Well, I think our next dev day exchange

is in Bangalore. So I'm going to say

India.

>> Let's go to India.

So what we should see here is another

agent builder powered workflow. But we

can see how not only did um a widget

populate on the right side, we actually

were able to control the JavaScript that

was rendered on the on the actual

website itself. So being able to have

this customizability and portability

into the websites and browsers that you

use every day is something that um we

find really fascinating with Chatkit.

>> That was the fastest trip to India I've

ever taken.

>> Totally. Um awesome.

So we covered um the the build side as

well as the deploy side into chatkit. Um

what really is the most important part

and you know the the hardest part of a

lot of building agents is the evaluate

part.

>> Yep. That's how we know that we can

trust the agents uh in real world

scenarios in production at scale with

all of the glorious uh and weird edge

cases that come up. So with that, I'd

love to hand it over to our friend in

the UK uh Henry who can walk us through

an EDOS demo.

Thank you so much Tash and Samath. And

hi everyone. I'm Henry. I'm one of the

product managers who worked on agent

kit. Um and so today I want to talk a

little bit about how once you've built

that agent, once you've got that

workflow, um and you've defined it in

the visual builder, I want to talk about

how you can test it. I want to talk

first about how you can test an

individual node and get confident that

that specific agent or that specific

node is going to perform as you want it

to. Cuz ultimately your agent is only as

good as its weakest link. like you need

every single component to be dialed in

and performing how you want it to. Once

you've got every one of those nodes in a

place that you're comfortable with, you

then want to be able to assess the

endto-end performance. And for that, you

can look at traces, but traces are hard

to interpret. And so now we have a trace

grading experience, too, that allows you

to take those traces and evaluate them

at scale. So, let me pull up my screen

and start talking you through a bit of a

demo and show how we can uh how we can

do this.

So, here you can see an agent that I

built. This is based on a real example

from one of our financial services

customers. This takes an input of a

company name. It assesses is this a

public or a private company and it

completes a series of analyses on that

company before ultimately writing a

report for one of the professional

investors of that company to review. So,

as I mentioned, you have a whole bunch

of agents here and every single one of

these agents needs to perform well and

needs to perform as you want it to. And

so how do you get confident in the

performance that's going to do that? How

do you get visibility and um and kind of

transparency into how it's going to

perform? So when you're defining this

agent and you're looking into one of

these nodes, you can see there's an

evaluate button here in the bottom

right. So we click that evaluate button

that's going to take that agent node

which has a prompt, it has tools, it has

a model assigned and it's going to open

it in a data set. So here you can see

this data sets UI and this allows you to

visually build a simple eval. And so I'm

going to now attach um just a couple of

rows of data into this eval. You can see

a company name and then you can see some

ground truth revenue and income figures

as well. So I've imported that to this

data set and that's going to allow us to

run this eval. So here you can see

everything that was passed through from

the visual builder. You've got the

model, you've got the tool of web

search, you've got the system prompt and

the user message that we had assigned.

And then you can add additionally see

this data that I uploaded. So this is

just three rows, a couple of company

names, and then some ground truth values

for the revenue and income figures that

our web search tool should return for

those um those companies. So what I can

do now, I can run the generation. So

this is obviously the first stage of any

eval is to run generation and then once

you've completed the generation then you

complete the evaluation stage. So while

that generation is running I want to

show how we can attach columns and so

here we can add new columns for let's

say ratings where we can attach a thumbs

up and thumbs down rating and then let's

additionally add columns for free text

feedback.

So this is where I can attach kind of a

free text annotation. uh maybe I'm happy

with something, maybe I want to attach

some kind of longer form feedback on

that data as well. And so what you can

see now is that this output is coming

through. And if I click into this, I can

tab through these generations that have

been completed. So you can see here I

was asked to complete some analysis of

Amazon, of Apple, and then of meta as

well is still running. And I can scroll

through that and I can see the

generation that was completed. So what I

can then do is I can attach these free

text labels or attach these annotations

sorry that I just created. So I can say

this one's good. I can say maybe this

one's bad. I can say maybe this one's

good. And then I can attach feedback. I

can maybe say this is too long for

example.

Now once I've done those annotations I

can also add graders. So, let me add a

grader here. And I'm going to just

create a simple grader that's going to

evaluate a financial analysis. And it's

going to require that this financial

analysis contains upside and downside

arguments that it considers competitors,

that it ends with a buy, sell, or hold

rating. So, I'm going to save that and

I'm going to run it. And this is now

going to run through uh in fact, let me

just let me just change that. Okay,

let's just leave that. So that's now

going to run through and complete those

kind of greater uh ratings. So that's

going to take a little while to run

through because we've got a lot of data

in there. So I'm going to tab over to a

data set that I created earlier where

you can see these graders have now

completed. If I click into these, I can

see the rationale. I can see why the

grader has given the result that it has

done. So here you can see for example uh

this grader has failed because there's

no explicit recommendation and there's

no competitor comparison. So what we

could do at this point and now just

maybe recap even where we are here we've

got those generations that have been

completed. We've got all those

annotations and we've got all these

greater outputs. What do you do at this

point? How do you make your agent

better? So one thing you can do is just

do some manual prompt engineering and

try and find patterns in that data and

then try and rewrite your prompt. That

obviously takes a long time and requires

you to find those patterns and to spend

a bunch of time, you know, trying to

solve them. What we see as a better

solution is automated prompt

optimization. So you can see here

there's this new optimize button. So if

I click that, it's going to open a new

prompt tab in this data set and that's

where we're going to automate the

rewriting of the prompt. And this is how

you save yourself having to do that

manual prompt engineering every time. So

this is where we're taking those

annotations, we're taking those greater

outputs and we're taking all the uh the

prompt itself and we're using that to

suggest a new prompt. And again, this

will take a minute or two to run

through. So I'm going to tap here to one

that I made earlier. And you can see

here the rewritten prompt that completes

a fundamental financial analysis but is

much more thorough and complete than the

initial kind of pretty scrappy and rough

prompt that I had completed.

So that's an overview of how you can

take that single node from that agent

builder and how you can robustly

evaluate that single agent. But we're

not building a single agent here. This

is a multi- aent system and we want to

test every one of the nodes

individually. But ultimately what we

care about is that endto-end

performance. So how do we get confident

in that? How do we test that? So as

Samarth mentioned these agents emit

traces and here you can see some example

traces from when I've previously run

this agent. So clicking through this I

can see every span. I can click into

every span and I can start to identify

uh you know what happened when this

agent ran. Now as I'm clicking through

this I might start to notice problems.

For example, here you can see I there's

a bunch of sources that have been pulled

by the web search tool. For example,

CNBC and Barons. Maybe we don't want

these third party sources to be cited.

Maybe we want only first party

authoritative sources. So, we should say

web search

sources should be first party only.

Let's just run that with GPT5 and Nano

so it's nice and fast. And then as I

click through more of these, I might

find additional problems. Let's say we

identify another pattern that the end

result doesn't contain a buy, sell hold

rating. So we say end result needs to

contain a clear by sell

rating. And again, I'm building up these

requirements that I can then run over

specific traces. And now this set of

requirements you can think of as like a

grader rubric. And this greater rubric

is built up with a series of criteria

that define a good agent. And then once

I've got that set of criteria built up

and I've tested it on a couple of

traces, I can then click this grade all

button at the top here. And this is

going to export the set of traces that

I've scoped this to. So in this example,

just these five traces. And it's going

to take the set of graders that I've

defined on the right. and it's going to

open that in a new evap. And this allows

you to assess a very large number of

traces at scale because clicking through

every one of these traces and trying to

find problems doesn't work that well. It

takes a lot of time. It doesn't scale

well. But instead, you can run these

trace graders over a very large number

of traces. And that will help you

identify just the spans that are

problematic and just the traces that you

want to dive into. So that was an

overview of how we have this kind of

embedded eval experience. is tightly

integrated with the agent builder. Um I

also just wanted to flash a couple of

best practices that we've seen from

working with a large number of customers

now uh on this platform and a couple of

lessons that we've learned. Um first

starting simple you know don't over

complicate things but do start early.

have a handful of inputs and a simple

grader that you define right at the

start of the project instead of leaving

evals right to the last minute as like a

I'm just about to ship this thing I

better do some testing which I know some

people do it's much better to like start

early embed evals and do kind of eval

driven development where you're

rigorously testing your prototypes

finding problems in the prototypes and

then quantitatively measuring your

improvement as you hill climb against

your eval much better way to build a

product and likely to result in in

higher performance.

Secondly, using human data. It's really

hard just coming up with hypothetical

inputs, using LLMs to generate synthetic

inputs. You'll probably get much better

performance if you get real user data,

real inputs from real end users because

that captures a lot of the messiness of

the real world. And then finally, make

sure you invest a bunch of time

annotating generations and aligning your

LLM graders because this is how you make

sure that your subject matter expertise

is really encoded into the system so

that your graders are actually

representing what you want your product

to do.

So that was a high level overview um of

our product. This is all in G. So we'd

love for you to to give it a spin and

please let us know uh any feedback at

all. And with that, I'll pass back over

to Tasha and uh Sam.

>> Awesome. Thanks, Henry. I feel like we

could do a whole hour session on emails.

That was awesome. Um, one quick question

for you actually before uh you step out

is um how large of an email data set do

you recommend? We got this from chat. Is

it uh 100, a thousand, 10? How do you

know what the right uh data set size is

to get the results you want?

>> Yeah. So, the best thing to do is to get

started early. And so even like 10 to 20

examples goes a long way. And having um

having that set of data in there to just

test your application against is is

really helpful. So even just you know 10

20 a couple of dozen uh rows is helpful.

And then as you get closer to production

clearly the more is the more is better.

But it's really, you know, I wouldn't

think of it as a a question of just how

many rows because there's kind of a

quality times quantity uh multiplier

that you have to um have to consider

here. Having, you know, 50 rows of

really high quality inputs that are very

representative of a large set of user

problems and then having graders that

are really aligned with the data that

the behavior you want to see that can

perform phenomenally. Whereas if you use

an LLM to generate a thousand rows of

synthetic inputs, it's not going to be

that helpful. So I'd say the quality is

almost more important than just the

quantity.

>> Yeah,

>> that makes a lot of sense. Yeah.

>> Yeah. And just to add on top of that,

like one of the questions that we get a

ton of is like how do we create a

diverse data set to run evals from,

especially if you haven't put a lot of

these tooling into production already.

Um, when we're building our go to market

assistant, our engineering team that

actually supports those workflows sits

right next to our go to market team to

understand what subject matter experts

are actually asking or curious about.

This allows us to build a good diverse

set of questions that on every iteration

that we continue to optimize. Um, we're

capturing the nuances and the real

queries that people are actually

interacting with.

>> Super cool. Um, awesome. Well, thanks

Henry. So, with that, I'd love to cover

a couple real world examples and then

we'll leave some time for Q&A at the

end. So, um, our first one here is a

short video of a procurement agent that

RAMP built. Um, so they used chatkit to

actually visualize uh this UI to the

person requesting a software. They used

agent builder on the back end to

actually orchestrate the agent flow. Um,

and they used evals to make sure that it

would work uh at scale in production. So

while this isn't live on their platform

yet, um, we hope that it will be in the

near future. And that was a quick run

through of um, what they actually built

and and the prototype. Um, awesome. So,

uh, RAMP with the agent kit stack, uh,

was able to build this prototype 70%

faster, which I think is pretty amazing,

uh, equivalent to like two engineering

sprints instead of two quarters. Um,

Ripling, I actually think you worked on

this project a little bit. Do you want

to maybe share what they built and how

it went?

>> Yeah, totally. We were initially

thinking about like how we can spec this

out through the agents SDK and um one of

the hard challenges was like getting

that alignment between subject matter

experts as well as you know the ability

to build workflows that were logically

sound and so we really sat with them to

understand what was their real go to

market use cases and be able to work

backwards from there. Um chatting with

their team I think it was a pleasure to

use a tool like agent builder and we we

got a ton of really uh good feedback on

next versions that we're looking to roll

out.

>> That's awesome. Um similarly HubSpot who

has been doing a lot of amazing uh work

in the AI space they used uh chatkit to

enhance their uh breeze AI assistant. If

you want to actually advance um awesome

thanks all good. Uh so yeah they saved

weeks of front end time like we

mentioned at the start building agents

from start to finish is super

timeconuming because of each of the

complex steps involved. So if we can

even help with just one of those um

numerous steps, the UI uh aspect in this

case, that's um that's a useful lift. Uh

and then finally, Carlile and Bane,

which were two uh amazing evals

customers of ours. So um they were able

to see a 25% efficiency gain um in their

eval data set, which is fantastic. Um

cool. Okay, so maybe to round it out

before we go over to Q&A, um when we

launched Agent Kit, these are some of

our early um customers who built on the

product. And you'll see that Agent Kit's

currently powering tech stacks at

startups, Fortune 500s, everything in

between. Um these are the different

types of agents. There's a bunch of uh

breadth of use cases here from uh work

assistants to a procurement agent,

policy agents. Um, Albertson's the large

grocery retailer has a merchandising

intelligence agent. Um, Bane code

modernization. So, really cool to see

just the wide range of use cases here.

Awesome. With that, we can go to Q&A

from the chat.

Uh, maybe do you want to go to the next

slide?

>> Cool. Okay. So, how can I add a four

loop blocks? Mark, you want to take that

one?

>> Yeah, good question. So we don't have a

for loop but uh we do have a while loop

that's available within agent builder

you're able to actually be able to um

conditionally continuously run different

agent workflows depending on if a

completion criteria has met. Um

obviously with the agents SDK you can

take it out into a codebase and then

orchestrate that on your own. maybe use

like our interpretation as that of that

as like v0ero. Uh but instead of a for

loop, we do support Y loops. So such

that you're able to actually iterate um

throughout the workflow until that uh

end criteria has been met.

>> Hopefully that helps. Um what else have

we got? How does Agent Kit compare to

the agents SDK?

>> Um I would say that agent kit so far is

uh so well I I'll back up a bit. Agent

kit is a suite of products that we've

tried to opinionate as to the most

useful tools that we at OpenAI find find

uh from our day-to-day as we build

agents. Um agents SDK powers the

entirety of agent kit and most of

everything that you can do within agent

kit you're also able to do within the

agents SDK or it's via uh available via

an API. Um so far uh we're continuing to

roll out a ton of these changes to make

that parody happen a little bit more

closer. Um but we imagine in the future

that agent kit will also contain um you

know some features that allow you to

extend the ability to host these

workflows um on the cloud. And so rather

than using like traditional chatkit

implementations uh you could also

trigger these workflows via an API as

well. Um this allows you to essentially

host the agents SDK on the cloud. Yeah.

>> Very cool. Yeah. And I would say um

agent builder is like the equivalent of

the agents SDK functionally but it's the

canvas visualbased way to actually

orchestrate those agents whereas agents

SDK is like the jump straight in

straight into the code version of it. Um

so yeah very cool.

Uh how do you build out of the box MCP

servers versus building your own?

>> Yeah totally. So we have a few MCP

servers. So we support uh remote MCP

servers which means that the MCP servers

have to be hosted on the cloud or um

hosted on the publicly available

internet to some degree. Uh when we're

building our own MCP servers, a lot of

the considerations that we have around

authentication require us to build our

own MCP servers. That said, a lot of the

providers that you use every day like

think Gmail, your calendar, etc. Those

all have out of the box connections

likely that you're able to just paste in

an API key and get started with all the

tools that we support. Some of these I

think um you know we don't have full

capabilities to do things like write. So

for example, if you want to write an

email via the Gmail API, I don't believe

that is currently supported. So you

might want to spin up your own MCP

server there. Um the thing I really like

about MCP is that it allows for that

authentication and blackbox is what that

flow actually looks like. So whether you

want to bring your own personal access

token or go through something like OOTH

and then pass in that last token that

you get to uh the MCP server, both are

totally great options to be able to

authenticate to secured sources.

>> Cool. We have any more questions?

>> Yes.

>> When do you recommend a classifier agent

with branching logic to different

agents?

>> Yeah, I think this is a great question.

It's one that we get a ton because um

as you add more tooling and instructions

to a model, what we've seen is that the

performance generally deteriorates. Um

imagine a world where you had a 100

tools, right? Allowing the model to

select which one of those 100 tools

becomes increasingly difficult. Um more

realistically, you might not have a 100

tools, but you might have 20. And each

agent or each use case for an agent

might use those tools in entirely

different ways. So one way that I like

to think about agents is that I like to

stratify the logic for what is a core

competency for this agent. What are the

net set of tools that I want this agent

to use and only in that specific type of

way. The moment I start confusing the

model and how to invoke these tools, how

to interpret the instruction with the

context of those tools, I like to branch

off to a different agent. So in the in

the cases that we had um where uh you

know we were looking for three different

uh GTM use cases maybe the email agent

that we're building you know that

outputs a widget is not the best one to

also do lead qualification. So um those

use cases where you're maybe using even

the same tools but uh you want to

structure the outputs a little bit

differently you want the model to

interpret the outputs a little

differently um it's good to branch out

to different agents.

>> Cool.

Alrighty. Uh, can we use agent kit for a

multimodal use case especially for

analyzing images and files?

>> Totally. So, um, this is a great use

case for agent kit. We do support file

inputs within that preview section that

we covered. You're able to even play

around in the playground with uploading

files. Um, I what I find really

interesting is that like we propagate

this behavior to chatkit as well or

chatkit propagates that behavior to

agent builder as well where if you

upload files within chatkit that is also

passed into hosted agent builder

backends.

>> Oh, super cool. Yeah.

>> Okay. So, we are at the end here. We

would love to leave you with a few

resources if you're interested in

exploring more. Um, agent kit docs, a

super helpful place to guess place to

get started. Um, we also released a

cookbook the other week um that walks

you through a very similar use case to

the one that we showed today um in a bit

more detail even uh chat studio if you

want to play around with chatkit and see

how you can customize it. Um, and then

finally, uh, to learn more about

upcoming build hours and past build

hours, the build hour refill on GitHub.

Awesome. Uh, and with that, I think

we're at a close. If you want to, um,

Right. Okay. Upcoming build hours, we

have two, uh, agent RFT. So, building on

what we talked about today. How do you

actually customize models for tool

calling and custom graders and things

like that? Um, that will be November

5th. So, really excited to build on

today's session um, with that next

session. And then on December 3rd, agent

memory patterns. Um, so hope to see you

at both of those. You can uh get more

information about registering at this

link.

>> Awesome. Well, that's it. Thank you so

much for putting this awesome demo

together. It was super fun. Um, yeah.

Thank you all for watching and I hope

you have fun building agents.

Loading...

Loading video analysis...