LongCut logo

Inside Anthropic’s Bet on Claude Agents that Work While You Sleep | Jess Yan

By Peter Yang

Summary

Topics Covered

  • Highlights from 00:00-08:02
  • Highlights from 07:55-16:27
  • Highlights from 16:21-24:06
  • Highlights from 23:54-31:42
  • Highlights from 31:31-40:17

Full Transcript

We've really evolved from agents being prompting loop to agents being autonomous, self-discovering and longunning actors. We set them tasks

longunning actors. We set them tasks overnight and then we wake up and backlog is resolved and bugs are are squashed. All of that is 10,000 times

squashed. All of that is 10,000 times easier because of all the agents that we [music] have internally. My personal

favorite is like a predictive model that based off various attributes of the customer and the product can predict whether this customer is going to return. And it's able to produce this

return. And it's able to produce this really rich level of insight in just minutes. Limits of what we can achieve

minutes. Limits of what we can achieve will really be based off of how much we can delegate at once more so than like what our personal capacities are.

Hey everyone. Uh my guest today is Jess, product lead at Anthropic for Cloud Managed Agents. Uh really excited to get

Managed Agents. Uh really excited to get Jess to demo uh how to build an agent from scratch, maybe talk about how Anthropic uses agents internally and maybe even just talk about what an agent even is. So welcome Jess.

even is. So welcome Jess.

Thanks for having me Peter.

Yeah, it's great to have you. Um so

everyone's talking about agents and the agent tech stuff and everything. So let

me ask you, so how would you define what an agent is and what are the main components of an agent?

Man, what a what a loaded question. Um

yeah, so once upon a time agents were really just prompting loops where you were just trying to get questions and responses uh in a loop and I think that's really evolved towards

permissioning and access to thirdparty systems, internal tooling and sensitive data. Um and that that level of access

data. Um and that that level of access now requires permissioning um observability uh steering not not in the same way that

it was just questioned answered before.

Got it. And so they're really the underlying components are still you know the model uh the system prompt and behavioral instructions and uh the actual harness driving the loop but the

sophistication of what we are asking agents to achieve is higher. So that has made the sophistication of the harness higher as well.

Okay. Yeah. Because there's no you know agents can use tools they can have memory right there's all kinds of stuff.

Yeah.

And why don't we also just define what a harness is like what what is a harness?

Yeah, the harness is the core scaffolding around the model that gives it the ability to run those tools and to call its memory um and to know when to ask for human in the loop input versus

to just continue executing on its task.

So the harness is really what elevates us from the sort of random sampling of just tokens in and tokens out to actual actionable uh products and and let me throw you a curveball

actually. Do you think the model should

actually. Do you think the model should be developed the hardest like like are they kind of joining at hip? I think I am quite biased but I also think that it is impossible to get the maximum

possible performance without tying together the harness and the model. Now

the the uh components of the harness and maybe the thickness of the harness will change over time as models get more and more capable. Um however

more capable. Um however you know we when we test our models and when we are uh assessing their performance we always have to test it in con conjunction with a harness and are

we going to test it with all the different harnesses of the world we're going to select the harnesses that we have built and so there is an aspect of you know the necessity of building

models is that you have to be testing them with harnesses and that sort of keeps them paired together.

Okay. like you know you test them with like cloud co-work and cloud code and maybe some third party cares right yeah yeah yeah and of course we run against like open source evals and

whatnot but at a certain point you know every single model distribution now is through a harness and so we also need to be testing them through harnesses as well okay now let's talk about your product

what is manage agents and you know how is it different from just like you know me talking to the messages API yeah so manage agents is really the evolution of where we see uh task

orchestration going. And earlier

orchestration going. And earlier [clears throat] I talked about how we've really evolved from agents being prompting loops to agents being autonomous uh self-discovering

and uh you know uh longunning actors with access to lots of thirdparty systems and

uh need for both permissions and guardrails. And so cloud manage agents

guardrails. And so cloud manage agents was developed with that in mind.

um rather than just being a prompting loop that we've sort of like added um different capabilities onto, this is a pre-built harness and uh companion

infrastructure to allow an agent to run complex uh tasks at scale. And so for us, the core motivation behind cloud manage agents is that the return on

effort for building an agent should be extremely extremely high. So we wanted to build easy to stack primitives and easy to use flexible developer APIs with out of the box infrastructure and all of

that should be really really low effort but then you should be able to delegate hugely complex work that might have taken you days out days months weeks to to actually execute.

Okay. So it takes care of like a lot of the infra stuff that you have to build to even get agent to run. Is that kind of Yes. Exactly.

Yes. Exactly.

Got it. Okay. All right. All right.

Well, then without further ado, do you want to show us uh how easy it is or hard it is to build a manage agent?

Yeah, absolutely. Hopefully, we find it easy. All right. Here I am in our cloud

easy. All right. Here I am in our cloud console and I have a pre-built agent um that I've already configured. Um so here you see the core components of this

agent. So one is the model selection. Um

agent. So one is the model selection. Um

this continues to be what drives sort of like the intelligence layer underneath the agent. Um here are the system prompt

the agent. Um here are the system prompt which is the raw text that the model gets to define its behavior, its

guardrails and a highle awareness of the kinds of test tasks that you will orchestrate to it. I've given it access to a builtin tool set that we ship with

every cloud manage agent. Um and I've given it the ability to basically interact with its file system and produce uh produce results. I've

actually set its permissions to always allow each of these tools. Um, but we also have the flexibility to configure these as um ask requesting permissions.

So, keeping a human in the loop for any of these actions.

Um, in this particular case, I didn't grant it any skills, but here's where I would grant it skills as well.

So, this is a this is a agent to analyze data and uh kind of like a data analyst.

Exactly. Yes.

Got it. Okay.

Um so uh this particular agent runs analysis on a fictitious uh grocery store called just in time

and I actually give it um uh an initial prompt that gives it the data schemas um a guidance on how to actually execute

the task and um I'll I give it also a file a really large like multi-million line file for it to analyze as well. Um

throughout the course of the session run it you can see it running all of its uh tools and calling the model selectively

to run the analysis and at the very end um it outputs HTML files that can be rendered in the browser so that you can

see the results of the analysis. Um, so

all of these particular t all of these particular events are the actions that the model is taking. And you can see that I'm actually not steering it except for the initial event. And at the very end,

these outputs are are produced for me.

This episode is brought to you by Riverside. I've used Riverside for years

Riverside. I've used Riverside for years to record my podcast because it records in 4K resolution each person locally. So

the audio and video still comes through clean even if a guest Wi-Fi gets shaky.

But the reason I love it now more than ever is what happens after we stop recording. If I go in here that I can

recording. If I go in here that I can use these AI tools to remove pauses, remove filler words and just clean up the recording. And I I can also edit the

the recording. And I I can also edit the transcript directly and it will automatically generate clips with captions ready to publish to YouTube, Spotify, and all types of social media

platforms all from one place. As a

oneperson creator business, that matters a lot. Riverside is the upgrade your

a lot. Riverside is the upgrade your content workflow needs. Try it at creators.side.com/peteryang

creators.side.com/peteryang

and use code peter Yang at checkout to get one month completely free. That's

creators.

Peter Yang. Now, back to our episode.

And what is the initial prompt for this uh particular conversation? Is it just like go analyze this huge file?

Yeah, you you could think of it as um as basic as that, but I had a very specific schema that I wanted it to be aware of.

So, for example, I wanted it to know exactly what the structure of the data set was beforehand. And that way I can frontload the initial exploration that I would have or ordinarily had to do. Um

and then I wanted to break down the steps into discrete t uh discrete segments so that I could get very predictable outputs at the end of the task because these agents are you know

randomized actors. So you do need to be

randomized actors. So you do need to be somewhat prescriptive sometimes if you want to have uh very predictable outputs.

Okay. And this will help with like debugging and stuff too, right? Because

like they can figure out which step. And

to your point about debugging actually um directly in cloud console um we have an agent that runs and analyzes actually the full uh session history associated

with the agent um and so uh after I've run this I actually can um use this debug agent to to look for areas where I could have um improved this agent even more.

I What is the output again? You said

there's like a HTML file.

Yeah. Yeah. So it produces uh three different HTML files and I can pull this up right now actually.

So uh okay so basically the components are there's a there's a model there's a prompt right there's tool access and then there's skills that are optional.

Yes.

Uh for tool access like uh like if I want to hook up to my internal database or something like there there's like a whole set of processes for for that right or Yeah.

Yes. Yeah. So uh one of the ways in which you can hook up an agent to a uh thirdparty database or a third party system is through MCP. MCP exposes a

standardized way to communicate to external services and it includes an authentication layer in front. So that

allows you to safely access these internal services that you might not want just anyone to access. Why don't

you just like briefly talk about if I didn't use cloud manage agents like if I had to set this up from scratch what kind of scaffolding is to go into this thing to to build from scratch? Yeah.

Yeah. So I think that if you're working with a raw prompting loop um then all of your work is highly highly synchronous and you are constantly dependent on like

the prior request that you strung together to complete successfully in order to get to the next step. And I

think that that worked in a world where we were just asking the um you know the chat bots to write haikus for us and we

were doing very very simple tasks um and then over time that has become increasingly unscalable. So if I

increasingly unscalable. So if I delegated a really large task to an agent um and for example my like something went wrong in you know that

first initial message that I sent and either I dropped the message or it was like slightly off from my expectations then my ability to pivot my integration

and to handle that gracefully is is significantly lower. So, it's important

significantly lower. So, it's important for us to evolve towards these more self-running um you know uh self-reovering agent

loops that can recover from errors, recover from going slightly off course, uh resteer themselves back and then just keep you in the loop as they're doing so so that you're aware of their process.

When you say self-reovery, you mean like they run into error, they can like debug their own error, they can do some searches and figure it out, right?

Yes. Yeah.

Got it. Um, and then even more baseline than that, you know, if they um, uh, if they produce an output that is unexpected, but then they're aware of

what a good output is supposed to look like, they can, uh, you know, revise their thinking and really adjust their course of action. That is very difficult to wire together in just a raw prompting

loop.

All right. Uh, let's go back. Do do you think the output generator generate now or is it still working?

Yeah. All right. So this agent produces three primary artifacts. So first an analysis into the products. Um so you know just an overall uh highle

in inspection of like common common order patterns in shopping carts. It

also produces a analysis into the shoppers and sort of heat maps on when they're shopping um in a and these really interesting like radar charts.

And then lastly, like my personal favorite is like a predictive model that based off of various attributes of the customer and the products um can predict

whether this customer is going to return. Um so all of this was just with

return. Um so all of this was just with like a simple prompting and access to Python packages in the agents environment and it's able to produce

this really rich level of insight in just minutes. And so the prompt has like

just minutes. And so the prompt has like uh some stuff about like the format of the three reports that we want and the data that we want, right?

Yes. So um what I what I put into the system prompt is general performance optimizations because I do want this to be a general agent that I can reuse across a lot of different uh data sets.

And then what I put into the initial prompt I sent to it is that schema that highly specific schema discovery and the

desri the actual task description on how to run um its analysis in sequence.

Well, okay. Well, let me ask you this.

So, so now you have a bunch of traces and a bunch of outputs like you know like how do we build evals for this Asia? How do we know it's like not going

Asia? How do we know it's like not going off the rails?

Yeah, eval is definitely uh the toughest part about building agents today because I think that what has traditionally been a

uh you know traditional eval development is evolving as the tasks get more complex. I think that you know the

complex. I think that you know the traditional eval setup of you know here's a set of initial prompts and here's how we want the agent to uh produce results in response. um that

still works. I think that we're also seeing folks uh do more sophisticated things like um uh replays of more complex like

multistring uh kinds of interactions with agents or um you know AB testing different versions by sending the like

same string of sort of user interactions and seeing how that uh ultimately changes uh the responses. I think

another trend that we're we're also enabling in cloud manage agents is like a built-in eval loop where if the agent itself knows to grade its outputs then

you can actually pull the eval directly into the agent's work rather than having it be outside of the course of the session.

Okay, this is kind of what I do in cloud code. Like it spits out an output and I

code. Like it spits out an output and I try to get another agent to run the eval and then if the eval sucks then I try to get the other agent to work again basically. Yeah, when you when you can

basically. Yeah, when you when you can have agents evaluating their own work uh potentially in separate context windows to avoid bias, then you're always going to get a better output. And um just real

quick on evals like um do you guys do like past fail evals or or like do you do uh like scoring evals or all kinds of evals?

We do a mix. There's definitely binary pass fails. There's definitely scoring

pass fails. There's definitely scoring which is more like LLM as judge and like applying sort of more of a you know letter grading type approach.

And then there's also triggering evals um for things like you know make sure that this type of action is actually triggered in general. Um, for example, with skills, something that we worked on

very early on was making sure that skills triggered properly at the right time because the whole point behind skills is, you know, progressive disclosure.

Oh, yeah. It wasn't that good at I think that's gotten better over time. I used

to have to manually trigger the skills with like slash commands.

Yeah.

But it's got a little bit better over time. Got it. Okay. I feel like as these

time. Got it. Okay. I feel like as these agents become more autonomous, right, and they can do longer running tasks. I

feel like it's more about like what is your goal and what's the outcome that you wanted to have? Like what do you think about like that kind of changes the prompting and everything, right?

Like [laughter] you know.

Yeah. Yeah, it definitely changes things. Um I think that you know once

things. Um I think that you know once upon a time there were structured outputs, right? We told an agent

outputs, right? We told an agent specifically your output must adhere rigidly to this structured uh you know code formatting.

Yeah. And then we will string together a lot of glue to make sure that these big blobs of of JSON um structured structured data turn into something like beautifully rendered in the browser. I

think that as models have become more capable and as harnesses have evolved the sort of outcome has become the structured output. The the uh outcome is

structured output. The the uh outcome is sort of a meta a meta structured output where we don't need to tell the agent anymore.

Okay, like you know this is the exact structure of your response and uh I will glue all of these together and create something rich and interactive. We're

just skipping straight ahead and saying like let's build this rich and interactive thing and this is sort of my tastemakers assessment of what good would look like. And I and because we

now have the infrastructure for these agents to run autonomously, the agents can actually self-correct along the way rather than relying us to string together all these intermediate outputs.

Okay, so it's not like, you know, here's like five seconds you got to adhere to, here's the character count, here's like what you got to do. It's more like I think it's actually kind of harder now because like what's an example of a tastemaker output? like, hey, you got to

tastemaker output? like, hey, you got to make it beautiful and interactive or you like what?

Yeah, I mean, yeah, we we see people using it for like slide generation and and content creation. Um, I I think that's like one area where outcome

optimization is is really useful just like visual artifacts and editorial content. Um the other place where we see

content. Um the other place where we see it being useful is on uh you know for example in the predictive model that I just showed. Let's say that I needed it

just showed. Let's say that I needed it to achieve like a specific score of 90% in a sort of like accuracy benchmark.

Then um I've I've run tests with this agent before to try to always optimize for building a model that will hit that score and it's iterated until it got there.

Okay. Got it. Okay. That that that makes sense. Yeah, it makes sense to try to

sense. Yeah, it makes sense to try to get the agents to do their own loops first before the human eyes have to come into play, you know, just like make as much progress as possible. Yeah.

Yeah.

Yeah.

Well, let me let me switch gears a little bit. Let's talk about how you

little bit. Let's talk about how you guys use agents internally and anthropic. Um and yeah, maybe you can't

anthropic. Um and yeah, maybe you can't screen share Slack, but I'm curious how anthropic PMS, you know, technical staff, what what kind of help and leverage has agents helped you guys internally during your work? Yeah, for

me it's really been about depth and I think that access to our codebase has been the biggest unlock for me. I think

that one it helps me just like manage state more easily. You know rather than um you know poking a bunch of engineers on what they're doing, I can just track

the PRs directly and see which ones are merged, which ones are deployed. Um, I

think there's also an aspect of like I deeply understand and interact with my product so much more than I've ever been able to in the past because it's so much faster for me to either prototype an

agent on cloud manage agents uh because I can use cloud code um or just interrogate the codebase on like exactly how everything is working. So, anything

from, you know, going into a customer RFP and filling out all of their, you know, security check marks to, you know, diagnosing problems in the field,

unblocking users who are having trouble or helping them scale by like helping them understand like our specific architecture. Um, all of that is 10,000

architecture. Um, all of that is 10,000 times easier because of all the agents that we have internally.

Okay, so let's kind of break that down into a few things. So um do you come into work on Monday and be like hey you know hey cloud what what did my engineers ship last week like [laughter] do you ask that kind of question or or

like how do you um for for for me I have some scheduled runs going that like summarize activity.

Um, but I still do a lot of the deep dives on a more sort of ad hoc basis based off of the questions I'm getting or like the the pitches I need to uh uh

prepare for or the customer conversations I'm about to have.

Okay. Got it. Okay. Got it. So, like

you're going to enterprise to like talk about manage agents and you need to like personalize the pitch for for that company or something, right?

Yeah. Yeah. Um, and and and do you guys have like a feedback? Like I I know some people are active on Twitter even so it's pretty toxic, but like how do you get feedback for your prof product? Like

do you have like a enterprise Slack group or something?

I do have agents monitoring our Slack channels. Uh I sit in a bunch of Slack

channels. Uh I sit in a bunch of Slack channels with all of our with a bunch of customers. Um and I love talking to

customers. Um and I love talking to customers directly, but for those I'm not able to talk to directly, it is useful to have agents sort of summarizing the activity that's

happening there. Um I I I think that

happening there. Um I I I think that like we have started to evolve towards thinking about agents as you know always on like you should be able to tag them

anywhere but they also should be proactively surfacing things toward uh for you in the way that a co-orker truly

would. And so I think that like there's

would. And so I think that like there's two aspects to which our agent usage is really powerful. One is the level of

really powerful. One is the level of access of data we give it. And then two is the interaction styles that we expect of our agents which is it should be

humanlike. It should be proactive and

humanlike. It should be proactive and not just reactive.

uh is proactive through like triggered events and like chron jobs or like how do you make it proactive just like yeah proactive is on triggered events and chron jobs and like continuously refreshing the data that has available

so that if it is long running and it's and it's like constantly slurping up information. It's it needs to be as

information. It's it needs to be as upto-date as you are and that shouldn't be just you know on an ad hoc basis that should be proactive.

Got it. Yeah. Yeah. It's got to have the most up-to-ate context right. Yeah.

Yeah.

So why don't you ask this question? How

do you compare your conversations with claudin agents versus your conversation with co-workers? Like what happens more

with co-workers? Like what happens more throughout the day? [laughter]

Interesting. That's a great question.

Yeah.

Um I I feel like I talk to cloud more. I'll

be honest. [laughter]

It is true. I I I do I think particularly when I'm in a new space, I find myself spending a ton of therapy time with Claude just trying to wrap my

hands, wrap my arms around uh a thornier concept. And I think that it uplevels my

concept. And I think that it uplevels my conversations with my my teammates because I'm able to come to a conversation with like a true opinion

and a lot of baseline research done very quickly. And so I'm not asking the like,

quickly. And so I'm not asking the like, you know, please spin me up questions.

We're able to engage at a deeper level.

I mean, I I guess that's probably expected with an right here because like you encouraged to use cloud. Yeah. Do do

you guys pull up cloud in like a decision m meeting or like a team meeting and like try to you know Yeah. [laughter]

Yeah. [laughter] Yes.

You do that?

Yeah. Uh Claude is a really good neutral judge for certain things. Uh we have an API review claude that basically if

we're really stuck at an impass with how we want to shape certain components of our API then we do tag in quad to tell us when our biases are getting the

better of us.

Okay.

But we all of our primitives are definitely there to be able to allow for this kind of interaction. It's sort of like agent to agent communication. All

right. So, just to wrap this section, how would you say you typically spend your day or I guess no real typical day, right? But like do you talk to a lot of

right? But like do you talk to a lot of enterprise customers? Do you do like a

enterprise customers? Do you do like a lot of road map planning? You're like

shipping stuff yourself like how do you typically spend your day?

Yeah, I think that my day now is spent a lot more in the customer discovery and like sort of uh integration journey process than it has been before. And I

think that's awesome. I think it's because frankly like we're all moving so fast that the like I said the kind of conversations that we're all having has been like seriously upleveled. I think

previously, you know, if I spent this much time in customer conversations, it might be like, oh, like please debug this like tiny little thing that is like, you know, some a problem that I've

debugged a hundred times. But now,

because they have agents and because we have agents and because we're pushing the boundaries, the conversations are now like, okay, how would we build this super futuristic thing together and like

what are our principles around it and how can we push that forward in the next like two weeks? So, um, those [laughter] kind of those kind of conversations are really exciting for me and like I really

really do love spending time with our customers. Um, I also spend a ton of

customers. Um, I also spend a ton of time prototyping and so um, one thing I love about managed agents is because it's so easy to spin up an agent. Um,

and because my work changes dayto day and it never really looks the same um, you know, week over week. uh I need to be able to spin up an agent that is

perfectly suited for that specific task of the moment. And it's okay if I throw that away agent away. Like it doesn't have to be the most beautifully productized thing in the world.

Okay.

Um and so I spend a lot of time bashing my own product by basically automating my own work. and that it it takes me maybe half an hour to to spin up an

agent, but I try to have like a different one going every couple of weeks.

Can you give a example?

Yeah, I'll give an example. So, we had a weight list um for some of our advanced features and it was like a 4,000 organization long wait list and it was

filled with invalid entries and duplicates and you know all the kind of stuff that you get on like a traditional published web form.

And so I know that this agent or I I know that this wait list is only really uh going to be relevant for the next few weeks until we get this feature out into

the public and make it self-s serve. Um

so I just needed to spin up an agent that would automate my next few weeks of work with this weight list like um parsing out all the invalid entries like assessing which ones are the highest

likelihood to convert and actually give really high feedback. and I basically embedded it with like access to our internal systems and our databases and whatnot to make that assessment and

figure out who to pull in off the weight list on a on a daily basis. Um, that's

just like a few weeks of work and there's no point in like building something super shiny for it. And so

having the a really really good easy to use pre-built infrastructure that just automates building an agent is like a huge unlock for me and I can take those kinds of tasks and just repeat

them.

Okay. So basically the this agent uh looks at a big weight list and then like uh cleans it up and then like sends uh invites to people on the wait list. Is

that kind of Yeah. Yeah. like people who are highest

Yeah. Yeah. like people who are highest likelihood to be high value testers for us.

Got it. People who who are like the most active Twitter complainers basically.

No, I'm just kidding. [laughter]

No. Um Okay. I don't have to be a company to use this product, right? Like

like I I I think you were going to show me how to do this in cloud code or like can I just use this as an individual?

Yeah.

As long as you have an API key with us, then you should be able to use it. In

fact, we have a ton of usage from individuals. Um, it actually kind of

individuals. Um, it actually kind of surprised me, but like I think that we're seeing a lot of individuals just like automate their lives with cloud manage agents.

But like when should I use this product versus just like try to build a skill or like a week weekly crown job or something like you you said there's a lot of individuals using it, right?

Yeah. Keep in mind that these are really longunning cloud hosted sessions. So

anything that's running directly in your cloud code is is bound by the constraints of your laptop and and when it's on. And

it's on. And so using cloud manage agents basically pushes that all to the cloud. It

increases the capacity of the work that you're able to do and also the longevity of it.

Okay. So stuff like uh maybe like some sort of online competitive research or like um like what are what are people using it for uh in in their personal lives?

Yeah. I um I have a friend who's like a new parent who basically has her entire like child's um sort of like the hourly schedule of like feedings and like um

tummy time and all of the things that new parents have to worry about basically fully managed by Claude. Um

and then she also has like these like uh fridge monitors and like grocery uh uh grocery management agents running as well. Uh

well. Uh yeah. So yeah, I I I don't actually have

yeah. So yeah, I I I don't actually have insight into what people are doing at scale because like you know uh we redact a lot of that information internally,

but um from anecdotal usage like honestly just using these as like your personal uh personal assistant uh is you know they're ultim ultimate customizable

agents, right? You can do them for

agents, right? You can do them for whatever you need. Like I I I'll tell you my bias. Like I feel like uh sometimes companies tend to set up agents. Like they make it too

agents. Like they make it too complicated right off the bat, you know?

They just like they want to have like or orchestrators. They want to have like

orchestrators. They want to have like all all kinds of stuff. And then my bias is just like why don't just build one agent and see if it actually works and people actually use it. But like like you know, you talked to a lot of

enterprises. What what are some best

enterprises. What what are some best practices to roll this out? Yeah. I

think that a lot of enterprises make the immediate jump to like how could I automate this

like crazy 20 team workflow that would have required like a lot of like crosscutting coordination and these like multimonth processes like I mean super

ambitious very exciting but I do think that there is something really valuable about just like okay how do we unlock the individual like how do How do we

make any individual on any team feel exponentially more powerful if like one Peter and one Jess is suddenly like wait

I don't need to make dependency requests because I have these agents that are able to extend the kind of work that I'm able to do do the design work that

previously I would have had to request a human for etc. um then like you've supercharged like one one individual,

right? You might you might not have

right? You might you might not have completely like eviscerated an entire like multi-quarter process like compliance process that everyone hates

but you've like instilled the kernel of like creativity and sort of like autonomy within like the individuals of your organization. So starting there and

your organization. So starting there and starting by just like getting your employ individual employees to raise a ceiling of what they can do creatively and like what they can ship in isolation

is the first starting point and then from there you can start working on these like multi- multi- teamam like mega processes and start raising the ceiling of complexity there. But like

there is a huge amount of value that's unlocked simply by making everybody feel like they have their own power to develop products as sort of like a one person startup.

Like you have a bunch of one person startups inside a large company basically.

Yeah. Yeah.

Uh that's interesting. Okay. Yeah. So

basically um and and and like do you just give do you just let anyone make their own individual agent or do you kind of like I think it's probably best practice to actually give some like spotlight examples, right, of people who actually know what they're doing, you know?

Yeah. I I think giving people templates and then letting them iterate freely off those templates is always a good place.

Um you know you avoid the writer's block of like what do I do? Um and but then you give them the creativity to iterate.

Cool. This is very good. And and I think getting the agent into the hands of actual users qu quickly maybe not all users but at least like some beta users you know that's kind of where the rubber meets the road right. So like like don't

go too crazy on evals before you even get in the hands of any users like do do yeah just that. Yeah. The vibe testing is honestly the most important first

step. And at a certain point, you

step. And at a certain point, you uh outgrow the vibe testing because you can't really like do uh you know aggregate vibe signals at at scale and

that's when you do eval.

Okay. Okay. The it's it's it's it's hard to present the vibe testing. I I just a way to quantify the vibes. I guess you can have a bunch of quotes and stuff but it's hard.

Yeah.

Yeah.

Um Okay, cool. So I guess last question.

So, I mean, this stuff is moving so fast. Where do you think this stuff is

fast. Where do you think this stuff is going to go? Like, let's just say like three or six months from now. Like, you

think we're gonna like, you know, like before I go to bed and be like, "Hey, just just take care of everything for me and [laughter] then when I wake up, it's is done or Yeah, honestly, uh, to some degree, these

longunning agents are kind of doing that. Like, we set them tasks overnight

that. Like, we set them tasks overnight and then we wake up and, you know, backlog is resolved and bugs are are squashed." Um but I think I so I I think

squashed." Um but I think I so I I think that we will see the workday like becoming sort of uh sort of the the

limits of what we can achieve will really be based off of how much we can delegate at once um more so than like what our like personal capacities are because we are going to be able to

increasingly um lean on these agents as partners. Um, so that's one thing. I

partners. Um, so that's one thing. I

think like on an industry level, what I'm really fascinated by is I'm starting to see that vertical SAS is sort of just like becoming increasingly specialized.

And so the idea that you would have like an accounting agent versus a uh healthc care agent like that is starting to become like more and more and more

narrow as people realize that like uh you know the models are getting smarter and so like broad domain expertise is is sort of

there and so like the real value ad is like like these incredibly specific and and niche use cases. And so, um, it's

really interesting because, uh, I I think that like getting the right agent to do the job now is getting so, uh,

specialized and tailored that having um the really the shared thing now is like the context patterns and the task orchestration patterns more so than like

okay, this is a canonical way to build like a finance agent or like a healthcare agent.

So for example, like maybe instead of a general accounting agent, it's like a accounting agent for like solarreneurs or something like like just like more.

Yeah. Yeah. Everything is just becoming incredibly specialized. um particularly

incredibly specialized. um particularly as as uh people are able to build products um you know for themselves and

then and scale them externally like so so what we're seeing is as as you think of as you realize that you can now build software for like these hyperspecific

use cases people are scaling those things and so that means that the kind of products that we're seeing distributed are increasingly verticalized yeah I I think this is actually really interesting I actually have more

questions about this just real quick. Um

because I feel like I can build a product like a AI product for accounting for solarreneurs or something, but I feel like someone can just build a scale that's just as good. Like [laughter]

it's kind of hard to think about like a long-standing SAS product, you know, you know, like around this.

Yeah. I don't really have a good answer but yeah I think that where the products that will sort of survive this transformation are the ones that meet their users where

there are where their workflows are right so it's about being one hyper specific to like the tasks at hand and adaptable um I guess beyond

those specific tasks but then two like being where exactly where you need it to be and so like um that always on kind agent pattern is definitely important

because uh you know you you you want that agent to pop up at the right time but it also needs to be in like the discoverable place where you would expect those workflows to be handled increasingly that is just like in clock mode or some of these apps right like I

don't want to navigate to a website and like fill a bunch of forms like I feel like I have I want to bring my personal agent like with all my contacts and I don't want to get it to go talk to someone's accounting agent just like go go figure it out [laughter]

you know what I mean yeah I wherever you're wherever your your work lives and so for like you know for a lot of engineering teams that will

be increasingly in cloud code and and everyone sort of is becoming engineering team um but also like just like you know even if we think about uh Verscell's

chat SDK and how um you know they had the sort of vision to realize that like everything is chat now because like agents interact best in chat and like

these sort of like um interactions with our colleagues are getting like more and more compressed because like the the the speed that we're iterating at is is so much faster, right? And so

like I do think that a lot of these I mean it it sounds kind of basic but like these sort to to meet users where you are you have to it has to live in cloud code and it has to live in chat.

Yeah. It's got to live in chat and the chat has to be connected to all your personal contacts. It's not just like a

personal contacts. It's not just like a random chat window on web website you know.

Yeah. Well, fascinating world, fascinating world of living. [laughter]

Yeah. Um but uh but this is super helpful, Jess. Um where can people um

helpful, Jess. Um where can people um find you or learn more about cloud manager agents?

Yeah. Um so definitely use the cloud code skill to learn more about cloud manage agents. We also have traditional

manage agents. We also have traditional arteal documentation that I I assure you humans still read um at platform.cloud.com.

platform.cloud.com.

Uh, and uh, I'm Jess double_yan on Twitter if anyone wants to reach me.

All right, Jess. Well, this has been a fascinating conversation. I hope the

fascinating conversation. I hope the agents don't take over everything, but I look forward to having them save time on work we don't want to do. Yeah. Thanks

so much.

Loading...

Loading video analysis...