How People Actually Use AI Agents

By The AI Daily Brief: Artificial Intelligence News

Summary

## Key takeaways - **Median Claude Code Turn: 45 Seconds**: The median turn lasts around 45 seconds and that's been fairly consistent over the past several months. [04:36], [04:48] - **99.9th Percentile Jumps to 45 Minutes**: Average turn duration at that percentile jumped from 25 minutes to 45 minutes from Sonnet 3.5 launch through Opus 3.5 launch. [05:09], [05:32] - **Experienced Users Double Auto-Approval**: New users use full auto approval roughly 20% of the time, which roughly doubles to 40% for more experienced users. [06:42], [06:58] - **Experienced Users Interrupt Twice as Much**: Newer users interrupt Claude around 5% of the time, while more experienced users interrupt it around 9% of the time, almost double. [07:16], [07:31] - **Claude Asks Clarification Twice as Often**: For turns where there was high goal complexity, humans interrupted Claude 7.1% of the time, while Claude asked for clarification more than double that 16.4% of the time. [09:00], [09:24] - **Over 50% Tool Calls Non-Engineering**: Even at this early stage with coding and engineering tasks being the clear breakout, you're still already seeing more than 50% of tool calls being outside of that software engineering domain. [10:34], [10:54]

Topics Covered

Meter Measures Human Task Time
Real Autonomy Hits 45-Minute Peak
Experienced Users Trust More
Agents Expand Beyond Engineering
Autonomy Equals Permission Plus Ability

Full Transcript

Today we're discussing a new study from Anthropic that while nominally about agent autonomy is actually much more about how people are using AI agents in practice.

Welcome back to the AI daily brief.

Today we are looking at a new anthropic study on agent autonomy. It's called

measuring AI agent autonomy in practice.

And in many ways it ends up actually being a case study in how agent behavior is changing. After reading it, I

is changing. After reading it, I couldn't help but feel like it was a profile of a changing market where more and more of the tasks are moving outside of coding or engineering and more and more of the agentic work is being done by people who are not themselves

engineers. Now, to set this up, I think

engineers. Now, to set this up, I think it's useful to have as a comparison the most frequently discussed study on agent autonomy. That is, of course, the meter

autonomy. That is, of course, the meter study, the chart of which I'm sure you've seen before, that measures AI's ability to complete long tasks. The

metric that they created is basically a measurement of the duration of a task that AI can complete at a certain level of success. It is not, and this is

of success. It is not, and this is something that people frequently get wrong, a direct measure of how long an AI agent can work for. Instead, it is a measure of the duration of task as it would take a human. So when, for

example, GPT 5.2 high comes in at 5 hours, that's not that GPT 5.2 high took 5 hours to complete a task. It's how

long that task would have taken a human.

What's more, Meter has two success metrics, 50% success and 80% success, neither of which would be sufficient performance for a real world context. In

other words, you're not going to keep an employee around who completes tasks at a 50% success rate. Still, I've always thought that this meter metric was really valuable. In my estimation, it

really valuable. In my estimation, it doesn't matter so much whether 50% or 80% success is the core number. It's

that it's consistent and applied consistently over time to different models. So, ultimately, what is this

models. So, ultimately, what is this trying to get at? Well, it's trying to measure agent autonomy. And so why does autonomy matter? Autonomy matters as it

autonomy matter? Autonomy matters as it shapes what agents can do. The more

autonomous an agent is, the greater the capability it has to complete longduration tasks with high success rates, the wider and more complex the array of use cases that it can be valuable for. That matters on an

valuable for. That matters on an individual level in terms of what work you can outsource to an agent on an org level in terms of which sets of tasks or which entire functions can be identified and on a societal level as it has big

impact when it comes to the job disruption conversation. Yet, despite

disruption conversation. Yet, despite meter being a very valuable and off-sighted metric, indeed last year during the height of the bubble times, people joked that this chart was keeping the entire industry on its back, as it

was the one thing that suggested that there was no plateau on progress, which was maybe the chief piece of evidence that the bubbleists were looking for.

And yet, there are of course limitations of their methodology. As Anthropic puts it, the meter evaluation captures what a model is capable of in an idealized setting with no human interaction and no

real world consequences. And that of course is not how people actually use agents in practice. To understand how people use agents in practice, one of the best places to look is Claude code.

For all intents and purposes, I think one can argue that Claude code is the first agent with product market fit. In

fact, many people have noted that claude code is better thought of not as a coding tool per se, but instead as a code enabled generalpurpose agent. And

that brings us to anthropic study measuring AI agent autonomy in practice.

Now, although Anthropic has access to pretty unique data in this regard, there are still some challenges. First of all, there's the question of what is the definition of an agent? Since this is a constant source of debate, Anthropic

decided to go with a definition that is, as they put it, conceptually grounded and operationalizable. An agent is an AI

and operationalizable. An agent is an AI system equipped with tools that allow it to take actions. As they point out, studying the tools that agents use tells us a great deal about what they are doing in the wild. In terms of sources,

they pulled from the public API as well as clawed code. And going back to this idea of tools for the public API data, they say rather than attempting to infer our customers agents architectures, we instead perform our analysis at the

level of individual tool calls. They

write this simplifying assumption allows us to make grounded consistent observations about real world agents even as the context in which those agents are deployed varies significantly. The limitation they note

significantly. The limitation they note is that they have to analyze actions in isolation rather than understand how those individual actions combine into a larger hole. The second source of data

larger hole. The second source of data is Claude Code. And what makes Claude Code super valuable for this study is that because it is their own product, they can understand an entire agent workflow from start to finish. The

challenge, of course, is that it doesn't have the same diversity of use cases necessarily as their API traffic. Now,

one last note on the methodology. When

trying to figure out how long agents actually run for without human involvement, in Claude code, they're using turn duration. Basically, how much time elapses between when Claude starts working and when it stops. One note they

make is that when it comes to the average, most clawed code turns are very short. The median turn lasts around 45

short. The median turn lasts around 45 seconds and that's been fairly consistent over the past several months.

Instead, then they look at the signal at the very end of the long tail.

Basically, the 99.9th percentile turn duration with the argument being that these are the most advanced users or at least the most advanced says and in that way are more likely to reveal what the end duration of the capability set

really is. So looking at that 99.9th

really is. So looking at that 99.9th percentile turn duration there are two really interesting phenomenon over the past few months in the period between October and January basically from when

sonnet 45 launched through when opus 45 launched in November average turn duration at that percentile jumped from 25 minutes to 45 minutes interestingly they note that the increase is smooth

across model releases suggesting that autonomy is not purely a function of model capability and indeed I think that's one of the big themes of this research is that when we try to understand agent autonomy, we have to

think beyond just model to the entire context in which a model operates, including the human interactive context.

The second really interesting period in this chart is the period over the last 6 weeks or so when there was actually a bit of a dip backwards from the peak of over 45 minutes down to something that's closer to 40. They identify two theories

for why what was a previously pretty smooth curve has now leveled out and in fact gone down a little bit. The first

is a shift in what projects people were using claude code for. The argument is basically that over the holidays people had sort of broader range exploratory things that they were doing for their own gratification or hobbies. Whereas

when they came back they had as they put it more tightly circumscribed work tasks. The second piece however is that

tasks. The second piece however is that between January and midFebruary the claude code user base doubled which is obviously phenomenon that we've been tracking closely here. A doubling like that is naturally going to bring with it

a more diverse user base that's going to reshape the distribution a little bit.

And indeed, maybe the most interesting thing about this study to me is not just the raw measure of capability, but the human interaction measures. A lot of this story is the difference between new users and power users. One of the

interesting findings is that users at the beginning of their Cloud Code journey use the full auto approval features less than more experienced users. New users use full auto approval

users. New users use full auto approval roughly 20% of the time, which roughly doubles to 40% for more experienced users. Cloud Co's default settings

users. Cloud Co's default settings require users to manually approve each action. And so Enthropic suspects that

action. And so Enthropic suspects that what we're seeing is a steady accumulation of trust. At the beginning, you approve things each time and then as you dial in your settings and you start to learn to trust the model, you give it

that auto approval more frequently. At

the same time, approving actions isn't the only way that people supervise Claude code. Users can also interrupt

Claude code. Users can also interrupt Claude while it's working to reorient it or give it feedback. And that kind of follows the opposite pattern. Newer

users interrupt Claude around 5% of the time, while more experienced users interrupt it around 9% of the time, almost double. Now, one part of this

almost double. Now, one part of this might just be a shift in where people put the burden of oversight. If new

users are approving each action before it's taken, maybe they don't need to interrupt Claude as much. Whereas, when

those experienced users use auto approval more liberally, there's more of a context for them to step in. However,

there also might be a sort of learned experience here as well. They write,

"The higher interrupt rate may also reflect active monitoring by users who have more honed instincts for when their intervention is needed." With the idea being that the new users simply don't know when to intervene as much. I think

one comparison here is that if you view AI as sort of a junior employee, it earns trust over time. That's the shift from the 20% to 40% auto approval rate.

But as you get more comfortable with it, you also intervene more, checking in on the work as it's happening and reorienting to make sure you get the most out of things rather than just waiting to see the end product. to judge

its success in that way. Now, although

these measures are about the human intervention, this is not a static number across models. In other words, model capability does impact this.

Anthropic writes that from August to December of last year, as Claude code success rate on internal users most challenging tasks doubled, the average number of human interventions per session decreased from 5.4 to 3.3.

Basically, as the models get better, users grant Claude more autonomy and achieve better outcomes while needing to intervene less. Now, when it comes to

intervene less. Now, when it comes to autonomy, we're talking about an interaction set in a conversation between the model and harness cla code and the humans using that model. Human

intervention is only one of the directions in which autonomy can unfold in practice. Claude, as they write, is

in practice. Claude, as they write, is an active participant, too, stopping to ask for clarification when it's unsure how to proceed. Anthropic found that as task complexity increased, Claude code would ask for clarification more often

and more frequently than humans actually chose to interrupt it. For example, for turns where there was high goal complexity, humans interrupted Claude 7.1% of the time, while Claude asked for clarification more than double that

16.4% of the time. That compares to minimal goal complexity, where humans interrupted 5.5% of the time, with Claude asking for clarification 6.6% of the time. In other words, the gap

the time. In other words, the gap between how much humans intervene and how much Claude asks for clarification increases alongside the complexity of the task. However, these aren't exactly

the task. However, these aren't exactly direct measures. as humans interrupt

direct measures. as humans interrupt Claude and Claude interrupts itself for different reasons. The number one reason

different reasons. The number one reason that humans interrupt Claude is to provide missing context or corrections.

That's 32% of the time about a third.

17% of the time it was because Claude was slow or hanging with every other reason being much less frequent. In

terms of when Claude stops itself, the most common reason at a little above a third at 35% is to present the user with a choice between different approaches.

which is interesting because that's not really a knock on its own autonomy in the sense that it doesn't necessarily need that information to proceed as it could theoretically just make the decision for itself, but a way to better

align with humans on the upfront. Now,

the one other really interesting chart is the chart of which domains agents are deployed in. As you might expect,

deployed in. As you might expect, especially given that this is anchored by clawed code, software engineering represents around half of the tool calls overall. And although the other

overall. And although the other categories are all below 10%, they kind of read like a map of where agentic automation is likely to come next. Back

office automation is at number two at 9.1% followed by marketing and copywriting at 4.4, sales and CRM at 4.3, finance and accounting at 4.0. It

is notable that even at this early stage with coding and engineering tasks being the clear breakout, you're still already seeing more than 50% of tool calls, in other words, more than 50% of agentic

use cases being outside of that software engineering domain. This is a pretty

engineering domain. This is a pretty simple study overall, but a really valuable compliment in my estimation to the meter study as it moves away from the realm of the theoretical and into the realm of what people are actually

using agents for and how they're actually interacting with them. There

are a few interesting implications that people picked up on. David Hendrickson

wrote, "What's most surprising from the paper is that real world AI agents are currently given much less autonomy than they could technically handle. In other

words, we had to go to the 99.9th percentile to really see what Claude could do despite the fact that the average turn is just 45 seconds. We've

talked a lot on the show about a capability overhang, and it looks like this is another example of that in practice, even with some of the most advanced tools in the space. Another

interesting takeaway is about a shift in our thinking of autonomy from purely based on model capability to this more complex view of model capability plus human interactive state. Yong Riu

writes, "Aut autonomy is not just steps taken. It is permission, scope, and

taken. It is permission, scope, and ability to change state." The other thing that people are exploring is based on all this, what they actually want the interactive mode to look like in the future. Richie on X, for example,

future. Richie on X, for example, writes, "Need a clawed code mode that isn't exactly dangerously skip permissions, but can skip pointless, do you want to proceed questions, and at the same time doesn't nuke my entire

database and family tree." Lorenzo

responds, "What you want is competent autonomy." Claude can skip pointless

autonomy." Claude can skip pointless prompts while respecting blast radius boundaries so dev stay sane and prod stays intact. Now, one thing to watch

stays intact. Now, one thing to watch for is how much the emphasis in the next set of developments is more improved interactions or a totally different paradigm of longduration autonomy. In a

recent podcast with Lenny, OpenAI Sherwin Woo argued, as the AI future brief put it, that the next leap in AI isn't just smarter models, but longduration autonomy. While today's

longduration autonomy. While today's tools are optimized for short bursts, tomorrow's tools will be agents you dispatch for 6 plus hours of independent work. Right now, as anthropic shows,

work. Right now, as anthropic shows, that certainly isn't how people are using these tools, but it does appear that things are evolving fast. Overall,

a very valuable study and a great way to see what's happening in practice. For

now, that is going to do it for today's AI daily brief. Appreciate you listening or watching as always and until next time peace.

Loading...

Loading video analysis...