Live coding session with Boris Cherny and Jarred Sumner

By Claude

Summary

Topics Covered

AI Can Now Outsource the Writing
Adversarial Code Review Between Bots
LLMs Eliminate the Switching Cost Tax
Model 4.7 Enables True Hill Climbing
Every Bottleneck Becomes the Next Automation Target

Full Transcript

Please welcome to the stage, head of Claude Code of Anthropic, Boris Cherny, and creator of fun at Anthropic, Jared Sumner.

Hello. All right, so this is a developer conference. We're going to be doing a little bit of talking, but mostly we're

developer conference. We're going to be doing a little bit of talking, but mostly we're just going to be, like, coding. So this is for the developers in the room.

I'm going to start by talking a little bit about how Bun uses cloud code to build and maintain Bun, and also kind of how our setup works. Because it's

kind of a... So, kind of how our setup works, because it's kind of a slightly more advanced setup than what's common today.

But first I'm going to get a few agents running to just fix some GitHub issues. This is classic Jared doing work during a talk.

issues. This is classic Jared doing work during a talk.

So, in Buns repo, every time somebody submits an issue, we have a CloudBot automatically Every time somebody submits an issue, we have a cloud bot automatically run and try to reproduce the issue. So you can see this person has these side effects, and this is like one of the most recent issues,

and we can see that Robobun, which is our bot, went and managed to reproduce the issue and submitted a PR automatically.

all these PRs always have tests. It's one of the actual hard requirements before it can submit a PR. And so the challenge here is like, does this code look correct? And one of the things we do to check that is, does the test fail in the previous version of bot and pass in

this debug branch? And the bot actually can't submit a PR without that.

being the case. And so this is, just to make sure I understand, so this is like every single issue that goes up in the Bund issue tracker, you have RoboBund automatically try to reproduce it before anyone looks at it. Yeah. And this saves a lot of time because we have so many open GitHub issues. It really moves the challenge from just fixing and debugging the issue to is this the right thing

to merge? Like, is this the right fix? How...

to merge? Like, is this the right fix? How...

good is it? Is this doing like 100 % of PRs? Is it like 10 %? We can go to the insights and go to contributors. And then if we

%? We can go to the insights and go to contributors. And then if we go last three months, and this is specifically to main, we can see that Robobun is now a bigger contributor to Bun than I am. And that's with merging

am. And that's with merging not all of its PRs for sure. You can see we have a lot of PRs open right now. With merging not all of its PRs for sure, you can see we have a lot of PRs open right now. The challenge is really how do we know, can we merge the PR? And that's the test. And then

the other thing that's really interesting about this is we have automatic code review bots that run and then they're going back and forth.

So like CodeRabit leaves a comment and then Robobun leaves a comment and then they go back.

And then they go back and forth and CodeRabbit did the...

I love this. And it also marks the comments as resolved when it's done. And you can see they actually went a lot. There's a lot of back

done. And you can see they actually went a lot. There's a lot of back and forth here. There's like 30 comments or something. And so you're using like a combination of agents. So this is like coder view, this is like quad coder view, and then also CodeRabbit. CodeReview, this is like Cloud CodeReview and then also CodeRabbit, and you're using them together. Yeah. I think basically CodeRabbit is good for kind of stylistic

issues and things that are like make sure that it follows the CloudMD, and then the Cloud CodeReview is really good at here's this really subtle edge case that would have taken me like 30 minutes of reading all the code and having all the context to figure out.

you need the full context to really understand.

And I think basically it's really hard to actually have all this automation without having code review that is in the loop with Claude there replying, or replying is very performative, but like fixing.

And that's also a big part of what used to take so much time. And

that's also a big part of, like, what used to take so much time when, like, why PRs would take so much time to merge is because you'd have to, like, check out the branch locally, fix a lint error, then run the lint for locally, then push it back up. And there's all this switching cost that's constantly there.

And so I think this is, like, an especially good use case for LLMs because otherwise, like, it just takes up so much time to ship. Otherwise,

it just takes up so much time to ship. And I guess especially for the BUN codebase, because it's systems code, it's very easy to repro an issue and then see if the issue is fixed, because this is kind of back to what we were talking about before with this kind of verification loop. It's all systems code, so it's really like a test case on a particular architecture and you can

essentially repro or verify anything. Yeah, that's one thing that makes it easier in BUN's codebase is because it's a CLI tool. One thing that makes it easier in Bunz Codabase is because it's a CLI tool, because we don't need to run a browser to test things, but you can also just, like, have something set up to, like, take a screenshot or record a video or those sorts of things. In Bunz

case, we don't need that, at least not yet. There's a couple of things we could do that for. Like, we have some front -end stuff that would be nice.

But, yeah, I think this is, like, the direction that I think is really interesting is because... Like, the direction that I think is really interesting is because it saves

is because... Like, the direction that I think is really interesting is because it saves so much time. And this is not, this is something that, like, this is specifically for BUN, but the more generalizable thing, because most products are not open source, is, like, instead of an issue, maybe the starting point is, like, a customer support

ticket. So, like, you could imagine and automatically passing customer

ticket. So, like, you could imagine and automatically passing customer support tickets to a cloud bot to then go and try to reproduce the issue.

to a cloud bot to then go and try to reproduce the issue and then submit a PR and then having code review go back and forth. And that's where I think for a lot of companies it becomes a lot more impactful because it just saves so much developer time. We should think of some kind of name for this pattern. It's like adversarial code review or something like that. Yeah, I don't know.

this pattern. It's like adversarial code review or something like that. Yeah, I don't know.

But I do think, like, there's also a few other things about this that's, like, if you just do this, then it doesn't quite work. The very first step you need is to, like, make sure the development environment is set up. Like, I think this has been talked about before, but, like, CloudMD is very important

because otherwise it's going to just submit PRs that don't quite make sense for you to merge. So, like, we very much emphasize in Bunz codebase that it runs this

to merge. So, like, we very much emphasize in Bunz codebase that it runs this special That it runs this special command to do the build.

And this both builds and runs the command, so it forwards the arguments, because that's also one confusing thing, because BUN has to be compiled, you want to make sure that it's running the actual changes and not a debug build that's stale. We also go into a lot of detail about how to

run tests, how to write tests, where to where to put the test.

And a lot of like, here's how, here's all the issues that we've run into previously. Like basically the pattern here is like every time that you find yourself repeating

previously. Like basically the pattern here is like every time that you find yourself repeating something, it should probably go in Cloud MD. Because the question now is like, how do you make it maintainable to have lots of clouds running all the time?

And to do that, it needs to be written down, it needs to be documented.

And to do that, it needs to be written down. It needs to be documented.

So like a really small detail is like, we check the, we have it, like to make sure that Claude sees the error message, we make it print the error message before the like less informative conditions. So this is sort of like, you have Claude write a test, and then the test is bad or something about it doesn't work, and then you see this kind of repeated like once or twice,

and then Kind of repeated like once or twice and then you just add it to the QuadMD so that every time in the future when you write a test, you do it correctly the first time. So this is like compound engineering, it's kind of this. And then also it's helpful to give an overview of where all

the folders are, like how the code is laid out and like about dependencies.

Another thing I think is interesting... is

making sure that it can read your CI errors and build logs. You want

to set up the agent to be able to read the code, to do the full loop of writing the code, testing the code works, checking CI, monitoring CI, and reading all the errors so that by the time it gets to a person, everything is set up. I'm, it gets to

a person. Everything is like set up. Like the ideal is that you read the

a person. Everything is like set up. Like the ideal is that you read the code and you have very clear indications that you can be high confidence to merge it. And the only way for that to be true is if it is set

it. And the only way for that to be true is if it is set up for success. It's interesting. I remember when we, when we first met, you were talking about like your vision of like everyone being able to run hundreds of agents in parallel and how that would work. And I feel like I didn't really get it at the time. Although now like every night I I'm running like hundreds of

agents every time although now like every night I'm running like hundreds of agents every single night and I feel like now I'm finally there but this is a thing that you've been thinking about for a long time so it feels like this is sort of like the setup in order to be able to kind of scale up agents way more like you need the self -verification so that agents can run autonomously

yeah like this has gone through many iterations in Bun's code base we previously just had a discord bot where I could just I've mentioned the bot and it would spin up a container it didn't have like the CI up a container. It didn't have the CI stuff, it didn't have the code

container. It didn't have the CI stuff, it didn't have the code review stuff, and it's so much better now, especially with, like, Opus 4 .7.

All this stuff is getting so much better. Oh, yeah, we can also check on how it's going. It looks like it created a PR, the first one, and it wrote tests.

I wrote tests. So maybe while we look at this, I'm curious just to get a show of hands for people in the room, as people think about their development process, raise your hand if it looks something like this where you have a bunch of terminal windows or desktop tabs and you're kind of pasting in issues. Okay.

So that's like maybe half of people. And then what if it looks like Robobun or something more like that where it's like closing the loop a little more. It's

like the next level of abstraction. Starting to get there?

Yeah. I think it's like, it's not surprising because I think model capabilities are just getting there. Like I think 4 .7 is the first model where it's really felt

getting there. Like I think 4 .7 is the first model where it's really felt like it's able to do this. And in the past, maybe you could do it with like a bunch of scaffolding. Like you just throw a bunch of tokens at it and it can kind of work. But now it's like efficient enough. You can

actually do this day to day. Yeah. Let's see.

So the first PR is there. We'll see if it did any others. Okay.

Let's see if it did any others. Okay, it did two PRs.

This looks very plausible. That's cool. And this before, after, do you tell it to do that? Sometimes it does it.

It's pretty good about knowing when it should do that, like when it's like a string formatting thing. It also kept like...

fun style of the label, which is good, because Node's style is slightly different.

Let's see, does this change look good?

Yeah. Mostly what I'm thinking right now is it did this. This is good because, like, you don't want to write one byte at Because, like, you don't want to write one byte at a time, you want to write in chunks, and then it used saturation, but I don't like that, or like, it shouldn't have to

do that. Does anyone here actually know Zig? Sort of.

do that. Does anyone here actually know Zig? Sort of.

This is how I feel looking at this good. And you can see all the patterns from the CloudMD await using, and then that pattern of reading all the... that pattern of resolving all the promises at the same time. And so what's your workflow? When

you see something like this, are you usually going in and commenting, or are you just going to wait for CodeReview to come in and drop a comment? Usually it

depends on how complicated it is. In this case, usually I'll wait for... This one

is actually pretty simple. I feel pretty high confidence that if the tests pass, then I would probably merge this. the test pass, then I would probably merge this. But I still would wait for the code review,

this. But I still would wait for the code review, at least for the cloud code review one to run, just in case.

Because what I really like about that is it will find things that aren't in the diff, that are like from tracing the control flow, which is what you want when it's like a human reviewing it. It's like When it's like a human reviewing it, it's like somebody who has a lot of context who can think what

are all the edge cases that this might run into. And the signal to noise ratio is pretty good. It's something maybe like 10 % of the time is wrong.

And for how that used to be with other code review products that we've tried, it was like basically you had to ignore most of what it said. That's pretty

cool. How long has something like this worked? Is it like a latest model thing or like, have you had like, like, How long has something like this worked? Is

it like a latest model thing? Or like have you had like RoboBun or this kind of like automated repro, automated fixing, like this whole pipeline? Like how long has that been actually possible? We can probably like see this in a chart somewhere. That's

kind of a lot of commits, but I think that might not be on main.

That might be the Rust thing.

Yeah, I heard Bun is going to be written in Rust soon. Is that? I

don't know. be written in Rust soon? I don't know. I just have a cloud running, and we'll see what happens. But

you can see the volume of commits there is kind of lower, and then it's definitely gone up a lot, and then it's gone up a lot.

Now really the bottleneck is like, do I feel good about merging this? Am I

confident that its changes are correct? And that's new, because it used to be like Confident that its changes are correct, and that's new, because it used to be like the code wasn't good enough. What do you think is left? What's it going to take before, is there a missing tool or a missing model capability or model version or something for you to kind of feel like RoboBank can fully close the loop?

Issue comes in and then fix goes out automatically. I think it needs a little bit more, it takes a lot of time to verify the changes are correct.

It takes a lot of time to verify the changes are correct. This was kind of already true, like when is a person pushing up a PR?

But I think the challenge is like, how do we make sure to communicate sufficient proof that the changes are correct?

Or making it easier to roll back things? I think those are kind of the two directions. But I think... those are kind of the two directions, but I think,

two directions. But I think... those are kind of the two directions, but I think, like, for the majority of, like, simple issues, we should probably be pressing merge a lot more, and the bottleneck now is actually, like, CI and, like, making sure that, like, and having, like, fully running the code,

like, making sure all the test stuff works. But I think it's, like, basically there for, like, and the large projects are still... For like,

the large projects are still non -trivial, but also I've been doing some pretty large PRs lately with Cloud, mostly with, not as much with RoboBond, but with Cloud Code. We recently added support for a built -in

Code. We recently added support for a built -in image processing library to Bond, I could probably pull up the PR, and that was Cloud, and also we did a bunch of follow -up PRs too.

Yeah, it's interesting because I think when I look at different people using quad code, everyone is at a different level of sophistication or adoption of this. And I think for me the hardest thing is the model changes very often, so I have to constantly retune and recalibrate to what it can do. And as an engineer it's hard because it's a very weird technology. It's the first technology I've used that's like that.

And I sort of feel like this is actually the way that you do it is ahead of how the quad code team does it for quad code itself. The

way that you do it is ahead of how the quad code team does it for quad code itself. And to me, the way the quad code team does it is actually very automated, but this is even further ahead. This is almost like full liftoff, fully closed loop. In the last two weeks, we've added an HTTP 3 server to BUN. There's a PR for an HTTP 2 server. There's fetch support for

HTTP 3 and HTTP 2. There's this image processing API. There's the

ongoing Rust rewrite, which may not ship. API. There's the ongoing Rust rewrite, which may not ship. That's like the most ambitious one I've done so far. And

not ship. That's like the most ambitious one I've done so far. And

done is too strong to work because it's very much not done. And so even something like this, like this is like a benchmark. So Claude ran this benchmark for you. Yeah, Claude ran this benchmark. I gave it like a, this ran in like

you. Yeah, Claude ran this benchmark. I gave it like a, this ran in like a separate, like on like a Linux box. Yeah. And it, and I was like, make it faster. Yeah, and I was like, make it faster than the sharp, and that is basically what I did. I gave it a few ideas like,

oh, you could try to read this code in JavaScript core to figure out how to avoid cloning the typed array when it is not strictly necessary, but like, it then went and did it and figured it out, and like, yeah, it's pretty crazy.

honestly, because it was, none of this would have worked several months ago.

Yeah, I feel like within Anthropic, like within AI lab, you call this kind of thing hill climbing. And this is this idea that like, if you give the model some sort of metric, and then you give it a way to verify its result, you can just make it iterate and keep going and keep going until it hits that metric. And this is something like 4 .7, I think is unique. that metric.

that metric. And this is something like 4 .7, I think is unique. that metric.

And this is something like 4 .7 I think is uniquely good at. And I

think it's something really underutilized because I think it's the first model that's actually very good at that. And if you just give it a target, you give it a way to like improve the performance and you give it a way to measure, it'll just like keep going until it's done if you let it go in auto mode.

Yeah. And you can also see like this is another case where like the code review comments was really, really helpful because like there was like in this PR there's In this PR, there's like a hundred comments or something, and it's just going and fixing everything. It goes on

fixing everything. It goes on for a while. In the meantime, you're just working on something else. Yeah, this was not the thing I was 100 % focused on. I was maybe like 10 % focused on this. I was doing like five things at once.

And this definitely wasn't possible like six months ago.

Three months ago, like, this is like very recent that this is doable. Okay.

So how are sessions doing? Yeah, so we have one PR there. There's almost

another PR coming up, it looks like, pretty soon. This

one should be the trickiest one.

It mostly looks good, though.

It looks plausible based on these changes. I wouldn't exactly do it this way, but I think we need a better, more optimized way to do this, because that's a lot of checks. And looking at your setup here, so you mostly use CLI? Yeah. And do you always use auto mode for permissions?

Yeah, and before that I used dangerously skipped permissions.

No, you guys, I can delete stuff if you do that. I think I'm not supposed to recommend that, but I think it's just not fun to wait for Claude to press approve because then you just go off and do something else and then it's just been sitting there. That's why Automode is really good because it's actually a real way to fix that instead of just trusting.

like the little composer, it's stuck to the bottom of the screen, so you're using no flicker mode. Yeah, I'm using no flicker. Honestly I think we should just like make that the default because it's so much better. Like you can see I can scroll really fast and like you could scroll fast before but like sometimes there would be a flicker and now there's not. Have folks tried no flicker mode

for CLI? Yeah, a few people?

for CLI? Yeah, a few people?

A few people? We launched it on April Fools.

In hindsight, it came across as a joke a little bit, but if you do quad code no flicker equals one, so just set that environment variable, we totally rewrote the renderer that's running in the CLI. So it's using virtualized scrolling, virtualized selection, and so what this means is constant memory usage, constant CPU usage, and also some

nice stuff. Like if Jared types, he can actually like click around the... And also

nice stuff. Like if Jared types, he can actually like click around the... And also

some nice stuff, like if Jared types, he can actually click around the Composer. And

so you can actually click and mouse events work, which is pretty crazy for a terminal. So I'm just also having it monitor the PR.

terminal. So I'm just also having it monitor the PR.

And you can see it ran some commands, and then it's going to go to sleep for 20 minutes and wake back up. 20 minutes is probably a little bit too long, but it's okay. And is that like using a loop or something? I think so. And is that like using a loop or something? I think

something? I think so. And is that like using a loop or something? I think

so yeah and then it's, let's see, how else is it doing? And then the other ones are still, apparently it fixed an extra bug as well.

Okay, so we got, okay, so it's been, what, it's been like 20 minutes or so, 25 minutes and we got one, how many purists have we gotten? How many

PRs have we gotten? Three PRs. That's not bad. I

think we'll get a fourth one once it finishes running the test. In

the meantime, Robobun is still running and generating even more PRs. Yeah. Every time somebody submits an issue, it tries to reproduce it. Yeah. I kind of feel like the way quad code makes you think is every time there's a new bottleneck, you have to kind of... What makes you think is every time there's a new bottleneck, you

have to kind of automate that bottleneck and then there's always some other bottleneck after and you kind of move on to that and like it started like writing code was the bottleneck and now it's no longer the bottleneck and then like verification and running tests that was like the bottleneck and that's no longer the bottleneck and now there's like a deeper layer of verification maybe that maybe that's it. What do you

think are the bottlenecks remaining? It's definitely this deeper layer of verification. I feel like the bottleneck after that is going to be like planning. Like after that it's going to be like planning like what should we do and what should we not do and what is the right way to fix this um and like ideally Claude would

be smart enough or like we could trust Claude enough to to merge the PRs by itself um and I think like you know in certain projects you could probably do that um and just have that be automatic completely um I think one is not yet for like it's not yet there for Claude or sorry for for Bun

um Or like, it's not yet there for Claude, or sorry, for Bun, but I think it'd be really cool if like we had the tooling for us to feel confident enough to do that. So like right now RoboBun, it doesn't like build features?

It doesn't do like feature requests yet? That's true, yeah. It doesn't do feature requests, but we do also use it sometimes. So we can also add mention it in either Discord or Slack, and it will like try to implement the feature.

So sometimes when people are like, hey, bun is missing this thing, then I just mention the bot and maybe like an hour later, there's a PR.

A bunch of times, somebody's like tweeted at me something like this. Can you fix this bug or whatever? And that's basically what I do. And then I reply with a link to the PR. Should we add a RoboBun account on Twitter? Should we

add a RoboBun account on Twitter?

And I think like, so like it can do feature requests, but I'm hesitant for it to implement literally everything anybody asks for in a GitHub issue, because that's kind of a lot. Because in some ways it's kind of crazy to put something like an image processing library inside a bun, but like we talk about engineering taste and there's like an element of taste that goes into that. Like you felt like that's

a good idea. And like we're not like that's a good idea and like we're not sure yet if quad is at the point where it would also think this is a good idea but you know at some point in the future it'll get there yes starting to go and I do think that like PR has become suggestions like having like not merging

PRs used to be like you feel bad if you don't merge like a co -workers PR because like they put work into that but you don't have to feel bad when it's like quad so like It's like Claude.

So like if the PR is wrong or for whatever reason, then you can just not merge it. But it does mean that like the bar for what you merge is like, should it be there? Because I think there's also a difference when it's

like people, because with people you want, you don't want people to feel bad about their like lost work. You don't want people to feel bad about their lost work.

So in some ways, it does actually end up raising the bar for what you decide to merge. Yeah, it's interesting. As the bottlenecks move, the dynamics change a little bit. It's sort of like having to trust each other, having to trust people on the team. This kind of changes a little bit. Now it's a little bit more about do we have the right automation and do we trust automation

as a group? Yeah. So I think we're almost at time. Is there any like... Yeah. So I think we're almost at time. Is there any like maybe one

like... Yeah. So I think we're almost at time. Is there any like maybe one last thing we want to show people where we can kind of check in on kind of like the progress that we've made? I don't think so.

Yeah, it's still going one last onto that fourth PR. It's going back and forth like found a bug and then fixed a bug. Looks like it's about to submit the PR now.

We'll have to have one more. This is the cool thing about auto mode. In

auto mode, I can let quadruns for hours and hours at a time. I run

it almost every night. I'll have a bunch of quads running in auto mode. Before

this, it just didn't work because it always got stuck at some kind of permission request. That was crazy. This

request. That was crazy. This

Okay, so it's pushing it. It's about to submit the PR.

That sounds like the right fix, too. This has been an issue that's been open for a long time. Okay.

And we got a PR. Okay. And we got a PR.

Yeah, we can go to this issue and we can see how many outputs it has. 20. Yeah. That's kind of a lot.

has. 20. Yeah. That's kind of a lot.

Cool. Maybe we can pause there. But to me, this is just like such a cool vision of where engineering is going, I think, for everyone in this room.

going I think for everyone in this room and you know we're gonna see this first we're gonna have to figure it out first and then everyone else is gonna have to figure this out so you know like we were talking about this morning just excited to be on this journey together and like you can see we haven't figured everything out yet but I think like the mode that we're in is just

constantly experimenting constantly trying to to see what the next bottleneck is so we can solve it it's very exciting because it's so cool Hello

Loading...

Loading video analysis...