LongCut logo

Claude Code: Anthropic's CLI Agent

By Latent Space

Summary

Topics Covered

  • Why build Claude Code in a terminal?
  • How do you predict model capabilities in 3 months?
  • Do the simplest thing first in AI products?
  • When will models need less human input?
  • How does Claude Code boost engineering productivity?

Full Transcript

Hey everyone, welcome to the latest and space podcast.

This is Allesio, partner and CTO at Decible, and I'm joined by my co-host Swix, founder of Small AI.

Hey and today we're in the studio with Cat Woo and Boris Journey. Welcome.

Thanks for having us. Thank you. Uh Cat, you and I know each other from before.

I just realized Dagster as well and then Index Ventures and now Enthrovic.

Exactly. Um, it's so cool to see like a friend that you know from before like now working in Enthopic and like shipping really cool stuff.

And Boris you were a celebrity cuz like we were just having you outside just getting coffee and people recognized you from your video.

Oh wow. Right. That's new.

Wasn't that wasn't that neat? Um, yeah.

I definitely I had that experience like once or twice in the last few weeks.

Yeah, it was surprising. Yeah.

Well thank you for making the time.

You're here to talk we're here to talk about cloud code.

Most people probably have heard of it. We think like you know quite a few people have tried it but let's get a crisp upfront definition like what is cloud code? Yeah so cloud code is cloud in the terminal. Um so you know cloud has a bunch of different interfaces.

There's desktop, there's web and yeah cloud code it runs in your terminal because it runs in the terminal.

It has access to a bunch of stuff that you just don't get if you're running on the web or on desktop or whatever.

So it can run bash commands.

it can see all of the files in the current directory and it does all that agentically and yeah I guess it maybe it comes back to like the maybe the question under the question is like you know where did this idea come from and yeah part of it was we just want to learn how claude um we want to learn how people use agents we doing this with the CLI form factor because coding is kind of a natural place where people use agents today and you know there's kind of product market fit for this thing but yeah it's just sort of this crazy research project and obviously it's kind of bare bones and simple. Um but yeah it's like a agent in your terminal.

That's how the best stuff starts. Yeah.

How did did it start? Did you have a master plan to build cloud code or There's no master plan.

Uh when I joined Anthropic, I was experimenting with different ways to use the model kind of in different places.

And the way I was doing that was through the public API, the same API that everyone else has access to. And one of the really weird experiments was this claw that runs in a terminal. And I was using it for kind of weird stuff.

I was using it to like look at what music I was listening to and react to that and then you know like screenshot my you know video player and explain what's happening there and things like this.

And this was like kind of a pretty quick thing to build and it was pretty fun to play around with. And then at some point I gave it access to the terminal and the ability to code and suddenly it just felt very useful like I was using this thing every day. It kind of expanded from there.

We gave the core team access and they all started using it every day which was pretty surprising. Uh and then we gave all the engineers and researchers at anthropic access and pretty soon everyone was using it every day and I remember we had this DAU chart for internal users and I was just watching it and it it was vertical like for days and we're like all right there's something here. We got to give this to external people so everyone else can try this too. Yeah. And yeah, that's where it came from. And were you also working with Boris already or did this come out and then it started growing and then you're like, "Okay, we need to maybe make this a team so to speak.

" Yeah, the original team was Boris, Sid and Ben.

And over time, as more people were adopting the tool, we felt like okay, we really have to invest in supporting it because all our researchers are using it and we this is like our one lever to make them really productive.

And so at that point I was using cloud code to build some visualizations.

I was analyzing a bunch of data and sometimes it's super useful to like spin up a streamllet and like see all the aggregate stats at once and cloud code made it really really easy to do.

So I think I sent Boris like a bunch of feedback and at some point Boris was like do you want to just work on this?

And so that's how it happened.

It was actually a little like it was more than that on my side. you were sending all this feedback and at the same time we were looking for a PM and we were like looking at a few people and then I remember telling the manager like hey I want cat I'm sure people are curious what's the process within Entropic to like graduate one of these projects like so you have kind of like the a lot of growth then you get a PM when did you decide okay we should like it's ready to be opened up generally at anthropic we have this product principle of do the simple thing first and I think that the way we build product is really based on that principle.

So you kind of staff things as little as you can and keep things as scrappy as you can because the constraints are actually pretty helpful.

And for this case, we wanted to see some signs of product market fit before we scaled it.

Yeah, I imagine so like we're putting out the MCP episode this week and I I imagine MCP also now has a team around it in much the same way.

It is now very much officially like sort of like a an anthropic product. So I'm kind of curious for for Cat like how do you view PMing something like this?

It's like what is I guess you're like sort of grooming the road map.

You're listening to to to users and the velocity is something I've never seen coming out of anthropic.

I think I am with a pretty light touch.

Um I think Boris and the team are like extremely strong product thinkers and for the vast majority of the features on our road map, it's actually just like people building the thing that they wish that the product had.

So very little actually is tops down.

I feel like I'm mainly there to like clear the path if anything gets in the way and just make sure that we're all good to go from like a legal marketing, etc. perspective. Yeah.

And then I think like in terms of very broad road map or like long-term road map, um I think the whole team comes together and just thinks about okay, what do we think models will be really good at in 3 months and like let's just make sure that what we're building is really compatible with like the future of what models are capable of. I'd be interested to double click on this.

what will models be good at in 3 months?

Cuz I think that's something that people always say to think about when building AI products, but nobody knows how to think about it because it's everyone's just like it's generically getting better all the time. We're getting AGI soon, so don't bother, you know, like how do you calibrate 3 months of progress?

I think if you look back historically, we tend to ship models every couple of months or so.

So 3 months is just like an arbitrary number that I picked. I think the direction that we want our models to go in is being able to accomplish more and more complex tasks with as much autonomy as possible.

And so this includes things like making sure that the models are able to explore and find the right information that they need to accomplish a task.

Making sure that models are thorough in accomplishing every aspect of a task. Making sure the models can like compose different tools together effectively.

Yeah, these are directions we care about. Yeah, I guess it coming back to code, this kind of approach affected the way that we built code also because we know that if we want some product that has like very broad product market fit today, we would build, you know, a cursor or a windsurf or something like this. Like these are awesome products that so many people use every day.

I use them. Um, that's not the product that we want to build.

We want to build something that's kind of much earlier on that curve and something that will maybe be a big product, you know, a year from now or, you know however much time from now as the model improves.

And that's why code runs in a terminal.

It's a lot more bare bones.

You have raw access to the model because we didn't spend time building all this kind of nice UI and scaffolding on top of it.

When it comes to like the harness so to speak and things you want to put around it, there's one the maybe prompt optimization.

So obviously I use cursor every day.

There's a lot going on in cursor that is beyond my prompt for like optimization and whatnot, but I know you recently released like you know compacting context features and all that.

How do you decide how thick it needs to be on top of the CLI so that's kind of the share interface and at what point are you deciding between okay this should be a part of clock code versus this is just something for the IDE people to figure out for example.

Yeah there's kind of three layers at which we can build something. So the you know being a AI company the most natural way to build anything is to just build it into the model and have the model do the behavior.

The next layer is probably scaffolding on top. So as like quad code itself and then the layer after that is using cloud code as a tool in a broader workflow.

So to compose stuff in you know so for example a lot of people use code with you know t-mox for example to manage a bunch of windows and a bunch of sessions happening in parallel.

we don't need to build all of that in.

Um compact, it's sort of this thing that kind of has to live in the middle because it's something that we want to work when you use code.

You shouldn't have to pull in extra tools on top of it.

And rewriting memory in this way isn't something the model can do today.

So, you have to use a tool for it.

And so, it it kind of has to within that that middle area. We tried a bunch of different options for compacting, you know, like rewriting uh old tool calls and uh truncating old messages and not new messages.

And then in the end we actually just did the simplest thing which is ask quad to summarize the you know the previous messages and just return that and that's it.

And it's funny with when the model is so good the simple thing usually works.

You don't have to overate.

Yeah we do that for cloud plays Pokemon too which is kind of interesting to see that pattern reemerging.

And then you have the claw MD file for the more userdriven memories so to speak. It's like kind of like the equivalent of maybe cursor rules I would say. Yeah. And quadmd it's another example of this idea of you know do the simple thing first. We we had all these crazy ideas about like memory architectures and you know there's so much literature about this.

There's so many different external products about this and we wanted to be inspired by all the stuff but in the end the thing we did is ship the simplest thing which is you know it's a file that has some stuff and it's auto read into context and there's now a few versions of this file.

You can put it in the root or you can put it in child directories or you can put in your home directory and we'll we'll read all of these in kind of different ways but yeah simplest thing that could work. I'm sure you're familiar with ader which is another u thing that people in our discord loved and then when cloud code came out the same people love cloud code.

Um any thoughts on like you know inspiration that you took from it things you did differently kind of like maybe design principle in which you went a different way.

Yeah, this is uh actually the moment I got AGI pled is related to this.

Okay, so maybe I can tell that story.

Yeah. Um so Ader inspired this internal tool that we used to have at anthropic called Clyde. So Clyde is like you know CLI quad and that's the predecessor to quad code. It's kind of this research tool that's uh you know it's like written using Python.

It takes like a minute to start up.

It's like very much written by researchers.

It's not a polished product. And when I first joined Enthropic, I was putting up my first poll request. You know, I hand wrote this poll request cuz I didn't know any better. And my boot camp buddy at the time, Adam Wolf, was like, you know, actually, maybe instead of handwriting it, just ask Clyde to write it.

And I was like, okay, I guess so.

It's a AI lab. Maybe maybe there's some you know, capability I didn't know about.

And so I start up this like terminal tool.

And it took like a minute to start up. And I asked Quad, hey, you know, here's the description.

Can you make a PR for me? And after a few minutes of chucking along, it made a PR and it worked. And I was just blown away cuz I had no idea. I had just no clue that there were tools that could do this kind of thing. Like I thought that, you know, kind of single line autocomplete was the state-of-the-art before I joined and then that's the moment where I got AGI and yeah, that's where uh code came from.

So yeah, it was uh ADER inspired Clyde which inspired quad code.

Um so very much big fan of ADR. It's it's an awesome product.

I think like people are interested in compare and contrasting obviously because to you obviously this is the house tool you work on it people are interested in like figuring out how to choose between tools there's the cursors of the world there's like Devons of the world there's aers and there's cloud code and you we can't try everything all at once my question would be where do you place it in the universe of options um well you can ask quad to destry all these tools and I wonder what it would no self favoring at all quad play quad plays engineer I don't know we like we use all these tools in house too like we're big fans of all all this stuff like cloud code is uh obviously it's it's a little different than some of these other tools in that it's a lot more raw um like I said there isn't this kind of big beautiful UI on top of it it's raw access to the model it's as raw as it gets so if you want to use a power tool that lets you access the model directly and use claude for um automating you know, big workloads, you know, for example, if you have like a thousand lint violations and you want to start a thousand instances of quad and have it fix each one and make then make a PR then cloud code is a pretty good tool.

Got it. It's it's a tool for power workloads for power users. Um, and I think that's kind of where it fits.

Yeah, it's the idea of like parallel versus kind of like single path one way to think about it where like the IDE is really focused on like what you want to do versus like clock code. you kind of more see it as like less supervision required.

You can kind of spin up a lot of them.

Is that the the right mental model?

Yeah. And there's some people at Anthropic that have been racking up like thousands of dollars a day with this kind of automation. Most people don't do anything like that, but you totally could do something like that. Yeah.

We we think of it as like a Unix utility right?

So it's like the same way that you would compose, you know, GP or cat or um oh cat or something like that. the same way you can compose code um into workflows.

The cost thing is interesting. Do people pay internally or do you get free?

If you work at Andra, you can just run this thing as much as you want every day.

Um it's for it's for free internally. Nice.

Yeah. I I think if everybody had it for free, it would be huge.

Um what because like I mean if I think about I pay cursor 20 bucks a month.

I use millions and millions of token in cursor that would cost me a lot more in cloud code.

And so I think like a lot of people that I've talked to, they don't actually understand how much it costs to do these things. And they'll do a task and they're like, "Oh, that cost 20 cents.

I can't believe I paid that much.

" How do you think going back to like the product side too? It's like how much do you think of that being your responsibility to try and make it more efficient versus that's not really what we're trying to do with the tool?

We really see quad code as like the tool that gives you the smartest abilities out of the model. Um, we do care about cost in so far as it's very correlated with latency and we want to make sure that this tool is extremely snappy to use and extremely thorough in its work.

We want to be very intentional about all the tokens that it produces. I think we can do more to like communicate the costs with users. Um, currently we're seeing costs around like $6 per day per active user and so it's like it does come out to a bit higher um over the course of a month in cursor. Um, but I don't think it's like out of band and that's like roughly how we're thinking about it.

I would add that I think the way I think about it is it's a ROI question.

It's not a cost question.

And so if you think about you know an average engineer salary and like what you know we were talking about this before before the podcast like engineers are very expensive and if you can make an engineer 50 70% more productive that's worth a lot and I think that's the way to think about it. So if you're saying if you're targeting cloud to be the most powerful end of the spectrum as opposed to the less powerful but faster cheaper side of the spectrum then there there's typically people recommend a waterfall right you try this faster simple one that doesn't work you upgrade you upgrade you upgrade and finally you hit clock code at least for people who are token constrainted that don't work at topic and part of me wants to just fasttrack all that I just want to fan out to everything all at once and once I once I'm not satisfied with the one solution, I would just sort of switch to the next.

I I don't know if that's real.

Yeah, we're we're definitely trying to make it a little easier to make quad code kind of the tool that you use for all the different workloads.

So, for example, we launched uh thinking recently.

So, for any kind of planning workload where you might have used other tools before, you can just ask quad and that'll use, you know, chain of thought to to think stuff out. I think we'll get there.

Maybe we'll do it this way.

How about we recap like sort of the brief history of cloud code like between when you launch and now there there have been quite a few ships. Um how would you highlight the major ones and then we'll get to the the thinking tool?

And I think I'd have to like check your Twitter to to remember everything.

Um I think a big one that we've gotten a lot of requests for is web fetch. Yep.

So we worked really closely with our legal team to make sure that we shipped as secure of an implementation as possible.

So um we'll web fetch if a user directly provides an URL whether that's in their call.

md or um in their message directly or if a URL is mentioned in one of the previously fetched URLs. And so this way enterprises can feel pretty secure about letting their developers continue to use it.

We shipped a bunch of like auto features like autocomplete where you can press tab to complete a file name or file path.

Auto compact so that users feel like they have like infinite context since we'll compact behind the scenes.

And we also shift auto accept because we noticed that a lot of users were like hey like cloud code can figure it out.

I've like developed a lot of trust for cloud code. I wanted to just like autonomously edit my files, run tests, and then come back to me later.

So, those are some of the big ones.

Uh Vim mode, custom/comands.

People love Vim mode. So, that was a that was a top request, too.

That one went pretty viral.

Yeah. Yeah. Uh, memory, that was a recent one. So, like the hashtag to remember.

So, yeah. I mean, uh, I'd love to dive into, you know, on the technical side, any of them that was particularly challenging.

Um a Paul from Ader always says how much of it was coded by Ader you know so then the question is how much of it was coded by cloud code obviously there's some percentage but I wonder if you have a number like 50 80 pretty high probably near 80 I'd say that's very high a lot of human code review though yeah a lot of lot of human code review I think some of the stuff has to be handwritten and some of the code can be written by quad and there's sort of a wisdom in knowing which one to pick and what percent for each kind of task.

So usually where we start is Claude writes the code and then if it's not good, then maybe a human will dive in.

There's also some stuff where like I actually prefer to do it by hand. So, it's like, you know, intricate data model refactoring or something.

I won't leave it to quad cuz I have really strong opinions and it's easier to just do it and experiment than it is to explain it to Quad.

So yeah, I think that nets out to maybe like 80 90% quadridden code overall.

Yeah, we're hearing a lot of that in our portfolio companies like more like series A companies is like 80 85% of the code they write is that generated. Yeah.

Yeah. So yeah the well that's a whole different discussion.

The custom slash command I had a question. How do you think about custom/comand MCPS like how does this all tie together?

You know is the slash command and clock code kind of like an extension of the MCP? Are people building things that should not be MCP but are just kind of like self-contained things in there?

How should people think about it?

Yeah I mean obviously we're big fans of MCP.

You can use MCP to do a lot of different things.

You can use it for custom tools and custom commands and all this stuff but at the same time you shouldn't have to use it. Um, so if you just want something really simple and local, you just want, you know, some essentially like prompt that's been saved, just use local commands for that.

Over time something that we've been thinking a lot about is how to reexpose things in convenient ways.

So, for example, let's say you had this local command.

Could you reexpose that as an MCP prompt?

Yeah, because cloud code is an MCP client and an MCP server or some let's say you pass in a custom uh you know like a custom bash tool. Is there a way to reexpose that as an MCP tool?

Yeah we think generally you shouldn't have to be tied to a particular technology.

You should use whatever works for you.

Yeah because there's some like puppeteer.

I think that's like a great way great thing to use with clock code, right, for testing.

There's like a puppeteer MCP protocol, but then people can also write their own slash commands.

And I'm curious like where MCP are going to end up being, where it's like maybe each slash command leverages MCPS, but no command itself is an MCP because it ends up being customized. I think that's what people are still trying to figure out.

It's like should this be in the runtime or in the MCP server? I think people haven't quite figured out where the line is.

Yeah, for something like Puppeteer I think that probably belongs in MCP because there there's a few like tool calls that go in that too. And so it's probably nice to encapsulate that in the MCP server.

Whereas slash commands are actually just like prompts, so they're not actually tools. We're thinking about how to expose more customizability options so that people can bring their own tools or turn off some of the tools that uh cloud code comes with.

But there's also some trickiness there because um we want to just make sure that the tools people bring are things that COD is able to understand and that people don't accidentally um inhibit their experience by maybe bringing a tool that is like confusing to Claude.

So we're just trying to work through the UX of it. Yeah, I'll give an example also of how this stuff connects for quad code.

Internally in the GitHub repo, we have this GitHub action that runs.

And the GitHub action invokes claude code with a local uh slash command.

And the slash command is lint. So it just runs a llinter using cloud. And it's a bunch of things that are pretty tricky to do with a traditional llinter that's based on static analysis.

So for example, it'll check for spelling mistakes, but also checks that code matches comments.

It also checks that, you know, we use a particular library for network fetches instead of the built-in library.

There's a bunch of these specific things that we check that are pretty difficult to express just with lint. And in theory you can go in and, you know, write a bunch of lint rules for this. Some of it you could cover, some of it you probably couldn't.

But honestly, it's much easier to just write a one bullet in markdown in a local command and just commit that.

And so what we do is quad runs through the GitHub action. We invoke it with /ro colon lint.

So which just invokes that local command.

It'll run the llinter.

It'll identify any mistakes.

It'll make the code changes and then it'll use the GitHub MCP server in order to commit the changes back to the PR. And so you can kind of compose these tools together.

And I think that's a lot of the way we think about code is just one tool in an ecosystem that composes nicely without being opinionated about any particular piece.

It's interesting. I I I have a weird chapter in my CV uh that makes me I was the CLI maintainer for Nellifi and so I have a little bit of a dive there's a decompilation of cloud code out there that see that has seems has been since been taken down uh but it seems like you use commanderjs and react inc is like the public info about this and I'm just kind of curious like at some point you're just you're not even building cloud code you're kind of just building a general purpose CLI framework that anyone any developer can hack to their purposes.

You ever think about this like this level of configurability is more of like a CLI framework or like some new form factor that is doesn't exist before.

Yeah, it's definitely been fun to hack on a on a really awesome CLI cuz there's not that many of them.

But yeah we we're big fans of Ink. Um Vadim Dez we actually used him used React Inc. for a lot of our projects. Oh, cool. Yeah.

Yeah. Yeah. Um yeah, Inc. is amazing.

It's like it's sort of hacky and janky in a lot of ways. It's like you have you have React and then you're the renderer is just translating the React code to like anti escape codes as the way to render.

And there's all sorts of stuff that just doesn't work at all because ANC escape codes are like, you know it's like this thing that started to be written like the 1970s and there's no really great spec about it.

Every terminal is a little different.

So building in this way, it feels to me a little bit like uh building for the browser back in the day where you had to think about like Internet Explorer 6 versus Oprah versus like Firefox and whatever.

Like you have to think about these cross terminal differences a lot.

But yeah, big fans of Ink because it helps abstract over that. We're also uh we use bun. Um so big fans of bun.

Um that's been it makes writing our tests and running tests much faster.

We don't use it in the runtime yet. It's not just for speed, but you tell me. Yeah.

I don't want to I don't want to put words in your mouth, but my impression is they help you ship the compilation, the executable.

Yeah, exactly. So, we use bun to to compile the code together.

Yeah. Any other pluses of bun?

I just want to track bun versus deno conversations.

Yeah, because deno's in there, you know.

Um I actually haven't used deno back uh it's it's been a while. Um I remember a lot of people say yeah. Yeah, Ryan made it back in the day and it was like there there were some ideas that I think were very cool in it, but yeah, it just never took off to that same degree.

Yeah still a lot of cool ideas like um being able to npm just import from any URL I think is that's the dreaming dream of ESM.

Yeah, very cool. Okay. Um I was going to ask you uh one one other feature then we can get to the thinking tool of auto accept. I have this little thing I'm trying to develop thinking around for trust in agents, right?

when do you say all right go autonomous when do you pull the pull the developer in and sometimes you let the model decide sometimes you're like this is a distractive action always ask me and I'm just curious if you have any internal heristics around when to auto accept and where all this is going we're spending a lot of time building out the permission system so Robert on our team is leading out this work um we think it's really important to give developers the control to say hey these are like the allowed permissions.

Generally, this includes stuff like the model is always allowed to read files or read anything.

And then it's up to the user to say, hey, is about to edit files, is to run tests.

These are like probably the safest three actions.

And then there's like a long list of other actions that um users can either allow list or deny list based on uh reg x matches with the action.

Hagen writing a file ever be unsafe if you have version control. I think that's yeah I think it's I think there's like a few different probably like aspects of safety to think about. So it could be useful just to break that out a little bit.

So for file editing it's actually less I think about safety although there there is still a safety risk because what might happen is let's say the model fetches a URL and then there's a prompt injection attack in the URL and then the model writes a malicious code to disk and you don't realize it although you know there is code review as like a separate kind of layer there as as protection but I think generally for file rights it the model might just do the wrong thing that's the biggest thing and what we find is that if the model is doing something wrong it's better to identify that earlier and correct it earlier and then you're going to have a better time.

If you wait for the model to just go down this like totally wrong path and then correct it 10 minutes later, you're going to have a bad time.

So, it's better to usually identify failures away.

But at the same time there's some cases where you just want to let the model go. So, for example, if claude code is uh you know, it's writing tests for me, I'll just hit shift tab enter auto accept mode, and just let it run the tests and iterate on the tests until they pass. um because I know that's a pretty safe thing to do.

And then for some other tools like bash tool, it's pretty different um because quad could ram run, you know, rm rf slash and that would suck, right?

That's not a good thing. So we definitely want people to be in the loop to to catch stuff like that. The model is you know trained and aligned to not do that but you know these are non-deterministic systems. So like you still want a human in the loop. Yeah, I think that generally the way that things are trending is um kind of less time between human input.

Did you see the meter paper?

No. The they establish a Moor's law for time between human input basically and it's basically doubling every 3 to 7 months is the idea.

And Enthropic is currently doing super well on that benchmark and it's roughly above autonomous for 50 minutes at the 50th percentile of human effort. Uh which is kind of cool. Highly recommend that.

Yeah, I put cursor in yolo mode all the time and just run it. But but it's vibe coding, right?

Like this is all of spade.

And there's a couple things that are interesting when you talked about alignment and the model being trained.

So I always put in a docker container and I have it prefix every command with like the docker compos. And yesterday uh my docker server was not started and I was like oh docker is not running let me just run it outside of docker and I'm like whoa whoa whoa whoa you should start docker and run it in docker you cannot go outside.

So That is like a very good example of like you know sometimes you think it's doing something and then it's doing something else.

And for the review side it's so I would love to just chat about that more.

I think the llinter part that you mentioned I think maybe people skipped it over.

It doesn't register the first time but like going from like rulebased linting to like semantic linting I think is like great and super important.

And I think a lot of companies are trying to do how do you do autonomous PR review which I've not seen one that I use so far.

they're all kind of like mid.

So I'm curious how you think about closing the loop or making that better and figuring out especially like what are you supposed to review because these PRs get pretty big when you buy code.

You know, sometimes I'm like, "Oh, wow.

" Oh GTM. You know, it's like am I really supposed to read all of this? It kind of seems most of it seems pretty standard but like I'm sure there are parts in there that the model would understand that are like kind of out of distribution, so to speak, to really look at.

So yeah, I know it's a very open-ended question, but any thoughts you have would be great. Yeah, we we have some experiments where Quad is doing code review internally.

We're not super happy with the results yet.

So it's not something that we want to open up quite yet. The way we're thinking about it is Quad Code is, like I said before, it's a primitive. So if you want to use it to build a code review tool you can do this. If you want to, you know, build like a security scanning vulnerability scanning tool, you can do that.

If you want to build a semantic llinter, you can do that.

And hopefully with code it makes it so if you want to do this it's just a few lines of code and you can just have quad write that code also because quad is really great at writing GitHub actions. Yeah.

One thing to mention is we do have a non-interactive mode which is like what um cloud uses in these sit or how we use cloud in these situations to automate cloud code and also a lot of our uh the companies using cloud code actually use this non-interactive mode.

So they'll for example say hey I have like hundreds of thousands of tests in my repo.

Some of them are out ofd some of them are flaky and they'll send cloud code to look at each of these tests and decide okay how can I update any of them?

Like should I deprecate some of them?

How do I like increase our code coverage?

So that's been a really cool way that people are non-interactively using cloud code.

What are the best practices here?

because when it's non-interactive, it could run forever and you're you're not you're not necessarily reviewing your output of everything, right?

So, I'm just kind of curious how does how is it different in non non-interactive mode?

What are like the most important hyperparameters or arguments to set?

Yeah. And for folks that haven't used those, so non-interactive mode is just quad-p and then you pass in the prompt in quotes and that's all it is.

It's just the -p flag. Generally, it's best for tasks that are read only.

that's the place where it works really well and you don't you know super have to think about permissions and running forever and and things like that. Um so for example llinter that runs and doesn't fix any issues or for example we're working on a thing where we use quad in with -p to generate the change log for quad.

So every PR is just looking over the commit history and being like okay this makes it into the change log this doesn't um because we know people have been requesting uh change log so we're just getting quad to build it.

So generate non-interactive mode really good for readonly tasks for tasks where you want to write the thing we usually recommend is pass in a very specific set of permissions on the command line.

So what you can do is pass in uh d-allowed tools and then you can allow a specific tool.

So for example not just bash but for example get status um or get diff.

So you just give it a set of tools that it can use or you know edit tool.

Uh it still has default tools are file read GP system tools like bash and ls and memory tools right all those are so it still has yeah it still has all these tools but a lot of tools just lets you instead of the permission prompt because you don't have that in the non-interactive mode it's just kind of pre-accepting uses and we'd also definitely recommend that you start small so like test it on one test make sure that it has reasonable behavior iterate on your prompt then scale it up to 10 make sure that it succeeds or if it fails just like analyze what the patterns of failures are and gradually scale up from there.

So definitely don't kick off a run to fix like 100,000 tests.

Yeah, I think the so at this point I just you know I I want to this tagline is in my head that basically at anthropic there's cloud code generating code and then cloud code also reviewing its own code that like at some point right like different people are setting all this up you don't really govern that u but it's happening at some yeah we have to be you know at anthropic there's still a human in the loop we're reviewing and I think for you know for ASL this is important so like for general like model alignment and safety what's what's ASL oh so ASL this It's like the kind of the safety levels.

Yeah. Right. Right. For what does it stand for?

Autonomous safety level.

Autonomous. It's essentially like it's like a Sorry, I don't I'm not used to the acronyms. Yeah, we have a lot of these.

You But you've published stuff like I know. I just don't know what they're called internally.

Yeah exactly. But it's essentially like as the model gets more capable and it hits you know, ASL5 is kind of the highest level.

It's like you know the model is capable of fooling a user if it wants to and kind of exfiltrate itself like inject itself out of its container and replicate itself across other containers.

We're not elizer yukoski ah like yeah this is like where the line goes vertical.

We're at two. We're at two right now.

Yeah we're kind of bordering on three right now. Yeah. So I think at like three, four and five you you start having to think a lot more carefully about this because hopefully the model is aligned but in case it's not aligned you need a human in the loop in in the right ways.

The point of the thing I was thinking about was we have you know VPs of engine CTO's listening like it's this is all well and good for the individual developer but the people who are responsible for the tech the entire code base the engineering decisions all this is going on my developers like I I manage like a 100 developers any of them could be doing any of this at this point what do I do to manage this how does my code review process change how does my change management change I don't know we've talked to a lot of VPs and CTO's about it, they actually tend to be quite excited because they experiment with the tool, they download it, they ask it a few questions and like cloud code when it gives them sensible answers, they're really excited because they're like "Oh, I can understand like this nuance in the codebase and sometimes they even ship small features with quad code and I think through that process of like interacting with the tool um they build a lot of trust in it and a lot of folks actually come to us and they ask us like how how can I roll out more broadly.

Um and then we'll often like have sessions with like VPs of dev prod and talk about these concerns around how do we make sure people are writing high quality code.

I think in general it's still very much up to the individual developer to hold themselves up to a very high standard for the quality of code that they merge.

Even if we use quad code to write a lot of our code, it's still up to the individual who merges it to be responsible for like this being well-maintained, well doumented code that has like reasonable abstractions.

And so I I think that's something that will continue to happen where quad code isn't its own engineer that's like committing code by itself.

It's still very much up to the IC's to be responsible for the code that's produced.

Yeah, I think cloud code also makes a lot of this stuff a lot of quality work becomes a lot easier.

So for example like I have not manually written a unit test in many months and we have a lot of unit test. We have a lot of unit tests and it's because quad writes all the tests and you know before I felt like a jerk if on someone's PR I'm like hey can you write a test cuz you know they kind of know they coverage is that coverage. Yeah. Okay.

And you know, they kind of know they should probably write a test and that's probably the right thing to do.

And somewhere in their head they made that trade-off where they just want to ship faster.

And so you always kind of feel like a jerk for asking, but now I always ask because Quad can just write the test, right?

And you know there's no human work.

You just ask Quad to do it and it it writes it. And I think with writing tests becoming easier and with writing lint rules becoming easier, it's actually much easier to have high quality code than than it was before.

What are the metrics that you believe in?

like is it a lot of people actually don't believe in 100% code coverage because sometimes that is kind of optimizing for the wrong thing arguably I don't know uh but like obviously you have a lot of experience in different code quality metrics but what what what is still what still makes sense I think it's very engineering team dependent honestly I wish there was a one size fits all answer yeah like for me the one solution for uh for some teams test coverage is extremely important um for other teams type coverage is very important Especially if you're working in, you know, a very strictly typed language.

And, you know, for example avoiding like NES and JavaScript and Python.

Y I think complexity kind of gets a lot of flack, but it's still honestly a pretty good metric just cuz there isn't anything better in terms of ways to measure code quality. Okay.

And then productivity, obviously not lines of code, but do you care about measuring productivity?

I'm sure you do. Yeah.

You know, lines of code honestly isn't terrible.

Oh god, it's uh it has downsides.

Yeah, it's it's terri Well it lines of code is terrible for a lot of reasons.

Yes. But it's really hard to make anything better. So, it's the least terrible.

It's the least terrible.

There's like lines of code maybe like number of PRs, how green your GitHub is.

Yeah. Yeah. Yeah. The two that we're really trying to nail down are one decrease in cycle time. So, how much faster are your features shipping because you're using these tools.

So that might be something like the time between first commit and when your PR is merged.

It's very tricky to get right but one of the ones that we're targeting.

The other one that we want to measure more rigorously is like the number of features that you wouldn't have otherwise built. We have a lot of channels where we get customer feedback and one of the patterns that we've seen with cloud code is that sometimes customer support or customer success will like post hey like um this app has like this bug and then sometimes 10 minutes later one of the engineers on that team will be like cloud code made a fix for it. And a lot of those situations when you like ping them and you're like, "Hey, that was really cool.

" They were like, "Yeah, um without cloud code, I probably wouldn't have done that because it would have been too much of a divergence from what I was otherwise going to do.

It would have just ended up in this long backlog.

" So, this is the kind of stuff that we really want to measure more rigorously.

That was the other AGI pilled moment for me. There was a really early version of quad code many, many months ago.

And this one engineer at Enthropic Jeremy built a bot that looked through a particular feedback channel on Slack and he hooked it up to code to have code automatically put up PRs with just fixes to all the stuff and some of the stuff you know it couldn't fix every issue but it fixed a lot of the issues and I was like 10% 50 you know this was like early on so I don't remember the number but it was it was surprisingly high to the point where I became a believer I see in this kind of workflow and I I wasn't before. SOPM isn't that scary too in a way where you can build too many things. It's almost like maybe you shouldn't build that many things.

I think that's what I'm struggling with the most.

It's like it gives you the ability to create create create but then at some point you got to support support.

This is the Jurassic Park like your scientist is so preoccupied with whether you could. Yeah. Yeah. Exactly.

But no, we should uh Yeah. How do you make decisions like now that the cost of actually implementing the thing is going down as a PM? How do you decide what is actually worth doing?

Yeah, we definitely still hold a very high bar for net new features. Most of the fixes were like, hey, this functionality is broken or this like there's a weird edge case that we hadn't addressed yet.

So it was very much like smoothing out the rough edges as opposed to building something completely net new.

For net new features, I think we hold a pretty high bar that it's very intuitive to use.

The new user experience is like minimal.

It's just like obvious that it works.

We sometimes actually use cloud code to prototype instead of using docs.

Yeah. So you'll have like prototypes that you can play around with and that often gives us a faster feel for hey is this feature ready yet or like is this the right abstraction? Is this the right interaction pattern?

So it gets us faster to feeling really confident about a feature, but it's it doesn't circumvent the process of us making sure that the feature definitely fits in like the product vision. It's interesting how as it gets easier to build stuff, it changes the way that I write software where like like K saying like before I would write a big design doc and I would think about a problem for a long time before I would build it sometimes for some set of problems and now I'll just ask quad code to prototype like three versions of it and I'll try the feature and see which one I like better and then that informs me much better and much faster than a doc would have.

Yeah, I think we haven't totally internalized that transition yet in the industry.

Yeah, I feel the same the same way for some tools I build internally.

People ask me, could we do this? And I'm like I'll just Yeah, just build it.

It's like, well, I feel it feels pretty good.

We should like polish it, you know, or sometimes it's like, no, that's not.

It's comforting that, you know, like that your up your max cost is I mean your even at anthropic where it's theoretically unlimited, the cost is roughly $6 a day. That gives people peace of mind because I'm like, $6 a day, fine.

$600 a day, we have to talk like, you know. Yeah. I pay 200 bucks a month to make Studio Gibble photos.

So it's all it's all good. That is totally worth it.

You mentioned internal tools and that's actually a really big use case that we're seeing emerge because a lot of times um if you're working on something operationally intensive, if you can spin up a internal dashboard for it or like an operational tool where you can for example grant access to a thousand emails at once, a lot of these things you don't really need to have like a super polished design.

You kind of just need something that works.

And Quad Code's really good at those kinds of 0ero to one tasks. Like we use Streamlit internally and there's been like a proliferation of how much we're able to visualize and because we're able to visualize it, we're able to see patterns that we wouldn't have otherwise if we were just looking at like raw data.

Yeah. Like I I was working on also this like side website uh last week and I just showed Cloud Code the mock.

So I just took the you know the screenshot I had dragged and dropped it into the terminal and I was like hey quad here's the mock can you implement it and it implemented it like you know it sort of worked.

It was a little bit crummy and I was like all right now look at it in puppeteer and like iterate on it until it looks like the mock and then it did that three or four times and then the thing looked like the mock.

Yeah, this was just all manual work before.

I think we're going to ask about like two other features of I guess the the overall agent uh pieces that we mentioned.

So I'm interested in memory as well.

So we talked about autoco compact and memory using hashtags and stuff.

My impression is that your you like you say simplest approach works but I'm curious if you've seen any other requests that are interesting to you or internal hacks of memory that people have explored that like you know you might want to surface to others.

There's a bunch of different approaches to memory. Most of them use external stores of various sorts.

Uh there's chroma. Yeah, exactly.

Yeah there there's a lot of projects like that and uh yeah, it's it's either way K value or kind of like graphs that's like the two big shapes for these.

Um you believer in knowledge graphs for this stuff or you know I'm a big I if you talked to me before I joined enthropic and this team I would have said yeah definitely um but now actually I feel everything is the model like that's the thing that wins in the end and it just as the model gets better it's it subsumes everything else so you know at some point the model will encode its own knowledge graph it'll encode its own like KV store if you just give it the right tools.

Yeah, but yeah, I think the the specific tools there's still a lot of room for experimentation that we just we don't we don't know yet.

In some ways, are we just coping for lack of context length?

Like are we doing things for memory now that if we had like a 100 million token context window we don't care about?

It's interesting.

I would love to have 100 million token context for sure.

Some people have claimed to have done it.

We don't know if it's true or not.

But I guess here's a question for you Sean.

If you took all the world's knowledge and you put it in your brain. Yeah.

And let's say, you know, there was like some treatment that you could get to make it so your brain can have any amount of context.

You have like infinite neurons.

Is that something that you would want to do or would you still want to record knowledge externally?

Putting it in my head is like different from me trying to use an agent tool to do it because I'm trying to control the agent and I'm trying to make myself unlimited, but I want to make the tools I use limited because then they then I know how to control them.

And it's not even like a safety argument.

It's just more like a I want to know what you know and if you don't know don't know a thing that sometimes that's good but like like the ability to audit what what's intent and I don't know if this is the small brain thinking because this is not very bitter lesson which is like actually sometimes you you just want to control every every part of what goes in there in the context and the more you just you know Jesus take the wheel trust the model then you have no idea what it's paying attention to.

Yeah. Yeah. I don't know.

Did you see the mech interpretability stuff from Chris Chrysola and the team that was like last week? Yes.

What about it? I I wonder if something like this is the future.

So there's an easier way to audit the model itself. Mhm. And so if you want to see like what what is stored, you can just audit the model.

Yeah. The main salient thing is that they've they know what features activate at per token and they can tune it up suppress it, whatever. But I don't know if it goes down to the individual like item of knowledge from context, you know, not yet. Yeah. But I wonder, you know, maybe that's the bitter western version of it. Right. Right.

Any other comments from memory? Otherwise, we can move on to planning and thinking.

We've been seeing people play around with memory in quite interesting ways like having Claude write a log book of all the actions that it's done so that over time Claude develops this understanding of what your team does, what you do within your team, what your goals are how you like to approach work.

We would love to figure out what the most generalized version of this is so that we can share broadly. I think with things like cloud code with like I think when we're developing things like cloud code it's actually less work to implement the feature and a lot of work to tune these features to make sure that they work well for general audiences like across a broad range of use cases.

So there's a lot of interesting stuff within memory and we just want to make sure that it works well out of the box before we share it broadly.

Agree with that. I think there's a lot more to be developed here.

I guess a related problem to memory is uh how do you get stuff into context? Knowledge base.

Knowledge base. Yeah. And originally we tried very very early versions of quad actually used rag. So we like index the codebase and I think we were just using voyage so you know just offtheshelf rag and that that worked pretty well and we tried a few different versions of it.

There was rag and then we tried a few different kinds of search tools and eventually we landed on just aentic search as the way to do stuff.

And there were two big reasons, maybe three big reasons.

So one is it outperformed everything by a lot. By a lot. And this was surprising.

In what benchmark?

Um this was just vibes. So internal vibes.

There there's some internal benchmarks also but mostly vibes. It just felt better in agentic rag meaning you can you just let it look up in however many cycles it needs.

Yeah. Just using regular code code searching, you know, glob grab.

Um just regular code. Regular code search.

Yeah. Yeah. That was like one.

And then the second one was uh there's this whole like indexing step that you have to do for rag and there's a lot of complexity that comes with that because the code drifts out of sync and then there's security issues because this index has to live somewhere and then you know what if that provider gets hacked and so it's uh it's just a lot of liability for a company to do that. You know even for our codebase it's very sensitive.

So we're we're kind of we don't want to upload it to a third party thing.

It could be a first party thing but then we still have this out of sync issue and Agentic Search just sidesteps all that.

So essentially at the cost of latency and tokens, you now have really awesome search without security downsides.

With memory is like planning, right?

There's kind of like memory is like what I like to do and then planning is like now use those memories to come up with a plan to do these things. Um there was one or maybe put it as like memory is sort of the past like what we what we already did and if plan is sort of what we will do and it just crosses over at some point.

Yeah. I think the maybe slightly confusing thing from the outside is what you define as thinking. So there's like uh extensive thinking. There's the think tool and it's kind of like thinking as in planning which is like thinking before execution and then there's like thinking while you're doing which is like the the thing tool. Can you maybe just run people through the differences?

I'm really confused listening to you do that.

Why?

Well, it's it's one tool. So uh Quad can think if you ask it to think.

Uh generally the usage pattern that works best is you ask Quad to do a little bit of research like use some tools, pull some code into context and then ask it to think about it. Uh and then it can make a plan and you know do a planning step before you execute.

There's some tools that have explicit planning modes like rue code has this and quin has this and other tools have it like you can shift between you know plan and act mode or maybe a few different modes we've sort of thought about this approach but I think our approach to product is similar to our approach to the model which is bitter lesson so just free form keep it really simple keep it close to the metal and so if you want to if you want claude to think just tell it to think be like you know make a plan think are don't write any code yet and it it should generally follow that and you can do that also as you go. So maybe there's a planning stage and then claude writes some code or whatever and then you can you can ask it to think and plan a little bit more. You can do that anytime.

Yeah, I was reading to the think tool blog post and I said while it sounds similar to extended thinking it's a different concept.

Extended thinking is what claw does before it starts generating and then think it once it starts generating how do you add a stop and think? Is this all done by the clock code harness? So people don't really have to think about the difference between the two basically is the idea.

Yeah, you don't you don't have to think about it. Okay. Um and it's all that is helpful. That that is helpful because sometimes I'm like man am I not thinking right?

What um yeah this it is and it's all chain of thought actually in quad code.

So we don't use the think tool.

Anytime that quad code does thinking it's all chain of thought.

I had a insight.

This this is again something we had uh a discussion we had before recording which is uh in the cloud place Pokemon hackathon we had access to uh morph sort of branching environments feature which meant that we could take any VM state branch it play it forward a little bit and use that in the planning and uh then I realized the TLDDR of yesterday was basically that it's too expensive to just always do that at every point in time but if you give it as a tool to claude and prompt it in certain cases to use that tool seems to make sense.

I'm just kind of curious like your takes on overall like sandboxing environments branching rewindability maybe just something that you immediately brought up which I didn't think about.

Is that useful for Claude or Claude has no opinions about it. Yeah, I could talk for hours about this.

Claude probably can too if you ask me.

Let's get original tokens from you and then we can train Claude on that. By the way, that's like explicitly what this podcast is.

We're just generating tokens for people.

Is this is this the pre-training or the post training?

It's a pre-trained data set like we got to get in there. Oh man.

Yeah. How do I how do I buy how do I how do I get some tokens?

Starting with sandboxing. Ideally, the thing that we want is to always run code in a Docker container and then it has freedom and you can kind of snapshot you know with other kind of tools later on top.

You can snapshot rewind do all this stuff.

Unfortunately, working with a Docker container for everything is just like a lot of work and most people aren't going to do it. And so we want some way to simulate some of these things without having to go full container.

There's some stuff you can do today.

So for example, something I'll do sometimes is if I have a planning question or a research type question, I'll ask quad to investigate a few paths in parallel.

And you can do this today if you just ask it.

So say, you know, I want to refactor X to do Y. Can you research three separate ideas for how to do it?

Do it in parallel. Use three agents to do it.

And so in the UI when you say when you see a task that's actually like a subclaude it's a sub agent that does this.

And usually when I do something hairy I'll ask it to just investigate you know three times or five times or however many times in parallel and then claude will kind of pick the best option and then summarize that for you.

But how does Claude pick the best option?

Don't you want to choose? What's your handoff between you should pick versus I should be the final decider?

Um I think it depends on the problem.

You can also ask cloud to present the options to you probably you know exist at a different part of the stack than than cloud code specifically cloud code as a CLI like you can use it in any environment so it's up to you to compose it together.

Should we talk about how how and when models fail because I think that was another hot topic for you.

I'll just leave it open like how do you observe cloud code failing?

There's definitely a lot of room for improvement in the models which I think is very exciting.

Uh most of our research team actually uses quad code dayto-day and so it's been a great way for them to be very hands-on and like experience the model failures which makes it a lot easier for us to target these in model training and to actually provide better models not just for cloud code but for like all of our coding customers.

I think one of the things about the latest Sonnet 37 is it's a very persistent model.

It's like very very motivated to accomplish the user's goal, but it sometimes takes the user's goal very literally and so doesn't always fulfill what like the implied parts of the request are because it's just so narrowed in on like I must get X done.

And so we're trying to figure out, okay how do we give it a bit more common sense so that it it knows the line between trying very hard and like no the user definitely doesn't want that.

Yeah. Like the classic example is like "Hey, go on, get this test to pass.

" And then, you know, like five minutes later it's like, "All right, well, I hardcoded everything.

The test passes." I'm like "No, that's not what I wanted.

" Hard coded the answer. Yeah. But that's but that's the thing, like it only gets better from here. Like these use cases work sometimes today, not you know, not every time.

And you know, the model sometimes tries too hard, but it only gets better.

Yeah. Yeah.

Like context for example, is a big one where like a lot of times if you have a very long conversation and you compact a few times, maybe some of your original intent isn't as strongly present as it was when you first started. And so maybe the model like forgets some of what you originally told it to do. And so we're really excited about things like larger effective context windows so that you can have these like gnarly like really long hundreds of thousands of tokens long tasks and make sure that quad code is on track the whole way through.

Like that would be a huge lift I think not just for quad code but for every coding company.

Fun story from David Hershey's uh keynote yesterday. Uh he actually misses the common sense of 3.5 because 3.7 being so persistent.

3.5 actually had some entertaining stories where apparently it like gave up on tasks and just 3.7 doesn't uh and the G and when cloud 3.5 gives up it started like writing uh formal requests to the developers of the game to fix the game and he has some screenshots of it which is excellent. So if you're listening to this you can find it on the YouTube because we'll we'll post it very very cool.

One form of of failing which I kind of wanted to capture was something that you mentioned uh while we're getting coffee which is that uh cloud code doesn't have that much between session memory or or caching or whatever you call that right so it reforms the whole state from whole cloth every single time uh so as to make the minimum assumptions on the changes that can happen in between. So, like how consistent can it stay, right?

Like I think that one of the failures is that it forgets what it was doing in the past unless you explicitly opt in via cloud.

MMD or whatever. Is that a something you worry about?

It's definitely something we're working on.

I think like our best advice now for people who want to resume across sessions is to tell Claw to hey like write down the state of this session into this text doc probably not the claw.

md but like in a different doc.

And in your new session, tell Claw to read from that doc. Um, but we'll we plan to build in more native ways to handle this specific workflow.

There's a lot of different cases of this, right?

Like sometimes you don't want quad to have the context and it's sort of like git.

Sometimes I just want a, you know, a fresh branch that doesn't have any history, but sometimes I've been working on a PR for a while and like I need all that historical context, right?

So we kind of want to support all these cases and it's it's tricky to do a one sizefits-all but generally our approach to code is to make sure it works out of the box for people without extra configuration.

So once we get there we we'll have something. Do you see a future in which the commits play a bigger part of like in a pull request like how do we get here?

you know there's kind of like a lot of history and how the code has changed within the PR that informs the model but today the models are mostly looking at the current state of the branch. Yeah. So claude for you know for some things it'll actually look at the whole history.

So for example if it's writing if you tell claude hey make a PR for me it'll look at all the changes since your branch diverged from main and then you know take all of those into account when generating the pull request message.

You might notice it running git diff as you're using it. I think it's pretty good about just tracking, hey, what changes have happened on this branch or so far and just make sure that it's like understands that before continuing on with the task. One thing other people have done is uh ask quad to commit after every change.

You can just put that in the quad MD. Um there's some of these like power user workflows that I think are super interesting. Like some people are asking Quad to commit after every change so that they can rewind really easily.

Other people are asking claude to create a work tree every time so that they could have, you know, a few clouds running in parallel in the same repo.

I think from our point of view, we want to support all of this. So again, cla code is like a primitive and it doesn't matter what your workflow is.

It should just fit in. I know that 3.5 haiku was the number four model on ader when it came out.

Do you see cloud code have a world in which you have like a commit hook that uses maybe haiku to do some like like the llinter stuff and things like that continuously and then you have 3.7 more.

Yeah. You could actually do this if you want. So uh you're you're saying like through like a pre-commit hook or like a GitHub action or Yeah.

Yeah. Yeah. Say well kind of like run clock code like the length example that you had.

I want to run it at each commit locally like before it goes to the PR.

Yeah. So you could do this today if you want.

So in the um you know if you're using like husky or like whatever pre-commit hook system you're using or just like get pre-commit hooks just add a line quad-p and then you know whatever instructions you have and that'll run every time.

Nice. And you just specify haiku.

It's really no difference right.

It's like maybe it'll work a little worse but like it still support it.

Yeah, you can override the model if you want.

Generally we use sonnet.

We default to set for most everything just because we find that it outperforms. Yep.

Um but yeah, you can override the model if you want. Yeah, I don't have that much money to run commit hook on.

Just as a side on on pre-commit hooks, I have worked in places where they insisted on having pre-commit hooks.

I've worked at places where they insisted they'll never do pre-commit hooks cuz they get in the way of committing and moving quickly.

I'm just kind of curious like do you have a stance or recommendation?

Oh god, that's like asking about tabs versus spaces with a little bit.

But like, you know, I I think it is easier in some ways to like if you have a breaking test, you go fix the test with clock code.

In other ways, it's more expensive to run this at every point.

So, like there's trade-offs. I think for me, the biggest trade-off is you want the pre-commit hook to run pretty quickly.

So, that if you're either if you're a human or if you're a quad, you don't have to wait like a minute for all.

So, you want the fast version.

Yeah. So, generally, you know pre-commit, you know, for our codebase should run.

Yeah. Yeah, it's like less than, you know, 5 seconds or so.

Like just types and lint maybe. And then more expensive stuff you can put in the GitHub action or GitLab or whatever you're using.

Agreed. I I don't know.

Like I I like putting prescriptive recommendations out there so that people can take this and go like this guy said it, we should do it in our team and like that's that's a basis for decisions.

Yeah. Yeah. Yeah. Cool.

Any other technical stories to tell? um you know wanted to zoom out into more producty stuff but uh you know you can get as technical as you want. I don't know like one anecdote that might be interesting is um the night before the code launch we were uh going through to burn down the last few issues and the team was up like pretty late trying to trying to do this.

And one thing that was bugging me for a while is we had this like markdown rendering that we were using and it was just you know it's like the the markdown rendering in quad today is beautiful and um it's just like really nice rendering in the terminal and it does bold and you know headings and spacing and stuff very nicely but we tried a bunch of these offtheshelf libraries to do it and I think we tried like two or three or four different libraries and just nothing was quite perfect.

like sometimes the spacing was a little bit off between a paragraph and like a list or sometimes the text wrapping wasn't quite correct or sometimes the colors weren't perfect.

So each one had all these issues and all these markdown renderers are very popular and they have you know thousands of stars on GitHub and have been maintained for many years but you know they're not really built for a terminal.

And so the night before the release at like 10 p.m. I'm like all right I'm going to do this. So I just asked Quad to write a markdown uh parser for me and it wrote it. Zero shot. Yeah.

It wasn't quite zero shot, but after, you know like maybe like one or two prompts, it got it.

And um yeah, that's the markdown parser that's in in code today.

And the reason that markdown looks so beautiful.

That's a fun one. It's interesting what the new bar is, I guess, for implementing features like like this exact example where there's libraries out there that you normally reach for that you find, you know, some dissatisfaction with for literally whatever reason, you could just spin up an alternative and go off of that. Yeah.

I feel like AI has changed so much and you know literally in the last year but a lot of these problems are you know like the example we had before a feature you might not have built before or you might have used a library now you can just do it yourself like the the cost of writing code is going down and productivity is going up and we we just have not internalized what what that really means yet. Yeah. But yeah I expect that a lot more people are going to start doing things like this like writing your own libraries or just shipping every feature. Just to zoom out, you obviously do not have a separate cloud code subscription.

I'm curious what the road map is like.

Is this just going to be a research preview for much longer? Are you going to turn it into an actual product? I know you were talking to a lot of CTOs and VPs or is there going to be clock code enterprise?

What's the what's the vision?

Yeah, so um we have a permanent team on cloud code. Uh we're growing the team.

We're really excited to support cloud code in the long run. And so yeah uh we'll we plan to be around for a while.

In terms of subscription itself it's something that we've talked about.

It depends a lot on whether or not most users would prefer that over pay as you go.

Um so far, pay as you go has made it really easy for people to start experiencing the product because there's no upfront commitment. And it also makes a lot more sense with a more autonomous world in which people are scripting cloud code a lot more. But we also hear the concern around hey I want more price predictability if this is going to be my go-to tool.

So we're very much still in the stages of figuring that out.

I think for enterprises given that cloud code is very much like a productivity multiplier for IC's and most IC's can adopt it directly.

We've been just like supporting enterprises as they have questions around security and productivity monitoring and so yeah we we've found that a lot of folks see the announcement and they want to learn more and so we've been just engaging in those.

Do you have a credible number for the productivity improvement like for people not in anthropic that you've talked to like you know are we talking you know 30% some number would help justify things.

We're working on getting this and we should yeah something we're active we're working on but anecdotally for me it's probably 2x my productivity.

My god. So I'm just like I'm an engineer that codes all day every day.

For me it's probably 2x. Yeah. I think there's some engineers at Anthropic where it's probably 10x their productivity and then there's some people that haven't really figured out how to use it yet and you know they just use it to generate like commit messages or something.

That's maybe like 10%. So I think I think there's probably a big range and I think we need to to study more.

For reference sometimes we're in meetings together and sales or compliance or someone is like "Hey, like we really need like X feature.

" And then Forest will ask a few questions to like understand the specs and then like 10 minutes later he's like, "All right, well, it's built.

I'm going to merge it later. Anything else?

" So, it definitely feels definitely far different than any other PM role I've had.

Do you see yourself opening that channel of the non-technical people talking to clock code and then the instance coming to you which like they already define and talk to it and explain what they want and then you're doing calendar review side and implementation.

Yeah, we've actually done a fair bit of that like uh Megan the designer on our team she is not a coder but she's winning pull requests.

She uses code to do it. She designs the UI.

Yeah. And she's landing PRs to our console product.

So it's not even just like building on cloud code, it's building like across our product suite in our monor repo, right? Yeah. Yeah.

And similarly our you know our data scientist uses quad code to write like you know like bigquery queries and uh there was like some finance person that went up to me the other day and I was like hey I've been using quad code and I'm like what like how did you even get it installed like you know how to use git?

And they're like yeah yeah I figured it out and yeah they're using it.

They're like, "So, quad code you can pipe in because it's a Unix utility.

" And so what they they do is they take their data, uh, put it in a CSV and then they take the they cat the CSV, pipe it into code, and then they ask it code questions about the CSV and they they've been they've been using it for that.

Yeah, that would be really useful to me because really what I do a lot of the times like somebody gives me a feature request, I kind of like rewrite the prompt, I put it in agent mode and then I review the code. It would be great to have the PR wait for me. I'm kind of useless in the first step like you know taking the feature request and prompting the agent to write it. I'm not really doing anything like my work really starts after the first run is done.

So I was going to say like I can see it both ways.

So like okay so maybe I'll simplify this to in the in the workflow of nontechnical people in loop should the technical person come in at the start or come in at the end right or come in at the end and the start obviously that's the highest uh leverage thing because like sometimes you just need the technical person to ask the right question that the nontechnical person wouldn't know to ask and that really affects the implementation.

But isn't that the the bitter lesson of the model that the model will also be good at asking the follow-up question?

Like you know if you're like telling the model, hey that's why you trust the model to do the least, right? Sorry.

Go ahead. Yeah. Yeah. No, if you're like the model, hey, you are the person that needs to translate this non-technical person request.

Yeah. Yeah.

Into the best prompt for cloud code. Yeah.

To do a first implementation. Yeah.

Like I don't know how good the model would be today.

I don't have an eval for that.

That seems like a promising direction for me.

Like it's easier for me to review 10 PRs than it is for me to take 10 requests then run the agent 10 times and then wait for all those runs to be done and review. I think the real the reality is somewhere in between.

We spend a lot of time shadowing users and watching people at kind of different levels of seniority and kind of technical depth use code. And one thing we find is that people that are really good at prompting models from whatever context, maybe they're not even technical, but they're just really good at prompting.

They're really effective at using code. And if you're not very good at prompting, then code tends to go off the rails more and do the wrong thing.

So I think in this stage of where models are at today, it's definitely worth taking the time to learn how to prompt model as well. But I also agree that, you know, maybe in a month or two months or 3 months, you won't you won't need this anymore cuz, you know, the bitter lesson always wins.

Please please do it. Please do it. Entropic.

I think there's a there's a broad interest in people forking or customizing cloud code.

So I we have to ask why is it not open source?

We are investigating.

Ah okay. So um so it's not yet.

There's a lot of trade-offs that that go into it.

On one side, our team is really small and we're really excited for open source contributions if it was open source, but it's a lot of work to kind of maintain everything and like look at like I maintain a lot of open source stuff and a lot of other people on the team do too and it's just a lot of work like it's a full-time job uh managing contributions and uh all this stuff. Yeah, I'll just point out that you can do source available and that's you know solves a lot of pe individual use cases without going through the legal hurdles of a full open source. Yeah, exactly.

I I mean I would say like there's nothing that secret in the source.

Um and obviously it's all JavaScript so you can just decompile it.

Compilation's out there. Very interesting. Yeah.

And and generally our approach is you know all the secret sauce it's all in the model and this is the thinnest possible wrapper over the model.

We literally could not build anything more minimal.

This is the most minimal thing. Yeah.

So there's just not that much in it.

If there was another architecture that you would be interested in that is not the simplest, what would you have picked as an alternative?

You know, like and we're just talking about agentic architectures here, right?

Like there's a there's a loop here and it goes through and and you sort of pull in the models and tools in a relatively intuitive way.

If you were to rewrite it from scratch and like choose the generationally harder path like what would that look like?

Well Boris has rewritten this. Boris and the team have rewritten this like five times.

Oh, that's a story. Yeah.

Like quadco is very much the simplest thing I think by design. Okay. So, it just got simpler.

It got simpler. It didn't go more complex.

We've rewritten it from scratch.

Yeah. Probably every 3 weeks four weeks or something. And it just like all the it's like a ship of thesis right?

Like every piece keeps getting swapped out and just cuz quad is so good at writing its own code. Yeah.

I mean at the end of the thing, the thing that's breaking changes is the interface.

the card MCP blah blah like all all that has to kind of stay the same unless you really have a strong reason to change it. Yeah, I think most of the changes are to make things more simple like to share interfaces across different components.

Um because ultimately we just want to make sure that the context that's given to the model is in like the purest form and that the harness doesn't intervene with the user's intent. And so very much a lot of that is just like removing things that could get in the way or that could confuse the model. Yeah. On the on the UX side, something that's been pretty tricky.

And the reason that you know we have a designer working on a terminal app is it's actually really hard to design for a terminal. There's just like there's not a lot of literature on this.

Like I've been doing product for a while.

So like I kind of know how to build for apps and for for web and you know for engineers in terms of like tools that have Devex, but like terminal is is sort of new. There's a lot of these really old terminal UIs that use like curses and things like this and for very sophisticated UI systems, but these are all they all feel really antiquated by the UI standards of today.

And so it's taken a lot of work to figure out how exactly do you make the app feel like fresh and modern and intuitive in in a terminal. Yeah. And we've had to come up with a lot of that design language ourselves.

Yeah. I mean, I'm sure you'll be developing over time.

Um cool. Closing question. This is just more general like I think a lot of people are wondering and and topic has I think it's easy to say the best brand for AI engineering like you know developers uh and and coding models and now with like the coding tool attached to it is it just has the whole product suite of model and tool and protocol right so I and I don't think this is obvious one year ago today like when cloud 3 launched it was just it was just more like this is general purpose models and all that but like cloud son really took the scene as like the sort of coding tool of choice and I think built Enthropic's brand and you guys are now extending.

So why is Enthropic doing so well with developers? Like it seems like there's just no centralized every time I talk to Enthropic people they're like oh yeah we just had this idea and we pushed it and like it it did well and I'm just like there's no centralized strategy here or like you know is there an over overarching strategy?

Sounds like a PM question to me. I don't know.

I would say like Dario is not like breathing down your necks going like build the best dev tools like he's just you know letting you do your thing.

Everyone just wants to build awesome stuff.

It's like I feel like the model just wants to write code. Um yeah I I think a lot of this trickles down from like the model itself being very good at code generation.

Like we're very much building off the backs of an incredible model.

Like that's the only reason why quad code is possible. I think there's a lot of answers to why the model itself is good at code. But I think like one highle thing would be so much of the world is run via software and there's like immense demand for great software engineers and it's also something that like you can do almost entirely with just a laptop or like just a dev box or like some hardware. And so it it just like is an environment that's very suitable for LLMs. It's an area where we feel like you can unlock a lot of economic value by being very good at it.

There's like a very direct ROI there.

We do care a lot about other areas too, but I think this is just one in which the models tend to be quite good and the team's really excited to build products on top of it. And you're growing the team you mentioned. Who do you want to hire?

Yeah, we are. Who's like a good fit for your team? We don't have a particular profile.

So if you feel really passionate about coding and about the space, if you're interested in learning how models work and how terminals work and how like you know all all these technologies that are involved, yeah, hit us up.

Always happy to chat. Awesome. Well, thank you for coming on.

This was fun. Thank you.

Thanks for having us. This is fun.

[Music]

Loading...

Loading video analysis...