Everyone Is Sleeping on Composer 2.5
By Web Dev Cody
Summary
Topics Covered
- Context Window Burns Fast on Simple Tasks
- Accuracy Trumps Speed and Cost
- The Real Cost Difference Between Models
- Use Cheap Models for Simple Tasks Only
Full Transcript
So about a week ago, Composer 2.5 came out and I wanted to give you my opinions on using it. I've been building with it off and on for adding features to Mission Control and also kicking it off for doing refactorings on my existing
projects. So over here in Mission
projects. So over here in Mission Control, I do have Cursor CLI as an option and I will default to Composer 2.5 for a lot of tasks because it's very fast and for the most part it's very
accurate, right? It's not on par with
accurate, right? It's not on par with GPT 5.5 or Opus 4.7. I think it's a little bit below, but overall, if I need to just do a quick little bug fix, this is a really great model to at least try
out. If you haven't tried it out before,
out. If you haven't tried it out before, just try it out. You're going to be blown away with the speed and the accuracy of some of these requests. Now
like I mentioned, I'm not going to hype up this model. I think it's a great model and for most coding tasks, like UI related tasks, this thing works very well. I will say that for the more
well. I will say that for the more esoteric and complex bugs or features, the ones that really span multiple different files and really have complex conditional logic and stuff like that, I
have found defaulting to GPT 5.5 is still my my favorite model, the one I would recommend. Honestly, if you only
would recommend. Honestly, if you only had one model to pick, this is it as of today. This is the model you want. This
today. This is the model you want. This
is the subscription that you want. But
if you do have some additional funds, maybe you're using Cursor or Cursor CLI, Composer 2.5 is great. Check it out, play around with it and I find myself using Opus 4.7 a lot less these days. I
do use it at work all the time. I do
think it's a really great model that does understand the code base. It's has
a million context window, which is also very good, but also GPT 5.5 has a million context window as well. In terms
of benchmarks, it looks like it's almost on par with Opus 4.7 for terminal bench, uh GPT 5.5 still blows them out of the water. Now for the software engineering
water. Now for the software engineering bench multilingual, it's beating GPT 5.5, which honestly, I don't think this is accurate. I Many times I have to use
is accurate. I Many times I have to use GPT 5.5 to fix bugs and implement the hard stuff and then I fall back on Composer 2.5 if it's something that's kind of simple. But I will say often these benchmarks are pretty accurate
with how good they are. I mean, I code with them for like a couple days straight to get a good feel for them.
And overall, I do like this model. If I
only had the code with Composer 2.5, I would be pretty happy. The The issues with Composer 2.5 is the context window.
I find the context window to be very small. So, if you do have a very simple
small. So, if you do have a very simple bug or a simple implementation new feature you want to add in, it's great for that. Like, if you just want to add
for that. Like, if you just want to add a new button and that button has to go and change some back-end code, maybe it modifies some schema, maybe it needs to run some tests or write some tests, it's a good model to pick. But, the context
window does have a limitation. I find
that after just one prompt, I typically hit 50% of my context window, which in comparison to GPT 5.5 or Opus, I mean, I have a lot larger of a window. I can
keep on prompting it, and then at some point I can either compact that or throw it away. And so, I do leave that for the
it away. And so, I do leave that for the harder models. Now, just to kind of demo
harder models. Now, just to kind of demo this project live, I'm going to switch over to my Mission Control project. I do
use Mission Control to build out Mission Control. And then I'm going to load up a
Control. And then I'm going to load up a new session over here with uh I can switch these to Cursor CLI. We can load that up. I have a bunch of extra things
that up. I have a bunch of extra things I should probably delete. And now we have a Composer 2.5 model ready to go. Okay. So, there's
a small bug in this application. When I
click on the setting icon, which is behind my head, you'll notice that it loads the settings just fine, but when I collapse it or press escape, it kicks me back to the main dashboard. There's a
small bug. When I load up the settings panel, it works fine, but when I press escape or collapse it, it seems to redirect me back to my main dashboard route. When the settings panel closes,
route. When the settings panel closes, we should not be redirecting the user anywhere.
Okay, again, this is a really simple bug. Any of the models will probably be
bug. Any of the models will probably be able to kick this and basically ship it out of the park. And I don't even need to tap into my skills. There's a ship skill that I typically use for all my prompts, so I would say like use the
ship skill. It just helps analyze the
ship skill. It just helps analyze the code base a little bit better, and all the code that's written is going to be covered with test. Maybe I'll kick it off just to kind of demo that. But, for
the most part, you can one prompt many requests just like this. For example, I use composer to add in this GitHub branch switcher, to add in the work trees. I do have work tree support now
trees. I do have work tree support now in Mission Control. Now, this was a really stupid example, cuz honestly, the fix was what? We did two two line changes. I can go ahead and just look at
changes. I can go ahead and just look at the diff over here. We can kind of just see, you know, it just changed the settings and it changed the route.
That's it. So, now I can actually just refresh this page and we can test this out. All right, so let's just click on
out. All right, so let's just click on settings. I'm going to go ahead and
settings. I'm going to go ahead and collapse it and then everything is working as it did. Now, I will say there's another bug. If I hide my head again, when the settings panel is open, I can't click on this button to collapse it again. Okay, so that's another bug.
it again. Okay, so that's another bug.
I'm going to go ahead and just reuse that same session, cuz it already kind of dived into my code base and it knows about what we just did. There's also
another small bug. When I click on the settings button to collapse the settings menu, it doesn't seem to work anymore.
Now, I don't think composer broke this.
I think this was just already a bug in my code base. But, we can just go ahead and prompt this off real quick and I think that would probably fix it real fast. But, as you can see down here,
fast. But, as you can see down here, we're already at 28% con- text window.
From one small little bug fix, it had to dive through all my code base, it had to truly understand what was going on and I'm already at like, you know, a fourth of the context window used up. Which you
can tell from a larger feature or a larger refactoring, it kind of burns through that context window extremely fast. All right, so let's just
extremely fast. All right, so let's just do a refresh. Looks like it's done. I'm
going to go ahead and just click on the settings button. Now, unfortunately,
settings button. Now, unfortunately, it's still broken. Like, if I click on the settings and click the button, it doesn't actually collapse it. So, this
is a realistic example of what I'm seeing with composer 2.5. Or some things, it just doesn't get
2.5. Or some things, it just doesn't get it right, right? And you do have to come back in here and you have to keep on re-prompting it. And I'll probably tell
re-prompting it. And I'll probably tell it again, "Hey, like this still doesn't work. This still doesn't work when I
work. This still doesn't work when I click on the settings button, it does load the settings menu, but then when I click it again, it doesn't collapse it. I will state that using escape does properly collapse it
and it shows me the same route I was on previously, but for some reason, manually clicking the settings button does not properly hide the settings route/panel anymore. These models do not
route/panel anymore. These models do not one-shot everything. Uh this is
one-shot everything. Uh this is something that GPT-5.5 I guarantee you probably would have fixed the first or second prompt. But the cool thing about
second prompt. But the cool thing about Composer 2 is that it's very cheap. It's
almost like 10 times cheaper than using Opus, but it has like the same benchmarks as Opus. So, you can do a lot more requests with a fraction of the cost, which is really great. Especially
if you want to do like refactoring or stuff and you have a lot of tests in place, Composer can kind of go through your whole code base and quickly refactor things. But I have I've noticed
refactor things. But I have I've noticed that I do have to re-prompt it a lot.
Sometimes it will get my application to a broken state, and I do have to re-prompt it to get it to be fixed.
Whereas something like GPT-5.5 often just does it correctly the first time.
And I would say that I tend to lean on models that are just accurate. If I have to wait an extra 2 3 5 minutes, I'm okay with that. If the end result means I
with that. If the end result means I don't have to go and re-prompt it or fix something that it broke, I'm okay with waiting a little bit longer. Accuracy by
this point is the most important thing.
Okay, so now it is totally fixed. It
took about three prompts to fix this. I
do want to press escape to verify that works. Okay, that works fine. I will
works. Okay, that works fine. I will
state that the animations don't work when I click the the button to close it. So, that's
something we can also fix. This is
great, you fixed it, but I noticed when I click on the button the animations don't slide the settings menu away like it does when I press escape. Can you
debug and fix why that's happening? And
so, the fourth prompt, let's go ahead and just refresh one more time and then we're going to click on this.
Look at the way, it does slide away.
That's good.
Um pressing escape works. Make sure it works on different pages. Yeah, this is all looking pretty good now. Okay, so it did take four prompts to fix it. Now, I
can't really tell you how much this cost me for doing these prompts because I think right now it's still in the free period, which is unfortunate because it'd be nice to actually see how much this would have cost me if this was um
not included already. But I will say that the model is significantly cheaper.
If you look at this graph over here, the orange is composer 2.5, the Y axis is how accurate the model is, and the X axis is the cost. So, you'll notice here Opus 4.7
cost. So, you'll notice here Opus 4.7 extra high, which is like the default most of us are using in cloud code, it's about $7 per task.
Versus composer 2.5 is around a dollar, even maybe 75 cents per task. And it's
saying that it's even more competent than Opus 4.7, which I find great for like kicking off CLI tools, like for example doing background task or asking composer to create a pull request. All
those like, you know, the minutia stuff that happens in our day-to-day, using a cheap model that's accurate just to like do a terminal command to get it done, and you're going to save a lot of money and not spend a bunch of time waiting for it. Versus this, you're going to
for it. Versus this, you're going to spend a bunch of money for a simple task, you're going to sit there waiting for it like a minute or 30 seconds for it to just, you know, do a grep over your code base to find a single file for you. Versus this thing can get it done
you. Versus this thing can get it done extremely fast, and it's also pretty accurate in terms of its performance.
But going back to my original point, I would probably pay the premium if every single request, like a large feature request or a plan, is implemented into end and works perfectly like the first
time, right? I would pay a premium to
time, right? I would pay a premium to make sure that it is accurate. And I
would say that Codex or GPT-5.5 is the model for doing that. It is about, you know, $4.5 per task, but if your tasks are actually large tasks, then I think this is still the best model. But I do
like using composer. I find myself using it all the time when I'm adding new features. I'm like, "Oh, I just need to
features. I'm like, "Oh, I just need to fix a small thing." Or "Oh, I need to add in a new feature that does X, Y, and Z." I will probably kick it off in
Z." I will probably kick it off in composer first. Now, later on this
composer first. Now, later on this article they do talk about how this is basically built. Like they use like
basically built. Like they use like Kimmy 2.5, and they did some reinforced learning to kind of make it better.
I don't want to talk about that stuff. I
just want to kind of do a benchmark against like a simple bug fix like we just did. Okay, so we've verified that
just did. Okay, so we've verified that this is fixed now. It took us four prompts, or maybe I think it's four or five, right? So, I went ahead and
five, right? So, I went ahead and committed those fixes to my branch, but what I also want to do is I'm going to check out that commit, right? I'm going
to go back and find this commit before I had that fix. And we're just going to go ahead and check it out here.
And then I'm going to try to do the exact same prompt as we did before. I
might have to go find it.
Uh let's go here. There's a small bug.
I'm going to grab all this code.
Or sorry, this text. I'm going to go and make a Codex session.
And we're going to try to fix it with the exact same prompt using GPT 5.5 extra high. Now, I'm going to paste in
extra high. Now, I'm going to paste in the exact same I do have to specify though, do not look at Git logs because we already fixed it, right? And these
models are smart enough to go and do like Git logs to figure out if maybe it broke something along the way. This is
not 100% a good comparison because, you know, you should have the exact same prompt. I should have been able to check
prompt. I should have been able to check out this shawl in isolation. But, you
know, it is what it is, right? Just
we're trying to do a little bit of testing to make sure that these models actually do what they're supposed to. All right, so let's just do
supposed to. All right, so let's just do a hard refresh and load the settings and then I'm going to close them out. Now,
the animations still are kind of broken uh when I close it. I think it's also when I just click on it. Let's see.
Yeah.
When the settings panels are open and I press escape, the panels do animate away, which is great. But, when the settings panel is open and I manually click on the settings button with my mouse, it doesn't animate away. It just
disappears.
Can you fix the animations to make it work consistently for both? Now, one thing I wish I did a
for both? Now, one thing I wish I did a little bit better was track how long that original prompt took. I will say GPT 5.5 took maybe 4 minutes to do its first bug fix and then the second prompt
may take another minute or two. But,
Composer, you have to kind of prompt it more along the way. And so, I don't know, maybe at the end of the day it's still the same amount of aggregate time spent on waiting for these models to run. But, one requires a human in the
run. But, one requires a human in the loop much more versus GPT 5.5 seems to just get it done. All right, so now it's just running a bunch of checks. I'm
actually I'm going to stop this cuz I actually have to end this video. I'm
going to go ahead and just refresh.
We're going to see if the animations are fixed now. And they are. Okay. So again,
fixed now. And they are. Okay. So again,
that was a really simple bug. Composer
took four prompts.
And GPT-5.5 took two prompts. Overall,
the amount of time elapsed probably equivalent. I will say that Composer is
equivalent. I will say that Composer is probably a lot cheaper. But you do have to be more in the loop and kind of re-prompt it and do much more manual testing. So that could add to the actual
testing. So that could add to the actual engineering time, especially if you're not really familiar with how to verify these things manually. You have to keep on clicking through the UI, make sure it works. Whereas I found GPT-5.5 still
works. Whereas I found GPT-5.5 still super accurate. And if you do want a big
super accurate. And if you do want a big feature, you kick it off as a background task. You come back and then usually
task. You come back and then usually kind of works. Okay. So that's what I wanted to kind of talk about. I wanted
to kind of give a realistic scenario of a really simple bug. Um maybe I'll make some more complex test scenarios in the future. We can kind of talk about that.
future. We can kind of talk about that.
I can make a video on it. But that's
about it. If you guys enjoyed this, leave a comment. Have a good day. And
happy coding.
Loading video analysis...