Claude Opus 4.8 | First impressions
By Arena AI
Summary
Topics Covered
- Anthropic release cadence is compressing fast
- Benchmark numbers can't tell you how a model feels
- Anthropic's thinking mode barely beats non-thinking
- AI 3D generation quality is wildly inconsistent
Full Transcript
It's always an exciting day when we have a new frontier model out today, it's opus 4.8.
They released some really nice looking benchmark, although there are a few only, so we really want to see how it feels to use this model on our code arena.
And as usual, we'll go over a bunch of different examples, and I really want to dig in and see how how really feels.
So in here we can see how I performance a few coding benchmarks.
So for example SWE Bench Pro is five percentage points higher than opus 4.7.
That's a big jump.
Terminal bench to 0.1 is eight percentage points higher.
Again a massive jump although it's not the cartoon model there.
So we should see pretty big jump.
So that's what I want to explore today.
Are we going to see that big jump.
A couple of other things I want to pay attention to is, do we see a big difference between reasoning and non reasoning models for anthropic because previous models are felt like the reasoning reasoning were pretty close, but now they seem to be paying a lot more attention to it.
So I want to really look out for that as well, and then to see if anything else we're going to pick up any particular buttons, which is always good to see.
Before we get into it, I want to point something out.
Anthropic models have been releasing at a higher frequency than before, so if we go back to Claude for releases, it was 75 days to 4.1 50 odd days between 4.1 to 4.5.
Well, actually, if you depends, depends how you count.
So then we, we are all the way down to 42 days between opus 4.7 and 4.8.
Now this is acceleration.
Make of it what you will.
But I think it's it's an interesting point to, to look out for.
What I want to do today is to run a bunch of tests that we normally run on this channel.
And the way it works, just so you have a mental picture of it, is that we go to arena AI, we put in the prompt, it looks something like this.
Sometimes it's much shorter, sometimes I've got like longer complicated prompts.
Then I've got two generations and then we compare them.
Normally you'd comparisons are by side, but we just take the link in here as well.
It's a one short generation.
It is not multi short based on your code base and so on.
So adjust for that.
This is not testing for every single thing.
But the advantage of this is that we can cover a lot of ground and we can see how well the model models are performing across all of these categories.
So what I want to do today is to dig into some of these generations.
And as you see, we'll have opus 4.8 thinking then the non thinking version, then the same for the 4.74.6.
We're going to see how the performance has changed over time.
We're not going to do this for every single one.
But I'll give you a few examples to to give you a feel for how things have evolved, where things got better, where things didn't get better, and you'll get your own opinion.
So let's get into it.
So first of all, this is the voxel generation of Rome.
This is opus opus 4.8 thinking here, which I think is rather nice generation.
We have a lot of different CT elements.
They're all nicely put together, which I really like.
I think this is good.
This is like a little bit not fully constructed.
And I think it could be maybe a little bit brighter, but other than that I think that's that's pretty good.
The non thinking version I don't know what happened.
If I can switch between time that's really you can kind of see the outlines.
But something's wrong here with the way it's generated.
So I have seen that before for this generation.
So I don't know if there's something like a little bit odd.
So maybe let's put that aside.
Keep a note of that if we see anything weird.
But yeah, that wasn't a good one.
So opus 4.7 I'll give you a feel for opus 4.6.
Whoops, let's skip that one.
I Gemini 3.1 I think quite a good generation from Gemini and GLM 5.1 and oh in here.
Yeah, we've got the Gemini 3.5 flash which is very aggressively builds a bunch of other stuff.
But like what do you think?
I would say that in my perspective, if I get rid of this one, I think that's quite a good one.
I think it's complex, is detailed.
It has quite a lot of nice elements.
So I would probably put that first.
I don't like the roof, but otherwise like looking at it, it's definitely improvement of 4.7 which is lower fidelity doesn't have quite as much detail as this.
Like a kind of flickering 4.6.
Kind of feels very weak.
You can see how much less detail there is and there's things not connected.
We'll skip that.
Gemini is a good one, but it really feels like kind of things thrown together a little bit.
Not quite detailed enough.
And yeah, these ones are noticeably weaker.
So yeah, I would say 4.8 thinking here is very good.
So I like that generation with good.
I'm going to skip through a bunch of them.
That's 218 tests.
So we're not going to look through all of them.
But I want to see a few that that are interesting to, to to look at and see which ones we can learn from.
Yeah.
4.8 not working the reasoning one but the non reasoning one.
The idea here is that we want to have this kind of coral reef setup and the while the fishes are quite interesting in detail, the obvious point here is that the flow is kind of missing.
And we want to have a nice structure here.
Opus 4.7 also flow missing.
I don't know what's going on there, but the fishes are much less detailed.
So you can see this is an interesting improvement.
4.6 the flow is here, but much less detail as well.
So you can kind of see okay the flow is is voice.
But if you forget about the floor like the detail in the fish is better.
So I like that this is a good improvement in terms of it's trying to make the world ratio.
For comparison. This is 3.1 Gemini.
I think that's a good one where we've got the nice structure on the floor, but the fishes are so much less detail.
You can see that right.
So yeah, I think it's kind of different things that depend attention to.
And we can see quite low amount of detail for Gem and 3.5 flash.
Not sure what was kind of done here.
It's not it's going to try certain elements but yeah it's not it's not a great generation personally.
So let's keep moving.
I quite like that one.
And the reason why this is interesting is that this is not a dam de Paris.
So we want it has a specific known structure so it can align the reality or not.
And it has quite a lot of these kind of colorful elements.
I'm not a big fan of this because you see the things coming out of the frame.
This is like a little bit incoherent.
And things like this is not quite what it looks like.
I mean, things kind of coming out of it, it's not terrible, but not my favorite one.
Although the light is quite nice, I do like the the light here.
So let's have a look at the non thinking version I think has quite a lot of Hazel.
Let's turn this down.
I think the the structure is better as a node structure of the building.
But yeah structure of the building actually some elements are good although this is a bit weird, but the structure of the window here is definitely better.
You can see like this doesn't really make sense, but this one is better.
So to the point of this thinking like make it a lot better here, maybe not so much.
I think let's let's keep an eye on that.
I think there are some better examples where we can see the difference.
4.7 for comparison.
Much worse here in terms of like floating structure and so on.
Although the actual outline is rather nice, not too bad, but yeah, floating structure.
Definitely a knock down 4.6 thinking cannot closer to the 4.8 thinking.
But yeah, this is kind of inverted.
Things are too close together.
I think that's not a what you would expect.
And the non thinking 4.6 this is a lot worse than 4.8 I would say.
You see things kind of behind this is not very interesting at all.
So yeah this is noticeable jump at least in this kind of test between 4.6 and 4.8.
Yeah 4.5 here.
Much much worse.
So if we're going to go to 4.5 to 4.8 is really, really different kind of category of models.
Very jazzy from Gemini 3.1 here as I don't quite know what's going on, but yeah.
In here. Yeah.
It's kind of try to do quite a bit.
But yeah, that's not a great generation.
I am actually quite good here.
So for comparison you know that's a good comparison.
I think it's on quite as ambitious.
But I think you did a good job.
Yeah I'm flush.
Jesus.
Everything's going on here.
But yes again very very a genetic model.
It's really pushes things forward.
Structure. Not too bad I must say.
This is this is actually not terrible.
So let's let's keep looking which ones are interesting.
So we'll have a few more types of generations.
But I just we've looked at the more kind of 3D ones.
I want to look at a more complicated variant of 3G.
And the idea here is that can we get our lamps to create games?
The reason why I think games are interesting is that, first of all, they have this kind of 3D element, which is quite complicated already to start with.
Then they need to actually create a gameplay that makes some sense.
And this is one area where Llms have really struggled getting our lamps to create some kind of coherent game.
I mean, you can see here, like this is I'm really struggling with this.
Like, not sure what's going on here is not not trivial.
And while I do quite like the the whole world as is created, it like doesn't feel like a good gameplay.
I'm not even sure what I meant to be doing here.
So I think this is this is interesting.
The conversion, I wonder.
This is like feels actually like a nicer environment, almost like I know it's not quite as condensed, but the gameplay feels.
Yeah, that's like at least driving.
It feels like more reasonable to me.
Although, yeah, the whole kind of card here.
Although I can't yet. So maybe it is.
Yeah, it's kind of this one kind of feels nicer.
But the moment you start using it, that that's not ideal.
4.5 you could see how much worse that is.
So this is not even close like this is the whole environment here is like miles behind.
It's all very rigid.
Square doesn't make any sense.
So yeah I would say that's a very noticeable improvement.
Gemini you know what's going on here?
Okay, I thought Gemini is going to be good at games.
That's why I read it.
But like, this is not even making any sense.
Let's move on.
Oh. Oh, yes.
The sound.
Oh no this is bad.
Yes. So, Gemini 3.5 flush.
Newsflash is that it likes to do sound a lot.
Yeah.
And I think, you know, you can see it's very jittery.
It's not your screen. It's here.
I think it's tried a bit too much to to add a lot of other stuff.
But if you kind of look at it as freeze frame, maybe it's even better game.
But I'm going to switch it off before my computer dies with it.
So I want to try more games.
See see if this is going to give us some more information.
This is a game which is meant to be restoring Assistant Chapel.
If you fly on a UFO, kind of on a drone, I think that was.
And you meant to be restored in the Sistine Chapel.
Not impressive.
You see how much I had to explain?
Would you guess this was 16 chapel?
Would you guess this was restoration?
Kind of weird, but you know it does to the fly.
So this is better than some that I've seen before.
The non thinking version flickering.
But I do kind of like that I can fly a little bit more.
So you kind of fly in the ceiling and this.
Oh yeah okay.
This is actually slightly I know bear with me.
Like I know when you look at it, it's like, oh it's flickering.
It's not as nice, but which is definitely true, but gameplay wise, that actually feels nicer to me that I can actually control it more.
So the non thinking version out twice, gameplay was better, which is weird, right?
Like I think this is kind of interesting.
4.7 looks kind of half of it.
I was going to say looks more like this is than you see these kind of weird creatures there.
If you haven't seen the Sistine Chapel, this is like the top world class art.
So this is not it.
But nevertheless like kind of arguably better than 4.8.
Think in terms of the moving around.
But yeah, some kind of world environment issues for sure.
Yeah.
And again just to contrast 4.5 here is like that's quite a bit worse.
Like this is nowhere near as good.
So yeah that's that's kind of interesting as well.
GLM just for comparison to anchor a little bit.
This doesn't make any sense.
Like this is not a game.
Control is not too bad, but the rest is not coherent.
Like there's no ceiling at all.
But the controls are actually quite nice.
Yeah.
So I want to keep looking at more games.
If you guys not bored of that, just I have so many games.
Maybe. Maybe for later.
You know, it's stay tuned for more videos and I'm going to come back to some other games.
This one I think we are kind of at the wrong side.
So this one is meant to be a interacting with this kind of one golfs sunflowers.
But I think we're on the wrong side so I know what's happened.
I think I got stuck.
Oh, that's a shame.
I think that looked better.
So, yeah, I think sometimes that's one issue that we definitely see with this kind of 3D generations is that models put a bit too much, and then it kind of just doesn't perform, which is a meaningful part of, meaningful part of creating a game as well.
Right. So.
Yeah, the moment you hit a kind of dies.
So I'm definitely positive points to 4.8.
Thinking that it even works for point eight doesn't work.
Not ideal.
4.7 thinking oh no, it doesn't work.
Maybe that's a bad game for me to show you.
If it doesn't, doesn't work.
But yeah, I must say this is kind of the difficulty of creating games is that for all of these, I don't know how much you can tell how much I'm struggling here, but the the gameplay here is pretty bad that I can't really aim.
The mouse is not working.
Not ideal.
GLM for comparison doesn't really do anything.
So yeah, that doesn't really make sense.
The reason why I also bring in GLM is not like to single them out as the bad one.
It's actually because they are good.
I genuinely would say out of the open source models they are on the better side.
That's why I wanted to bring it in to calibrate.
So while I'm saying like, oh, this is bad or this is the worst, maybe apart from some generations, this is not to say that it is actually bad.
Like this is quite good.
Like we are talking about very, very different tiers of models, so don't take it the wrong way.
This is Pyramids of Giza.
So this is quite a lovely generation by 4.8.
And what we want here is this kind of beautiful creation.
Different things put together nicely I do, I do quite like this.
This is, this is nice.
I wish it was a little bit bright if it's like during the day, but overall construction I would say really, really solid.
Could be higher detail.
But let's see what what others have done.
So this is 4.8 thinking why is it so hazy.
Yeah.
This is this thing is massive.
It's like it should be like about 10th of the size.
So I'm not loving the thinking generation here versus that.
I mean, don't get me wrong, this is still very good.
But and this is like a beautiful shot here.
But outside did not think it was better. No.
Like I don't know if you if you agree but certainly construction wise like this is the sizing is better.
The what's there in the front was better as well.
So yeah this kind of haze also not great.
So interesting.
Just for comparison Gemini what's going on here?
3.1 Pro yeah, this is not great.
Yeah.
So all I would say is is quite a bit better here.
3.5 flush.
Just to look at the later model.
It's kind of all all of the generations.
Feels like I'm building some kind of disco from the 80s which is definitely not part of the prompt.
Oh yeah.
We can kind of comment down a bit, but you can see like this, is this like the level of detail is much, much higher.
So I think that's that's not worthy.
This is opus 4.7 I would say okay. It is worse actually.
okay. It is worse actually.
Yeah it is quite a bit worse. Things are floating.
Yeah.
So I would say between 4.8 and 4.7 like that does seem to be a jump in the quality in here, which is good to see for sure.
And 4.6 here I would say is wow, like it shouldn't be that raised.
It's not that high.
I think Sphinx is missing.
Or maybe it's kind of embedded in.
Is that it? Yeah. And it controls.
You can see how much I'm struggling to to even control this.
Yeah. This is not not that great.
So I would say again feels like an uplifting and GLM for comparison as well as a kind of open source challenger.
Not not amazing but not too bad either.
Like pretty pretty nice generation.
If you wouldn't have seen the others, I would say this is quite nice.
Certainly better than maybe some of the others that we've seen.
Like not not bad like.
Right.
I want to really look at this.
So this is the Golden Gate Bridge prompt.
The reason why I like this is that it is really complicated.
There are so many different things that it needs to bear in mind in terms of the the whole design development, a bunch of different features, the structures.
So I find this quite interesting.
So let me show you a few generations here.
I'm just going to go kinda between the few different ones and see what you think.
So this is the 4.8 thinking.
I think this is the really quite good.
You can see the.
The sea here, the bay, the traffic.
Maybe I would want to get it a bit higher.
But this is really, really nice.
Yeah.
The weather is reacting well.
The traffic is is really quite good.
The bridge is quite good I think. Yeah.
My like close I think there's a close to, to a real thing.
Like I'm really like I think this is really excellent generation and this is 4.8 thinking.
And I'll show you some of the previous ones that opus has been generating before and generally how bad they were.
So you'll see how good this is.
I want to see I want to see the comet, which we can do at night.
Yeah, that is okay.
Not that impressive.
I mean, okay, can't do everything perfectly, but I would say a few, a few small criticisms. I think water could be like a little bit nicer.
And the the comet wasn't a great more traffic would have been better as well.
But other than that, like this is pretty close to two.
Really excellent.
You can see Gemini 3.1 Pro here, much less detail.
Things like not quite well coordinated.
The bridge is not like terrible, but you can see the difference in traffic.
We didn't quite look at the guitar here, but you can see the concrete zooming.
That's a that's annoying, but you can see the trucks in the distance and so on.
Here is just like squares going around.
If I look at 4.5 and you can see my excitement, right.
So this is opus 4.5 second bad.
That was I know it's thinking on thinking oh actually that is thinking.
So this is three notches up.
And now we can do this like this is this was so bad 4.6 pretty bad as well.
So 4.6 is a good model. Like people really like it.
And it's the generations at least on this test were really bad.
4.5 and another example here again many issues like some elements better than others but many issues 4.5 again, yeah, many many issues.
You can see for comparison 3.5 flash from Gemini.
Yeah, feels very kind of metallic almost to something but yeah very patent.
So yeah like better than some of the early opus generations.
But yeah there's no, no no traffic.
I think that's a, that's a big problem as well.
Jelen 5.1 by the way.
Really nice for for such a small open source model.
There's some issues with the traffic and so on but not as ambitious.
But like pretty nice generation.
Opus 4.7.
Again this is this is the gem that we're dealing with the some construction like elements.
Not too bad but you can see like it's not even in the right dimension.
So yeah definitely definitely nice improvement 4.6 I included this because that was probably the best 4.6 generation, which is quite nice.
But you can see level of ambition is like nowhere near nowhere near this one.
Yeah. So this was really excellent.
I don't want to oversell it.
Like I think this is a worse version of a 4.8 thinking in here that like you can see, kind of feels much more cartoony, almost like the traffic is kind of like, yeah, like a little bit of a joke.
The things are floating and so on.
So it has a lot of really nice elements to it. But
yeah, the difference between the two is quite stark, I must say.
So I think it's a kind of maybe keep this in mind as well.
Maybe it could be that one was a lucky generation, another was unlucky generation.
So yeah, it's a you definitely with a lot of models that happens.
The stability is not quite perfect here.
Okay.
This is a little bit indulgent.
I was testing some new prompts.
So I'm going to show you guys, see what you think.
Leave a note in the comments if you want me to do more of the aquatic theme.
But I went on a holiday recently to do some while watching, so I was playing out some of these fantasies.
You definitely don't get to see that view.
So this is one benefit of this.
And what we want to do here is a kind of sperm whale, which is moving around this kind of natural behavior.
It's then this kind of pulsating had that kind of thing that we want to see.
And I think there's some elements I don't like about kind of realism and so on, but it feels very alive.
So 4.8 thinking I think the like pretty good job here.
4.8 like not thinking.
I think you can kind of see what like falters a bit like the construction is like, I can't even make it out.
And I think it looks less kind of natural.
I mean, not not that this is supernatural, but at least there's kind of trying and there's trying to do something for point seven thing.
It was actually quite nice generation in here.
You can see I quite like the different patterns on the whale.
I think that is quite cool.
Jelen 5.1 again for comparison, that doesn't really look like a whale, but I mean, who knows, maybe a couple more generations maybe.
Maybe it'll get there.
Just I really start feeling bad for GLM.
Like I don't mean to be bashing them.
This was actually picked because they're good.
Octopus 4.8 noon thinking this is awesome.
Like I, I have not seen an octopus.
I guess maybe I have seen, but definitely not on this trip.
But yeah, this was quite this is quite cool.
Like, I really like this.
The natural behavior, the movements.
This is really a really nice generation.
4.7 thinking that is kind of creepy, right?
I mean, this is good.
I'm not not criticizing this for quality, but it's kind of interesting.
Like I haven't seen this kind of three generations do this kind of pattern.
So I think that's also like really try to do something good.
So yeah, all quite different generations.
4.8 thinking I don't know if this was it trying to be like two clever and it like hygiene or something.
But this is kind of bad generation.
Like the not thinking version is so much better than the thinking.
So that's something that I was standing at the beginning.
I'm interested in looking out for this.
Is it actually better or not?
And we can see that the very few times when it was noticeably better, but then quite a few times when it wasn't so not quite sure what to make of it the best.
I think there's something to be said that I think.
Still, the thinking levels for anthropic models don't quite make as much difference as, say, OpenAI models.
You can see huge difference for OpenAI models, whether they think you're not thinking.
So this is a humpback whale jumping.
We can see a few generations couldn't not include Gypsy, unfortunately.
Like not a great generation here.
I hope they get better with this kind of stuff.
But yeah, I think all of them, like none of them like amazing generations, I must say.
So this might be a good one to to keep an eye on.
Again. Maybe not. Thing is actually the best.
Not that this was great, but I think the the non thinking was better than than thinking in my mind.
Although there are some elements which I think you try to do more so but this is not, not that good.
So I want to maybe I'll come back to this one with better models and see what the progress is like.
So I think let's maybe skip the switches.
I think that's a bit too much and then not that interesting.
I want to have a look at some of the front end designs real quick.
So I have got a bunch of different problems here, and we have a few kind of a little bit unusual advantages do landing pages.
So we want to see creativity.
We want to see how the kind of approach the different problems here.
So the idea here is like react website, children's Science museum exhibit about motion lights and magnetism.
So let's see how creative they are here.
So this is 4.8 thinking we've got things moving.
We've got this kind of different elements.
It's kind of okay.
Like I don't know.
It's I don't know what do you think?
I feel like it's a little bit like like a first draft doesn't feel like a complete website.
So that was the thinking version.
The non thinking version.
Feels maybe less interesting in the sense that I think it tried to do more here.
I think that's quite cool.
But maybe this one almost feels more complete, although not better if you see what I mean.
I don't know, I'm struggling to decide between those two and 4.7, I don't know.
Is that is that better?
I feel like this is better.
I think it kind of depends whether you're tired of this kind of design style and are totally here.
If you say like, oh, I have seen this so many times, this is really awful.
But to me, 4.7 like if you kind of not used to that style, I think 4.7 is actually better.
That was the thinking, the non thinking this was noticeably worse.
I think it's more like 4.8 but kind of words.
So not not that impressive.
And Gemini 3.5 flash here.
I'm not going to talk about too much because I think it's noticeably worse.
Let's just go over a couple more.
So we've got a vinyl record pressing plant.
Let's see what they did here for point thinking, I quite like the this is quite a nice element.
This is this is quite tasteful, I would say, except for this, that could be like a real website I think with like a bit more content.
But this is, this could be a real website.
I feel like this is quite nice.
The non thinking version.
A little bit heavier but design is quite good as well.
Like I think this is quite good ideas, tastefully done, little animations.
It's not like over the top like to me some elements.
I think I prefer this element here versus this, but somehow the shape of it like it feels less nice here, but overall that's that's pretty good.
4.7 not to my taste a little bit.
Yeah.
This kind of haziness although that is nice.
Like I think this is good.
This is maybe a better element, but maybe I wish it is a little bit less heavy, less this kind of hazy.
But yeah, maybe a 4.7.
I guess it's a little bit matter of taste, but 4.7 I think is quite good.
Yeah.
And 4.7 not thinking a little bit lighter even quite, quite similar between thinking, not thinking I don't know.
I can go both ways here.
Not not sure.
There's not obvious obviously a big jump between 4.7 and 4.8 here for for these kinds of generations.
And maybe let's take a look at the last one and let me show you the prompt.
So this is the Toy Inventor workshop.
So build a front end website for a small toy inventor's workshop selling one wind up two is mobile machines and tiny automata.
So let's see what 4.8 thinking built for us.
And I think that's quite nice.
You can see the repeated patterns, right.
This kind of all things are going across.
And there's kind of try to do this thing not super successful but like overall like not bad I would say.
And when I say not bad, I think the overall structure and layout, like I've seen so much worse for motherland.
So I do feel like that this could be like a real website for the non thinking one.
Maybe some structure feels like a little bit more, I know, more condensed almost.
So yeah, definitely prefer the the thinking version of the non thinking one here.
4.7 thinking yeah I'm not sure this try to do here.
I think that's maybe try it a bit too much.
That doesn't seem to like work that well.
Yeah.
Yeah.
For point I did a lot better on this 5.5. Hi.
5.5. Hi.
I had just for this example for some reason.
Yeah.
It's like don't love the 5.5 high kind of design style.
But you can see it like it's not too bad.
Like I think maybe 1 or 2 generation.
It seems like they're focusing on it.
It could get better.
But for now, like the alignment, like the fact that it's like on three lines, it's on two lines.
And misalignment like this is definitely, definitely big knocks on the quality.
So if you're going to look at this in here especially this like that's they're not like big misalignments.
So yeah definitely some work to do for for GPT models.
Yeah. In Gemini here is like a lot worse.
It's interestingly that how much worse it was for for the Gemini models.
So what have we learned here.
The one thing is that I, I really always want you to encourage to try these models out.
All the numbers come out.
God knows what's going on.
I there's so much complexity like is this model but is this model better?
So I really want to encourage you to go in the arena, try it out.
It's really nice.
Put in your prompt, open a couple of tabs or something.
Try your prompts here.
With your generation is like you'll get much better feel.
And then in terms of the model specifically, what I wanted to see is there is the meaningful improvement that really warrants this kind of this.
Do these numbers kind of describe it in the right direction?
And I'm not sure it's like that big of a jump.
But certainly once you look at this kind of goes from the go back to November 2024, really important release of opus, opus 4.5.
Until now, we had like maybe about 150 days or something.
I would say the difference is very meaningful, that you can really see the difference.
And it's the quality, at least on these kinds of tasks, is is noticeable.
So that is quite impressive.
And the fact that the condensing in the timeline, that is good to see and the improvements are quite solid.
Is it better at everything?
When we looked at UI between 4.74.8, not always left 4.8, but you know, maybe in your task you'll find some differences somewhere, but not in others.
Another thing we wanted to learn is how much difference is there between the thinking and the non thinking versions.
And I would say I'm still not that convinced how good the models are thinking, which on the one hand is kind of weird that you can kind of knock them and say, oh well, why?
You maybe spend a lot of tokens thinking, but maybe you're not quite get into it.
But the result, but the fact that the non thinking models do so well, this is incredible.
I don't think there are the nonclinical models that do anywhere near as that.
So maybe something to note.
Maybe it is worth trying if you want to save some tokens, some money, maybe going and turn down that thinking and see what you get.
Maybe for your task that makes a good difference.
So maybe we're trying.
And another is like, what else?
Is there any kind of other patterns, any witness we picked up?
And definitely the variability in some of the generations has been quite high.
So the Golden Gate Bridge one was absolutely amazing.
Another was like, okay, pretty good, but not that great.
Other than that to be honest, seems like really solid all around model release.
And sometimes you get these kind of releases and they're like really good at this thing and completely fall down on another.
We haven't seen that here.
So all in all, I'm quite happy with this release.
It does look like meaningful steps forward, the timelines condensing and the getting still meaningful improvements, which is really, really cool to see.
So I'm happy.
Go try out the models and for your own tasks go to the battle mode.
I think that's the best way to experience it.
And I think for post models, that's where you can find them and see for yourself.
It's really important to try the models for yourself and let me know how it goes.
Loading video analysis...