LongCut logo

OpenAI just dropped GPT-5.4 and WOW....

By Matthew Berman

Summary

Topics Covered

  • GPT 5.4 Unifies Coding and Knowledge Work
  • GPT 5.4 Tops Real-World Benchmarks
  • GPT 5.4 Masters Efficient Computer Use
  • Frontier Models Accelerate Rapidly

Full Transcript

GPT 5.4 is here and we may actually have a new best model on the planet. I've been using it for the past week. I got early access and yes, it is an incredible model. And what OpenAI did to get to 5.4 feels very similar to what Anthropic did to get to Opus 4.6

and they're all heading in the same direction. These are models built for real world knowledge work. These are models built for agentic tasks. And I'm going to tell you

knowledge work. These are models built for agentic tasks. And I'm going to tell you everything about it. I got a couple demos I want to show off to you.

And this just might be my new main model in OpenClaw. And let me explain why. So we have Opus 4.6 and it is a good model. What

why. So we have Opus 4.6 and it is a good model. What

makes it so good is that it is not only a world model, but it is incredibly good at writing code. We have world knowledge.

We have logic and reasoning. We have a great personality. And

yes, the personality really does matter, especially if you're plugging it into your personal AI assistant, aka OpenClaw. And of course, it's incredibly good at code. It's

incredibly good at agent work. It's great at browser use. It's

great at computer use and all of these things it is very, very good at.

But OpenAI did not have that. GPT 5.2 was good at a lot of things and Codex was really good at coding, but they were separate models. And if

you wanted to use one for both use cases, you really couldn't. You had to choose one or the other for the appropriate use case. So We had coding over here. That is what GPT 5.3 codex was all about. And it was really good

here. That is what GPT 5.3 codex was all about. And it was really good at coding. But if you wanted a personality, if you wanted writing and

at coding. But if you wanted a personality, if you wanted writing and creativity, you went to 5.2. But remember, Opus had all of those things built into a single model. And that is where GPT 5.4 comes in. So they basically said, okay, 5.2 and GPT 5.3 codex, go ahead.

have a baby, and we're going to call it GPT 5.4.

And this is their new Frontier flagship everything model. It is good at coding. It has a personality. It's good at creative writing.

model. It is good at coding. It has a personality. It's good at creative writing.

It's good at tool calling. It's good at agentic use cases. You can plug it in as your main model in OpenClaw. GPT 5.4 has everything.

And not only that, they made it faster, they made it more token efficient.

All of this baked into a single model that can serve many use cases now.

And remember when Sonnet 4.6 came out and I said this was specifically built to serve knowledge workers? Well, That's kind of what 5.4 is now. It is incredibly good at things that you would maybe do with Claude co-work, reading PDF documents, creating PowerPoints, searching the web, using the browser, using the computer. All of these things can

now be done with 5.4 really well. And here's the thing. Here's that last part that the Claude family of models now had that the open AI models did not.

We had... 1 million tokens of context. Great. But now so does GPT 5.4 and that is huge. Now it's not cheap to use all this, but I'll get to that in a moment. All right. So two models were dropped.

We have 5.4 thinking and 5.4 pro and they put together a nice chart so we can compare the benchmarks against obviously the older open AI models, but also For the first time in a while, they actually included Anthropic and Google models. So

here we go. OS world. This is computer use. We have 75% for GPT 5.4 thinking as compared to 74% GPT 5.3 codecs. So a tiny bump and look over here. Opus 4.6, 72.7. We have Sweebench pro 57.7%

over here. Opus 4.6, 72.7. We have Sweebench pro 57.7% on thinking we have 56.8, so it is actually getting a better score than the codec-specific model, and no score for Opus, but we have a 54.2 for Gemini 3.1 Pro. Now, it kind of sucks these companies are kind of picking

and choosing which benchmarks they're running against because then it makes it very difficult to compare them. All right, next we have GDPVal, which is OpenAI's own benchmark measuring

compare them. All right, next we have GDPVal, which is OpenAI's own benchmark measuring real-world knowledge work. So the ability for these models to actually complete real knowledge work, things that will actually move the GDP of the country. Now, even

though it is OpenAI's own benchmark, other companies do use it and run their models against it. So what do we see? Well, we have an 83% for GPT 5.4

against it. So what do we see? Well, we have an 83% for GPT 5.4 thinking, which is interesting because GPT 5.4 Pro, which is technically the smarter model and much more expensive, actually gets a lower score. Now look at this versus 5.3 codex, it is 13 points higher and on Opus 4.6, which scored a 78,

it is five points higher than that. It also dominated at frontier math as well. And by the way, if you want to test the latest models and

as well. And by the way, if you want to test the latest models and use it in an open claw like environment, but don't want to worry about all the headache of setting up open claw yourself, Go check out the sponsor of today's video, Lindy. I know the entire world is talking about OpenClaw, but OpenClaw is not

video, Lindy. I know the entire world is talking about OpenClaw, but OpenClaw is not yet for the entire world. It still takes a ton of handholding, a ton of security mindfulness. And personally, I spent 5 billion tokens so far just

security mindfulness. And personally, I spent 5 billion tokens so far just getting it to a really good place. So there has to be a better solution for everyone. And that is where the sponsor of today's video comes in, Lindy. They've

for everyone. And that is where the sponsor of today's video comes in, Lindy. They've

been a great partner, so I'm excited to tell you about them. If you've been running your own agents, you know the drill. Whether you're spending $600 on your own Mac Mini, hundreds of dollars in token costs every month, or constantly having to handhold them, it can be easier. And Lindy just eliminated all of that. But this

isn't just for casual users. This is for hardcore automations as well. Lindy has built a personal AI assistant for people running complex workflows. It meets you wherever you are.

iMessage, email, Slack, Notion, Gmail, Google Drive, and it integrates with over 100 apps and learns the way you like to work. Check out Lindy's AI assistant. I'll

drop a link down below. They've been a fantastic partner. So go check them out.

It helps the channel when you let them know I sent you. Now, Back to the video. So according to the blog posts, GPT 5.4 brings together the best of

the video. So according to the blog posts, GPT 5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. So

just like I said, it incorporates the industry leading coding capabilities of 5.3 codecs while also improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. Also, GPT 5.4 thinking can now provide an upfront plan of its thinking. just going to start and go. One of the most

useful features in something like cursor is the fact that you can plan first and it just allows you to guide the model rather than burning all those tokens, actually building the thing and potentially going in the wrong direction. You can turn on the little plan feature right here and it will plan instead of build. And now that's built into ChatGPT. So as I said, it's really good at computer use and it

also has incredible vision capabilities. And obviously those two things go hand in hand. So

it's excellent at writing code to operate computers via libraries like playwright, as well as issuing mouse and keyboard commands in response to screenshots. And so look at this OS world verified benchmark and OS world is basically giving these models and operating system in which they can operate them. on the x-axis we have the number of tool yields

the tool calls and on the y-axis we have the accuracy so really what you want is to be high up in this top left corner because that means you have the highest accuracy with the fewest tool calls and that is a good thing fewer tool calls less tokens cheaper more efficient so as we see gpt 5.2 is over here and We can see the accuracy tops out

at about a little under 50% with 42 tool yields. Now look over here, GPT 5.4, much more efficient and better topping out at 75%, 15 tool yields and just much more efficient than 5.2. All right. So here's an example of it using Gmail. So we can

than 5.2. All right. So here's an example of it using Gmail. So we can see the little cursor right there. It's going to click. Go over to send. It's

going to look at its sent emails. It can star them quite well. It labels

them, puts each of the emails into the label. It can create calendar invites. All

of this is super useful, but you know what needs to actually catch up is the websites and the publishers themselves who generally block agentic use of their websites, right? They don't want scrapers. And so they block stuff like this, but hopefully the publishers catch up. So here it is writing an email, sending

the email. Yeah, just super impressive. Here's bulk data entry. So we have what looks

the email. Yeah, just super impressive. Here's bulk data entry. So we have what looks to be a JSON object and it's basically just extracting it from that and putting it in here very quickly. And by the way, if you look up in the top right, you can actually see the timestamp and it looks to be going at real time speed. That's kind of insane. So this is not sped up at all

if you are to believe this timestamp up here. All right, and so they gave a few demos of things that GPT 5.4 built and it's incredible. This is

possibly one of the best demos I've ever seen. So this is a theme park simulation game. Look at this. So you can see it has the speed up here.

simulation game. Look at this. So you can see it has the speed up here.

You can make things faster or slower. You can design the park. All of the assets are created. The little people walking around are obviously just little circles, so very simplistic, but all of the logic, your funds, your guests, your happiness, cleanliness, park rating, all of these things are built in. Look at all of this. You choose

what you want. You can place a new Ferris wheel. You can place a new carousel. People go to it. So impressive. And it says the simulation game

carousel. People go to it. So impressive. And it says the simulation game was made with 5.4 from a single, lightly specified prompt, meaning you didn't have to give it a highly detailed prompt. Next is an RPG game, kind of a very 2D 90 style RPG game. And it looks excellent. All

of the assets are beautiful. You can see all the little characters here. We have

attack and turn. Yeah. So very, very cool. All right, last, and I hope you're sitting down. This is going to be a little bit painful, the pricing. So We

sitting down. This is going to be a little bit painful, the pricing. So We

have GPT 5.2, $1.75 per million input tokens now for 5.4, $2.50.

Frontier intelligence seems to be getting more expensive, not less. 5.2 pro, $21 per million tokens. Ooh, now it's $30 per million input tokens. And

similarly, the output price. It's only slightly more on the output price. So $14 as compared to $15 for the new 5.4 model and $15. For 5.4 Pro, it's $180 per million output tokens versus $168 per million output tokens for 5.2 Pro. But these are expensive models. Don't get me wrong. You can save a lot

Pro. But these are expensive models. Don't get me wrong. You can save a lot of money by caching the input, but really the output is going to be very expensive regardless of what you do. All right. So like I said, you're probably going to want to test this model out in OpenClaw. Use it as your primary model.

And the way to do that is simply to tell open Claude to do it.

It's really not that difficult, but here's the thing. Here's the really important thing to remember. The way that you prompt GPT 5.4 is very different than the way that

remember. The way that you prompt GPT 5.4 is very different than the way that you prompt Opus and Claude models in general. So what you're going to want to do is look for the latest prompting guide for 5.4. And here it is. They

already have documentation on it, which is fantastic. So you. Point open claw at this page, say download the prompt guide, and either rewrite the prompts or create two sets of prompts specifically for 5.4 and Opus separately. And it seems like we're getting new models almost every week at this point. I mean, some of my team just got

GPT 5.3 codecs. Now we're getting GPT 5.4. We had Opus 4.6 right after Opus 4.5. We had Sonnet 4.6. These models are coming at lightning speed and there's a reason. Both of these companies, Anthropic and OpenAI have completely figured out their pre-training cycle. Meaning these models are going to continue. They're just

baking in the oven and every time they think they have enough progress, they Ship a version of it. They basically cut it off and they say, all right, let's ship it to the public. And it's super exciting. And don't forget just less than a year ago, open AI was really not doing well in the pre-training front. They

released GPT 4.5, which was actually a really good model, but it was massive and slow and expensive to run. And so there were a lot of problems with it.

They ended up retiring it, or I think it's still available, but basically people don't use it all that much. And now the entire 5.0 family of models from open AI are fantastic and they keep coming. They're efficient. They're fast. They're great.

So congratulations to open AI. Now, the last thing I want to go over is a couple of industry reactions. Matt Schumer has had access to GPT 5.4 similar to me over the past week. And these are his thoughts. So his general takeaway, it is the best model on the planet by far, which is a big statement. He said he primarily used pro models and not

statement. He said he primarily used pro models and not anymore. 5.4 thinking is sufficient for all of his use cases, more than sufficient. The

anymore. 5.4 thinking is sufficient for all of his use cases, more than sufficient. The

coding capabilities are ridiculous. It's essentially flawless. That is definitely an overstatement. It is not flawless, but it is excellent inside codex. It's insanely reliable. I can attest to this.

That's where I've been testing it myself. Now it still does have some problems that he points out front end taste is far behind Opus 4.6 and Gemini 3.1 pro it also can still miss obvious real world context.

So his example is I had it plan an itinerary for a trip at first glance. It looked perfect, but it failed to take into account that it chose locations

glance. It looked perfect, but it failed to take into account that it chose locations that would be mobbed by spring breakers. So I had to rerun the prompt from scratch with more context. And number three within open claw, it kept stopping short before finishing tasks. Okay. These things should be fixed quite quickly. In fact, Sam Altman just

finishing tasks. Okay. These things should be fixed quite quickly. In fact, Sam Altman just reposted this saying, yes, we're going to fix those immediately. Flavio Adamo, also an early tester, was very impressed. Check this out. I've been testing it in early access. It

has a million token context window. Number one on SweetBench. Okay. We had a big Aavely update planned for late March because a few parts of the site were still too time consuming to pull off with previous models. And it basically one-shotted them within Codex, 5.4 did. So he says, yes, it is excellent. Peter Steinberger, of course, OpenAI employee now, so take it with a grain of salt, but he also says

it's a very good model. The coding-specific jump is more in line what we had in 5.0 to 5.1, but now it's unified and smarter on everything else. It writes

better docs. It is a better general-purpose agent and overall more pleasant to use. And

I will also be testing that in OpenClaw. I need to know if it's the right vibes. Does it have the right personality? And I suspect it does. So that's

right vibes. Does it have the right personality? And I suspect it does. So that's

it. It is an incredible model. I'm going to be testing it thoroughly. If you

enjoyed this video, please consider giving a like and subscribe and I'll see you in the next one.

Loading...

Loading video analysis...