Claude Code + Codex = AI GOD
By Chase AI
Summary
## Key takeaways - **OpenAI Codex now usable inside Claude Code ecosystem**: OpenAI's Codex, considered the top competitor to Opus 4.6, can now be used directly within the Anthropic Claude Code environment, giving users access to both models in one workflow. [00:00], [00:10] - **Adversarial review fixes AI self-evaluation flaw**: The adversarial review feature addresses a fundamental problem with AI coding: models do a poor job evaluating their own code. Codex can review Opus's work with a 'discerning eye' that assumes things may be broken. [01:07], [08:06] - **Codex targets seven critical attack surfaces in reviews**: During adversarial reviews, Codex pays special attention to authentication, data loss, rollbacks, race conditions, degraded dependencies, version skew, and observability gaps—issues that could derail production. [05:14], [05:29] - **Opus and Codex find different bugs in head-to-head comparison**: In a test on the creator's Twitter bot codebase, Opus found seven high/critical issues Codex missed, while Codex found three issues Opus missed. Both agreed the Telegram polling issue was the most critical. [07:10], [07:47] - **Use Opus for planning, Codex for execution to save costs**: Since Codex offers significantly better dollar-to-token ratio than Anthropic, users can use Opus 4.6 for high-level planning and Codex for execution, getting 'way more bang for your buck.' [03:01], [03:26] - **Codex works on free ChatGPT accounts with simple install**: The plugin installs with just a few commands from the marketplace, and usage rates are tied to your ChatGPT account—meaning it works even if you're on the free tier. [02:03], [02:29]
Topics Covered
- Codex vs Opus: Better Bang for Your Buck
- AI Models Are Bad at Reviewing Their Own Code
- The Fundamental Flaw in Single AI Systems
- Codex is a Great Value Play
Full Transcript
So we can now use codecs inside of cloud code. OpenAI has made it. So the number
code. OpenAI has made it. So the number one competitor to Opus 4.6 is now something you can use inside of the anthropic ecosystem. And this is great
anthropic ecosystem. And this is great news for all cloud code enjoyers, especially if you're someone who has been struggling with usage rates because
frankly Codeex gives you a way better bang for your buck in terms of dollar to credit/tokens. And so in this video, I'm
credit/tokens. And so in this video, I'm going to show you how to set it up. And
we're going to go through what Codex can actually do with the Claude code harness on top of it. And more importantly, what we can do using Claude Code with Opus
4.6 and Codex together, right? How can
we play these two models off one another to get a sum that is greater than their parts? Now, before we do the install,
parts? Now, before we do the install, let's do a quick overview of what the Cloud Code plug-in brings us because there's a few things. Now the two most important things I would argue are the code reviews, right? The ability to
essentially have it take a look at something Opus has written and that goes in two stages. First of all, we have the standard codeex review which is just, you know, kind of a neutral review. You
know, it's taking a look. It's just read only. The second one is adversarial
only. The second one is adversarial review, which I love. So this is essentially telling Codex like, hey, take a look at what Opus have built or what any coding agent has built, but
have a very discerning eye. like kind of assume they screwed up and figure out what we can do to make it better. So,
this is an awesome way to really improve our outputs because one of the issues with Opus and really a lot of AI models in general is they tend to do a bad job of evaluating all their own code. This
is something Enthropic talked about in their engineering blog that got released last week. So, something like adversary
last week. So, something like adversary review, perfect, love this. Other than
that, we can also use codeex rescue which allows us to have codeex create something all on its own just like you would do with opus inside of cloud code and then beyond that just kind of like some status stuff like you know taking a
look at where it where it is in its particular job. So let's dive into this
particular job. So let's dive into this and take a look at the install. Now to
install this is pretty simple. You're
just going to run this command to add it to the marketplace and I'll have all these commands down in the description.
And then you're going to run this plugin command to install it. Codex at openai-c usual ask where you want to install it.
I'm going to do user scope. And then we just need to reload the plugins to get it up and working. And then lastly, we want to run codeex colon setup. In case
you didn't realize, there's also a GitHub repo for this, which also goes over all of the install command. So I'll
link that in the description as well.
And the usage rates are tied to your chat GPT account, even if you're on the free account, apparently. So just
understand it's going to be pulling from your codeex usage. It's going to ask if you want to install Codeex. Yes. For
that, you log in and that will send you to the browser where it runs you through the authentication process. Now, there's
really two obvious use cases for this Codeex tool inside of Cloud Code. The
first one is dealing with the usage limits inside of Cloud Code. Normally,
if you're on the pro plan with Anthropic or the 5X Max, you can hit those limits very quickly, especially with some of the CLI bugs we've been seeing in the last week. If that's the case, what you
last week. If that's the case, what you might want to do is use Opus 4.6 six to plan and codeex to execute. And to do that, again, very simple. You're just
going to do codeex rescue. And then from there, you're going to give it the prompt. And you can also specify a whole
prompt. And you can also specify a whole bunch of things like you see all the flags here, including the effort level and all that. And remember, codeex, the model is very solid. And again, the usage isn't even close to what anthropic
charges. But I think the more
charges. But I think the more interesting use case is what I talked about earlier, and that's the adversarial review. So, let's put that
adversarial review. So, let's put that to the test. So, I'm going to have it take a look at my Twitter engagement/ressearchbot.
This is the web app I had Claude Code build. Essentially, what it does is it
build. Essentially, what it does is it scans tweets in the AI space for every like 30 to 45 minutes. It has a quality filter. It has scoring signals based on
filter. It has scoring signals based on a number of different parameters. It's
connected to Superbase to make sure the tweets don't get repeated. It has a scoring system. It integrates softmax
scoring system. It integrates softmax pics. Everything gets pushed to
pics. Everything gets pushed to Telegram. And I also have AI built in
Telegram. And I also have AI built in there to help with responses. So there's
a fair amount going on. And then on top of that, it also tracks like all of my responses. So we can kind of have a
responses. So we can kind of have a feedback loop. So this is like a
feedback loop. So this is like a relatively, it's not super complicated, but this isn't like a landing page we're having it look at. So we're going to see what codeex comes back with when we do
an adversarial review on the code for this, right? So let's see how it does.
this, right? So let's see how it does.
So we'll keep it pretty open to interpretation. So we're telling Codeex,
interpretation. So we're telling Codeex, take a look at the codebase and let me know what you think. And the first thing it does is it tells us, hey, we're going to estimate the review size to determine the best mode. And then from there, it says, hey, do you want to run it in the background or do you just want to wait
for the results? So, we're just going to wait for the results. And it's telling us the review scope includes the full codebase plus nine working tree changes, one modified file, eight unttracked files. So, it knows there's a kind of
files. So, it knows there's a kind of there's a lot it needs to take a look at. And while that's working, let's talk
at. And while that's working, let's talk about how adversarial review is actually working. So, we just kind of saw the
working. So, we just kind of saw the first four parts, right? It parsed the arguments. We didn't pass any flags, so
arguments. We didn't pass any flags, so it's just going off its default settings. And then it estimated the
settings. And then it estimated the review size, resolved the target, and collected some context. That was all that text about, hey, you know, we have these unttracked changes, and this is going to take a while. Now, after those first four steps, it's then going to
build the adversarial prompt. And
there's seven attack surfaces it's going to pay special attention to. That's
authentication, data loss, rollbacks, race conditions, degraded dependencies, version skew, and observability gaps.
Right? So like seven things that are somewhat under the surface that could really screw us if we try to push this to production and we don't have a handle on. From there it's going to send all
on. From there it's going to send all that information back to the open AI server so Codex can take a look at it.
And then it will give us our structured JSON output. We should expect it to look
JSON output. We should expect it to look something like this, right? And it will give us some sort of severity of its findings, right? Versus critical, high,
findings, right? Versus critical, high, medium, and low as well as recommendations and next steps. But all
you have to do is sit there inside of Clawed Code and wait for the response.
So Codeex came back with four issues with our codebase and all of them had a severity of high and I pasted this over to Excal so it's a little easier for us to go through it. So for each one of these it gives us the severity, the
area, the actual issue, the files as well as the actual lines of code we need to take a look at and then importantly like what's the actual impact here as well as the fix. So number one it's
saying we had an issue with our deadup logic. Number two was how we were
logic. Number two was how we were dealing with Telegram polling. Third was
our schema drift. And then lastly was our actual dashboard build. So this is actually relatively important stuff and luckily it doesn't look like the fixes
would be too difficult to implement. But
what I'm interested in is okay, this is what Codeex gave us. What would Claude give us if we asked for a similar, you know, sort of adversarial review on its
own codebase? because I think that would
own codebase? because I think that would be kind of enlightening to see them head-to-head and like what Codeex really does differently than the other cuz for all we know they're exactly the same and this whole video was pointless. So I'm
now having Opus run the same sort of adversarial code review. I had Codeex come up with particular prompt. So
essentially it's just saying hey I want you to challenge the implementation and the design choices. Here's some things I want you to evaluate and then here's the sort of output format. So let's see what it comes back with. And so here's the
results broken down. So first of all, they did have one shared finding. So
they both agreed that the telegram issue was a problem. So this was the one issue that they both found and that they said was either high or critical. Codex said
it was just high and then Opus said it was critical. Now Opus itself found
was critical. Now Opus itself found seven other additional issues ranked high or critical that Codex didn't. Now
we're not saying that just by virtue of saying there's more issues that Opus was necessarily better than Codeex. Just
pointing out it found seven things we might want to look at that Codex didn't.
Then obviously on the flip side, we found three issues with Codeex that Opus missed. So what does this mean? If we
missed. So what does this mean? If we
kind of look at this in totality, does this mean Opus is better than Codeex because it found more or that Codeex is better than Opus because it narrowed down on four and didn't take us onto a weird path? I think what you draw from
weird path? I think what you draw from this is kind of whatever you want to draw from this. And that probably is that there is kind of value of having these two systems. Look at it, right?
It's a second pair of eyes versus having opus grade opus all the time. You know,
there is some sort of fundamental flaw I think with having the same AI system do the planning, the generating, and the evaluating. And if we're able to very
evaluating. And if we're able to very easily bring in codecs, especially at its price point, to even just do things like this, like an adversarial review, again, that's like one of the great AI
coding like on the margin plays, which again is like why not, you know, if you're already paying for chat GBT, if you're already paying the 20 bucks a month and I can now bring in this and kind of have Codex just take a look at
anything this simply, like what's the downside to this really? I
mean, I don't think in a quick test like this we're going to have any definitive answer like oh Codex is better versus Opus. And I think that whole
Opus. And I think that whole conversation sort of misses the point.
This is just like one more tool in our toolbox and now we can use it. So I
think this is great. Now we can get way more specific with adversarial review as well because our prompt was pretty just like open and out there and it was able to interpret it in a lot of different ways. But just based off of the GitHub
ways. But just based off of the GitHub examples, right, you can get pretty specific about what you want codecs to look at. So overall, I think this is a
look at. So overall, I think this is a great addition to the claude code ecosystem. The more tools, the better.
ecosystem. The more tools, the better.
Especially if you're someone who either A is already paying for ChatG or B is like on the Anthropic Pro plan and then maybe you are paying for Chad GBT. 100
bucks a month might be a little much, 200 bucks might certainly be too much.
Like this almost gives us like this middle ground between the $20 sub and the $100 sub because real Codex really is a great value play. So definitely
check it out. Super easy setup. Let me
know what you thought. And as always, I'll see you
Loading video analysis...