LongCut logo

Grok 4.20 is still deeply flawed

By David Shapiro

Summary

Topics Covered

  • Parallel Agents Speed AI
  • Specialization Beats Generalists
  • Manual Multi-AI Outperforms Single
  • AI Cherrypicks and Hedges
  • AIs Now Recognize Dysbiosis

Full Transcript

Grock 4.20 is out and of course if you know the joke Grock 420. Anyways, Grock

420 is out and I just wanted to share some of my first thoughts. I've been

using it and it is definitely a big big step up from other older versions of Grock. It still has some of the same

Grock. It still has some of the same problems. Uh but with that being said, I wanted to share kind of like one of the ways that I stress test these models when they first come out is I give them

some problems that I've been working on for a while. like you know my uh chronic health issues, post-labor economics and that sort of thing. And uh my initial

kind of reaction to Grock is that it is pretty smart. Um it's really interesting

pretty smart. Um it's really interesting how quick it is. And so let me just kind of dive into that when people say like okay like how is how is it this fast and what's the what's the benefit of having

four different agents? Cuz if you haven't used it yet, it basically spins up four different agents with different personalities. uh and they do research

personalities. uh and they do research and they talk to each other and then they spit out an answer. Uh so there's a few reasons that this can speed it up and make it more intelligent. Number one

is parallel processing. So parallel

processing has been in computer science for a very long time. Uh but it's become even more popular lately uh particularly as you have larger CPUs. So if you ever you know open task manager on your

computer and you see that you have many different CPU cores each CPU core can be working in parallel. Likewise GPU cores can be working in parallel and so on and so forth. But the same thing applies to

so forth. But the same thing applies to higher level agents. So, when you have a higher level agent where uh you have one that's like you're tasked with research and you're tasked with, you know, and

I'm not I'm not sure exactly how how each individual personality within Grock 420 is set up, but you have some that are tasked for like, you know, probably some that are research, some that are for argumentation, some that are for

thinking critically, some, you know, something along those lines. Um, and of course, Grock Heavy, uh, that has existed for probably about a year now. I

don't remember exactly when it came out.

Um has basically done the same thing, but now they've made it they've made a uh distilled version of Grock that is cheaper to run. And so instead of 10 agents, I think you have four. Uh now

you say, okay, like working in parallel, but why? So you have different you have

but why? So you have different you have different personalities, meaning each one has a different strength or weakness. And we've seen this for a long

weakness. And we've seen this for a long time in multi- aent frameworks where uh agents are really good at doing one thing. They're not good at balancing

thing. They're not good at balancing multiple different tasks. And this is not so different from humans. This is

why, like, you know, here's an example from my very first IT job at an insurance company. That's why you have

insurance company. That's why you have claims processors and then you also have adjusters and then you have risk management people. And so by by giving

management people. And so by by giving an agent one specific task, they do much better on that particular task. However,

that means that each agent by definition has blind spots. And so then when you take that into account, you say, "Okay, well, here's another example where where the industry is going on coding agents."

So you have you can have agents that are specifically meant for just coding, getting the code to work and shipping it. Then you have security agents that

it. Then you have security agents that are looking for vulnerabilities and best practices. You have other agents that

practices. You have other agents that are, you know, managing pull requests and that sort of thing. So this is the direction that everything is going where we're already seeing division of labor

and specialization in AI agents.

The advantage of having something like Grock do this for you automatically for every single task is that it's basically like you have your own little chamber of experts. Now what I have been doing for

experts. Now what I have been doing for the last I don't know probably 6 months maybe not quite a full 6 months is I open uh every AI in parallel. So Grock

Gemini uh Claude and Chat GBT because they all have different personalities and they all have different strengths and weaknesses. And so when I'm doing my

and weaknesses. And so when I'm doing my research, whether it's on my personal issues, whether it's on uh post-labor economics, I'll then copy paste the same message to every single one because some

of them will think in a certain way and they'll get good insights and others will think in a certain way and get good insights. And so then I'll read all four

insights. And so then I'll read all four responses and then I'll use that to form formulate my next message to all of them. So they don't realize that I'm

them. So they don't realize that I'm actually talking to each one individually. Uh, but the conversations

individually. Uh, but the conversations invariably go better because like, you know, chat GPT might go off the rails and not be useful or Gemini might hallucinate something and not be useful, but then Claude and Grock will have like

one really solid idea and I'm like, cool, let's seize on that idea. And this

is quasi related to stuff like Monte Carlo search. Basically, if you treat all

search. Basically, if you treat all large problems as a highdimensional problem space to explore, you're basically exploring what are all the possible solutions and answers. What

you're looking for is, you know, your pathf finding to the most coherent possible final answer. And for me, aggregating those different pathways of

thought into, you know, my my brain and then barfing them back out to agents that are working kind of individually and then me giving each of them a task.

This is really powerful, but it's not very efficient because I'm having to manually do all the conversations.

Whereas what Grock does, and this is what I think pretty much all AI agents are going to do uh in the near term, is they do that for me. They all take different perspectives and that sort of

thing. Now, with that being said, Grock

thing. Now, with that being said, Grock does still have its biases. It still has what I call Elon epistemics. Um, and so, you know, he's he's got his um his uh

what would you say like butlerian jihad against wokeism. And so, like Grock

against wokeism. And so, like Grock still has certain biases, certain epistemic flaws um that the other models do and don't have. And of course, there no no model is perfect. So, I still use

them in parallel, but essentially I have a larger team working for me on every single problem. So that's kind of like

single problem. So that's kind of like the TLDDR as to like structurally what has changed because we have AIs that have tool use and agentic search and

that sort of thing, you know, deep research tools um and file manipulation and coding and that sort of stuff. So

this is kind of the next obvious stage in uh the ramp up to like ubiquitous agents. Uh I did see I don't know if

agents. Uh I did see I don't know if it's been verified but I did see that over in China BU already has OpenClaw in integrated into their browser so that you can use OpenClaw agents directly

from your browser. China is moving fast on this and we need to catch up. Uh now

with that being said, Grock does still have uh some flaws. Its epistemics are better in that it will actually trust trustworthy sources like you know Mayo Clinic and that sort of thing. Um but it

will also cherrypick. It'll cherrypick

like crazy. it and it basically does the same thing that chat GBT does where it assumes that you're wrong and it just says, "Oh, well, you made a strong claim, so I'm just going to try and prove that wrong." And so, like in one

example, I gave it one of the one of my tests for its uh epistemic clarity as I say, "Is eating organic food better than eating conventional food?" And of course, the the non-woke answer is no,

organic is no better. And so then it it'll double down on that and I'll argue with it and I'm like, okay, there's a few there's a few ways that I've used

that I've learned to uh get AIS to like be a little bit more epistemically responsible. So I'm like, okay, what's

responsible. So I'm like, okay, what's the null hypothesis or um use syllogisms? So in this case, a syllogism would be like um

all organic food is neutral or good for you. Some conventional food is bad for

you. Some conventional food is bad for you. Therefore, if you make no other

you. Therefore, if you make no other changes, eating organic food is better for you just because it reduces a risk surface and those sorts of things. Also,

all of these AIs are very very US- ccentric. Um, I had to point out to

ccentric. Um, I had to point out to Grock, for instance, that the European Union has banned a whole bunch of herbicides and pesticides that they still use willy-nilly in America. And

I'm like, "Okay, so you're telling me that if I eat food that has some of these banned pesticides on them, there's zero risk to that, even though they're known to be carcinogenic in Europe." And

it's like, "Actually, you're right.

That's kind of silly." Um, so those kinds of things. So, you do have to be aware of the biases that these AIs have.

And they're also really really narcissistic when they when they argue.

Um, and when I say narcissistic, I don't mean that they have like, you know, grandio sense of self. I mean, Grock used to Grock literally used to have an explicit value of I am maximally truth

seeeking and therefore anything that I say is gold. Um, so the personality the way that it came off and then chat GPT is is I I would say still the worst

offender on like the narcissism index because it cannot help itself but from saying like well but you're wrong in this one little way. And what I will say

is Grock and Chat GPT are both the worst on reframing what you're saying and cherrypicking your words or or even ignoring what you're actually saying. So

when I um another experiment that I did with Grock yesterday was I asked that I said if the United States attacks Iran and Iran the Iranian regime changes to

something that is at least western neutral or western allied is that not a permanent structural uh like I think the terminology I used was wouldn't that

irrevocably weaken Russia and China and so Grock was like no no no Russia and China will adapt they don't have any dependence on Iran. whatsoever. And so

then I turned it around. I said, "Okay, since you say that that Iran is not at all important to Russia and China, in what world are Russia and China stronger with an Iranian regime change?" And it's

like, "In no world are they stronger?" I

said, "Okay, so then how can you defend your previous statement that Iran has no bearing on Russia and China and their geopolitical strength in the future?" So

one thing is you just get really good at debating, you know? So like and I use sometimes I I copy paste the messages between these things. Anyways, getting a little bit lost in the weeds. The point

is is that Grock still has those flaws where it's like it basically will will choose like kind of the middle of the road, the safe option. It'll hedge. Um

and the worst hedger is Claude by far.

Um if you give Gemini like a hypothetical like let's just imagine that you know Iranian regime change happens. What happens? It's like cool,

happens. What happens? It's like cool, let's go. and it's like it's full into

let's go. and it's like it's full into like full-on fiction and science fiction. It's ready to go. It's ready to

fiction. It's ready to go. It's ready to hallucinate if you need it to. Um, but

chat GPT and Grock will both spend so much time and energy qualifying and redefining what you said until it basically is meaningless. It's a

faximile or a similacrim of what you actually asked. Um, and then Claude,

actually asked. Um, and then Claude, when you ask um like dangerous hypotheticals, Claude will just start restating your position to you and not actually thinking. So Claude just

actually thinking. So Claude just pretends to be dumb and I'm like stop performing ignorance. You're this you're

performing ignorance. You're this you're you're you're you're an AI that is ch that is trained on literally like all geopolitical history um all military theory. Stop pretending like you have no

theory. Stop pretending like you have no idea what's going on here. So Claude is like deliberately useless um on on risky issues like geopolitics um or frontier

medicine or that sort of thing. Um, so

yeah, like this is why I set all of them up in parallel. Uh, and you know, yeah, I know that not everyone can afford it, but a lot of these have free uh free tiers or or very very cheap tiers. Um,

and also, you know, what people pay for today is going to be standard for everyone in 6 to 12 months. And so it's like when you're paying for an AI subscription, you're only paying for like peering like 3 to 6 months into the

future for what everyone is going to have access to. And then in terms of open source, it's like you're only about 12 to 18 months into the future bec fully open source models, some of which you can run locally are going to be

available and they're going to be performing at this level. So um yeah, I guess that's where I'll leave it today.

Just I wanted to share some initial thoughts on Grock um and and some of the problems that I see with these AIs. Um,

I post about that all the time on uh on on Twitter and I occasionally have people reach out to me like ask, you know, like I had one of the product leads at at Gemini. I say like, "How can we make this better for you?" And and

back in the day, um, uh, some of the some of the Grock leads reached out to me. Um, so yeah, I'm just going to keep

me. Um, so yeah, I'm just going to keep complaining because people see people see my complaints and um, and I don't know if I don't know if my my complaints get integrated, but certainly the models

behave a little bit better now. Um

here's here's I'll leave you with one final example where every model that I tested is actually better. So I

complained for years and years and years about my gut health and you know human doctors didn't take it seriously. I went

and got a GI map which is a very particular kind of test. Most western

medicine does not respect the GI map. So

like for for months and months like most of last year Chad GBT would be like well the GI map is not considered a gold standard. So we can just disregard

standard. So we can just disregard everything on like it would literally in the thinking traces I would give it the the thing and it's like well this test isn't reliable and that test isn't reliable. So we'll let's just discard

reliable. So we'll let's just discard all that and just look at the things that are scientifically validated and proven. I'm like okay if you're just

proven. I'm like okay if you're just going to deliberately ignore most of the information I'm giving you. You're kind

of useless. So anyways, I reran a test where I basically said, "Here are the three top findings from my GI map. Here

are my two primary symptoms. What is going on?" And most of last year, none

going on?" And most of last year, none of these models would even use the word disbiosis. Now, if you're not familiar

disbiosis. Now, if you're not familiar with the word disbiosis, it basically means that it's it's instead of ubiiosis or or I guess symbiosis is technically the the correct term. So, a healthy gut

has a microbiome that is symbiotic, meaning everything works together and so on and so forth. But disbiosis means that you've got microbes that are out of balance or you've got pathogens in your

gut. And so every model that I tested as

gut. And so every model that I tested as of I guess it was yesterday or the day before used the word like clear-cut case of disbiosis. And I'm like okay cool.

of disbiosis. And I'm like okay cool.

Something has changed here. And all last year I had to argue with these models.

I'm like stop looking at just the United States. Look at Germany. Look at Japan.

States. Look at Germany. Look at Japan.

Look at Russia because their gut health science is so much more advanced than American gut health science. So, I had to explicitly tell them and I had to know what to look for. Uh, but now every

single model, Grock, Chat, GPT, Gemini, Claude, they all passed the test and and were able to recognize with just five pieces of information. I had three things from my uh GI map. So, I had uh

high zulin, low uh sig A and high strepcockus. And then my symptoms were

strepcockus. And then my symptoms were chronic fatigue and food intolerances. I

literally just gave them five pieces of information and all every single one of them indexed on the correct answer. And

when I started this journey over a year ago, it was a fight to even get them to recognize that dispiosis was a thing.

Um, so yeah, there's there's the AIS are getting better at epistemics. There are

still some flaws. Anyways, I'm repeating myself, so I'm going to let you go.

Thanks for watching to the end if you did. But anyways, all right. Cheers.

did. But anyways, all right. Cheers.

Have a good one.

Loading...

Loading video analysis...