The 2.3B AI Model that "Thinks" like a 70B (Gemma 4)

By Better Stack

Summary

Topics Covered

Highlights from 00:00-02:10
Highlights from 01:59-04:24
Highlights from 04:15-06:42
Highlights from 06:30-08:33
Highlights from 08:33-10:17

Full Transcript

Last week Google did something unexpected. They released a truly

unexpected. They released a truly open-source model under Apache 2.0 license. It's called Gemma 4 and it

license. It's called Gemma 4 and it features specialized Edge versions as small as 2.3 billion parameters that are designed to run entirely offline on

devices like your iPhone, Android flagship phones, or even on a Raspberry Pi. It seems like the race to build the

Pi. It seems like the race to build the ultimate small model is really heating up. Just a few weeks ago I did some

up. Just a few weeks ago I did some tests on Qwen 3.5 to see how it was pushing the limits of local AI, but now Google is promising even higher

intelligence density. So in this video,

intelligence density. So in this video, we're going to perform similar tests on Gemma 4 to see if this model is truly the best small model out there. It's

going to be a lot of fun, so let's dive into it.

[music] So what's so unique about these new Gemma 4 models? Well, the real technical shift here is something Google calls per-layer embeddings. In traditional

per-layer embeddings. In traditional transformers, a token gets one embedding at the start that has to carry all its meaning through every layer. But in

Gemma 4, each layer has its own set of embeddings allowing the model to introduce new information exactly where it's needed. This is why you see the E

it's needed. This is why you see the E in the E2B and E4B model names. It

stands for effective parameters. While

the model acts with a reasoning depth of a 5 billion parameter model, it only uses about 2.3 billion active parameters during inference. This results in a much

during inference. This results in a much higher intelligence density, allowing it to handle complex logic while using less than 1.5 GB of RAM. And beyond the text

performance, Gemma 4 is natively multimodal. This means vision, text, and

multimodal. This means vision, text, and even audio are processed within the same unified architecture rather than being bolted on as separate modules. This

architecture enables a new thinking mode that uses an internal reasoning chain to verify its own logic before giving you an answer. This is specifically designed

an answer. This is specifically designed to prevent the infinite loops and logic errors that often plague small models.

It also ships with a 128K context window and support for over 140 languages, which should make it significantly more capable at tasks like complex OCR or

localized language identification. And

to showcase these abilities, Google released some eye-opening benchmarks. In

their internal tests, the E4B model achieved a score of 42.5% on the AIME 2026 mathematics benchmark,

which is more than double the score of much larger previous generation models.

They also demonstrated the model's agentic potential on the T2 bench where it showed a massive jump in tool use accuracy. They also demonstrated the

accuracy. They also demonstrated the model's agentic potential through a feature called agent skills. Instead of

just generating static text, the model was shown using native function calling to handle multi-step workflows like querying Wikipedia for live data or building an end-to-end animal calls

widget. Now all of that sounds

widget. Now all of that sounds impressive, but let's try it on our own and see how it works. In my previous Qwen 3.5 video, I tested the small models by running them locally without

internet connection using LM Studio and Klein. I will use the same setup for

Klein. I will use the same setup for testing Gemma 4. First, we have to download the models on LM Studio, then increase the available context window, and start the server. We can then jump

into Klein and hook up our local LM Studio server, choose the E2B model, turn off our internet connection, and begin our tests. Last time we saw that

Qwen 3.5 was quite decent at generating a simple cafe website using HTML, CSS, and JavaScript with two of their smallest parameter models. Let's reuse

the same prompt and see if Gemma 4 is just as good at this coding task. So it

took the E2B model roughly 1.5 minutes to complete this task, and for a model with 2.3 billion active parameters, the results were honestly a bit

underwhelming if compared to the result of Qwen's output, which used only 0.8 billion parameters. The most annoying

billion parameters. The most annoying thing was that Gemma appended the task list at the end of the HTML file as well as at the end of the CSS file, so I had to manually delete it from both files

before opening the page. And it also claimed it had written a JavaScript file when in fact there was no JS file produced in the final output. So the E2B

test results were a bit disappointing.

But the situation did improve quite a lot when switching to the E4B model version. It took this version roughly

version. It took this version roughly 3.5 minutes to finish the task, but the end result was notably better. Maybe not

in terms of design, it still looks very bland, but this version actually had a working cart functionality, which none of the previous tests, both for Qwen and Gemma, were able to produce

successfully. So the E4B version is

successfully. So the E4B version is already a big step up from the E2B version, but obviously no one would seriously consider using such small models for complex or serious coding. I

just conducted these tests out of curiosity to see if such a small parameter count can still produce a meaningful result for a given coding task. All right, now let's see how Gemma

task. All right, now let's see how Gemma 4 performs on edge devices like an iPhone. So in my Qwen 3.5 video, I built

iPhone. So in my Qwen 3.5 video, I built a custom iOS app, which was capable of running the model on the native metal GPU using Swift's MLX framework.

Although Gemma 4 is open-source, unfortunately there are no MLX bindings available for this model as of now, which would be capable of running this model on iOS with multimodal

capabilities. And Google themselves are

capabilities. And Google themselves are running Gemma 4 on their AI Edge Gallery app using their own inference framework called Light RTLM, which sadly also

doesn't support iOS bindings at the moment. So to try it out on an iPhone,

moment. So to try it out on an iPhone, our best option right now is to use their Edge Gallery app. So we're going to conduct our tests on their own app and see how it performs. So let's go to

the AI chat section, and here we will be prompted to download the E2B version of Gemma 4, and you also have the option to download the E4B version, but for some reason the app says I don't have

sufficient space to download it, which I'm sure is not true, so maybe that's a bug in the app. But anyway, now that I've downloaded the model, we can finally start using it. And let's start

by typing a simple hello. Wow, did you see how fast the response was? A lot

faster than Qwen 3.5. Maybe this is the magic of the Light RTLM framework they're using. So now let's try the

they're using. So now let's try the famous car wash test and see if Gemma gets it correctly. Wow, it gives me a really long response, and at the end of it we see that the final recommendation

is to drive, which is correct, but I do have to take into account the fact that it's looking at convenience and comfort and not the actual logical fact, so I

don't know, it kind of passes the test, but it kind of doesn't at the same time.

All right, now let's hop on to the ask image section, and let's see if Gemma can identify the dog in this picture. So

it did identify that it is indeed a dog, and it gives some other details about the image, so that's pretty cool. But if

I ask it what's the breed of the dog, it replies saying that it's a border collie, which is not true. It is

actually a corgi. But I do have to say, for just over 2 billion active parameters, this response is pretty good nonetheless. Lastly, let's try the OCR

nonetheless. Lastly, let's try the OCR test. So if you watched my previous

test. So if you watched my previous video with Qwen 3.5, you will recall that I tested it with an image that had text in it, which was in Latvian, which

is also my native language. Now Gemma

touts as being able to understand up to it should pass this test easily. And

yes, indeed, it does identify that the language is Latvian, and I'm surprised that most of the text is actually pretty spot-on with some minor exceptions. I

see that some words are nonexistent and some of the grammatical structures are just very bizarre, but it's still very impressive, so I'll give this test a pass. Now this actually begs the

pass. Now this actually begs the question, can I chat with this model in Latvian? So let me try that next.

Latvian? So let me try that next.

So I see that the response is actually in Latvian, but once again, the grammatical structures are very bizarre and nobody talks like that. But still,

Latvian is a very small language, so this is already impressive that it has all that knowledge in such a small model. And while I'm at it, I'm going to

model. And while I'm at it, I'm going to ask it what is the current US president to see what is the knowledge cutoff of Gemma 4. And it replies that it is Joe

Gemma 4. And it replies that it is Joe Biden. And then if I actually ask what

Biden. And then if I actually ask what is your knowledge cutoff, it will tell me that it's January 2025, which checks out. So there you have it. That is Gemma

out. So there you have it. That is Gemma 4, the newest open-source model by Google, and I got to be honest, this model does seem pretty good. It does

what it advertises, albeit it lacks some creativity in web design, but other than that, the small models, as we just saw, are more than capable of successfully

completing all the tasks I was giving it. It's a shame we still don't have the

it. It's a shame we still don't have the MLX bindings for this model, because I would really love to use Gemma 4 locally on a custom iOS app, but I'm sure that

it won't take long for Google to get this release out to the public. And in

the meantime, I'm keeping a close eye on community projects like Swift LM, which are already working on unofficial native bindings for these models. So those are my two cents on the model. What do you

think about Gemma 4? Have you tried it?

Will you use it? Let us know in the comment section down below. And folks,

if you like these types of technical breakdowns, please let me know by smashing that like button underneath the video and also don't forget to subscribe to our channel. This has been Andrus from Better Stack and I will see you in

the next videos.

[music] [music]

Loading...

Loading video analysis...