Tiny AI Is About to Change Everything (IBM Granite 4.0)

By Better Stack

Summary

## Key takeaways - **AI progress is in smaller, efficient models**: The excitement in AI is shifting from massive models to smaller, more efficient ones that can run on everyday hardware, signifying real engineering progress. [00:04], [00:20] - **IBM Granite 4.0: Hybrid architecture for efficiency**: IBM's Granite 4.0 models combine transformer and Mamba layers, enabling efficient handling of long contexts (hundreds of thousands of tokens) and better performance with fewer active parameters. [01:35], [02:18] - **Granite Nano (350M) runs offline in browser**: The smallest Granite Nano model (350M parameters) can be tested in the browser, demonstrating capabilities like tool usage and JSON formatting, even with a 2023 knowledge cutoff. [03:51], [04:47] - **Local AI code completion on M2 MacBook**: An offline AI coding assistant was built using Transformers.js and Granite models, enabling code completion directly in the browser on a laptop without an internet connection. [05:36], [06:22] - **Granite models: Stable, secure, and open-sourced**: IBM's Granite models are noted for their stability, security, and open-source availability, making them suitable for various projects without requiring high-end GPUs. [08:55], [09:10]

Topics Covered

AI model releases are boring, small models are exciting.
IBM's Granite 4.0: Hybrid architecture for efficiency and long context.
Small models offer more capacity for less compute.
Responsible AI and practical efficiency: offline browser apps.
Local AI coding assistant for offline development.

Full Transcript

I got to be honest, folks. These days,

when a new giant AI model drops, it just

doesn't feel that exciting anymore. I

mean, sure, the new model probably moves

the needle a tiny bit higher on the

benchmark scale, and that's cool and

all, but is it really a big achievement

at this point? Now, what actually gets

me excited is when someone releases a

tiny model, something lighter, more

efficient, and closer to running on

everyday hardware. I'm still waiting for

that day when we will have AI models

small and efficient enough to run on a

Raspberry Pi. But until then, every

meaningful step towards smaller and

faster models feels like real

engineering progress. I recently covered

tiny models like Microsoft Bitnet and

Nvidia's Neotron, but today we are

looking at another family of compact

efficient models, IBM's Granite 4.0

series. It seems that IBM, like a

phoenix, has finally risen from the

ashes and has established itself as a

serious player in the AI space. Their

latest Granite 4 models are seriously

impressive, but not because they're

huge, but because they're small and

effective. In today's video, we're going

to look at Granite 4 models in a closer

detail, see how they perform, and I will

also show you an offline code completion

app I built with these models, which you

can run directly in your browser. It's

going to be a lot of fun. So, let's dive

into it.

So, Granite 4.0, this model family is

built around a pretty interesting idea.

Instead of relying only on the classic

transformer stack, IBM mixed transformer

layers with something called Mamba

layers. Transformers are great at

language understanding, but they get

expensive as sequences get longer. Mamba

layers handle long context more

efficiently and the granite for models

were trained with very large context

windows. We are talking hundreds of

thousands of tokens. That means they can

actually hold entire documents, code

bases, technical specs, legal contracts,

or large chat histories, and memory

without constantly forgetting what

happened earlier. For workflows like

rag, coding assistance, or research

automation, that's pretty huge. But the

part that stands out the most is how

small some of these models are while

still performing well. For example,

Granite 4 small sits at roughly 32

billion total parameters, but only 9

billion are actively used at any time

because of the hybrid architecture and

the use of mixture of experts. And if we

compare the benchmarks, even the 3

billion parameter granite model beats

some older 8 billion parameter models

from previous granite generation. So you

are getting more capacity for less

compute. And these models are also

cryptographically signed. Their training

data is documented and the family is

aligned with the ISO 420001

standards for responsible AI. And that

matters if you want to deploy them

inside Binance or healthcare or

government environments. Now, from a

practical standpoint, the efficiency

gains are the real story. Lower memory

usage means more people can run these

models on smaller GPUs or even CPU

optimized setups. It means faster

interference, lower latency, and lower

operational costs. And here's the

coolest part. You can even run them

locally offline in your browser using

the Transformers.js

library. Richard already made a great

video how to get started with

Transformers.js, so be sure to check

that one out as well. But now let's see

how this actually works. So IBM has made

a little demo page on hugging face where

you can test out the granite models in

the browser. So let's go ahead and test

it out with the smallest model, the 350

million parameter granite nano. As soon

as we open the chat window, we see there

are given examples we can try out. Most

of these are using the tool

functionality, which is a pretty cool

feature. If we open the tools panel, we

can see that there's a bunch of tools

that the model applies based on the

context of the prompt. And what's even

cooler is that you can modify the

response object to be a neatly formatted

React component with styling and

everything. So, for example, if I ask

what time it is, it will immediately

call the get time tool and output the

response in a neat React component. It's

kind of similar to the MCP UI library on

which James made an excellent overview

in this video. Okay, so this model can

also call tools which is awesome. But

let's see what its knowledge cutoff is.

So if we prompt to ask who is the

current US president, you will see that

it replies with some outdated

information. And that's because this

model was trained back in 2023. So for

some newer information, it might be a

bit outdated. But nonetheless, it's

quite impressive to have such a fast,

tiny model running entirely on your

browser. For a final test, let's give it

a bunch of data and see if it can format

it in a JSON format. As you can see,

it's doing a pretty fast job of

converting it to JSON, but because of

the token limit on the demo page, we

will see it cut off the response by the

fourth customer. This is easily fixable,

however, if you're building your own AI

agent. We can just provide a larger

token limit. Okay, so I was thinking in

what kind of scenario would a

browserbased offline AI model be useful?

And then I realized a pretty cool tool

would probably be an offline AI coding

completion assistant, which you can use

to do coding offline, like for example,

when you're traveling on a plane. So I

decided to build a little proof of

concept AI coding assistant app that

works offline, uses the transformer.js

library and the granite AI model behind

the scenes. So this is the app and when

you first load it, it will download the

model to your local cache and store it

for future offline use. And now we can

actually turn off my internet connection

and then try this app in offline mode.

So now if I write a function called add

with parameters num one and num two, we

can see that the model now suggests

adding the exact next line. And if I

press tab, it will apply that suggestion

to my code. And this is all happening

locally on my M2 MacBook with no

internet connection. How cool is that?

And we can see that if we go to the next

line, it will also suggest to add the

closing bracket as well. And now if I

add an examples comment line, look at

that. The model understands we want to

include some samples of that function.

This is some solid AI code completion

right there. Let's try another example.

Let's see if it can write a Fibonacci

sequence function. And look at that. It

suggested the entire Fibonacci code

block and applied recursion as well. And

once again, if we ask for examples, it

will suggest some additional console

logs. Now, I do have to say it is not

perfect. Sometimes it suggests the same

code line as you just wrote and

sometimes it forgets to include the

closing bracket. So, this is still very

much a proof of concept. But the very

idea that I can use a local AI model

offline to create an AI coding

assistant, that's pretty cool. Lastly, I

want to show you what the code for this

little app looks like. Oh, and by the

way, folks, if you want to clone this

project and try it on your own, I've

also included the GitHub repo link in

the description below. So, on the coding

side, we have this simple HTML page with

the code text area component. And on the

main.js js file. We are loading in the

granite for 1 billion parameter model. I

also tried the 350 million parameter

version, but that one was too unreliable

for this project. Then we load in the

tokenizer and the model to our cache and

we use a 4bit quantization because this

is the amount that was recommended by

IBM for this particular model version.

Next, here's a very detailed prompt

telling the agent to only output the

next suggested lines of code and we are

using a

128 token limit, although we could

easily increase this for longer

suggestions if we wanted to. And I also

found that setting temperature to one

worked best for this project. And then

we get the raw response from the model.

And I also add some helper functions to

clean up the response to remove any

textual non-code artifacts. And then we

throttle this function to happen after a

second every time we stop typing. It

then shows the suggestion and then we

listen for a key down event for tab. And

if tab is pressed, we append the

suggestion to the current code. And

lastly, if we click outside of the

suggestion element, it just hides the

suggestion. So that's basically it. I

honestly really like the new IBM's

models despite all their little

inconsistencies. You can really feel

that they have battle tested this model

to be as stable and secure as possible.

And the fact that they've open sourced

them for everyone to use on their own

projects is super cool. And these models

are also small enough for us to actually

use them on our own projects without

needing a beefy GPU. So overall, great

job IBM. But those are just my two

cents. What do you folks think? Are you

planning to use these models in your own

future projects? Let us know in the

comments down below. And folks, if you

like these types of technical

breakdowns, be sure to smash that like

button underneath the video. And also,

don't forget to subscribe to our

channel. This has been Andress from

Better Stack, and I will see you in the

next videos.

[Music]

Loading...

Loading video analysis...