Tiny AI Is About to Change Everything (IBM Granite 4.0)
By Better Stack
Summary
## Key takeaways - **AI progress is in smaller, efficient models**: The excitement in AI is shifting from massive models to smaller, more efficient ones that can run on everyday hardware, signifying real engineering progress. [00:04], [00:20] - **IBM Granite 4.0: Hybrid architecture for efficiency**: IBM's Granite 4.0 models combine transformer and Mamba layers, enabling efficient handling of long contexts (hundreds of thousands of tokens) and better performance with fewer active parameters. [01:35], [02:18] - **Granite Nano (350M) runs offline in browser**: The smallest Granite Nano model (350M parameters) can be tested in the browser, demonstrating capabilities like tool usage and JSON formatting, even with a 2023 knowledge cutoff. [03:51], [04:47] - **Local AI code completion on M2 MacBook**: An offline AI coding assistant was built using Transformers.js and Granite models, enabling code completion directly in the browser on a laptop without an internet connection. [05:36], [06:22] - **Granite models: Stable, secure, and open-sourced**: IBM's Granite models are noted for their stability, security, and open-source availability, making them suitable for various projects without requiring high-end GPUs. [08:55], [09:10]
Topics Covered
- AI model releases are boring, small models are exciting.
- IBM's Granite 4.0: Hybrid architecture for efficiency and long context.
- Small models offer more capacity for less compute.
- Responsible AI and practical efficiency: offline browser apps.
- Local AI coding assistant for offline development.
Full Transcript
I got to be honest, folks. These days,
when a new giant AI model drops, it just
doesn't feel that exciting anymore. I
mean, sure, the new model probably moves
the needle a tiny bit higher on the
benchmark scale, and that's cool and
all, but is it really a big achievement
at this point? Now, what actually gets
me excited is when someone releases a
tiny model, something lighter, more
efficient, and closer to running on
everyday hardware. I'm still waiting for
that day when we will have AI models
small and efficient enough to run on a
Raspberry Pi. But until then, every
meaningful step towards smaller and
faster models feels like real
engineering progress. I recently covered
tiny models like Microsoft Bitnet and
Nvidia's Neotron, but today we are
looking at another family of compact
efficient models, IBM's Granite 4.0
series. It seems that IBM, like a
phoenix, has finally risen from the
ashes and has established itself as a
serious player in the AI space. Their
latest Granite 4 models are seriously
impressive, but not because they're
huge, but because they're small and
effective. In today's video, we're going
to look at Granite 4 models in a closer
detail, see how they perform, and I will
also show you an offline code completion
app I built with these models, which you
can run directly in your browser. It's
going to be a lot of fun. So, let's dive
into it.
So, Granite 4.0, this model family is
built around a pretty interesting idea.
Instead of relying only on the classic
transformer stack, IBM mixed transformer
layers with something called Mamba
layers. Transformers are great at
language understanding, but they get
expensive as sequences get longer. Mamba
layers handle long context more
efficiently and the granite for models
were trained with very large context
windows. We are talking hundreds of
thousands of tokens. That means they can
actually hold entire documents, code
bases, technical specs, legal contracts,
or large chat histories, and memory
without constantly forgetting what
happened earlier. For workflows like
rag, coding assistance, or research
automation, that's pretty huge. But the
part that stands out the most is how
small some of these models are while
still performing well. For example,
Granite 4 small sits at roughly 32
billion total parameters, but only 9
billion are actively used at any time
because of the hybrid architecture and
the use of mixture of experts. And if we
compare the benchmarks, even the 3
billion parameter granite model beats
some older 8 billion parameter models
from previous granite generation. So you
are getting more capacity for less
compute. And these models are also
cryptographically signed. Their training
data is documented and the family is
aligned with the ISO 420001
standards for responsible AI. And that
matters if you want to deploy them
inside Binance or healthcare or
government environments. Now, from a
practical standpoint, the efficiency
gains are the real story. Lower memory
usage means more people can run these
models on smaller GPUs or even CPU
optimized setups. It means faster
interference, lower latency, and lower
operational costs. And here's the
coolest part. You can even run them
locally offline in your browser using
the Transformers.js
library. Richard already made a great
video how to get started with
Transformers.js, so be sure to check
that one out as well. But now let's see
how this actually works. So IBM has made
a little demo page on hugging face where
you can test out the granite models in
the browser. So let's go ahead and test
it out with the smallest model, the 350
million parameter granite nano. As soon
as we open the chat window, we see there
are given examples we can try out. Most
of these are using the tool
functionality, which is a pretty cool
feature. If we open the tools panel, we
can see that there's a bunch of tools
that the model applies based on the
context of the prompt. And what's even
cooler is that you can modify the
response object to be a neatly formatted
React component with styling and
everything. So, for example, if I ask
what time it is, it will immediately
call the get time tool and output the
response in a neat React component. It's
kind of similar to the MCP UI library on
which James made an excellent overview
in this video. Okay, so this model can
also call tools which is awesome. But
let's see what its knowledge cutoff is.
So if we prompt to ask who is the
current US president, you will see that
it replies with some outdated
information. And that's because this
model was trained back in 2023. So for
some newer information, it might be a
bit outdated. But nonetheless, it's
quite impressive to have such a fast,
tiny model running entirely on your
browser. For a final test, let's give it
a bunch of data and see if it can format
it in a JSON format. As you can see,
it's doing a pretty fast job of
converting it to JSON, but because of
the token limit on the demo page, we
will see it cut off the response by the
fourth customer. This is easily fixable,
however, if you're building your own AI
agent. We can just provide a larger
token limit. Okay, so I was thinking in
what kind of scenario would a
browserbased offline AI model be useful?
And then I realized a pretty cool tool
would probably be an offline AI coding
completion assistant, which you can use
to do coding offline, like for example,
when you're traveling on a plane. So I
decided to build a little proof of
concept AI coding assistant app that
works offline, uses the transformer.js
library and the granite AI model behind
the scenes. So this is the app and when
you first load it, it will download the
model to your local cache and store it
for future offline use. And now we can
actually turn off my internet connection
and then try this app in offline mode.
So now if I write a function called add
with parameters num one and num two, we
can see that the model now suggests
adding the exact next line. And if I
press tab, it will apply that suggestion
to my code. And this is all happening
locally on my M2 MacBook with no
internet connection. How cool is that?
And we can see that if we go to the next
line, it will also suggest to add the
closing bracket as well. And now if I
add an examples comment line, look at
that. The model understands we want to
include some samples of that function.
This is some solid AI code completion
right there. Let's try another example.
Let's see if it can write a Fibonacci
sequence function. And look at that. It
suggested the entire Fibonacci code
block and applied recursion as well. And
once again, if we ask for examples, it
will suggest some additional console
logs. Now, I do have to say it is not
perfect. Sometimes it suggests the same
code line as you just wrote and
sometimes it forgets to include the
closing bracket. So, this is still very
much a proof of concept. But the very
idea that I can use a local AI model
offline to create an AI coding
assistant, that's pretty cool. Lastly, I
want to show you what the code for this
little app looks like. Oh, and by the
way, folks, if you want to clone this
project and try it on your own, I've
also included the GitHub repo link in
the description below. So, on the coding
side, we have this simple HTML page with
the code text area component. And on the
main.js js file. We are loading in the
granite for 1 billion parameter model. I
also tried the 350 million parameter
version, but that one was too unreliable
for this project. Then we load in the
tokenizer and the model to our cache and
we use a 4bit quantization because this
is the amount that was recommended by
IBM for this particular model version.
Next, here's a very detailed prompt
telling the agent to only output the
next suggested lines of code and we are
using a
128 token limit, although we could
easily increase this for longer
suggestions if we wanted to. And I also
found that setting temperature to one
worked best for this project. And then
we get the raw response from the model.
And I also add some helper functions to
clean up the response to remove any
textual non-code artifacts. And then we
throttle this function to happen after a
second every time we stop typing. It
then shows the suggestion and then we
listen for a key down event for tab. And
if tab is pressed, we append the
suggestion to the current code. And
lastly, if we click outside of the
suggestion element, it just hides the
suggestion. So that's basically it. I
honestly really like the new IBM's
models despite all their little
inconsistencies. You can really feel
that they have battle tested this model
to be as stable and secure as possible.
And the fact that they've open sourced
them for everyone to use on their own
projects is super cool. And these models
are also small enough for us to actually
use them on our own projects without
needing a beefy GPU. So overall, great
job IBM. But those are just my two
cents. What do you folks think? Are you
planning to use these models in your own
future projects? Let us know in the
comments down below. And folks, if you
like these types of technical
breakdowns, be sure to smash that like
button underneath the video. And also,
don't forget to subscribe to our
channel. This has been Andress from
Better Stack, and I will see you in the
next videos.
[Music]
Loading video analysis...