LongCut logo

This Shouldn’t Be Able to Run 120B Locally

By Alex Ziskind

Summary

Topics Covered

  • 120B Models Fit in Your Pocket
  • Runs GPT OSS 120B Offline at 18 Tokens/Second
  • OpenAI-Compatible API for VS Code Coding
  • Activation Locality Beats Apple Silicon

Full Transcript

If you want to carry a local and private 120 billion parameter large language model, you might need something like this. Big expensive GPUs and computers. But what if I told you that 120 billion parameters can fit in your pocket? The first time I heard of something like that, Steve Jobs was putting a thousand songs in my pocket.

Well, this is a slightly nerdier version than that. And no, I was not happy about Steve Jobs putting stuff in my pocket. Stay out of my pocket, Steve. For

the past few years, AI hardware has been moving in one direction. Bigger GPUs, bigger servers, bigger clusters, and if you watch this channel, you know I've built a few of those. Now, something very different is starting to show up. This is Tiny AI

of those. Now, something very different is starting to show up. This is Tiny AI Pocket Lab. And according to the company, this thing can run models up to 120

Pocket Lab. And according to the company, this thing can run models up to 120 billion parameters locally, and that's a wild claim. So I need to check it out.

And that matters because one of the best known 120 billion parameter models right now is still GPT OSS 120B. the newest model but it's well known because it's open ai's first open weight model and open ai says gpt oss 120b is best with least 60 gigabytes of VRAM and it also says it runs more efficiently on

a single 80 gigabyte GPU that means I won't be able to even run it on my expensive RTX 5090 which only has 32 gigabytes of VRAM so I want to check out this tiny AI pocket lab with this MacBook Neo that only has 8 gigabytes of memory just to give you an example this is a 4 billion parameter model running on the MacBook and getting about 9 tokens per second so

let's find out whether this This little box can actually pull this off. First of

all, this thing really lives up to its name. This is my iPhone 16. This

is this thing. I also have to weigh it. My iPhone with the case weighs 228 kilograms, kilograms, excuse me. 228 grams.

I could have guessed that. And tiny weighs 305. And the reason I'm pairing it up with this MacBook Neo is because The MacBook only has eight gigs of memory.

There's no way you'll be able to run anything bigger than a four billion parameter model. Well, maybe. I'm doing a separate video on this. But one of the biggest

model. Well, maybe. I'm doing a separate video on this. But one of the biggest things about large language models and running them locally is space. 256 gigabytes

SSD here. I'm already almost out of space here, and I just have software development tools on there. Imagine loading LLMs. This thing comes with one terabyte of SSD space. So you'll be able to load up a bunch of models. It also has

space. So you'll be able to load up a bunch of models. It also has its own CPU, and of course it has the memory. 80 gigabytes of it so it can run larger models. Now to get going with this, I'm gonna connect this thing with a USB-C cable to PC it says. Now it does operate up to 60 watts, 30 watt TDP. So you can actually power this externally by plugging in

a USB adapter here into the power section of it. And I'm going to power it on. There's a little light right there. It shows me that it's on. It

it on. There's a little light right there. It shows me that it's on. It

shows it as a drive. And I just need to double click on this to set up the software. TinyOS, pop that open. And after login, you just get to the chat screen. And right there, I'm already chatting with this thing. I have chat available. I have agents available. I have an agent store and all these different models.

available. I have agents available. I have an agent store and all these different models.

that I can try. There's GPT OSS 120B that I was talking about. Let's click

on that. I'm gonna try that out. And there it goes. Let's give it a little bit of a longer prompt, just so you get a sense of how fast it's going on a 120 billion parameter model. This, by the way, is the default chat interface. Now, people watching this channel might be interested in the fact that they

chat interface. Now, people watching this channel might be interested in the fact that they have an SDK, not just the chat interface. That way you can interact with the models programmatically. PIP install tiny SDK, boom. That's it. Now I can use this in

models programmatically. PIP install tiny SDK, boom. That's it. Now I can use this in my programs. Load up a device, load up the models in my Python code. Or

I can use this right from the terminal. So I can do tiny run, open AI GPT OSS 120B. And now I can interact with this fully through the command line. And they weren't kidding. It's 18 tokens per second for this model. Now this

line. And they weren't kidding. It's 18 tokens per second for this model. Now this

is completely offline and completely not connected to the internet. Everything is right here. You

only have to connect to the internet if you want to download new models. And

so far, these are the ones that I have available to me that I've already downloaded. GPT OSS 120B, QUENT 330B, QUENT 3 Coder, Image Creation. We'll take a look

downloaded. GPT OSS 120B, QUENT 330B, QUENT 3 Coder, Image Creation. We'll take a look at that and out of the ones that are available that I haven't downloaded yet, GLM 4.7 flash looks to be pretty interesting. So I'm going to click on get.

Network is offline. Please check your connection. So before I can use anything else, I would need to connect this to Wi-Fi so I can download the new model. So

there I am downloading GLM 4.7 flash and what's nice here is that It's not going through the machine, even though you think, oh, why doesn't it just use the connection and use the Wi-Fi that's on this thing? Well, that's the whole idea is its own device. We don't need to take up space on the computer to take up space on the device. So the model goes directly to the device instead of

going through the computer where you might not have enough space. There's 120B answering me.

Let's ask it to write a story. And while it's thinking about it, you can see the approximate speed that it's going at. Look at our activity monitor. Look at

the memory usage here. Five gigabytes. That's pretty normal. There's no spikes or anything here.

No memory pressure. Nothing is being used on this computer so I can keep using this as is. All the processing is done on tiny pocket lab. All right. Thank

you. That's a very long story. We don't need that. Let's load up a coding model. Did my GLM flash download yet? It's still downloading, but we do have Quen

model. Did my GLM flash download yet? It's still downloading, but we do have Quen Coder 30B. Hey camera, what are you looking at? The screen? You like

Coder 30B. Hey camera, what are you looking at? The screen? You like

what you see? I like it. Quint Coder, done loading. Got to do something while it's loading, right? New chat and hello. And Quint Coder is much faster, obviously, than 120 billion. This one is only 30. And this is not a thinking model, so it answers right away. But why would I be interested in a Coder model?

Well, because this is a software dev channel, so how can we use this besides chatting with it as software developers? Well, first of all, there's a bunch of different agents you can use right here on the left. Stable Diffusion, Chat Memos, and AI Assistant. There's a bit too many different agents to go into in this video but

Assistant. There's a bit too many different agents to go into in this video but let me know if you want to learn more about these and maybe I'll do a follow-up but there's also a agent store where you can go in here and obviously this is the kind of thing that's gonna keep getting updates more and more stuff is gonna be active here now here's the dashboard this shows me how many

tokens I've used which models I use and how many tokens per model this is really handy because it gives you a little bit of visibility and insight into how you're using your resources. For example, if you're developing a solution locally that's going to be deployed somewhere else, you can kind of gauge the token use and estimate costs later if you're using some third party provider. Pretty handy. Now, one thing I

have not tried yet that I'm about to, always scary doing this First time on camera, but there's an API base URL and API key that will allow you to use Tiny in an open AI compatible interface. And in fact, if I take that weird looking base URL and I go to the browser and plug it in, it'll tell me exactly what's going on. So here's slash models endpoint. And yeah, that is

an open AI compatible endpoint. It tells me that Quinn 3 Coder in fact is loaded. So here I am in VS Code and I've configured Quinn 3 Coder 30B

loaded. So here I am in VS Code and I've configured Quinn 3 Coder 30B to be my custom AI agent inside my editor and now I can talk to it through my VS code interface. What does this file do?

This file is a batch script for Windows that automates the setup and execution of a Python project. That's right. So it has tool calling enabled. It has the ability to read and edit files and so on. And the speed is not terrible. Now

I don't have to take my MacBook Pro and Here's an example of a smaller model, Quint 3 8B, and that's pretty fast. Try again, write a story. Now, what's

funny is that the MacBook doesn't have a fan at all, but the tiny box, the Pocket Lab does make up for that because it has its own fan. It's

definitely needs to cool that NPU that's doing the processing. So the only thing I hear here in this office is the tiny Pocket Lab. It's just a persistent fan going. You can definitely hear it. It's not silent, but it's not like wildly annoying.

going. You can definitely hear it. It's not silent, but it's not like wildly annoying.

By the way, when you use this inside VS Code as a coding agent, you go through a lot of tokens. Just that couple of queries that I made, I'm already over 20,000 tokens. Another win for local, right? Because you get an unlimited number.

GLM Flash finally downloaded, and now I can start talking to it. not have even more models stored on this thing that I can anytime just pull up and use whenever I want. Let's check out SD Web UI. A cat in a hat. Do

you think it's gonna draw a cat inside a hat or a cat wearing a hat? Instead it draws a red square. So it says here, please select a text

hat? Instead it draws a red square. So it says here, please select a text to image model from the status bar. So you can only have one model loaded at a time. And right now I have GLM 4.7 loaded, as you can see here, flash, which makes that model available to all the different tools and agents. If

I want a stable diffusion type of model to be available to the stable diffusion UI, then I need to load up the Z image turbo or another text to image model that's going to be available, which unloads GLM 4.7 flash. Makes sense, right?

You have a limited number of resources to work with so unloading a model before loading a new one makes sense all right let's try generating that and there it goes it looks like stable diffusion all right any guesses i just see a cat i see zero hats in that picture a cat playing guitar now if you know a little bit more about stable diffusion that i do then you might know what

all these things do and the best options to select but you're not me And I don't, so I'm gonna just hit generate here. Hey, look at that. We're getting

a cat playing guitar. What a cute little pussycat. What's cool about this interface is that it actually tells you what model are available. What models you already have downloaded, the ones with the check mark, and the models that you could still get that'll work with stable diffusion. Now, one of the questions that comes to mind is, how is it that such a tiny little device with an NPU can compete with... I

don't know, an internal GPU, like for example, Apple Silicon. So here's Tiny on the left running, QEN 8B, which is a bigger model than the one I'm running on the right. That's QEN 3, 4B. And let's take a look at how that looks.

the right. That's QEN 3, 4B. And let's take a look at how that looks.

I'm going to go enter and enter here. They're not at the same time, but close enough just to give us a sense of how quickly things are generating. Hello.

There it goes. Okay. It's thinking. The one on the right is taking a little bit longer. And you tell me, which one do you think is faster here? I

bit longer. And you tell me, which one do you think is faster here? I

think the NPU one is faster, even though it's a larger model. The one executing on the Mac is actually running MLX. Now this is the MacBook Neo, so it's not going to be super fast like a M4 or M5 Max machine, but we're talking about really low power consumption lasting a long time. How is it that Tiny is doing that? Well, the foundation of that can be found on GitHub. Tiny AI

PowerInfer is the name of the repository. You can go check it out. This is

kind of like the backbone of the product. Look at that, GitHub trending number one repository of the day. PowerInfer is basically a CPU slash GPU LLM inference engine leveraging activation locality for your device. I don't know what the heck that means, but basically that means keeping the actively used, the hot parts of the model constantly

active and alive while putting some of the ones that are less commonly used to sleep. It's like a clever way of managing the model so that it can run

sleep. It's like a clever way of managing the model so that it can run faster. You can get more info on tiny.ai website as well as their Kickstarter campaign

faster. You can get more info on tiny.ai website as well as their Kickstarter campaign where you can sign up and get this thing at a discounted rate if you're an early backer. Now I've wanted to play with this thing ever since I saw this at CES and I think it's really cool. Now this thing is not gonna be a replacement for a monster GPU rig. It's slower and right now it's more

curated than a fully open DIY setup. That's kind of clear. But if you want a small and private low power box that can bring meaningful local AI to a much weaker laptop, or perhaps a mini PC that's less capable, that's where this starts to make a lot of sense. And yes, being on Kickstarter makes it a hard sell for some people, which is totally fair. But the idea here is genuinely interesting.

And in actual use, I think it makes a stronger case than you might expect.

Let me know your thoughts in the comments down below. I look forward to check it out and reading it. I do read all your comments and I think you would like this video next. Thanks for watching and I'll see you next time.

Loading...

Loading video analysis...