LongCut logo

You Guide To Local AI | Hardware, Setup and Models

By Syntax

Summary

Topics Covered

  • Quantization Unlocks Massive Models Locally
  • Unified Memory Replaces Expensive GPUs
  • AMD Strix Halo Mini PC Delivers 40 Tokens/sec
  • Guardrails Essential for Local AI Coding
  • Local AI Hardware Future-Proofs Capabilities

Full Transcript

Check this out. What you are looking at is an LLM running inside my home. It's

actually on that mini PC back there. And

I'm getting over 40 tokens per second using this particular model to get a very decent answer to this JavaScript question that I just asked it. So, in

this video, we're talking all about local AI. I am going to show you the

local AI. I am going to show you the machine that I got, talk about its specs, talk about your options when it comes to hardware for running local AI, and really just try to answer the

question, can a mini PC like this replace a $200 subscription to an AI company. I'm also going to dive into

company. I'm also going to dive into what models are available and really just show you all the things that I've tried in terms of just basic prompting, hooking it up to MCP tools or tool servers and then using it inside my

editor and also some more agentic workflows as well to really just give you my opinions on all this stuff and and you can decide whether or not you want to get into this as well. That

sounds good. Let's dive in. My name is CJ. Welcome to Syntax.

CJ. Welcome to Syntax.

All right, let's talk about hardware and some key terms and things you should know if you're getting into this world of local AI. First up, we're talking about inference. So, whenever you type a

about inference. So, whenever you type a question into ChatGpt or Claude or whatever LLM you use, that is performing inference. It's essentially taking a

inference. It's essentially taking a model that's already been trained.

You're passing in some new text and then it's predicting the output text. So,

this is known as inference and that's what I'm doing with my local AI. It's

also possible to train your own model or fine-tune an existing model. That is

something I have not tried yet and not something I was optimizing for when I was looking for some hardware. So

inference is what we're trying to do.

The next thing to know is that we need GPUs for this. So this is a video from three blue one brown. It's called large language models explained briefly.

There's a whole series on neural networks from three blue one brown. But

they talk about how GPUs basically have tons of processors inside of them and can do tons of parallel processing. So

they're much more powerful than just a CPU and that they can do a lot of things in parallel. And these models are

in parallel. And these models are basically just big old matrices of numbers. And GPUs can essentially

numbers. And GPUs can essentially perform these calculations across all these matrices of numbers really really fast. So we're performing inference and

fast. So we're performing inference and we need a GPU to do that. Now for any given model that you're trying to run inference on, you need a whole lot of memory. This article here talks about uh

memory. This article here talks about uh some basic calculations that you can do.

And let's say you're trying to run a 70 billion parameter model. We'll talk

about what the various parameter sizes are, but the base calculation here is that if you're running this model at full precision, that would need 140 GB of video memory. So it's a lot. The

other aspect to think about is not only do you need video RAM for the model itself because essentially when you're running these models, they have to fit the whole thing into VRAM so they can process quickly. You also need to care

process quickly. You also need to care about the key value cache. So every

prompt that you type in needs to be vectorized so that it can be run through the model. And if you had to do that

the model. And if you had to do that every single time for every word or every prompt that you're giving it, it would take a whole lot longer. So the

way a lot of these things are set up is it essentially pre-calculates the vectors and then reuses those on each subsequent prompt. So you also need VRAM

subsequent prompt. So you also need VRAM to hold on to that cache as well. So all

of this is mounting up to you need GPUs, you need a whole lot of RAM. Now we

talked about this 70 billion parameter model running at 16 bit precision. there

is this idea of quantization and that's essentially taking an existing model that's at half precision then basically dumbing it down a little bit and uh kind of rounding all of the numbers that

exist inside of the model. So at full precision you would need 140 gigabytes of RAM. But if you dumb down the model

of RAM. But if you dumb down the model just a little bit uh to 8 bit precision or 4-bit precision, you're actually going to get to the point where a 70 billion parameter model maybe only needs

70 gigabytes of video memory memory or 30 gigabytes of video memory depending on the quantization. Now, if you want to learn more about quantization, there's a fantastic article here from Martin Gutenorst called a visual guide to

quantization. Talks about what it is.

quantization. Talks about what it is.

It's fascinating, but all you need to know is that these models that you're going to be trying to run locally come in different sizes and quantizations. A great place to start,

quantizations. A great place to start, and what I've been using to learn about these models and and figure out how to run them, is Unsloth. Now, Unsloth also has a lot of guides on fine-tuning models, that is taking an existing model

and potentially repurposing it for something else or something that you're trying to do. But they also have a lot of guides on just running the the models that they themselves have quantized. Um,

so this page here runs through a bunch of models, but just as an example, if we look at Quinn 3.5, which is one of the best local models you can run right now.

On the page for that model, they give you a chart that shows you the various precisions, the uh various models with their various number of parameters. So

27 billion parameters, 35 billion parameters, and then for each precision, how much video memory is required to load that model in. So you can start to see as you find these models, you're

going to need to find a GPU that has enough memory uh to load the model itself plus a little more memory for the key value cache as well. So with all of that base knowledge, now you need to

find yourself a GPU and it is not cheap.

So just looking on a PC part picker, uh this is a place where you can like build your own machines. I went in and just I'm looking at video cards now and you can see just sorted by memory, a video

card with 48 gigabytes of memory. This

is an RTX 6000 is $7,000. And of course, it depends on where you buy it. So like

the RTX A6000 is like $1,000 less. You

could potentially run an AMD GPU with a similar amount of video memory for like half the price, but then you're also going to potentially run into uh compatibility issues because certain

platforms only run on AMD or versus run on Nvidia. So right off the bat, video

on Nvidia. So right off the bat, video cards themselves the are are steep. So,

right, if you wanted to build a machine just with this 48 GB memory video card, your base price is 7,000. That's not

including memory, CPU, a motherboard, everything else. PC Part Picker, for

everything else. PC Part Picker, for whatever reason, only has these are basically like consumer grade uh GPUs.

And right now, I don't know why, I can only see up to 40 GB of memory. They do

make consumer GPUs that have more memory. So, if you look at the RTX Pro

memory. So, if you look at the RTX Pro 6000, this has 96 GB of memory, but it's almost $10,000.

Um, and there are other ones. They're

also like enterprise level GPUs. Uh, but

all of that to say, you need a GPU with a lot of memory. And GPUs with a lot of memory are very expensive. One of the first platforms to try to make this more accessible for people trying to run

local AI was Nvidia with their DGX Spark platform. So, it's this little machine

platform. So, it's this little machine right here that actually has 128 GB of memory, unified memory that can be used for either the video card that's on it

or the the CPU itself. Um, and this makes it so that you can have a single machine and not need a really expensive GPU and run some of these models. So,

this 128 GB of memory is shared between the GPU and the CPU. But that means you can load a 70 billion parameter model and have room for the the key value store on that particular machine. Now

the DGX Spark is a little more approachable in terms of price. It has,

like I said, 128 GB of unified memory and around $4,000. So that's great. You

could start there. But AMD came along and they brought us the AIAX 395 processor, which has really good scores. all of

their charts here, they're comparing it to an Intel i7, which is not comparable because it doesn't use the unified memory architecture. But all that to

memory architecture. But all that to say, their platform, very similar to the DGX Spark, um, uses a unified memory architecture and is a lot more um, approachable in terms of price. Now,

when I say unified memory, you might be thinking of Apple because they were the first ones to do it with their M1 architecture back in 2020. Um, but

essentially with a traditional setup, you have a CPU that's talking to the RAM that you plug into the motherboard and then you have a GPU which has onboard video RAM b baked into it and then they

communicate with each other over PCIe.

With the unified architecture, there is unified memory that's accessible by the CPU, the neural processing unit, and the GPU. And like I mentioned, Apple was one

GPU. And like I mentioned, Apple was one of the first to do this. So back in 2020 when they released the M1, they came out with this unified architecture. And

that's why you could also run these local LLMs on Macs as well. But AMD is what I decided to go with. And that

little machine running back there is an AMD Stricks Halo machine. So Stricks

Halo is like the code word for the 395 processor with the unified architecture.

If you go to the Stricks Halo wiki, they actually list out all of the types of machines that run this processor. And

you might be familiar with the framework desktop. And this machine runs that

desktop. And this machine runs that exact same processor. You you might have seen people getting it for running local. This is very similar, but this

local. This is very similar, but this one is is a bit more approachable in terms of of price. So, right now, if you look on MicroEnter, uh the machine that I have is selling for $2,500. So, of all

the stuff we've shown so far, this is probably one of the most approachable in terms of price. And then, like I mentioned, there are other platforms that have the same processor with unified memory, but are from Minis Forum

or from HP. There's a few others as well, like I showed in that wiki. All of

them a little more expensive than the one that I got, which is the GMT Evo X2.

Now, I bought this thing 2 or 3 months ago, and I got it for uh $2,100, but with the increase in the price of uh RAM and everything else, this stuff is only getting more expensive. Um, so I

bought in early. This is still a fairly reasonable price to be able to run the kinds of things that I'm going to show you, but this is all the research that I did. It's a whirlwind tour. I'll provide

did. It's a whirlwind tour. I'll provide

links to all the stuff I showed you in the description if you want to look into this stuff more. Um, and of course, like I mentioned, you also could go with a Mac or Mac Studio because Mac Studios

you can get 128 GB of unified memory or 256 or 512, but the price is a little bit more. Now, when I bought the GMT

bit more. Now, when I bought the GMT Evo, like I said, I paid $2,100. And at

the time, it would have been $1,000 more just to get the Mac Studio with the same amount of unified memory. And so, I just went with that. At this point with the price of everything increasing, you're kind of getting to the point where it's

almost the same price to get like this HP or this minis forum or the framework desktop and it's probably about the same as a Mac Studio. Um, so you can you can weigh those options. The other thing to

consider though is with a Mac you have to run Mac OS. Now there is AI Linux. I

have not looked into if people are trying to do local with AI Linux but typically you would just stick with Mac OS. So that might limit you in what you

OS. So that might limit you in what you can actually run. But with these PCs, they usually come with Windows, but you can wipe them, install Ubuntu, install Fedora, and then get access to a lot

more uh community packages and workflows where people have been tweaking and working with this stuff to try to get local AI running. So that was a whole lot to take in, but all of that to say, I have landed on this machine here, and

let me show you how I set it up. Now,

the machine I got was the GMK Tech Evo X2. And uh it's it's a nice little

X2. And uh it's it's a nice little machine. Now, in terms of IO, on the

machine. Now, in terms of IO, on the front, we've got an SD card reader, a USBC, which is USB 4, two USB 3.2 ports, and a 3.5 mm headphone jack. On the

back, you've got the DC power in, another headphone jack, a USB 3.2 port, another USBC port, you've got display port, HDMI, and then two USB 2 ports.

We've also got a 2.5 gigabit Ethernet adapter. Now, in terms of sizing, it's

adapter. Now, in terms of sizing, it's about one bread by one bread and half a bread wide. It's also got these nice

bread wide. It's also got these nice little feet on the bottom so you can stand it upright. And you do want to use it in this way. Look at that little buddy. Uh because of all the airflow.

buddy. Uh because of all the airflow.

Comes with a power cable and an HDMI cable as well as the power brick which is a 230 W brick running at about 19.5

volts. So nice little machine. Let's get

volts. So nice little machine. Let's get

this thing set up. Now when you boot it up for the first time, you do get Windows. It actually took quite a while

Windows. It actually took quite a while before it got to the initial Windows screen. And while I was waiting, I found

screen. And while I was waiting, I found this button on the side which changes the RGB color. And uh yeah, Windows took forever to boot. So, I didn't even go through the getting started. I

immediately plugged in a USB drive that has the uh Fedora installer on it. And

then I went through the BIOS and updated all of the recommended settings in terms of performance mode, how much memory to allocate to the video card, and a few other options that I found by going through the wiki and the forums. Now,

this was actually my first time installing Fedora. I prefer Ubuntu and

installing Fedora. I prefer Ubuntu and run it on all my machines, but Madora is pretty cool. Pretty cool. So, the the

pretty cool. Pretty cool. So, the the install was pretty seamless. And they

also had an option to enable the administration tools, which I didn't know what it was going to do. But the

first time it booted up, it immediately made itself available over the web. And

that means I could actually just go back to my computer. I didn't even need to SSH into it. I could go to the web dashboard and actually start configuring it. They have a built-in terminal there

it. They have a built-in terminal there and I can see all of the status of it's running. So, this is actually some

running. So, this is actually some software called Cockpit. And I've run this software on my Ubuntu machines, but it comes from Fedora and it's really cool that out of the box you can get this up and going. So, the moment your install is done, you can immediately

head to a browser and start configuring your machine. Now, one of the first

your machine. Now, one of the first things I needed to do for this machine is to enable some settings for GTT, which is how Linux is able to allocate

memory for the video card or for the CPU. So, there are quite a few guides

CPU. So, there are quite a few guides out there. There's one from Jeff

out there. There's one from Jeff Gearling that helped initially, but uh this particular stricks Halo machine used different options for setting how

much video RAM should be allocatable.

And so, once I figured out those settings, I was good to go. Essentially,

we're allocating up to 108 gigabytes for the video card. And like Jeff Gearling mentions, this essentially allows it to run stable. Anything more, and it might

run stable. Anything more, and it might kernel panic every now and then. So, we

essentially have 20 GB for the OS and everything else that's running of RAM and up to 108 GB that we can use for all of our AI workflows. All right, our hardware is all set up. Now, we need to

actually run some models. And there's a lot of different ways to do this as well. Olama is probably one of the most

well. Olama is probably one of the most popular platforms. Essentially, it's a CLI tool. It can download models and

CLI tool. It can download models and then run them locally. There's also LM Studio, which is a desktop app. They

also have like a CLI tool where you can manage the downloaded models and then they give you a chat GPT like interface.

And then there's also VLLM, which is commonly used to cluster and network your uh computers together so you can run LLMs at scale. And then there is

also Llama CPP. And this is actually what I've settled on. So, Llama CPP in a lot of benchmarks is one of the most fastest and they actually pioneered the

format that a lot of these other tools use, the GGUF file format. So, Llama CPP is what I went with and in order to get it set up, it's pretty involved. But

fortunately, there are community packages and libraries for this. So,

shout out to Cuz who's created this AMD Stricks Halo Toolbox, which essentially out of the box gets you ready to go with Llama CPP on any machine that has this

Ryzen AI 395 on it. And all of this is based off of this standard for toolbox.

So, toolbox essentially works on top of Docker or Podman to package up various dependencies and things that you might need to get something running. And so

that's exactly what Kuzo has done for these toolboxes because when you're running Glamma CPP on the Andy Stricks Halo, you're going to need special drivers. You're going to need various

drivers. You're going to need various things that are set up and ready to go.

And you can do all of that manually.

Like even at this point, when I was trying to set this up without the toolbox, you'd have to compile Llama CPP from scratch. Essentially, these

from scratch. Essentially, these toolboxes come with pre-ompiled versions, so you're just ready to go.

And once you've installed the toolbox CLI, it's just one command to spin up that toolbox. And now you can start

that toolbox. And now you can start running Llama CPP inside that toolbox.

It's fully optimized. Everything's

compiled. You're good and ready to go.

Now there's two different toolboxes to choose from. There is Vulcan and Rockm.

choose from. There is Vulcan and Rockm.

Vulcan is essentially the open- source implementation of the AMD drivers. And

then RockM uses the proprietary drivers, but in my tests, RockM does give the best performance. So that's the toolbox

best performance. So that's the toolbox that I went with. So you run this one command on your machine, and then you're ready to go. You're ready to start running some models. Now, in the readme for this project, they show a command

that will use the hugging face CLI to download the Quinn3 coder 30B at BF16 precision. So, that's that full

precision. So, that's that full precision and it comes in two different files. But if you want to start picking

files. But if you want to start picking your own models and running your own stuffs, first of course, install the hugging face CLI. So, this will give you access to the hf command and then you can start pulling models from hugging

face. So, hugging face is the most

face. So, hugging face is the most popular place for people to host models and also download models. And since

we're using Llama CPP, if you head to the HuggingFace models page, you can filter by Llama CPP and then every single model that you see here will run under Llama CPP and they typically

include the recommended settings for running that specific model. Now, when

you're looking through these models, you will notice there's a lot from Unsloth.

They're the ones that I mentioned earlier in the video. Essentially, they

create quantized versions of models and provide guides for fine-tuning them.

Like I showed earlier, all of these models they basically have guides for.

But next thing you need to do is actually pick a model. And what you'll do is once you get into the model page over on HuggingFace, you can take a look at the model card here and see all of

its quantizations. So you can see Kimmy

its quantizations. So you can see Kimmy K 2.5 in this format at full precision is 2 terabytes for the model alone. And

then if you look at some of the other quantizations that the size gets smaller and smaller as you go. And so for the machine that I have, it has 128 GB of unified memory. only about 108 of that

unified memory. only about 108 of that is accessible as video memory if you want to run things in a stable way. So

any one of these models I could not run.

Essentially you can look at the size of the model. That's how much video RAM

the model. That's how much video RAM it's going to take up. So we can't run Kimmy K25. There is Miniax M2.5 and you

Kimmy K25. There is Miniax M2.5 and you can see in the 3bit quant we have a model that is 101 GB. So you could run that one. In my testing, it works, but

that one. In my testing, it works, but because we only have a few more gigabytes to work with after that's already loaded in, your context size is limited to maybe 4,000 or so tokens. So,

it works, but it starts to slow down the larger your prompts get. You could

potentially try the smaller quantizations, but as they get smaller, they get a little bit dumber. Um, and

then there's Quint 3 coder 30B A3B Instruct. And this is the one that I

Instruct. And this is the one that I have been using. It's been fantastic.

It's been really good. I'll show you some of my tests next, but you can see that the 4-bit quantized version is under 20 GB, so there's plenty of room for like really long context windows.

And you could probably even run the higher precision ones and still have room for large context windows. So,

you'll pick a tool for running your models. You'll pick which model you want

models. You'll pick which model you want to run, take into account how big that model is, take into account what the format is, is it going to work with the platform that you're running with, and then you're ready to go. Now, let's just talk about overall impressions for just

prompting the AI. Now, I've been using Quinn 3 coder next and then also the Quinn 3.5 model. And for one-off questions, it's pretty decent. It is

technically accurate for simple questions. It can generate CLI commands.

questions. It can generate CLI commands.

It can answer questions about various popular JavaScript libraries. All of

that seems to work perfectly fine. And

especially if you hook it up to some tools like web search or documentation search. So, I got the open web UI up and

search. So, I got the open web UI up and running and connected it to my llama server and enabled the web search tool.

And so, now whenever you prompt, it can actually search the web and then use the results from the web to ground its answer in the truth instead of potentially hallucinating something. And

so, I also found this very reasonable in terms of how it was able to summarize things. Yeah, I will say it's a little

things. Yeah, I will say it's a little bit slower than something like chatbt or claude, but it is running inside my house on that little machine back there.

So like I I can I can take that. Uh and

also a lot of times you might fire off a query, come back to it later. That's

totally fine. So personally I see myself replacing claude chatgptemini with this for my initial searching of the web and initial question asking. It

works perfectly fine for that. I don't

have to worry about these AI companies being trained on my prompts and this is working great. So for basic question

working great. So for basic question answering and basic searching and summarizing from the web, this has absolutely replaced my usage of Claude chatgbt Gemini and now I can do it from

the comfort of my own home. Now when it comes to coding, there's still some things that need to be worked out. It's

not perfect. If you want to use VS Code, you have to use the insider version of VS Code. That's like the pre-release

VS Code. That's like the pre-release version. And then you can only use it

version. And then you can only use it for agentic workflows. You can't use it for autocomplete. There are plugins and

for autocomplete. There are plugins and extensions that work inside of just regular VS code, but that essentially replaces Copilot altogether. So there's

this tool I tried called continue, and it basically works like Copilot, but you can customize which AI endpoint it talks to. So I can point it directly at my

to. So I can point it directly at my little machine back there that's running locally. And uh it works exactly like

locally. And uh it works exactly like you would expect in Copilot, right? It

has a chat window. It has agentic workflows where it can create files and I found that trying to create projects from scratch using this particular workflow. The AI struggled at various

workflow. The AI struggled at various points. So I would have to intervene

points. So I would have to intervene every now and then to fix a config issue or tell it oh you're editing the wrong file or you put it in the wrong place.

Um so it really couldn't do it unguided from like basic prompts. However, if you start to use more structured tooling, so if you try to do more spec driven development, that absolutely keeps this

in line a lot more. And that's really my guidance here if you're going to try to use this for bigger workflows is you have to be specific about the prompts that you pass it and try to use sub aent workflows. So, I also got this hooked up

workflows. So, I also got this hooked up to open code. And from my experience, that's the best way to work with local AI because it's supported out of the box. It can do sub agents. It has a

box. It can do sub agents. It has a web-based dashboard so I can prompt from the couch. Um, and I can hook this up to

the couch. Um, and I can hook this up to tools like beads for task management or spec kit for doing the initial spec and design of what I'm trying to build and then from there creating beads tasks and

then from there having an agent orchestrator that then spins up sub aents. So if you keep guard rails around

aents. So if you keep guard rails around what you're asking this little AI to do and you have lots of ways for it to verify itself. So if you have tests, if

verify itself. So if you have tests, if you have linting and type checking, all of those things are going to be ways for it to verify itself instead of just guessing. And that's basically what I

guessing. And that's basically what I found when I was trying to create a project from scratch with a basic prompt. It it started to break down.

prompt. It it started to break down.

Right now, you can do that kind of thing with Claude Opus 4.6 and it works perfectly fine. But with some of these

perfectly fine. But with some of these dumber models, they need checks and balances. So anything you can do that

balances. So anything you can do that allows them to verify themselves will keep them in line a whole lot more. So I

found having extensive test suites, having that spec drawn out beforehand, having it run through all of those checks before moving on to the next thing, having it work on only isolated

work at a time and and in this this uh try things out test repeat cycle that got it to stay aligned much more. So it

requires a lot more work and it requires a lot more managing of these agents to get it working on this little machine here. But again, you don't have to pay a

here. But again, you don't have to pay a monthly subscription. It's a onetime

monthly subscription. It's a onetime cost and AI companies don't have access to the source code that you're generating. So, there are some

generating. So, there are some trade-offs. To really answer the

trade-offs. To really answer the question, am I going to cancel my Claude subscription? No. Right now, I'm I'm

subscription? No. Right now, I'm I'm actually getting some really good use out of Cloud Opus 4.6 uh for some various personal projects that I'll be talking about eventually, but um the

kind of work that I'm doing there, it's just not it's too complex for for this little machine. Now, like I mentioned,

little machine. Now, like I mentioned, if you have the right guard rails, if you have tests and and uh all of the things set up in a way that can allow this little machine back there to work

within guidelines, you can get it working, but it's a lot more work. It's

it's much easier for me to prompt in a more general way to something like Cloud Opus 4.6, and it figures out what I need without all the handholding. Now, I will say that these models that are being released by uh these companies like the

openw weightight models from Quinn and and GLM 4.7 and Kimmy K2.5 and Miniax 2.5 um all of these models are really decent for being local models and they're only going to get smaller and

they they're only going to get better.

So that's one of the cool things about this whole ecosystem is if you just wait a month or two, a new model will be released and all of a sudden the same hardware that you already have can

actually have better capabilities because of the models that have been released. So for me, that's the exciting

released. So for me, that's the exciting thing is being able to tinker with these things. Um, and like I mentioned

things. Um, and like I mentioned earlier, be able to to prompt these LLMs from the privacy of my own home without AI companies training on my prompts. So

that's all I got for you in this video.

Let me know down in the comments if you're going to try this out. Also, let

me know if you're already trying this out. Let me know what hardware you're

out. Let me know what hardware you're working with. And if you have any

working with. And if you have any questions about how I set things up or anything else like that, throw those down in the comments as well. All right,

I'll see you in the next one.

Loading...

Loading video analysis...