Run Claude Code with ANY Model on Runpod
By Runpod
Summary
Topics Covered
- Self-Hosted Models Slash Costs to Pennies per Hour
- Match Model Size to Task Complexity
- Self-Hosting Unlocks Security and Control
- Small Models Need Granular Direction
Full Transcript
Hey there, welcome back. Earlier this
week, we walked through setting up cloud code within a pod, which use the official cloud platform to handle your prompts. And that was a great start. But
prompts. And that was a great start. But
you don't actually need to use your platform if you don't want to. In fact,
you can connect cloud code to a model you host on RunPod with no lower limit.
In fact, you can have a GPU running for pennies per hour to handle this task.
And that's what we're going to cover in this video. So, this is what we're going
this video. So, this is what we're going to cover in this video. Why you would want to bring your own model in the first place. setting up a pod with Olava
first place. setting up a pod with Olava to handle the inference, configuring claw to connect to that pod, and then seeing how that performs compared to the actual cloud models. So, first let's go over why you would want to do this at
all. And there's a lot of good reasons.
all. And there's a lot of good reasons.
First and foremost, cost. $10 will go a lot further when you self-host models compared to using a big foundational model. In this example, we'll be using a
model. In this example, we'll be using a 20B model that's been quantized down, which means you could run that in a 390 or an A40 for like 3040 cents an hour.
They'll get you almost 24 hours of all you can eat unlimited use of the pod where you could blow through 10 bucks in an hour or two with larger claw models if you're not careful. Second, this
allows you to tune your expenditure to the level of task you have for it. If
you just have simple asks like boilerplate Python scripts, you don't need opus for that. Fact, you don't even need haik coup honestly. These are tasks that practically any coding model can one shot on a first try. And when you
abstract the cost of a query down to a cost per token metric, then all the claw models are just overkill for those really simple tasks. Third, compliance
and security. You may have trade secrets or specific security requirements or tool calling where you need to be assured that you have low-level access or control to the underlying operating system. And in that case, those large
system. And in that case, those large foundational models aren't going to be able to offer that. When you use cloud code and bring your own model, you are connecting to an LLM engine under your direct control that you can hack and
expand as you need to. Last, you can customize the individual model to a specific use case. You can use models fine-tuned for specific domains like a model trained heavily on Python or one
optimized for data science rather than being limited to general purpose models.
In fact, this is going to be really important when you're working with the smallest models because they lack a lot of overall general knowledge scope and they do best when you fine-tune them for a really specific reason. So, we'll
demonstrate this in a pod using Olama because we'll need an engine has robust anthropic API support which isn't as widespread as OpenAI API format. they
might be more familiar with. In my
opinion, this is the fastest and quickest way to stand up a server that works well with cloud code. To really
dig down into the budgetoriented nature of this process, we'll start with a 20B coding model quantized down to 4bit, which will fit pretty comfortably into our most inexpensive GPU specs.
Normally, I'd like to lean on the A40 normally for VRAM intensive tasks because it's the only spec I believe that actually gives you more than a gig of memory per. But we're going to actually use the A4500 at 25 cents per
hour to dig deep and get into the lowest absolute cost per hour. Okay, we will go ahead into Playerbot. So, we'll just scroll down to the A4500.
We will change the template to a llama and we will just give oursel a little more container space just in case we need it.
And we will deploy our pod.
And we'll wait for that to come up. And
now while that's booting, we need to pick a model to deploy. And there's one important thing to remember here is that if you want the tool calling that you normally get with the cloud models that edits files on their own, what you need to do is you need to pick a model that
actually supports that tool calling.
Well, every LLM will work with cloud code in that you can send a prompt and get an answer back. If you want the advanced capabilities, our selection will be a little more limited, but we'll use this model in our example. uh that's
been actually fine-tuned to work with tool calling.
Once our old llama pod is set up, we'll connect to it in the terminal and we will run the following command.
This is a version of GPT OSS20B that has been fine-tuned to add functionality for tool calling. For reference, I actually
tool calling. For reference, I actually did try it with the base model, but only got the basic LM functionality, which is what led me to this model. Once we've
got it running, we'll just pop in a quick hello world. And boom, we've got our LLM being served for a quarter per hour. And now we're going to select the
hour. And now we're going to select the pod that we're actually going to run Claw in. So we'll just grab whatever is
Claw in. So we'll just grab whatever is convenient. We'll grab an A6000
convenient. We'll grab an A6000 and we'll change it to the latest version of PyTorch and get that up and running.
And this is going to be the pod that we actually install cla.
So if you watch the other video, the command is actually the same to install.
And once again, we'll copy the path command and run that. And we will need access to uh text editor in the command
line. So we'll do apt get update
line. So we'll do apt get update apt get install nano.
And now before we actually start cloud code, we need to set some environment variables. So we're going to go to root.
variables. So we're going to go to root.
And then we're going to go into cloud, which is a head directory.
And then we're going to nano settings.json.
settings.json.
And I'm going to copy and paste some environment variables here, which I will drop below in the description. So you can do that
description. So you can do that and we will need to tell it to connect to our Olama pod. Uh so we'll need a pod ID.
So we we can click this and then copy our pod ID and then we can just replace the pot ID that I had there.
And then we'll just hit control X to exit. Hit yes. Hit enter.
Lastly, if you don't have an Active Cloud Pro account or an active Cloud Console account with billing enabled, we need to set up a workaround so you can skip the authentication on our first setup screen. If you have one of those
setup screen. If you have one of those accounts, you don't need to do this. You
can log in as normal. But if you're going in without any cloud account at all, here's what you need to do. First,
let's create a file called API key helper.sh.
helper.sh.
And we will just put in this and then echo.
We just need to create a dummy API key.
Doesn't really matter what you put in here. And then we'll save.
here. And then we'll save.
And we'll chod this to make this executable.
And then we'll go back into settings.json.
settings.json.
And then right above M, we'll put in API key helper.
Then the path to the key and then save and exit.
And now that we've uh started up cloud, we can see that it skipped the authentication and we are ready to begin. So because
we're in the cloud folder again, we'll actually exit this and we'll go back to workspace and we'll run cloud from within there. So to prove this is
within there. So to prove this is actually hitting our local pod instead of claude, we'll type which model am I speaking to? And
clearly it's identifying itself as the GPOSS model. So we are actually not
GPOSS model. So we are actually not talking to Claude at all. We are talking to our pod. So now we get to actually test out the capabilities of this and see how it compares to Claude. So let's
just try create a game that I can play in the terminal with the
arrow keys. Arrow keys to move the
arrow keys. Arrow keys to move the snake. Eating apples will increase the
snake. Eating apples will increase the score.
RX will exit the game.
Okay, it's done. And it only took a minute to do it. So, when I did this in testing, it actually one-shot it. So, uh
let's see if we will do it this time, too.
All right. So, it is moving.
It's actually really tough to play. It's
not It's not the um coder's fault. It's
the fact that I'm doing this on a terminal. A remote terminal. Let's see
terminal. A remote terminal. Let's see
if I can get it. Come on.
Oh, and it's doing pretty well, especially for first try.
We'll see if I can eat the apple emoji.
It's a little tricky. There's a little there's a few things that's that's kind of jumping out of me, but it's clearly that this is it's clear that this is actual functional snake game.
And considering this is a very small model that's running at 4bit actually did a pretty good job into the wall. See if it exits. And
yeah, I'd say that's a pretty passable snake game. Let's see how it does with
snake game. Let's see how it does with Tetris. Create a Tetris
Tetris. Create a Tetris game for me in the terminal.
Arrow arrow keys move pieces. Space bar ast drops
but a score indicator on the right side.
Okay, it's done. Let's uh see how it does. I hope it does something wrong so
does. I hope it does something wrong so I can test its um correction capabilities too.
Oh, I just realized I forgot to to This isn't my fault. I forgot to give it the rotate. Uh, I
rotate. Uh, I need a rotate function as well. R key
will rotate also slow down something.
That was fast. Let's try it again.
It's a little more manageable.
Yeah. Um, it's laggy, but again, that's that's a terminal. It's the fact I'm doing this on a remote terminal as opposed to a uh local terminal.
But, um, can I at least get a line?
I swear I'm [laughter] a Tetris. Oh, there we go. We got a
a Tetris. Oh, there we go. We got a line. Cool. Yeah. Um, again, this uh
line. Cool. Yeah. Um, again, this uh oneshot Tetris as well. Uh, let's see if we can do a web search. Uh let's switch
do a web search and tell me five fun facts about because runpod has been around for a few years so it's probably in the knowledge
cutoff but it's got outdated info in the actual model itself. So let's see if it can actually do a web search.
Oh can't do web searches. Why not?
I don't have abilities to browse web.
Pulling frustrated from external sites.
Well, that's unfortunate. If you have specific details or links you'd like me to look at, feel free to share them.
Okay. Well, let's give it a link and see what happens. Let's give it our main
what happens. Let's give it our main site and see what it can do.
It did get the gist of it. It did miss one very important detail, but it clearly demonstrates that it's something. This is a functionality even
something. This is a functionality even without cloud. So, it won't actually do
without cloud. So, it won't actually do web searches. You can't just give it a
web searches. You can't just give it a vague direction to say, "Oh, Google for this and find tell me what you find."
But if you give it a direct web address, it can actually still pull that data and summarize it and act on it. To clarify,
new users can give that credit bonus, but it has to be done through referral program. It doesn't it's not for all
program. It doesn't it's not for all signups, period. Let's test the
signups, period. Let's test the decision-m capability. So, let's I need
decision-m capability. So, let's I need a REST API.
choose the best framework and let's see if it asks any clarified questions or explains any trade-offs on this very vague request I'm giving it.
So, it's first thought is to search a codebase and I don't really have a code base for it to search, but it's nice that it's it's nice that it's trying to at least look for context. I'm really
interested to see if it asks me any questions, though, because everything that's in there right now is just two games in Python. So, it's probably a doesn't really have a whole lot to go off of. So, it thought for 7 minutes and
off of. So, it thought for 7 minutes and then eventually just eroded out for about five of those minutes. It actually
was stuck at 1.8K tokens and then it just kind of bombed out. So, I think with these small models, I think this is going to be like a disadvantage is that it requires a lot more direction on your
side. You you can't rely on the model's
side. You you can't rely on the model's ingrained ability to do a lot of planning. you kind of need to break
planning. you kind of need to break things up into smaller chunks. Kind of
like I mentioned in the last video, but in general, that's just good practice for using AI coding in general. The more
specific and more granular tasks you give it, the in general, the better your experience is going to be. So yeah,
thanks for watching. The big advantages of bringing your own model is that it's much less expensive and you get a lot more flexibility and customizability from being able to leverage those specific models. The disadvantages are
specific models. The disadvantages are that it is a lot more exacting and it does need more attention. It's not
nearly as fire and forget as a big foundational models and of course there's some setup required. But when it comes to just writing code, which is the main focus of an application like this, it still does that very well if you can
direct it appropriately. You have any cool things you built or questions on how this all works, feel free to drop those in the comments and I'll see you in the next one. Thanks.
Loading video analysis...