Build Self-Improving Claude Code Skills. The Results Are Crazy.
By Simon Scrapes
Summary
Topics Covered
- Skills Self-Improve Overnight via Auto Research
- Binary Assertions Enable Automated Skill Tuning
- Autonomous Loop Achieves Perfect Skill Score
- Two-Layer Self-Improvement for Skills
Full Transcript
Skills are one of the most powerful things you can build inside Claw Code for your business. But what if those skills could even improve themselves overnight? I've built over 20 so far,
overnight? I've built over 20 so far, and getting them from version one to something reliable usually takes weeks of tweaking. So you run the skill, you
of tweaking. So you run the skill, you spot something wrong, you open up the skill.md file and make a change, and
skill.md file and make a change, and it's pretty repetitive. It's slow and it's inconsistent. And then last week,
it's inconsistent. And then last week, Andre Carpathy, part of the founding team at OpenAI and former head of AI at Tesla, shared an idea called auto research. So the idea is simple. You
research. So the idea is simple. You
give an AI system something to improve and one clear way to measure if it got better. Then it just loops. So it's
better. Then it just loops. So it's
trying a change. It's running a test.
It's checking the score. And if the results improves, it keeps the change.
If not, it rolls it back and tries something else. And the best thing about
something else. And the best thing about it is it keeps going all night so you get to sleep and wake up to a better system. So today we're applying that
system. So today we're applying that exact idea, but to Claude Code skills.
I'm going to show you how to set up a loop where your skills improve themselves automatically. Let's get
themselves automatically. Let's get straight into it. So let's take a quick look at what Carpathy actually built and it can pretty much be summarized by three different files here. So first we have the program.md which is just a
markdown instruction file that we give to that agent telling it what we want to test. We have a fixed data file for
test. We have a fixed data file for recording all the results. And then we have a training script that the agent actually goes in and edits. And the core of the program.md file or the one that we edit is actually really summarized by
about 10 lines. And that's all we need to make it our own. So we've got tune train.py Pi with an experimental idea by
train.py Pi with an experimental idea by directly hacking the code, i.e. make a
change. We've got run the experiment as it sounds. Read out the results. If the
it sounds. Read out the results. If the
value has improved, then we're going to advance the branch and keep the commit.
If the value is worse, then we're going to reset to where we started. And I love this line down here, too. Never stop.
Once the experiment loop has begun, do not pause to ask the human if you should continue. Do not ask if I should keep
continue. Do not ask if I should keep going or is this a good stopping point.
The human might be asleep or gone from the computer and expects you to continue working indefinitely until you are manually stopped. You are autonomous.
manually stopped. You are autonomous.
I.e. just keep working until you've either improved the results indefinitely where there's no additional gains to be made or we interrupt you. So let's talk about how we apply this directly to our
skills. So before we improve our skills
skills. So before we improve our skills output, we need Claude to actually use the skill. So a quick recap because we
the skill. So a quick recap because we covered this in the last video. So
Claude reads, "Your YAML description decide relevance of the skill and community testing found that activation was as low as 20% when you've got vague descriptions." So basically descriptions
descriptions." So basically descriptions in the YAML are super important inside the skill. Now the skill creator
the skill. Now the skill creator anthropic skill, the upgrade already has a built-in loop for this. So it's
effectively the same pattern as Carpathy. You give it test queries to
Carpathy. You give it test queries to see if a skill activates. Some are going to trigger the skill, some aren't going to trigger the skill. and it runs each multiple times, checks the trigger accuracy and proposes a better
description to trigger the skill and then retests that. So you can see this directly in the improved description.py
file here which is designed to improve a skill description based on evaluation results and that runs through the run loop which combines the evaluation and the improved description Python files in
a loop. So i.e. it keeps running and
a loop. So i.e. it keeps running and improves the description based on the trigger accuracy. So did Claude actually
trigger accuracy. So did Claude actually activate the skill at the right time?
Yes or no? That's basically how it works. So, we already know by this that
works. So, we already know by this that this is automated and built into this skills 2.0 version. So, there's no need to reinvent the wheel here with skill descriptions. We're just going to use
descriptions. We're just going to use anthropics built-in skill creator skill.
But, triggering reliably and producing actual great outputs from the skills are different problems. So, the skill creators eval which we covered in the last video let you test and score output
quality based on your own defined metrics. So, we actually went ahead and
metrics. So, we actually went ahead and tested this. So we had optimize my skill
tested this. So we had optimize my skill for making sure my copy follows the persuasive techniques listed in my persuasion toolkit reference file which was just a reference file that we had
inside the marketing copywriting skill.
And then we said measure it on does it always use that reference file? Does it
use curiosity and open loops? And how
often is it using proof or founderled stories which were both metrics inside that persuasion toolkit. And then we tested it by getting it to write landing page copy for my school community five times and testing it against that
criteria. And it was brilliant. It came
criteria. And it was brilliant. It came
up with qualitative feedback on the skill quality and even displayed it in a nice click-through dashboard, but it wasn't self-improving. So, what we're
wasn't self-improving. So, what we're adding today is making that loop run autonomously Carpathy style, so it improves overnight without your inputs.
So, let's visualize them now side by side so you can see the exact framework and the same or similar logic that we're using between Karpathy's original loop or ours here when applied to skills. So,
we've got read the train.python Python
file, change a value, run a test, check the value. So this is the metric they're
the value. So this is the metric they're using val_bpb keep or revert. So it's either going to if if the score is improved, get commit it and keep it and run the next loop. Or
if the score is dropped, it's going to get reset and actually start again and make a different change. So ours is seriously similar. It's going to be the
seriously similar. It's going to be the same logic, same infrastructure, but what we're actually doing is reading this skill.md file instead. So reading
this skill.md file instead. So reading
our skill instructions, process instructions, changing a value, we're going to run a test, we're going to check the pass rate, and then we're going to keep or revert. So the only difference here is the metric by which
we're measuring it by. So they're using a value here. We're checking the pass rate against 25 binary assertions across five tests. So we're going to talk about
five tests. So we're going to talk about uh what binary assertions are right now and why they're important. Now the word binary is everything here and this is where most people are getting it wrong when they're te executing tests on their
skills. So for example we have something
skills. So for example we have something binary on the right hand side. It does
not contain m dashes. So our text doesn't contain m dashes or it's under 300 words or the final line is a question. It's all true or false
question. It's all true or false statements versus something very subjective like does it have a compelling subject line. And this is obviously not binary because two people can disagree on what compelling actually means which means we can't actually
automate it. Of course, we can get the
automate it. Of course, we can get the assistance from claw code to say actually based on certain frameworks, this is considered compelling, but it's not this binary true false approach. So,
here's my actual setup. So, inside my skills here, we've got a marketing copywriting skill and what we need to do is set up an eval folder. So, this is something that the skill creator skill
can actually do for you or you can actually create this yourself. So, we
set up an eval folder and an eval.json.
And inside that eval.json, JSON, we've got 25 assertions or true or false binary things that we can check or the autonomous agent can actually go through and check and make sure are true or not.
So, for example, for the copyrightiting skill for the first test, we're going to feed in the prompt, write a LinkedIn post about why simple automations beat complex ones with an expected output of a LinkedIn post following brand
structure rules. And it's grabbing those
structure rules. And it's grabbing those brand structure rules from our reference files. So, our contextual files inside
files. So, our contextual files inside the skill. So, we've got a tone of voice
the skill. So, we've got a tone of voice guide, we've got persuasion toolkit, we've got examples of good posts inside there. But what we're doing is actually
there. But what we're doing is actually testing things that are totally based on this skill.md and the process. So, does
this skill.md and the process. So, does
the first line appear as a standalone sentence and not part as paragraph? That
is going to be marked true or false.
Does it contain at least one specific number or statistic? Is the final line not a question? I don't like questions as the final line in my posts. Is the
total word count under 300? And you get the point. We have various different uh
the point. We have various different uh tests that we run with different prompts and different assertions that are going to come back true or false. And this
enables the loop to go through each of these assertions, validate whether it's true or false, and then make a change to the skill.md if it's not hit perfect
the skill.md if it's not hit perfect score. And of course, you don't need to
score. And of course, you don't need to go through and actually create this manually, this evals.json. You can just ask cloud code to spin up an evals.json
file with assertions that can be validated by true or false questions based on your skill.md. And then what we're effectively doing is feeding in a prompt, seeing does it hit those assertions. If it doesn't, then we need
assertions. If it doesn't, then we need to improve our skill.md so that clog code is able to follow it every single time. And then after that, all you need
time. And then after that, all you need to do to run this autonomously is say to use the skill creator skill. We probably
didn't need to say that even run a self-improvement loop on my copywriting skill. We'll point it to our evals file
skill. We'll point it to our evals file to evaluate each iteration. We're
telling it to basically use the same principles, detect whether it passes or fails, and then return a pass fail mark.
If any of the assertions fail, make one change to the skilled atm. So, we're
doing the exact same logic that we spoke about in that diagram. If any fail, rerun the tests and recalculate. If the
score improved, keep the change and get commit. If it dropped, get reset and
commit. If it dropped, get reset and make a new change. It's going to log everything. And we've also given it the
everything. And we've also given it the instruction to not ask for my permissions and keep looping until I interrupt you or you hit a perfect score. So, we can run through that. And
score. So, we can run through that. And
what we've effectively got here is on the first run of this, we've scored 23 out of 24. So, as I've already mentioned, this is like the fifth version of this marketing copywriting skill. So, it's already gone through
skill. So, it's already gone through quite comprehensive iterations and changes. But you can see on the first
changes. But you can see on the first iteration of this test, we had a 95.8% success rate. One of the assertions
success rate. One of the assertions failed, which was end with a question rule, which was actually a rule in the tone of voice.md, but not in the skill.md. So, that we had contrasting
skill.md. So, that we had contrasting information there. So, it added a rule
information there. So, it added a rule to the skill.md. LinkedIn post must not end with a question, close with declarative statement, CTA, or a punchy fragment. And then on the second time it
fragment. And then on the second time it actually ran that and it got a perfect score. So obviously we're talking about
score. So obviously we're talking about an example that only need two runs to be perfect, but where you've just created a skill. This will take many runs to
skill. This will take many runs to actually refine and improve on the skill here. So get claw code to write your
here. So get claw code to write your assertions once, set up the loop and you can literally let it run overnight and come back to a skill the next day or multiple agents running multiple tests on these skills. They're actually
structurally more sound. So there's two layers of skill self-improvement. Layer
one is the skill creator's own description improvement loop to improve skill activation to get it to actually trigger a skill in the first place. And
layer two is our amended carpathy loop for skill outputs that use those binary true false assertions and a score and then autonomous improvement through a simple prompt where we ask it to
actually use the evals and continue to loop until we are happy it's met a certain criteria. Now a quick note on
certain criteria. Now a quick note on limitations. The binary loop handles
limitations. The binary loop handles structure, format, word counts, forbidden patterns, but it does not handle tone of voice, creative quality, whether your skill is actually using the context you've put in your reference
files properly. Those still need a bit
files properly. Those still need a bit of human judgment. But if you watched the last video, you already know how to use the skill creator tool for that where it gives you a sideby-side dashboard to review the qualitative output, write feedback, and even AB test
your reference files. Whereas this
binary loop can be used for the more structural stuff. Now, if you're looking
structural stuff. Now, if you're looking for bespoke skills to run your business, we've just launched a complete agentic operating system built on claw code that ties all of this, including all the skills into one system. So, it has your
brand memory, 18 production skills across marketing, strategy, ops, and visuals, too. A self-learning loop,
visuals, too. A self-learning loop, self-maintenance, and you can access it through your phone through Telegram, too. So, it's not a personal assistant.
too. So, it's not a personal assistant.
It's entire business context packaged into a system that gets sharper every time you use it. Links down in the description if you want more info. And
thanks so much for watching.
Loading video analysis...