Claude Code Skills Just Got Their Biggest Update Yet

By Ray Amjad

Summary

Topics Covered

Vibe-Based Skills Fail Post-Model Updates
Capability Uplift Skills Have Retirement Dates
A/B Test Skills Prove Real Impact
Optimize Triggers for Reliable Skill Activation

Full Transcript

Alright, so Anthropic just made it much easier for you to make better skills for Claude Code and for Claude Cowork. And that's exactly what we will be going over in this video. But before

getting started, if you are interested, there is a sale going on right now in my Claude Code Masterclass to celebrate the 1-year birthday of Claude Code. It is the most comprehensive class on Claude Code that you will find online, and many people from hundreds of companies have taken it and have gone on to be the best Claude Code users at their companies. Okay, so right

now most people are developing Claude skills based exclusively on vibes. And what they're doing is they go through the process once with Claude Code, and then they say, hey, can you turn this into a skill? They may give it additional resources such as a blog post or an internal document or something else to help him make a better skill. And then they will

try the skill a few times to make sure that it works. They'll be like, hey, this looks good, and then ship it to the rest of the team to use or just online. And many people have made some really great skills with this approach. For example, this GitHub repo of marketing skills. But

there are a couple problems here. Firstly, whenever we have a brand new model update, it may be the case that your skill is actually no longer helping Claude Code because a lot of the ideas and functionality you encoded inside of your skill have now been encoded into the model. Or the

model can do a better job than your skill can. So by triggering your skill, it's actually holding itself back from reaching its true potential. And also right now you don't really know if making a change to a skill will lead to any better output. So what Anthropic did is they made a whole bunch of improvements to their Skill Creator skill to make it easier for you to both write skills,

see if they're actually making a difference, run evals to make sure they're being triggered reliably, and a whole bunch of other things as well. So we'll be going for an example later on in the video, but basically you would use this brand new Skill Creator skill to help make you one. Then you may want to run your own evals to make sure they're being triggered reliably

one. Then you may want to run your own evals to make sure they're being triggered reliably in the way that you want to be triggered. And you may want to do an A/B test to make sure that the skill you just developed is actually making a difference. So Anthropic says that skills generally fall into two different categories. The first of which is capability uplift. So

essentially right now the model may not be smart enough in a certain domain. So it may not know how to handle Swift concurrency properly, or it may be really bad at filling out PDFs or making PowerPoints right now. And this kind of skill basically gives a model missing information or provides techniques or patterns that it can use to achieve whatever goal you have. So as an example,

in the Anthropic official repository, they have a bunch of skills such as handling Word documents, PDFs, and PowerPoints. And the reason Anthropic made a PDF skill is because right now the models aren't really that great at handling PDF-related tasks. Or it may be the case that Opus 4.5 or Opus 5 becomes much better at handling PDFs and you no longer need to have the skill. So usually

capability uplift skills that basically fill in missing information or teach a model techniques have their own retirement date. And the Skill Creator skill can help you determine whether you should get rid of that skill or not because the base model capability has caught up to the level of the skill. The next category of skill basically encode workflows or preferences that you have. And

that could be because of compliance-related reasons or your system is just designed in a certain way. A quick example of that skill would be the Windows Release skill for my application

certain way. A quick example of that skill would be the Windows Release skill for my application Hyperwhisper. It's my voice-to-text application. And when I want to do a new Windows release,

Hyperwhisper. It's my voice-to-text application. And when I want to do a new Windows release, like after making an update, then I just trigger the skill and it goes through the entire workflow that I defined before. But you still want to make sure that when you're developing your skill, it's still being triggered reliably and it's doing the right thing that you would expect, especially if the skill is pretty complicated. So capability uplift can include filling out PDF

forms, using OCR to help you know which part of the form should be filled out, and then also making complicated documents as well. And that would be like PowerPoint documents and Word documents. Skills that would encode preferences would be like an NDA review checklist, another skill that compiles data from all your different MCP servers into a weekly report, like PostHog and like Jira or something. And then another skill that has a very specific like flow

for code reviews. So for example, Anthropic found that their PDF skill struggled with forms that did not have like fillable components. And it always placed things in the wrong area. So then they used their brand new method to isolate the failure make improvements to Skill, and then it finally started working consistently. All right, now let's go for a quick example of how you can use this. Firstly,

working consistently. All right, now let's go for a quick example of how you can use this. Firstly,

you want to go to /plugins and make sure that you have the Skill Creator plugin installed. So

searching it over here, and then we can install it on this project level in our case. And then

after restarting Claude Code, if I do /Skill Creator, I can see that I actually have two. And

one of these is an older one that I installed a couple months ago. And that exists in my user level. So I'm going to delete this from the user level Claude.md file, delete that,

user level. So I'm going to delete this from the user level Claude.md file, delete that, and that has now disappeared. Anyways, we can ask the Skill Creator skill like, hey, so can you tell me what you can do? And just to quickly compare it to what the old one says that it can do, they can both make skills from scratch, but the new one can create test cases to see how the skill performs.

The improving existing skills already has some benefits, so it can run test prompts and comparing against the baseline. It can identify what's not working and revise instructions, and then also run benchmarks for us. And then finally can run an optimized feedback loop for us where it will test different skill descriptions to see which one would trigger reliably against a realistic prompt.

Okay, now let's go through the entire process so I can explain how it works. All right, now let's use this Skill Creator skill and say, "Make me an SEO audit skill." Press enter. And ideally

we would want to be more descriptive or give it like reference documentation it can use to make a better skill. But I'll start off with a pretty simple description. To rely on the underlying

better skill. But I'll start off with a pretty simple description. To rely on the underlying model's capabilities. So when asking questions, it now says, should we set up test cases to verify

model's capabilities. So when asking questions, it now says, should we set up test cases to verify the skill works well? And I'm going to say, yes, run evals as well. Okay, so now it made us a skill and now it will make us some test cases as well with realistic prompts. So these are now the evals that it came up with for us. And now it's going to start testing the skill to make sure that it

actually makes a difference. So it's launching 6 runs in parallel. 3 with the skill and 3 without.

And then it will grade the results against expectations that it has in this expectation list. So we can see these agents running by doing /tasks. And all of these are running, one without

list. So we can see these agents running by doing /tasks. And all of these are running, one without the skill and one with the skill. Go to any of these and we can kind of see what's happening behind the scenes. So essentially for each eval, it spawned up 2 subagents, one with the skill, one without the skill. And then it goes back into the main session, which has a comparator. And the

comparator just compares both of the outputs, but it doesn't know which one used a skill and which one did not use it. So this applies to making a skill because you can actually make sure it's making some kind of difference in your codebase. And also when we get a model upgrade, you can reevaluate your most critical skills that you use every single day with the Skill Creator

and be like, hey, can you run an A/B test to make sure that this skill is still performing well? So

it may be the case that right now Opus 4.6 isn't really that good at SEO auditing, but Opus 5 will be. So when Opus 5 comes around, you can then run this skill comparison A/B test on any pre-existing

be. So when Opus 5 comes around, you can then run this skill comparison A/B test on any pre-existing skills that you have made and then determine whether you should delete the skill, keep it, or change it. In the case of Anthropic, they benchmarked their PDF skill with and without the skill loaded, and they found that having the skill loaded led to a better pass rate compared to not

having it loaded. So it seems that all 6 runs have been completed, and it's launching 6 graders of agents to grade them in parallel. And if we look inside our project, it made a SEO audit workspace.

This is the first iteration that I came up with, with the skill, the report that I came up with, any grading according to the evals that we had defined, and then also some timing information. So

repeated this 6 times. But yeah, let's wait until the grading has been completed. So essentially we have 2 types of evals that can happen. One of them being capability evals of which output is better, and that is like the SEO audit is better. The PDF fields have been filled in correctly. And then

we have procedural evals. So you could have an insurance claim triage skill and you made it for your organization. You fed in all the internal documentation and then you gave it like 20,

your organization. You fed in all the internal documentation and then you gave it like 20, 30 examples with the result that it can do evals against. So for example, you could say that if the insurance claim value is greater than $10,000, it requires a missing police report. If

it's an injury claim, then it requires medical records as well. Now with this new framework, you can basically give the Skill Creator skill your insurance claim triage skill, give it a bunch of examples, and then have it automatically grade and make improvements to the skill. Okay,

so now it's given us the benchmark results. And it says with the skill enabled, the success rate is 13.5% higher. The average time to complete the task is 22% faster or lower. And also it uses

13.5% higher. The average time to complete the task is 22% faster or lower. And also it uses slightly more tokens to have the skill enabled. So it says there are some HTML reports that I can see in the viewer, but it hasn't open up my browser for me. And now it's opened up the report for me right over here. So I can see the outputs. And these are basically long HTML files for the like

SEO reports and then the grade that it's given each output. And I can leave in my own feedback here, go over to the next output and review that as well. And if I review all of them, so I can press submit all reviews. If I go to benchmarks, I can see the benchmarks that we saw before, which assertions were passed with the skill and without the skill. And then some final analysis

notes as well. So if I were to press submit all reviews, it would download a feedback.json file.

And then I would just drag and drop this file back into Claude and then say like, okay, make these improvements. Now when you have skills, then your skill description determines whether or not it's

improvements. Now when you have skills, then your skill description determines whether or not it's going to be triggered. And there may be some cases where you don't find your skill being triggered reliably, in which case you can tell Claude Code with the /skill Creator skill to improve on the trigger. So for example, I have this skill over here and I can open up a new tab, go to Claude, go

trigger. So for example, I have this skill over here and I can open up a new tab, go to Claude, go to Skill Creator, and then say, can you optimize the description of my sync model skill to make sure it triggers more reliably? And now it comes up with some example prompts of what a user would say, where it should be triggered, and then some prompts of where it should not be triggered. All

right, so now it gave us a brand new review thing so we can review these prompts and decide whether or not these skills should trigger. Add our own like queries as well. And then once we're done, we can export this and then we can drag it back into Claude Code. So pressing enter. So now it's going to go through its optimization loop where we have 20 different queries. It will train on 60% and

keep 40% for testing. So it's kind of like machine learning just generally where you have a test set and you have a training set. So essentially the way that it works, we have our queries that are split up into a training set and a testing set. Claude then fires queries at all of them in the training set. It then checks whether the skill was actually called or whether it was triggered.

training set. It then checks whether the skill was actually called or whether it was triggered.

It doesn't actually run through the entire skill, it just checks whether it was triggered. So this

will repeat up to 5 times until it makes a better description. And you can see right now it says the description is too generic for the skill. Claude thinks it can handle these tasks without actually consulting the skill. So now it's going to write a more optimized description and then go through another cycle just to check. And in Anthropic's blog post, they found that through this process

they can have the skill trigger more reliably. But yeah, it is pretty interesting how heavily Anthropic are investing into skills. And you may think after watching this video, like, this seems like a bunch of effort, why would I bother? But I think the reality is that if you have a skill that you're using multiple times a day and you're going to be using it for the foreseeable future,

the small amount of time that you put into making sure that A, it's actually delivering better results and B, triggering reliably is going to pay off in the long term. Because nowadays we're finding that some people's jobs are essentially being replaced by a couple of Claude skills. So

I will be running this against some of the other skills that I have across my projects to make them better. And I would recommend running it on your own projects as well, especially for the skills

better. And I would recommend running it on your own projects as well, especially for the skills that you use the most often. And for skills that you made a couple of months ago, because it may be the case now that that skill is no longer required because the model's core behavior actually encodes all the functionality of that skill itself. This can also be useful when you're publishing skills

or downloading them from online because you can actually check with the Skill Creator to make sure that it's making a difference in your codebase. Anyways, if you do want to learn more about Claude Code and everything that I'm learning, then you can join my Claude Code newsletter. It

will be linked down below. And I basically share anything that I'm learning or thinking about or investigating And by signing up, you get access to a bunch of free videos from the Master Claude Code class. If you want to upgrade and access all the videos, then you can use a coupon code

Code class. If you want to upgrade and access all the videos, then you can use a coupon code BIRTHDAY to get a discount because it's Claude Code's 1-year birthday. There are a bunch of classes covering pretty much every single feature in Claude Code, some about context engineering, some about my own daily workflows as well, like what I'm doing on a day-to-day basis, and then a

bunch of bonus techniques as well that can make sure that you're prompting more effectively and addressing common failure patterns that you may find Claude Code and other agents falling into.

Loading...

Loading video analysis...