I Made My Claude Skills Learn Without Going Rogue
By Mansel Scheffel
Summary
Topics Covered
- Evals Grade Skills, Refinement Teaches Them
- Raw Evidence Doesn't Always Belong in a Skill
- The Blast Radius Test for Autorefining
- Build Skills Only for Repeated Real Problems
Full Transcript
So apparently your AI is supposed to rewrite itself now. Read your slack.
Learn your voice. Become you while you sip a flat white. Unfortunately, you've
been missold. In this video, I'll show you what self-improving actually looks like under the hood so that you know how to approach it for the systems that you build for your clients. Let's get into it. So, first things first, if you don't
it. So, first things first, if you don't know what a skill is, you need to go and check my other videos on skills to understand what they are before you get into this. This will be a little bit
into this. This will be a little bit more of an advanced concept. Second
thing here, there are many ways to solve this exact problem we're talking about.
This is just one way to do it. Now that
that's out of the way, we can finally get into this video. So, a skill in its simplest form is literally just a procedure that the AI is going to follow every single time. A workflow, standard operating procedure, whatever it is. And
we do that because it saves you time. It
improves the quality over time, and it's easier to iterate for the exact same work that you're going to be doing over and over again. But there comes a problem where sometimes you end up repeating yourself because you don't take the time to address the problems
with your skill after you've actually built it. I have this problem all the
built it. I have this problem all the time with several skills up until the point where I actually built the skill refinement system particularly because they were often small things where I would just solve the problem on the fly but I did it in the system that it was ending up in instead of just getting
clawed to go and fix it. So our goal here is to stop that stupid manual pattern that a lot of people have and not just get to a point where our skills are refined but also get to a point where we can autorefine some of them because I certainly believe you should
be autorefining all of them. More on
that in a second. So from my point of view, skill refinement itself, it turns real world feedback into better reusable AI behavior. That is how I'm kind of
AI behavior. That is how I'm kind of framing this whole video. So we take a rejected output over here. Perhaps the
wrong voice, maybe it was missing a rule, maybe it wrote a little bit too much AI slop. We're then chucking it into this evidence box. And once we've gathered all of our evidence into a box, we need to do several steps inside that
box to then ultimately improve our skill with the goal being that we want a refined, clearer, stronger behavior with inside the specific skill that we are working on. The important thing to
working on. The important thing to remember here is that we are getting real feedback from real systems inside our actual working environment and then using that to change the behavior of our skill for the next time that it runs and
therefore improve it. Something to keep in mind, emails are not refinement.
They're not the same thing. Even though
there is some overlap to me, evals grade the skill. So we might evaluate against
the skill. So we might evaluate against known static examples of work while we are building the skill initially. We
would run our evals over as many examples as we could and it would grade it each time to see which one was better and we might give it feedback as to what was wrong, why didn't it match our expectations, why is it not our
definition of good. Skill refinement as a comparison to me would be something that teaches the skill over time with real world data. Like I mentioned on the previous slide, it is something that is observed through usage. The usage
informs the notes. The notes create the rules and the rules improve the skill over time. To help ground this a little
over time. To help ground this a little bit more, we need to look at the skill life cycle as a whole because it really is an entire process. We don't just build something and then refine it and job done. It all starts obviously by
job done. It all starts obviously by building our skill with a very very clear definition of done. You need to come in there at least having an understanding of what you're trying to achieve. For instance, if we're trying
achieve. For instance, if we're trying to write LinkedIn content out there, you're going to have an understanding of at least what type of content you want to be writing, what type of voice you might want to be using, and you would
give as much information as you can to AI up front to go and build your initial skill. Once we've done that, we would
skill. Once we've done that, we would then run some evals against it where we are evaluating against the definition of done that we just created. Have we
matched what we were trying to get to?
After we've done that to the best of our ability, we would then use it. We would
put it into real situations. Go out
there and write based on XY Z. Maybe
write me 20 posts. After we've done that, we're going to be collecting feedback, the outcomes, the observations, any of the edge cases that might have come out from running it over so many iterations. After that, we then get into the refinement process because
now we have all of the data that we need in order to actually make these refinements and learn from the things that we had from the previous runs. Then
we would re-eval that to make sure that our behaviors have stuck and that the patterns are actually better than they were the first time that we set up the skill with our initial definition of done. And then finally, as a part of
done. And then finally, as a part of this, we might want to curate this. But
I'll cover curation in another video. I
wanted to bring up the skill life cycle because it is important to understand that you're not just going to be able to refine something if you've already given this thing a shitty definition of done from the minute that you started. And
for those of you wondering, yes, of course, even if you don't have a complete definition of done up front, you can use AI to actually help you understand what a good definition of done is in the first place. Many ways to research this before you dive into
refinement. So, always make sure that
refinement. So, always make sure that you're nailing the first step of this process very well because it means you're probably going to have to do less refinement down the line. And that
brings us onto our three-layer system.
Now, we'll get into a P in just a little bit, but first we need to look at a few more slides to understand how this system works. So our three layer system
system works. So our three layer system is quite simple and there are various intricate layers that go inside this and I'll break them down for you. So with
signal capture we are pulling in all of the information from our outside systems. We would then have our refinement engine which runs through various processes in order to look at all of that information and decide what
it needs to do with it before it can apply any quality changes to that or any refinements. Finally as a part of this
refinements. Finally as a part of this we would need to understand the cadence behind it. So is this thing going to run
behind it. So is this thing going to run daily or weekly? Is it going to be called at the end of a session inside your VS Code environment or claude code or are you just going to have a weekly review that runs as a separate skill?
There are many ways to skin this catch.
You can even use hooks if you want to, but for the everyday user, hooks might not always be available. So, I think for most businesses, having something like a daily or weekly review is probably the best cadence. Then, next up, we can take
best cadence. Then, next up, we can take a deeper look into the skills pipeline.
When we understand what each skill is doing, it helps us understand the processes inside the system, which will make it a lot clearer for you guys. So,
like I said, the first skill that we want to run is our signal capture. And
if we take a look at what's inside there, we have our little evidence inbox. And you can see that we're just
inbox. And you can see that we're just using it to gather all of that information from our outside systems over here. Rejected drafts from
over here. Rejected drafts from something that we wrote, maybe a call transcript from some fathom sessions that we had, which could tie in with the session notes. What were the key
session notes. What were the key takeaways, the questions, the updated client information, it would all go into our evidence box, customer comments, failed emails, whatever. It lives inside here so that we can use this to sort
through it and ultimately manipulate it and refine it. Next skill we have is the evidence router and that is doing exactly what it sounds like. It is
rooting the evidence that we have inside our signal capture box over here. It
decides the destination of where all of this information that we have in the raw format gets sent to because not every piece of information that comes in from a system goes directly into a skill.md.
There are different things that form our skill.md. There are references, there
skill.md. There are references, there are context files, there are memories, various things that go into our AI operating system as a whole that we might want to update. So, we wouldn't just route everything to skills. And
that's why we have this evidence routter skill because it can intelligently decide based on the evidence that it's found inside that signal box where to send it to. For instance, if it learned a behavior about the skill that ran and
it was specifically related to skill.md
not doing something, it would then root the behavioral change to say, "Oh, hey, we need to make a change to the skill.md." If we learned some new
skill.md." If we learned some new information from some prospects that we spoke to about their company, it would say, "hm, this sounds a lot more like context," it should probably go into the context file for this specific client.
So that the next runs that we have for anything related to that client, it takes into account that new context instead of living off of the old one.
You get the point by now. We are rooting that raw information, sorting through all of it, and putting it in the right place so that later on when we get to the approval part, it knows exactly where it needs to go. Next up, we have the skill self update. And this is
proposing the skill changes. It can't
just do that off a whim. Of course, it is doing this off of actual evidence.
And it's why I said in the beginning that we need to have a very clear definition of done. Because in this case, as a part of our workflow, we have this role of something called the judge.
And what it's doing here, it's literally judging that the changes that are about to go through are actually of better quality or not. It is an AI gate. In
this case, in this case, just remember that the judge cannot know quality unless the skill defines what good is up front. So as a part of our skill, we
front. So as a part of our skill, we have made very sure that ours has a very clear understanding of the standards that we want for whatever the skill is that we're working with. So after step three, where we have suggested any ads
or edits or things that we need to remove, we get on to step four where we update the context. And all of this stuff will end up in a folder called proposals. The reason that we do that is
proposals. The reason that we do that is because we need a human gate as a part of this process. Like I said earlier, I don't think every single thing out there can just magically autorefine itself.
Despite what a lot of people on YouTube say, there's definitely a need for a human in the loop here. And we need to take a look at why now. Instead of
thinking to yourself, what kind of skills can I have running on autopilot?
It's much better to look at the gate from the perspective of what is the worst thing that can happen if the information that is autorefined is incorrect. The blast radius, if you
incorrect. The blast radius, if you will, because that will help ground you when you're looking at autorefining your skills. I see a lot of people saying,
skills. I see a lot of people saying, "Oh, you know, I have the system where it pulls in all this information from my sales calls and automatically updates all my clients. Does that about five or six times a day?" And I'm like, "Okay, cool. So, how do you know if any of that
cool. So, how do you know if any of that information is actually accurate for the clients that you're serving?" Because
the first thing that goes to my mind when I have discussions with people if I can see that they're talking mostly based on hype. They don't really know what their systems are doing. So, for
me, this comes down to these three M's over here, the megaphone. This is how will it impact your audience? How is it going to impact your money? And how does it impact the meaning of your systems?
Because having one wrong thing in one wrong place can destroy a whole cascading set of skills that you have.
For instance, if you change the context for your ICP, how many skills use that ICP in order to fulfill whatever it is that part of that workflow does? So, if
you had to have some form of auto refinement on your ICP and nobody's checking this thing and the AI actually made an error in judgment, you're going to screw up every single one of the workflows that rely on that. So you need to look at this from the perspective of
who is affected, what value is at stake, does this change our promise or change our direction. If the answer is yes to
our direction. If the answer is yes to any of those things, I would highly recommend pausing for a second using a system like this and literally just reading the proposed change and then pushing it through. It will take you 10 minutes and save you a ton of trouble.
Great. So now that we understand the consequences of what can happen if we don't have a human in the loop, we need to find some form of cadence that works for us. For me, in my cadence, I just do
for us. For me, in my cadence, I just do this thing weekly. I don't need something to run every single day, but that's just for the style of my business. You can do yours differently.
business. You can do yours differently.
So, if we take a look at our options, we can either do any of these over here.
The first one being manual. Now, if you were working with Claude and you were having a conversation one day and you noticed that it made a few mistakes, you could literally just say at the time, hey, I don't want this mistake to happen again, review our skill, take everything
that we've spoken about in here, and update our skill so it never happens again. That's the easiest way to do it.
again. That's the easiest way to do it.
And often that's the best way to do it because you're right there. You can just rese it through some evals again to make sure it doesn't happen. But then there are obviously other times this can happen.
After a call is a very popular one at a session end. So again when I close this
session end. So again when I close this that is a session end a hook would then fire and it could update a whole bunch of things. You could also manipulate any
of things. You could also manipulate any of the hooks that happen in between the sessions that you have in VS Code or clawed code. There are many of them.
clawed code. There are many of them.
I'll cover that in another video. But
for most average users out there, the weekly or daily schedule is probably going to be your best bet, as well as pulling in information from other systems as they happen in real time, which you can do either via a skill or you can just use a web hook that pushes
something down, however you want to make that thing work. But for the average user, the weekly or daily cadence is probably going to be the way to go. You
can get it to pull information down from other systems or you can just get those systems to push the information directly into the folders that we're about to take a look at. Cool. So, here we are in my environment and I'm going to run through it very quickly. I've just given Claude a very simple prompt to run
through everything I just spoke about and then stop at certain sections so that you can see how it pans out. On the
left over here, you can see we've got our evidence folder. And like I said, this is all just markdown stuff. And
each of the skills that will run as a part of our refinement loop, they'll be dumping things into certain parts here.
Remember, in our first instance, intake is where all of the raw events come in from our outside systems. It could be Fathom, it could be notion, could be Slack, any sessions that we have where we want to pull in that outside information. So, I'm just going to hit
information. So, I'm just going to hit enter on this and it's going to run through the first part over here. It's
going to generate some mock data for us and it's going to throw it into the intake folder as if it has just harvested a bunch of information from our outside systems for us. Okay, so
stage one is complete and this was just generating some mock data for us. So, we
can see here our information has now come from our outside systems. We have a draft for LinkedIn that was rejected and we have an Acme procurement call from our Fathom transcripts. So the first thing here it gives us a little bit of
information around the draft of what was written why the user rejected it. Of
course you would want to have reasons in there giving as much information as possible when you reject something because the clearer you are the more the AI will understand what to learn from that and then it tells us why this matters and how it is going to affect
future LinkedIn posts. Then for our call this is based off of a discovery call with Acme Robotics. Gives us key moments from the call and then it tells us why this matters. These are durable customer
this matters. These are durable customer commercial facts, not one-off task notes. And that means it's going to
notes. And that means it's going to change several skills. For instance, if they had a rule like any deal over 25K at Acme now requires procurement review, that means you're going to have to change your proposal, your statements of
work, and perhaps some of the ways that you actually reach out to them in the first place. It also says that Acme now
first place. It also says that Acme now requires sock to and signed DPA before production access. So, that's going to
production access. So, that's going to change delivery and a few other things that might affect the consultants who you'll be pushing to their site to go and actually do the work. It then also lets us know that multiple skills should know this. Our proposal generator, our
know this. Our proposal generator, our meeting prep, and our sales closer, amongst a few other things. So,
capturing this raw information and distilling it where we can is very important. Next up, we have our signal
important. Next up, we have our signal capture skill. And for those of you
capture skill. And for those of you who've been paying attention, that was the one that puts everything into our evidence box. So, we've taken all of our
evidence box. So, we've taken all of our raw data, we've run it through the signal capture, and we've now built little evidence cards to prove that the information in there is actually valid and worthwhile going through our pipeline to make a change. So for
instance here we've now moved out of intake into inbox and if we look at our acme procurement that we just read over it gives us all the information that we need like our source system was fathom
the signal type was a changed fact it gives a little bit of a summary the observed problem and the proposed lesson that we are hoping to learn as a part of this refinement. It lists the signal
this refinement. It lists the signal that we have and then it shows the direct evidence from the raw events over here verbatima moments. Anything over
25k now has to go through a formal procurement review. it would have pulled
procurement review. it would have pulled that out of the transcript. Our security
team requires a SOCK 2 report and assigned DPA before any vendor touches production data. That's a hard gate.
production data. That's a hard gate.
Now, again, direct evidence from that transcript. You get the point here. We
transcript. You get the point here. We
are building that evidence box before we push it further through the system. For
the LinkedIn thing, it's pretty much the exact same process. The evidence here is just a user rejection verbatim about what the user might have said when they rejected this thing. You get the point.
Cool. Now, we're at stage three. And
just to make it very clear, you wouldn't be doing all of this stuff manually, stage by stage. That's ridiculous. It
would be doing all of this on autopilot for you to the point where we would ultimately get to our proposals to review it in case you guys were wondering. But for stage three, what we
wondering. But for stage three, what we have done here is our goal is to route the evidence to the specific place that it needs to end up in. So we've now gone to our inbox over here and this skill has run and you can see it's edited the
files. So all of the same information is
files. So all of the same information is still in there, but it appended some stuff at the bottom. The routter
verdict, where this thing is going and why it's going there. It knows the destination is a skill and it knows that the destination path is our LinkedIn content writer specifically for references and the anti- AI writing
guide because in this case this is where I rejected it for putting some form of AI slop in there and its decision here was to propose meaning there's going to be a human in the loop and it's smart enough to know that because it's following those three M's that I spoke
about in this case one change to this is a megaphone meaning we are broadcasting something to an audience so a human definitely needs to review this before we make a change to it and it did the exact same thing for our Acme Robotics.
It told us over here exactly where it needs to go. In this case, the destination type is context and the context, the client's name, and then Acme.md, which is all of our customer
Acme.md, which is all of our customer information for them. For this one, the decision is also human in the loop because again, this is going to affect broad customer work where we might have multiple systems using this information
in order to deliver services to them. In
terms of what gets stashed where, you can see we now just have our rooted folder which has the same information from our inbox for Acme procurement which goes into context. And then our skill over here will be for LinkedIn
that gets updated as well. And at this point, we haven't made any change yet.
So we need to move on to the next stage.
And then as a final part of this process, we obviously have our proposal which has our judge in it. It goes
through this information that we have already looked at and it puts it into a proposal for us before we go ahead and make this change as our final gate. It
tells us the type of the change. In the
case of our LinkedIn writer, we are adding a very specific rule. The
reasoning behind it is because the rejected draft shows the writer reaching for the let that sink in AI slop. As a
part of it being a judge, it also gives us a confidence score and then we have the diff to see what is actually going to change. So if we were reading this as
to change. So if we were reading this as the human in the loop, we would be able to see the change. And if we were doing this through GitHub, it would obviously use a pull request for this that you would review as a pull request and then accept the change and it would go through just like a code review. And
that's it. I realize it can get quite convoluted but it is the framework behind this that is so important to understand. There are many different
understand. There are many different things that you can manipulate as a part of the system and you certainly can do different forms of this. Hermes does
this quite well obviously but I have a problem with Hermes in that it creates skills for nearly everything when that's absolutely not necessary for me. I
prefer a constraint-based approach, meaning I only build a skill when I have a business need or a problem that arises. And I know that the work is
arises. And I know that the work is going to be repeated, not just a one-off thing and then all of a sudden an agent goes and spawns something that will just die inside my skill folder and never get used. I also think that if you use this
used. I also think that if you use this kind of loop inside your business or you're setting this up for your clients, it's a much better way to do it because it gives them the confidence to know that their systems aren't just being filled with information that they might
not know is entirely accurate. Because
most people out there don't know how these systems work and they're being misled by products like Hermes that will auto update everything for you. And
YouTube is claiming that it is just a magic solution when in reality it's not.
It will still make mistakes. Problems
can still happen. But people aren't thinking about that because they aren't aware that they need to be looking at this stuff. Which is why as automated as
this stuff. Which is why as automated as we can make this entire process in the video, we certainly need to have some point of a human gate. So I hope this video was helpful. Leave some comments down below if you have any questions. I
will get back to you. Otherwise, check
out the videos on the screen now.
They'll definitely help you in your journey. Thanks very much for watching.
journey. Thanks very much for watching.
Loading video analysis...