L8 Principal's Agentic Engineering Workflow
By Kun Chen
Summary
Topics Covered
- AI doesn't know it codes faster than humans
- Popular skills often make agents worse
- Design tools for agents, not humans
- Be a director, not a code reviewer
- The bottleneck shifts from execution to direction
Full Transcript
Hi everyone, welcome to this video and this will be a full walkthrough of my agentic engineering workflow. My name is Kun. I was previously an L8 principal engineer worked at Meta,
workflow. My name is Kun. I was previously an L8 principal engineer worked at Meta, Microsoft and Atlassian on many large scale systems like the Bing search engine, Windows and Facebook games. In the recent couple of years, I have been building frontier coding agents at
Facebook games. In the recent couple of years, I have been building frontier coding agents at Atlassian and helped many engineering teams figure out how to use them effectively. And I
have been building heavily with agents myself and shipping 40 to 50 PRs almost every day, sometimes more. And these are all well- tested and shipped to production, not those Minecraft demos you see
more. And these are all well- tested and shipped to production, not those Minecraft demos you see people vibe code on social media. I have shaped my workflow to be both highly productive and enjoyable. Many people recently asked me what it looks like. To be honest, I did debate a lot with
enjoyable. Many people recently asked me what it looks like. To be honest, I did debate a lot with myself whether I should make this video a paid course because it does have that level of value.
But ultimately, I decided to just share it here with everyone because I want to stay focused on building products as my main business. You can see this is a bit of a long video because I'm going to walk through many fundamental concepts of agentic engineering that not only show you how I do it, but also the why and how things really work under the hood. These are not gimmicks that look cool,
but can't actually be used for real work. These are all real workflows that professionals like myself use to get real work done. By the end of this video, I want you to feel like a captain that can sail a large ship with a crew of agents working for you and do so in a stress-free and satisfying way. Largely speaking, we will be walking through these chapters. We'll start
satisfying way. Largely speaking, we will be walking through these chapters. We'll start
with assembling our ship where I will introduce the core setup. We will then talk through how we recruit and ramp up our crew mates with the right usage of memory and skills. I'll then demonstrate how we work with a single crew mate effectively. Then we'll upgrade to working with multiple crew
mates all at the same time. And lastly, we will recruit a first mate that manages a lot of the overhead for us so we can stay focused on the big picture as a captain. The very first level is to gather our gears and build our ship. Now, as we get into my workflow, something that's going to be
really hard to miss is that I do almost everything in my terminal. I know there are a lot of people who will tell you that the graphical user interface is better. It allows richer interactions and better visuals. But I think by the end of this video, I might just be able to convince you that terminal is not quite that yet. I use the terminal mostly for two very real reasons. One
is to allow my hands to almost never have to leave the keyboard. This is actually a much bigger deal than most people think because when your hands stay on the keyboard, you stay in the flow. But
if you have to move your hand to the mouse every couple of seconds, it breaks the flow and forces your brain to context switch. I know there are some GUI apps that also have great keybinds that allow you to do most things with the keyboard as well, but that's just not the primary interaction paradigm for GUI apps, and it's hard to build the discipline of hands-on keyboard when every once in
a while you still have to use the mouse. Terminal apps, on the other hand, are all designed for the keyboard. So, there's no reason for your hands to move anywhere else. The other very important
keyboard. So, there's no reason for your hands to move anywhere else. The other very important factor that drives me to use the terminal is that I can keep the exact same workflow everywhere, even on my phone. But if you really don't like the terminal, that's okay, too. I designed this video to be more about the fundamental concepts behind agentic engineering rather than the mechanics. So,
most of the things that I talk about should be applicable to GUI based workflows as well. Now,
since we are looking at a terminal here, let me share what it is. I'm using this beautiful, clean, and elegant terminal emulator you're looking at here is called Westerm. Westerm is
a highly performance terminal emulator built by a guy named Wes. It's got 26k GitHub stars and has existed for many years. I like it mostly for two reasons. One is that it's truly crossplatform.
It's pretty much the only terminal emulator I can find that can work on Windows exactly the same way it works on Mac and Linux. Right now, I mostly only work on Mac, but it was a big lifesaver when I was working for Microsoft and was forced to use Windows for work. The other reason is that it's highly customizable. You can write Lua scripts to configure pretty much everything here. Let me
highly customizable. You can write Lua scripts to configure pretty much everything here. Let me
show you my config in my files. It's all in this file called westterm.la. It's a Lua script. So,
it's not just static values. You can actually set conditions and write various kind of logic to make your config very dynamic and flexible. If I change some settings here, let's say I change the color scheme to chalk, you will see that it does a hot reload instantly, which is super handy. But I still like my rose pine moon. So, let's come back to it. I can't use
anything else. Inside of Western, I run something called T-Max. It's short for terminal multiplexer.
anything else. Inside of Western, I run something called T-Max. It's short for terminal multiplexer.
If you haven't come across this yet, it's probably easiest to just show you what this does. So, I'm
typing this command here to start a T-Max session. Now, I'm inside of T-Max. You can see not much uh is different except for that there's a bar at the top showing some information. And I still get a shell where I can type commands. But now I can split my terminal into multiple panes, as many of
them as I like. This is super useful because I can spin up an agent in one pane and spin up an editor in another uh and still have a pane to myself. So I can just run commands. I can also spin up multiple tabs uh and they are also called windows in T-Max. This is very useful for running multiple
agent sessions in parallel. The other cool thing is that uh T-Max sessions are persistent in the T-Max server. So, if I use a keyboard shortcut here to detach from T-Max, you can see I'm back in
T-Max server. So, if I use a keyboard shortcut here to detach from T-Max, you can see I'm back in the normal shell without the T-max status bar at the top. But if I type the same command to launch T-Max again, I get back to the exact same state I was in before. So, I can continue my work here.
What's even more useful is that I can connect to the same session from another device like my laptop or my phone. That's a real gamecher that's very hard to replicate without this terminal centric workflow. If you just install T-Max, by default, it doesn't have the same experience
centric workflow. If you just install T-Max, by default, it doesn't have the same experience we're showing here, uh, like the tab bar and the metadata. You'll probably need to do a bit of configuration and customize it. Let me show you my T-max config. Here it is. Most of these
settings are key binds that I have been using for many years and built into my muscle memory.
Some of these are for styling and various kind of behaviors. There are many YouTube videos that go into more details about T-Max configuration. So, I'm not going to go down the rabbit hole here for now. You just need to know that you are likely want to spend some time configuring your T-Max for
for now. You just need to know that you are likely want to spend some time configuring your T-Max for it to look good and work well for you. This text editor here is Neo Vim. It's basically the modern version of Vim. It's my favorite text editor. If you are not familiar with Vim yet, it's an editor whose main purpose is to keep your hands on the keyboard. So, if you watch my keystrokes here,
I can move the cursor up and down, left and right with keys. I can also scrub up uh or scroll down.
If I have to make edits, I can go into insert mode and start to type anything I like. There are a ton of keyboard shortcuts for doing everything you need. For example, let's say I want to delete the current line. I can just type dd and it's gone. I can undo it by typing u. Oh, and if you look at
current line. I can just type dd and it's gone. I can undo it by typing u. Oh, and if you look at the left hand side, I have relative line numbers. This line number 238 is the current line number.
And the line above shows one, which means it's one line above the current line. And the lines below as well. So let's say I want to jump to the line that says set environment. That's 11 lines above the current line. So I can just type 11K and I'm here. So once you have enough muscle memory,
you can just navigate around much more quickly than using a mouse. I also have a bunch of plugins that help me get around in neoim as well. Uh and I have keybinds for all of them like space s allows me to search or grab for the codebase. So I can just type rows and it will find all the
occurrences of rows in the current uh codebase. I can type uh space f to find files by their names.
Like if I type flake, I'll get to the flake file immediately. Working with Vim has a learning curve for sure. Um but once you get used to it, it just feels really really good. Whenever I'm in Neo Vim,
for sure. Um but once you get used to it, it just feels really really good. Whenever I'm in Neo Vim, I'm just flying like a bird and it's awesome. Okay, I have to stop here before this turns into a Vim tutorial. You can find a lot of great YouTube videos that will help you get started on Vim and become a master. Maybe one last tip from me is just how to exit. Here you go. All right,
our ship is ready to sail, but we have no crew mates yet. We're the captain. We can't
do everything by ourselves. We need to bring in agents as our crew mates. I use four different agent harnesses regularly. There is Claude, which is uh Claude Code, which is basically the only practical choice if you are using the subscription from Anthropic. Generally speaking though, it's a pretty good harness. I think it has the most sensible default experience out of the box.
It's also got a pretty rich feature set. The downside is that sometimes it's a little bit buggy and it's not as customizable as some of the other options. The next one I use a lot is codeex CLI. It's written in Rust and you can feel it's a little bit smoother than cloud code when you use
CLI. It's written in Rust and you can feel it's a little bit smoother than cloud code when you use it. It's also open source. So if you run into some problems, you can often just have Codeex inspect
it. It's also open source. So if you run into some problems, you can often just have Codeex inspect its own source code and figure out a workaround by itself. It's a bit lacking in terms of bells and whistles. And it's also not very customizable. And then there's the PI coding agent and its
whistles. And it's also not very customizable. And then there's the PI coding agent and its whole philosophy is to be minimal and highly extensible. It's great if you don't want any bloat and you'd like to tinker around and kind of make it your own. And lastly, there is open code.
I like it a lot. It's got a buttery smooth TUI and it's got good integration with pretty much every model you can find. It's also got a more complete out ofthe-box feature set than Pi. So,
if you want to use an agent harness that is model agnostic and one that you can just grab from the shelf and just go, Open Code is a pretty good choice. For the rest of this video though, I'm going to use cloud code because I know many people are already familiar with it. But I have been very strict about making my workflow agent agnostic because the landscape is changing very
very fast. Who knows which model or agent will be the best performing one next month,
very fast. Who knows which model or agent will be the best performing one next month, right? So everything I show here in the video is agent agnostic and should be applicable
right? So everything I show here in the video is agent agnostic and should be applicable regardless of which model or harness you use. Now the problem with these crew mates is that they are fresh recruits and they have no idea how we run our ship or how we like to work. We need a proper onboarding process to ramp them up. We'll do this mostly through two ways. Memory files and
skills. There are a few types of memory files. Global memory files and project level memory
skills. There are a few types of memory files. Global memory files and project level memory files. The global memory file for cloud code is at this location. And every other agent uses the
files. The global memory file for cloud code is at this location. And every other agent uses the other standard location here. So what I do is that I use this command which made clots.m MD a symbolic link to agents.md. So they both exist but under the hood they point to the same file. Here's
the content of my actual global memory file. You can see it's pretty minimal. There's only 27 lines because everything in this file gets loaded into the system prompt of every single agent session across all our projects. If we have too much content in this file, it will silently use a
lot of our tokens. I mostly write down my personal preferences here like never use mdash. Somehow AI
models are trained to use mdash by default instead of a plain dash. So now whenever I see mdash, I just feel like it's robotic and I don't like that when I need the agent to write something for me like PR descriptions. Oh, and this is a good one. When making technical decisions, don't give too much weight to development cost. Here is something interesting that you may not
know. Let me show you. If we ask a Frontier model to estimate the development cost of a project,
know. Let me show you. If we ask a Frontier model to estimate the development cost of a project, let's say I want to build a 3D firsterson shooting game that I can play locally with AI enemies. How
long do you think that will take? Let's see what Cloud will say. Okay, here it is. See,
the estimate is in days and weeks and months. But if we ask the agent to actually build it now, I can guarantee it will come back with a playable version in just a few minutes because I have done this so many times. This mismatch is happening because the models were trained from human data and that is what a typical human developer would give as the estimate. AI doesn't seem
to know it can code much faster than humans yet. So when AI is making technical decisions, it's implicitly assuming the development cost for some of the options are much higher than they actually are. This biases the model to choose cheap solutions that are often low quality,
actually are. This biases the model to choose cheap solutions that are often low quality, not scalable or hard to maintain. So I have this rule here to correct that bias.
I also said when doing bug fixes always starts with reproducing the bug in an end- to-end setting as closely aligned with how an end user would experience it as possible. AI models today by default like to write unit tests which are often not sufficient and not really covering
the product behaviors we want to guard. I found that leaning into end toend testing is a lot more reliable. Besides these preferences, I also have some interesting stuff like opinions.mmd which is
reliable. Besides these preferences, I also have some interesting stuff like opinions.mmd which is super useful. That's a slightly different topic though, so I won't go into too much detail here,
super useful. That's a slightly different topic though, so I won't go into too much detail here, but I do have a blog post explaining how that works, which I'll link here in case you're interested. Besides the global memory file, each project can also have a project level memory file.
interested. Besides the global memory file, each project can also have a project level memory file.
Let me show you one example here by going into this project called H Highbit. This is an AI Twitter app I have been working on. The project level memory file is typically stored as claw.md
or agents.md uh depending on which agent you use. I did the same thing here with a symbolic link. So
the same file is shared for both cloud and other agents. This one we're looking at here is a little bit verbose I would say. I'll probably clean this up after this. But let me show you on a high level what I put into this file. It has some context on what this project is, how the repo is laid out,
some terminology, how some of the most important components work, and how to do end to end testing uh and some conventions at the bottom. This file is a lot more verbose than the global memory file because this is basically the collective learning of all the agent sessions in this project. The way
I built this file is not by writing everything by hand, but rather that every time I saw the agent doing something wrong, I would correct it and ask it to remember to not make the same mistake again by storing the learning into this memory file. So over time, our crew mates working on this project get smarter and more experienced. You don't need any fancy memory system to do that. This
markdown file is all it takes. Over time, it does tend to get more and more bloated though. Um, one
way I reduce the size of this file is by moving some conditional information that is not always needed into a skill. For example, the end to end testing instruction here um is only needed if the agent is making changes, right? So, if I just ask the agent a question, this whole section is
totally useless and would be wasting tokens. The way to improve efficiency here is by converting this kind of conditionally useful information from the memory file into skills. I typically
just ask the agent to do this. Uh here let me do it live. I would say uh let's extract the end to end testing instructions uh instructions in our agents.md file into a project level skill. Claude
already knew how um to do this uh what skills mean and how to create them. But other agent harnesses may not understand how to do that uh out of the box. To teach your agent how to create skills, you can install a skill called skill creator which uh which was written by Anthropic. Uh you can do that
by running this command. This npx skills thing uh is a CLI from Versel that is very handy. Um, it's
basically my main tool for installing and managing skills. Uh, it supports pretty much any agent. Um,
once this skill is installed, your agent will be able to follow the rules and create new skills for you moving forward. All right, Cloud has done its work. Let's look at uh what Claude created for us. It basically uh removed a large chunk of the content from our agents.mmd file and moved that
us. It basically uh removed a large chunk of the content from our agents.mmd file and moved that into this uh this skill file. This is a good thing about skills is that it's designed for progressive disclosure. Which means when your agent starts, it only loads this tiny description field from your
disclosure. Which means when your agent starts, it only loads this tiny description field from your skills into the system prompt to know what these skills do. And only when it actually decides that it needs to use a certain skill, it will then read the rest of this file. This allows you to store a lot of the knowledge about how to do various kind of things without blowing up your system prompt
and memory file with a ton of content that uses your tokens for every single request whether the request actually needs those skills or not. One thing I do want you to know about skills is that you should generally avoid installing random skills from the internet even the ones that have a lot of GitHub stars. First of all, these skills can instruct your agent to run pretty much
anything on your machine. This is a very risky thing to do because the agent can leak your API keys or even credentials to your bank account to untrusted third parties without you knowing.
And even if we put aside the security problem, some of the skills actually degrade your agents performance. Look at this repo here called Andre Kaparthi skills which has 177,000 GitHub stars.
performance. Look at this repo here called Andre Kaparthi skills which has 177,000 GitHub stars.
That's like massive. So, it must be really good, right? I actually evaluated the skill in this repo with program bench, which tests the agent's ability to build programs end to end. And the
result shows that by using this skill, the agent will use 5% more tokens while making the results worse. And if you look closely, this skill is not even written by Andre Kapathy himself. I'm not
worse. And if you look closely, this skill is not even written by Andre Kapathy himself. I'm not
here to criticize the author of this repo, though. Uh, and many saying that being popular is not the same as actually being good. A lot of the skills being widely shared today have not been rigorously evaluated and are typically just some random guy who found something that worked for themselves
anecdotally and somehow got it to go viral. Their GitHub stars only tell you how popular they are and not whether they are actually helpful. So, as a general rule of thumb, I recommend that you do not install any skill from the internet that claims to magically make your agent perform better, but hasn't published anything rigorous that proves its claim. All right, now that we
have memory files and skills to help ramp up crew mates, it's finally time to actually start working with a crew mate and set sail. The first thing about working with a crew mate is how you talk to them. I have pretty much completely moved to voice input now. So instead of typing I will just
to them. I have pretty much completely moved to voice input now. So instead of typing I will just say explain this repo in a concise way and give me a recap of what the recent PRs have been working on. This is just so easy. There's an actual paper from Stanford that seriously compared the
on. This is just so easy. There's an actual paper from Stanford that seriously compared the efficiency and basically talking is three times faster than typing. So this is a very big boost in productivity. I also want to show you something interesting here. Um, if we go to the references
in productivity. I also want to show you something interesting here. Um, if we go to the references of this paper, look who's here. It's our guy Dario. What is the CEO of Anthropic doing here?
Apparently, if we follow this link, Dario was doing some speech recognition stuff back in 2016.
Now, we're using speech recognition technology to talk to Claude, which is also created by Dario.
What a small world. The voice input we just did uh was actually transcribed locally using this app called open super whisper. It's completely free and open source which is what I think this type of software should be. It runs the whisper model locally on your machine and do the transcription
and the quality is like really really good. So this is how I do most of my prompts. Now, the only case where I fall back to typing is when I need to give the agent a URL or a file path or something like that. Trust me, you don't want to speak a URL out loud, whether it's by yourself or with
like that. Trust me, you don't want to speak a URL out loud, whether it's by yourself or with other humans around. Now, if we come back to this prompt and let the agent run, you will see that because we asked the agent to look at recent PRs, it will need to call GitHub to fetch the data.
This is an important thing to pay attention to because agents rely on external tools like GitHub to do its tasks. The design of these external tools can greatly affect your agents performance.
Take GitHub as an example. Many people use the GitHub MCP server for accessing GitHub. However,
I ran this benchmark here that measured various kind of ways to access GitHub for the exact same tasks. Using GitHub MCP server will cost you to spend three times more on token cost and
tasks. Using GitHub MCP server will cost you to spend three times more on token cost and more than double the latency compared to using the COI. If you're using the GitHub MCP, you are pretty much wasting both time and money for no clear benefit. Now you can see there's this thing
called Axi which has the lowest cost but highest success rate. So what is it? Let me show you.
AXI is a set of design standards I authored after discovering the huge upside we can have by designing our tools to treat agents as a first class citizen and optimize for agent ergonomics. I
created 10 principles for how to make a tool highly efficient for agents. For example,
using token efficient output format can save about 40% tokens compared to using JSON. And then I built a few axis with them. Besides the GitHub axi I showed earlier, I also built ChromeDev tools axi and benchmarked it against other various browser tools. Here you can see the agent taking
less turns and using less tokens to get the same tasks done with axi. The main point here is when you give tools to your agent, do some research on their efficiency because they can greatly affect how much mileage you get out of your agents. If you want to use the axis I mentioned earlier, you can just go to this site called axi.mmd and find them in this catalog. You can just
go to their repo and find instructions for how to start using them. Speaking of this catalog, there is something called lavish axi here. This is a very important tool in my setup. I pretty
much rely on this tool for planning any kind of complex work. Let's do a real feature live and I'll show you how it works. Let me first launch Hibbit to show you what I'm trying to work on here. Hibbit is an AI tutor I'm building for kids. Um I'll just create a
test profile here. You can see here um at the top I have these two buttons what I can do and my progress. They are showing very similar content right now uh which is a problem.
my progress. They are showing very similar content right now uh which is a problem.
So um and also the UI is not very exciting or fun. So, uh, let me go back and talk to Claude.
I'll still use, uh, voice input here. I'll just say, I'd like to consolidate, um, the what I can do and my progress buttons because their functionality is very similar, and I'd like to revamp the experience there to be something more fun and exciting, like an uh, an achievement system. Come up with some options, and let's discuss. Don't use lavish.
Okay, the reason I said don't use Lavish is that I wanted to show you what the default workflow today looks like for many people. And then I'm going to show you the difference Lavish makes.
Because I already have Lavish skills installed, my cloud will automatically use Lavish for this type of question, which is why I had to tell it not to do that right now. Okay, cloud is doing its work.
Now cloud has come back with a response. Sometimes it will use its plan mode. um or sometimes I will ask you to write down the plan in the markdown file, but it's more or less the the same. It's a
wall of text I now have to read through. It's not very easy to understand what um what each option uh is actually going to look like. And if I'm not happy with some parts of it, I can't very easily tell Claude which part I'm talking about. I can't select a piece of text in the plan and say this is wrong. Now, let's try the exact same prompt with Lavish. Here it goes again. I actually don't
is wrong. Now, let's try the exact same prompt with Lavish. Here it goes again. I actually don't have to say use lavish because the agent already has the lavish skill uh that tells the agent for this type of planning it should use lavish but for demo purpose I just wanted to be explicit. Cloud
would roughly do the same things to figure out the options except at the end it would not print out that wall of text again. It will launch the browser and show me this page. Now look at this.
This is the lavish editor. The reason I named it lavish is that it's richer than a rich editor. I
almost named it filthy rich editor, but um that's just not the best sounding name. Lavish editor
basically instructed the agent to create an HTML artifact to visualize what we need to discuss. It
always uses the same design system as the current project being worked on. So this is consistent with how the hybits app actually looks. This makes it very easy to review concepts and prototypes.
see the option is laid out here. This is so much easier to understand than the huge wall of text we were looking at earlier in the terminal, right? I can also annotate and make comments on specific parts of the artifacts to give feedback to the agent. This is something that's really hard to do
with the wall of text or a markdown file. And at the bottom, there are things for me to decide on.
And I can just click on these options to make the decisions. and I just send this feedback back to the agents inside of Lavish without even having to go back to the terminal. Honestly, I can never go back to reading text in the terminal anymore. This is just like way too much more efficient. Now,
the agent has made updates to the HTML and I feel happy about this. So, I'll just tell the agent to start building uh start building and we'll end the session. We can then go back to the terminal now and see the agent uh will start to work on the implementation. Because we already clarified all the requirements in the planning phase, I typically don't need to interfere at
all during this implementation phase. I only come back to this when the agent has done. And when the agent says it's done, that's actually when things get really tricky. This is where a lot of people will spin up their editor and start reviewing the diff. The problem is AI writes code so fast and if
every piece of code requires your review then you are creating a big bottleneck on yourself because you can only review so many diffs every day. Your velocity will be hardcapped by it. And even more importantly reviewing diff is just not fun. No one says I became an engineer because I love reviewing
diffs all day. My advice here is that in order to really scale ourselves with AI, we have to think of ourselves more as an engineering manager or engineering director. Your directors most likely don't review any PRs. Yet, they can influence the quality of their team's software by creating good
culture and processes and rely on the team to carry them out. That's what we should do with AI.
What I do here when the agent says the work is done is not to start reviewing the diffs or start manually testing the changes. That's too much overhead on myself. I send the change into a pipeline I built called no mistakes. No mistakes is also free and open source. It orchestrates your
agent to execute a series of steps that takes this first pass code all the way through to a clean PR.
It would first create a branch if one doesn't exist yet and then create a commit and then take it through a pipeline in an isolated work tree. So nothing during the validation would affect your current repo. It would first understand your real intent behind the change by analyzing your agent session. Then rebase the change on top of the latest main branch on remote origin and
agent session. Then rebase the change on top of the latest main branch on remote origin and resolve merge conflicts up front. Then starts an adversarial review in its own fresh context window. This is where most problems get caught and obvious problems will get self-corrected but
window. This is where most problems get caught and obvious problems will get self-corrected but ambiguous ones that have product implications will be escalated to us humans for a decision. After
review, it also tries to test the change end to end against the original intent and this step will actually record evidence that proves the change is working that we can then later on look at to gain more confidence. It will then do a documentation pass of updating all relevant documentation to
more confidence. It will then do a documentation pass of updating all relevant documentation to reflect the latest change and also finally make sure there's no linking problems before pushing the branch to remote and raise a PR. The no mistakes pipeline will also keep babysitting the PR until it's merged. Because during the PR phase, we can still have merge conflicts that come
in or CI pipeline failures that are very annoying as well. With no mistakes during the babysitting, we don't have to waste our own time at all. Another way to trigger no mistakes is as a skill. I can just type no mistakes in the agent and it will do the same pipeline as well. This
skill. I can just type no mistakes in the agent and it will do the same pipeline as well. This
may seem very slow, but in practice, I never stare at this screen. I would go spin up other tasks. I
come back only when no mistakes says all checks passed. And that's when I go to the PR and apply my judgment. Here's the PR from the change we just did. We can see here it summarized the original
my judgment. Here's the PR from the change we just did. We can see here it summarized the original intent, what changed, how it's tested, and what happened during the no mistakes pipeline. We
can click to see the evidence from its testing to know whether it's really done what we asked for.
Depending on what the change is, the evidence could be a screenshot like this, a video demo, a log file, or something else. It's designed to give you the most direct way to see the change working as you intended. We can also see that the pipeline discovers some problems and fix them before raising the PR. This is a good time to audit whether these changes are actually what
we need. If anything doesn't look right, we can go back to the agent and ask for more changes
we need. If anything doesn't look right, we can go back to the agent and ask for more changes before merging this PR. This risk assessment here is also very useful. I basically look at this to decide how much time I should spend on reviewing this change in more detail. For low-risk changes, I don't really look at the diff at all because I have validated time and time again for low-risk
changes. Any problem I could catch is very likely already caught by the pipeline. Only more risky
changes. Any problem I could catch is very likely already caught by the pipeline. Only more risky changes are worth my time. This is how I scale up the volume of code changes I do every day through a large crew of agents without losing control on quality. One thing we are starting to see now is that the place where I spend time is towards the beginning and the end of the task.
At the beginning, I would spend time in lavish to plan the requirements more clearly. At the end, I would come in and hold a bar on quality. All these parts in the middle is done by AI which frees me up to spin up other tasks. This is a core aspect of how I work. And you can see the
more time I can free up in the middle, the more work I can go do in parallel. So an interesting question now is how do we get the agent to work for longer and longer in the middle? That depends
on us giving them more and more complex tasks that take longer to complete. But more complex tasks are often not as easy for our agents to complete autonomously. An extreme version of this is when I go to bed, I sleep for seven to eight hours every night. How do I keep the agents busy
for eight hours? This is where I say good night. Have fun. It's another free and open source tool I built specifically for longunning tasks. It's becoming quite popular. It's that simple to use.
Just give it an objective and it will keep going until it meets some stop condition you defined.
Let me show you a real example that I often do. This is again in the high bit repo. I'll run good night. Have fun and give a prompt. Pretend you are a seven-year-old kid and use the high bit app end
night. Have fun and give a prompt. Pretend you are a seven-year-old kid and use the high bit app end to end. Don't mind the profile creation step which is designed for parents. In the rest of the app,
to end. Don't mind the profile creation step which is designed for parents. In the rest of the app, try to do different things and find the first usability problem that would confuse you as a kid or stop you from knowing how to proceed. If you find a problem, stop and fix it. Then rinse and
repeat. Here he goes. Good night. have fun is now running in a loop to execute on what I just asked
repeat. Here he goes. Good night. have fun is now running in a loop to execute on what I just asked for. I can monitor token usage here or how many iterations have been done. The iterations will
for. I can monitor token usage here or how many iterations have been done. The iterations will be showing up as the moons in this row. And I can see how many commits have been made as well. Or I
can just go to bed knowing the agents won't stop until there is no more problem to be found. When
I wake up, I can review a list of commits made on this new branch and decide which ones I want.
I typically use good night have fun for improving on some verifiable objectives or objectives where I trust the agent to have the reasonable judgment over like the one we just did.
Verifiable objectives are more like um reducing page load time, improving end to end test coverage or like Android caps auto research um keep experimenting different hypothesis to improve on a metric. These are all well suited for a longunning loop to tackle. The recently introduced /go command in codeex and cloud code can also do something similar. But good night have fun still
gives me a better experience because I can set a token cap or iteration cap or stop condition more precisely. Whereas in cloud code and codeex if I set a goal before I go to bed I might wake
more precisely. Whereas in cloud code and codeex if I set a goal before I go to bed I might wake up realizing my weekly quota is all gone. Good night, have fun solved a very important problem, which is to keep the agents running for a long time. So, when the agents are running, I'm freed up to do more things. This is when we level up and start working with multiple crew
mates in parallel. So, let's spin up another tab in T-Max and get more work started. Now,
here's the problem. In this directory, I already have good night have fun running. So if I spin up another agent working in the same directory, they will step on each other's toes and cause conflicts. The default solution here is git work tree. For those of you who aren't familiar with
conflicts. The default solution here is git work tree. For those of you who aren't familiar with it, uh a git work tree is basically creating a clone of your repo directory. I can create one by typing gork tree add and give a path here. Now we have to think about a name.
This is when you waste five minutes and eventually give up and just say high bit two. Now we have a work tree and we can navigate to it. Uh so let's go find it. It's in high bit 2. This is a separate directory on the file system. So we can have an agent doing anything here and it won't conflict with goodn night have fun which is running in the original repo directory. The problem with
work trees is that we now have something to maintain in our head. I need to remember, oh, I have hybrid 2 here. Next time I come into this hybrid 2 directory, I would wonder what was I doing in this work tree last time. Is there still an agent running or is it free? All of that has to exist in my head and there's no way I'm going to remember all that. So, this work tree
basically becomes a debt. To get rid of it, I need to run this remove command. Remove.
This is just a lot of overhead. My solution to that is another tool I built called treehouse.
It's very simple. I just come into this repo and I run treehouse. It would drop me into a fresh work tree where I can start doing whatever I want. I can keep spinning up more and more of these work trees by running treehouse again. And if I want, I can see a list of all the work trees by
typing treehouse status. So I can see which ones are being used versus not. When I'm done, I can just close this tab and treehouse knows that I'm done. So it will free up that work tree for future use. Next time I ask for a work tree, it will try to reuse one of the idle work trees instead
use. Next time I ask for a work tree, it will try to reuse one of the idle work trees instead of creating a brand new one. So let's start some real work. I have a bunch of user feedback from my son's last round of play testing. So let me uh use this first work tree we created and launch
claude and I'll say I remember it's hard for the kids to realize they can press and hold the voice input button to talk by default they just click it and then they will see a popover. Maybe in the popover we add a label that tells them they can also press and hold and I'll enter. Then I'll spin
up a new tab, treehouse claw. And this time I'll say the image attachment dropdown menu should have an action that takes a screenshot of the current app and use that as the attachment. All right,
one more tab. Treehouse claw. Our agent status bar right above the chat input is not always showing bot activity. Look into what happened there and make sure when any bots are in progress,
bot activity. Look into what happened there and make sure when any bots are in progress, it always displays something that reflects the latest activity. Boom. We now have three sessions running in parallel. Now I can keep going because none of these sessions will need my attention anytime soon. Especially if I tell them to run no mistakes after implementation. I know whether they
anytime soon. Especially if I tell them to run no mistakes after implementation. I know whether they need me by looking at the top status bar and I can switch between the tabs using keyboard shortcuts like this. That's very important for managing a lot of parallel sessions efficiently. That said,
like this. That's very important for managing a lot of parallel sessions efficiently. That said,
after doing this for a while, you will discover that juggling between all these sessions, it's quite exhausting. The constant context switch and having to remind yourself what each session was even doing just doesn't feel like an ideal endgame experience. So, I kept pushing the boundary on this. And I discovered that I needed a first mate. Someone I can talk to as a captain
that will carry out all my directions and manage all the crew mates for me so I can focus on the big picture stuff like where should we go next? Not playing whack-a-ole with this increasingly high number of crew mates. This is how I level up and truly become a captain. My first mate is another free and open- source project and it's very new. The way to use it is by just cloning it.
And then I can run an agent in this repository. Now I just talk to it and ask it to work on any projects I like. Let's say I'd like to work on Lavish AXI, GitHub AXI and Chrome DevTools AXI. They are all GitHub projects I own. First M is starting up and the first time we run it,
AXI. They are all GitHub projects I own. First M is starting up and the first time we run it, it will do some setup and ask for some preferences, but it's all through just talking to it, which is pretty easy. And you might wonder why uh is the transcription so good because it's recognizing these project names. Let me show you. Open super whisperer actually
uh supports this customization through uh system prompt. So what we can do here is uh in this model menu uh in the transcription menu uh there is an initial prompt and we can put in some common vocabulary that we use into this system prompt and this prompt is what makes the transcription really good. First made here is asking how strict I want to be with
uh the code changes in this repos and uh I want to select full gate to PR. Uh this is basically going to be using no mistakes as the pipeline to validate each change. Um, and first task, uh, yeah, I'll describe it. All right. Now, a real thing I want to do is for all three projects,
I'd like to add an update command on the COI that will update their version to the latest on npm.
And let's see what first mate does. It realizes that this is not one task, but three parallel tasks. And it's now spinning up these tabs in T-Max just like we would behind the scenes.
tasks. And it's now spinning up these tabs in T-Max just like we would behind the scenes.
It would also call treehouse to create work trees and then run an agent in that work tree to get the work done and then it will run no mistakes to validate the change and get the PRs ready for us to review. Now you can see it's first mate that is doing the juggling. I don't need to worry about any of this now. I can just keep giving it more work. Hey first mate, let's also look
at the most recent three open issues in lavish axi repo and let's discuss which ones are actionable.
Boom. First mate now is pulling the open issues from the repo while waiting for the three background agents working in parallel. All right. First mate said number 87 is cleanest.
Uh it's very actionable. And uh let me just see what is this. Uh don't toggle in annotation mode.
Uh okay, that's a clear bug. All right, first mate. Let's uh address number 87.
Look now, First Mate is juggling a lot of tasks for me that I otherwise would have to manage by myself. Watching it context switch is actually an oddly satisfying experience because I know
myself. Watching it context switch is actually an oddly satisfying experience because I know that's what I would have to do otherwise. First mate is basically all my tools coming together as one cohesive workflow and I have been really happy with it. It's been a pretty significant improvement to my overall experience working with agents. I highly recommend trying it out if you
are still directly talking to every single agent session one by one. It will be a pretty massive upgrade. Something you'll start to notice after having a first mate is that because first mate
upgrade. Something you'll start to notice after having a first mate is that because first mate took care of so many things for you, you start to run out of ideas for what to ask it to do.
This is a good thing because it indicates the bottleneck is shifting. But it also means you as the captain needs to keep up. This requires a mindset shift of focusing more of your energy on understanding what matters by talking to your users, understanding the competitive landscape,
and crafting a good treasure map that can lead your crew to a good direction. Once you started doing that, congratulations. You have successfully transitioned from a sailor into a great captain.
All right, we have gone from not having a ship to being a captain that has a first mate and a big crew that sail together. This is a pretty good time to wrap up this video. All my tools can be found on my GitHub and will be linked in the description below. They are all free and open source. I built them because I just want to see more people learning how to do aic engineering
source. I built them because I just want to see more people learning how to do aic engineering effectively and doing it in an enjoyable way. And that's what I hope you can get out of this video.
I will continue to share more of my workflow and things I find useful on my channel. So,
don't forget to subscribe if you don't want to miss anything.
Loading video analysis...