No vibes allowed (AKA live coding with Claude and Code Layer): 🦄 #27
By Boundary
Summary
Topics Covered
- AI Coding Agents Solve Hard Problems
- Write Docs First for AI Coding
- Research-Then-Plan Compacts Context
- Read AI Code or Get Screwed
- Small Phases Enable Incremental Wins
Full Transcript
iz that people can't actually join the stream.
>> Um, so if you have the >> I'm going to set up the I'm going to set up the link >> or just Yeah, just shoot me the public link so I can drop it in chat. Um, hey
everybody, we're live. We're going to do some very fun live coding today. Uh, and
this is an episode I've been wanting to do for a very, very long time. So, I'm
very stoked to be here and uh I can't wait to write some code and uh get yelled at by by Vivov.
>> All right, event is updated.
>> Um I sent them to you as well. We've got
some people in the chat and we should get a few more coming too. What kind of hard features are we building today? We
will get to this in just a second.
But this is um like Dexter said, I think a lot of people underestimate how good bite coding can be. Uh and a lot of people view it as like it's great for throwaway code. It's great for stuff
throwaway code. It's great for stuff that we don't really leverage for a lot of situations, but I think I've been convinced otherwise. I've been convinced that you
otherwise. I've been convinced that you can use uh coding agents for really really hard problems in a maintainable way. So we're going to try that today
way. So we're going to try that today live on stream for a problem that has been aching us for a while. in BAML. And
obviously I know the BAML codebase really well, but Dexter has no idea what the BML codebases.
>> Well, we did ship one feature a couple months ago, but you were driving almost all of that. So
>> yeah. Uh but even then, like the fact that Dexter really doesn't know the file system. He doesn't really know how we
system. He doesn't really know how we organize code in there. He doesn't
>> I don't know how to run the tests.
>> You don't know how to run the test.
There's a lot. Do you know how to run the tests?
>> I mean, Claude knows how to run the tests.
>> Yeah. But the point here is like Dexter really has almost no information about what the real system does over here. So
I think a big part of today is going to be to see if we can go from nothing to something from scratch. Um but for everyone that's tuning in um one of the
whole this is a thing that we do every Tuesday. We call it AI that works and
Tuesday. We call it AI that works and the whole point is to show shipping workflows that actually leverage AI in some interesting way whether it's to build pipelines talk about how to the method of building the pipelines or actually the pipelines themselves but
show real mechanisms for actually leveraging AI. My name is Fibof. I work
leveraging AI. My name is Fibof. I work
on BAML.
>> My name is Dex. I work on uh code. Oh,
sorry. Do you want to finish the the oneliner?
>> No, that's it.
>> BAML is a great programming language for working with uh LLMs and AI agents. Uh,
and it's getting cooler and cooler and uh, weirder and weirder as as the days go by. I'm very excited to see uh, what
go by. I'm very excited to see uh, what y'all are solving soon. Uh, I'm Dex. I
work on human layer >> and >> this is the part where you tell people how why >> I mean I think actually people are going to get to see it. So rather than talking about it, I think it's going to be a
really interesting way to just mechanize how we how you can work with the coding in in in an interesting way. If you find this stuff interesting, this is the URL that you can usually sign up for to see our Tuesday events. And if you sign up
there, you can subscribe and we'll send you event invites out every single time.
But with that, let's get started.
>> Um, cool. Uh, I'm going to pull up uh I'm going to share my screen.
>> Why don't we start with just showing the problem up front so then people can get a spec for what we expect to get done today.
>> Yeah. So right before this, literally right before this call, I messaged Dexter a couple of tickets that we want to get solved. This is a specific ticket which is in BAML. One of the things that you can do is you can configure any LLM
you want and with to with retry policies and all sorts of mechanisms. Now one thing that a lot of people have been asking for is timeouts. Uh so we want to expose that capability so that
users can configure timeouts for uh for themselves.
As a part of this work, we started thinking about it. So there's a GitHub issue that we started with. Then we
spent some other time thinking about the actual syntax and stuff along the way.
But once we realized what the syntax and everything was for timeouts, I kind of just shared all the information with Dex. And Dex, if you want to screen
Dex. And Dex, if you want to screen share and just show people what I showed you. Let's start with the GitHub issue
you. Let's start with the GitHub issue and then from there we can work through to the docs that we have.
And for context for everyone watching, like how long have you been working on timeouts? Well, we thought of this we
timeouts? Well, we thought of this we thought of this GitHub issue. When did
it when was it proposed? On March 18th.
That was the initial proposition for it where the idea is that you can say that there's a connection timeout, there's a response timeout, there's a total timeout and then with fallbacks where like you're actually falling back to a
different model if the first one fails for whatever reason. Then you want to have a total timeout for the entire fallback or retry policy kind of attached to itself.
And then similarly, if you scroll down further, uh we realize some problems with like an actual problem is like you want some sort of nesting. So it's not actually like really verbose to write out. But as we did this, I actually
out. But as we did this, I actually assigned this to one of the engineers on our team to go take a stab at. And the
engineer did take a stab at this.
>> Is your engineer code arena bot?
>> No, the code the engineer is not that.
That's like a coding agent that we tried to go solve this problem. The coding
agent failed.
>> But do you want to show the do you want to show the markdown stuff that you had?
>> Yeah. Yeah. Yeah. So um just for a little more context here and just kind of make sure I understand this properly.
Um, I'm going to go back to the AI that works repo and just pull this up in VS Code. I know I said I don't really use
Code. I know I said I don't really use VS Code that much anymore, but um, in this today we're going to look at it.
So, because we have lots of good BAML code examples in here. So, if you look in BAML source, um, you will have autogenerated a client's file. And so,
this is BAML code that talks about different ways that your prompts can talk to LLMs. And so, we have this thing called extract date. And you can set a client here or you can say my what is
it? custom sonnet,
it? custom sonnet, >> whatever you want to name it. Yeah.
>> Yeah. So that is mapped to something in here called custo I guess this got cleaned up a little bit. Oh yeah,
there's custom sonnet. So you can have very basic options like just model an API key and then you can create a you know higher level abstraction over it that lets you do fallbacks and round
robins and retries and things like this.
Um but what's missing in here is is timeouts. So, if I want to say, "Hey, we
timeouts. So, if I want to say, "Hey, we want to try Gemini 2.5, but if it takes too long, we're going to fall back to GPT40 Mini because it's better to give the user something than nothing." Um, we wanted to be able, it sounds like you
wanted to be able to add a and then VIBO did a thing that I think is really really interesting which is actually um this like reminds me of this technique that Amazon kind of talks about a lot called working backwards which is
basically like write the documentation or write the blog post first before you go write any code especially for dev tools. So this is like a preview of a
tools. So this is like a preview of a doc of like how do we actually want this to work in reality.
>> I did not do this. Greg did this. I want
to call that out. But our team generally does abide by this philosophy. I think
for AI agent coding, it's actually really good.
>> And then you had sent me one other thing with like the clarified format, right?
>> Slack.
>> Okay.
>> Yeah.
>> Um I'm going to go through the really fast just so people Yeah, you may want to share your whole screen.
>> Um I'm just going to grab that client because there are other sensitive things in that Slack thread.
Uh let's see. Okay, cool. We'll go back to sharing the full screen. Um,
so one thing that we have, >> let's look at the docs really fast just so people get caught up on exactly what the objective is that we're going to get to. So if you go read these docs, can
to. So if you go read these docs, can you zoom in a little bit, Dexter?
>> Yeah.
>> If you go read these docs, this is what we're going to want. We're going to want to add a couple new keys in here that are like connection timeout MS, which is like time to establish connection, time to first token because sometimes the
stream might not come back yet. Uh, keep
on going.
idle timeout where it's like you're getting you want to delay like the stream is just stalled out at some point you want to restart in that scenario request timeout which is like just a
total request as a whole for end to end.
Uh and then for composite clients like fallbacks and retry policies and other things you want to be able to put like a total timeout. Um the only thing we
total timeout. Um the only thing we didn't like about this is now we're adding arbitrary key value pairs inside of here that can be really noisy and hard to detect and we might have conflict. So I sent text one more update
conflict. So I sent text one more update to this and this this documentation file should be updated accordingly.
>> Can you show the HTTP thing that I sent?
>> Yeah, let me just um >> Yeah, which is instead of doing it some other way, all the timeout options should be under HTTP. So all the timeout options should be under an HTTP key. So
we therefore we don't occupy too much of the key the key name space inside of options.
>> Yeah. You don't want thousands of parameters at the top level here.
>> Exactly.
>> Yep. Cool. So that's the only constraint. Um
constraint. Um >> I think this is pretty well speced out.
I feel pretty confident that this is something that we want. Take it away Dex. What should we do? So
Dex. What should we do? So
>> cool. So the first thing I want to do is I want to get this doc as markdown because this actually serves documentation is a great specification.
We're saying exactly how the user experience should look.
>> And so here's our spec. Yeah. Go ahead.
>> I was going to say uh I'll let you do your thing really fast. So I'm first thing I'm going to do is I'm just going to have Claude update the spec to match the new format with the nested key.
>> And while we do this, one of the things that we're going to talk about is this concept of research implement uh research plan implement.
>> Yes.
>> And while we go code implement this, I want everyone to uh we'll get to everyone kind of understanding what this means in a second. But the idea is that the one of the first thing that we're going
to do before we even go do this is first we're going to make sure our spec is actually correct. it's properly
actually correct. it's properly specified.
And once we've done that, we're going to try and do a little bit of work to go and look through the codebase to be like, did we miss something? Cuz as
clarified as the spec is, we really want to make sure it's actually in line with what the codebase would expect to some degree and figure out all the parts of the codebase that are relevant, not that not that need to be changed, just are relevant.
>> Yeah. And this is all about context engineering. And like we'll we'll give
engineering. And like we'll we'll give you the high level of this, but if you're watching this and you haven't seen that, like you should go back and watch the episode we did on using context engineering with coding agents.
Um, but the basic idea is like the less context you use, the better results you get. And so we're building our workflow
get. And so we're building our workflow around what I call like frequent intentional compaction. And so phase one
intentional compaction. And so phase one is like take the spec and create a research document that documents how the codebase works today and everything about it that is relevant to solving
that problem. And then we use that to
that problem. And then we use that to build a plan. And all that means is like instead of having to research like read all the code and then build a plan, the model already has a baked understanding
of how the codebase works. So when it gets to planning, it's actually writing the plan in the sweet spot part of the context window because we always
>> you want to let it rip while while we talk because it takes time. And one of the things I want to call out here is this is like we're literally live coding all this on the fly. We have no idea if this is going to work. It may work. It
may not work. While you folks have questions about this, ask questions along the way. Uh and we'll try and make sure that we're it's totally up to date with everyone uh and their understanding
of it. HTTP key. Um,
of it. HTTP key. Um,
what is this issue number?
>> I have no idea.
>> 16:30.
>> Yeah.
>> And I'm just going to commit these so we can see the diff. And we're just going to let Sonnet do this one cuz it's pretty quick. But it's going through,
pretty quick. But it's going through, you can see it's going through the spec and actually making the changes.
>> Nice.
>> And I'm going to go back to dark mode cuz >> Thank you.
>> Yeah, it's just putting all this stuff.
Does this look right to you, Baba?
>> Um, this does look right. Uh BKT is asking if this is a demo coder. Codler I
think personally I think is a great tool to do this but really it's really a demo of the process. You could do this directly in cloud code if you wanted as well. Um but similar to how I like to
well. Um but similar to how I like to use BAML to building a lot of agent stuff because it makes certain things easier. I personally found code layer to
easier. I personally found code layer to be a great way to to be a nice wrapper around um cloud code because it's for me slightly prettier and therefore I
navigate the UI a little bit faster.
>> Wow. Okay. This is actually just like a million parallel tool calls all coming through Sonnet. So, let me put it on
through Sonnet. So, let me put it on auto approve and I think we might be done now.
>> Yeah, it looks pretty good. Um,
let's just have a quick look at this.
Just skim it. This all looks right.
>> By the way, this is one thing I've actually learned deeply while working with Dexron on how to do VI coding. Cuz
to be completely honest, I didn't do a lot of VI coding. And the reason I don't do VI coding is because like I'm a pretty damn good engineer and I it slows me down in the beginning. But the reason that I was actually getting worse vibe
coded is because I wasn't actually uh what was it called? I wasn't actually reading every line of the code. So like
the thing Dexter did here where he actually went through and opened the diff to go look at it in a real format.
That is important. If you don't read the code, you are going to be screwed. So
read >> you have to read this stuff. This is
this is not magic. You have to read what it does. The idea is not that like oh if
it does. The idea is not that like oh if you do research you it's the magic prompt that just makes everything better. This is all about giving you and
better. This is all about giving you and I we do all these slides in other episodes, but it's all about giving you more leverage, right? You still have to do the work, but you have to you have to you you're doing it on on higher
leverage thing.
>> Yeah. So, let's go back. Uh let's see. I
think can you go back to reading and just make sure it all looks correct.
>> Yep.
>> And for context, by the way, here I have more context on the [ __ ] codebase. So, I
will read this stuff and be like, is this good or is this bad? Can you go up?
I want to read that complicated one. the
for fallbacks clients. So fallback
clients are interesting. They don't
actually have uh I don't want them to have connect timeout, idle timeout, or any of this stuff.
>> Okay. I only want them to have uh total timeout.
>> Interesting. Um
>> we should probably tell the agent that in the dock in the >> Yep.
>> Yeah.
>> Or the fallback one. What was it?
>> Fallback roundroin.
Uh yeah, fallback roundroin retries. Um,
>> just the timeouts.
>> Just total timeout. Total underscore
timeout >> because I think there what was the the docs basically had it was this one, right? Cuz this is actually
right? Cuz this is actually >> the doc was wrong. Oh,
>> okay.
>> Yeah. Yep. Let it let it rip >> the lowle.
We'll skip the need for the lowlevel timeouts to be passed through. Direct
clients are the FT, etc. >> Yeah. Cool.
>> Yeah. Cool.
>> Perfect. Um, only caveat D is I think you should be using your voice stuff like you normally do.
>> Yeah, I will be using voice plenty today. When it's names of variables, I
today. When it's names of variables, I find sometimes the voice one doesn't have a lot of context. And if it's a codebase I work a lot in, I'll go into super super whisper and like update the
vocabulary basically. So like it knows
vocabulary basically. So like it knows how to do quad cloud code. It knows the names of certain people that I talk about a lot um or even occasionally. But
like I had to put your name in there because it keeps putting you in as uh VIBO.
[Laughter] >> That's funny.
>> Yeah.
>> Um >> Yeah. And I think one of the one of the
>> Yeah. And I think one of the one of the other things that I have learned is like if you guys are really really vibe coding and you get the hang of this, once you've done this a few more times where you actually can go read
something, I highly highly highly recommend um starting to do task in parallel. We
won't do that today because doing 2,000 in parallel while having to talk to the stream is too hard. It is really hard.
>> But and and like I'll tell you why I think that works is because your like points that you check in with the agent or it's not just you have five cloud two cloud sessions going and writing code all day and you have to figure out which one is which and all this. But it's like
because everything you review is all kind of shaped the same and you know what to expect in between your like prompt sessions and letting it go off.
it becomes easier to like mentally model what's happening and you don't have to do as much context switching even when you're switching contexts. Does that
sound right by Bob? Does that m match your experience?
>> Uh, I think so. Honestly, it's just that it seems to work and I have no real context about why. Um, but it mostly works and it's like it feels good. I
guess >> vibes are a big part of this. The best
the best engineers I know like they don't use evals. They just they just know what works better cuz they spend 70 hours a week talking to Claude.
>> Okay, so what have we done?
>> Okay, so we have our low-level stuff here. Yeah, so here's the note that got
here. Yeah, so here's the note that got added. Um, total timeout present
added. Um, total timeout present upperbound regardless of the fallback change exhaust. No further clients are
change exhaust. No further clients are attempted. Low-le timeouts should be
attempted. Low-le timeouts should be defined on individual clients, not on the fallback client itself.
>> Yeah.
>> Um, >> that looks great.
>> Cool. Okay. Uh, I'm going to kick off the research. Um
the research. Um uh we'll start a new context window for this one since we've already used about 30%. That's the other idea is like the
30%. That's the other idea is like the goal here is you always want to have um context and I'm actually going to I'm going to grab the new versions of
the prompts here. So uh you don't have to include these in the PR vibe off but um I'm going to drop them in here. I'm
>> Yeah, go for it. Is it
>> here?
>> Watching someone type feels so archaic now.
It's like, oh, >> you need to make uh someone actually contribute a CLI that does this for you.
>> Really?
>> Yeah.
Oh, >> I don't have called >> agents. You spelled agents wrong.
>> agents. You spelled agents wrong.
>> Yeah, I know. Something uh in my shell does >> I don't have a folder called agents inside of BAML.
>> Yeah. Yeah. Okay. So, we'll make that.
Something in my shell drops the G key when I paste.
>> Really?
>> Yeah. Oh, you're going to love this. If
you're not using the sub agents, you're going to this is going to change your world.
>> Okay.
>> All right. I think we're >> I'm down. Just merge it. If this works, you'll get the whole thing merged in.
>> Amazing. There we go. We're We're
calling in installing some agents into your repo right now. Um, okay. Research.
What is the number? 1630.
There we go. Now we got the good one.
Um, let's read the spec in at spec. I'm
actually going to rem rename this and research all parts of the codebase that are relevant to implementing this feature. Um, it doesn't exist today, but
feature. Um, it doesn't exist today, but I just want to know everything where timeouts are handled. Um, make sure you don't get into the details of how to make an implementation plan. Just tell
me all parts of the codebase, how the testing works, how the integration tests work, any codegen that's used in the repo to make this feature work, and uh, explain it all for me so that uh, the
next agent can pick it up and get to work. Um Josh asked a really good
work. Um Josh asked a really good question. Why are we making a new
question. Why are we making a new context window? Is that to optimize the
context window? Is that to optimize the catch?
>> It's actually not about it's not about that. It's more about like fundamentally
that. It's more about like fundamentally I mean we talked about this last week when we talked about the entropic models and where they degraded because they use the million million token window and million tok million context model
instead of the shorter context model.
These systems generally work better when you get the least amount of context you need to to actually um make it slightly better. And part of why
better. And part of why uh why we often recommend controlling the full context window is because a really really important part of this whole workflow is actually getting the
system to work well. So once we got the spec right, all the work that we did ahead of time in the context window to make the spec correct can be deleted and we can just recede the whole context
window with just the right spec. So the
model doesn't have to be confused about the old spec, what we changed and then the new spec. It just says purely here's the spec that I'm implementing and it forget it doesn't need to remember the
history of what led us to there.
>> Yeah. Because basically the less of your context you use the better. And so as soon as like all of the work to do the writing and building the spec up so that we were actually upstream of this over here. Um I'm just going to drop this in
here. Um I'm just going to drop this in to the white.
>> Uh if you guys want the commands I'll post them for you guys.
>> Yeah.
>> So we actually were over here writing the spec. So the spec is part of our
the spec. So the spec is part of our initial user message which is like hey go implement that thing.
>> Yeah. If any of you want the commands, uh, they're in there posted in the chat and you're welcome to go and check out the cloud code yourself.
>> Yep. They're there. We've linked them and documented them before.
>> Yeah, >> I actually and if you all saw, I actually cancel the old one and because it was on sonnet and for the research, uh, I really tend to think that if you're not using Opus, you're not going
to get good results. Sonnet is fast and it writes damn good code, but when you want to reason over a large complex codebase, you should almost always be using Opus. it's worth the money.
using Opus. it's worth the money.
>> Uh I would say even even differently, it's more expensive for you to have to stop and then start again if you get the wrong result. So it's better to pay a
wrong result. So it's better to pay a tax uh the more expensive tax to get it right the first time around because large a large part of vibe coding is actually feeling good about the process and if you just feel bad about the
process, you're not going to see it through. So I highly highly recommend
through. So I highly highly recommend just using the better model at least in the beginning until you build confidence about what is working or not. Also,
Dexter, we're seeing your forehead, just so you know.
>> Can you see my face now? Sorry. It's uh
I have a new AV setup today and uh the laptop blocks just the very bottom of my screen. So, when I need to see something
screen. So, when I need to see something at the bottom, I move it.
>> Yada asked a really good question. Is it
possible the spec and code connect? What
if the code changes? So, the whole point of this workflow, Yamada, is actually not to make the spec and code directly connected. What we're really doing is
connected. What we're really doing is really imagine when you're working with a colleague and you're right and you're working together and like pair programming. It doesn't need to stay
programming. It doesn't need to stay connected forever. You just need to be
connected forever. You just need to be correct about whatever you're implementing right now so your colleague can go write something. And that is really
um that's really what we're doing here.
And this is this is also why I am very bullish on like um basically like or I'm I'm I'm not I'm I'm kind of like meh on
like what I might call um like codebase documentation and using agents to update the codebase like using agents to update the documentation for other developers to use because I think like one of the
problems with AI generated code is if everyone on your team is shipping a thousand lines of code every couple days because they're all doing these techniques the research basically gives you ondemand upto-date codebased
documentation in you know 10 minutes or so.
>> So let's see let's see what the and this will be really clear as soon as we start reading the research and funny enough you guys will understand a little bit of the BAML codebase as the research is done because then you'll be like oh cool
I see where this is relevant and I'll show you some interesting things that I suspect the research pipeline will pick up >> if it's done.
>> Yep. So these are these are all running.
So you can see it's invoking these special sub aents that are defined in the um they're here public in the human layer repo which is where I got >> I posted the link.
>> Yeah. So I'm just going to pull one of them up. Um so we have the research
them up. Um so we have the research prompt itself. Um which is yeah this
prompt itself. Um which is yeah this one.
So it's like the research again it's the only job is to document and explain. So
it's like analyze the thing, launch these parallel sub aents, and then we tell it the document format we want. So
we want some front matter at the top, we want the user's question and topic, and then we want detailed findings on how all of the relevant components work with
really concise. What makes this so good
really concise. What makes this so good is that the other the next model that's picking up in the next fresh context window doesn't have to search and find and learn how all this works. It can
just go straight to reading files to figure out what changes need to be made.
Exactly.
>> And then the agents, sorry, let's let's look at the agents real quick. Um, so like the codebase
real quick. Um, so like the codebase locator, its job is to document the codebase, right? And so all all its job
codebase, right? And so all all its job is to just here's this file, here's what it does. Here's this file, here's what
it does. Here's this file, here's what it does. So it's just finding things.
it does. So it's just finding things.
This is like your super grip. And then
the analyzer's job gets instructions like basically like find the entry point to the whether it's an API endpoint or a CLI or something trace the data flow
from like wherever X happens all the way down to the database or all the way down to the external service or whatever it is and understand how things change along. And so we use a mix of these
along. And so we use a mix of these different agents and like the research prompt uses them de facto and it just says like hey here's here's here's the one to use and where you can also steer
Claude to use one of these agents by hand. Um okay cool we're in the I'm
hand. Um okay cool we're in the I'm going to put this on skip permissions also.
>> We should almost be done. Um Josh asked a really good question of Manis recommended keeping the wrong stuff in.
How do you decide to keep it versus to start a new thread? Well, I think Josh, this goes back to two elements. This
this is this can vary both based on when you build your agent and based on uh and based on what you do with your coding agent when you're using a coding agent like using an agent and building an
agent kind of have the same philosophy.
And what I would say is like it really depends. So for example, let's say what
depends. So for example, let's say what I'm trying to do is debug. If I'm trying to if my if what I'm trying to do is debug in that scenario, it's actually really useful to keep the history of it
in. But once I have debugged and I'm
in. But once I have debugged and I'm moving on to the next flow. So like say for example the user asked me to generate some SQL statements and it produced a bad SQL statement and then I
errored it out until I got the right SQL working. Well in that scenario it's
working. Well in that scenario it's actually better for me to delete all the old SQL statements and only put the right one in there or perhaps put a summary of everything that went wrong so
I can show the best so I can show from the main agents perspective. All I see is the user asked to do this and I generate this SQL statement and I produce the result. On the other hand,
during the process of actually getting the error, I probably want to put uh you want you want you mean result instead of error at the very bottom.
>> Oh, thank you.
>> Yeah. on on the other end in when I'm not debug when I'm actually generating the wrong SQL I probably do want I probably do want all the errors in there so it can actually debug itself
and get to the right response and I would >> so this is like your error compaction right what is what some people suggest is if you run three tool calls and then you finally get a good one the next time you send a prompt to the model just do
user message and the SQL and the success because this is a lot of noise for the model that isn't like productive to the conversation and mana says leave this stuff in because then the model the next time it
goes to write a SQL query it remembers oh that table doesn't exist so it learn it's like it's it can learn from what was in the context window versus like basically getting errors again
>> yeah but what's really interesting here though that you could do is can you go back to that thing is it's not as straightforward as what mana says what mana says is leave it in always what I I'm going to propose a really quick algorithm that I think all of us can
understand hopefully which >> I sent you a link to the board by the way if you want to write it.
>> That's okay. I'll just draw it on your I'll just reference it on your screen which is imagine what I did was I did this and next time that I got a tool call for a new SQL I actually injected
the errors in only at that time. So if I get a tool call for a new generate SQL or create query plan then I show all the errors that exist in the past five
queries. But by default I don't. So for
queries. But by default I don't. So for
anything that is not a SQL generation command I hide all the SQL errors. For
anything that is I show the SQL errors.
It doesn't >> if it's calling the SQL generation tool, isn't it already too late to like by the time it's >> you control the tool. So the SQL generation tool is going to go ahead and
generate SQL.
>> I guess my point is like if you send this context window in and the model says I want to run XYZ query like >> I I guess I was describing the model in some other way where like I would just
want it to indicate that it wants to generate SQL to me and then I would just toss in the extra error commands in there.
>> I see. Okay.
>> Right. That's one thing I can do. I
could also update my base prompt or my like rag system for whatever I use to describe the SQL tables better based on the type of errors I do in like a post-processing world.
>> So if you separated out the declaration I want to run a SQL query from the actual writing of the query itself, then you can inject the errors in between here.
>> Exactly. And the or or you can inject it into the original SQL as well into the original spot from the previous query.
>> All right. Well, this is about context engineering. It is incredibly off topic.
engineering. It is incredibly off topic.
So, I'm going to jump over here and read the research doc. Um, let's have a look.
>> Can you You know what you should do? You
should open this in Obsidian and it'll be so much cleaner.
>> I don't have Obsidian.
>> Ah, unfortunate.
>> But I can do this.
>> Uh, do you want to open the G-grip version? But you can open this if you
version? But you can open this if you want.
>> Okay, cool. So, here's our research question. I think this is fine. So,
question. I think this is fine. So,
let's read the summary. And Vibb I say I don't know about BAML. And this is like when you're doing this stuff, you should always have at least one person that's an expert in the codebase cuz you need to be able to read this and make sure it's right and nothing's missing. So by,
I'm going to ask you to read this and tell me what's what's missing or what's wrong.
>> Um, this looks mostly correct. Yeah,
there's a new error. Testing. Okay, go
down. Um, that is the right file. We do
currently have hard-coded timeouts that need to be plumbed through.
>> Okay, cool.
>> Yeah, that that's what I was expecting it to find. If it didn't find that, I knew it would have messed up cuz I know we have hardcoded timeouts. I don't know where but I know we have them >> special AWS client handling >> because AWS does some weird stuff with
it >> and then WASM disables timeout. That is
correct.
>> Okay.
>> U Can you go up? Sorry, I can't read that fast.
>> Publishing API publishes. Yeah,
publishing timeouts. Publishing and
tracing timeouts don't matter for this.
Ignore them. They should not be configurable.
>> Okay.
>> So, can you just tell the model that?
>> Yep. Yep. So, we're going to start.
Publishing and tracing timeouts do not matter for this. So you can just ignore them. Don't include that information in
them. Don't include that information in the document >> or more likely tell it explicitly they don't matter. It's better. We we do want
don't matter. It's better. We we do want to know that those timeouts exist, but like >> instead add a note that the publishing and tracing timeouts don't matter, but
uh omit the kind of detailed documentation about what it is. Um yep.
So we're going to launch that one and then we're going to keep reading. So
this is how you kind of like can keep keep the model working in the background as you're continuing to review. So this
doc will update at some point as cloud makes the changes but um well let's keep going.
>> Publishing parsing location property handler. Yep, this is all correct. Uh time and options blah
all correct. Uh time and options blah blah blah. I believe that visit client.
blah blah. I believe that visit client.
Yep, that's all correct. Correct.
And just for everyone else, what am I reading for when I read this code? I'm
actually not reading to like look for perfect correctness. I'm looking for
perfect correctness. I'm looking for approximate fuzzy matching because if you know, if I was looking for perfect correctness, I would actually jump to that line in the codebase and then go read it.
>> And you can do this if you're really skeptical. One one advice I give also is
skeptical. One one advice I give also is like you should probably not try to make the research 100% correct because there's diminishing returns.
>> Exactly. So like I know what I'm looking for is I'm like ah this file line is roughly correct and that's good enough for me.
>> So yeah line 45. This is Yep.
>> Okay. Request builder provider implementation. Yep. That's where I call
implementation. Yep. That's where I call request. It does the right thing. This
request. It does the right thing. This
all correct. Go on.
>> Cool.
>> Composite client. No time out at strategy level. Yep. That's a current
strategy level. Yep. That's a current structure. That is correct.
structure. That is correct.
>> Yes, that is correct. No deadline or timeout. So we will need to go add that
timeout. So we will need to go add that in. And it's found it's found almost all
in. And it's found it's found almost all of them. It has missed orchestrator.rs.
of them. It has missed orchestrator.rs.
So you probably want to tell it that >> you missed orchestrator.rs.
Can you add notes about that one as well?
>> And this is just like where me knowing the codebase knows that if I don't say this, it will get it wrong.
>> Um >> in the which section is this?
>> Out. I don't think you have to tell it that. It'll figure it out. Oh, sure. You
that. It'll figure it out. Oh, sure. You
can say that, too.
>> I just I haven't read the whole thing yet, so I don't know exactly. Um yeah.
Okay, cool. Cuz there might be Yeah.
Okay, cool. Uh, error handling infrastructure.
>> Yeah, that's correct. Um,
that is correct.
>> Okay, made our change.
>> Mhm. Go on.
>> How the tests work.
>> I don't actually know if that's how the test works, but I'm going to ignore that for now.
>> Can you add inline code examples of uh the test for each client SDK generation?
That is a bad bet. But sure, we can put >> You don't think so?
>> Yeah, put it. Put it. It's fine.
>> We'll see how it works. We can always roll it back.
>> Um, cool. Codegen pipeline. Is this
relevant?
>> No, but it's it probably is actually because we do have to generate code. I
mean, not really. We don't really have to generate code. So, maybe not.
>> Okay.
>> Yeah. What's nice about this one is like the timeouts is already implemented.
It's just not customizable. So, you know that the flow is actually correct. like
the flow already works.
>> Well, kind of, but it's close. It's good
enough.
>> Yeah, it's a lot easier than the cancellation one we did.
>> Yes.
>> Um, okay. And then this is code references. This is just going to repeat
references. This is just going to repeat a lot of stuff about what was already there.
>> Yeah. So, I just want to go read this though, just so it knows um roughly what it's doing.
>> So, here's your here's your test simulation.
>> Nice. Okay. Can I read the test?
>> Yeah. Yeah. Yeah. Let's go back up.
>> Show me the Python one first before I read anything else.
>> Yeah, >> a board controller with timeout MS. >> So, this is just the test for the cancellation stuff.
>> Yes. So, it did I mean you're researching so it's not writing new code probably.
>> Yeah.
>> Um >> but this is probably actually not relevant. I'm I was more looking for
relevant. I'm I was more looking for like what unit tests are we going to have to add to um test these new fields.
Yeah, that's you probably should have phrased it that way.
>> Yeah. Yeah, that's actually wrong. Can
you remove those examples? I'm more
looking for what unit tests would we have to add to test the new fields that we want to add here? Um, so can you give
me just one example of uh how and where there is a test um to test the actual like BAML syntax? Um, and just one like tight example and where it lives. Um,
this is going to be helpful for the planning because we're going to want the planning agent to Oops. Let's grab this.
Sometimes Super Whisper doesn't paste it in. Um, cuz we're going to like we're
in. Um, cuz we're going to like we're going to ask the planning agent to build a test for us. So, um,
>> exactly. So, it'll go it'll go write some stuff out. Um, that is correct though where it does this. It's really
all in the property handler. There's
nothing else that has to be modified.
>> Okay.
>> Um, yes. Uh Matt said something great which
yes. Uh Matt said something great which is you should be using voice to prompt uh for coding tasks. If you're not using voice, you're just slowing yourself down. Typing the reason and I'll I'll
down. Typing the reason and I'll I'll give people an intuition about why this is true because I it took me a little bit to to come to the dark side. Um
which is or which is that when you're typing, you almost want to think before you type. And we're almost all
you type. And we're almost all instinctively trained to do that. If you
make a typo, you're going to press backspace and go write that out. when
you're speaking, you're going to speak a lot more uh freely and you're just going to inject more information in there, which means that you the model will have better context and the number of tokens you inject even if you're speaking very
very verbosely is trivial compared to the amount of context that you will never get in there uh if you so just like speak.
>> Yep. Amazing. Um cool. So yeah, we've gotten it's updated our unit test example with just this. So this is integration test for client timeout config.
>> Ah it got the spec wrong. Why doesn't
have HTTP in terms of the first one?
>> Yeah.
Uh yeah interesting.
>> Yeah I think it thinks it's optional.
>> Why doesn't the first one have the HTTP nesting? That's incorrect. You need to
nesting? That's incorrect. You need to reread the spec and then update the unit test. We're also getting a little high
test. We're also getting a little high on I try to keep the context under 40%.
That's when you get 40 to 50 range. If
it's easy and we're almost there, I will usually keep it. Um, but you always want to be pretty aggressive about the context. But I think I'm ready to start
context. But I think I'm ready to start uh doing the plan for this one >> once almost. I trust me that we should just make sure it's actually good >> because otherwise it's going to be a waste of time.
>> Yep.
>> Um, we should add some caveats in here that I think are not obvious. Um, and I think the spec is actually underspecified. Idle timeout is only
underspecified. Idle timeout is only relevant if called during a stream. Time
to first token is only relevant if called during a stream.
>> Worth noting um idle timeout is only relevant if called during a stream and time to first token timeout is also only relevant if called during a stream. Can
you update the specification with that note, not the research?
>> And specifically what you want to add in is orchestrator/call.rs
is orchestrator/call.rs does not care about it and orchestrator.
RS does care about it. And this is kind of that nuance of why I know u this is this is just me knowing the codebase like times the first token is a feature that is only relevant if you're calling
it with stream like function b.stream.f
function name instead of b.f function
name >> and and this is why we say is like this is this is real engineering like the people other people who say like work a lot with coding agents is the the idea that like the best engineers have the
entire codebase downloaded into their brain or whatever you want to call it like that is still super valuable. the
same way it's valuable if you're navigating in an IDE and and writing code by hand.
>> Yeah. Cool. Let's go read let's go read this now from the top down so then we can like make sure that we actually understood it. And you probably need to
understood it. And you probably need to refresh right?
>> Yep. And we're refreshed.
>> Okay. Perfect. Okay.
>> So here's our mod. RS special client handling note on the tracing timeouts. Perfect.
>> Um >> that's just where the data is. It's in
property val >> property parsing. Yep.
parser database provider for specific parsing athropic vertex composite clients. No time out. No time out.
clients. No time out. No time out.
>> Can you scroll up? I want to read the composite clients uh carefully.
>> Yep.
>> Yep. Okay. Go down. Okay, that's good. I
read that.
>> This is the interesting part. Yeah.
>> Would need total timeout implementation only. Correct. Um
only. Correct. Um
yes, I think that is correct. Line 221.
Um, it may not be in stream.rs. That's
probably the easiest place to put it actually. Can we just open that file and
actually. Can we just open that file and look at it?
>> Yeah. Yeah. Yeah.
>> And this is kind of where it starts to be like really relevant because like if I were trying to do this myself, I would have to think really really hard.
>> Yeah. And when we say like, hey, we don't we don't edit files and cursor, you still probably will end up reading quite a bit of code. God, why? Who did
this?
Why is uh Oh, sorry. We need stream. RS
>> down. Yeah. And what line number?
>> There we go.
>> 21.
>> Um, go down. Okay, I'm good. This looks
correct.
>> You're happy with this is where you would want that stuff implemented, right?
>> This is where I would want this implemented. So, I agree.
implemented. So, I agree.
>> Yeah. So, this really serves not only as like documentation, but also like very subtle steering to where you want your changes implemented.
>> Yeah. And can you go up really fast?
>> This is where total idle timeout would be and where um time to first token. And
it's really interesting that this doc captures all that because like now I'm like, "Okay, cool. I feel confident that whatever thing is going to work on this, we'll have some context that is really, really relevant."
really relevant." >> Yep. And just as a fun exercise, I'm
>> Yep. And just as a fun exercise, I'm going to go kick off the plan now.
>> Yeah. Yeah. One other 16:30.
>> Yep. Read the spec in spec.md
and the research. Um, actually, I'm going to show you a cool trick. I'm just
going to run create plan because if you run it with no arguments, it will um just ask you what do you want to plan?
>> But isn't that extra context?
>> Extra context, but it's also the right trajectory. So the planning prompt
trajectory. So the planning prompt actually is like a threephase back and forth with the user. And sometimes if you just run create plan and tell it what to do, it kind of there's a chance
that it skips those interactive steps.
And so by doing by doing create plan and then this back and forth, hold on, this is a really important concept. By doing
create plan and then having it answer me and then I say the next thing, you're setting the trajectory to be a little bit more towards your fot prompting the model that it's often like checking back with the user and that this is a back
and forth, not a just go call tools until the thing is solved. You I'll I'll show you what this means.
>> I I believe you. But for for this personally, I would just run it cuz I'm just lazy. Like I'm not I'm not this
just lazy. Like I'm not I'm not this would feel like a micro optimization for myself, but I might be wrong.
>> Uh if you work with these prompts a bit, you will you will you will uh you will agree with me.
>> Um where is this thing? You got too many files in your repo. You're breaking my uh fuzzy finder.
>> Bro, I have a big code base.
>> Yeah, once I kick this off, we're also going to run clock. We're going to do this. Let's work back and forth to uh
this. Let's work back and forth to uh outline the phases. Start with your open questions for me and then give me a phase outline before writing the plan.
We're just going to give it as much extra steering as possible to um >> for this problem. Given how detailed the research was, I would have actually just let it rip.
>> Uh you can, but I'm also like want to kind of show more generally, right?
We're not just solving the problem.
We're But like that's fair. So the
research prompt is actually going to go do some of its the plan prompt is going to go do some of its own research. Say
what?
>> You want to auto approve everything?
>> Oh yeah. I'm also going to kick off a clock which is uh count lines of code just just for audience context on how much code is in this codebase.
>> That's a I don't know man.
>> You don't want to see it. Let's see.
Script 200,000 Rust 200,000. Go 130,000.
This is more than last time, dude. This
is crazy.
>> Yeah.
Yeah. I know a lot of this is generated though.
>> Can you ignore intact test when you run this? Just ignore the int test folder or
this? Just ignore the int test folder or go into Yeah, if there's a way to ignore ignore BML client basically. Uh BML
client. Yeah, just enter if it'll finds that. It's like BML client anywhere in
that. It's like BML client anywhere in the path.
Oh, >> we'll do this later.
>> Oh, you need to use Do you not use warp?
>> No, I don't use warp. I use cloud.
>> Interesting.
ignoring also uh star. Uh oh yeah, maybe that'll work. That'll probably do
that'll work. That'll probably do something interesting. I just use warp and I found
interesting. I just use warp and I found it to be pretty good for this stuff.
Works pretty good if you want to just do it on the CLI. Um I just do everything in here now. So this thing is researching again like even though the research is pretty succinct, um the planning planning code is designed to do
a little bit of research up front as well. This is also so that if your
well. This is also so that if your problem is really simple, you can go straight to planning and you don't have to do the full research, especially if you have a smaller codebase. You can
just do the plan and get get better results.
>> What do we get in clock while it's running?
>> Let's see.
>> Let's see what we got. There we go. That
looks more appropriate.
>> Okay. Yeah, now it's all rust. And do
you have decode in there? What is that?
>> I don't know what decode. It's reading
something as decode. That's probably
just like config files we have.
>> I mean, clock is a old janky tool. So um
>> yeah this looks about right. Okay cool.
>> Um okay cool. So open questions. Timeout
interaction hierarchy when multiple timeout mechanisms are active user configured abort signal retry policy.
What's the priority order?
>> Um what wait >> should an abort signal immediately cancel regardless of timeout settings?
>> Yes. Abort should immediately cancel.
>> Should retry attempts each get a fresh timeout duration?
>> Yes. attempts should each get a fresh timeout set duration. Cool.
Um it's like as the spec suggests. So
it's like it's gonna it's gonna always try to come up with some open questions.
Um some of them are just like softballs of like you want me to do what it said in the spec. Um should it trigger between any SSE events including PE keep alive pings or is it only chunks?
>> Um only chunks.
>> Only chunks.
Keep alive do not reset the timer. Okay.
Question three.
HTB block structure. Yes. Enforce the
structure not allowed at the top left.
Composite climate should only accept total timeout.
>> Dude, you got to switch to whisper floats. A lot more accurate.
floats. A lot more accurate.
>> I just downloaded the biggest model on Super Whisper to test with it. But
um default timeout values. I'm also
going to have to go get a charger in a sec.
>> Should we maintain the current hard-coded defaults? 10-second connect,
hard-coded defaults? 10-second connect, 30 second read as fallbacks when no configuration provider.
>> Yes, keep the default.
>> Yes.
>> Five. Error information granularity.
Yeah. How do you want to display errors to the user?
>> That was underspecified. Um,
tracked elapse time.
Uh, just BAML timeout error uh with uh no data for now. Oh
yeah, it should it should be a subset of uh BMA client error. Uh it should be a it should be a child of BMA client error subclass subclass subasses
>> was environment >> silent degrade if unsupported.
>> Okay. And then let's look at the phases.
So phase one, and when we're designing phases, what I really like to do is um really focus on what is the incrementably testable thing. I don't
know if that's actually going to be relevant since you have such good unit integration test coverage, but if you're building like a full stack application, I kind of like if I'm not going to actually be able to look at it and
verify it until the end of phase three, I'll just have it combine all three of those into one phase. Um, there's some art and science to this, but what do you think about this?
>> Excuse me. Um, let me take a look and think about this more.
>> I think this is uh yeah, that's probably a good phase one.
>> So, we we're skipping the streaming stuff. We're just doing the the the core
stuff. We're just doing the the the core client and basic time.
>> It's really just like parsing clients is what I would describe that as.
>> Yeah.
>> As like parsing stuff.
>> Yep. Then we add in the BAML timeout error. Yeah, that makes sense.
error. Yeah, that makes sense.
>> Create those errors in the SDKs.
>> Yeah.
>> Yep.
>> Then do this. I would probably move phase 4 right with phase one personally.
>> Okay.
For phases, let's do phase 4 right after phase.
>> No, right with right as a part of phase one.
>> Phase one.
>> Yeah. Uh because it's the same parsing.
You can even tell that since it's the same parsing logic. since it's the same parsing logic file.
Um, okay. Phase five, testing and documentation.
Don't do tests at the end. Add tests as >> No, no, don't do that. Don't do that.
Don't do that. Don't do No, >> really.
>> Leave that for now. The code base is [ __ ] So, that's not how I would tell it.
>> I'll tell you.
>> What do you want to do? You want to just leave this as like we'll write the test at the end?
>> Just back backspace all that so far and just run all the commands you have so far. Give me like one second to think
far. Give me like one second to think about how to tell it that.
>> Okay, I'm gonna hit this. I'm gonna go grab a charger. Um
and then >> Yeah, let I'm not gonna hit >> let it create the plan. Hit enter, let it create the plan, and then we'll come back to that testing thing in a second.
>> Yeah. And then we'll read the created plan and we can iterate from there.
>> Yeah, exactly. Cuz I want to read the actual plan. Um and then I think I'm
actual plan. Um and then I think I'm good. Uh but you should fire it off to
good. Uh but you should fire it off to let it run the plan.
>> So for context for everyone else, what have we done so far? In about an hour, we've gone through where we had this spec that we wrote that like roughly outlines new syntax that we have. We
took that spec, we asked an LLM to go modify it to some more updates. We then
told the LM to go research the BAML codebase to go and understand exactly what the codebase is doing and like what parts are relevant to make that change.
And by the end of this hour, we now have even an implementation plan of like here's how we would go implement this by step by step. And hopefully in the next 10 to 15 minutes, we'll have the full implementation plan ready to go. And
once we have the implementation plan, the rest of this is actually really, really, really fast because once a plan is approved, letting a model rip and run is so fast.
But the key part is having really, really good documentation there. For
example, if you go back and look at this earlier, some of you may have seen that uh one of the things that we did was we added streaming as an early part of the stream to the actual spec. And because
the research did not pick that out as unique, >> that was an important >> super high leverage. And this is again this was the whole point of like a bad line of code is a bad line of code. A
bad part of a plan is a is a is a 100 bad lines of code. And a like misunderstanding of which parts of the system are relevant can tank your entire project.
>> Exactly. So like because we knew that that information is relevant, we added that in and that was actually why the model realized that hey idle idle time is only relevant during streaming and so and that's what I realized too while
doing that. Idle time is only relevant
doing that. Idle time is only relevant during streaming and like time to first token is only relevant during streaming.
We don't want to like error out in the normal in the normal call pattern when you're not streaming for time to first token. Um so while this is almost done,
token. Um so while this is almost done, it's going to go and implement all this stuff and it's going to write the plan and then we'll go read the detail plan.
Dextra, while we do this, I want to >> go download Obsidian.
>> I want to make you download this.
>> Yeah, I actually would love to see your Obsidian workflow.
>> Just You can't homebrew it. Just
download it.
>> Yes, I can.
>> Oh, maybe you can, >> bro. You can homebrew anything, man.
>> bro. You can homebrew anything, man.
>> Okay. Yeah, cuz you've been talking a lot about how you use this. And actually
one thing I think that is interesting that is like a thing that I find when I work with people on these kind of projects is like and what works when we're sitting together is like I think instead of me listening to you and then
translating it into feedback, I can just hand you the mic and you can talk at the model. And that's that's like a thing
model. And that's that's like a thing where it's like when there's two people working, it's really helpful to be able to switch back and forth between who's prompting quickly. And we don't have
prompting quickly. And we don't have that today cuz we're live on a stream and we're not doing a screen share or whatever it is, but like being able to This is running. Can you hold Super Whisper and I'll talk and we'll just see if it pulls it up and listen.
>> I don't think it can pull my computer audio.
>> Wait, are you you have uh Wait, do you have uh headphones on?
>> If you take off your headphones, it will listen to me and it'll plug it in.
>> All right, let's try it.
>> Let's see if this works virtually.
>> All right, let's go.
>> Okay, so we can do that. Uh I'm going to go back to headphones just for the audio quality, but that's good to know. Um the
the point I was making about like obsidian and stuff is basically uh this creates one of the workflows that you may want that is helpful for um collaborating with your team, right?
Because like what makes this workflow really shine is the um open folder.
>> Just open a folder as a vault. Yeah.
Plans and put in reader view. So why do I why do I personally like this work?
Can you zoom in a bit more?
>> Yep.
Why do I personally like this workflow where using bare markdown files? Well,
one is I can switch between reader mode and writer mode. Two, you saw earlier, one of the things that we were doing is we were actually asking the model to go make every edit. Some edits are just easier to make manually and not have to
go back and forth between the model. And
two, I find reading this to just be prettier.
>> It's just for me personally, I find this view to be much much better.
>> Okay, >> I like it.
>> So, desired ends. This is our this is our like concise spec. Let's read the first first few lines. And I apologize to everyone on the stream that's watching me and Dexter read a bunch of stuff. Um,
stuff. Um, sadly, this is just what we have to do.
>> This is worth it, too. Like, this is the part you want to be doing and spending time on. I I talked to a lot of people
time on. I I talked to a lot of people who are uh they get to this and they're like, "Okay, cool. So, like we can use the model to write the plans and we can use the model to write the research.
Like, what if we had the model write the specs, too, and we can the model write a really good spec?" And it's like, the plan gives you 10x leverage. The
research gives you a 100x leverage.
Times you just have to learn to be happy with being a 100x faster and not try to get to a thousand because then it things become hard in a weird way. And there's
sometimes the things that you want to do manually whether it's reading or manually pruning these specs that is still worth doing even though like you've been told that like if if you if you're doing something AI should be able
to do it. There's like a there's like a top out where it's like diminishing returns. Does that make sense by does
returns. Does that make sense by does that track?
>> That's right.
>> Yep. Um, so let's go on. This looks
correct. You're right. The um, as the model will say, you're absolutely correct.
>> Yeah, we don't say we don't say that other thing here on stream.
>> Okay, that looks correct. Let's go down.
Fall back and only support this.
>> We're not doing go down.
Um, >> be an HTTP block. Not adding detailed metadata to the error objects yet.
>> Yes. Not impossing time out in any WASM environments.
Not changing default time up behavior.
One second. Not implementing complex timeout inheritance between composite and underlying clients.
>> Great.
>> Yes. We're not inheriting time. Every
client is independent.
>> Yep.
>> Implementation approach. H. Okay. Cool.
>> Um, cool. Okay. So, phase one, we are going to update the helpers to add the HTTP block.
>> Can you go up? I want to read this a little bit better. And for context for everyone else, it's like what are we doing here? Well, we're going to read
doing here? Well, we're going to read the code. We're notice normally you
the code. We're notice normally you don't read all the AI generated code.
I'm going to read this code because this is relevant. So, I will spend time
is relevant. So, I will spend time reading this >> and and this is like the idea is that this is higher leverage than actually reading the code the model wrote because it's more of an outline.
Like this is actually not going to include every single line of code, but it's going to be the directionally like like important parts and it's going to be line by line.
>> We don't have good errors here. So, we
should give the tell the model that you didn't add good errors for this phase of the plan.
>> Go down. I don't think it did. I'll just
let's just read the plan a little bit better.
>> Yeah, let's let's finish reading phase one and then we'll we'll give it our feedback.
>> Ignore for now. Okay. Okay, that's fine.
It has some bail is bad. We really want errors. I think we do want an error
errors. I think we do want an error actually because I mean that was the other issue we fixed is like someone had they misspelled this and they didn't get an error.
>> Yeah, exactly. So tell it that we want good errors here.
>> We want good errors for unrecognized fields in the HTTP block because the user needs feedback to know that they have typed something wrong or unsupported.
>> Enter. While that runs, we can keep going.
>> Yep.
>> Yep.
>> This is so fun. I love I love this [ __ ] >> Let's go on.
Okay, that's perfect. Um,
>> so this is the config.
>> Okay, cool. This part I have no idea what it does. So to be honest, I'm like whatever. This is going to figure.
whatever. This is going to figure.
>> So we're going to use the users config and then we're going to map it from milliseconds and then we're going to unwrap it or use the default.
>> We should probably have a way to define infinite infinite in some way. That's
underspecified in this doc.
>> Um, what is infinite? Minus one.
>> Yeah.
>> Whoops. Let's go. Uh, no. Let's use zero for infinite.
>> If a user puts zero as the timeout, that should mean infinite timeout and override the default.
>> Oh, why zero as a timeout?
>> I upgraded to the super ultra model from Super Whisper and it's not very good.
>> And I don't know if zero is the right thing, but like we can use zero for now.
Um, we could also make the user write in, but that's okay. Let's just do this for now.
>> Okay, cool. I'm going to just stash that because it's still working and I don't want to interrupt it.
Um okay, cool. So now we have um >> I believe the errors here are fine.
>> Okay.
>> Um uh >> unrecognized fields not as empty bail.
>> Wait, go up. What did it say about providers?
>> Go up. It said something about providers >> is composite fallback or round robin. Is
there any other providers?
>> Round-roin.
Okay.
In the provider check block, it's round dashroin.
>> There you go.
>> Um, and we're almost getting to 40%. So,
I depending on where we're at after this round of feedback, I will probably start a new context window and just be like, "Hey, we're working on this plan."
>> Cool. We'll figure out the error checking here. I think the error
checking here. I think the error checking can be much better. So, we'll
deal with that later.
>> Yep.
>> Yeah. This is going to change, but um Okay.
>> Cool. It's cool that the model inferred that zero is a error.
>> It's an infinite. Oh,
>> it did.
>> Yeah. Zero. Oh, this just changed after I put the feedback in.
>> Oh, >> do you want to have a 10-minute max?
>> You can just delete that. Just delete
that. And this is what I mean. Like this
is like where we don't have to think about this. Maybe. Yeah.
about this. Maybe. Yeah.
Yeah. I don't want to put a max on there. Screw it. Who knows how good
there. Screw it. Who knows how good models get. Maybe we do have 10-minute
models get. Maybe we do have 10-minute long HTB requests that we run.
Do you want to leave this in? This is
validator. Like I think this is the only one.
>> Um, no. What What does the other one do?
>> It's literally This is just validating the timeout. So it's like these mean the
the timeout. So it's like these mean the same thing.
Okay, cool. Um, all right. Let's get to the
cool. Um, all right. Let's get to the end of phase one and then we can actually Well, let's keep going. Does
this look right?
>> Yeah, this looks about right. This is
just passing it through to the HTTP client.
>> Makes sense. One thing I don't like about this. Can you go up?
about this. Can you go up?
>> Yeah. Yep.
>> I don't like that this stuff is done over here. I feel like this stuff should
over here. I feel like this stuff should actually be done in the constructor. So
like the defaults should actually be configured in the parsing like for all these timeouts we have the defaults right there.
>> So we should just like the default should be con in as a part of the previous section, not as part of this section.
>> So that way we don't have to reduplicate the defaults like in five places.
>> Okay. So basically the default should happen in in this section, not the other one.
>> Uh yeah.
>> Yeah. Okay. So now we're at 63%. So I'm
going to make a new one.
>> Oh.
>> No, we're going to make a new one.
Sorry bud.
>> Okay.
>> We're not creating a new plan. We're
updating an existing plan. I'll give you the path below. Um so don't need to kick off any research to start. Just use the guidelines to update the plan. I'm going
to give you the path and some feedback below. You need better fuzzy match.
below. You need better fuzzy match.
>> I know something's going on.
>> You can just right click. You can right click on that one. The one above. Just
right click on the right click on the file on Obsidian and it'll give it to you.
>> Okay. Oh, yeah. Yeah. Copy. Copy path.
>> Yeah.
>> Copy.
>> There we go.
>> You need to put thoughts in front of it.
>> Yeah.
>> I would actually let it read the plan and then give it feedback after.
>> Yeah.
>> Cuz then that's going to become like >> in phase one part three, we configured the defaults. I'd like to instead uh set
the defaults. I'd like to instead uh set up the defaults in part two where we create the client. Is that is that right?
>> Yeah.
Or where I parse the client in part one I think where I parse the PL client.
Yeah. There you go.
>> Cool. Let's keep on going. Can you make it wider again?
>> Yep.
>> Yeah.
>> This stuff looks really good. It's
actually really simple because it seems like so straightforward in terms of how it would implement it. That's fantastic.
>> Um okay, so zero. No timeout, none, use the default. We're going to move the
the default. We're going to move the default up, but >> yeah.
>> Okay.
>> Yeah. So then we'll always have it. And
if it says none and zero should really be none, I think this will do the right thing. Zero and none should behave the
thing. Zero and none should behave the same.
>> Are you sure? Isn't none use the default timeout, but zero is explicitly set?
>> No, because it'll be set now in the other code. Uh yeah, there you go. See,
other code. Uh yeah, there you go. See,
it changed it automatically to reflect that behavior.
>> Perfect.
>> This is a lot easier to think about.
Cool. Do you want to skim the change it made or you trust it?
>> Yeah, I do want to skim the change it made.
Um, ensure uh I don't like this happening here. This should happen as a
happening here. This should happen as a part of HTTP config. Like as a part of insure http config.
>> This should happen as a part of ensure HTTP config. Oops.
HTTP config. Oops.
>> Uh, so it actually >> what happened? Yeah.
>> So it'll do the defaults there.
>> Okay, cool.
>> And once it's done, I hate that you're absolutely right. It's so frustrating to
absolutely right. It's so frustrating to see.
>> Yeah.
>> Um Okay.
>> It's frustrating to see this early in a context window.
>> It's fine.
>> Um Okay.
>> And then composite client total timeout.
>> Make it wider again.
>> Yep.
>> Okay. Let's keep reading.
>> Let's Well, it just made a change. Let's
go check the change.
>> What? Why didn't it make the change in the right place?
Go up.
Okay. So, it did do this here.
If it's not a composite, it adds some default in there. That's good. That is
correct.
>> Um, it also rolled. I don't know. I
edited this in Obsidian. I don't know if it actually saved.
>> It does save. You just have to hit command S, but that's fine.
>> Just going to remove this stuff.
>> Yeah, you're done.
>> Since I don't trust it.
>> Um, you can go confirm. Just less than zero.
>> This is the research. That's why. Okay,
good. Yeah, cool. Command S works. Yeah.
>> Okay. This looks good.
This looks good.
>> Wait, can you >> Why is it doing this twice?
>> That's fine.
>> Connect timeout. Request timeout. Okay.
The two different timeouts.
>> Composite time. Total timeout. That's
fine. Does this work? Can I see this one more time?
>> Yep.
>> Yeah. Orchestrate.
>> I want to look at how orchestrate works.
Total timeout. Okay. Yep.
Yeah, that's correct.
>> Board signal is already there. Existing
retry logic with fresh timeout per attempt.
>> Yep, that's correct. Okay,
>> that's good.
Does this need to be a special error or is this right?
>> That probably should be a special error, but I think that's okay because we haven't implemented the error yet.
>> Yeah. Okay. And then for streaming, we're going to do basically the same thing. So, this is the like compaction
thing. So, this is the like compaction part, right? We're not writing every
part, right? We're not writing every single line of code, but it's like, okay, we did this and this is going to be about the same.
Yeah, I think the only difference is we should use Tokyo Select here, but that's a separate problem. I'll deal with that later.
>> Are you sure you don't want to change that?
>> Yeah, you should tell it that. You
should tell it code later. Like we
should use Let's use Tokyo Select. It'll
it'll know what I want.
>> Probably should have told it phase one, but okay.
>> Figured out.
>> There are examples for user feedback.
>> Yeah. Can you make it read only view again?
>> Yeah. Connect timeout. Yeah, that's
>> unrecognized fields. supported timeout
fields not allowed for composite composite clients only support connect must be non- negative zero valid means no timeout yeah these are all our like
test cases basically >> yeah um and you should I think there's something in the doc that's somehow telling it that uh there's yeah there you go you got that right >> the select stuff this is right
>> yeah it looks more correct than before >> here we go okay cool um cool >> okay um That phase looks correct.
>> Uh that is not the right that is we don't have make files.
>> Yeah, that's actually I have a global clot ND that steers it to make and I'm going to have to get rid of that.
>> Um those are the wrong testing commands at the end of phase one. Um what should I tell it to?
>> Yeah. Yeah.
>> I'm also like I'm a little bit worried this isn't actually telling it to add unit tests.
>> Um there are no unit tests here, but that's okay. Uh well I I'll talk about testing
okay. Uh well I I'll talk about testing in a second with BAML.
>> Okay, cool.
>> Um I would probably say is it's it's kind of meaty for phase one. It should
really just get all the configs right.
That's what I would call all of phase one. All of phase one is just
one. All of phase one is just configs right. And and
configs right. And and >> so what do you want to take out?
>> I would just split out phase one to the parsing and actually adding the timeouts as phase two.
>> Cool. Yeah, this is like a thing in phase design, right? is you want to it's like it's how how would you build this right? How would how much code would you
right? How would how much code would you write before you pause to like run something?
>> Yeah. And the reason by the way I say this is because I know that we can add tests at at the parsing layer first before we add tests for the rest of it.
>> And I might honestly just say like everything cargo test passes.
>> Uh I'll tell I'll give you specific tests to run actually after this.
>> Um or the model can do it.
>> Yeah, I know. But I I'll tell you what to run and then you can just tell it that it's not actually cargo test. We'll
have to if you edit this here, it's going to edit it again. You're going to get screwed. So you can't you can't
get screwed. So you can't you can't multi hop.
>> Yep. Okay,
>> cool. Let's keep reading though. Um
because even if it renames the phases, that's fine.
>> Yep. Okay. Um error handling enhancement.
>> Yep. This is correct.
>> Python SDK error. TypeScript SDK error.
Go SDK error.
>> Cool. Looks good.
>> Well, Go SDK doesn't have errors yet.
>> Oh, really? Pretty sure that doesn't exist. I'm like 85% sure that that does
exist. I'm like 85% sure that that does not exist.
>> Just want to see what is it doing here.
>> While we're doing this um and while Dex is running this question, feel free to keep firing them off while this is executing because like we have a lot of downtime while this runs.
>> Yeah. And this is why actually I think pair programming is really good for this kind of work because there is a little bit of downtime. And if you're doing it solo, I would just go check Twitter or something. But when you're sitting with
something. But when you're sitting with another engineer building stuff, you can actually take the downtime to like engage on the problem, think about what's next, talk about some of the things that we haven't figured out yet.
Do we have more questions?
>> Keep on going. I'll let them know what's the questions coming. I'm watching the chat.
>> Cool. Thanks.
>> Okay, so phase one is parsing and validation.
>> This is insure HTTP config. And then
phase two should be. And so
>> now what I want to cargo test is not enough. Now what I'll tell you is I'll
enough. Now what I'll tell you is I'll actually show you how to add tests. The
way that you add test is you go into our repo. Open our repo really fast just so
repo. Open our repo really fast just so you know what I'm pointing to in cursor.
>> Yep.
>> Open that and then go to engine slashablam test. Yeah. and then do command B so I
test. Yeah. and then do command B so I can see the sidebar and then uh scroll up uh it may not it's in validation files.
>> Oh here we go.
>> So this is where new thing in client open client and we like boom be like hey add new tests to this file. These are
all incorrect but there's a mechanism to run them. Copy path.
run them. Copy path.
Yep. And just tell it to add test for phase one. You can add test to this
phase one. You can add test to this file. Yeah. And then the way that you
file. Yeah. And then the way that you run them is cargo test from within that within um no no not to the whole patch just engine baml libbaml.
>> So yeah one thing I will say is um this is the kind of thing that you're not going to want to tell it manually for every plan. And so this is the kind of
every plan. And so this is the kind of thing where you could go update your create plan command to say to give a guidance around how the tests work for different parts of the codebase. We're
not going to do that today but that's like room for improvement. Specifically
what you want to tell the system is this is how you run parser tests like uh th this is how you run like parsing tests or like a validation test.
read a few other files in there to get a sense how it works >> and also tell it that the comments automatically get updated um with the
comments automatically get updated the er the error comments
if when you run update_exect equals 1 all caps uh cargo test dot dot dot Yeah, there you go. So
like that's just like a mechanism that we have. Yeah, cool. That's it. Tell
we have. Yeah, cool. That's it. Tell
that now it should be able now it should be able to go good test. That's why I didn't want to do testing stuff earlier because I wanted to have some context here. I was like there's just no way
here. I was like there's just no way that we'll get this right in that part of the >> Well, you want you want to kind of read the plan and you know what it already knows and that gives it enough context to actually go put in the right unit test.
>> Exactly.
>> Cool.
>> And rep are we using obsidian as a knowledge base or just as temp for creating your prompts or context? I
actually I personally never save my prompts and my repo my research and plans in my codebase permanently. It's
like a separate artifact that lives independent of it. And Obsidian is just a really good markdown viewer and reader. So that's what I read.
reader. So that's what I read.
>> Yeah. And also like I think we found a lot is like we in the past have like kept all historical research and plans and given the model access to like review them for historical context. And
what I'm finding more and more is like you almost never want that. And so what you want is you want the model to have access to all of the research and plans and specs for a specific project. And I
say project, I mean like this issue that we're solving and then when you move on to another project, all that stuff is becomes not by default visible. Um
unless you ask the model to like go search for specific things. We do have a couple files that I call like golden research which are just like the model should probably always be able to find this stuff because it's like here's how the tests work. like this kind of
information could get put into a research file that is always available to the model, but you'd want it to like be kind of like hand hand manicured and really really cuz that's a high leverage thing.
>> Can I read the plan?
>> Yeah, let's see what it did.
>> Okay, cool.
>> Yep.
>> Add validation test. Okay, go on. So, go
down. So, it just shows you error examples, but then it actually starts adding validation tests. That's great.
That's great. That's great. That's
great.
And then it will tell you how to go do this in it. Now tells you invalid tests, what are happening. Cool. That's fine.
Like all these errors and stuff are totally fine.
>> Cool.
>> Adds a bunch of tests and then it should make them pass and some of them won't pass. So then it'll just like run the
pass. So then it'll just like run the diff themselves.
>> Messages.
>> Uh let's I would actually run phase one.
I would just let let it run phase one.
>> Cool. And then we'll go back and fix the rest of the plan once we know what's up.
>> Exactly. So while while we're waiting and reading the rest of it, I would let it run on phase one.
>> So will you would you actually run every single one of these commands?
>> Yes, that is how I would run this.
>> Um, >> yes, that is correct.
>> Make sure this passes. So it's it's it does have the same thing.
>> I I would get rid of I would I would get rid of default. Defaults are applied for non-composite clients. Yes. Um,
non-composite clients. Yes. Um,
>> yeah, I would get rid of that. Yeah,
we're just testing the parser stuff.
>> Yep, there we go. That's all fine. You
can leave that there.
>> It'll probably only run it once.
>> Yeah, it'll figure it out.
>> And then we're just going to not do manual verification here, right? H
>> It's fine. Let let it be there. I would
let it run. And I would just let it run on um And the only reason I'm I'm um the I would just tell it to only run phase one. I And you're going to start.
one. I And you're going to start.
>> I will. Yeah. Yeah. Yeah. But I also put that in there just to over steer it a little bit.
>> Create plan.
>> Um, cool.
>> Yeah. Implement. Yep. And just let it rip. And now let's go back to reading
rip. And now let's go back to reading the plan while this is running.
>> And this is something that I I have often done because we've already read phase one and we kind of split it into two separate plans. I'm okay letting this happen. I I am not at all worried
this happen. I I am not at all worried about like it messing up at this phase.
So I will let it go do that and it'll make some progress and at some point the code will be done. Probably in a couple minutes to be honest. Cool. So, I'm
going to kick off since this one was getting fairly high. I'm just going to close this one out and I'm going to use the same prompt. I literally copied the same prompt, which is like create plan, read the file, and then wait for my feedback.
>> Cool. Let's keep on reading.
>> Cool. Timeout implementation. So, this
is the stuff we moved out of phase one.
>> Yep. So, it's now actually going to create the client. So, this is this is much better than before.
>> Yep.
>> Cool. And then this is our applying the actual timeout to the sync stuff >> in the each client. And then the composite client total.
>> Um we have our Tokyo pin. We have our Tokyo select. So I think this all looks
Tokyo select. So I think this all looks right. I don't
right. I don't >> error is out there. I agree with this. I
think we should move the error definitions up before this happens. Like
we should expose error will need to be implemented and it's not done yet. So
like >> in phase two, let's move the exposed error up and implement it before the >> I would just move error definitions before I would just move error. I would
delete that and let's tell it to move move the phase of defining errors before we actually do uh the request definition.
>> So in phase two, let's just define all the errors first before we go in and do the request definition stuff.
>> Then let's split out the phases. It
might that might interpret it as do it in phase two.
>> Do you want to do it instead of phase two? I want to do it in Yeah, I want to
two? I want to do it in Yeah, I want to do it before phase two. Like I want phase two to move down one more step.
>> Make that into phase three and then add a phase 2 that is just define all the errors first.
>> Cool.
>> Exactly.
>> And again, it's all about breaking down the problem into smaller and smaller components that are all going to individually compile because like this is a huge compilation task. So my whole goal is compile as much as possible
ahead of time. And one of the benefits, for example, of using Rust is if it compiles, it probably works.
>> Rust.
>> This somehow got our stupid value greater than 10 minutes >> cuz there's a line somewhere in there that says like look up 10 minutes. I bet
you there's a thing in there in one of the things that says like 10-minute max.
>> It just rewrote this again. I tried to edit it.
>> Yeah, but go look up in the other Go look up in the other one.
Go look up in the research.
>> Didn't read the research though.
>> It did.
Not in the imp. This is the implementer.
This is the thing actually going to like do the code.
>> What file calls did it make?
>> It read the plan. My point is like the plan we we edited that plan to remove it and it got put back in. So
>> this is why I don't this is why this is why I don't manually edit sometimes because the model will just keep rewriting it with the new stuff.
>> I think I have a separate prompt that I use that tells it that sometimes I make edits that you don't have.
>> I see.
>> Um it's fine if it has that. We can just remember that. Like honestly, a
remember that. Like honestly, a 10-minute max is fine.
>> No, we're going to fix this.
>> Oh, it actually completed phase implement. Well, this Are you? No, the
implement. Well, this Are you? No, the
implement plan doesn't need it.
>> I want to update the plan cuz every every model from here on out is going to see that in the plan, and it's going to be like, oh, that's wrong. I'm going to go fix that, too.
>> Yeah, but then then we should be an implement plan. We should do this in
implement plan. We should do this in create plan. Fire it off. Okay, let's
create plan. Fire it off. Okay, let's
keep on reading it. Did
>> Hold on. Hold on. No, because I want to make sure that the actual implementer also has that. Updated the plan for validation to simplify the logic. Read
the file again and update the validation. I'm just going to stage
validation. I'm just going to stage that. I'm not going to send it.
that. I'm not going to send it.
>> All right. Cool.
Um, cool. Let me go read this really fast.
cool. Let me go read this really fast.
Let me read the plan while it's working.
>> Yep.
>> Um, we just want I want to read the next phase because this part is done. Phase
one should be done. Cool. Error type
definition.
>> Cool. That was correct. Correct. Go up.
No, you're going too fast. My brain
doesn't read that fast.
>> And I do want to read everything else.
>> Yep.
>> Yes. Damn. Time out error. It comes from a client. So, we actually have all the
a client. So, we actually have all the client metadata for free. Yes.
>> Yep. Correct.
>> You said this doesn't exist yet.
>> I'm pretty sure. I mean, just go check that file. Yeah, it's I don't see it.
that file. Yeah, it's I don't see it.
>> Cool. That's what I thought. Um, so Go doesn't have errors. Just like delete.
>> Should we just skip this?
>> Just delete it. Delete. Delete it and just say like doesn't have errors. And
you can tell if you want.
>> The Go SDK doesn't have support for errors. So remove step four from that
errors. So remove step four from that phase entirely. We'll just skip that for
phase entirely. We'll just skip that for now.
>> And same with Ruby.
>> Same with Ruby. Same with Ruby.
>> Technically, we do have errors. I just
don't want to deal with it right now.
Uh, and you should say focus only in Python.
Oh, whatever. It's fine.
That's good enough. Let's go back and read.
>> Okay, >> this thing is still implementing.
It's doing its thing. And a large part of actually using AI models is really about just like moving fast and paralyzing. So that's why we're
paralyzing. So that's why we're implementing phase one without having to go through all of it.
>> It's just not worth the time.
>> So now we just have Python and Typescript. And here's the mapping. And
Typescript. And here's the mapping. And
that's the end of the phase. Is this
right? Like string contains timeout.
>> No, we will eis connect. I don't know what that E is. Can you go to that file?
Just like look up what that file reads like. I have no idea. And again, this is
like. I have no idea. And again, this is just another benefit of having like the actual file references. Like we can just like read this stuff. Just go back and read. Um, no, it's somewhere down below
read. Um, no, it's somewhere down below >> line in execute request or error handling. So here we look at the status.
handling. So here we look at the status.
No, this won't be in here. This is
incorrect. This will only ever be in the I don't know where the error will come.
I mean, maybe where's execute request match.
>> So, what here's a good example of like um >> I have no idea what Russ does here to actually see where it happens.
>> Like, >> so this is >> this is what I'm going to do. I'm going
to show you a cool trick that shouldn't happen here. Use codebase pattern finder
happen here. Use codebase pattern finder to see how errors from HTTP clients are handled.
>> I'm pretty sure it does happen there, by the way. Oh,
the way. Oh, anyway, that's fine.
>> Okay. Uh, close enough. We'll figure it out.
>> It's somewhere in there. Um, is the other one done yet?
>> The implement.
>> Why is it running make? It's running
make.
>> It's my global cloud MD. Read the damn plan, you nerd.
>> Cool. And then I think what what should end up happening here is once phase one is done, then we should be able to get phase two to compile. Once phase two compiles, then we can actually go implement it.
>> Yeah.
Um I'm actually going to stop this cuz the context's getting quite high.
>> It should be fine. Trust me, I've run it to like 80% for these guys, especially during the last part of running tests.
>> But we also have to update this all this logic. All the unit tests are wrong
logic. All the unit tests are wrong because we updated the plan to remove that timeout max. So, like we have to fix that and then we have to get the test passing.
>> It will matter because it's going to fix and get all the tests passing and then it's going to have to make a bunch of changes and fix the tests again. Trust
me.
>> Okay, >> we're working on the plan. Um, we're
only doing phase one. Um, I did one small update to the plan. So, um, check what's been done and then, um, update the validation logic and then just run
the automated verification steps from the plan. Don't use make. Don't read the
the plan. Don't use make. Don't read the make file. Just run the cargo test
make file. Just run the cargo test commands as documented.
>> H interesting. Send it.
>> I got to give it the plan path.
>> Did you stop the other one? Oh,
>> so we're paraly Yeah, we're paralyzing here, but we're paralyzing between phase one and also iterating on phase two.
>> Cool. So, it's going to run. I guess
it's going to run stuff, I guess. Auto
approve everything. Okay, cool. Let me
keep reading. I know why this error handling is happening incorrectly.
The reason that this is happening is this should actually be done in phase that should be handled in phase three later, not in phase one. So we don't actually have to use exposed error yet.
>> We're already in we're already in phase three here.
>> No, in the implementation that's phase two. Yeah, like the error go down.
two. Yeah, like the error go down.
>> Yeah, >> go down a little bit more. Go down a little bit more. Oh, go down. Oh, it's
during time implementation. Okay,
>> I think it's decided that there's no extra stuff to do. It's already It's already done.
>> All right, cool. Let's go down.
>> You want to look at timeout implementation?
>> Yes. Cool. That's correct. Don't set a timeout. Okay, that's correct. Um, don't
timeout. Okay, that's correct. Um, don't
do any of the WAMOM stuff. Tell it tell ignore WASOM in phase three.
>> Well, now this is trying to do a bunch of stuff in phase. Which phase is this?
In phase three.
>> Uh, let's just see what it did.
And then I think so we're what? We're in
uh 11:42. So it's been about an hour and 45 minutes >> since we've been doing this.
>> Um okay. So this is the actual error detection happening here. Um
where we're like updating the enum.
Basically, we're adding timeout to the error code. Even this seems correct
error code. Even this seems correct because we're going to like emit this from inside the actual like client handling.
>> Okay, I'm down for that.
>> Okay, so yeah, if we Yeah. Okay,
>> cool. That makes sense.
>> Use the message. Otherwise, we parse the HTTP error.
>> We go up.
>> Yep.
>> I want to read a couple more things.
>> Yeah, >> the timeout error will need to have uh the client name as well.
>> Which phase is this? I missed the outline. Can I get an outline?
outline. Can I get an outline?
>> No, you can just get an outline. It's on
the right side. Uh it's one of those it is that one. Click on that the right side. That icon that you had the
side. That icon that you had the sidebar.
Nope. And then at the top. Yeah, that
one.
>> Yeah. Okay, cool. Oh, that's so much better. Okay, cool. So, in phase three,
better. Okay, cool. So, in phase three, sorry, it was the >> section four.
>> Phase 3.4.
>> Uh, section five.
>> Oh, that one.
>> This one.
>> The create errors. This part needs the client name as well.
>> It needs the client information as well.
Since uh specifically you can tell it since timeout error extends from client http client error.
>> Cool. Let's go on. Um yeah, I'm just going to check on this one.
Okay, that looks good. It's running the test. It's making some
test. It's making some some fixes based on some parsing errors.
So this is the other thing that's really important is like the better your test, the more you can give the model an objective measure of whether it's correct or not. A lot of people talk about like LM as judge or code review
agents and it's like the LMS are bad at judging things if you ask me, but they are good at reading errors and fixing them. Um, and this is also why like you
them. Um, and this is also why like you don't necessarily always want to obsess over getting the plan perfect. Um, you
just you just need to be get be good enough to get the LM into this state where it's it has structuring the right code. It's putting the right things in
code. It's putting the right things in the right places. It's using the right libraries and the right implementations.
And then if there's syntax errors or the wrong method is used here, it's not cloned properly, then then the model can fix all that.
>> Yes.
>> Yes.
>> Cool.
>> Let's keep reading.
>> Um, we'll let that one keep running.
We'll go back to our plan iteration session. Um, looks like that did it
session. Um, looks like that did it correctly.
Um, let's find that clone.
>> Yeah.
>> Um, yeah. Client name client.clone. That
looks right.
>> Yes, that looks correct.
>> Cool. And then the composite client total timeout.
>> Um, this part is the part that scares me. So, honestly, part of me is like
me. So, honestly, part of me is like because I actually don't know how to implement this. So, let me just um look
implement this. So, let me just um look at this. Um, go down.
at this. Um, go down.
Okay, I guess sleep until deadline.
That's great. Never completes if no deadline. I mean, this will probably
deadline. I mean, this will probably work. I think this will probably work.
work. I think this will probably work.
>> This is all existing code pretty much, right?
>> Yes.
says or however client name is accessed.
Should we have it go put that make that correct or you fine to have it fix that during the test cycle?
>> Um I think that that's actually correct.
>> Okay, cool.
>> I'm trying to remember in my code.
>> Yeah, if you're like I'm not sure then we can have it go do a little bit more research on the side and figure that out. Um but that's like yeah for
out. Um but that's like yeah for pinpoint research then you can just steer to the sub agents. You don't have to go create a new research document.
Part of me is also thinking that we should actually ignore composite uh composite clients for now.
>> For now, just to simplify the plan.
>> Just to simplify the plan like uh like for now let's just make sure that the actual errors work for like primitive clients.
>> Okay. So let's move the composite client timeouts implementation >> retry >> timeout retry implementation into its own phase after phase three.
>> Yes.
>> Yeah.
>> Yes. I feel way better about that. Um,
just because like I'm like I don't know how the stuff is going to work. I want
to be able to get errors fully working for like perfect timeouts.
>> And when you start building like the other thing is too is you're going to learn things as you do the implementation plan. And so like the
implementation plan. And so like the smaller you can make your phases the more you can be like cool we're going to implement phase one and get that working and then we're going to go from there.
And um >> how do I say it?
You may have learnings that cause you to change the rest of your plan in in the in the earlier in the earlier phases.
>> Can we go see the other one? See if it's done.
>> Yeah, let's take a look.
>> It is.
>> Says it's done.
>> Okay. So, here's here's how I know what to do now. What I know now is I want to go look at the validation files that it created. I'm actually not even going to
created. I'm actually not even going to look at the diff. Let's just look at the validation file.
>> This is the thing is like yeah, if you can if you can read the tests and you know they work, then you don't actually have to read every line of code. This is
the really powerful bit. It's in uh it's in Baml. It's on BAML runtime. It's in
in Baml. It's on BAML runtime. It's in
BAML lib uh client. Yeah. So, it made some of
uh client. Yeah. So, it made some of these. I don't know which ones it made,
these. I don't know which ones it made, but it made some.
>> It should have. If it didn't make these, we know we're here. Here we go.
>> Here's the new ones.
>> Okay. Can you Can you disable the syntax here?
>> I don't know how to do that.
>> Okay. Well, I don't know either. That
won't do it either. Um Oh, just select a different file extension like text. The
LSP is always going to pick it. Okay.
Well, command. Never mind.
>> There we go. LSP still serves it. It's
good.
>> Oh, no.
>> It's fine. It's fine. I'll just ignore it regardless. Um, yeah, there's one
it regardless. Um, yeah, there's one pain point that we have to deal with in our repo and we've got configurations set up.
>> But let's just go read this. So, we have a client. We do this and then we do
a client. We do this and then we do this. Cool. It says what? What's the
this. Cool. It says what? What's the
error? Unsupported property http. That
looks like a bad error.
>> Yeah, that looks wrong. It didn't
actually pull it in.
Yes. Unsupported property HTTP. Go on.
Let's read the next one. What does that say?
>> Unrecognized fields and configuration block connect time. This is for the typo.
>> Yes. Um uh can you can you actually tell it to s make a recommendation for what thing to make? I don't like this error.
>> Um >> it should recommend a we always like Yeah, you should paste just paste the error.
>> Yeah. Yeah.
>> And just say I prefer Yeah. and just say I prefer a error that recommends >> Here it is.
>> Wait, why did that not work?
>> It's here. It's just I was wrapped.
>> Wait, what does it say? First token
connect timeout. MS. Wait, why does it say connect timeout? That is valid.
>> There's a timeout. There's a typo here.
>> What's a typo?
>> Typo is there's only one N in connect.
>> Oh, okay.
Can you actually tell it though? Give a
recommendation on best match. Yes. And
we have some functions to already do that. Um, can you do me a favor? Um,
that. Um, can you do me a favor? Um,
it's fine.
>> I'm I'm uh giving you the gift of gift of taste.
>> Okay. Well, go back.
>> Okay. You want to read some more?
>> Yep.
>> That looks good. And again, these examples execute, right? When you run cargo test, it actually like makes sure these errors match.
>> This is good. This should be a map. This
is fantastic.
Error for Oh, what does it say?
>> Total timeout is not supported in like lowle. This is only for composite
lowle. This is only for composite clients.
>> Yes.
Cool. Unsupported property http. We
should >> So some of these didn't work.
>> Yes. So we should figure out why. Oh,
it's because validation for uh those clients are different. So we should tell it that.
>> Okay. So the error in this file is incorrect. Um the validation for the
incorrect. Um the validation for the orchestrator clients is different. Um,
and so we need to make sure that HTTP is supported in those as well.
>> It's for fallback. It's just a composite clients and like this is just like if you have a good testing framework, this is much faster to test because I know for sure that the code is running and parsing everything accordingly.
>> Yep. Okay. We're getting high in context as well. So, um, I am
as well. So, um, I am >> This takes a while to run the test. I'm
going to have to restart it. It's still
it's still doing the previous thing, so I don't I don't want to interrupt it.
>> Cool.
So at this point what we have is we kind of have we have the parsing working. We
have the data plumbed through the right classes probably and I say probably because I haven't actually looked at the code but I do know the parsing is working and like we've plumbed through the most important part which is like
user syntax into the data model.
>> Yep. Can kind of skim through here.
Cool. Building programming languages is fun.
>> Yeah. It's interesting at the very least.
>> Is it normal for this to have no output?
>> Yeah.
>> Because we did a grab. Yeah. Okay.
>> Yeah. Yeah.
>> Cool. Our new er Yeah, the tests are failing because our new error messages are different. So, it's going to run it
are different. So, it's going to run it with update expect. Let's see what the new error messages look like.
Yeah. Look at that. We got our fuzzy finding working.
>> Yeah, it >> this looks better, right?
>> It's slightly better. There's still
better things I would do, but I'm okay with this for now.
>> Okay. I mean, that that's the kind of thing that's probably We've seen that it's easy to polish that part. Cool. I
just I would just let the last I would just let the last thing rip even though I >> Yeah. Yeah. No, I I sent it.
>> Yeah. Yeah. No, I I sent it.
>> Okay, cool. Just going to go work on that. Um here's our planeration.
that. Um here's our planeration.
>> So, new structure, parse and validate, define error types, implement basic timeouts, implement composite timeouts, implement streaming timeouts, testing and documentation, runtime configuration.
>> Yes, good.
>> I actually will probably do streaming phase five before phase four. That's
actually the only difference I'd make.
And the idea is like I want to get the primitive clients fully working end to end >> before I touch anything about the composite client. This also makes
composite client. This also makes testing a lot easier.
>> So that's another guideline. Um changing
spec. Um we'll have it like guidelines to add to create plan.md. This is the thing you should always be doing when you're working with this and you find a thing like oh I always want the model to do things X way. Then you want to move
it into your infrastructure which is your your commands and your agents. Um
so number one was guidance on cargo test generating new error messages
tests etc. And then the other one was always implement basic HTTP then streaming then composite clients in that order.
>> Yep.
>> But I I I personally probably wouldn't put this in cloud MD because it's so specific to this thing.
Yeah, there's some there's somewhere where you might want to write this down or keep it handy. We're also like you may have like prompt snippets that you store and that's just a thing whenever you're changing the HTTP client logic.
Okay. So, yeah. Okay. Basic timeout,
streaming timeouts, composite client, testing and docs, runtime configuration.
I don't know what phase 7 is, but we'll get there. All right. You want to read
get there. All right. You want to read phase four then?
>> Yes, please.
>> Cool. Time to first token and idle timeout. Here we go.
timeout. Here we go.
>> Okay, this looks pretty good. H. Okay, I
see. Okay. Yep, that looks good.
>> Literally counting clocks. Clocks are
hard. Okay. So, if this resolves and we haven't gotten a token, then we go.
Otherwise, it's the other one.
>> Yep.
>> Cool.
>> Otherwise, we just get the next token.
We just keep alive.
>> Yeah. Okay. So, then we set this, then we have to keep alive.
>> Cool.
>> Yep.
>> Cool. And yeah, look, I like how it's staying brief here. Cool. And then in each client we have to pass in the HTTP config to the stream handler.
Okay, sounds good.
>> This looks good to me.
>> Okay, I think we're still on phase one over here, but I'm just going to do infinite bypass.
>> Yeah. And then there is a technical design thing that I need to think about which is how do what do we want fallbacks to do on certain types of errors? And I think we want the
errors? And I think we want the fallbacks to treat like time to first token errors like errors that it just like timeout errors are like things that fallbacks consume and don't forward up.
You just get told that a timeout happens and then you just move forward. So I
need to think about how composits will play into that. But I think this will just work as is right now.
>> Um sorry. So it's like how do composite clients happen handle timeouts from children?
>> Exact. And it's not just composit, it's like retry policies as well. Cuz one
thing you could say is we want to error out completely on this. But I think what we're going to
on this. But I think what we're going to do with timeout errors is uh is not what we do with abort. On abort
we want to exit fully.
>> Yeah.
>> Timeouts we don't.
>> Yeah. I think a timeout Well, so sometimes a timeout is like I want to guarantee that I actually get a response to the user in some amount of time. But
I think in general, yeah, timeouts are retriable if it's if you're dealing with like shoddy upstream infrastructure.
>> Yeah, exactly. Um, and then this should almost be done.
>> So, let's see. It's doing the expectations now.
>> Yeah, it's doing the other one where it's actually running like the HTTP one >> generated two warnings.
>> Problem with Rust, honestly, is the worst part about Rust is yes, while it compiles, it does work.
>> Uh, it is kind of slow to compile. Not
allowed for composite unsupported property. Yeah, this one's still wrong.
property. Yeah, this one's still wrong.
Unsupported property. Yeah, it's still not figuring this out. So, we've given it feedback and it hasn't figured it out. And so, my take here is we actually
out. And so, my take here is we actually need to go update the plan more um or add like a phase 1B that is like fix this thing before we move on.
>> Yeah.
>> You see what I mean? Like the composite clients are still not able to fix this.
Yes, >> this is correct. But this one on supported property is still the error.
It looks like something actually died and it didn't make the updates.
>> Okay. Well, while this is running, we'll see how this runs.
>> Yep. Most tests are passing. All
validation tests are now passing. Oh,
this is valid. Great. Type field name not allowed for composite. Okay, this is better. This is the right error. This is
better. This is the right error. This is
not failing on HTTP now. So, it did fix it.
>> Exactly.
>> Okay, cool.
>> Um, >> amazing. I'm going to commit this. do
>> amazing. I'm going to commit this. do
it. And that's good. That's a good thing to commit. So, like now we're done.
to commit. So, like now we're done.
We're phase one.
>> Yep. And I will leave the I will leave all the files and specs and stuff in there. You can go clean them out of the
there. You can go clean them out of the PR if you want to.
>> Yeah, I will do that. Can you can you show the phase one so people can see what the what the check boxes did?
>> Yeah.
>> Yep. And so it actually will check the system will actually automatically check off these stuff for you so you actually know what you're done.
>> I actually just I Go ahead. I was going to say and the the implement plan prompt has a little bit of extra steering in it to basically like make it able to resume. So it's like if it has existing
resume. So it's like if it has existing check marks, trust that that's done, pick up from the first unchecked item and verify previous work only if something seems off.
>> Yeah, exactly. So if we go back and take a look at this, uh we can remove the manual manual verification. We're done.
Or check it off yourself and probably want to delete stop here for human input.
>> Yeah. Okay, let's go compile the rest of this. Can we look at the compilation
this. Can we look at the compilation steps in error type definitions?
>> The Oh, the verification steps. It's
going to try to run this with make again.
>> Yes. So, you should tell it how to do this.
>> Sorry. I'm just going to do this.
>> Okay. Yeah. I didn't know that you make so aggressively in your code.
>> I like make because it's language independent and we use a lot of different languages.
>> We do too, but it's interesting. I wish
there was a way to just not load this because now your make file is going to be not have all your stuff that you need.
>> That's fine. I'm going to keep this one in. Yeah, this this shouldn't be in my
in. Yeah, this this shouldn't be in my global thing anyways. That's better.
Cool. There's your secret alpha for the day is snippets of my claw MD.
>> There you go. Um, cool.
>> All right. So, uh, we use this commit command that kind of splits things out into logical files.
It's fine. I probably could have just told it to just make one commit, but um are you ready to jump to phase two?
>> I am. So, we just phase two.
>> Let's do it. Um we haven't read the later phases, but we'll kick that off and then we'll go back and read the f the later phases of the plan. Cool. So,
um BAML 1630 phase 2 only doing phase two. I guess that's fine.
>> That's probably good. Let's rip, baby.
You got this.
>> Um, you might also want to tell it something else. Um, yep. Which is the
something else. Um, yep. Which is the way that you we run tests. So, let me send that to you in Slack. So, that way you have that for Python.
>> Um, yeah. Actually, we should up before we start this, we should update the plan to have that. You want to slack it to me?
>> Yes. Um, I will give this to you. I'm
giving it to you. One second,
>> Elder. For your question about Sonnet 45 versus Opus. Sonet45 writes good code,
versus Opus. Sonet45 writes good code, but when you want to reason over a long complex codebase, you really want to be using Opus most most of the time.
>> All right, I sent it to you. This is how we run uh let me let tell you how we run then uh and just tell it to not run the typescript test. It'll just take too
typescript test. It'll just take too long.
>> Are we also doing cargo test?
>> Yes.
>> Yes. Uh we don't really Yeah, probably to just compile check like the previous test should still pass.
>> Yep. Okay, cool. Cool. Yeah. So, we'll
update that in the plan and then we have a look.
Cargo build, cargo test, do this, do this, do this.
Do you want to do the rough check?
>> Yeah.
>> Okay.
>> That's how you check that everything is compile time working.
>> Do you want to do the build here or skip all Typescript?
>> Just skip all Typescript. Uh, yeah,
exactly.
>> I assume if we weren't on a stream, you would run the TypeScript tests.
No, I'd probably not. If it it's like if Python compiles and TypeScript doesn't compile, it's really rare. It's probably
good enough. I'd run it at the very end because again, my most my most important resource is my time. So like I'm optimizing for my time here. So like I have a pretty good proxy that the test will probably pass. And I know I'm going to run.
>> If the Python's good, then you know the TypeScript's good.
>> Yeah. And I know I'm going to run the test eventually. So it's like it's not
test eventually. So it's like it's not like I'm not going to do it.
>> Cool.
Amazing. No, I run cloud code directly on my machine. And the reason is like honestly even if I'm using Opus, I am still restricted to the directory I'm operating in. So it's okay. If the model
operating in. So it's okay. If the model does end up doing something really malicious, I think you could just like prevent certain you could just like blacklist certain commands.
So like for example, I know Dexter did something interesting. You type Python
something interesting. You type Python into your shell, Dexter.
>> Uh oh, I had to get rid of that.
>> Okay. Uh, but yeah, I had a >> Dexter had a command where if you typed in Python directly into your shell, it would it would shim it. Um, it would
shim Python to say echo use uh UV instead to make sure that the model could never do anything wrong.
Um, and that helped a lot >> because there's Python in the cloud default system prompt.
>> Yes.
>> But now if I run Python, >> get the reload source.
>> It's in there. It's because the first one isn't found. It's not a real thing.
You have double slash that goes to root.
>> Ah, all right. Well, I'm not going to go messing around with my files. But yes,
there that's that's the basic idea is you just replace Python. The first thing in your path should be something else.
>> Yeah, that way if the model does it just will not trigger the wrong Python. So I
think it's the same for all cloud code stuff.
>> Yep.
Okay. So this one is ripping. That said,
I have had a friend who actually had claude code or cur I think cursor actually not cla code >> run rm-rf >> uh in a very scary scary directory in
the home directory tilda slash.
>> Yeah, you shouldn't do that.
>> And it did that because it actually made a directory called tilda by accident in the local folder and then it started running that as a shell command to delete that directory and that is um not good.
>> Do not want that. Yes. Internal
monkeypatch.py.
>> Yeah, we've done some interesting stuff.
>> Does look right.
>> Yeah, I mean that stuff looks right.
>> Yeah. Okay. Um, cool. We are at hour two and we're getting through it. Um,
we should think about what's next.
>> I think what's going to end up happening is we'll get the timeout implementation.
Um, and I think it should work for the connect and request timeout.
>> Mhm. Um
like as long as time out can if you can you can you go back let's just make sure timeout request is actually >> this base three stuff >> is only doing the basic stuff. Yeah.
>> Yeah.
So here's our HTTP client. We've read
that a bunch. Here's our request implementation and then we do that for all the providers and then we detect timeout errors in the execution.
>> Yeah. Cool. That's correct. We should
get rid of the was on the stuff there.
Get rid of the wom stuff.
>> Yeah, we don't need it. It'll just work.
>> Let's remove the wom stuff from phase three. Just leave it out of
three. Just leave it out of >> Yep.
>> This is crazy.
>> It's going to distract it.
>> Um, >> can you go up really fast? I want to see a couple more things. I want to read this error code. Detecting timeouts. Go
down. I want to read this.
>> Um, request timeout. Error code timeout.
request timeout. Error code timeout.
That's good.
>> Okay.
>> Yep. Okay. Otherwise, we just do one from the status.
>> Oh [ __ ] What was the code there before?
Can you command Z this? I might have messed up. Maybe we did one.
messed up. Maybe we did one.
>> Uh, we can go see what the diff was.
>> Oh, no. We do want that WASM stuff. I
was wrong. Oh, this is undo. That was
the original code in there. I thought it added for this spec, but that's actually the original code in there.
>> Yeah, sometimes it's not 100% clear what it's it's like. It'll reproduce large blocks of the code. It's not always clear. I think this is the only thing
clear. I think this is the only thing that's added. Is this it? Let else.
that's added. Is this it? Let else.
>> Yes, exactly.
>> Um, cool. And then is fine.
>> Yeah, that's fine.
>> Um, okay. Let's scroll down.
okay. Let's scroll down.
This looks good. Timeout error code, I guess. 408. I don't know what's a what's
guess. 408. I don't know what's a what's a timeout error code. Is that 408? HTB
request timeout, I guess. So,
>> I don't think it's an HTB code. Is it?
>> I don't know. I'm going to Google that.
>> There you go.
>> Hey. Yeah. Request timeout.
>> Okay. I've never seen this code before.
It's one of those >> Well, HTV codes are very fascinating. I
don't know what they have.
>> Yeah.
>> Let's see the >> one >> handle time orchestrator.
All right. So, we get the response.
If we get a timeout response, then we expose it as a client as a timeout.
>> Timeout error.
>> Perfect.
Otherwise, we do that. Otherwise, we do that, which is perfect.
>> Cool.
>> Yeah. Does this look right?
>> Um, I would add some more tests. I want
phase before we do streaming timeouts like phase 3B >> almost >> should be add end to and a add a pi test inside of intest Python. Well, I would
do phase 3B.
I would make a separate phase a integy inside of in Yeah. Python test.
>> Yeah.
uh to validate the time matter at error.
>> Uh yeah, use a pattern finder agent to see how this works >> and then yeah and then also give it the UV command that I gave you earlier to
actually run how uh yeah there you go.
>> Amazing. Um great I can see HTTP config as a default. I'm wondering
okay yeah so we have done some tests now and finding a bunch of issues that we're going to go start. Wow, 300 lines of errors.
>> Uh, this is the problem with Ross.
They'll figure it out.
>> Yeah. Okay, we're already very high in the context window, so I might actually stop it. There's a command I use called
stop it. There's a command I use called slash continue.
>> I would let it I would let it go on for for a little bit longer.
>> Nope, I'm not because they changed the the compaction window. This is going to compact in like the next two seconds.
So, we're going to do a manual compaction.
um create a handoff prompt about where we are and it should start with slashimplement plan and include a ref to
the plan file. Fortunately, we might compact as part of this, which would be frustrating, but let's see if the context goes back down. Yeah, see it already compacted itself. [ __ ] Oh, I
hate it. We're still going to do a
hate it. We're still going to do a handoff and start over.
>> Okay, >> so this is the prompt it's going to use.
Current status is mostly complete. Needs
final testing and verification. We did
all the stuff. Yes. Okay. Yes. So, we
have a CLI to code layer that I use that basically lets you have human in loop for your compaction because it's going to run the CLI to launch the next session.
>> So, Oh, it's just going to launch that for you.
>> Exactly. Um, so you let the model create the prompt and then it's going to use the CLI to do it. Um, this is kind of a thing I've been playing with recently that I've been enjoying. It's kind of
somewhere in between clear and compact.
Um, but it's a nice form factor. So,
yeah, this launched another one. And now
we should have there's a state thing here, but it created this new one.
>> That's cool.
>> Um, yeah. So, here's the prompt that the other model generated. We're using
implement plan. Here's the plan. And
then, yeah, we go from there.
>> Nice.
>> Um, cool. Okay. So, phase two is rocking. This one we're going to we're
rocking. This one we're going to we're going to we're going to axe.
Um, and this one is looking for Python test patterns. So, I think it should
test patterns. So, I think it should have found them. Oh, opus. Come on.
>> Are you out of opus?
>> No, they're API erroring. This is why can't have nice things.
Let's see.
>> This is actually why I really like uh uh uh codeex. So, it actually does not go
uh codeex. So, it actually does not go down as much.
>> Let's see.
Yeah. Okay, it's back. It's just a little spotty. I am hoping that the
little spotty. I am hoping that the context from this agent from that took 3 minutes searching for patterns is going to be available in our context window.
Yeah. Okay, cool. It says based on the patterns I found. So, I think it did find things.
That's good. Just going to tee up another one in case we need to make more changes. Oops. Yeah. Okay, that's fine.
changes. Oops. Yeah. Okay, that's fine.
Nope, it's not. I didn't mean to send it. Front ends are hard. This one's
it. Front ends are hard. This one's
still cooking.
This one's ready. Okay. So, major updates. So,
ready. Okay. So, major updates. So,
phase 3B integration testing for timeout errors. How's this look?
errors. How's this look?
>> Okay, that looks correct.
That looks correct. 50 m timeout. That
looks correct.
>> A 500word essay. Cool.
>> Um, and that should fail.
Uh, go down.
>> That's right. Provider open AI. Should
this be >> That's fine. Okay.
>> Um, let's read this.
>> Yep. That should read time at Perfect.
Um, >> nice.
>> You're going too fast. I was sorry >> that one. Yes. Fails. Yep. Accounting
for overhead of any kind. Why not? That
makes sense.
>> Abort after delay. Abort should still um >> should not be a timeout error. This is
abort takes precedence.
>> Yes.
AB uh because when does a timeout error happen?
Oh yeah. Yeah.
>> Looks like 100 millconds takes a board takes precedence. That's amazing. Yeah,
takes precedence. That's amazing. Yeah,
that's really good.
>> Gone. Synchronous. Yep.
We also time out in synchronous clients.
That's perfect.
Streaming. Streaming should also time out. That is perfect. Um
out. That is perfect. Um
>> and then we haven't implemented this.
>> Yep. That's correct. That's correct.
Include. This is the power of the codebase pattern finder.
>> Yes.
>> Is like if you're like I need to do this thing and I know it wasn't in the research, you can just steer it, go find the pattern and then use that.
>> Yes, that is correct.
>> Okay. Um and then updating the BAML config for tests. So this is timeout test clients. This is what gets used by
test clients. This is what gets used by the Python test.
>> Yes, that this part it'll figure out that that part looks about right. Uh yeah,
there you go. It figured it out.
>> Cool. Sick.
All right. Um,
>> and I think this is kind of the point, like if I didn't know that we needed these phases, we could be endlessly spinning for a while, but I'm like, "Oh, I want the Rust code to compile, then I want the Python test to compile."
>> It's the same way you would engineer if you were writing the code yourself.
>> Exactly.
>> Um, I think phase two is done. What do
you want to look at to verify it manually?
>> Um, just look at the diff in VS Code.
Please don't show me the git diff here.
I want to see what files changed. Show
me the tree version.
>> This? No. How do I show you the tree version? View is tree. Okay, cool. So,
version? View is tree. Okay, cool. So,
we got this. We got this.
>> Okay. Yeah, it's just doing some like Oh, you didn't get the commit from last time. Interesting. Okay, that's fine.
time. Interesting. Okay, that's fine.
>> Looks like it didn't do all the providers, but maybe is OpenAI is like the base for everything.
>> Um, I think this just mixed it from last time. It just fixed some compile bugs
time. It just fixed some compile bugs that your git commit didn't get.
>> Yeah. Okay, this looks about right.
>> Patch. Yeah, here we go. Typescript
errors.
>> Silent so you guys don't see my hear my text notifications all the time.
Um, okay. Cool. Cool. Uh, all right.
Let's start phase three.
>> Cool. And I think the really nice thing that a lot of people underestimate in this workflow is just like how how like independent each step is. And like for context, this input this feature as you
guys can tell is very nuanced. There's a
lot of stages to it and it would have taken forever to implement this >> and by forever I mean like it probably would have taken an engineer on my team.
>> A couple days.
>> Yeah. One day maybe two days of time and that would be like a good implement. I I
would be like very like I personally would be satisfied if it took me two days to go from like nothing to it working end to end. If we can get this thing working with me and Dex spending three hours here, this is clearly a win.
And this is like while we're live streaming and also like not even trying to do other things and be super optimized here.
>> I think it was just you and me and there was no stream. We weren't taking questions and trying to like talk through the thought process. We could
use the downtime instead of explaining what we're doing. We would use the downtime to go like work a sec a separate feature in parallel. Phase
three is about implementing the actual timeout functionality.
Um and then timing wise I have this on till 1. Um I'm having to like commit and
till 1. Um I'm having to like commit and push what we get. We can either do a part two next week or we you can pick this and run with it yourself. I don't
know. I don't know what you want to do.
>> I think if we get this working end to end Python, I'm actually very happy. If
this works, this means that >> like the re other reason I broke this out.
>> Yeah.
>> Is that I can actually ship this and make it usable by users very early on without putting the whole composite features and everything else in there.
>> You would you would ship these new fields even if they didn't work for streaming and composite.
>> Well, the >> I guess it's two separate features.
Exactly. So composite, it comes to a separate feature. Exactly.
separate feature. Exactly.
>> But I would actually ship it for individual clients because that's still worth it.
>> But you need language support for every language and you'd need those fields to work on every all three of those fields to work in every in every path so that you could document it.
>> So I would do I would do phase 3B phase 4 with the streaming support and then I'd merge >> and then you chip it.
>> Yes, that'd be enough. I don't need everything all the way end to end working. I don't need composite features
working. I don't need composite features to work. What is this? Runtime
to work. What is this? Runtime
configuration and advanced features. Is
this nonsense?
>> No, this is correct.
>> Ah, okay. This is like the dynamic client registry stuff.
>> Yeah, exactly.
>> Oh, it's very >> I would I would do this stuff. It's It
is very fast. It just plugs into our codebase.
>> Um, most of what we have is actually works really really simply with the whole system. It's designed in that way
whole system. It's designed in that way where it should be easy for AI to add these features or humans to add these features, which is why most sections you see are very short. We spend a lot of time on architecting our codebase. Like
there's I I would say if you're vibe coding from scratch, this this approach will probably not work as well because your codebase has no like concrete architecture. Like for example, you know
architecture. Like for example, you know the place where it picked out the wrong error >> for the property types where it was recommending messages.
>> Yeah.
>> I that's actually incorrect the way it's done. We actually have a different way
done. We actually have a different way of doing it that will actually give you recommendations that we that's a standard format that we have that says here are all the options. Here's the
thing the user typed in. Pass in the error. I don't want it to use a new
error. I don't want it to use a new method. I wanted to use that original
method. I wanted to use that original method. I just don't want to side rail
method. I just don't want to side rail the whole conversation to fix that.
>> Well, we got there eventually anyways.
>> No, it didn't because it didn't use the original method. There's a method that
original method. There's a method that we have that you have to use that actually does the formatting for you for that kind of stuff.
>> Okay. But the actual error messages we got are correct. So, you're saying it just rewrote that logic itself somewhere?
>> It rewrote the logic with a shittier version of that logic.
>> Interesting.
>> I have good logic that is actually like way more battle tested for a lot more edge cases.
>> Okay.
So, that's kind of how I would that's kind of how I would think about it because it does a whole bunch of stuff around preference and ordering and stuff there.
>> Yeah.
>> Um, and I don't want that to be the case.
>> Okay. But that's that's the kind of thing that is like, okay, once we got the whole thing plumbed in to end, that's a really easy refactor.
>> Exactly. It's not even a refactor. It's
actually like I I will use AI to solve that problem, too. But I just don't want to lose what I don't want to do here is I don't want to lose a train of thought over the details that I know don't really matter.
>> So, we're at 60 again. I'm starting to get a little again context anxious.
>> Yeah, the problem with our codebase like I said is massive.
>> Well, also this plan file is getting quite long and so it's reading the entire plan file to only like a lot of this is like stuff that's already done and working and it may even be worth
compacting the plan file.
>> I have found that that has led to worse results than letting the context window with autocompact personally.
>> Okay. Um, and like what I would really do here is I'd probably do all the request timeout with every single client with um with sub agents. That's what I probably would have done.
>> I'll have it actually go do the implementation with sub agents and do the testing with sub agents.
>> Yeah, that's probably what I would have suggested is like better if you really care about compaction. Like honestly, I probably wouldn't even do compaction here. I would just let it rip.
here. I would just let it rip.
>> I'm going to compact cuz we got pretty See, it already auto compact.
>> All right, we got 30 more minutes. I
think my goal is by the end of this 30 minutes it should fully run. I think
it's good to give it the other phases because then if it has that it knows what context it can do personally. Let's
keep running this thing.
>> Yeah, we'll just let this run. Um
it actually the only reason you can't use sub agents here is very sad because the sub agents are sadly uh cargo does a lock on the system.
Can't lock. It's actually really annoying. And also when you're updating
annoying. And also when you're updating a codebase individually, it's very annoying unless you have like separate packages that you're testing in because it's all within the same package. You
can't really do it in any more parallel sadly.
And incremental compilation is not a thing that Rust really has in a great way sadly. Yeah, but I don't know how if
way sadly. Yeah, but I don't know how if they'll be able to do it sadly.
Honestly, I actually spend very little time thinking about what model to use and I just let it rip and then if I run into problems, then I will deal with it.
Well, Sonnet 4.5 is supposed to be good, so it's I think it's worth it. I
actually think people underestimate how important speed is. I think most models can write code pretty well. So, like I probably would use Sonnet 45 or even the smaller. I actually use the uh the
smaller. I actually use the uh the shittier Sonnet models all the time.
They're good enough. your your
philosophy of you use opus always like personally I have a pretty good proxy of when to use and when not to and I get really results without using opus I think it can do way harder problems than that too personally yeah I don't know that's just my personal take though like
I found it to be fine even for really hard problems like super long it is all vibes too no
it's I think most prompting in my honest opinion most prompting whether it be at the application layer or using coding agents is all vibes. And the best thing you can do is build a really really good vibe checker in your own brain of
whether or not it's working or not. And
if you can do that, you that will likely be better than any system you build because there's this infinite there's always this joke of like I spent I spent four hours automating a 10-minute task.
I think the same with prompting for most tasks. You're going to spend weeks
tasks. You're going to spend weeks setting up an eval system for what was a one day prompting problem. And the
infrastructure is not worth it most of the time.
>> Well, no screen share. Yeah, because
there's literally just nothing to do. It
just you just have to wait through the code.
>> Yeah. So, I'll tell you like my take on this. So, like I've tried showing my
this. So, like I've tried showing my team the prompts and usually what I find is any workflow that I give to my whole team, engineering is such a diverse medium that you usually don't have one
thing that fits all. Um, and like for example, we all use VS Code. We all use VS Code in totally different ways. We
all have terminals. We all use terminals in totally different ways. Then the
reason VS code and terminals I think work really simply is because like there's one medium that they have which is like reading and editing files which is a really really common pattern but the actual mechanism that you use to read and edit files is varies
dramatically between every single engineer I've ever met. So like or just like for example you use Vim keybindings I use I use regular keyboard key bindings like engineering is such a dynamic discipline that there's no one
homogeneous way to do things for almost every task. uh we have workflows like
every task. uh we have workflows like system workflows like GitHub pull requests and everything that we do clearly have more systematic ways of doing things usually but even then every team has different ways of doing this
and it's not really prescriptive but it's usually just like hey thou thou shalt code review thou shalt do certain things but not really how the how is very loosely defined and I think for prompting and everything
else like as hard as I tried to get my engineers to do all be do the same exact thing actually turns out that they're better when they're doing the thing that is like kind of like a an offshoot off
of the golden path that works for them and makes them happy and enjoy the work.
>> And then same with prompts like I think there are prompts that certain engineers like because their workflow is better.
It works with their personal style of workflow better. So that was like I
workflow better. So that was like I think one of the key learnings that I've had which is like and if I press one workflow on one uh on people too much then they won't really have fun anymore.
And part of software is not just like being a machine that spits out code.
It's actually like enjoying the work.
And that's almost like Yeah, it's like coming to the conclusion yourself rather than being told what to do. You made two plans.
>> Okay, let's keep on going on the Let's see where the implementation's at. Let's
see what it gets.
>> Yeah. Now, now it literally should just do like a quick compilation step and then we should read write some pi test.
And like honestly, if all the pi tests pass, I'd be surprised. I'd probably run it manually myself because I don't really trust it. Once it tells me that it passed, then I would just like see if it actually passed. Just as a quick sanity check, you know, I don't want I don't want to be I don't want to be a
schmuck that just like is like, "Ah, the AI told me it didn't." Like, while I do kind of believe it, it's like uh for really serious stuff, even if my engineers say the code works, uh or any team I've ever worked on, if someone that I'm code reviewing said the code
works, for really serious stuff, I will go validate it myself.
>> Exactly. Because I'm in the end, I'm still on the hook for the code working like regardless of like whether I wrote the code or not. And like this is and I didn't notice I didn't do it before for
the previous test because I trusted that system a lot more than this one. This is
like an end toend test. I'm like ah I don't want to think about it. Let's
really make sure that this works. But
sadly we just have to wait. I really
wish there was a better thing to do than just wait. Um no I don't know that I
just wait. Um no I don't know that I need to think about this more. That's
why also it's like there's a deeper problem here that I need to think about how to do. Um don't update greet plan. I
would Oh yeah. Okay. Oh, so you're not going to auto approve. Interesting. But
yeah, I mean, you told it the wrong thing, so it makes sense. It'll fix
them.
It's going to run. No, this is going to run. I'm like, it's so close. As soon as
run. I'm like, it's so close. As soon as a pi test command runs, I trust me, I think it's going to work. Uh, that part is fast. It'll just compile. Once this
is fast. It'll just compile. Once this
compiles, the Python test should just work. It's going to be better than you
work. It's going to be better than you think. Um, so I'll stick on a little bit
think. Um, so I'll stick on a little bit longer just until we get the Python test running. If we don't get the Python test
running. If we don't get the Python test running on like within the first uh three to four minutes of it executing that space then we can call it. But I
think it will run. I'm very optimistic actually. I think this workflow has
actually. I think this workflow has worked for me really really well.
Um and I think what's cool for everyone here is like look I I know BAML. Dexter
doesn't like the BML codebase. And
really all we're doing here is I'm just reading the plan. He's writing all the code. I haven't I haven't touched a
code. I haven't I haven't touched a single prompt. And like Dexter's
single prompt. And like Dexter's basically like paraphrase some of the stuff that I've done, but in reality I haven't done most of it. It's
>> Yeah.
>> Yeah, it's fine. So why why do you do the continue all the time?
>> Is that actually true? Are you sure it deletes all the prompts and not just like summarizes the last few messages in some interesting way? Okay. Yeah, I
would just double check. I I suspect that they they probably do something a little bit more clever than just like summarize everything here.
>> I think that's the one other thing that I've learned about myself. Um, wait, why does it say about Ruby FI? I mean, I just don't care about compiling Ruby.
The autoco compact generates a summary of about 3500 words. You can ask the new agent what is the Damon fetched in fetched in to write word for word the
transcript provide as file.
Uh, yeah, then you can see the autoco compact. Yeah, I think it's because it's
compact. Yeah, I think it's because it's just trying to do compilation stuff.
Yeah. Okay, now it's reading the plan.
That's good enough to be honest. That is
the core packet. If that packet passes, then the rest of it will Yeah. Yeah.
Exactly. Yes. I'm very excited for this.
If this works, it's gonna be super exciting. And again, I think what's
exciting. And again, I think what's really fascinating about this whole thing is like this is so so so
much faster than the old workflow. Like
what like all this is going to do is literally write the Python in tech test and verify that the inte pass. Um, and
like if this works that means the whole pipeline worked and now we now have like timeouts at least for part of the codebase like for primitive clients we now have support for connection and uh
idle timeouts which are huge like connection request timeouts. Yeah, the
built-in timeouts are whatever but like this part I think is the most fascinating like how does how well does this work and if this works and like it's it's golden in terms of like outputs like this will work fantastically. I don't know if it ran
fantastically. I don't know if it ran the matcher command. Did it? Can you
scroll up and see?
It did. Oh, it did not go up.
Uh, tell you need to run the matchin command. Matcher command. Just say
command. Matcher command. Just say
matcher. M a tu. Yeah, that one. Just
tell it. It'll figure it out. And then
you do need openi API key. Uh,
you do? I think so. So, you might want to probably want to stop screen sharing.
Yeah, >> because that's going to fail for a variety of different reasons.
>> I have not.
>> Yeah. Nice. So this will >> Yeah, we're actually working on fixing this. So this is um much faster because
this. So this is um much faster because like pio3 and matcherin are just like too slow. Yeah, exactly. So it'll fix
too slow. Yeah, exactly. So it'll fix itself. I think we're going to be done
itself. I think we're going to be done soon. I'm very optimistic. We'll see if
soon. I'm very optimistic. We'll see if it pans out. Yep. Exactly. Yeah, the
diff. That's correct now. Yeah, it it'll figure it out. This part I feel really confident about. It probably had to read
confident about. It probably had to read the file and be like, "Oh yeah, exactly.
It got the better version of it." Yep.
Now that Python compiles, so we know that the compilation works end to end.
Yep, I think all things should be in in Rust. All dev tools should be built in
Rust. All dev tools should be built in Rust. I have a strong opinion on that.
Rust. I have a strong opinion on that.
Well, dev tools application code write in whatever you want, but like where develop.
There we go.
That works. Run it yourself. Yeah. I
want to go run the test and I want to go Oh, I mean, I guess we can just look. I
want to run the test with like a with like a with a t thing.
Uh, do- Yeah.
Oh, it is dashv dash dash VVV. Oh, it
responded too fast.
Oh, I think so. What's the assertion error that you got? Oh, it didn't. The
model just responded poorly. Uh, the
assert is bad. We should get rid of that assert in that test cuz the model is responding with like a string by just knowing it's a string more than 10 characters and that we didn't fire an exception.
>> That's it. Like this works. There we
have it. Like end to end Python asserts definitely work. We're getting time. If
definitely work. We're getting time. If
we go look at the actual Can you open up the thing for me really fast? The the
codebase so we can show the commits uh the Python code. I want to show the Python test file test. Yeah. Test
timeouts. Like we're definitely firing these tests and we're definitely capturing these exceptions. Now, it's
possible that there's still a bug here.
So I would have to go and like read this code in really good detail, but just statistically based on everything that we have compiled and looking at the diff of everything so far where the only difference is really the client that we're passing in. I find that hard to
believe.
>> So um maybe after I validate that it's actually good myself. Cool. So Dex, if you can take all the thoughts, take all the research, take all of his stuff and put it as a PR up uh if to the repo. Uh
we'll take it from here.
Um yeah, so this was like I said a longer form episode compared to what we normally do. Uh and what we will be
normally do. Uh and what we will be doing in back in the future is we'll go back to writing more code. But today we wrote a lot of code using AI and hopefully the lessons that we shared around here are going to be helpful to
all of you. With that, we're going to peace out and let Dexter get the stuff up and running. Thank you guys for making all the time and tuning in. Uh
Dexter, until next time.
See you.
Loading video analysis...