What AI Engineering Looks Like at Meta, Coinbase, ServiceTitan and ThoughtWorks

By AI Native Dev

Summary

Topics Covered

Full Video

Full Transcript

I think one of the things about metaculture is that it's very engineering empowered. If you can get

engineering empowered. If you can get engineers to support something from the ground up, you can go a long way. And

there's a popular phrase code wins argument. And I think in this case, it's

argument. And I think in this case, it's a case of like proof wins argument.

>> One of the questions that I often ask when I talk to people about spec driven development is at what level are we talking about? So we took was defining a

talking about? So we took was defining a process using ripper 5, how we're going to work with the LLM starting with the spec and then we paired developers with that process. There's a Stanford

that process. There's a Stanford research on over 100,000 employees. What

they found out is AI is helping generate 30 to 40% more code than before.

However, 15 to 25% of that code ends up being junk that gets reworked, right?

So, they estimated the actual productivity gain is like 15 to 20%. I

would think it's higher if you're using AI in the right way. All large tech companies have this issue. Your software

that's been in existence for more than a few years using legacy that no longer resemble the way that you would rebuild those things today. So we're talking about shifting to architecture with a

semantic layer and a kind of a query engine on top to serve the same kinds of metrics queries but on a off-production architecture.

[Music] Before we jump into this episode, I wanted to let you know that this podcast is for developers building with AI at the core. So whether that's exploring

the core. So whether that's exploring the latest tools, the workflows, or the best practices, this podcast's for you.

A really quick ask, 90% of people who are listening to this haven't yet subscribed. So if this content has

subscribed. So if this content has helped you build smarter, hit that subscribe button and maybe a like. All

right, back to the episode. Hello and

welcome to another episode of the AI Native Dev. And we're here live from New

Native Dev. And we're here live from New York uh at Cucon AI. And on my journey here, I'm going to be talking with a number of different folks uh throughout the conference uh based on whether I

think it's a a really nailed presentation for for for our listeners.

And today's um I absolutely found one.

And so this is uh Sepair Kosravi who gave a session choosing your AI co-pilot maximizing developer productivity. And

that is such a cool topic and you actually mentioned a huge ton of different productivity hacks and things like that in and around uh using various agents and LLMs. Um, first of all, Sapare, talk to us a little bit about

who you are. Um, you work for Coinbase.

What do you do there? And and tell us a little bit about it.

>> Yeah. Yeah, I'm happy to. So, I'm

Sappire, like the fruit pair makes it easier for people to remember. I'm a

machine learning platform engineer at Coinbase. I've actually only been in the

Coinbase. I've actually only been in the industry about two years. I used to sell teddy bears, but of a different career.

Uh, and I also teach part-time at UC Berkeley where I teach people these AI tools and how to use them to maximize productivity or launching startups. And

I also run an academy for kids to teach them the same thing for free which is called AI Scouts is kind of my passion project that I >> That's super cool. Super cool. And so

how what age groups are we talking here?

>> We actually start anywhere from like 11 and up to 70. Some grandpas want to learn as well and we we like put them all together but we teach them the basics of AI, how to like understand software and how to use some of these tools that we're going to talk about

today.

>> Amazing. Amazing. So we talked about or rather you talked about a bunch of different tools, a bunch of different agents and you went through a number of different tips and tricks to get people uh to to be more more proficient, more

effective with those agents and those tools. Tell us a little bit about what

tools. Tell us a little bit about what what is your >> Yeah.

>> What is your environment that you find you are and I know this is very personal in terms of everyone is different, but for you personally, what's your what's your ideal environment that makes you the most productive?

>> Yeah, I think my ideal environment is cursor IDE setup with cloud code on my terminal. And that's where I feel I'm

terminal. And that's where I feel I'm the most productive. I'll kind of use cursor for most of the things actually like 80 to 90% of the things I'm doing on cursor. But for deep tasks, I will

on cursor. But for deep tasks, I will start off my task with asking cloud code.

>> Oh wow. Okay. Yeah. Yeah. And how do you find like was that the first time you moved out of uh ID into terminal?

>> Yeah. Yeah. I

>> how did you bridge that not being able to see everything in a kind of like typical IDE?

>> Yeah. Yeah. I think initially why it started is because our company was actually tracking developers using AI and how much they're using it and actually in a good way. It's not I know some companies are like they don't want

to use as much AI because it costs a lot right our company was pushing a lot of AI adoption >> and then cloud code uses a lot more tokens than cursor does. So it seemed like I wasn't using a lot of AI even though I was pretty sure I'm using AI

more than most people. So I'm like okay let me try out cloud code because of the token usage. But it turns out like it is

token usage. But it turns out like it is really useful right for those deeper tasks. I've had so many projects where

tasks. I've had so many projects where cursor couldn't get it done but cloud code could get it done.

>> Very interesting.

>> Uh and yeah >> and why don't we go through a number of tips then? So we'll talk about cursor

tips then? So we'll talk about cursor first then we'll jump into claw code. So

cursor um you went through I think 14 different tips. They're on they're on

different tips. They're on they're on screen. We'll we'll flick through them

screen. We'll we'll flick through them and we'll we'll pull them up as you as you as you find one that you like. Um

tell us about tell us about a few of the tips that you find. Hey, do you know what? This is really a game changer.

what? This is really a game changer.

Maybe for cursor specifically or just generally.

>> Yeah. Yeah. I would assume most of your audience hopefully is pretty AI forward, but if not for the people who don't like AI, tab AI is where I would start. Okay.

>> Uh cursor has this tab function.

>> For the people who don't like it, just like turn on cursor and see what it suggests cuz a lot of times it'll like write 10 to 20 lines of code for you from you just hitting tab and not having to lift their finger. Yeah.

>> So that's where I would start for people who are very against AI.

>> And this is the very agent. This is very AI assisted kind of like tab completion levels of of AI users.

>> Correct. Cursor has their own model built for this AI tab complete which is why I recommend it over a lot of the other IDEs.

>> Cool. Okay. So tab what's next?

>> Tab. What's next? Let's see.

>> True is like so I think most people know web cursor agent that's where you use it, right? But the multi- aent mode I

it, right? But the multi- aent mode I think is really underrated. And what I like it for specifically is there's like a new model coming out every other day at this point, right? And like how do you keep up with which one's the best and which one's not? you kind of want to

experiment at the same time.

Experimenting takes up a lot of time. So

what I like to do is every time some new AI model comes out, use this multi- aent mode where you can set it up so you type in one prompt and have two or more AIs answer that same prompt. So recently

with Chach 5.2 coming out, I want to see if I wanted to switch my daily driver from Opus 4.5 to this new one. So I

would have it almost shadow whatever Opus 4.5 is doing, see how I like the results and then after a few tries kind of decide like do I want to continue using this or not? So multi-agent is a great one for people to use.

>> That's awesome. That's awesome. And so

and so with this one, how would it once you parallelize, how do you then choose which one?

>> Uh so there there's like a three benchmark for these models where you can see which one tends to be the best. I

don't think that's the best metric. Like

you can tell like Google Gemini for example recently ranks really high on that. But people most of the time are

that. But people most of the time are saying cloud is better.

>> Yeah.

>> So it comes up to your personal experience most of the time. I kind of just take a look at it, see if it did a good job of completing it. How do I like the code it generated? How do I like the response that it gave? So, for example, I really like Claude because the

responses it gives a lot of times are more educational than the other AIs and you can really understand what it did better.

>> Yeah.

>> So, yeah, I just leave it up to personal experience. You look at it, whatever

experience. You look at it, whatever works for best for you, that's the one you end up using.

>> Yeah. Another great feature from Cursor specifically is they built their own model Kyle Composer. The other IDEs don't have this model. So, it's native to Cursor >> and it's specifically good at generating

code quickly. The biggest problem we

code quickly. The biggest problem we have a lot of times with these AI coding is like you type something in and then then you're waiting two to three minutes and sometimes the task isn't even that complex. For that I'm using composer. So

complex. For that I'm using composer. So

for example, this page was generated with composer in 24 seconds where something like relatively similar for cloud uh was 2 minutes and 30 seconds and the results are pretty similar.

>> And that's significant actually cuz like I remember when we like were originally like in the very early days of of of um uh using AI in a more assisted way particularly in when using the tab

completion for example we required a really really fast turnaround right like when you hit tab you want an answer like within a split second. Now with more agent based the because it goes deeper

you will be you know more inclined to say actually I'm I'm happy to wait for it 30 seconds a minute or something for a better answer. But now actually when you look at this the difference between 30 seconds and 2 minutes 2 and a half

minutes is significant and actually interrupt your flow interrupt your thought process when it gets that significant in in in kind of like a difference. So that actually matters

difference. So that actually matters quite a bit.

>> Yeah. Yeah. 100%. It's so big of a problem that it's like a funny example, but YC actually invested in this company which is a brain rot IDE. It's an ID that pops up Tik Tok videos and games

for people to do while they're waiting.

>> So why she's putting money into it like there's a problem there.

>> It's like that classic meme like my code's compiling. Right.

code's compiling. Right.

>> Yeah. It's like why are you playing games? Oh my my agent's doing something.

games? Oh my my agent's doing something.

>> Exactly.

>> Okay. So um so yeah, multimodels. What's

what's uh what else from from the cursor point of view? Sorry.

>> Yeah. Um, next big tip is just setting up your cursor rules. A lot of people are just using the Asian but not putting down like the groundwork you need to do to really excel with these tools. One of

those is going to be rules. The other

one is going to be MCPS.

>> So for rules, there's actually four different types of rules. You can choose rules that always apply on every prompt or you can have it be specific where cursor is figuring out if it should

apply it based on the context, based on specific files. Or you can just have it

specific files. Or you can just have it set up so it cursor will never by its own look at the rule but only apply it when you manually tell it hey I want you to set up this rule >> opt-in kind of like information opt-in

documentation very actually a little bit similar to how Tesla does that with a with a Tesla NT and things like that awesome cursor rules really really important why is context so importantly >> yeah I think you kind of got to treat

your AI like a basic junior engineer if you don't give him the full requirements of the task he's not going to be able to figure it out right >> so we need to make all the details that the AI needs to know, we're providing it. Yeah.

it. Yeah.

>> And I think >> one really good way to do that as well is with MCPS and setting up some sort of documentation MCP because there's often time a lot of gaps in our code where the AI will read through your code but still

not understand what's going on. But when

we give it access to our documentation, it can read that and fill in those gaps which really really boosts uh the productivity of what you can produce.

>> Yeah. Amazing. And actually in the in the session I asked one of the questions which was about how you know how to know when to give you know enough context without giving so much context that it actually degrades the performance. Um

and I guess that's what a lot of the things like when we talk about cursor rules and the always apply and the apply manually it's for exactly that reason.

And I suspect when we look at if if you wanted if you had a huge amount of context that you wanted to provide you were probably more likely to say actually there's too much context here.

Let's add this either manually or add it more intelligently. apply it

more intelligently. apply it intelligently so that way you're not actually bloating context for no reason.

>> Yeah.

>> Yeah. Super interesting and actually a really really crucial part. Um okay, one more cursor rule and we'll jump into uh Claude.

>> Yeah. Yeah, let's do it. You can have a bunch of different rules. I think one that's particularly interesting. Claude

shared this themselves is relating to this context we just talked about. Often

times if you get close to the end of your context window, so you've used up like 90%. Then you ask the AI for

like 90%. Then you ask the AI for something, it will give you a short answer because it's just trying to get something out before it runs out of context. But if you type in a prompt

context. But if you type in a prompt like this or something similar, you can tell it, hey, your your context is going to end, but like don't worry about that.

You can compact it actually give me the best answer. And that's one useful tip.

best answer. And that's one useful tip.

>> Yeah. One of the things I love and actually maybe this segus us into Claude. One of the things I love

Claude. One of the things I love particularly about the slash uh the slash command, the compact um is the fact that you can add some text after that, which is super cool because if you do a slash sometimes it'll autocompact or you just put slash compact, which

will which will make it compact. I love

the slash compact where you put text after it and that text is hey this is what I'm going to do next and what that'll do is uh it'll also it'll based on that it'll release a certain amount of context that it doesn't need because

it thinks actually I don't need that for my next task. Super super cool way of actually saying actually I want you to release just the right context and leave me with what I need to do my next task.

So that's a a claw tip.

>> That's a good tip. That's a good tip. I

like that tip and that lends us into clawed code. Now clawed code I would

clawed code. Now clawed code I would probably say is probably one of the most functional agents in terms of the extra value ad that it provides you with. Um

let's jump into the clawed code uh section. Um what what would you say uh

section. Um what what would you say uh is your kind of like number one tip for for claw code?

>> Yeah, I guess >> cloud code very similar to the way I use it that I do on cursor except I think the use cases are different.

>> Cursor is good because it gives you the visual ID. You can see what you're

visual ID. You can see what you're doing. It's faster than cloud code is

doing. It's faster than cloud code is uses less tokens and you can switch between different AI models.

>> But if you're trying to go deep on a task that's difficult, cloud code does a lot more thinking, takes up a lot more tokens, but usually gives you a better engineered answer.

>> Yeah.

>> And like at work, I had a project like this that I put into cursor and into cloud and cloud code saved me a ton of time. It like searched the web, found

time. It like searched the web, found open source repos, analyzed them, gave me a proper solution where cursor just like picked something, implemented it and it was done, right? But cloud code really saved me like multiple hours

>> and I think based on your language that you use in the prompt it will do the correct amount of thinking based on that. So if I if you type think deeply,

that. So if I if you type think deeply, it actually adds the amount. And if you another top tip, if you type in ultraink, I don't know if you have typed that. Let's get let's go to the terminal

that. Let's get let's go to the terminal quick. Uh if you type in if I was to go

quick. Uh if you type in if I was to go let's run claude. If you type in ultraink uh ultra think, >> it does this crazy thing. This is one of

like the lowest the deepest levels of deep thinking it will do. And so if you if you type something after that, it'll then go in and do really deep thought on that on that thing. That's a good one.

>> So super cool super cool little tip there as well. So thinking yeah and I totally agree and I think that's why sometimes core takes that deeper like longer time because it does do that deeper thinking. So thinking completely

deeper thinking. So thinking completely agree with that. Next next tip what's up >> 100%. Other one is sub agents. This is

>> 100%. Other one is sub agents. This is

where cloud code kind of beats out cursor as well where you can set up different agents and give them all their specific MCP tools that they can use.

And these sub agents also have their own context windows. So for more complicated

context windows. So for more complicated tasks, this might be a lot better. For

example, like you might set up a sub agent that's a pager duty investigation sub agent, right?

>> Every time a page comes in >> and you call cloud code, it'll use this one. It'll specifically look at Slack,

one. It'll specifically look at Slack, find the alert, and then go into data log, research it, and come back with a solution for you. But for specific workflows that need specific sets of tools, these sub aents can be super helpful.

>> And all that extra not all that extra work it did adding stuff into the context unnecessarily just to get a small amount of information, it passes that small amount of information back to the main agent and then all that other

context just gets lost. Right. Exactly.

It's so useful. We use that at Tesla a lot to do our research on our on our kind of like our context and then we once we learn this is the piece of context we need to pass back to main main agent. It's it keeps it nice.

main agent. It's it keeps it nice.

>> Yeah. Yeah. 100%. Context management is like kind of everything.

>> Absolutely. Now, sub agents, that's not a that's not a thing which is which is duplicated across many different agents, right? It's it's quite particular to

right? It's it's quite particular to Claude in terms of I would say the mainstream agents. Do you feel like you

mainstream agents. Do you feel like you know agents are going to learn from each other and start building these additional things? Do you feel like

additional things? Do you feel like we'll hit a a kind of like feature parity across these agents or or how do you see that?

>> I I think it really depends to how companies handle this context management problem. Uh but right now even with

problem. Uh but right now even with cloud code there's a main agent and sub agents and none of the sub aents actually talk to each other. The

software agents all talk to the main agent.

>> Yeah.

>> So even cloud code itself isn't like that advance of what we could be doing in the future and I'm sure these companies will start to dabble in that.

>> Really interesting. There's also another tool I think it's called claude flow um which whereby they do kick off a whole ton of agents. Um I think it's yeah mostly command line and they have this thing called hive mind which is super cool where the agents have almost like

this shared memory where they can start writing to that memory and learning from each other. So yeah, audience if you're

each other. So yeah, audience if you're interested in that, I think it's Claude Flow which is uh which is super cool. Um

nice >> sub agents super cool. Really love that about Claude.

>> Another tip while you're introduced to other tools, Claudish is another one.

Really love like the terminal of cloud code but you want to use different AIs with it. There's different open source

with it. There's different open source libraries that do this but Cloudish is one of them. Looks exactly like cloud code but you can call like Gemini HD whatever AI model you want from it. And

that's what and and people would do that because they love the claude code UR or DX UX but they presumably have a different model that either the company have agreed on or something like that.

Do you think you'll get like a much of a benefit in terms of the the the underlying LLM being maybe certainly better in certain cases or not or do would you say you know if you don't care claw code is probably better to use

clawed with? I

clawed with? I >> I I think if you don't care cloud code is probably the best to use at this point. Maybe that changes. Um yeah, I

point. Maybe that changes. Um yeah, I wouldn't really like a man cottage. It's

just exactly for that specific use case that you're >> if your hands are tied. It's a

>> another thing like when you're using these AI tools, it's important to evaluate like are these actually working or is this slowing my team down. A lot

of times actually initially teams will adopt AI and see a little bit of slowdown before they ramp up. And that's

kind of the learning curve that it takes to start using these tools.

>> But overall, you want to be tracking these metrics.

>> There is no perfect metric is what I found, right? But what you want to do is

found, right? But what you want to do is just track a bunch of different things like PR's merged, how long from a PR created to merge, etc. And then like what I find most times is you in the

company build a qualitative story that you want to tell your management. But

when you have these metrics, you can pick and choose which ones you need to use at the right time to tell that story and back it up with evidence. So that's

where I think it's like really important to gather as many different metrics as you can and use them when you think it's right.

>> I couldn't agree more. But I think I think uh in the past AI adoption has more been like people were just wanting to adopt AI tooling >> and they they have that almost that push

from up top like you have to be using AI let's let's all do it and I think there are certain people who are now looking at this thinking from a productivity point of view how much value is this pro providing me I was talking with Tracy Bannon yesterday actually and Tracy

who's giving the closing keynote actually at the at QCon um and hopefully we'll chat to her as well but Tracy was talking about value versus things like velocity Because velocity is just a metric, right? It's a metric and it can

metric, right? It's a metric and it can be it can be gamed. There's a whole different ways, but value is super important. It's like what is this what

important. It's like what is this what is what is this agent actually giving to us from a business value point of view.

I think it's super crucial. Completely

agree with you. I think I think the impact is >> 100%.

>> Kind of alluding to what you're talking to. They have some complicated research

to. They have some complicated research methodology that they did over on a on over 100,000 employees. But what they found out is AI is helping generate 30 to 40% more code than before. However,

15 to 25% of that code ends up being junk that gets reworked. Right. So they

estimated the actual productivity gain is like 15 to 20%. I would think it's higher if you're using AI in the right way. But good to know like there's a

way. But good to know like there's a difference between value and just straight output.

>> Absolutely. Yeah. And that's that's the biggest trip like hazard to to just to just you know output versus outcome and measuring that measuring that wrong thing. 15 to 20% though still still

thing. 15 to 20% though still still impressive and we expect that to probably grow as well over time when LLMs and >> 100%.

>> Um amazing session today. Thank you so much. Pleasure pleasure to be in it and

much. Pleasure pleasure to be in it and meeting you at KeyCon today.

>> Thank you for having me.

>> Absolutely. Enjoy the rest of the conference and uh and yeah speak to you soon. Hello and this time I'm with Ian

soon. Hello and this time I'm with Ian Thomas who is a software engineer at Meta. So Ian, welcome to our podcast.

Meta. So Ian, welcome to our podcast.

>> Thank you for having me.

>> And and your as soon as I saw your title uh on the on the schedule, I thought I have to have Ian on my uh on the podcast because your title today was uh AI native engineering and of course this is

the AI native dev podcast and I'm like it's a perfect match.

>> Match made in heaven. So tell us a little bit about the session.

>> So this is a kind of a case study of what we've been doing uh in my team and my wider.org or in the reality labs part of Facebook and Meta uh where we've had a sort of groundup adoption program for

people who are looking to experiment and and develop with AI uh as part of their workflows um building on all the various tools that we've been getting access to over the last few months. Um and

generally seeing how we can accelerate ourselves in terms of productivity and making more um I guess wins product wins and outcomes for our our teams. And and I guess in terms of like every

organization is different, but generally how would you say how would you say adoption is of these types of tools in a in a large organization like Meta? Do

you have like people who are diving in, people who are stepping back and then the majority maybe leaning into it slightly or where does that sit?

>> So I I'd say well it's changed a lot over the last 6 months. So um I can only really speak to my or the patterns that I've seen there. We had a few people that were really keen and and sort of

experimenting and they were they were finding value of these tools outside of work >> and they were really keen to see how they could apply it to internal. Meta

has some interesting engineering things because it's very in-house. It's it

builds its own tools and platforms and things. So it's there

things. So it's there >> some quite bespoke ways of working there that maybe you can't just go and pull some of these things off the shelf and and apply them to our codebase.

>> Yeah. Um but yeah, so we had these few people that were really passionate about it and then we had a bunch of really senior engineers who were perhaps a little bit more skeptical and then gradually over time we've seen adoption grow and there's been more of a push

from um the company overall and saying well this is going to be something that we have to take seriously.

>> Yeah.

>> Um you've got access to all these tools.

Let's go for it. So the adoption now is pretty good. I think last time I checked

pretty good. I think last time I checked we were over 80% weekly active users.

>> Oh wow.

>> Uh and the way that we measured that sort of changed subtly. So now it's um considering four days out of seven usage of any AI assisted tool.

>> And can you say what tools you're using or >> uh well you've got a whole spectrum of internal things. So we've got Metamate

internal things. So we've got Metamate and Devmate which are the kind of two main ones. One's more of a chat

main ones. One's more of a chat interface and the other one's more for like coding and agentic workflows.

>> Um there's a whole bunch of stuff like Gemini code uh sorry Gemini codeex uh core code um etc. We've got access to third party tools as well.

>> Devs are devs are kind of like trying to find what works for them and teams are trying to work out what exactly. Yeah.

And a lot of the tools like um Metamate and Devmate have access to the models used by things like core code as well.

So you get to compare like things that have got a bit more internal knowledge than external. So it's it's quite good.

than external. So it's it's quite good.

>> The versions that we use generally are kind of tailored for >> um meta use cases. So they're kind of a bit more restricted than the generally available stuff, but they're still pretty pretty powerful. One part of your

your session that really resonated with me was when you started talking about community and growing that community.

Why is why is building that community so important when we talk about sharing that knowledge and kind of that growth and adoption?

>> Mhm. So I think one of the things about metaculture is that it's very engineering empowered >> and so what that means is if you can get engineers to support something from the ground up

>> you can go a long way. And uh there's a popular phrase code wins argument. And I

think in in this case it's case of like proof wins argument. So if you've got people that are going to be forming around this idea and these concepts, they can bring them up to a level where people are saying, "Hey, I'm getting

value from this and this is how I'm using it and this is what's working." It

attracts other people who are going, "Okay, well, if that if they're doing it, then you know, maybe this will work in my case, too." And I think that community aspect helps to get a bit of authenticity and and people feeling like they're part of a community. And it

helps to kind of push everything along with a bit more momentum. Whereas

sometimes if you get like a top down mandate >> engineers we can be quite skeptical people at times you know it's a bit cynical and uh oh well if I have to do it so but rather it's it's bottoms up

it's a bit more kind of okay we can make this work and I think it's going to be a great thing. So what was being shared

great thing. So what was being shared then like problems as when people were having problems or people had questions uh wins maybe were people saying hey this works really well for me maybe

context or or rules things like that how did the community work >> all of the above right so we have one of the things that we rely on a lot is uh workplace which is sort of like the professional layer on top of Facebook

>> uh it was an external product until fairly recently so we use this heavily >> and that's kind of the basis of where this community is and we share things in there you post, people can comment. It's

very, it's like a social experience but for work.

>> Um, and so what we were seeing was people were coming in, they were asking questions, they were saying, "Well, I'm trying this thing or have you seen this tools available?" Um, generally kind of

tools available?" Um, generally kind of keeping everyone informed about what's going on. And over time, it's just

going on. And over time, it's just gradually grown and grown. Um, we never really forced anyone to join. It's not

got like auto enrollment or anything.

And, uh, I think last week we hit 400 members, >> which is pretty good for something that's kind of a grassroots initiative.

>> Yeah. self-sustaining like one of some of the best communities like that.

>> And I guess one of the things that are really nice about communities is that everyone you know you get so many different people at different levels of their experience and adoption and and this kind of like leans into one thing that you mentioned as well which is like

these maturity models. I guess there's maturity in terms of how an individual is using it as well as a team. Talk us

through the maturity models and why they're important. So the the thing

they're important. So the the thing about the maturity model I intended it to be used for a team but there is a a dimension on there which is about individual productivity as well. So you

can reflect on your own kind of performance and ways that you're getting value from it. Um the benefit of it being a team based thing is that it opens up the conversation within your team. Yes. And so you can generate uh

team. Yes. And so you can generate uh ideas and have action plans that are specific to you because every team is going to have slightly different context or different levels of ability and different interests. So that's great.

different interests. So that's great.

Um, we tried to model it in a way that was going to be fairly agnostic of the tooling and and be durable because again the value I think is that you can have these models and you can repeat this the assessments time after time and see how

you're progressing.

>> Um, and we do subtly tweak it every now and again but um, generally it's kept fairly consistent and then yeah like say the teams run these assessment workshops and they can have the discussion and that's that's where the real value lies.

>> Yeah. Yeah. Let's talk about value and wins. what were the what were the big

wins. what were the what were the big wins that you saw and and how did you share that across the community?

>> So initially the wins were people saying I'm using DevMate which is part of our tool set in VS Code that we work with dayto-day and I'm finding ways to use this to like understand the code base

better or or what have you. And there

were some early examples of people just sort of going for a big problem and and putting a prompt in and they were getting a bit of lucky and then there was equally some people that were suffering that that wasn't working at all. Yeah. Yeah.

all. Yeah. Yeah.

>> Um but then we found there was kind of repeatable patterns emerging around things like um test improvements or how to make code quality improvements or reducing complexity of code.

>> Yeah. And and that was when we started to experiment with things like unsupervised agents and you could say okay with this category of problem um say like we've got test coverage gaps we

want to go and find all the files that are related to this part of the codebase related to this on call say >> uh find the ones that have got the biggest coverage gaps and then using this runbook that we've put together go

and go and cover them go and produce diffs that help us to >> bridge the gap >> and that was the sort of thing that we lots of hours of manual work and >> as this sort of evolved we found

actually we can go and use the tools to go and query the data to find make do the analysis and then generate the tasks for itself to go and then fix the tests and add the coverage uh and I think the end result was something like 93 and a

half% coverage was achieved which when we were way less than 60% to start off with so >> um many diffs landed >> and it sounds like like you know huge productivity gains >> yeah so that one I think we ended up

doing in about 3 hours to to achieve this so which is incredible and this is again it's one engineer was just saying I've got a hunch I think this is going to work I'm going to go and play with it they found a pattern that worked and

then it became repeatable they shared it to the group and I think I said in the talk um one of the things that came about from that was another engineer was like hey this kind of works I've got all these tests that are running fine but

they're a bit slower than I need them to be to be eligible to run on diffs I wonder if I can do the same sort of thing can I go and put this tool to use and go and find all the tests test and see if I can

>> reduce the runtime down. And I think they they achieved something ridiculous like 1,900 tests were improved.

>> Wow.

>> And so that's it's a huge amount of work that you >> It's hard to estimate how much time that is. And it's probably you're probably

is. And it's probably you're probably right. It just wouldn't get done. So

right. It just wouldn't get done. So

it's a beneficial increase in quality versus just time saver.

>> Exactly. Yeah. And then in terms of the value of that one, I think the re I was chatting with that individual engineer as well and I said, "Look, what would be amazing to know is not just how many tests you fixed or improved. can you

tell me how many issues that's actually caught because they're now eligible to run on these diffs and he went away and he found the data like yeah look in the time since I launched this >> at least 200 changes have been stopped because

>> these tests now run and previously they didn't. So

didn't. So >> yeah, >> that's the kind it's harder to get those kind of value metrics for everything, but that's the sort of thing that I think will it makes the difference and it helps to persuade people that this is

this is going to be a different >> and we're just starting.

>> Amazing. Well, thank you so so much for for the session and joining uh joining our discussion here. Um I very very much encourage folks go to the UCOM website uh check out the talk in full but for now Ian thank you very very much.

Appreciate it.

>> No problem. Thanks for having me.

>> Hi there. Now I am with David Stein who is a principal AI engineer at service titan and uh and David gave a a really great session uh yesterday in fact which

was called moving mountains migrating legacy code in weeks instead of years uh very nice title really really engages the thought processes of of what is possible with migrations in and around

AI. Um David, first of all, tell us a

AI. Um David, first of all, tell us a little bit about what you do. Tell us a bit about Service Titan.

>> Right. So Service Titan, we are the operating system of the trades. And so

what that means is this is for trades and uh residential and commercial building services industries like plumbing electrician HVAC roofing

garage door and so on. Bunch of these contractor businesses for uh for regular service work and for construction. And

Service Titan builds a technology platform that helps these businesses run their business. So an end-to-end

their business. So an end-to-end platform supports everything from uh you know customer relationship management, taking payments to answering the phones

and we have a whole suite of capabilities in our in our platform around those around everything you need to run a business like that. Uh so

that's what service titan is and I work as an AI engineer principal AI engineer at service titan and uh you might wonder like at first what is what are the AI

use cases in the trades businesses and there's actually like a lot of really interesting AI applications that are really important for companies that want

to run a sophisticated operation uh today and just to name a few of them if Let's keep going there. So there's um

>> one thing that we are doing is around um what's called job value prediction.

We can a lot of trades businesses when they have uh you know many dozens of technicians and lots and lots of customers and you know many jobs that

they're going to serve in a particular day. they have a a big task to do at

day. they have a a big task to do at their at their main office around matching which technicians are going to serve which customers on a particular day. And we have a bunch of intelligence

day. And we have a bunch of intelligence in our product that allows for this scheduling and dispatch to be matched efficiently to basically help our

customers run their business better, more efficiently. And uh that's one

more efficiently. And uh that's one example, but there's a bunch of other areas too using like for example, we have uh a a voice agent product uh that

you know voice voice AI answering service so that our customers can kind of that answers the phones for our customers so that it can help their customers schedule service appointments

24/7. Uh those are a few things that

24/7. Uh those are a few things that that I work on. Um yeah.

>> Yeah. And so tell us and so what one of the big things uh about your session was about how you took some legacy application and you wanted to migrate it. So first of all why don't we talk a

it. So first of all why don't we talk a little bit about the the legacy application uh you know what what were the you know pieces of that application that you needed to migrate there was a bit of data I think there was some code

anything else >> yeah so in the talk it actually focused on a a different area than the things that I just mentioned and it related to

a migration that we've been working on as part of like this is you know like all large tech companies have this issue if you've been around for more than a few you've, you know, your your software

that's been in existence for more than a few years using uh, you know, legacy, uh, you know, uh, using, you know, components that have been around for for some time, not you know, that that no

longer resemble the way that you would rebuild those things today using more state-of-the-art foundations. And so

state-of-the-art foundations. And so anyone who's been in software engineering, you know, hip for long enough, has had to work on a migration at some point where you are, you have to

kind of unpack code that was written a long time ago. Uh maybe the people who wrote it aren't on the team anymore, you know, where not all of the context is easy to find, but you have to do the work of picking those things up and

moving them into a new implementation on an improved platform. And those kinds of projects, you know, that all companies have these projects, service included, they're notorious for being uh, you

know, they take a long time. There's a

lot of kind of toil in understanding the old legacy code and moving it over. So

the example I talked about in the talk was around our reporting applications.

Service Titan has a uh a reporting product that basically allows our customers to see all sorts of business metrics and KPIs relating to their operations and their financials and and

their business. And we have a bunch of

their business. And we have a bunch of very complex infrastructure that supports that. And something that we,

supports that. And something that we, you know, do periodically is move the the underlying machinery to use better and more state-of-the-art uh infrastructure. And so that's what this

infrastructure. And so that's what this is about. So the specifics are we've

is about. So the specifics are we've been working on uh you know like working with uh DBT metric flow. It's like a metric store technology based on and layered on top of Snowflake because

there's some of the details of of the stack underneath. Um but there's this

stack underneath. Um but there's this big task of how do you take uh this code that was written a long time ago touching uh you know legacy uh well I

don't know how many of these dirty details I should go into here but like I'll let you actually ask the question.

What do you want to know? interesting

bit. So why don't we talk about the AI piece first of all? Uh so so we know there's a reporting part of the application. There was some code that I

application. There was some code that I think you said you change language on for that. There's an amount of data as

for that. There's an amount of data as well of course. Uh did the data migrate or is that was that pretty much staying where it was?

>> Um so yeah so we we're talking about shifting from an architecture that is based on a you know the legacy stack.

Yeah. which is written in C with a OM you know programming model that is uh operating against uh you know SQL DBs in

production and shifting that to a kind of uh data lake like architecture with a semantic layer and a kind of a query engine on top that allows us to serve the same kinds of metrics queries but on

a off-production architecture and there's lots of reasons for that but the >> uh Zavashier Yeah. Yeah, that's fine.

And so in terms of so you used agents to to help you with this, >> right?

>> Um from the what what would happen if you just said to an agent, you know, here's my environment. I need you to migrate this, create a plan and do it yourself, >> right? What what what would be the

>> right? What what what would be the problem? So for a product like this when

problem? So for a product like this when you have hundreds of metrics and each of them are underpinned with a bunch of code written in C with all the issues I mentioned before about not all the

context necessarily being exactly where you need it. There's a lot of complexity in that. You can't just open up even the

in that. You can't just open up even the state-of-the-art coding tools like cursor and say hey please like migrate all of our metrics into you know into this new abstraction on this new

framework and by the way convert from C into writing SQL with a YAML for metric flow on top you it doesn't uh it doesn't work to do that is what we found

>> um in order to get traction there you have to really break down >> the you know break down the mountain of that problem into small pieces it sounds kind Obvious if you say in this way you

break it down into standardized into you break it down into basically a bunch of similar tasks that can be uh verified in a standardized way.

>> Yes.

>> And then you assign a long you you basically construct a long task list with fa with with phases for individual sections of the task like move these first five metrics is you know the first

phase and then the next and so on.

>> And who's doing this? Is this is this humans doing this? Is this humans with agents as an assistant? Right. So humans

are you know choosing the the task list right like are enumerating okay these are all of the metrics that we're going to enumerate that we're going to migrate in phase one and these are the metrics that we're going to migrate in phase two

so humans are making those choices as well as like what the target architecture is that we're going to be putting these things into and humans are also you know with some help from like AI tools are constructing the you know

all of the context that's going to go to the coding agents to actually enable them to do the migration work for those pieces and you went into a bunch more detail in the talk. I very much recommend it's

the talk. I very much recommend it's recorded talk I believe. So it's very very very much recommend folks go to the UCON site and and check that out. Um but

in terms of the agents like now we have those tasks the agents now need to take those byite-sized tasks. they need to perform that migration and then essentially you can run the validation to say you know did this work did this not work what needs to be changed after

that >> in terms of I'm really interested in what needed to be done to the agent to best prepare it for that for that actual migration >> yeah it's kind of hard to say it all

without the visuals of that talk so people can watch the talk if they want to see more of these details but we talked about standardizing the tools

that need to be used to do each of these tasks and so that means standardized tools for context acquisition.

>> So the agent just like an engineer is going to need certain context in order to be able to migrate one of these pieces of metrics code.

>> What type what type of context here are we talking about?

>> Context in the term like in terms of the um what the underlying data looks like in the database. So the the agent will be looking at the reference code and it

will have references to some tables and we you need to know like what some example data actually looks like that lives in those tables and what the schemas of those tables are. Um

and that's a and and in terms of what's available in the destination platform like confirming like okay we have this table in snowflake we know that the data

is there uh understanding even uh you know anything that a human engineer would need to see in term in order to be able to correctly write the code that's going to calculate those metrics in the

new platform we want the agent to be equipped to see that too. So we would have a standardized tool for context acquisition uh or a set of tools and we would like we would what that would mean

is like uh not tools in the sense of actually MCP we didn't get that complicated with it. We just made sure that the CLI tools which is uh CLI tools

that an engineer would be able to use to get these things that those are like set up and ready for the agent to use to be able to get access to what it needed in order to be able to do the task. Then

the other really important piece is around um giving the agent a kind of environment in which to run the code and a kind of simulation engine or I call

like a physics engine in the talk which is not really about physics but it's about a script or a program that the agent can run to try out the code that it has written

>> to actually run the metrics code in a context that resembles a legacy application so that it can produce an output that can be directly compared to the output that would be produced by the the legacy code.

>> And I think what you said in the session was there does you're not using production data at this point. It's

almost dummy data a little bit to say if you mess up I don't I want you to mess up in a safe environment. But it it gives a good enough way of you being able to say and almost >> yes >> yeah I see what you're doing. You're

within an environment where you can almost like test yourself and make sure you're able to migrate and we can understand >> yes >> if that's successful.

>> Yes. And so from a validation point of view then or a validator point of view, trust is a massive piece of this, right?

In terms of when you when you push big pieces of code and things like that, um what levels of validation did you have there? You obviously mentioned, you

there? You obviously mentioned, you know, you had this like environment that you created. What else did you need to

you created. What else did you need to do before you kind of like said, "Right, we're going to put this live."

>> Right. So

the validation that enables the tasks to be able to be done automatically by the agent.

>> We had we have machinery there that can compare the output produced by the new system >> to the output produced by the old system. Mhm.

system. Mhm.

>> And so um there's a separate set of validations that we also run before actually doing production cut over because it's

obviously extremely important that the data be right when we're showing it to our customers. In terms of the details

our customers. In terms of the details of those, >> I'm not sure what level of specifics to go into there, but hopefully >> I think what's what's important is you were you were effectively replaying

things that you saw in your previous production environment and you were ensuring that the output that you got from from your previous production environment was you know the same and equivalent of what you were getting with

the with the updated code, the updated flows.

>> Yeah, that's right. So replay is a good key word there. So we of course have like the logs of what queries are given to the old system. We're able to play those queries against this engine that runs using the new platform as well.

>> Um yeah >> in in terms of the LLM itself um and the agent itself how smart does that actually need to be? Is it does it like you know did you hit any limitations in

terms of ohh we need to change model or anything like that or was it pretty basic in terms of what it actually needed to do given the context the validator intelligence? So I I had uh a

validator intelligence? So I I had uh a couple slides about this. At one level, the agent doesn't need to be that smart.

Doesn't really need to be as smart as a human engineer. Part of the point that

human engineer. Part of the point that we tried I try to make in the talk is that as long as you have that kind of self-healing loop where you're able to

empower the agent to check its work um and then try to make corrections if it didn't you know if didn't pass validation and the uh the metric

computed by the new system doesn't match the metric uh produced by the old system. Uh it's kind of interesting.

system. Uh it's kind of interesting.

This project really kicked off for us when uh Claude for Opus was released in I think it was May of this year. We had

done some experiments before that at seeing how well the different coding LLMs could understand some of our legacy metrics code and it turned out that like

they could understand it to an extent but the um the first time that we encountered you know the ability VLM really be able to understand that stuff

well enough to do a pretty good job of re of you know rewriting the restructuring the code for the new platform. form was with uh cloud 4 opus

platform. form was with uh cloud 4 opus which is one of the bigger you know models that's out there. So so it needed in our case to be pretty smart but it doesn't need to be perfect which is an important point cuz I think that

sometimes people kind of get caught up in like well >> you know bots will hallucinate sometimes and you are they really as good as a human engineer but like part of what I'm trying to say is that that that's not the point as long as you can have a self-healing loop.

>> So so it's interesting that you needed that kind of like opus 4. Did did they did they get stuck at any point? Were

there any common kind of like ways in which the agent or the LM failed maybe through lack of context or or or you know lack of uh uh intelligence or anything like that?

>> Yeah. So we have a slide about this as well. So the um in in the in the talk we

well. So the um in in the in the talk we talked about a few pieces that we put up to kind of govern this process where you're going to have the agent go

through and do these tasks one after the other. And the it's not the case that we

other. And the it's not the case that we came up with this idea and we tried it one time and it was able to just do all of the metrics and migrate all of them in just one take. That is not how it

worked. Instead uh the first you know 10

worked. Instead uh the first you know 10 15 20 of them >> we kind of did them many times uh in order to get to a point where the you

know the standard context that we present in the in the system prompt and like the migration goals.txt txt that I mentioned in the talk as well as the way we even break down the tasks under what

granularity and also the behavior of the validator and simulator tool itself.

>> Um we had to kind of tune those to get them to work well enough that the agent was able to reliably get through. And so

we would encounter situations where the agent uh you know thinks that it successfully migrated or that it that it successfully converted rewrote the code

for a particular metric so that it would work in the new platform. uh but that upon you know for for like let's say a set of them it would say okay finished number one great okay try number two great did number two and then we just

let it go until it had done uh several of them and then we would find like upon inspection of this is obviously before shipping any of that code we would find that like well there are some problems here and what the agent did and so we

would have to go back and like re rewrite add certain things um to the context and improve the behavior of the validator and simulator so that from that point on the you the bot would be

able to do a better job of producing the code. And another thing I guess you can

code. And another thing I guess you can find is cases where the agent would get stuck because it knows that it's not able to make progress on a particular task. And there could be many reasons

task. And there could be many reasons why that would happen. And a example that I like to talk about is like if the agent just doesn't have sufficient uh

test data for that particular case, it will struggle to be able to confirm that the code that it tried to rewrite is actually going to work, which is a similar problem that a human engineer could have.

version of some code that fits a new paradigm, but they don't have sufficient that context.

>> Yeah, absolutely. And I guess I guess you know in your in your session the session title was migrating legacy code in weeks instead of years. Did do you have any thoughts in terms of well actually how much time well first of all

how long was that process and I guess secondly how long do you think it would have been had you not have used AI as as a part of that? So we talked about I think the number of metrics I mentioned

in the talk is 247 metrics that were moved into the new platform. The it really does go very

platform. The it really does go very very quickly once you get that flywheel running using the approach that we're talking about. This is part of like once

talking about. This is part of like once you you know get that assembly line working where you have sufficient contexts in there and the right tools for context acquisition and the right

tools for validation you can get pretty good agents to run your whole migration really fast. So, in terms of how long it

really fast. So, in terms of how long it took, I would say that like the first 20 or 30, like I mentioned before, uh, as well as getting those tools and those

fixtures right, probably took one to two months, trying to remember my exact estimates there. The once those things were in place, going from that 20

or 30 point all the way to almost the end of that list of metrics was really fast. elves like just a few weeks and it

fast. elves like just a few weeks and it would be the reason that you what's so compelling about this is that as we talk about in the talk

anytime you would be needing to have an engineer you know read the code in one language look at the underlying data make sure that the logic can be captured and put into a new abstraction like that that

would be like work you have to assign to an engineer those tasks would stretch out over a long period of time the way we estimated like I said in the talk It would take quarters to do that kind of work where you're going to have

engineers migrate all that code into a new platform >> and and distract them potentially from a future product work or anything else that >> right so I mean that's time that engineer can be spent working on something else but there's also a sense

of like agility that I want to kind of people to take away from this where like a lot of companies and also where I've worked in the past like there are often uh you know there are some big

migrations that are deemed high enough priority to actually uh to actually kick off and do despite the huge costs that these things used to take. But there's

also a bunch of other kinds of tech debt cleanups and like um you might almost say desire to migrate something to use a better foundation, but where engineering

teams struggle to prioritize those things because of how long they take.

Yeah. And so trying to say here that if you can break down that problem in the right way like we talked about in the talk, you might be able to do those migrations that you've been wanting to do

>> much faster than you might have been able to do them in the past. So it kind of hopefully helps people think about their strategy a little differently given what we have now.

>> Amazing. David, thank you so much. Um

and by the way, Service Titan, uh they're hiring right now at the time of recording.

>> That's right. At the time recording, we're hiring. Uh so yeah, folks can find

we're hiring. Uh so yeah, folks can find me on LinkedIn. You can reach out if you're interested. There's a lot of

you're interested. There's a lot of great stuff that we're doing here. So

>> amazing, David. Thank you so much.

Appreciate the amazing session you gave yesterday. Thank you for uh giving us

yesterday. Thank you for uh giving us some time and enjoy the rest of KeyCon.

>> Thank you so much.

>> Appreciate it.

>> Hi there. Joining me this time is Wes Rice. Uh and Wes is technical principal

Rice. Uh and Wes is technical principal at Thought Works. Uh Wes, welcome to the podcast.

>> Thank you. It was great to be here. I'm

excited to be on it. Um tell us a little bit about what you do at Filworks first of all.

>> Great. So I'm a technical principal. We

call them technical partners as well.

And so what I basically do is I lead an account um all the technical aspects of an account. So on this particular

an account. So on this particular account that I'm leading, it's a large state organization in the United States.

I've got about 10 developers about 18 people total on the project with design project those type of folks. Um so we're basically building a um a knowledge

graph for state agency that's using a deep research agent to be able to populate this knowledge graph. Then

we're building applications to access that knowledge graph to be able to answer questions for the state agency.

So I've been doing about three months for this particular project and yeah that's what we're doing.

>> Awesome. And we I I joined your session yesterday. Really really interesting

yesterday. Really really interesting session. Um the title for for for those

session. Um the title for for for those obviously not at UCON AI first software delivery uh balancing innovation with proven practices and you talked a ton about the various different levels of um

using specifications as part of your software development. Super interesting

software development. Super interesting but going from I guess using specifications to assist your development all the way through to specs being that kind of that specentcentric

piece. um an amazing uh amazing blog

piece. um an amazing uh amazing blog from Bea uh who we had on the podcast before as well.

>> Absolutely. Absolutely.

>> And you introduced a really interesting um framework called Ripper or Riper. RIP

>> Ripper 5. Um why don't you introduce that to to the audience and then we can kind of like delve deeper into it.

>> It's first to be fair. I did come up with it. It was uh we discovered it on a

with it. It was uh we discovered it on a blog post or a cursor forum that was in I think March of of this year or so when we first ran across it. But what it

stands for um it's really it stands for research innovate u plan execute and then review. So it's that plan execute

then review. So it's that plan execute model but it goes a little bit deeper.

And the reason why I think it's so important is you've been interacting with an LM, you've been in a chat console and you're trying to do something and it has to have the context of what you're doing. You don't always

are in the same mindset. You may be trying to research a particular thing but it jumps into coding or you may be planning and it jump it jumps into coding or you're coding and it's it's

not really doing any planning. So what

the ripper 5 model does is it provides a set of instructions that um you can pass as a command. We're using cursor. So we

pass it as kind of um property for our our our um our our IDE to be able to have this context. So we give it a command. We say we're in research mode.

command. We say we're in research mode.

And what that does is it it tells the LLM that asks me questions, analyze the codebase of where you're at, for example. Um, but don't do is as

example. Um, but don't do is as important as what it can do is is what it can't do. Don't do coding. Don't do

planning. Right now, I just want you to understand what the code is, what my what my spec is actually trying to do, so that way I can provide more details

to refine the spec. So, this Ripper 5, it's kind of like an execution model of how you work with the LLM and we do that in pairing with the with the developer.

It's it's really interesting because like when often when when you sometimes give an LLM a task, >> it will think it knows enough jump to the next thing. It's almost like if I have enough information that I can give

a response, >> I'll try and do that rather than actually thinking, well, I have enough information to give a response, but I don't have enough information to actually give the response that you as a

user want to provide back. So, in terms of the context that you give the LM at that stage, um I I presume it's like similar to rules or it's it's essentially like a step-by-step process to say look this is what you need to do

and then this is what you shouldn't do or if you find yourself doing this stop and >> you know at least at least complete that before you say I'm done and then pass over to the next. Is that is that how it works?

>> Yeah. So, I I kind of jump straight into the Ripper 5. So, what what we start with is spec driven development as as you well know. So what we define is a well-defined spec. So it has things like

well-defined spec. So it has things like acceptance criteria. We're moving to

acceptance criteria. We're moving to actually put behavior-driven development type tests in there. So that way we're getting the test up front before the code is written. A lot of times when you

have it generate tests as as you well know it will fit the test to the code and not create the test then the code validates it. Mhm.

validates it. Mhm.

>> So by getting kind of BDD style tests upfront and in the specification, it kind of creates a test-driven development type approach that the spec follows. So the first step is getting a

follows. So the first step is getting a well-defined spec. One of the questions

well-defined spec. One of the questions that I often ask when I talk to people about spec driven development though is at what level are we talking about? Are

we talking epics? Are we talking stories? Where are we when we talk about

stories? Where are we when we talk about specs? What we're focused on right now

specs? What we're focused on right now remember go back. This is a a team that we just stood up with a new account.

Brought these people together working with kind of a a traditional um company, a state organization. So we hadn't been here long. We didn't have a lot of

here long. We didn't have a lot of domain knowledge yet about all the processes that we were doing or how we were working. So what we took was

were working. So what we took was defining a process using ripper 5, how we're going to work with the LLM starting with the spec and then we paired developers with that process. So

with that well- definfined spec with that definition of done with the acceptance criteria with the information that's there given that spec we then go into that first R stage and again this

is just a command that we share with all of our development team subm module we pull into our repo that means do research >> so do research give it the spec give it

the codebase that you're working on >> make sure you understand what it is that this is going to do ask me questions >> and then we refine that spec So as it asks questions, we put it back into the

spec. So that way we keep refining it.

spec. So that way we keep refining it.

Then we go to the next stage.

>> Yeah, it's a really interesting point actually because when you say ask me questions, I like that because when we talk about research, I guess there are different ways in which the LM can get answers. It can kind of like determine

answers. It can kind of like determine it itself based on its training. It can

do there are a ton of tools that are available to LM these days and agents.

Maybe it does a web search or something like that. Is it how autonomous is this

like that. Is it how autonomous is this research? Is it is it very interactive

research? Is it is it very interactive with the user and and kind of like almost like prescribed by the user or can it just go off and and come back?

>> Well, so that's the beauty of it, right?

You do have the tools. So, you can use a tool for web search. If in my particular case, we don't have like an MCP client that has broader context. But if you did have context like some of the stuff,

some of the products out there that might provide some context back to it, you could certainly use those.

>> In the case that we're doing, it's primarily ask me. Um, you can do a web search, but generally it's ask me and then the developers providing it back.

But there's nothing that precludes that adding the additional tools into it so that it can ask that you can use it. But

it's up in the way that we're doing it right now. We're using the developer

right now. We're using the developer upfront in the process.

>> As we move along, as we gain more domain knowledge, as we gain more of the context of how this business operates, we can start to put evals in place where we can do things more autonomously.

Right now we're taking a very supervised approach being that a developer a pair actually is engaged with this research >> and it keeps that developer in the roof as well. So who determines when research

as well. So who determines when research is done? Is that a developer developer

is done? Is that a developer developer decision?

>> Developer decision. So we're firmly keeping that developer in the process.

Yeah. So once that research is done then the next stage of ripper is innovate.

And that's that part where you're you can do things many different ways in software right like implement this. what

are the ways that you would implement this and it might give you one two three four different options and you're like you know what I like two or actually I like two but let's make it more evented and make it event pass you pass that to

it that then goes back into the spec that refineses it and that's the innovation >> and so the innovate stage is still updating the spec but it's updating the spec more in terms of the how yes >> the developer would want that to be >> improved so it's refining the spec it's

adding more detail to it so a lot of the tools do some of this stuff as you go through but this we're taking a very simplified This is just straight markdown. We're interacting with the LLM

markdown. We're interacting with the LLM and the developer and enriching the markdown.

>> And is it typically I mean of course it's just markdown so both an LLM or a user can kind of go in and play with that. Do you find people leaning into

that. Do you find people leaning into the LLM only to make changes or do you find people more handcrafting as they go as well?

>> So that there's this is a common question that goes back and forth. What

what I push for is to make sure you go back to the spec at this stage. You're

firmly in the spec. Now I do take a spec first approach. I believe once the code

first approach. I believe once the code is developed that the code is the source of truth. Yeah. At this stage though

of truth. Yeah. At this stage though we're refining the actual spec.

>> Yeah. Yeah. Um so we've done the innovation that refineses the spec. P

>> now P P is planning.

>> So in planning what we do if you think back before AI if if I had a story a well- definfined story and we do on a scrum team together and say we're on this team we got three or four other developers. we don't know who's going to

developers. we don't know who's going to pick up this story. So, we've got this story. We go into planning and we look

story. We go into planning and we look at this thing. It's like, how are we going to develop this? And you're like, oh, it'll be a a React component that we need to build. There's maybe a service.

It needs these methods. We need these kind of things passed into it. Maybe

this models there. We might need to create this repository. We need to do something to be able to migrate for the database to do updates. But we like task it out. We literally break down what's

it out. We literally break down what's going to be done. We write the task. So

if you and I aren't here and somebody else picks it up, another pair, they have the same task, they have the same mental model of what's being required to do it. We use the same thing with

do it. We use the same thing with planning.

>> So what we do in planning is once we have this um this stage, there's a command that we'll run that's do plan.

And what do plan, we give it the spec.

What do plan will do is it will first break it down into individual tasks. So

do that same tasking stage. Then as a pair we review those tasks and we look at it and go okay say we're um do these things match what we want or do we want to rearrange them or do things

differently. Uh one common example I

differently. Uh one common example I think I used in the talk yesterday >> was in the example that I did when I set up the uh when I when it tested out was

doing a Python project and it didn't set up a virtual environment first right and everything I do I want set up in a virtual environment so I'm sandboxed off and nothing's manipulating my machine.

So the first thing I did was go back to it and say, "Hey, add a virtual environment to this plan." Going back to the spec, um, I'll keep trying to update that. Sometimes I have to go to the plan

that. Sometimes I have to go to the plan if I can't do the spec. But what I do is I task these things out. So now I have this task, set up the project, do this this uh, component that I mentioned or

do this service that was there, this repository, the migrations. I lay all those things out. Once we've done that, then what we can do is go to the next

step where we actually create the plan for that tasks. Um, in the in the part where we actually do the task for that, we try to keep these things as atomic as possible. So that way they're they're

possible. So that way they're they're very they're like a discrete commit if at all possible. So they're not leaking across different tasks.

>> There's a couple reasons for that. One

is they do have that adomicity, but also we might split tasks up. You might do one, I might do another. We might

paralyze some of the work. Some things

might be dependent. We'll build that into a little task plan that we do through it, but we'll do that so that way we can implement each of these things individually. So that's that's

things individually. So that's that's the um that's the planning stage.

>> Awesome. And and with with with planning, it seems like particularly with things like Python, for example, there are a number of common best practices like yeah, I want it I want it

to be contained in my own environment.

Um how much of this is almost reusable across different projects? because

there's probably an amount whether it's styling whether it's just ways of working whether it's you know maybe test-driven development or whether it's uh you know a stack that the teams are most familiar with or want to deploy

with is there is there like an amount that you kind of say you know what for this customer I want to reuse this for every single project therefore I'll I'll I'll make this some kind of context that

every every project can can reuse >> for for us in particular we're using cursor so I'm going to answer it through a cursor lens How I answer this really is irrelevant

on the specific tool that you use. So

what we really start with even before you jump into the ripper 5 is what are some of the foundational rules that we're going to establish. So the first rules we start with around documentation

u anything that's created use mermaid to be able to di to diagram things. We'll

create create some things like that.

>> U there's some general rules that we have on programming abstractions that we want to have. Um, I know some people will actually put rules in there about DDD, for example, if you want to leverage domain driven design or maybe a

clean architecture that you'll add into the rule set. So, you start there. As

you go through that ripper 5 and you go into planning and you discover those things that are crosscutting that you want, we create rules. Current project

I'm working on, I'm using Spanner.

>> Um, so there's some rules on how Spanner operates. Y So, what I'll do is is I um

operates. Y So, what I'll do is is I um but I'm still somewhat new to Spanner.

So, as I'm going through that and I'm working with Spanner and I'm doing DML or DDL, doing updates and Spanner has a certain way of dealing with it, I'm like, "Oh, I ran into an issue here."

So, I add that into my my rule file. I

share that out with the subm module.

Other developers pull it in. So, when

they're running another specification that touches on Spanner, they're picking up that DML rule that I added into it.

So, we use rules for those particular things.

>> Gotcha. And how much is that how much of that is like checked into the repositories? All of it is so what what

repositories? All of it is so what what we do um is just create a subm module.

Yeah.

>> So we have um um a cursor uh subm module. It's our a rules. We put rules

module. It's our a rules. We put rules and commands into that.

>> And then for all of our repos, we we've probably got five repos and this one's not a monor repo.

>> And so what we do is we just have them clone the subm module and that pulls down the latest rules and then we apply those rules as we go throughout them.

>> E is execution.

>> Execution.

>> Execution. So um I should say we talked about before that as important as what you can do is what you cannot do in each of the stages. So when you were in research, innovate and plan, one of the

rules that you cannot do is execute. You

cannot create code. And that's important to make sure that >> the LLM and the developer are in sync before they get to that stage of execution and creating the code. So now

we're in execution and now you don't plan anymore. You follow the plan.

plan anymore. You follow the plan.

That's the rules that has to actually happen. What if you realize during

happen. What if you realize during execution actually >> we need to actually go back to planning or we need to go back is it a process that you can bounce around or is it kind of like best done where you agree

completion before you go and actually makes it a bit messy to bounce around >> it's it's a fine fine line so the answer is it depends the the general rule is to go back to the specification and replant

that's the general rule but you know sometimes LLM can take a minute right you can think of it so maybe you just want to make a small change to the plan >> it It's frowned upon, but that's okay.

If you do it, you do it. Once you get to the execution stage, it really begins to depend. I am a firm believer like there's different models that you can take. When we talk about

spec versus first development, um, are we specriven development? Are we talking spec?

A lot of folks, a lot of tools out there that that drive specifications maybe take that approach. I haven't found that to be really successful with me >> because it forces you to move up too

high in the abstraction. Yeah.

>> Um because if you're too low then you have specifications that start to collide. So I haven't found

collide. So I haven't found >> that perfect perfect balance for me. So

I tend to use spec first >> where spec is used to >> define what we want to do in plain English. We review that. We generate

English. We review that. We generate

that when we review what's there. if

there are small changes um that has to happen. So one of the techniques that I

happen. So one of the techniques that I picked up with a colleague of mine when I worked for equal expert Marco Vermillion he's the uh the core contributor creator of SDKmon

>> he um what he would do is when the code was generated as he would review the code with a pair with others and then would drop a to-do in the code that says hey I don't want to do nested for loops

here I want to use streams to be able to do this. So, drop a to-do in there, put

do this. So, drop a to-do in there, put that definition in there, and then you would have the LLM roll through there, pick up all the to-dos, and update the specification from that. Interesting

technique. Try that a few times. That

that is a useful way of doing. I've done

that.

>> Um, other times >> I will just make the change because in in my approach with spec first and the approach that we try to follow, the code is our source of truth, not the spec.

The spec is used to help us get there, help us align on it, help us to do some of the thinking and then just automate some of the code generation, but the code is actually the source of truth.

>> So as that code changes then do you go back to update the spec or the spec is kind of like a kind of old document that >> it depends. I Oh, you mean as we go on as we go forward? Yeah.

>> Um, so right now I we get rid of the spec. We drop the spec. Um, I haven't

spec. We drop the spec. Um, I haven't completely excluded it from the codebase, but I exclude it from what we what we uh ingest because you don't want a mixed source of truth between the spec

and then what's the code. The code is the source of truth. So, we exclude exclude the spec and what we analyze.

>> Um, but I think it's going to be completely moved very soon.

>> So, the execution then just leaves you with pure code. Does it does it also execute does it generate any any tests from there at all or is that kind of more? This is what we were talking about

more? This is what we were talking about before with spec driven development or with um BDD like in many cases we would one of our tasks would be to generate spec. And so what what you've seen I'm

spec. And so what what you've seen I'm sure is when you generate code and then you have a task afterward that says create 95% code coverage across this.

>> What the LLM will do is take the code and create tests from that which makes incredibly brittle code. It makes

incredibly brittle t tests. So what I then wound up doing as we go through this is I'm spending so much time trying to fix these very very brittle tests that I wind up just get rid of the test

and regenerate them. That feels

disingenuous. That doesn't feel right.

>> So what we're doing right now is we're pushing tests into the specification. So

during the planning process, part of the spec defines BDD. So when you're generating code, you're generating the spec that the the tests that go with it.

So it's together. It's not done as an afterthought >> which I think at Tesla as well in our very early learnings we originally had kind of like specs for capabilities and features functionality essentially and tests separate and we kind of like

realized well and we realized this pretty quick as well capability is just a different way of describing a test and a test actually describes a capability so we kind of like merged it all together um and and yeah I think I think

that's absolutely right and and yeah I I've seen similar >> examples of tests being written for code and and those for the absol absolute implementation of the code as is and and

it it makes it hard. It makes it it puts LLMs into that into that loop when it wants to make changes and things start breaking and it it it chases its tail a little bit in terms of what what it should change whereas the test or the code and so forth.

>> What's interesting is the more you have these conversations like you and I haven't specifically talked about that but >> we're finding we are all converging on similar behaviors with specs and and what goes into the specs and how they're

created. I find that really interesting

created. I find that really interesting that people come from similar backgrounds and create a similar approach to this.

>> Yeah, I think it's good. I think it's good. The R the final R

good. The R the final R >> final R R. So, this is actually something that I think we miss too much.

Um, and that's the QA, right? This is

QC.

>> Like we've all established this pattern of plan and execute. You're seeing them brought into the IDEs like cursor now.

All that's there. Ripper expands that a little bit and what it does is now verify the thing that was created matched the specification. So in that review stage

>> what it does is it looks at the plan looks at the code that was created and shows you drift. Was there did it do what you asked it to do and it refers it pushes this back to the developer and

the developer can say yes >> I actually want that drift is okay. it

it does what I want because I'm reviewing and accepting it. In that

case, we have to remediate it. But what

it does is it it verifies that the plan that was given is what was actually created and then puts it in the hands of the developer to decide what to do.

>> And is that like LLM as a judge style?

It deter it identifies the the drift based on what it's seeing versus actually running some test to to validate. Yeah, I mean I I haven't

validate. Yeah, I mean I I haven't called LLM as a judge, but yes, it is a stage that says given this plan and the code that was executed, what does it match and it has things like create a

checklist and so it'll give us our little green check marks that that say it was all there and red identifies the things that were that were mismatched.

An interesting trick, an interesting feature of this though >> is sometimes you want that drift like it actually is something that's beneficial.

I say something in my talk that the reason why LLMs are so powerful is because they're non-deterministic. If it

was a pure function, we've had them for a while. If we give it input, it gives

a while. If we give it input, it gives us an exact output. It's called a template. We don't necessarily need a

template. We don't necessarily need a non-deterministic behavior from an LLM to achieve that.

>> So, um, sometimes it will give you something that you're like, "Oh, I like that. That's okay." Yeah.

that. That's okay." Yeah.

>> So, what in that cage, what we'll do is use review, get that drift, and then we'll identify that drift and tell it to go back and update the specification. So

that way we can get it back and we remediate it that way.

>> For people who want to, you know, read more about this, do they go to where where's the best place? The ThoughtWorks

blog or >> uh to be honest I we haven't blogged about it. Um this talk will be published

about it. Um this talk will be published in uh in a few weeks. It'll be

available. Um there'll be some things I'll put out on LinkedIn and um yeah, I'll get some posts out there, but I don't think there's anything I can point to other than the cursor forum where I actually founded originally which was March of

>> But follow Wes on >> where's the best place? uh Bush Sky. I'm

on LinkedIn primarily. Um don't don't really use X or Twitter as much these days, but um I'm out there.

>> Amazing. Where's it's been an absolute pleasure.

>> Thank you, Simon. I really appreciate you being at the talk and I appreciate the opportunity to chat with you.

>> Thank you.

[Music]

Loading...

Loading video analysis...