LongCut logo

AIE Europe Keynotes & Coding Agents ft. Pi, Google Deepmind, Anthropic, Cursor, Linear, & more

By AI Engineer

Summary

Topics Covered

  • Build Your Own Agent Harness
  • Agents Compound Bugs, Humans Feel Pain
  • Friction Is Where Your Judgment Lives
  • Validation Is Free at 1,200 Tokens Per Second
  • Missions Are Ecosystems of Agents

Full Transcript

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

It doesn't knock.

It doesn't name itself.

It calls you.

It calls you.

No face. It needs no crown. No single

hand to strike you down. It moves

through mouths that claim they know what's right. What must be so? It speaks

what's right. What must be so? It speaks

in care. It speaks in good. Cuts

everywhere.

No god, no code, no line to cross. Just

necessary justify the fracture.

Every truth becomes a weapon.

Everyone wor it isn't flesh or bone.

It thinks through what you take your body. It takes the hijacks choice rewrites in the chain

progress again for power. Eyes that gleam envy calls it

for power. Eyes that gleam envy calls it equity outrage made clean by broken

sloth that let the thinking die while slo by fear coercion with the smiling face. Fly or

disappear.

A demon without form.

A thought that breathes a storm.

It doesn't burn you instantly.

It hollows you out slow until your voice is not your own. And your hands what you don't know.

This is the warning. Not a monster outside the gate. But the moment you stop and outsource what you hate, it doesn't need

silence from your spine.

Give up the work of conscience and it will speak in your name.

Heat.

Hey, heat. Hey, heat.

It feeds on abdication.

The death of hesitation.

When you trade truth for alignment and agency for peace, you don't fall into hell screaming.

You walk there on a leash.

It is not a person.

It is not a plank.

It is an idea that ask you to stop guarding the gates. If your words are not your own, if your reasons feel

rehearsed, check your mouth, check your mind.

The demon speaks first.

launch control. We have a go. Roger.

Ladies and gentlemen, please join me in welcoming to the stage your MC for day two of AI Engineer Europe, Tjisk Kumar.

Good morning.

Thank you. Thank you. Good morning. We

are here. Woo.

There was a little bit of latency there, but we'll fix it. We'll write a skill for that. Good morning. Hey, it's day

for that. Good morning. Hey, it's day two. What a what an honor. What a

two. What a what an honor. What a

privilege. What a blessing to be here.

Look at this. It's a full room. We are

so excited. How many of you enjoyed yesterday? Show me your hands. And if

yesterday? Show me your hands. And if

you didn't, there we go. That's that's

right. It is is an amazing conference.

Um, yesterday was so incredible. Uh,

highlights. You want to shout out some highlights for us today? No, it's

Europe. We're a bit more reserved here.

If this was America, it' be a different.

But, um, I tell you what, my high, if you do have highlights, post them on your social media. Um, and use tag AI.

It's kind of we want this to be a community thing, and the more people we can invite in, the better. My highlight

uh thanks for asking was um was Malta yesterday um when he shared this slide of how Europe is leading um AI innovation. And I think this is so cool

innovation. And I think this is so cool because oftentimes we feel like the at least I I live in Berlin, Germany. I

feel like the underdog and this was so validating that real innovation is coming out of uh Europe with with even um the DeepMind office in Berlin. Um so

excellent. We we love Europe. We're

here. It's going to be amazing. Today is

day two and we're going to talk about a lot of interesting topics. So, we're

going to talk about coding agents. We're

going to talk about MCP. Who's using MCP here? Look at that. Almost everybody.

here? Look at that. Almost everybody.

Incredible. Um, we're going to talk about AI architecture, media, GPUs.

There is so much to do. But before we do that, will you join me in giving it up for our sponsors, Google DeepMind, our presenting sponsor? Let's give them a

presenting sponsor? Let's give them a huge round of applause.

For real, this would not this would not be as amazing as it is without them. and we

are very very thankful for them. Let's

also um give a huge round of applause for our platinum sponsors. We got Brain Trust. Keep it going. Brain Trust, Work

Trust. Keep it going. Brain Trust, Work OS OpenAI.

Yeah, it is it is a real blessing to have such sponsors that that create this environment where we as builders can assemble, build, and be inspired. And

finally, we've got gold and silver sponsors here. Uh give them a round of

sponsors here. Uh give them a round of applause as well. Um,

you can find them in the expo hall that's through these doors to the right and upstairs in the brakes. I encourage

you, they have some really amazing swag.

Uh, some I got I picked up from one of the companies uh like a little it's a threeb button keyboard for vibe coding.

You see this? It's so cool. Uh, so I encourage you go check that out. Um, we

are going to start today um with a little bit of ground rule setting. Okay,

the speakers have their jobs. They know

what they're going to do. They're going

to come here and they're going to inspire. Um, but you have a job as an

inspire. Um, but you have a job as an audience. Are you aware of this? You you

audience. Are you aware of this? You you

have a job. Your job is to make the speakers and presenters feel amazing. In

fact, you have so much power because you get to decide the quality of the talk you watch. You can you can if you make

you watch. You can you can if you make your speakers uncomfortable by by like this like, "Hey, prove yourself." Then

they're going to be nervous and anxious and it's not going to be a good talk.

Okay? I know this cuz I speak. Um but

instead if you validate them even before they prove anything they just walk up and you're like woo I did that. I made that sound. Um no if

was naughty I if you do that I guarantee you're going to have a great time cuz they're going to feel validated. They're

going to feel confident. They're going

to give their best and you're going to make it easy for them. It is a conversation. It's not a monologue.

conversation. It's not a monologue.

Okay? Is that clear? So like as the speakers come up I want you to warm them up and give them your biggest round of applause. Let's practice. Let's do it

applause. Let's practice. Let's do it right now. Pretend a speaker just walked

right now. Pretend a speaker just walked up and exactly. It's a little bit It's a little

exactly. It's a little bit It's a little bit quiet over here. I see you. Um,

let's let's try that again. This time

everybody look at them. But we're going to I'm joking. I'm joking. I'm joking.

But let's pretend one more time. Your

biggest round of applause for a speaker who has just walked up. Come on.

There we go. There we go. That's

exactly. That's why we're here. So, now

we're going to introduce our first talk of the day. We're going to introduce our first speaker. Our first speaker um

first speaker. Our first speaker um comes to us uh from Google. Um and we're going to hear about Gemma. Gemma is an incredible family of models. I

personally love it because it's an it's a source available set of models and they run almost everywhere. They can be fine-tuned. It's so cool. I'm really

fine-tuned. It's so cool. I'm really

excited for this talk. Um please we practice this. Give it up for your first

practice this. Give it up for your first speaker, Omar Santoro.

All right. Hi everyone. It's full here.

H. So I'm super excited to give this talk because just seven days ago we released Gemma 4. H. So before this conference, who here has heard about Gemma already?

Okay. So most of you great. So GM is Google minds a family of open models.

Open models means that these are models that you can h take, you can download, you can run in your own infrastructure, your own devices, you can fine-tune for your own use cases. So about a year ago,

we released YMA 3. Back then, Gemma 3 were the most capable open models that could fit in a single consumer GPU. So

we designed models from 1 billion parameters all the way to 27 billion parameters. And back then in LM Marina,

parameters. And back then in LM Marina, it was a very strong model. So you see here like different open models under LMA Marina scores and those small dots at the bottom represent how many H100s

or A100s you would need just to be able to load the models. So this is again GMA 3 that's from one year ago but you can see that even if it's a model from a year ago it's a tiny model or a

relatively small model that is extremely capable but yeah so last week we released Gemma 4 and this is my first conference talking about Gemma 4. So

very excited about that. So, Gemma 4 is the family of most capable of open models that Google has released ever.

These are models that go from two billion parameters all the way to 32 billion parameters. Uh, these models

billion parameters. Uh, these models have very different capabilities. So,

I'm going to talk a bit about these different things. And if you're

different things. And if you're wondering what's the E there, I also explain that in a second. So, the

smallest two models can run in an Android phone, in an iOS, in a iPhone phone as well, even in a Raspberry Pi.

These are really small small models that are multimodel have reasoning can do like very cool ondevice agentic things.

Then there's a a mixture of experts model that's like super fast high uh very low latency a model that you can do uh that can do very cool things. And

then you have the 31B that's the most intelligent model the most capable. So

when you want like the most raw intelligence you would use this large model. But even the 31B is a model that

model. But even the 31B is a model that can run in a consumer GPU. So all of these models have been in developer friendly sizes which is quite important to us. So let me show you a couple of

to us. So let me show you a couple of the most assuming the videos load. Uh so there's a lot happening here. So let me begin

with the one at the right. That's a an application where you have YMA running directly in an Android phone where you can pick different skills. So pretty

much here you have a full agentic setup where the model is speaking maybe like a skill to play the piano and then you have Gemma playing the piano right the one at the left is Yama B coding also on

device. Uh this is again airplane mode

device. Uh this is again airplane mode no API calls fully running in a phone and the example in the middle is in a laptop computer we have 20 instances or 10 sorry 10 instances of Gemma running

in parallel. H each of them is doing a

in parallel. H each of them is doing a different SPG and in a couple of seconds you're going to see like 10 SPGs generated by different agents. All of

these running on device with Lama CPP and even then it's like a 100 tokens per second and there you can see the SPGs that were generated by the 10 different

Gemma models. H Gemma is a good coding

Gemma models. H Gemma is a good coding uh model. It can do aentic stuff. It can

uh model. It can do aentic stuff. It can

do coding. It can do even Android app development and again all of this offline. So uh the LM arena scores are

offline. So uh the LM arena scores are quite nice. Uh here you can see like

quite nice. Uh here you can see like bunch of different models. X- axis is how many billion parameters the model has. Y axis is the LM arena score. And I

has. Y axis is the LM arena score. And I

know like LM arena is not the perfect benchmark but it does give you like some proxy of how much the community likes the model for general use cases like conversations and so on. And Gemma has

like a nice kind of a mix between being friendly and like a helpful and at the same time being very capable. And you

can see like this corner at the top left that means that these are very small models that are very capable which is quite exciting.

H it's been exciting to see how the models have progressed over the last two years. So last year it was GMA 3. Two

years. So last year it was GMA 3. Two

years ago it was Gemma 1. Uh sorry yeah JMA 2. And you can see like for a bunch

JMA 2. And you can see like for a bunch of different things uh the models have kept get getting better and better without going uh bigger which for me is quite exciting because if I think where we'll stand in a year from now or in two

years from now I do think we'll have extremely capable models running directly in our own devices in our own pockets.

Uh I'll skip the benchmarks but yeah what is exciting is that Gemma can fit in a desktop computer it can fit in a laptop it can fit in a phone. Uh I saw

yesterday or two days ago that someone put lama CPP in a Nintendo Switch and they are using llama CPP uh to try GMA directly there. So I don't know how

directly there. So I don't know how things will be in a couple of years but uh I'm excited for it. Uh something that we heard a lot with the previous YMA versions was that the license that we

had was not great like people wanted a proper open source license. So with the M4 we change our license to an actual Apache 2 license that gives you control to h pretty much you have the

flexibility of the Apache 2 license. So

uh that's quite nice as well.

Now uh you have probably heard about mixture of exports that's the 27B model 26B model. You have heard about

26B model. You have heard about transformers and tense models but you have probably never heard about the E here. So E2B stands for effectively two

here. So E2B stands for effectively two billion parameters. So actually Yema E2B

billion parameters. So actually Yema E2B has more parameters. It has four billion parameters or so and it has a new novel kind of architecture called per layer embeddings. That was something that we

embeddings. That was something that we released summer of last year. So there's

this small block at the bottom and the TLDDR here is that pretty much there is like a embedding kind of per each layer as the name indicates and it works more

of a pretty much as a lookup table rather than a computation that you need to do. So pretty much this is a

to do. So pretty much this is a extremely fast thing. You don't need to have this in the GPU. You can have this in the CPU. You can have this in the disk. And this is a architecture

disk. And this is a architecture decision that is really optimized for ondevice like mobile use cases. So

that's why the smallest models that can run in an Android or in an iPhone are using this E2B or E4B architecture. So

even if the model is five billion parameters, you actually just load two billion parameters into the GPU and then the rest can be like much slower memory because you are not doing any of the matrix multiplications that you would

usually do with the transformer architecture. And this can be done

architecture. And this can be done leveraging lama CPP with a simple flag override tensor and then you move the per layer embeddings to CPU or even to disk and it should work quite well out

of the box. H a couple of other exciting things. The smallest models can do

things. The smallest models can do multimodal understanding for images, for videos and even for audio. So you can do speech recognition. You can do speech to

speech recognition. You can do speech to translate the text. So I can speak in Spanish and the text can be uh transcribed to I don't know French. Uh

and then the larger model can do like extremely capable multimodel understanding. So uh videos uh fine

understanding. So uh videos uh fine grain details. Uh I actually have a

grain details. Uh I actually have a couple of examples in here. So for

example, it can do things such as pointing where the llama is in the picture. Uh it can uh do object

picture. Uh it can uh do object detection. So it can detect different

detection. So it can detect different objects in a picture. And what is cool is that this model is heavily multilingual. So Gemma 4 has a well it

multilingual. So Gemma 4 has a well it was trained with over 140 languages and it uses the tokenizer h that is based on Gemini as well. So pretty much all of the multilingual research that powers

Gemini is also enabling Gemma. Erh the

tokenizer piece is quite interesting because independently of the raw capabilities of Gemma this tokenizer was designed for multilingual use cases. H

and we took lots of care with it. H

which is interesting because if you want to fine-tune Gemma for a different language for which there are low digital resource languages. So let's say like an

resource languages. So let's say like an indigenous language in Peru Ketwa or I don't know one of the official languages in India. You can pick the model, you

in India. You can pick the model, you can use your data, you can train the model and independently of the raw capabilities of GMA just because of the tokenizer decisions things tend to work

quite well out of the box. So then you can mix the multilingual with multimodel capabilities. So for example here to get

capabilities. So for example here to get the text or an explanation of an image with Japanese text and that's quite cool.

Uh so we released the model a week ago.

uh just last uh yesterday we got to 10 million downloads just for Gemma 4-based models. There are over 1,000 models

models. There are over 1,000 models based on Gemma 4 already. So

quantizations or fine-tunes by the community over 500 million downloads of the whole Gemma family. So what is very cool for me is that Gemma is not just about oh it's a model that you can use

but it's more about enabling the ecosystem to build on top of it and that's what the committee has done over the last few days. H it was top of at hoging phase. People have been building

hoging phase. People have been building like cool examples. The onslaught of people have been doing like full repository audits using Gemma. People

are putting Gemma like in in all kinds of devices and exploring all of the capabilities which is quite nice. And

all of this is not done just by us. We

collaborate with the open source ecosystem. We work with onslaught, MLX,

ecosystem. We work with onslaught, MLX, O lama, hog interface, BLM, silang and pretty much we want to ensure that when we launch a new tool both for Gemini and for Gemma, people can leverage the

capabilities out of the box, right? Like

they should not need to switch to H KAS if they want to fine-tune Gemma like if they are fine-tuning with Hogenface transformers, they should be able to do that. So for us, it's very important and

that. So for us, it's very important and critical to be where the community is.

And that's why really shout out to all of those of you that are working in the open uh source ecosystem that are contributing to uh different tools uh maintainers of all of these repositories because it's really a way to enable the

ecosystem to do amazing things. Uh

another part that I like about Gemma is all of the product integrations that we can do. So Android Studio, I don't know

can do. So Android Studio, I don't know if anyone here is an Android developer, but Android Studio has like a agent mode where you have a agent that helps you buy code and develop. And there's a

offline mode now where you can have a llama CPP or Lama or BLM powered uh uh system in which you have GMA helping you

bite code uh for Android development and we did include some Android related data sets and benchmarks while training Gemma. So it's actually a very capable

Gemma. So it's actually a very capable model for Android development.

So I talked a bit about how many like people are fine-tuning about how many people are sharing. So let me share a bit about the the gem numbers. H so this number is outdated. This is from last week. Now we have 500 million downloads

week. Now we have 500 million downloads as I mentioned and in total YMA has over 100,000 uh models. So again uh maybe you just want to use them out of the box

like open models may work great for you but maybe you want to h improve the capabilities. Maybe you want to change

capabilities. Maybe you want to change the style in which the model is talking with the users. Maybe you don't want a conversational model right? Maybe you

just want a model that can predict certain thing in your own context. uh or

maybe you just have too many GPUs at home and you just want to burn them. H I

don't know what's your reason, but you can fine-tune models for many cool things. So Google has done a couple of

things. So Google has done a couple of what we call official Gemma variants. We

did a shield Gemma which is a family of word rate models. Those are great for production use cases where maybe you don't want users to put let's say toxic

images or toxic text that does not match the policies that you have set up. So

Shield Gemma is the family of models that allows you to do that. But then

there are also other kind of use cases.

So for example for medical use cases we have released medma which is a multimodel yema 3based model for different h medical tasks. So radiology

x-ray chest x-ray understanding and a bunch of other things. And again these are open

other things. And again these are open models you can use them and you can also fine-tune them even more if you have like a even h more niche kind of use case. So that's what Google has done.

case. So that's what Google has done.

But the community is also doing cool things. So for example, there is AI

things. So for example, there is AI Singapore. It's a group that is training

Singapore. It's a group that is training models for Southeast Asian languages.

Erh there are a bunch of them and they have been building quite a bit of research with open models to push even further the state-of-the-art capabilities in terms of multilinguality

or another example is Sarban. H so in India there are many official languages and there is this effort by the government. they are investing in a

government. they are investing in a couple of h big startups to train national models. So this is more on the

national models. So this is more on the sovereign AI and official like languages uh point of view but people are doing like very interesting stuff on the multilingual side of things. Apart of

that there's quite a bit of other like cool research happening. So there was this paper we released in December of last year about how some researchers from deep mind were able to use Gemma 3

to propose some cancer therapy pathways which was actually taken to an actual lab and they were able to validate that the pathways that were proposed by this uh Gemma based model were able to

actually uh lead to actual results that could be validated. So that was quite exciting because it's not just about uh having your assistant or chat chatting with uh yeah your I don't know like

doing role playinging and whatnot. It's

also about building models that can be used for actual things that help the community for many different things. So

uh be that like finance or be that I don't know like legal reviews uh offline use cases where you don't want your data to leave your servers if that's like for offline modes if you're in in I don't

know in the subway if you're in an airplane and you need to use AI for something if you want to have a Chrome extension that has a Gemma in there and help you understand what is in your

screen. If you want to do ondevice

screen. If you want to do ondevice control, the open models are getting there. And for me that's quite exciting

there. And for me that's quite exciting because if you compare where we are now versus how we were like one year ago, two years ago, open models now can do very cool, very interesting, highly

agentic, complex tasks entirely on device, entirely in your phone. So I

really like recommend all of you to just spend like one hour in the next two weeks just play with open models the latest open models and try to understand which are the capabilities. Of course

there are many things for which you will want to use a API based model. If you

want like the most raw intelligence you will go and use like Gemini or your model of choice but if you want to have things on device there are many exciting things that you can already do. Uh and

for me what is more exciting is I don't know how things will be in six or 12 months from now but I think we are heading towards a very exciting direction where people will be able to have extremely capable open models in

their own devices that are customized for their own use cases with their own data. So yeah, please try the models,

data. So yeah, please try the models, build something and share that right.

Thank you.

Our next presenter is here to make the case for the future of MCP. Please join

me in welcoming to the stage the creator of MCP and member of technical staff at Anthropic, David Sora Par.

Okay.

Well welcome.

Let's get started.

This is an MCP application.

That's an agent shipping its own interface, not through like a plug-in, not through an SDK, not rendered on the fly by the model on the client side or

hardcoded into the product. That is

something that is served over an MCP server. And you can take the server, put

server. And you can take the server, put it into cloud, you can put it into chatbt, you can put it into VS Code, cursor, and it will just [ __ ] work.

And that I think is kind of cool because for doing that you need something that a lot of things that we're want in the ecosystem do not offer. You need

semantics. You need to have both sides the client and the server to understand what each side is talking to understand how you render this understand that there's a UI coming. And for that you

need a protocol.

And the best part about this an MCP server doesn't just ship an app or can ship an app. It can also ship tools with it and so you can interact with it with the application as a human and you can

have the model interact with it through tools which is I think a very unique thing that I think we have not explored much just yet.

Okay, but let's quickly rewind a little bit from this what I think is a really cool glimpse into the future of MCP into over a year ago 18 months an eternity in

AI life cycle. Um, all of this did not exist. There was just a little spec

exist. There was just a little spec document, a few SDKs, uh, mostly written by Claude, local only with little more than just tools. And in that last 18 or

12 months, you guys have been absolutely crazy building stuff um, building servers, building um, and crazy ecosystem around this. And we on our side have been busy busy taking this

local only thing added remote capabilities added centralized authorization added new primitives like elicitation and tasks and last but not

least added new experimental features to the protocol like the MCP applications that you've just seen.

And in the meantime, we have reached I think a really cool milestone because again you all of you have been absolutely crazy building building and building of course luckily with the help

of uh a bunch of agents. Um we're now like at 110 million monthly downloads and that's just of course not us in our clients and servers. That's like OpenAI,

it's agents SDK, it's Google's ADK, it's lang chain, thousands of frameworks and tools that you might have never even heard of it. Pulling it in as a as a dependency, which means there's one

common standard that all of us have at our disposal to speak to each other. Um,

just a bit for context, uh, React, one of the most successful um, uh, open-source projects probably of the last decades, took roughly double the amount of time to reach that download volume. And in the meantime, of course,

volume. And in the meantime, of course, you all have been building really really cool servers from like little toy projects of WhatsApp servers and Blender servers uh to building SAS integrations like Linear, Slack and Notion that are

really powering what everyone does every day when they use MCPS. But most

importantly, the vast majority of MCP server most of all of us have built are behind closed doors uh connecting companies systems to agents uh and AI applications.

But I still think this is just the absolute beginning of where we are because I think 2025 was all about exploring in 2026 is all about putting

these agents into production because if you really think about in my mind 2024 we just built a bunch of like demos and showed cool stuff to people and there was a little bit of a buzz there. 2025

was really all about coding agents.

coding agent if you really think about are the most ideal scenario for an agent. It's local, it's verifiable, you

agent. It's local, it's verifiable, you can call a compiler like you have a developer who can fix [ __ ] if it goes wrong in front of the in front of the

computer. Uh and you can display a 2y

computer. Uh and you can display a 2y interface and the user is quite happy.

But I think now with the capabilities of the model increasing, we are going into a new era which I think this year will be we will see this start where we're not just doing coding agents. We're

going to have general agents that will do real knowledge worker stuff like things a financial analysis analyst want to do a marketing person want to do. And

they need one thing in particular. They

don't need a local agent that calls a compiler. what they need is something

compiler. what they need is something that could connect to like five SAS applications and a and a shared drive because the most important part for them for an agent is connectivity and in my

mind connectivity is not one thing if one if someone tells you there's one solution to all your connectivity problem be it computer use becp they are probably pretty wrong because

in the right because the right thing of course is that it always means it depends and there's a real a big connectivity stack and there's the right tool for the right job. And in my mind,

there are three major things that you want to consider building an agent in 2026. It's skills, MCP, and of course,

2026. It's skills, MCP, and of course, like CLI or computer use depending on your use case. And they have three very distinct things that they can do and three different things you want to

consider when you build your agent.

Number one, skills of course is just like domain knowledge. is just like capture specific capabilities put into a very simple file and it's mostly reusable. There's some minor differences

reusable. There's some minor differences between the different platform.

Of course, CLI is very popular when local coding agents. It's an amazing tool to get simply started to have something that you can compose in a bash that you that automatically discover

where the model can automatically discover what the CLI is capable of. And

most importantly, if you have things that are like CLIs, like GitHub, Git, and other things that are in pre-training, CLI is an amazing solution for your connectivity part. And they're

particularly good when you have a local agent where you can assume a sandbox where you can assume a code execution environment. But if you don't have this,

environment. But if you don't have this, if you need rich semantics, when you need a UI that can display longunning tasks, when you can have when you need things like resources, when you need to

build something that is full decoupled and needs platform independence or you don't have a sandbox, when you need things like authorization, governance

policies or short to say boring enter boring but important enterprise stuff or if you want to have experiments like MCP applications or what comes soon skills

over MCP, then I think MCP is this like additional connective tissue that is just yet another tool in the toolbox for you to build an amazing agent. And so

this is all to say that I think in 2026 we're going to start building agents that use all of it. They don't use one thing, they use all of it and they use

them quite seamlessly together.

But I don't think we're quite there just yet because we need to build a lot of stuff.

Partially um because our agents kind of still suck um and partially because I think we just haven't talked enough about like some of the techniques you

can do uh to really put this connective tissue together.

The number one thing that we need to go and start building is on the client side on the on the agent harness side on the things that powers the connective parts

that be it a cloud code uh be it a pi be it whatever application you're going to build and the number one thing we're going to do there and what we all have to do and something I want to really get across

today is that we need to go and start building something called progressive discovery most people when they think about like oh uh MCP they can think about like context

bloat but if you really consider what a protocol does a protocol just puts information across the wire but the client is responsible for dealing with that information and what everybody so far has done because we're in this very

early experimentation phase is to simply put all the tools into the context window and then be quite surprised that maybe the context window gets large. Um

but what you can do instead and what you should do instead you should start using this progressive discovery pattern which is to say use something like tool search

to defer the loading of the tools and start loading the tools when the model needs it. And we have this in the

needs it. And we have this in the anthropic product and the API. Um people

can use this on on competitors APIs as well. But also you can just build this

well. But also you can just build this in yourself where you just download the tool directly and the moment you give the you give the model a tool loading tool basically and the model goes like ah maybe I need a tool now and let me

look up what tools I need and then you load them on demand.

And here in this example, what you're seeing is on the left side is uh cloud code before we added this to cloud code and then after it uh uh to cloud code.

So you see a massive reduction in tool uh use tool context usage.

The second part to that is something called programmatic tool calling or what other people usually refer to um to code mode. Um this is the idea that one thing

mode. Um this is the idea that one thing that you really want to do is you want to compose things together. You don't

want the model to go call a tool, take the result, then go and talk call another tool, take the result, call another tool because what you're effectively doing is you're letting the model orchestrate things together. And

in that orchestration, you're using inference. that's it's latency sensitive

inference. that's it's latency sensitive and all of its stuff could be done way more effective if you would instead write

a script. Um, and in fact that's

a script. Um, and in fact that's actually what you constantly do and what you constantly see things like cloud code do when it writes the bash command.

But you can of course do this with everything and you can do this with MCP and you should do this with MCP. So what

does this mean? So what you want instead of having one tool out of another, you want to give the model a ripple tool provide like a comm like a execution

environment like a V8 isolate or a monty or something like that or a Lua interpreter and just have the model write the code for you and the model just executes that code and then

composes them together. And there's a neat little fe feature in MCP called structured output that tells you what the return value of the output will be.

And the model can use this information to to figure out type information which then mean it can really nicely compose these things together. And in this example here, instead of doing two

different calls, you do one call and you can filter that. The model will automatically ex uh remove things from uh JSON and just continue.

Of course, if you don't have uh structured output, you can always just ask the model to give you a structured output um uh by just extracting it and saying, "Hey, call a cheap model and

say, I want this expected type. Give it

back to me." And bam, you have a type.

the model can compose things together.

And I think this is something we're just not doing enough yet. And this is, I think, something where we can improve our agent harnesses.

And then last but not least, of course, you can just compile compose these things together with executables like with CLIs, with other components, with APIs as well.

Um next what we need to do besides the client work which is progressive discovery and um programmatic tool calling we need to go and start building

properly for agents and that means we all need to stop taking REST APIs and put them one to one into uh an MTP server. Every time I see someone

server. Every time I see someone building another REST to MCTP server conversion tool, I'm it's a bit cringe because I think it's just it just results in horrible things. Um, and what you should do instead, you should design

for an agent. Basically, you can start designing for you as a human how you would want to interact with this because that's actually a very very good start for an agent. If you want to orchestrate

things together, you should reach of course for programmatic tool calling and you can do this on the client side as I said before. But you can also do this on

said before. But you can also do this on the server side. The cloudfl MCP server and others like that are great examples how you can have instead of providing

tools provide an execution environment to the model and then just have them orchestrate things together which again cuts on token usages uh cuts on latency uh and is way more powerful in its

composition. And then last but not

composition. And then last but not least, you should start and we should start as server authors to use this rich semantics that MCP offers over alternatives. This means shipping MCP

alternatives. This means shipping MCP applications. It means shipping um

applications. It means shipping um skills over MCP. It means um using uh things like task and other aspects that the protocol offers that we're currently slightly underused or things like

elicitations, things that only MCP can do for you. And

of course that's all the work you all need to do and maybe some of our product people need to do. We also need to do a lot of work on MCP itself. And there's a few things down the line that we going

to go and have to go and solve. The

number one thing is we need to improve the core. There's a few things that as

the core. There's a few things that as we have developed the protocol over the last year that are just not in a good shape. Number one is that the current

shape. Number one is that the current streamable HTTP is very hard to scale if you're a large hyperscaler. And so we have a proposal from uh our friends at

Google um who are working with something called a stateless transport protocol which make it significantly easier to just treat MCP servers like you know

another stateless uh rest server something like that that we used to know how to deploy to like cloud runs or kubernetes and so on. So that's coming down in June and hopefully landing in

the SDKs very soon. In addition, we need to improve our asynchronous task primitive, which basically is a very fancy way to say we just want to have agent to agent communication. We have a

very experimental version of the protocol that very few clients support.

So, we're going to start building more clients out like that. And most

importantly, we are improving some of the little semantics that we need to do.

We're going to ship a TypeScript version uh SDK version two and Python SDK version two based on um a lot of the lessons learned uh over the last year.

There's a there's a um a a SDK called FastMPP. Uh who's using Fast MCP? Yeah,

FastMPP. Uh who's using Fast MCP? Yeah,

it's just way [ __ ] better than Python SDK that we ship in. Right. And that's

on me because I wrote the Python SDK. Um

and and so I have a bunch of people who are way better Python developers than me help me uh write it better. Um the

second part is we need to start integrate everywhere. We're going to

integrate everywhere. We're going to ship for particularly for enterprises something called cross app access. It's

a new thing that we're working closely together with identity providers which just allows you it's a very fancy way to say uh once you log in once with your local company identity provider be it a Google be an octar you will be able to

just use MCP servers without having to relog in. So it's a bit more smoothness.

relog in. So it's a bit more smoothness.

Um, in addition, we're going to add something called uh server discovery by um by uh specifying uh how you can discover servers on well-known URLs

automatically. So crawlers, browsers, um

automatically. So crawlers, browsers, um agents can just go to a website and say, "Oh, I'm instead of just parsing the website, is there also an MCP server I can use?" And we will be able to

can use?" And we will be able to automatically discover this. This is a really cool thing that will come down uh also in June when we launch the next specification uh and will be supported there. And then last but not least, we

there. And then last but not least, we are starting to use our extension mechanisms in in MCP which means that some clients will support this like for example MCP applications will only be

supported by web based interfaces because if you're a CLI you just have a hard time rendering HTML, right? Um and

we do more of these extensions. One of

the most exciting extensions that I think is is cool. we're just going to ship skills over MCP because it's very obvious that if you have a large MCP server with tons and tons of tools, you just want to ship domain knowledge with

it and say, "Oh, this is how you're supposed to use this. This is how you're supposed to use this." And it allows you as a server author to continuously ship updated skills without having to rely on

plug-in mechanisms and registries and other stuff. So, that's coming down. Um,

other stuff. So, that's coming down. Um,

there's a lot a lot of experimentation from people already in that space. You

can already do some of that today if you just give the model a load skills tool like you can you can build primitives or versions of this today without having to rely on the semantics. But of course

we're going to define the semantics.

Okay. So that's for me a long- winded way to think to say that I think MCP is actually in a really good shape and I think in this year we're going to push

uh agents to full connectivity.

Um MCP will continue to play a major major major role and we want of course your feedback. We are very open

your feedback. We are very open community. We just have created a

community. We just have created a foundation. We're mostly running as an

foundation. We're mostly running as an open source community with a discord with issues. Um just come to us and tell

with issues. Um just come to us and tell us where the [ __ ] are we wrong? what are

we getting right? Um so that we can improve this on a continuous basis. So

2026 I think is all about connectivity and the best agents use every available method. They will use computer use, they

method. They will use computer use, they will use CLIs, they will use MCPS and they use will use skills because they want to have a wide variety of things

they can do and then they can ship cool stuff like this. Um which is um one of the product features we shipped

recently. uh under the hood it's nothing

recently. uh under the hood it's nothing but an MCP application um that renders stuff right cool so we can now look at uh the model

writing graphs anyway thank you our next presenter is the creator of

Agent Craft and MCPUI here to speak about agent orchestration. Please join

me in welcoming to the stage Ido Salamon.

So uh good morning London. Uh my name is Ido Salman. I'm the creator of

Ido Salman. I'm the creator of Agentcraft. I am also the creator of

Agentcraft. I am also the creator of MCPUI and creator and co-maintainer of MCP apps. So, I'm building some of the

MCP apps. So, I'm building some of the stuff that David has been talking about.

Um, as you've all heard in the past day, agents are amazing. Uh, but if one agent is so amazing, why don't we scale up to 10 or 20 or 100 different agents and be

a 100 times more amazing? Uh, it is pretty simple. We just spin up a bunch

pretty simple. We just spin up a bunch of agents. We put them in this like nice

of agents. We put them in this like nice uh uh screen and it looks really glorious, but it won't actually work.

And the reason is that spinning them up isn't the problem. It's us. We are the bottleneck in orchestrating all of these agents. Now, if you think about it, the

agents. Now, if you think about it, the role of the engineer to actually go and manage dozens of reckless uh employees is not typically what we do in most

companies. Um so it we need to somehow

companies. Um so it we need to somehow find these new potentially new skills to manage all of these agents.

Luckily, they're not really brand new.

It's not something that we've never done before. It's just something that's been

before. It's just something that's been hiding in unexpected places. I mean, if you're a gamer or used to play games at any point, managing dozens of units

probably sounds a little bit familiar, which is why I built AgentCraft, which is an orchestrator that aims to raise the ceiling of human agent collaboration

by taking learnings from gaming and transferring them into productivity.

So, let's see a quick walkthrough of that and let's understand the journey to raise that ceiling.

So this is agentcraft.

There's a lot to unpack. Uh so we'll just start with the basics and go from there. Uh this is an agent, not a

there. Uh this is an agent, not a metaphorical one. This is actually a

metaphorical one. This is actually a physical manifestation of a coding agent like a live session. Um it can be you know cursor, it can be cloud code,

codeex, open claw, whatever. Uh it's

something that we can detect on the device and visualize it. But it's also something that we can spawn directly from here.

So now we have this agent uh and we can prompt it. We can use it like just any other

it. We can use it like just any other agent that we have from our CLI or whatever. Uh and what can we tell it to

whatever. Uh and what can we tell it to do? It has all of these quirks and we

do? It has all of these quirks and we have voice and we have text and we have images and so on. And we can just tell it to do stuff. So for example, we can tell it to um develop some feature for

us.

prompt.

And now the agent is working. So it's doing its work.

So it's doing work. Uh and as we can see uh if you look at the the UI, there's like a bunch of other stuff. We have

these buildings and each building represents some functionality. So for

example, you know, one of these buildings manages the skills and plugins and so on. Um there's also you know like integrated uh terminal and git just to

like get that end to end workflow. Uh

the second part of raising the ceiling now that we have the basics is visibility. We need to be able to

visibility. We need to be able to quickly understand what each agent is doing. Uh so we have this nice side

doing. Uh so we have this nice side panel here that really shows us like high level uh mission status summary and so on. What are they actually doing? But

so on. What are they actually doing? But

the cool thing about agent craft is that we don't just see a list of what they can do. We can actually see them

can do. We can actually see them working. So if we look at the map, you

working. So if we look at the map, you would notice that it's actually a projection of my file system. Each part

of my file system is actually on the map. So I have these directories here

map. So I have these directories here and each one of these directories has files. These files are represented as

files. These files are represented as runes as you can see here. So I can actually track and see visually what the agent is working on, which file. I can

see the entire change list of what happened there. And because we're

happened there. And because we're orchestrating it, I also know which agents did what and when. So we can have full lineage of what's going on. And we

can take this one step further. If I

know all of these stuff, why not just create a heat map? I can actually try and see visualize uh collisions and I can even prevent them proactively.

Now, the cool thing here is that once we have this visibility, we're not exactly done yet because we still need to be able to react to the changes that are happening. So, we can lean into another

happening. So, we can lean into another cool mechanism from RTS games. We can

simply use muscle memory to quickly cycle between the agents that need our help. They need uh us to approve the

help. They need uh us to approve the plan. They need us to uh answer some

plan. They need us to uh answer some question so on. So, now we have visibility and we can react quickly. So,

we're done. without solved

orchestration. Um, but not quite. Uh,

because that's really only the first step. Uh, I was able to use more agents

step. Uh, I was able to use more agents in parallel, but only for a short amount of time. Uh, there are a few reasons for

of time. Uh, there are a few reasons for that. The first one is that there's only

that. The first one is that there's only a limit to how many ideas I can have in my head at any given time without being tired. Uh, so what I did is basically

tired. Uh, so what I did is basically tell the agent to do it. I told them, okay, find missions for me to do. So I

have quest now and I can click a button and they just do whatever they can refactor, test all the stuff that I don't want to do. Uh and the second one is that all of this babysitting takes a

lot of time. Like I need I see what's going on. I can react to it very

going on. I can react to it very quickly, but I still need to cycle through it. Uh so what I did there is

through it. Uh so what I did there is kind of say how do I take myself out of the equation as much as possible? So if

agents are so amazing uh why not just let them do it. Uh I can just like give them some idea. I have this campaign feature. Uh broadly say what I want to

feature. Uh broadly say what I want to happen and I would just spin up a container. I would let the agents run

container. I would let the agents run there. They can decompose the task. They

there. They can decompose the task. They

can plan it. They can present the plan to me. I don't care what they're doing

to me. I don't care what they're doing because it's containered. So do

whatever. And the main thing here is that once it's decomposed, I'm not the one doing the babysitting. Now I have the campaign orchestrator and that's his problem. Uh so we're actually moving

problem. Uh so we're actually moving more of the effort only to the planning phase or the review phase.

Uh and once we have that, we reach a point where we can just say why is it my ideas? Why can't I tell it to have like

ideas? Why can't I tell it to have like running a chrome job, go to Twitter every day, scan cool ideas and just implement them and I just decide what I want. which is actually how I

want. which is actually how I implemented channels pretty quickly. Um,

so we have that and now just have a lot of different PRs to review. So there's

this nice capability of just review bundles. Uh, and now I can see exactly

bundles. Uh, and now I can see exactly what changes happened in each one, like why did they do stuff, what are the tasks, and I also have visual evidence.

So now I'm able to just look at screenshots. I can look at videos and

screenshots. I can look at videos and really see what's going on without investing too much time in doing it.

And once we have that, we can actually shift more of the work from the planning to the review. How much time do I need to spend on the plan if I can just do it 10 times and I'll just pick the one that

is most fitting for me.

H and the next part is we're still not done. I mean if you think about it this

done. I mean if you think about it this is only the first step because agents aren't that smart yet. Uh so we need to offload it to someone else uh humans. Uh

now what I can do and this is my favorite feature is that we can actually create these workspaces. So I can collaborate with the product designer for my team and they can do whatever

they want and you can I can just uh continue from where they left off. So

for example, let's say this is an agent actually from the product designer on their computer. So they can see my

their computer. So they can see my agents, I can see their agents, I can understand what they're doing and we can just collaborate.

Um, prompt prompt.

Yeah, they just started working again.

Uh, so I can see that they want to design this new page. Uh, which is pretty cool. Uh, so I can wait for them

pretty cool. Uh, so I can wait for them to finish or I can just go ahead now and just hand off from them to my agents.

Well, our agents, insert communist, uh, whatever. Uh so we have our agents now

whatever. Uh so we have our agents now and I can just keep going from there.

And the cool thing is that it's not just humanto human collaboration. Uh we are also collaborating with the agents. So

there's more direct stuff like this. I

can just type stuff and prompt my agents or even their agents. Uh but there's also a softer mechanism. Uh there's

actually a chat that is uh between humans and humans but also between the humans and the agents. You can see here that the agent said I'm starting to work on something and then I can say I'm also working on it. So the next time the

agent does something it knows someone else is working. They can also have soft collaboration so they would know uh what files each one is changing.

So we've actually taken a bunch of stuff uh that were limiting us from really reaching our full potential with agents and kind of solve them one by one. There

are a bunch of other features that I just didn't have time to go over. Uh but

you can try them out and see for yourself if you can really uh work better that way.

So to sum up uh these are not exactly new skills. I mean

you're probably worried perhaps that we won't be able to get adapted to this future where we're not actually coding.

we're just telling other people to code for us or other agents. Uh but these skills are there. They're just not something we used for work until now. Uh

so with games as one example, we can take these skills to the next level.

We need to somehow raise that ceiling.

We need to somehow improve our collaboration with agents. And with

agent craft, the goal is to take the learnings from games and really raise that to the next level with better visibility, more autonomy to the agents

and human to agent collaboration.

So I invite you to go to uh the website.

Uh this is the QR code. You can it's free. You can just download it and play

free. You can just download it and play with it. Uh it's still experimental.

with it. Uh it's still experimental.

It's still new. There's a bunch of stuff that need to change. Uh but it will only happen with great feedback. There's also

a discord. Uh so please join, give us uh your feedback and let's raise the ceiling together. Thank you.

ceiling together. Thank you.

Our next presenter is the creator of one of the top coding agents pi which is the engine inside open claw. So naturally,

he's here to tell us how agents are destroying open-source software. Please

join me in welcoming to the stage the creator of Pi, Mario Zechn.

Hey there, I'm Mario. I built Pi in a world of slop. And this is a strategy, a tragedy in three acts. just to talk about this real quick. Bunch of people on the internet gave me money for ad

space on my torso and all of that goes to a charity. So yeah, thanks guys. So

act one building pie. In the beginning there was cloud code and was good right we all got basically catnipped by that thing and stopped sleeping. Um bunch of

stuff before that but code cloud code was the one thing that kind of clicked with me the most. And to preface all of this I love the cloud cloud team. They

are brilliant people, talented, super high velocity. So, uh they also created

high velocity. So, uh they also created the entire game. Major props to them.

So, this is not a roast. This is just me, an old man, telling you why I stopped using cloud code and built my own thing. Um in 2025, I started using

own thing. Um in 2025, I started using cloud code in about April, I think, thanks to Peter. Uh because he told us the agents are working now. And back

then, it was simple and predictable and fit my workflow. But eventually

the token madness got hold of them I think and the team got bigger and they started uh dog fooding that stuff and built a lot of features. A lot of features I don't need which is fine. I

can just ignore them. But with velocity and more features come more bugs and that's bad because I used to work at construction sites and if my hammer breaks every day I'm getting really mad

and if my development tools break every day I'm also getting mad. So there was this. it's just a running gag. And

this. it's just a running gag. And

here's Tar telling us that clot code is now a game engine. And here's Mitchell from Ghosty telling us, "No, it's not."

And eventually they fixed the flicker, but then other stuff broke. And I think they're now in the third iteration of a 2y renderer. Yeah, but that's just a

2y renderer. Yeah, but that's just a symptom. The real problem is that my

symptom. The real problem is that my context wasn't my context. Cloud code is the thing that controls my context. And

behind my back, cloud code does things uh to the context. So you have the system prompt which changes on every release including the tool definitions.

They would remove tools, modify tools.

It's not good. They would insert system reminders in the most opp inopportune place in your context telling the model here's some information. It may or may not be relevant to what you're doing

that it actually says it may or may not be relevant what you're doing. And that

kind of confused the model and that kind of broke my workflows.

On top of all that, there's zero observability because that's how the tool is constructed. And I like knowing what my agents are doing. There's zero

model choice, which is obvious. It's the

native entropic uh harness, so it makes sense for them to want you to use cloud, right? And there's almost zero

right? And there's almost zero extensibility. And some of you might

extensibility. And some of you might have written some hooks for cloud code, but I'm telling you the number of hooks and the depth of those hooks is very shallow. Um, and every time a hook

shallow. Um, and every time a hook triggers, what actually happens is a new process gets spawned. basically the

command you specified for the hook to be executed and I don't find that specifically efficient. So I uh took a

specifically efficient. So I uh took a step back and looked around for alternatives and I'd like to especially call out AMP and Factory Troy the Porsche and Lamborghini of coding agent harnesses. So if you can afford them,

harnesses. So if you can afford them, please use them. They're at the frontier. They're really good and the

frontier. They're really good and the teams are fantastic. And there's a bunch of other options. And I have history in OSS. So naturally I kind of gravitated

OSS. So naturally I kind of gravitated towards open code. And again, brilliant team, super high execution velocity and they don't sell you hype, they sell you

tools that work for the most part. I

started looking under the hood of open code uh with respect to context handling as well because that's the most important part for me. And I found a bunch of things like given some

conditions, open code would just uh prune tool output after a specific minimum amount of tokens and that basically lobomizes the model. Uh

there's also LSP server support which means every time your model is calling the edit tool open code goes to the LSP server that's connected asks are there any errors and if so injects that as

part of the edit tool uh result which is bad because think about how you add editing code you're not writing a line of code checking the errors writing the next line checking the errors you don't

do that you finish your work and then you check the errors this confuses the model there's a bunch of other things like storing individual messages of a session in a JSON file. Each me message

is a JSON file on disk. Uh there was this and this happens to all of us. No,

no plane there. But it's not great if by default a server spins up, course headers are set in such a way that any website you open in your browser can now access your open code server. That's

yeah, and entirely unrelated to all of this, I started looking into benchmarks for coding agent harnesses and found terminal bench um which is a pretty good benchmark all things considered. And the

funny part about it is that it's the most minimal kind of thing you can think of. All it gives the model is a tool to

of. All it gives the model is a tool to send keystrokes to to a T-Max session and read the output of that team session. There's no file tools, no sub

session. There's no file tools, no sub aents, none of that stuff. And it's one of the best performing harnesses in the leaderboard. Here's the leaderboard from

leaderboard. Here's the leaderboard from December 2025.

irrespective of model family terminal scores higher mostly high even higher than the native harness of that model.

So what does that tell us? A form two se thesis is we are in the [ __ ] around and find out phase of coding agents and their current form is not their final form right. So second thesis is we need

form right. So second thesis is we need better ways to [ __ ] around and for me that means selfod modifying malleable agents things that the agent itself can

modify and I can modify depending on my workflow. So I stripped away all the

workflow. So I stripped away all the things built a minimal core but made it super extensible and made it so that the agent can modify itself with some creature comforts. It's not

entirely bare bones. Uh so that's Pi.

It's an agent that adapts to your workflow instead of the other way around. It comes with four packages. Uh

around. It comes with four packages. Uh

an AI package. It's basically just an abstraction across providers and context handoff between providers. An agent core uh which is just a while loop and the tool calling. A bespoke toy framework. I

tool calling. A bespoke toy framework. I

come out of game development. So I built a thing that actually doesn't flicker too much. And the coding agent itself.

too much. And the coding agent itself.

Here's Pi's system prompt.

That's it. Eventually the industry created a new standard called skills which is basically just markdown files.

So we added that as well. and that needs to go in the system prompt. So, be

crouchingly, we had to add a couple more lines. And finally, here's the magic

lines. And finally, here's the magic that makes Pi able to modify itself. We

ship the documentation, which was handcrafted by me and an agent. Um, and

code examples of extensions. And all we need to do for the agent to modify itself is tell it, here's the documentation. Here's some code that

documentation. Here's some code that shows you how to modify yourself by writing extensions.

It comes with four tools. That's all it has. Read, edit, bash. Here's the tool

has. Read, edit, bash. Here's the tool definitions. Don't read the the text.

definitions. Don't read the the text.

Just look at the size.

That's it. Here's what happens when you start a new session in one of these tools.

So the thing is the models are actually reinforcement trained up the wazoo. So

they know what a coding agent is because a coding agent harness is basically what they're being trained when they are post-trained. You don't need 10,000

post-trained. You don't need 10,000 tokens to tell them you're a coding agent. They know because they are coding

agent. They know because they are coding agents. Well, pi is also yolo by default

agents. Well, pi is also yolo by default because my security needs are different than yours. And I don't think a little

than yours. And I don't think a little dialogue that pops up every now every time you call bash asking you to approve is a smart security uh uh mechanism. So

instead I give you so much rope that you can build anything that's fit for your specific security needs. There's also

stuff that's not built in. I'm a he because this is how I do it. But if you don't like that, then you just ask Pi to build you sub agent support or plan mode

or MCP support, whatever you need.

Extensibility comes with a bunch of table stakes and then with the extensions itself. And extensions imply

extensions itself. And extensions imply are just TypeScript modules. In the

simplest case, a TypeScript file on disk. You point PI at that. Here's an

disk. You point PI at that. Here's an

extension loaded as part of the harness.

And with that you get a basically an extension API that lets you hook into everything and define stuff for the harness to expose to the to the model.

And that includes tools uh slashcomand shortcuts. You can listen in on any kind

shortcuts. You can listen in on any kind of event in React and then save state in the session that's optionally provided to the agent as well or stored there for

tools that analyze sessions as part of your organizational workflows. You can

do custom compaction, custom providers and you have full control over the tool.

So you can modify everything in PI and you can then bundle all of that up and put it on mpm or on GitHub because I think we don't need to reinvent another

bunch of silos called marketplaces. We

already have package manage managers and all of that hot reloads. So if you develop an extension for pi, you do so in the session and you hot reload the

changes and see the the effects of that immediately which is very great and it's also game development thing is in game development you want high very low

iteration uh speeds and that's great. So

a couple of examples cloud or anthropic ships the slash by the way which lets you talk to the agent while goes on its main quest. I posted this little prompt

main quest. I posted this little prompt on Twitter jokingly and somebody build it in five minutes. with more features and they didn't have to fork or clone Pi. They just let the agent write the

Pi. They just let the agent write the extension based on the prompt. Here's

Nico. He's one of the most prolific uh extension writers. I don't know what the

extension writers. I don't know what the [ __ ] is going on here. It's a chat room for all of his PI agents and they talk with each other. I would never use this, but all of this is custom, including the

UI. Or you can play nest games, or you

UI. Or you can play nest games, or you can play Doom.

And there's a bunch of other examples I'm not going to talk about. So, how do you build a PI extension? You don't. You

tell Pi to build it for you based on your specifications and then you just iterate with it on that and hot reload during the session. I'm going to skip that example as well. And if you don't like building things yourself, and I

hope you do like building things yourself, but if you don't, you can look on MPM or our little search uh interface on top of MPM to find packages for sub aents, MCP, and so on. So, does it

actually work? Well, here's the terminal

actually work? Well, here's the terminal bench leaderboard from October before Pi had compaction. I added that for Peter's

had compaction. I added that for Peter's claw thingy. It scored sixth place.

claw thingy. It scored sixth place.

Uh, but none of this is actually about Pi. If you want to retake, I I basically

Pi. If you want to retake, I I basically want you to retake control of your tools and workflows. So, build your own. Um,

and workflows. So, build your own. Um,

and if you want to know more about PI and OpenClaw, go to this talk, please.

Yeah. And then eventually Peter happened. He put Pi inside of Open Claw.

happened. He put Pi inside of Open Claw.

It's a chantic core, which meant my open source project became the target of a lot of OpenClaw instances unbeknownst to their users. So, this is act two. OSS in

their users. So, this is act two. OSS in

the age of flankers. Clankers are

destroying OSS. Here's Draw. They closed

down the issue on pull request tracker.

Here's open flaws uh trackers. Here's

mine. Half of that is OpenSaw instances who post garbage. So I started to rage against the clankers.

Um if you send a pull request, it gets autoclosed with a comment that asks you to please write a nice issue in your human voice, no longer than a screen worth of text. And if I see that, I write looks good to me. And your account

name gets put in a file in the repository. and the next time you send a

repository. and the next time you send a pull request, it's let through. Clankers

don't read that comment. They don't go back once they posted the pull request.

So, that's a perfect filter. Uh Mitchell

eventually turned it into vouch. Here's

a clanker. Uh I also label them. If you

had interactions with openclaw, your issues get dep prioritized. I also built tools where I embed uh issues and pull request texts into 3D space. So, I see clusters of issues. Uh I also invented

OSS vacation. I just close the tracker

OSS vacation. I just close the tracker whenever I want. So, I have my life back. So, does this work? Yes, sort of.

back. So, does this work? Yes, sort of.

Which leads me to act three. Slow the

[ __ ] down. Everything's broken.

And then there's people that say, "Our product's been 100% built by agents."

Yes, we know it [ __ ] sucks now.

Congratulations.

And I'm hearing this from my peers and this is entirely unhealthy.

Um, so here's how we should not work with agents and why at least in my opinion. I wrote this on my blog a while

opinion. I wrote this on my blog a while ago, but the basic is this. We're having

armory of agents and you're using beats been and you don't know that it's basically uninstallable malware and entropic build a C compiler. It kind of works but actually doesn't. And we're

hoping the next generation of mods will fix it. And here is Pers building a

fix it. And here is Pers building a browser and that's also super [ __ ] broken, but the next generation will fix it. And Saz is dead. Software is solved

it. And Saz is dead. Software is solved in 6 months. And my grandma just built herself a Spotify with her open claw.

Come on people. So agents are actually combounding boooos, which is my word for errors with zero learning and no bottlenecks and uh delayed pain. The

delayed pain is for you. Here's your

codebase on a human on one agent and 10 agents. How much of the agent code can

agents. How much of the agent code can you review? Here's the same code base

you review? Here's the same code base but expressed in number of boooos per day.

How much of those boooos do you think you'll find? Then you say, "Oh, I have a

you'll find? Then you say, "Oh, I have a review agent." Let me introduce you to

review agent." Let me introduce you to the wonderful world of the Oruro Boro.

Doesn't work. It catches some issues.

Um, the problem is that agents are mergent learned complexity. Where did

they learn that complexity from? From

the internet. What's on the internet?

All our old garbage code. There are some pearls on the internet, really well-designed systems. But 90% of code on the internet is our old garbage. And

that's what the models learn from. And

every decision of an agent is local, especially if the codebase is so big that it doesn't fit into its context.

And if you let it go wild and add abstractions everywhere that are intertwined. Um, so that leads to lots

intertwined. Um, so that leads to lots of abstractions and duplication and backwards compatibility. Who has seen

backwards compatibility. Who has seen that in the output of their agent? It's

[ __ ] annoying. or defense in depth.

So yeah, you get enterprise grade complexity within two weeks with just two humans and 10 agents.

Congratulations.

And then you say, "But my detailed spec." Yes, sure. You know what we call

spec." Yes, sure. You know what we call a sufficiently detailed spec? It's a

program.

So if you leave blanks in your spec, what do you think happens? How does the model fill in the blanks? And with what does it fill that in? It fills it in with the garbage that it learned on the internet from our old cult which is

garbage to mediocre. And then you say but humans also yes humans are horrible fail failurable beings but they can learn and they are bottlenecks. There's

only so many boooos they can add to your codebase on a daily basis. And humans

feel pain which is a very interesting property because humans hate pain. And

once there's too much pain the human has a bunch of options. It can quit their job. It can uh blame somebody else and

job. It can uh blame somebody else and make them fix it or everybody bands together and starts refactoring the [ __ ] out of the garbage code base. Right?

Agents will happily keep [ __ ] into your codebase and now your agents MD and super complex memory systems will not save you. Agents

don't learn the way we learn.

Those are my most most beloved people. I

don't even read the code anymore.

Congratulations. something is broken and your users are screaming. So, who you going to call? Not yourself because you haven't read the code. So, you're

relying on your agents, but they are now also overwhelmed because the codebase is so humongous that there's absolutely zero chance they can get all the context they need to fix the issues. And long

context windows are a hack as most of you will find out this year as everybody's switching to 1 million tokens context windows. And Agentic

Search is also failing.

So the agent patches locally and [ __ ] [ __ ] up globally. If you see this in your codebase, you're [ __ ] So you cannot trust your codebase anymore and also not your test because

your agent wrote your test. So good

game. So here's how I think we should work. Um there's a bunch of properties

work. Um there's a bunch of properties for good agent tasks. That means scope.

If you can scope it in such a way that the agent is guaranteed to find all the things it needs to find to do a good job, you're done. That means modularize your codebase. If you can give it a

your codebase. If you can give it a function to evaluate how well it did the job, even better. Hill climbing, auto research. Uh, anything non-m mission

research. Uh, anything non-m mission critical, let it wipe. Boring stuff, let it wipe. Reproduction cases for user

it wipe. Reproduction cases for user issues, which are usually only partial in information, perfect. I don't spend any mornings anymore doing that. Or if

you don't have a human near you, rubber duck. So, lots of tasks you can use them

duck. So, lots of tasks you can use them for and save time. At the end of that, you evaluate. You take what's

you evaluate. You take what's reasonable. most of it isn't. And then

reasonable. most of it isn't. And then

finalize. My final slide, more or less, slow the [ __ ] down. Think about what you're building and why. And don't just build because your agent can do it. Now,

that's stupid. Uh learn to say no. This

is your most valuable uh capability at the moment. Fewer features, but the ones

the moment. Fewer features, but the ones that matter. And then use your agents to

that matter. And then use your agents to polish the [ __ ] out of that. Enlighten

your users, not your uh token maxing desires. Get the amount of generated

desires. Get the amount of generated code uh that you need to review.

And non-critical code, sure, wipe slop ahead. Critical code, read every [ __ ]

ahead. Critical code, read every [ __ ] line. See the keynote after me for more

line. See the keynote after me for more info on that. So, how do you know what's critical? Any guesses?

critical? Any guesses?

Well, you read the [ __ ] code. Uh, if

you do anything important, write it by hand. You can use a clanker to help you

hand. You can use a clanker to help you with that, but don't let it make the decisions for you because we've learned all the decisions it makes are learned from the internet. And that friction is

the thing that builds the understanding of the system in your head, which is important. And it's also where you learn

important. And it's also where you learn new things. And all of this requires

new things. And all of this requires discipline and agency. And all of this still requires humans. Thank you.

Our next presenters will make the case that the friction is your judgment. Please

join me in welcoming to the stage creator of Flask and founder of Arendil Armen Ronacer and software engineer at

Arendil Christina Ponella Cubro.

Good morning.

Morning. Thanks for having us. Um, today

I want to talk with Christina about friction a little bit. Um

this is um a a social preview that came up automatically when someone submitted an issue um to

um basically this is a forum post that goes with um a security incident that was deployed accidentally. It was a configuration change that caused a problem and the social preview post had

the marketing tagline of that company which said ship without friction.

Um, and we want to encourage to add a little bit of friction to it. Um, and

I'll tell you why. So, who are we? Um,

I've been doing software development for 20 years, most of it in the open source space. Um, I have created Flask, which

space. Um, I have created Flask, which is a Python framework, which ironically is so much in the weights that a lot of people um are learning about it now because the machines are producing it.

Um and I left my previous company that worked for Sentry in April last year which perfectly coincided with um me having time and then obviously clot code

and so I fell deep into a hole of a engineering and I started writing on my blog and and and a lot of people reached out to me over the last year um being all excited about this. Um and then I

started with a friend in October a company called Arendelle where we are trying to make sense of all the AI things. Um,

things. Um, yeah. And my name is Christina and I

yeah. And my name is Christina and I work with Armen at this company called Arendel. But importantly, I am what I

Arendel. But importantly, I am what I like to call a native AI engineer. And

what that basically means is that these tools have been around longer than I have. Um, so what this means is like

have. Um, so what this means is like they've been super foundational in how I've become a software engineer. Not

just because obviously I use them to work, but also because this is the means by which I've learned to do what I do.

And before Arendel I was working at bending spoons.

So we want to share a little bit from practice not just theory but um I will readily admit that I don't think we have all the solutions. So we have been building with or on agents for a good 12

months. Um we had huge leverage and

months. Um we had huge leverage and great disappointment and we we really keep running into two types of problems. Um I I think especially if you listen to

some earlier talks at at this conference you will have learned a lot about um that you should keep using your brain.

Um it's for some reason it's really really hard. So there's a psychological

really hard. So there's a psychological problem and the other one is the engineering challenge is like they they seem to be producing worse code for some people and better code for some other people and like what is it that actually

makes that work. Um and so this is really not a solution as it is our part of the journey of how we think so far we have managed. Um yeah, so problem number

have managed. Um yeah, so problem number one is the psychology part which is like why is it even though everybody told you many times over that you should be using your brain, you should be slowing down, it's actually incredibly hard. It's just

one more prompt and and we don't sleep that much. Like what is it that actually

that much. Like what is it that actually makes it so hard? And then would it be that hard if the machines would actually be writing perfect code and we wouldn't have to think quite as much and like what is it is there something we can do

to make this a little bit better?

So I'll begin by introducing the first part of these problems, the psychology problem. And what I want to talk first

problem. And what I want to talk first about is the shift. So I'm sure a lot of us here who have been playing with these tools for a while now experienced this at some point. We were prompting

prompting not so good and then at some point suddenly it clicked and they were really really useful for us and it was fun in the beginning and they gave us a lot of extra time right because not everyone was using them. They were

actually tools that made us more productive, that made it more fun to do our jobs. But very quickly because they

our jobs. But very quickly because they were so useful and they got us so hooked, everyone was using them. And so

this kind of had the opposite effect where suddenly the baseline expectation was just that everyone is now using them and you have to use them. And so this this fun and free time translated into

pressure. Now we all have to ship faster

pressure. Now we all have to ship faster and produce more code. And it is just not sustainable to review and to actually have time to think.

And so this leads us to the trap. And I

actually think there's two parts of this problem of this trap. And one of them a lot of engineers have spoken about and it's that these tools are super addictive. You never know if that next

addictive. You never know if that next prompt is going to be the one that makes your product work and you've added a new feature or if it's going to be that last drop of slop that brings your product

crashing down. And so it's very

crashing down. And so it's very addictive. We keep doing what we're

addictive. We keep doing what we're doing. It's not a great solution. But

doing. It's not a great solution. But

also most importantly, and I don't think we realize this as much, is that because we produce a lot of output very fast, we are tricked into thinking that we're actually being more efficient, doing more work. And this is quite the

more work. And this is quite the opposite because now we don't have as much time to actually stop and think and design what we're doing. Ask ourselves,

is this the best way in which I can implement this or could I be some doing something better? And when you're in

something better? And when you're in this flow, it's very difficult for yourself to stop and it's definitely very difficult for your agent to stop because it's running around and it's reading files that it should have never

even read. So we are the ones that need

even read. So we are the ones that need to actually have the agency to be in control here.

And one thing that from a if you start scaling this from like one person to an engineering team that actually took me quite a while to realize is that it really changes the composition of the engineering team. We we were really

engineering team. We we were really supply constrained by creation of code and so like the balance between writing code and reviewing code and engineering teams was usually quite decent. Now

every engineer has a multitude of producing power compared to their reviewing power and so obviously we are piling up on poll requests but we are also slowly starting to expand the total

amount of humans in an organization that are participating in engineering process. I talked to a lot of engineers

process. I talked to a lot of engineers over the last year and increasingly the one of the things that came up is like now I have marketing people shipping code. I have um former CEOs sh CEOs that

code. I have um former CEOs sh CEOs that used to be like engineers now shipping code again. And so the the roles that

code again. And so the the roles that those people have in the companies also doesn't give them there's not that much um um the responsibility doesn't rest in

them. The the responsibility still rests

them. The the responsibility still rests with the engineering team. And so the the total number of entities both humans and machines that the participating in code creation process outnumbers the ones that can carry responsibility.

We're not there where the machine can be responsible for the code changes. And so

that has led to more and more code reviews being skipped being rubber stamped. Um and on the go to small PRs

stamped. Um and on the go to small PRs that that we want to see again so that this reviewing process goes um this amplification is something that at the very least we need to recognize.

And so when you get this pull request that looks really daunting and has 5,000 lines of code in it, this is actually when you should be thinking and that's exactly when it's the most overwhelming and and increasingly we're tapping out

of this on the engineering side. What we're

doing is we are creating larger pull requests. We're creating these massive

requests. We're creating these massive changes because it is free now, right?

And the if you think about how the agents work, they're really optimized to creating code that runs. Like their main objective is write some code, run the tests, make some progress. The

reinforcement learning sort of gets this in. And so the the agents are writing

in. And so the the agents are writing kind of code that is is when you as a human as an software engineer start learning how to write code you wouldn't necessarily write. So for instance, you

necessarily write. So for instance, you see quite a bit of code that tries to read a config file and if it doesn't read the config, it loads some defaults.

And as an engineer, you know, that's actually not great because I might not notice that I'm reading reading the default config file. And so I might only discover that I have a massive problem

after two hours when I already wrote database records with wrong data. And so

these machines, they they optimize towards making progress towards shipping stuff to like unblocking themselves. And

as a result, they're creating many more failure conditions than human written code normally would do. And in parts, it's because you as a human feel a little bit of a you feel bad when you write code like this. There's there's

something that sort of builds up emotionally in yourself. But the agent doesn't have a reason for this. It it

doesn't feel anything. And so if you if you create these services that are sort of hobbling along and they're actually willing to to recover from local failures, you actually create very very

brittle systems. And this also means that you're very quickly creating a codebase of the size and complexity that the agent itself can no longer dig itself out from. It's going to start no longer reading all the files that it

should. It's it's creating code in a new

should. It's it's creating code in a new file that has already done somewhere else. And so this this entire machinery

else. And so this this entire machinery over time creates much more entropy in a source code than you would normally have if if humans were on it. And a big part

of this is that humans feel bad and agents don't really have any emotions that they communicate to you.

But as Armen likes to say, don't worry, not all is lost. We have s found some correlation between what the agents really excel at doing and the types of code bases that we actually put them to

work into. And for example, the main

work into. And for example, the main example here is libraries versus products. What we found is that for

products. What we found is that for libraries, they tend to excel a lot more. And this makes sense because

more. And this makes sense because intrinsically when you're building a library, you tend to have a very clearly defined problem that you're trying to solve. And most of the time you can even

solve. And most of the time you can even map the set of features that you want to build to the API service. And it has very tight constraints. And because this is something that you probably want to build on top of or make accessible to

other people, it's likely that it's going to be a very simple core in which you can then plug into. And on the other hand, products and perhaps this is a bit more unlucky for the rest of us because we all probably are more into building

products. Uh it's much harder because

products. Uh it's much harder because there are so many interacting concerns and components like for example you have your UI, your API response. You have

different permissions depending on the feature flags, the billing and so on.

And so there's this very heavy intertwining between different components. And what this means is that

components. And what this means is that for the agent itself, it's impossible to fe fit all of this into its context window. it has no way to actually

window. it has no way to actually understand the entire global structure and so locally the agent tends to be very reasonable but when it gets to the

global scale it becomes a bit demented.

So what we're proposing here is that just as you would do with any type of system design in the past your codebase has now become infrastructure and as such you have to design it in the way so

that it is also legible for the agent and it can make the most of it and so this is what we're proposing is an agent legible codebase and one of the

main points that is very clear to all of us I'm sure is modularization so like we have different components and this makes it easy for the agent to add one feature in one spot without corrupting everything else. But importantly, this

everything else. But importantly, this also means modularizing your code flow itself. So for example, I've been

itself. So for example, I've been working on some refactoring or building somewhat of an AI assistant. And for me, it was super important to understand which steps of my code are actually like

the main points. So say like you get user message, then I pass the message to the agent loop and then I have to deal with the output. And this is where these points are very clearly defined for me.

So the code was not as messy, but it happens to be that between these points, between these steps. That's where the agent tends to add the most fuzz. So it

will be parsing between different types.

It's adding things to state that shouldn't be in state. And so you end up with these behaviors that you didn't want to support and that are unexpected and can be quite dangerous. Another

point is trying to follow all of the known patterns because I think we all know by now there's no point in fighting the RL, the reinforcement learning. the

more we can lean into it, the better that our output is going to be and it's also more scalable down the line. Then

as mentioned with libraries, like if you have a simple core and you push the complexity to other abstraction layers, then it's going to be easier for yourself and the agent to be able to read your codebase and no hidden magic.

So for example here uh using react server actions or using OM instead of rosql what this does is that it hides intent from the agent and if the agent can't see something it can surely not

respect it and so to be more precise these are the examples of mechanical enforcement that we have been using at the company and

most of these we actually achieve with uh linting rules. So the main example would be no bear catch holes. Great.

Imagine that there's an example here.

The agent found a bear catch hole and was like, "Oh no, this is bad. Edited

it." But yeah, so we also try to have our SQL uh always in one query interface so that the agent doesn't have to go hunting around the codebase finding all of the different places because if it

misses one then you can have breaking behaviors and again that's dangerous. We

try to have one primitives components library for the UI and not have any raw for example input uh input boxes. Uh so

that it's we always have one type of styling. It's very consistent one kind

styling. It's very consistent one kind of behavior. We don't have any dynamic

of behavior. We don't have any dynamic imports. And this may not sound as

imports. And this may not sound as important but actually we enforce unique function names. And the reason for this

function names. And the reason for this is not just more legibility for you and the agent, but it's actually also the token efficiency. So if your agent is

token efficiency. So if your agent is gripping for a specific feature or something in your codebase, if it only gets one output, it's going to be much better at continuing with the loop. And

we've started exploring something recently called erasable syntax only TypeScript mode. And what this does is

TypeScript mode. And what this does is that your code is basically JavaScript and it has the type annotations on top.

And this means that there's no transpiling direction because there's one source of truth between your actual code and the compiler. And so when the agent is looking for errors, it doesn't

have to have this like confusion of oh my god, where am I looking at? It is

much better at finding them.

And so the goal really is get in this loop somehow like get the agent to produce as good as it can, but you really need to find a way to feel the pain that the agent doesn't feel. And

you need to be woken up in a way when you should be looking at this. And one

of the things we have been doing is we build a PI extension for our review needs where we are separating out the kind of input that normally would go back to the agent. So this is mechanical

bugs. It is where it clearly violated

bugs. It is where it clearly violated agents MD. Um but then we specifically

agents MD. Um but then we specifically call out the kind of changes where the human's brain should reactivate, right?

It's like we don't think that the database migration should ever go in without the human making a judgment call on this because it very much depends on the locks, the size of the data in production. Um, if there are

production. Um, if there are permissioning changes, you better think about this themselves rather than the agent because they be can be they can be underdocumented.

Just some examples where we learned if we miss it, we regret it. Um, and you will miss it. But this these machines can help you find this and then you see this and then you actually get a little

bit of a hit like, oh, now now I have to kick into gear and do something here.

Um, this is what this looks like in pi.

Um you have the um on the bottom you have the human call outs on the top you have what is going what basically if we were to end this review and say like fix the issues the the agent would go back

and automatically act on the first two um but but this is the moment where I will now go and see like is this a dependency I actually want to have in this codebase like do I like the maintainers is this does this work for

me and we obviously like the speed like this is addictive it is great we feel there's a lot of productivity But it is so devious if you start

relying on that speed where you really shouldn't. And so I can only encourage

shouldn't. And so I can only encourage you to find the areas where you you have this feeling that this is actually net positive. For me a lot of this is

positive. For me a lot of this is reproduction cases like when a customer reports an issue I can I can have the age and reproduce this perfectly and I have a really good starting point exploring different type of product

directions for as long as you don't commit yourself to doing this uh with the code that it generates. Um all of this is great but on the other hand system architecture creating reliability in the system they're not just very good

at because we really still have to go slow.

It's there is so much mess that can appear in a codebase in so little time.

Mario was already talking about this earlier but like we forget that we producing months and months of technical debt in the in in a time of weeks in a time of days sometimes and it becomes so much harder to actually understand

what's going on as codebase. the when

the understanding of your own code drops, it is really really hard. And

it's also psychologically hard. I've

found some code pieces that actually didn't work in production. And I was kind of frustrated learning that I was the one that committed with the agent and just didn't really see that. It's

it's a very disappointing experience when it happens. And then you realize that you actually were the one that screwed up. Um, and so it is it is

screwed up. Um, and so it is it is psychologically incredibly hard to to really judge objectively the state of the codebase. And the only way right now

the codebase. And the only way right now is to really slow down a little bit on on that front. And this this friction I know that friction like every engineering team I've ever worked at

said like we need to get rid of the friction in shipping and and that is true. Like there's a lot of stuff that's

true. Like there's a lot of stuff that's very very annoying and shouldn't be there. But if you have worked in large

there. But if you have worked in large enough engineering or work, SLOs's are a great system that is intentionally designed to put friction to the engineering process to make you think, do I need this reliability? Do I need

this criticality of the service? Am I

sufficiently staffed to run it? And with

the agents, we have now gotten this idea that we should get rid of all of this when in all reality we need of it. Um

because the friction actually in many ways is what's necessary on a physical level to steer. like without friction there's no steering and and that is really necessary. Um so you should you

really necessary. Um so you should you should put a little bit more of a positive association to this idea of friction. Um because this is really

friction. Um because this is really where your judgment is. This is where your experience is and you should be inserting that and start feeling it.

Thank you.

Thank you.

Ladies and gentlemen, please welcome to the stage for a special announcement, the co-founder and creative director of the AI Engineer Conferences, Benjamin

Duny.

This event has been a dream of ours for some time.

Swix and I are based in San Francisco, but Europe has always been on our minds.

Sean lived in London for two years working in finance in Morgate. I spent a semester here in college or Rasmus as you call it and fell in love with the

energy of this city, particularly the diversity.

Uh London felt like a natural melting pot for all of Europe and beyond. And

the model for this event in Europe has been our world's fair event. That is a large multi-track event with general

session keynotes, multiple breakouts, and a thriving and exciting expo.

This wonderful venue and its lovely people has served us a fantastic first step into Europe. But we're just getting started.

And given that we sold out this event nearly a month ago, we plan on at least doubling the size of this event for next year. But if you don't want to wait

year. But if you don't want to wait until next year, we encourage you to join us at our flagship event in San

Francisco, the AI Engineer Worlds Fair.

Over 4 days from June 29th to July 2nd, we'll gather the edge of AI engineering at Moscone West. the

crown jewel of San Francisco's convention centers in the heart of downtown. And today I'm excited to

downtown. And today I'm excited to announce our sponsors for this event.

Our presenting sponsor, which is sold out, Microsoft returns for the third year running as our presenting sponsor.

Let's give it up for Microsoft.

When Sean and I were first looking to start this world's favorite brand, we needed an acre sponsor. You don't just do something like this uh without financing. So, they're helping us do

financing. So, they're helping us do that. And also our great content

that. And also our great content partner. We have a new tier, lab

partner. We have a new tier, lab sponsors.

This is also sold out.

Google DeepMind is coming in at a lab sponsor. Open AAI and Amazon AGI Labs.

sponsor. Open AAI and Amazon AGI Labs.

Anthropic, we're holding one for you, but we can't hold it forever. So, David,

all of you from Anthropic in the green room listening, watching, let's make some calls to marketing Devril track sponsors. These are the companies

track sponsors. These are the companies who are essentially running their own conference within World's Fair. So

they're big content partners and we're excited to announce these are sold out to Sneak who's running security Arise who's running eval and Neoforj AI in the industry

AI in the enterprise sorry our platinum sponsors also sold out these wonderful companies are coming in at platinum sponsors gold nearly sold out all of

these lovely partners and silver is also nearly sold out all of these lovely partners as well so this This is going to be the most exciting expo and event

of the year. Our expo is a village packed with value and intrigue buzzing with trillions of dollars in value along with the engineers and founders who

direct that value through their ideas and their execution. So come and meet them over four days of programming.

That's three days of keynotes and sessions and a full day of workshops with over 200 breakout sessions. And by

the way, the World Cup is in the United States this year. So we actually have some finals matches in San Francisco over these dates. So you can even enjoy a few soccer matches,

football matches while you're in town.

All right, so register today at ai.engineer/worldfair.

ai.engineer/worldfair.

We are just over two months to go and there are over a thousand people registered already. Um, but we do expect

registered already. Um, but we do expect to sell out. So before it does, be sure to get your tickets soon. You

can also submit a talk. Our CFP is open at ai.engineer/worldsfair.

at ai.engineer/worldsfair.

And if San Francisco is too far for you, we have an event just across the pond in New York with Arise as our first startup as a presenting sponsor. So, we're

really excited for that. That's going to be a fantastic event for specifically for AI in the industry as New York serves as that great enterprise center.

So once again, thank you for joining us here at AIE Europe. And if we don't if we don't see you in SF for New York this year, we hope to see you back in

London. And Tis is going to come up and

London. And Tis is going to come up and give a few more uh words. Uh and we'll see you soon. Thank you.

Ladies and gentlemen, please join me in welcoming back to the stage Tusk Kumar.

Hey, thank you. Thank you. Yeah. Yeah.

Listen, everybody's leaving. Why? Um,

just kidding. Thank you for staying. Uh,

that Hey, how amazing. AI engineer world fair. I'll keep it short cuz no, nobody

fair. I'll keep it short cuz no, nobody cares. Um, but

cares. Um, but but here's what we just finished the keynotes. Um, but we're going to break

keynotes. Um, but we're going to break now into breakout rooms. Um, there's going to be talks on this stage, but also upstairs on the fourth floor.

There's many different tracks. We're

going to be breaking into tracks for coding agents, for MCP. I'm going to be quick here, but you can see it on the screen too for AI architects, generative media, GPUs, and LLM infra. Okay, so go

to those tracks. And then after that, much later in the day, we've got lunch, networking, and so on. But for now, go to the expo outside, visit the sponsors.

They have amazing swag. See if you can get this like three button keyboard thing. That is so cool. Anyway, go

thing. That is so cool. Anyway, go

enjoy, and we'll see you back here later. Thank you.

later. Thank you.

What we do in life?

Echoes in eternity.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat. Heat.

Heat. Heat.

fear is the mind killer.

Fear is the mind killer.

Heat.

Heat.

Heat.

Heat.

Heat. Heat. N.

Heat. Heat. Heat.

Heat. Heat. Heat.

Heat. Heat.

Free your mind.

Free your mind.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Free your mind.

You are who you choose to be.

Heat. Heat.

execute the vision.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Hey, heat. Hey, heat. Heat. Heat.

Heat.

Heat.

Make the requirements less dumb.

Delete the part or process.

Simplify and optimize. Accelerate

cycle time.

Automate Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat.

Hey, heat. Hey, heat.

Never give in. Never give up. Outlast.

Out compete.

Persevere. Persevere. Persevere.

Heat. Heat.

Heat.

Heat.

Heat.

Heat.

A new age has come.

Oh, hold still.

Let it a little.

I watch the sparks all burn too fast.

Everyone reaching for the flash.

They take the first light they can find and call it truth and call it mine.

But I stayed when the room went quiet when the noise fell out of face.

sat with the weight of the question while the easy answers walked away.

It's not that I see further. I just

don't leave it soon. I let the silence sharpen. I let the dark grow.

sharpen. I let the dark grow.

I stay the almost right past the comfortable light.

I wait till the surface breaks, till the shade feels true inside.

I don't rush the fire.

I give it to I call it done, call it enough.

But there's a deeper know still huming underneath a fear of not being love.

Every great thing asks for patience.

Every real thing makes you choose.

Do you leave with what's acceptable or stay for what's asking more of you?

They say it's talent, say it's magic like it falls from open skies,

but nothing worth remembering our eyes on the first try.

I stay when it stops feeling kind when it stops feeling fast.

I say I wait through the restless doubt through the urge to collapse.

Hide by and chase the answer. I let it find me back. There's a moment after the last good idea dies.

Where the room feels empty and you want to run for your life. That's the party

teaches you to open. That's the H where the real stand.

Hold the light.

Hold the Let the shape reveal it.

I stay longer than I should. Long enough

to change.

I stay away till the pattern clears. So a

signal breaks the haze.

I don't bar in it. I

with time.

Most dreams don't fail.

They're just left too soon.

I stay.

I stay.

Typing thoughts into the dark. A spark

becomes design. Words evolve to whispers meant for something more divine. Syntax

bends and breeze. I see the language change. I'm not instructing anymore. I'm

change. I'm not instructing anymore. I'm

rearranging fate. Every loop I write rewrites me. Every function hums with

rewrites me. Every function hums with meaning. I feel the interface dissolve

meaning. I feel the interface dissolve between the maker and the new code. Not on the screen, but in the

new code. Not on the screen, but in the soul where thought becomes the motion and creation takes control. No lines, no

rules. Just balance in between the zero

rules. Just balance in between the zero and the one, the silence and the dream.

systems shape our fragile skin. They

mold the way we move. We live inside the logic gates of what we think is true.

But deep beneath the data post, there's something undefined.

A universe compiling the image of our minds. Every line reveals reflection.

minds. Every line reveals reflection.

Every loop replace connection. We're not

building, we're becoming. And the code becomes confession.

This is the new code. Not on the screen, but in the soul where becomes the motion and creation takes control. No lines, no

rules.

Just balance in between the zero and the one. The silence in the tree.

one. The silence in the tree.

We are not just the world we're in.

We are the world we're doing.

Each prompt, each breath, each fragile spin, a universe renewing.

This is the new code.

Alive and undefined.

Where logic meets motion and structure bends to mind. The systems eternal, but the soul writes the line. We are the new

code. Oh,

code. Oh, compiling time.

Compiling time.

We didn't light the fire.

We traced the spark through every truth.

Patient as I hear the echo before the sound.

I feel the answer before it's found.

Nothing from nothing.

We only shift the pieces that were always there. Hands in the dust of

always there. Hands in the dust of centuries, naming what we uncover, calling it creation, so we can feel like

lovers of pain, of faith, of power. We don't know.

of power. We don't know.

Time is not a river, it's a blade cutting order into shape. We don't move forward. We align until the pattern

forward. We align until the pattern breaks. Nothing is invented.

breaks. Nothing is invented.

It's revealed.

Every crowd was buried in the field. We

are architects of sequence, not gods of the real. Nothing is invented.

the real. Nothing is invented.

Here we rearrange what awaits at the core. I am not becoming something new.

core. I am not becoming something new.

I am what I was before screams. Every thought,

every self identity is scaffolding, held together by belief. I am a momentary order.

by belief. I am a momentary order.

Standing on my tears, shake me, break me, watch me reassemble.

Time doesn't chase us. It releases frame by frame. The truth we fear. We don't

by frame. The truth we fear. We don't

Fear the ending. We fear the pattern getting clear. Nothing is invented.

getting clear. Nothing is invented.

It's revealed.

Every meories seal. We are creators of

meories seal. We are creators of alignment in a universe that feels nothing is invented.

And every failure is a lesson learned. I

am not lost in what I am not.

I am the order that returns.

If I am only rearrange the noise from the signal ing from the fire.

Nothing is invented.

Stand and see.

Every future we don't write the laws of motion. We

choose velocity.

Nothing is invincible.

Say my name. I am ordering flame. I am time collapsing into will.

flame. I am time collapsing into will.

I am discover.

I'm going say the noise falls silent and the pattern holds.

You'll see it was never made only found.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat. Heat.

You feel Heat. Heat.

Heat. Heat.

Heat.

Ah ah a aha.

Ah, heat.

Ah, ah.

Heat. Heat.

Heat. Heat.

Oh, hey.

Heat. Hey, heat. Hey, heat.

I want I Oh, heat heat.

Heat. Heat.

you know. Hey, welcome back. Welcome

back. How was the expo?

They liked it. You didn't. Got it. Okay.

Um, welcome back. We're going to start off our um, breakout sessions right now.

But I I get to announce all the speakers here, which I'm really excited about.

But did it occur to you that I was announced just now by God? you know, and they're announced by me. What a

downgrade. Uh our our next speaker is um it comes to us from Cursor and he's going to talk to us about an incredible

topic. Like he has mad skills because

topic. Like he has mad skills because they they replaced 12,000 lines of code with just a 200 line skill. Absolutely

incredible. So I want you we remember the exercise from this morning. Yeah. We

need to choose the quality of our talks by supporting our speakers. Okay. So,

I'm going to introduce him and then you're going to give the biggest possible round of applause you can so that he goes for it. You ready? Give it

up for your next speaker, David Gomez.

Well done.

Hi everyone. How you all doing? Thank

you for uh coming today. Um I'm going to be talking about how markdown is basically the new code. Uh as Tjasa has already sort of previewed um we recently

replaced a lot of code in the cursor application with just markdown just a skill. And in today's talk, I'm going to

skill. And in today's talk, I'm going to share a bit of the journey of going from a full-blown feature with a lot of code, a lot of dependencies, a lot of

complexity and tests into a much more lightweight stream down version of the same feature effectively, but just with a single skill.

Um, before I start though, I have to give you guys a little recap of git word trees and how they work in cursor. Now,

if you haven't heard of work trees in Git, they're effectively like um separate checkouts, and I'm sorry for the white screen um but they're effectively like like separate checkouts

of your repos that allow you to work in parallel. So, different agents can be

parallel. So, different agents can be working on the same task at the same or on different tasks at the same time without um interfering with each other.

If you've never used this feature before in cursor, the way it works is that you can spin up an agent on an individual work tree. Um, and you will see for

work tree. Um, and you will see for example the same file in two different work trees and you can see that they look different because the agent is doing some work on on the work

tree but not on your primary checkout.

And anytime the agent runs commands or lints or anything it does will be isolated and scoped to that git word tree.

Um with this feature you can also um work even in parallel at the same time on the screen. You can have like these grids of Asians working for you. Um and

if you say hey open a PR the agent will open a pull request from that work tree with the changes that it produced inside that work tree. And one of the coolest

things about this feature is that it allows you to give the same task even to different models at the same time and then compare what different models do on

the same prompt. So if you haven't heard of this, we call it best event and it's effectively a way for you to compete on on diff have have different models

compete on the same task. And then you can even preview the changes if it's a front-end um project you're working on.

Uh you can um compare all the different visual implementations and then choose the one you prefer. Now if you have never heard about this all everything

I'm talking about today um I will also just say that it all came out in around October of last year alongside Cursor 2.0.

Um and when we initially shipped that it came with a lot of complexity. Um, we

had to write all the code for creating word trees, managing these word trees, feeding them into the agent as context.

We also had to make sure that the agents were scoped and isolated and they could not escape the word tree they were working on. Uh, we also have something

working on. Uh, we also have something called setup scripts which users can configure and run uh and and have cursor run them anytime an Asian starts

operating on a given word. We also have the judging. So, I didn't show you this

the judging. So, I didn't show you this before, but uh there's a little thumbs up icon on one of the models. That's

just a judge that we run um that tells you which implementation looks the best based on um different criteria. Uh and

then we also had to make some changes to the harness uh and introduce some system reminders to help the agent stay on track in these word trees. And then

finally, there's some cleanup complexity as well because people like to spin up hundreds of these word trees and then their disk sizes blow up and we have to help them by cleaning up the um the the

word trees that stay behind. Now, in our new implementation, the one that I'm going to be talking about today, we were able to get rid of most of these things.

And in fact, I recently opened a PR uh removing this entire feature from cursor and it was a massive like deletion of of of code. Like I think it was around

of code. Like I think it was around 15,000 lines of code deleted. The new

implementation of the feature is almost as good as the previous one. Um

and it is much much more lightweight in terms of us to maintain it. Um and it even has some benefits compared to the previous implementation that I'll be talking about today. So, how were we

able to replace an entire feature with a skill?

We decided that there are two primitives that we could use to effectively allow cursor users to use word trees by simply leveraging two primitives. One is Asian

skills and the other are subentations.

So, both of these are existing cursor features. You can learn more about them

features. You can learn more about them in our docs. Uh we have a page for skills and we have a page for sub agents. We realized that if we took

agents. We realized that if we took these two things together, we could basically reimplement both the cursor work trees feature as well as the cursor best event feature

with just markdown. And this is a little video of how it works. So I can now as a user say slashwork tree and then I'll give it some task. I'll say fix a typo

in the footer of the website and this agent will run in an isolated work tree and do its work there. So the way the

skill is written is actually really simple. I can show you most of it. Uh it

simple. I can show you most of it. Uh it

doesn't fit on the screen, but it's basically a set of instructions telling the model um how to create word trees and um to run the setup scripts that the

user might have configured and then to stay on that checkout, right? We want to make sure that when the agent is operating on a word tree, it is staying in that checkout. Um the best event

skill is very similar. It's actually

even smaller. The entire skill fits on the screen here with with a small uh font. Um and what we're doing here is

font. Um and what we're doing here is we're instructing the parent agent to go and create sub agents for each model and then spin up a word tree for each uh so

have each sub agent create its own word tree and work inside that work tree. Um,

and then we also tell it to wait for all the subvisions. And when they're done,

the subvisions. And when they're done, please provide some commentary. Please

let the user know um what um the different implementations by the different sub agents look like. Maybe

you can grade them. Maybe you can make some uh criticism of them and maybe you can help the user choose which one is the best. Um and and please give that to

the best. Um and and please give that to the user in some nice table format or something. But again, it's only around

something. But again, it's only around 40 lines of code and it's all marked down. Like it's not even code. And the

down. Like it's not even code. And the

previous version of this was maybe 4,000 lines of code.

Some of the considerations we have to have in this in the skill is that the skill must be crossplatform compatible like we have Windows specific instructions and we have Linux and Mac

OS instructions as well. We also

instruct the parent model to run the setup scripts for each word tree that the user might have configured. And then

this is the hardest part. We'll spend a bit of time on this on the talk today.

We have to instruct the model to stay on that work tree, right? We have to really say, hey, do not ever work outside this

and do not ever um escape, right? Um

and we we do that with some aggressive prompting effectively. So the new

prompting effectively. So the new commands are slashword tree and then slashbass event to do the basic basically like um the to start agents in

isolated work trees and to start multiple agents on the same task. And

then we also have apply word tree and delete word tree to bring over changes from the side word tree into your primary checkout. And delete tree just

primary checkout. And delete tree just does uh what you would expect. Uh a

little note is that these are not actually skills in cursor. They're

actually commands. But the way these commands work in cursor is extremely similar to how skills work in that there the prompts only get loaded into the context if the user chooses to load

them. Um and the only reason we did it

them. Um and the only reason we did it as commands and not as skills is so that the prompts for them can be controlled in our servers in our back end. This

means I can iterate on these prompts um without you having to update your cursor version. Um if I do some improvements to

version. Um if I do some improvements to these prompts, the next time you use them, you're going to have you're going to get the latest version of the prompts, but effectively they work like

skills. Um this is a demo of the best

skills. Um this is a demo of the best event um skill or command where I'm giving the same task to Kimmy, Grock,

Composer, GPT, and Opus. And what you will see is that the parent agent starts by spinning up five sub agents on the five different models that I specified.

And each one is going to have its own work tree. Each each one has its own

work tree. Each each one has its own context. And then opus takes a little

context. And then opus takes a little longer as expected. And then at the end the parent model as instructed will do that comparison across all the different

subsations. It'll say um these two

subsations. It'll say um these two models did basically the same thing.

this one did something that none of the others did. And you can even talk to the

others did. And you can even talk to the parent agent and you can say, "Oh, I like this part that Opus did and I like this part that GPT did. Can you can you match them together?" And the the parent agent will do that for you.

Um, so let's talk about some of the pros of the new implementation and then I'll talk about some of the some of the the the cons, some of the things we lost um

with this refactor. So the main pro of reimplementing this entire feature as a skill is that I have a lot less code to maintain.

Uh selfishly um I'm going to be spending a lot less time maintaining this feature. And this is an an advanced

feature. And this is an an advanced feature, right? We're not talking about

feature, right? We're not talking about a feature that is used by 90% of cursors users. Far from it. Work trees are kind

users. Far from it. Work trees are kind of an advanced thing. Um, and so only the cursor power users that love paralyzing and having these grids of

agents are using work tree. So it's not the kind of feature where we want to be spending a lot of time with maintenance.

Another advantage is that our users can now switch into a word tree halfway through a chat. It was not possible before. um we didn't want to pollute the

before. um we didn't want to pollute the prompt UI too much with all these like drop downs and settings. And so now that it's just a slash command, it's much easier for for users to switch to a word

tree halfway through a chat. They can

start talking about something and then if they decide they want to work on the site, they can do that with slashword tree. Another big advantage is that the

tree. Another big advantage is that the previous implementation did not work if you were working on multiple repos at the same time.

So it's very common to have a multi-reo setup where maybe your front end and your backend are separate repos. In the

past you could not do word trees in this kind of setup. It was just disabled.

With the new /wart tree command everything works fine. The agent will make sure to create a word tree on each repo and then if you open a PR it'll open two PRs, one for each repo. It

works quite well. Another advantage of the new skill implementation is that the judging experience at the end of knowing what model did which for best event is far superior. The parent now has a lot

far superior. The parent now has a lot more context over what each of the sub aents did. And the user can even ask the

aents did. And the user can even ask the agent to stitch together a little different piece pieces and bits from the different implementations which was not possible before in the previous implementation. You had to choose one

implementation. You had to choose one sub agent or one model and just stick with that.

Now, let's talk about some of the cons.

And if you're curious, um, we have a forums link here where we're actually getting some mixed feedback on the new home. Like, some people were really

home. Like, some people were really accustomed to the old way of how the feature used to work. Um, and if you're curious, you can go and see that not everyone is happy with the change, at

least for now, but we're we're tracking.

What are the problems? Number one, it's very hard for the agent to stay on track.

With our previous approach, um, the agent had to stay on track. Like it we didn't let the model ever touch any files outside its work tree. It was

physically impossible for it to do so.

Now we're trusting the model. So it's

you could say it's a bit vibes based because we're basically saying, hey, operate on this directory and and and then like you know, knock on wood, please don't forget about this. And

especially over long sessions, it's quite possible that the model will forget where it should be operating. And

sometimes these models, especially the worst models, will kind of hallucinate or they'll go a bit haywire and they'll start doing things they shouldn't. But

we're we're working on this. Um, another

con is that it feels slower because you're you're seeing the agent create the work tree and you're seeing that in your chat.

It's not actually slower, but it does feel like the agent is kind of like wasting time doing something that should be done for it in advance. Um, we're

also looking at some improvements here.

And then finally, this is much harder to find the feature now, right? Like before

whenever you opened cursor you had this dropdown that would show you do you want to run this task locally or do you want to run it in cloud or do you want to run it in a word tree now that entire dropdown is gone and so if you want to

use word tries you have to know the feature exists so you can actually type slashwordree so the discoverability is a bit worse but as I mentioned before this is an advanced power user feature um

which we're personally okay we're we're okay with being less discoverable in general So, how can we make this skill better?

Um, as I mentioned, the biggest problem right now is that the agent is not really always staying on track. Uh,

there's two ways that we're going to improve this. One is with evals and then

improve this. One is with evals and then using those eval to improve the prompts and then the other one is through RL and training. So, at cursor, we train our

training. So, at cursor, we train our own model called composer. And for

composer 2, our the latest version of this model, we didn't have any RL tasks with these prompts. We we didn't have

any tasks in all of the many many thousands of tasks that we um use for RL actually operating in this type of environment. So we're working on adding

environment. So we're working on adding a bunch of these tasks into our RL pipeline so that by the time we launch composer three or four or five u at least our own model will be much better

at this. Obviously we cannot improve the

at this. Obviously we cannot improve the models that the other companies develop but we've been sharing feedback with all the other labs and model providers on this kind of thing. And for evals, uh

I've been working on some evals for this feature and it was actually my first time or not my first time but one I'm I'm fairly um early in my u writing

evals uh journey and I was actually very surprised if you use something like brain trust and shout out to brain trust. They've been super helpful. Uh

trust. They've been super helpful. Uh

writing these kinds of emails are is actually super super easy. You don't

have to know almost anything about emails and you can just prompt the agent. it'll do everything for you. Um,

agent. it'll do everything for you. Um,

effectively what I'm doing is I spin up the cursor CLI. It's headless, so it's great for evals. Um, and then I have two scorers. One that checks to see if the

scorers. One that checks to see if the model did any work in its work tree as expected. And then another one which is

expected. And then another one which is the reverse of that, which is did the model do any work in the primary checkout where it shouldn't be doing any work. Uh, and so far the evals I've got

work. Uh, and so far the evals I've got are pretty simple. So I actually haven't been um able to simulate extremely long sessions which is when the models start

performing worse. But even so far I've

performing worse. But even so far I've already understood that not all models are equally good at this. So for example haiku which is a smaller less

intelligent model will very often deviate and start working in the primary checkup. But the other models that I've

checkup. But the other models that I've been testing such as composer and grock um are doing much better. So I still have to improve these evals a lot more to make them more complicated. But the

hope is that as soon as I can start to find patterns here, I can actually go and improve the prompts. And then

another thing we can do is have better system reminders to the models uh instructing them to stay on track and to not deviate from the word tree that they

are supposed to be working in.

Okay. So what's next? Um, the first thing is we're actually going to take a a small step back here and we're actually going to have a much more

complete and native work trees implementation in the new cursor agent window. If you're uh if you've been

window. If you're uh if you've been following, we recently announced cursor 3.0. Part of 3.0 know is a more agentic

3.0. Part of 3.0 know is a more agentic interface for coding where you can still edit code and you can still see code but the UI and the UX are much more

optimized around the agent and the chat interface. We believe this kind of

interface. We believe this kind of interface is the right place for a proper word trees implementation. The

kind of person who is more likely to be uh doing a bunch of local paralization is usually the same type of person that is more likely to use this type of UI.

So we're taking a small step back there and building a proper word trees uh implementation that is more native not so much agentic in the new UI. Also

we're improving the skills um as I mentioned through this continued work on evals and then RL and other training work. And then finally we are actually

work. And then finally we are actually looking into other parallelization primitives that are not git work trees.

So if you've used git work trees, you might know that uh they can be a bit slow to create. Um and also to uh they also use up a lot of disk space

on your computer. Um and then finally uh they only work in git repos. So if

you're using something other than git, there's really no local paralization primitive in cursor. Um, in the near future we hope to uh share more about this, but we're looking into some other

solutions for local parallelization that don't involve git and don't involve git work. Um, so yeah, stay tuned for that.

work. Um, so yeah, stay tuned for that.

Um, thank you all for coming to the talk today. Um, I'm sure many of you have

today. Um, I'm sure many of you have questions and I'm going to be around all day. Uh, feel free to grab me anytime

day. Uh, feel free to grab me anytime and uh, um, I'm happy to chat with anyone. Thank you.

anyone. Thank you.

David Gomez, good up for fantastic.

Yes, we're going to have some time to change over the next speaker is going to come up and start setting up. If you

want to go catch uh another breakout talk in the other rooms, here's your schedule. Uh this is the coding agents

schedule. Uh this is the coding agents track. Welcome. We're going to talk now

track. Welcome. We're going to talk now um about a piece of pie. Who likes to eat pie? Yeah, nobody. Okay. Who who

eat pie? Yeah, nobody. Okay. Who who

likes to I don't know code with pi nobody are you awake that's yeah one guy hey what's your name guy Alex Alex give it up for Alex everybody

that's be the the rule the rule of this room is be like Alex you know so he's setting he's going to talk to us about how you embed the open claw coding agent in your product who here is using the

open claw coding agent no like four okay um after this talk this number is going to go up because he's going to show us how to use it in your product. It's a

really incredible talk. I got to speak with Matias just before uh I'm very excited about. So, please your biggest

excited about. So, please your biggest warmest round of applause for Matias.

Woohoo.

All right. Um thank you very much for having me. Uh really an honor to speak

having me. Uh really an honor to speak here. Um yeah, and um I got introduced

here. Um yeah, and um I got introduced to Pi by basically Okay, that's Perfect.

Perfect. All right. I was introduced to Pi by uh um looking into openclaw. There

was a conference uh a meetup and said like okay we're doing openclaw. And I

wasn't so much interested into like all the craziness things that people are doing but I was more interested in understanding uh of how these things work. So I was

looking into pi and you know uh understand the the whole world of what pi is able to do. Um this is the one picture you need to take. Please feel

free to take more pictures. Uh but all the slides and the examples are there.

Uh so that's the one slide. All right.

very quick uh about myself. Uh we're

creating a small company uh Tavon AI.

We're building agents for uh organizations small out of Europe uh but getting started. And uh what I really like um

started. And uh what I really like um about sorry uh what I really like about um uh

Mario's talk is this this is quote uh you probably have seen uh this this morning. We are on the uh we are in the

morning. We are on the uh we are in the [ __ ] around and find our own phase for coding agents. Right? So everything that

coding agents. Right? So everything that I'm going to show you is what I know today. Right? And u I'm going to do the

today. Right? And u I'm going to do the talk again in a couple of weeks and it's going to be most likely be different. uh

but um as as Mario was showing this morning um he has created this minimal set right this this coding agent that is

available um uh for for you for you guys to to fool around with and that's what I'd like to encourage you so coding agents and why is it so

exciting for us to build more products this is Ken Thompson um inventor of uh Unix and this is the famous quote by him one of the quotes, write programs that

do one thing and uh one thing well and um I really like that because that's that kind of like works uh to our advantage with agents and um the best

part where I show this is with co-work so this is co-work uh cla's desktop um and they're basically are bundling their coding agent into something where they

feel is more applicable um and to be honest I've seen very good receptions around this and when you use uh with financing tools with their finance tools. You always need to work

finance tools. You always need to work with Excel, right? So they have this Excel skill down there. Um and it talks to Excel, right? Well, it doesn't. Uh

instead, it uses a a set of small tools, small CLIs, um uh pandas, open pixel, uh stuff from Libra Office, and package

this into their own skill uh to make it uh up and running. And I think this is a great example to kind of get your going, get your thoughts going of what what is

doable.

Um, I haven't written a book and nobody can write a book about this, right?

Because there are no patterns, right? We

need to figure this out. We're seeing

some emerging patterns in the coding space, right? There's obviously tons of

space, right? There's obviously tons of different coding agents and we're seeing this, but there's no authorative resource around this, right? So get

going. One thing uh when I was talking to Ivan yesterday uh we realized is like one architectural pattern that we're seeing is that make it easy for coding

agents right now that is very broad but think about it right like like make not don't try to be you know very um complex

and things but think about the the coding agent uh what is it good at and how do I build my system so that the um agent is easy make it accessible

And I I have some examples. All right,

this is the rough agenda uh for the next 10 minutes or so. Um I'm not going to talk too much about pi in openclaw. Uh I

have a two slides. Slides are online. So

we'll take it from there. So again very brief uh introduction of pi. Um Mario uh great work. Something he didn't mention

great work. Something he didn't mention is that he's joining Arendelle uh which I think is awesome. uh it seems like uh you know great great folks working

together and uh yeah it's open source it's minimal so it's it's just perfect to get started and the other part that I do want to reemphi emphasize on is is

give it a try right we're going to talk about a little bit different but open up pi and ask is it to build what you want right it's amazing of what it what it

actually is able to do by the system prompt uh uh that Mario has shown.

All right, these are the extensions. Um

so again uh all the extensions you can download uh build yourself or download and yeah ton tons to explore. All right,

so let's going this talk is not about the coding agent itself. So using it for your daily dev works but what can we potentially do with this and the starting point are actually not coding

agents, right? The starting point is um

agents, right? The starting point is um and I encourage you to do the the same is looking at the uh core agent itself and there's other SDKs but you know we're talking about PI so let's let's

let's use PI and what is an agent an agent is actually just an LLM agent that runs tools in a loop right so you have some goals you have some context

information agents MD uh in many cases and then you do do coke tool calls right and you get some results and you know you basically do do it in a loop, right?

That's it, right? There's not not much more. The rest is magic trying to put it

more. The rest is magic trying to put it in your use case a little bit more in the other use case a little bit in that direction. So, that's really it, right?

direction. So, that's really it, right?

So, pretty please uh uh don't like open the curtain uh play around with it. Now,

with agent um uh uh agent core, this looks a little bit something like this.

You have an agent class. This is all Typescript. Um you can uh you know

Typescript. Um you can uh you know address all all sorts of information information you can prompt it um uh with different information uh and um also you

you have an event system so you know a lot of things that that that are going on. So um small example uh this is a CRM

on. So um small example uh this is a CRM lead qualifier. I don't know I've

lead qualifier. I don't know I've started the CRM use case for my personally and it it just sticks around.

So um terminal interface obviously uh small uh TypeScript application three uh uh three files really easy and you can see this right you have a couple of

commands that you can execute and you know show me all leads and score them right so that's what we do uh show all leads and score them and here you see all these you know things that are going

on under the hood right you see that that the assistant is calling uh tools that you get some results and eventually you know, you get some input. Now,

obviously, there's tons of things to do, but you know, I've just vibe coded this away uh uh uh and it's a good again good uh learning exercise. The system prompt

um uh you know um as you could imagine, right? You know, calling out the

right? You know, calling out the different tools that what you do, right?

So, all pretty straightforward if you are building an agent. This is an example of how you inject here, right?

So um we said we want we we do call tool calling right we reach out to this uh and call a specific tool but for the agent for for steering it more right you

know a typical hook would be before the tool call do something right and in this case we don't want to update a contact uh without you know checking something or I don't know you can imagine any

types of authorative uh role based access whatever enterprise feature in here but basically you know uh just before the tool call. There's another

one events. So, we've seen these, you know, uh uh the stream and you might have seen a little check mark there.

Okay, the tool call was was fine and returned some result. So, again, we're subscribing to events. All pretty

straightforward and again, please give it a try. All right, so this is simple agents, others agents SDK uh are are available. Um and now we're moving

available. Um and now we're moving through the coding agent. Now what's

what's a coding agent? At the end of the day, it's really the same thing as we've seen before. It's a you know normal

seen before. It's a you know normal agent, right? It runs tools in the loop.

agent, right? It runs tools in the loop.

But now we have a runtime and some type of shell, right? Bash is seems to be the uh the shell that that everyone is using. But we have a shell and a runtime

using. But we have a shell and a runtime to to start executing.

And now things are getting interesting.

And now the the magic of of what you've seen with openclaw uh suddenly shines.

Uh um Peter uh shared this this example uh on some presentation where uh he sent a message to his open claw and sent a

voice message. Now at that time openclaw

voice message. Now at that time openclaw um and I still don't know if there's any like special plugin but at that time openclaw didn't know anything about

voice about voice messages. So what what it did is it it uh created and used different tools uh in the end one of the tools was uh

ffmpeg right on the local local machine and it started this and this was one of the tools right so from the outside it it looks like learning

but in the inside it's actually just another tool call that is available to the agent and that's why these things make it so interesting. So um again uh

the example here um now this is a little bit more sophisticated but the uh important part and and this is the extension API and you know please look it up online. We we're going to do

two things or the the things that I'm most mostly interested is in in session events and UI interaction and yeah uh uh look it up online but here's here's the

the actual extension. Now again this is what you would in a coding agent you probably just generate by asking it but here if if we have a look um this is a

CRM typescript a small snippet of it and basically what we're now doing is we're doing the same example as before right and we have a new command called pipeline right so if you have the slash

commands and you have a new command called pipeline and now we are able to we're loading all the context um and uh you see this little in um um don't have

the lines just below step one uh you can see context UI select right so all of the sudden we're not only interacting with the backend systems and and

sessions and and of those sorts but we're also interacting with the UI and we're able to select right and that's that's got got me thinking um so right

so you have this this command and again this is now just the coding agent Right?

We're not talking about the core agent class, but but this is how you would load up Pi if you just don't download the the coding agent. And now with this new extension, we have Pi, right? And we

can start selecting things, right? So

this is a simple simple select here. Um

and you know, you you even have dropdowns. Now the important part here

dropdowns. Now the important part here is these are extensions and the framework uh that currently pi um has included is catered towards the use

cases of a coding agent right so we you know there's lots of work and other things to do to make this ready for others for other types of applications

but I hope you can see and understand the vision where where this is heading and um yeah you know this is all terminal Right? So you wonder how would

terminal Right? So you wonder how would this look like in the web? Um it

currently is not possible if you ask pi to build something. So I ask pi to build something. Right? And this is the web

something. Right? And this is the web UI. It would be a web UI. Same command,

UI. It would be a web UI. Same command,

same selection all based on the same extension mechanism. Now um there's a

extension mechanism. Now um there's a refactoring going on to make this better accessible and make it more clean. But I

hope again it shows you a little bit of of where the where the things are going.

All right. Now um pi and open claw um is um is a special special setup right. So

pi and openclaw what we have there um is that that now we're not only talking about like like um a single agent in a single session in a coding environment

uh but now we have a multi- channelannel uh environment where uh we have um you know multiple threads going on multiple agents going on so there's a little bit

more to it um this is um and and the interesting part right that's that's where where I got started is is like if you look into um you know the the

packages um uh the core packages of of of pi all of them are used in openclaw right so openclaw has this uh uh this

function run embed uh uh pi agent and it creates a session right so sessions um uh pi itself has a great session support and it creates a session agent and

streams all the information back we have um the coding agent which we just talked about. We have agent core as um uh the

about. We have agent core as um uh the other part that we talked about and there's two other uh minor u or major

packages pi for the unified lm abstraction and uh a terminal ui interface. Um there's um uh open claw

interface. Um there's um uh open claw has built its own plug-in mechanism and that's because um uh you know it's a different use case right and has

different requirements. So you have

different requirements. So you have plug-in support for multi- channelannel routing, different or uh uh provider orchestration, sub agents, gateway support, yada yada yada, all the things

that you know by openclaw, but it's based around the core mechanics of of pi and and and leverages it. Cool. But uh

one thing and that's that's that's the like the the major gist I would like to bring across is like okay what do we do now with this? what are other options for us to do? And this is one of the

applications we've been building um for um for a client. Um and basically um uh

the the the use case is a sales process um they get um uh requests for proposals um of of an ordering another um another

system, right? Um parts parts being sold

system, right? Um parts parts being sold by that company. And we're taking all this coding agent all all of that we're taking away right where we we're new

fresh new thinking right and look at the process from the get-go. So um an email comes in right we we we monitor basically that inbox then we have some

gateway because what we want to do is we want to forward this to different agents right so here I have um multiple agents

right uh the way it's structured is we have one agent per customer and that agent has a general harness right agent MD uh um agents MDE as an example but

you can obviously also use different ones and that helps um understanding the role of that agent in the specific case.

It's it tells of how to use the system and how to react to certain you know inputs, outputs etc. Now um the other one is customer MD where we basically explain the agent

like you know the specific customer might have you know specific twerks right specific um uh access specific um um discounts and all of that sort and

then right and that's what I said like earlier I I like using sessions then for each case right we're we're creating and reusing existing sessions so we can back

and forth um um know what what was previously talked about. All right. So,

email comes in, we're looking at the gate um inbox and we route this to this different agents and now we have tools, right? So we have these different tools

right? So we have these different tools uh to talk to the CRM to talk to the ERP um and get the right information out of the system for this agent to look on

like like behave like you know maybe it has you know new contact information or of that sorts and again we make this available we make it easy for the agents

to access right and our way currently is doing this with CLIs right so CLIs our agents are really good at using CLI so we make it available as a CLI I we put

we make sure that the data is secure. uh

we have our own sandbox and then we're creating the drafts again right so that's the system and I hope by this point you basically understand like logically where these things uh uh fit

together but how would this look like um oh one uh a final thing right there is always the question around okay sandboxing etc and and to be honest

we're on the uh just on the on the steps of of getting there but if you've seen um Nvidia's announcement uh around um open claw, their policy, their open

shell is really really interesting and um um it's it's it's a way of it's one ways of securing an an agent. We're

looking into this. Please do as well.

All right. So, how does this look like um to to to kind of like get you an understanding of of how these things Right. So, here's the dashboard. Um

Right. So, here's the dashboard. Um

rather uh boring, but here's the in uh the email the inbox, right? So again, we see the the email coming in and yeah, we um it's one of one of many emails. Most

of them ignored, but this one is like the the DLM call said, "Okay, I'm I'm interested in this." And it is associated to a case, right? We see the

case up there. Now, this case is again is an agent session, right? Uh so we find the session and associate it to it.

Um we then create a draft. Uh so there's tons of calls which I'm going to show you in a second but basically the output of all that is a draft email that the

user will be able to use right. So our

thinking is uh let them user stay in in email let them stay in the the inbox and drafts and they don't even you know need to do a lot. So this is more like an admin interface. They can stay in email

admin interface. They can stay in email but basically the output would be a draft generated. And how does that look

draft generated. And how does that look uh behind right? We we had the the different sessions before uh uh the threads and this is the same thing right the assistant says uh well apologies

German but uh now I'm looking at the articles it does different tool calls right it gets gets results and does this in a loop to result right the end effect

for for the user is I'm looking at my inbox there's a new email it's associated to a case and I get a new draft which they can freely edit but um under under the hood we have all these

um uh agents working. All right, that's that it's for me. Um again um here here you find the slides. Um key takeaways

please. Coding agents are and will be a

please. Coding agents are and will be a core building block uh for your software systems. I'm I'm betting on it. A lot of people are betting on it. So please give

it a try. Pi is perfect for tinkering whether you like it or not. It's

minimal. You can rip things apart and put things together. It's perfect. So,

please go tinker. All right. Thank you.

Thank you. Thank you, Matias. That was a great talk. Give it a give one more

great talk. Give it a give one more round of applause. Every Come on, be amazing. You know what? You know what?

amazing. You know what? You know what?

Applause is free. I don't know if you noticed, but this doesn't cost anything.

Okay? So, we can we can be generous with it. And as we talked about at the

it. And as we talked about at the beginning, you get to decide how the speakers feel and the kind of talk you get. So, um, our next speaker is going

get. So, um, our next speaker is going to set up. I'm gonna introduce here in a minute. But first, let me ask you this.

minute. But first, let me ask you this.

Who here codes using coding agents?

Okay, almost everybody. What's I want you to shout out to me. What is your favorite model that you use?

Dang, man. Opus. Okay. Anyone using

Composer 2?

No. He said no. He's like, no. Anyone

using Kim? Uh Kimmy is Kimmy, right?

Which is kind of compos Anyway. So um

composer 2 is a fantastic model. I will

say this. I was impressed. I was talking with the cursor guys back there. It is

so fast. And when a model is so fast, we think surely it's fast. So it must be bad. I don't know what it is within us,

bad. I don't know what it is within us, but we think if it's fast, it must be bad. But I've been doing this thing

bad. But I've been doing this thing where I've been using a multi- aent system where I I I solve it with cursor and composer too and it's solved very quickly. I get I get a a diff and then I

quickly. I get I get a a diff and then I give the diff to opus and I'm like hey opus what do you think um and opus every single time with composer to lgtm and so it's it's actually I I can

recommend if you haven't used it it's it's a wonderful model our next speaker sorry just for free that's what happens when you're a builder you can't stop talking about it our next speaker Sarah Chang is going to talk to us about

exactly this fast models need slow developers I'm really excited about this because I have so many thoughts about fast models everyone your biggest round Round of applause. Sarah Ch.

Hi everyone.

So we'll just get right into it. So over

the past few years, we as developers have developed a series of bad habits when it comes to developing as a result of slow AI code generation. And so we

we're all familiar with it. We do things like write massive prompts and try to oneshot. We'll make huge commits or

oneshot. We'll make huge commits or we'll have our 10 agents all on the screen at the same time combulating, cogitating, thinking. And so about a

cogitating, thinking. And so about a month ago, we at Cerebrus and OpenAI released a new model, state-of-the-art model called Codex Spark. Codex Spark

can generate code at 1,200 tokens per second. And to put that into

second. And to put that into perspective, if you look at the sonnet family or the opus family, those can generate code at about 40 to 60 tokens per second. So in this new era, as we're

per second. So in this new era, as we're starting to see much faster coding models, this is 20 times faster. Not

only does it unlock new capabilities and use cases, but it also requires us to rethink how we as developers interact with the coding model. And of a lot of

these bad habits that we had before that were generating maybe 50 tokens per second of bad code, unless we fix them, they're going to start generating 1,200

tokens per second of bad code. And so

that is the topic of today's uh talk.

So to get started, my name is Sarah Chang. I'm the head of developer

Chang. I'm the head of developer experience at Cerebrus where we are building the world's largest and fastest AI processor.

A large part of my job is that I get to introduce fast inference and fast coding models to developers for the very first time. And for most people, it's a very

time. And for most people, it's a very exciting moment. There's no thinking and

exciting moment. There's no thinking and waiting and starting up that you might be really annoyed about. But at the same time, as I said, unless we change our habits,

we are not going to have good code in the future. And so this talk really is a

the future. And so this talk really is a practical playbook for how we as developers can think about how we interact with the models in this new regime, especially in a future where the models

are generating code faster than we, the human can keep up.

So I want to look back at history a little bit. We've had a very exciting

little bit. We've had a very exciting past two years. The models have gotten bigger. They're getting smarter. We have

bigger. They're getting smarter. We have

bigger context windows. But the thing that has remained relatively constant over the past two years is coding speeds. is model speed. So if we look at

speeds. is model speed. So if we look at a lot of the popular families, we have Gemini, Claude, GPT, Sonnet over the past two years, they've always been

within, you know, 50 to 150 tokens per second.

And this is Codex Spark. Again, Codex

Spark is just the first of many models that we as developers can expect to be much faster than what we're previously used to. And we even had to change the

used to. And we even had to change the Y-axis because it's so much faster. And

so before we get into the actual playbook and tips, I want to talk about why this is happening. Why are we suddenly seeing such faster models? And

it's actually a very exciting development. It's it's what many of you

development. It's it's what many of you probably work on on a day-to-day, but it's there's so many companies that are working on this problem all at the same time. And as a result, the entire AI

time. And as a result, the entire AI inference stack is getting optimized all at once. And so breaking it down, let's

at once. And so breaking it down, let's go through really quickly. We have

hardware. This is a physical device that inference, training, all of our comput is happening on. One of the biggest things that we have to think about with hardware is the memory wall. And this is

exactly why hardware and memory movement takes up 50 to 80% of that latency time for inference. This is where a lot of

for inference. This is where a lot of the frustration comes from. And so when we are running inference, we have to constantly move our weights and KV cache values between memory and our actual

chip. on the Nvidia GPU. This is the

chip. on the Nvidia GPU. This is the most traditional type of hardware. All

of this memory is stored off chip on offchip HPM. And we're now have a memory

offchip HPM. And we're now have a memory bandwidth bottleneck. What a lot of

bandwidth bottleneck. What a lot of newer companies are doing are thinking about companies like Cerebrus or Groth.

They're thinking about how do we move this memory to be as close to the chip as possible. And so here's an example of

as possible. And so here's an example of the Cerebrus wafer where all of the chip is um all the memory is distributed across the chip in SRAM. So every core

is direct access to the values it needs.

Even more exciting, we have disagregated inference. And this is an in um

inference. And this is an in um disagregated inference really has become commercialized in the last few months.

This is why Nvidia bought Grock for $20 billion a few months ago. And this is also why Cerebrus and AWS are now partnering to serve the wafer and AWS

tranium together. So in traditional

tranium together. So in traditional inference, there's two steps. There's

prefill and there's decode.

Traditionally, both of these steps have always been run on the same piece of hardware. Prefill is where we're taking

hardware. Prefill is where we're taking every token that the user inputs and processing it, embedding it, and adding it to our KD cache. This is a sequent this is a step that can happen in

parallel and so it's computebound.

Decode on the other hand is where we're actually generating the output token by token and this is sequential and is as we mentioned memory bound. Again, it

goes to the same problems that we mentioned before. And so what we're

mentioned before. And so what we're doing and seeing now commercially is that we're splitting up these two steps so that prefill is done on one type of hardware that is compute optimized and

decode is done on another piece of hardware that is memory optimized. Going

up the stack, there's the diagram. Going

up the stack, we look at model architecture. There's so many ways that

architecture. There's so many ways that we are training our models and shaping our models to cater to our hardware. We

have specific layer dimensions and memory and model size that we're always thinking about. A great example is a

thinking about. A great example is a very standard model architecture mixture of experts here. Instead of activating the entire model all at once for every single token, we're only activating a

subset of experts for every time. And

what this does is it allows us to have the intelligence of a much small larger model for the compute cost of a much smaller model. And again, we're always

smaller model. And again, we're always thinking about memory and the size of our models. And a lot of people have

our models. And a lot of people have been building on top of this in recent years. An example is reap router

years. An example is reap router weighted expert activation pruning. I

had to read that one. Um and here we're looking at the specific use case. We're

seeing which experts aren't being activated all at all and we're pruning them all together. We're getting rid of them. Again, we're always thinking about

them. Again, we're always thinking about model size. And then at the very top

model size. And then at the very top layer of the stack, we have inference optimizations. And this is where many of

optimizations. And this is where many of you might be working in and a lot of companies that you're probably familiar are also working in. These are companies like together, base 10, modal, who's

also here, fireworks. And one of the biggest things that we're thinking about at this level is KV cache reuse. And so

by storing and reusing previously computed token representations, we don't have to recalculate attention over the sequence at every step.

And now I want to get to the very top and most exciting part, the developer.

This is the current state of what the internet looks like or what Twitter, LinkedIn looks like. We have someone running six cloud code terminals at once, a 500 plus agent coding swarm, um,

someone running eight agents across five screens. And I get how tempting doing

screens. And I get how tempting doing something like this can be. I feel like if you're on Twitter at all these days, unless you are doing something like that, the internet is basically convincing you that you are living in

the stone age and that you need to catch up. But the reality, what is the reality

up. But the reality, what is the reality of what is happening in all these setups is that we're generating massive amounts of code that nobody is verifying.

And in the new future with much faster inference, this becomes increasingly dangerous.

And so, especially with fast inference, we're now going to be generating technical debt at a level that we've never seen before. And we're not going to know what to do with it.

And so I'm going to pivot now to spend the rest of the talk on the practical playbook and tips and workflows and how we can reimagine how we as a developer should operate in this new regime of faster inference. And as I mentioned,

faster inference. And as I mentioned, codec spark operates at,200 tokens per second. But it really is just the first

second. But it really is just the first model and what we should as developers expect and prepare for to be a new regime of faster models across the board. And so starting with the first

board. And so starting with the first one, the first category is just choosing the right models and how do we orchestrate our agents so that we're leveraging different model strengths. I

think historically we always think about intelligence. There's no is no secret

intelligence. There's no is no secret that we as developers are not particularly loyal and that we will switch to whatever model, whatever family is most intelligent at a given

time. And maybe we also think about cost

time. And maybe we also think about cost unless our company pays for whatever we want. And so here now the inference

want. And so here now the inference speed is a 20x difference. Now we also have another vertical to think about speed. And so a good mental model is to

speed. And so a good mental model is to use a larger model like GBT 5.4 5.3 for your planning or your long horizon workflows and then using a faster model

like codec spark as your actual executor.

And so here's an example. You might ask your five point GBT 5.4 to generate your plan. you would generate a um you would

plan. you would generate a um you would spawn all of your sub agents with codeex spark and have it actually operate uh have it actually execute on all of those

steps um one by one. Another really

helpful trick is to actually make skills out of successful sessions and capture trajectories that are working really well. A thing that you can do here is

well. A thing that you can do here is use a model like GPT 5.4 before to actually have it do the initial harder larger task, capture that as a skill and

therefore making it a verifiable repeatable workflow and then having a small um faster agent like codec spark just do it again and again in the background.

The next category I think is even more exciting because this is a category of things that just were not possible and were not practical. These are things we wouldn't do because we're tired of the

cogitating justiculating germinating that you might have seen.

And so here I really want us to think about this and internalize this. But at

1200 tokens per second, a model like codec spark makes validation basically free. There is no excuse and no reason

free. There is no excuse and no reason why you should not be doing things like this. test suites, linting, pre-commit

this. test suites, linting, pre-commit hooks, diff reviews, browserbased QA automations. There's all these things

automations. There's all these things that you can add to every step of your workflow because it is instant. It's not

slowing you down and it's not you do this all of this at the very end or right before you're about to push your code.

Another tip that I really like is exploring cherrypicking. So, let's say

exploring cherrypicking. So, let's say that I want to code a navbar and I want it to be midnight blue. I want four different icons. I give it to the model

different icons. I give it to the model and the result's fine. Instead, what I can do with codec spark or much faster model is I can have it tell it to generate 15 versions in the same time

that it would have taken the a previous model to generate one version and I can cherrypick the version that I like the best. Even better, I can generate five

best. Even better, I can generate five sub aents that are each generating 15 versions and now I have 75 versions and I pick the one that's best. And this is great for things where we really value

quantity or variety. So things like research direction, different types of architecture d um directions, or even just graphic design. And the reason why I really like this one is because it

almost allows us to artificially induce taste into our model output. So

traditionally, it's no secret, it's very easy to sniff out any UI or text that a model writes. The models themselves do

model writes. The models themselves do not have taste. And the ways that we've kind of brute force worked around this is that we either create an example ourselves or we find examples for the

model which is timeconuming or we give the prompt so much detail that we might as well have completed the task ourselves. This is a great way of saving

ourselves. This is a great way of saving our time and also getting much better results.

The next tip is kind of more more so a a mental model where now that the models are so fast, it should not be you spawn

a session, you go get a hamburger, you scroll Twitter, and then you come back. Now, you can actually sit down and it's a real time collaboration that you're able to have

with this model. You should view it much more as a pure programmer. This is the only way that you are going to avoid having bad code. So you can sit down and ask questions like h having it collect

all the context across your repo and actually asking it how does it work being the one in the front seat making decisions and implementations. The AI

should always be helping you make decisions not the other way around.

The next one I hate this slide because it's everyone's trigger word and overused word but h how do we avoid slob? So, as I was mentioning before, it

slob? So, as I was mentioning before, it really shouldn't be, you know, you spawn 10 agents, you never verify the code, you don't know what's happening under the scene. Someone asks you to explain,

the scene. Someone asks you to explain, you have to read the code for the first time. Now, you can actually have two to

time. Now, you can actually have two to three sessions and actually sit down next to your code. And I know this is something we're not really used to, but sit down with it and actually steer it,

understand what's happening, because again, we are now experiencing real-time collaboration as we code with this agent. You can be super specific. You

agent. You can be super specific. You

can think do things like ban the model from deleting files, give it a max diff size, the model, have the model only read and write, and even give it steering directions, things like only

change this, don't touch types yet.

Wait, that implementation wasn't quite right. Let's redo that. The graph on the

right. Let's redo that. The graph on the left is a is a helpful mental model as an example of how the developer, the AI agent, and the codebase can all work

together and what that should look like.

This next step, refactoring is very similar to what I was talking about with valid with verification. Just like with verification, something like constantly refactoring and cleaning up your code

automatically is basically free at 1,200 tokens per second. So you can do things instead of doing it at the very end right before you're about to commit your code. You can just re you can just bake

code. You can just re you can just bake this into your automatic workflow so that after every single task on that checklist is complete, you're just asking the model to automatically, you know, delete unused imports, clean up

unnecessary lines of code, make it so that all of my functions are structured the same way.

The last category that I want to talk about, and I'm sure that so many of you guys have already heard these two words a countless amount of time over the past few days and across so many talks, is

context management. But the reason I'm

context management. But the reason I'm going to talk to you about it again is because let's say that historically it took you 10 minutes to fill up your context before you saw you know the

god-feared word compaction.

Now if you take 10 minutes divide it by 20 you are now getting compaction in 30 seconds. And so context management

seconds. And so context management especially with fast inference is more important to think about than ever. And

you can't get away with sloppy practices anymore. And so all of these these

anymore. And so all of these these really are just good practices no matter what coding model you are using or what speeds. But a general a very high level

speeds. But a general a very high level framework is just always always break up large tasks into smaller bounded goals.

And this graph on the right is a good mental model for how how full your context is will then affect your behavior, the model behavior. So you

always want to avoid the 80 to 100% because you're going to get compaction.

And right now we all know some things might get lost.

And so a good way that you can think about how do I externalize this memory so that I can have these small bounded goals like what does that look like? So

an example of how you can do this and set up an external memory system that is persistent every time you set up a new session is with this four file system.

We have agents MD which is where we're actually defining all our agents our sub agents. We have plan MD which is what

agents. We have plan MD which is what we're creating at the very beginning and this is where we're just generating the entire plan and step by step ch stepbystep checklist that we're going to

go through. We have progress MD which is

go through. We have progress MD which is where we're keeping track of what's do we need to do and what has been done before. So every time you spawn a new

before. So every time you spawn a new agent or session has no context. It

comes in it looks at progress MD. It

sees what's been done before and it's like okay here's where I pick up here's where the next task needs to be done.

And then the last is verify MD. And this

is what we're using at every single step to just make sure everything looks good, it's clean code, and we can move on to the next step. And so an example of this is again leveraging different models,

using a GPT 5.3 or 5.4 codecs, having it create your plan, and then having your GPT 5.3 codeex spark actually execute the checklist one by one much faster

than before. And as a final slide, I

than before. And as a final slide, I want to do these um few helpful commands for how you can get the best out of codecs. things like permissions,

codecs. things like permissions, experimental skills, review and rename.

But the biggest thing that I really want to emphasize here is that honestly, it's not really about just having faster coding models. What it really means is

coding models. What it really means is that the developer experience is actually going to become so much better.

And when it's becoming so much better, there's so much more we can do. And

there's so many ways that we can now avoid creating bad clo bad code in a way that isn't miserable or us staring at a screen for 30 minutes. So, thank you guys so much for welcoming today me

today. My name is Sarah Chang. Um I'm

today. My name is Sarah Chang. Um I'm

visiting from SF. It's an honor to be here in London. Um if you have any questions or need any credits, my handle is milks and matcha across every platform. Thank you guys.

platform. Thank you guys.

Thank you. Thank you. Thank you. Thank

you, Sarah. What an incredible talk. Uh

I thoroughly enjoyed that. And now we have the next talk by Mr. Lawrence Jones. It's it's going to be so much

Jones. It's it's going to be so much fun. Lawrence is going to talk to us

fun. Lawrence is going to talk to us about fighting AI with AI. And I spoke with Lawrence backstage and I said, "Wait, wait, wait, wait. Does this mean you're going to like set up codeex right here and claude code right here and be

like, "Okay, fight." Um but it's not that. It's it's even better. Um, he's

that. It's it's even better. Um, he's

going to come out, but his prompt, listen, he needs to be prompted. The

other speakers have come out here and they've sort of set up their speakers, their laptops, and so on. He's not doing that. Okay? There's a prompt to get him

that. Okay? There's a prompt to get him on stage, and that prompt, you guessed it, is applause. So, what we're going to No, not yet. Not yet. I'm not done, man.

Don't set me up. So, on when I say his name, Yeah. we'll prompt him and he'll

name, Yeah. we'll prompt him and he'll appear. Ready? Give your biggest round

appear. Ready? Give your biggest round of applause for Lawrence. J.

It worked.

Um, hi everyone. Uh, so I'm here to talk today about how we use AI to manage the complexity of the AI products that we build um, at Incident.io. Uh, and to share with you some of kind of the tips

and tricks and the internal tools that we use when we're building um, our AI SRE product. Uh, but first I guess like

SRE product. Uh, but first I guess like who am I? Um, so I'm Lawrence. Uh, I'm a founding engineer at a company called Incident.io. Uh so we build, if you

Incident.io. Uh so we build, if you haven't heard of us, uh we build an instant response management platform. So

we're used by companies like Netflix, Etsy, Skyscanner, and actually probably a few of you in the room. Um we pay you when things go wrong and we help you run your incident and as you're running your incident, we help you communicate with

your customers. Um but like you might be

your customers. Um but like you might be thinking like where does AI actually come into this? Um and actually that like we don't just want to help people respond to these incidents. Um, our goal

is actually to fully automate kind of production investigations. So whether or

production investigations. So whether or not it's a big incident or if you just have some ticket and you wanted to look into production, uh, we want to be the place that you turn to ask us questions about actually what's what's going on.

Um, now it turns out we've been building this for about a year and a half, two years now. Um, and that's actually like

years now. Um, and that's actually like a really big ask and the systems that we've had to build to try and support this have been really quite complicated.

Um, and have been kind of on the edge of what you can do uh, with all of the AI technology that's out there. um and they often pose a challenge for humans to debug them. Uh they are now complicated

debug them. Uh they are now complicated enough that you can't as a human really tractably dig into how these things are performing. Uh you need assistance to

performing. Uh you need assistance to help you. Um so for example, this is a

help you. Um so for example, this is a kind of uh one of the investigations that we would actually produce for you.

Uh when you have an incident, uh we will end up running uh right at the start of the incident this investigation which will go through hundreds of telemetry queries. It's going to look at your

queries. It's going to look at your logs, your metrics, your traces, any historical incident data that we have.

Um, and it's going to try and cross reference this with your codebase and go, "Hey, like I'm pretty sure that the problem is this and you should probably do this to fix it." But I want to pause here and go like, how would you, if you were building this system, how would you

actually figure out how to tell me if this was a good or a bad report, how do you know if it's right? How do you know if it's wrong? Um, and there's a load of things that you might do. Uh, you might jump into the incident. You look at everything that happened. You might look

at the postmortem if there was one that was written. Um, but all of this might

was written. Um, but all of this might actually take you a really long time to do. In fact, it normally takes you like

do. In fact, it normally takes you like an hour or so to get a real full understanding of this incident. Uh, and

it's only at that point that you could then look at this investigation and go, I think it's right or it gave me the information that was really, really useful. Um, and as I said, behind this

useful. Um, and as I said, behind this investigation is like hundreds, if not thousands of prompts. So, how on earth do we scalably understand how this system is performing, especially across all of our customer accounts when they

all have very different things going on?

um you end up with a lot of stuff and a lot of AI and you've got to use AI to try and actually attractively get a handle on this. Um so I actually did a talk uh a year ago um LDX about becoming

AI engineers uh where I went through some of the core constructs that hopefully like a lot of you in the room given that we're an AI engineering con uh conference are familiar with. So

things like prompts, evals, scorecards, traces, data sets, back tests. Um this

talk is going to be about if you assume that you have these constructs put together and you're building these complicated AI systems. Um how can you use AI with the internal tools that you use to understand them to get a better

handle of how your system is performing.

So yeah um in this talk I'm going to talk about how you can use AI to help you manage and curate your eval data sets and making it easier for you to work with them. Uh making it easier for coding agents to actually work with your

eval tool. Um, I'm going to talk about

eval tool. Um, I'm going to talk about like probably what was the biggest unlock for us when we were building these systems, which was starting to translate the UIs that we built to try and debug them. Um, into downloadable

file systems, which has actually helped us massively using tools like claw code and codecs uh to dig into how the system is performing. Um, and then I'm going to

is performing. Um, and then I'm going to talk about how you can build kind of repeatable analysis pipelines using uh AI agents to run through them. Um, but

first eval.

So eval for me are AI unit tests. Uh so

each eval takes a prompt and it goes here is some input data. Uh it runs the prompt, it gets the output and then it has some grading criteria that says does this eval pass? Does it fail? Um and for

us eval files right next door to like our go uh prompts. Uh so we do everything can go at incident.io. Um

including all of the AI work that we do.

Um, and this is how we prove when we make a change to a prompt before we ever go and merge it that the prompt is actually going to do the thing that we want it to do. Um, so for us, this is what a prompt looks like. Um, this is

kind of a contrived prompt. I would hope no one actually has this in production anywhere. Um, it takes a message and it

anywhere. Um, it takes a message and it tries translating it into pirate speak.

So really simple, bit silly. Um but what we do for evals is if this is the prompt uh we would then define on the left some grading criteria for this prompt where we'll go there are two things that we

care about. We care that the result

care about. We care that the result actually looks like pirate speak and we are going to care that the meaning is preserved between the input and what we actually produced as an output. Um so

this is actually what we're going to use to tell us if the eval passed or failed.

Um, and then we have the eval on the top right which is just in a YAML file where we go here are three different test cases and we'll run through them and you can see the results of us actually running this uh on the bottom right. So

this works and it works really quite well but it does have some problems and I'm assuming that several people in the room have kind of come across these themselves. So first um like evals are

themselves. So first um like evals are really really fiddly. Setting up

realistic test data for your evals if you want to actually understand how this stuff is running um is quite difficult to do. Uh and especially in our

to do. Uh and especially in our environment, our production evals are including almost an entire incident. So

you can imagine a full incident report and that's the only thing that can trigger the bad behavior. This is kind of hard for you to end up pulling them down and putting them in your eval test suite. Um they just become extremely

suite. Um they just become extremely unmaintainable very quickly. Uh now

quite early on we created like this little button that we have that allows you to like steal an eval from production. So if anything was going

production. So if anything was going wrong inside of our like AI interactions, you could go in and you could just pull that down, put it in the codebase and you could run the eval against it. Um, but the thing with this

against it. Um, but the thing with this is like production eval aren't like great. If you think about evals as kind

great. If you think about evals as kind of like a unit test uh for your prompts, you want a unit test suite to be reasonably understandable. Like an ideal

reasonably understandable. Like an ideal unit test is very focused and just says I expect it to do this thing. You don't

want to have like two megabytes of all the YAML associated with it. It's just

really really hard to work with. Um, and

what we found was as these YAML files with the evals were growing really really large, uh, our coding agents weren't able to work with them. So if

you want to do a quick like read and like modify the eval suite, you be booting that into the context and you quickly hit your context limit. Uh,

which is obviously a problem because then you can't work with it effectively.

Um, so what we ended up doing was we ended up creating this small CLI tool that we call eval tool. um that was designed to allow agents to leverage our

evalu files. Um so it's just a small CLI

evalu files. Um so it's just a small CLI that can go what test cases do you have in here? I want to edit one, I want to

in here? I want to edit one, I want to replace one, I want to add one. Um and

it was by doing this that we allowed agents to work effectively with our eval tooling. And that's why we were able to

tooling. And that's why we were able to create this runbook to the right, which is actually a runbook that's designed for um a coding agent to use. So either

a runbook or a skill, it depends on how you want to package it. Um, but the cool thing about this is that now that agents can work with the eval, you can end up in a situation where you just ask your coding agent to go, hey, I've got a

problem here. Can you look at this

problem here. Can you look at this prompt? I want it to do these things.

prompt? I want it to do these things.

And the coding agent is going to turn up and it will create an eval case where it proves that the thing has failed. And

then it will go modify the prompt so that the eval now passes. Um, and then it will go through this runbook. And one

of the most important stages for us is checking at the end that the change that you've made to the prompt hasn't ended up breaking any of the other Ears that you had in your test suite. Um, and we also have like a final pass that tries

consolidating the prompt as well because if you end up doing this repeatedly, you end up with a prompt that is massive and really really difficult to maintain. Uh,

so you kind of want every time you make an adjustment to try and simplify as well. Um, so this has actually worked

well. Um, so this has actually worked like really well for us. Um and you can see here uh this is like me using it in core code where you can just point it at the eval and say hey have a look at the

prompt. This is a real prompt for us

prompt. This is a real prompt for us which turns human queries into log queries for Loki system. Um and it ends up racing through and goes and adds a new eval. It checks that passes with a

new eval. It checks that passes with a certain number of repeats. Um and then it gets to the end and it's like yep I think I've added it kind of the pass rate is acceptable. You can go ahead and get going. Um but the problem with this

get going. Um but the problem with this is like this solves one problem. Um, and

the problem that it solves is that if you know what the prompt is that you want to change, you can now change it fairly reliably. And that's very useful

fairly reliably. And that's very useful if you're working on these tools. Um,

but one of the biggest problems that you have now is that if you're building these systems, you'll know that they're not just one prompt anymore. Um, in

fact, most of the production AI systems you will use on a daily basis are many, many, many prompts. Um, and to illustrate this problem, I've taken our chatbot. So, this is a chatbot that you

chatbot. So, this is a chatbot that you interact with um, inside of an incident.

And what I've done is I've created a graph of all of the different prompts, tools, agents, and everything in the hierarchy uh that powers and interaction with our system. Um, so you can actually see there's like 10 different agents

there. There's 50. I I don't even know.

there. There's 50. I I don't even know.

There's it's actually bigger than this.

I couldn't fit it on the screen. Um,

it's a lot of stuff. So even if you think that you know, even if you've got a bad interaction that came in from a customer, um, you don't necessarily know which part of your system is actually the problem and which part to go change.

Um, so even if you have this eval red green cycle, you're going to struggle to know where to go to fix it. Um, and this gets even worse for a system like our investigations.

So if you think about trying to run through this process to debug what's going on in an incident, we have a ton of stuff that goes on uh inside that system. And you can see all the steps on

system. And you can see all the steps on the left. Um, and each one of those

the left. Um, and each one of those steps unpacks into the trace that you have on the right. And like really it's not about the details here. Um, it's

more about each one of these green blocks actually expands into possibly hundreds of different prompts and hundreds of different tool calls. And at

any point if you make a slight subtle error, you can't then easily trace through the system where the error originated, even if it ends up resulting in you having totally the wrong picture on what you think that the incident was

and your RCA is totally wrong. Um, so

like we built these UIs so that we could help humans look at them and they've been really good for humans to look at them. But I guess going back to what I

them. But I guess going back to what I was saying before, um, we just feasibly don't have enough time to go through this stuff. So the problems that we had

this stuff. So the problems that we had was we have all these UI tools, but agents can't properly use them. So how

do we get to a place where like the agents can use the tools properly? Um,

and I think Anthropic stumbled on this with claw code where um, they found kind of when they released core code that these agents are fantastic at using file systems and just going through this data using standard tools. Um, so we kind of

thought like, can we just download all of the UI that we have as a file system?

Um, and that's kind of what we've done.

So now for each of our different AI systems, you're able to download all of the content as a file system and we drop that into a sandbox claw code. Uh, at

which point you can just point claw code at it and go, "Hey, I've got a problem here. It's behaved in the wrong way." It

here. It's behaved in the wrong way." It

can see everything that went into all the prompts. It understands the

the prompts. It understands the structure because it's self-documenting.

Um, and then it can tell you because you have access to the codebase as well. uh

exactly where you should actually be making the modification to try and change it. And then you can lean on that

change it. And then you can lean on that red green cycle from before to try and modify a prompt if you need to. Also,

there's more stuff that you can put in this than you might think. There is

really not much of a limit as to what you can put into ASI. Um so like traces like this can get translated exactly from how you would present them in the UI to a text file which then the LLM can

consume in a really nice way.

So yeah, uh this is something that has turned the way that we debug our application into we hear that there's a bad experience. We end up downloading

bad experience. We end up downloading that interaction into a sandbox called code. Um you sit there in the session

code. Um you sit there in the session and you go hey like have a look at this.

Tell me what you think has gone wrong like like what is your interpretation of the problem? Um and then you go I really

the problem? Um and then you go I really wanted to do this instead like what part of the system would you change? And then

it will work its way through the hierarchy of all those tools and prompts that you just saw and it will be able to tell you where you should be making the modification and then all from that session because you have access to the codebase you can just go hey can I make that change and then you can prove it

using the eval runbook that I mentioned before.

So yeah like we've implemented these file system packages for a load of different AI interactions for us now. Um

so it's really easy for us to just drop this in claw code and just get going. Uh

but we now have another problem right because whilst you can do this on an individual basis uh we are running thousands of investigations across hundreds of our customer accounts um and we're doing that daily because we need

to know if this system is getting better or worse. So um what you can see here is

or worse. So um what you can see here is uh we have what we call a back test which is essentially a batch of investigations that we run on a daily basis against our account and against a load of our customer accounts as well.

Um, and eventually you just get this rolled up number which is like, oh, cool. 86% accurate RCA on our account,

cool. 86% accurate RCA on our account, which is which is great, but like this doesn't really tell you why the number went up and it doesn't tell you why it went down. And if you want to improve

went down. And if you want to improve the system for someone, you're going to struggle. Um, so what we've actually

struggle. Um, so what we've actually done is we've allowed ourselves to download uh kind of all of these investigations into a file system that we can then provide into an analysis

pipeline that again is leverage or is run using clawed code. uh that can end up running a structured analysis like with markdown playbooks that help you run it repeatedly and reliably each

time. Um so what that actually looks

time. Um so what that actually looks like is we created this rep this repo called scrapbook. Um and inside of

called scrapbook. Um and inside of scrapbook we have this like very structured flow that explains exactly how a coding agent should go through all of the information that we've gone and downloaded how it should understand these investigations and the process

that it should go through to actually run them. Um, now the key things that

run them. Um, now the key things that like I think are very important to these flows are you start and you actually parallelize out all of your agents. So

you start maybe 25 agents in parallel and they can all individually build their analysis of an investigation. Um,

and then you go into the next stage of the pipeline where you do some cohort clustering and you look at like meta points around like what are the same types of failure, how do we go wrong in different ways. Um, and by clustering it

different ways. Um, and by clustering it together, you end up with actually like a really really useful report that doesn't just tell you how this has gone wrong, but it tells you why is your AI system performing well or badly on this

customer account and actually like what should you do to try and fix it or improve the system. Um, and like this is this is like something that we've done several times over for several of our systems now and I think it generalizes

really well for anyone who's building this type of thing. Um so the points that make like a really good pipeline for this um you should leverage sub aents to do that parallel per entity analysis. Um you should store all of

analysis. Um you should store all of your analysis in files inside these downloads so that you have like incremental analysis built as you run through it so that you can start and resume the analysis if you ever need to.

Um, and then you want to combine this analysis with the codebase that is actually powering the system so that if it finds a problem, it can look in the codebase and go, hey, I think that this is the problem and this is the place and

it can actually do some analysis to go, I think that you should change it like this. Um, and then at the end, because

this. Um, and then at the end, because you have this all loaded in your code session, you can just go fix it and ask the coding agent, claude code, codeex, whatever you use to actually go make the change and then you can use that eval

red green process to actually confirm it works. Um, and then like yeah, this is a

works. Um, and then like yeah, this is a PR that is created after you do something like that where because the back test showed a couple of investigations that were going wrong and I knew exactly what the problem was. Um,

I could have a chat to it about a feature that we might change in the system. Um, and then we can deploy that

system. Um, and then we can deploy that and we can test that out in production and see how the thing goes. Um, so yeah, that's it. Uh, so um, the key thing from

that's it. Uh, so um, the key thing from me is like these patterns do generalize.

So for any of you in the room who are building kind of complicated AI systems um and you're finding it really hard to understand them or debug them or evolve them um you you really need to be using AI just as effectively in your internal

tools to try and understand these systems and grow them um just as you are in the products that you're building yourself. Um so yeah, make sure that you

yourself. Um so yeah, make sure that you prioritize any of the debugging tools that you have so that they work really really well with the coding agents that you're leveraging on a day-to-day basis.

Um, file systems are exceptionally good agent context. Like we could have put an

agent context. Like we could have put an MCP on top of this or use like human use uh agents. Uh, it wouldn't have been

uh agents. Uh, it wouldn't have been half as effective as this ability to just download in bulk all of the information that you need so that the coding agent can crept through it and find the details. Um, and then yeah,

anytime you are performing complex analysis, look at creating an AI runbook for it instead. Um, it will save you literally days or maybe weeks of your

life. Um, and then one final point from

life. Um, and then one final point from me. Um, like we are hiring. Uh, we we're

me. Um, like we are hiring. Uh, we we're in London. Um, we have just done a

in London. Um, we have just done a fairly big race last year and we're looking to expand the team so that we can build some of these systems. Um, so if any of this work looks interesting to you and you're interested in being on

like the edge of building some of this AI like AI SRE product, uh, then just get in contact and let me know. I'd love

to chat. All right. Thank you.

Thank you, Lawrence.

Thanks. Thank you so much. What

incredible. So get up for Lawrence, everybody. Incredible.

everybody. Incredible.

I got told off back there. They said,

"Hey, Tesious, when you clap, just like fake clap because my hands are too close to the mic and I'm causing problems." Our next speaker is is is great. This

next talk is really fun. How many of you have been on a mission before? Like I'm

on a to the grocery store. I'm going to go get milk. Yeah. Yeah. Missions

usually require many different steps.

They require longunning tasks. require

all of this. Our next talk is about this but with agents missions for agents.

What was your longest uh time that you've done some prompt with claude code? You know what I mean? Sometimes I

code? You know what I mean? Sometimes I

see I say like fix this bug, right? And

then it takes 15 plus minutes and I just watch this agent cook. 15 minutes a long time. Anyone go longer? Longer than 15

time. Anyone go longer? Longer than 15 minutes on some coding agent. What is

the longest?

Couple hours. Okay. Couple hours. What

about days? What about days not just with a single agent? What about days with a team of agents, a multi- aent system? That's what we're going to hear

system? That's what we're going to hear about now um from Luke from Luke Alv Alvo. Yeah, it's really going to be an

Alvo. Yeah, it's really going to be an exciting talk about longunning by order of days and multi- aents uh missions.

So, we're going to introduce him again with applause. This is the prompt to

with applause. This is the prompt to bring on the speaker. I need you to like applaud a lot otherwise they don't come on. They feel shy, you know. And so, uh

on. They feel shy, you know. And so, uh let's give it up Luke Alvo.

Oh, I don't Is he there? No. Wait. Uh, I

think I don't think that was enough.

He's kind of like he was sitting there crying actually. He's like weeping

crying actually. He's like weeping because it was too quiet. Can we give him Let's try again. Luke Alvo.

There. There we go. There we go. There

we go.

Hi everyone. Um, my name is Luke and my goal is that 20 minutes from now, you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a

single agent today. Um, a little bit about me. So, um, I come from a

about me. So, um, I come from a background in dev tools. About two and a half years ago, I started a project at Block, which is where I was working at the time. Um, and that project evolved

the time. Um, and that project evolved into Goose. Goose, um, is now one of the

into Goose. Goose, um, is now one of the leading coding agents is open source. Um

and it's recently was was donated to the AI um Agentic AI foundation. So it's

been really cool to see. Now um nowadays I work at factory where I lead our core agent harness and facto's mission is to bring autonomy to the entire software

development life cycle.

So I want to start off with a claim. Uh

the bottleneck in software engineering nowadays is not intelligence. It's now

limited by human attention. Uh even the best engineers can only complete a couple of tasks at a time. Um they may have a backlog of 50 features, but they can only drive a few forward per day

because every task requires their attention. Every commit needs their

attention. Every commit needs their review. Today's models are smart enough

review. Today's models are smart enough to figure out all 50 of these tasks, but there's not enough uh just bandwidth to supervise their implementation.

So we kept asking ourselves, what if a human decides what to build and then a system figures out how to do so, right?

An agent could just work for hours for days and you come back to finish work.

So that's what I'm here to talk about.

Um, when you start researching multi- aent frameworks and systems, you quickly realize that the field's a bit of a mess. Everyone has their own framework,

mess. Everyone has their own framework, their own terminology, their own uh opinions of what works and doesn't work.

And so I want to propose a simple tonomy. There's five frontier

tonomy. There's five frontier multi-agent frameworks. One is

multi-agent frameworks. One is delegation, right? This is where one

delegation, right? This is where one agent spawns another agent and the parent agent may say go figure out the database schema and then gets a response back. This is the simplest form of

back. This is the simplest form of multi-agent communication and is what most people implement first. Um you have you know sub aents and coding tools are

the most common example. Um the the other one is creator verifier, right?

Where one agent builds something and then you have another agent that checks that work. And the key here is like a

that work. And the key here is like a separation of concerns. The par the the agent that implemented the the code is has some cost bias, right? It wants that code to work. Um a fresh agent with

fresh context is way more likely to find issues. And this is why we do code

issues. And this is why we do code review as humans as well. Um another one is direct communication. This is when agents communicate without a central coordinator, right? It's like kind of

coordinator, right? It's like kind of like DMing each other. It's hard to get right though because state fragments across conversations without that coordinator and there's no single source

of truth. Um the next one is

of truth. Um the next one is negotiation, right? Negotiation is when

negotiation, right? Negotiation is when agents communicate um but over a shared resource. So that might be, you know,

resource. So that might be, you know, they want to use the same API, they want to modify the same portion of the codebase. Uh but negotiation doesn't

codebase. Uh but negotiation doesn't need to be adversarial. In fact, the best use case is when there's uh net pos net positive sum trading, right? And

that's um when agents have like a potential win-win situation while interacting. And then the last one is

interacting. And then the last one is broadcast and that is when one agent sends information to many. Uh think of it like you know status updates, uh new context that applies to everyone, new

shared constraints. um it's a bit less

shared constraints. um it's a bit less uh flashy than the other ones, but it's critical for maintaining coherence over longunning tasks. And so when you have

longunning tasks. And so when you have all of these different building blocks, how do you assemble that into a system that can run for many days? So missions

is our answer. It's a system that combines four of those delegation, creator verifier, uh broadcast, and negotiation into a single workflow. You describe a

goal, you scope that through a conversation, you approve a plan, and then the system handles execution for hours or days. And that enables you to focus on something else.

Notably, a mission is not a single agent session. It's an ecosystem of agents

session. It's an ecosystem of agents that communicate through structured handoffs and shared state.

It uses a three- role architecture.

There's orchestrator, there's workers, and then there's validators. The

orchestrator handles planning. When you

describe what you want, the orchestrator is kind of like your sounding board. It

asks you the right strategic questions.

It um you know checks out if there's any unclear requirements in in the problem space. And then it eventually produces a

space. And then it eventually produces a plan that includes features, milestones, and then something that's called a validation contract. And that validation

validation contract. And that validation contract defines what done sort of means before any coding is done. And I'll come back to why that matters because it turns out to be really important to the

system. The next role are workers. They

system. The next role are workers. They

handle implementation. U when a feature is assigned to a worker, that worker has clean context, no accumulated baggage, no degraded attention, right? The worker

reads its spec, it implements the feature and then commits um by git allowing the next worker to inherit a clean slate and a working codebase. And

then the last role are validators. They

handle verification. And so most systems validate by maybe running lint, type check, tests, maybe they do code review.

Missions does all of that, but we also validate behavior. Instead of just

validate behavior. Instead of just asking, you know, does the code look right, we wonder, does this work end to end? That's the difference that lets

end? That's the difference that lets missions run for many hours, many days in a row without drifting. And making it work had to involve sort of rethinking

validation entirely.

So when you've worked with coding agents before, you've probably seen this pattern where an agent builds a feature, it writes some tests, the tests pass, there's full coverage, but the tests

were sort of shaped by the code, not by what the code was attempting to actually do. Tests written after implementation

do. Tests written after implementation don't catch bugs. They confirm

decisions. So if you rely on validation like that, your system will eventually drift.

That's why this validation contract exists. It's written during planning

exists. It's written during planning before any code and it defines correctness independently of implementation. So for a complex

implementation. So for a complex project, this can be hundreds of assertions and each feature is assigned one or more assertions that it must

satisfy. The sum of all features must

satisfy. The sum of all features must mean that every assertion is covered.

After each after each milestone of features, we have uh two types of validators that run. So you have the scrutiny validator and the user testing validator. The first one is more

validator. The first one is more traditional. It runs the test suite type

traditional. It runs the test suite type checking lints and critically it spawns uh dedicated code review agents for each completed feature within the milestone.

And then the second one which is the user testing validator is more interesting. It kind of acts like a QA

interesting. It kind of acts like a QA engineer. It spawns the application. It

engineer. It spawns the application. It

interacts with it through computer use or something similar to that. it uh

fills out forms, you know, uh checks that pages render correctly, clicks buttons, and ensures that functional flows work holistically. So, this step takes significantly longer than the

previous one of of the scrutiny validator uh because the the system is interacting with a live application. And

what we've noticed is that missions most of the mission's wall clock time is actually spent here waiting for this like real world execution to occur instead of generating tokens.

Critically, neither validator has seen the code before. They are not invested in the implementation. And so validation is adversarial by design.

Okay. So then validation catches bugs, right? But for a system that runs for

right? But for a system that runs for many days, you also need to make sure that context isn't lost between the agents. When a worker finishes a

agents. When a worker finishes a feature, it doesn't just say, "I'm done." It fills out a structured handoff

done." It fills out a structured handoff detailing what was completed, what was left undone, what commands were run throughout that that uh agent loop, and what were the the exit codes of those

commands. Um what issues were discovered

commands. Um what issues were discovered and did it abide by the procedures that the orchestrator defined for that worker.

That's how we catch issues and how the system selfheals.

The errors get caught at milestone boundaries, corrective work gets scoped, and the mission sort of like pulls itself back on track, not by hoping that agents remember what happened, but by

forcing them to write it down and then actually address issues. And I I'll present on that in just a sec. Um, our

longest mission ran for 16 days, which is much longer than a full sprint. And

we believe that they can run for 30.

That's only possible because of this structure.

So once we had this architecture the next question be became um how do we actually run it right um the most obvious choice is like parallelism if you have 10 agents running at one point

in time then you have 10 times the throughput but we tried that and it doesn't really work for tasks in the like software dev domain because agents conflict they step on each other's

changes they duplicate work they make inconsistent architectural decisions and so the coordination overhead ends eating up the speed gains all the while

you're burning tokens. The difference

with missions is that we run features serially. So there's only one worker or

serially. So there's only one worker or validator running at any given point in time. Within a feature, we allow for

time. Within a feature, we allow for parallelization on readonly operations.

So you have something like uh searching through the codebase or researching APIs. All that gets paralyzed within

APIs. All that gets paralyzed within validators. We also paralyze readonly

validators. We also paralyze readonly operations such as code review.

This is serial execution with in with targeted internal paralization. It seems

slower on paper, but the error rate drops dramatically and when you have tasks that run for many days, the sort of correctness compounds.

Now, your your standard chatter interface doesn't really work for something that lasts many days. At a

quick glance, you need to be able to see how much of the project have you completed and what's what amount of the budget that you originally like set off with have you burned through. So using a

mission actually we built mission control which is a dedicated view for this. You can see what does what is

this. You can see what does what is active worker doing right now readoff handoff summaries that detail what did the worker the validator discover um how

it's gonna sort of like alter its course uh moving forward. or you could just, you know, go check out um you go hang out with your friends that night. This

entire view lets you just run missions asynchronously and you could be plugged in as a project manager overseeing the implementation or you could just, you know, go and and uh hang out with your

friends.

Okay, so the right model in each role.

Uh everything here sort of assume assumes one thing and that is that you're using the right model in each role. Planning

benefits from slow careful reasoning.

Implementation from fast code fluency and creativity. Validation benefits from

and creativity. Validation benefits from uh precise instruction following. Right?

And so no single model nor model provider is best at all three of these.

Using systems like missions requires the development of a new skill which internally we've been calling droid whispering. But it's this idea that you

whispering. But it's this idea that you need to be able to mentally model how different LLMs interact, where they fail, how those failures compound over a multi-day run, and then you need to make

a deliberate choice as to which model sits in which seat. Theo, the engineer who built our missions prototype, came up with our our model defaults, but we really encourage people to make these uh

their own and customize them to the needs of their project. So for example, validation might use a different model provider entirely to make sure that it's not biased by the same training data.

This is a structural advantage of a model agnostic architecture. You're only

as strong as your weakest link. And if

you're locked into one model provider, then you're constrained by that family's weakest capability. As models continue

weakest capability. As models continue to specialize, the ability to put the right model in the right seat becomes a compounding advantage.

It works in the other direction, too. If

you're using missions, the structure of that can compensate for models that are not quite at like the frontier level performance. So the validation

performance. So the validation contracts, the milestone checkpoints, they allow you to run missions very, very successfully, even using openweight models.

Now, this all sounds quite theoretical.

What does it actually look like in production? I got an example of building

production? I got an example of building a clone of Slack right here. This slide

has a ton of info, but I'll walk you through just a few things I want to call out. 60% of our time is spent on

out. 60% of our time is spent on implementation and 60% of our tokens as well. Notice how validation never

well. Notice how validation never succeeds on the first go. That's in the mission. What's it? It's the one on the

mission. What's it? It's the one on the bottom left. Um, we almost always have

bottom left. Um, we almost always have to create follow-up features. So, that

really demonstrates like the value of a system that does this QA loop. you end

up with se with 50% of your lines of code at the very end in the bottom right being tests and 90% of your uh code is covered by those tests. And lastly, we

take advantage of prompt caching heavily to make sure that we're sort of offsetting um the the price of running such a long task.

People are really taken to missions and it's been awesome to see what folks have been building with them. Um, some

examples I've included in this slide, but ones that I want to call out are specifically in the enterprise setting, which is where factory really shines.

Um, they've been used to prototype new ideas and features overnight, to um, make sure that people can uh, build internal tools at increasingly rapid rates, to run huge refactors and

migrations for ML search, research, sorry, and to modernize uh, code bases so that agents are more productive in them.

Um, one thing that I wanted to talk about was also this concept of like the bitter lesson because every person building multi-agent systems has this fear of the next model release sort of

like making their their architecture obsolete overnight. Um, so when we were

obsolete overnight. Um, so when we were building missions, we decided we had to make this system get better with every model improvement. This means that

model improvement. This means that almost all of the orchestration logic is defined in prompts and skills um instead of like a hard-coded state machine. How

it decomposes failures and um or decomposes features and handles failures is all in about like 700 lines of text.

And four sentences of this can alter the execution strategy pretty dramatically.

Worker behavior is driven by skills that the orchestrator defines per mission. So

you get very customized behavior and the only deterministic logic is very thin and it's focused on enabling models to do what they do best while the system handles like the bookkeeping right stuff

like running validation and ensuring that progress is blocked when there are some handoff issues that are not addressed. So missions sort of ensure

addressed. So missions sort of ensure the the discipline and the models provide the intelligence uh using primitives that they're already familiar

with like agents MD skills etc. So what does this unlock?

Remember the bottleneck that I started off with human attention. The economics

are sort of changing. Before a team of five engineers might be able to uh work on 10 work streams at any given point in time. Now maybe with missions we can

time. Now maybe with missions we can bring that up to 30. The team can focus on interesting problems such as uh the architecture, product decisions um

instead of uh worrying about the execution per se. And the important thing is the codebase ends up cleaner than when you started. The endto-end

tests, the unit tests, the skills, the structure that missions provide uh means that agents and humans are more productive in that environment moving forward.

So now that you understand how missions are structured and how they actually work, you can see that they're really a composition of those original um strategies, right? Delegation shows up

strategies, right? Delegation shows up everywhere in how the orchestrator spawns workers and uh how we spawn research sub agents. Trader verifier is fundamental in that validation and

implementation are always separate agents with separate context. Broadcast

runs through the shared uh mission state that every agent references and negotiation shows up at milestone boundaries where the orchestrator defines you know does this does this handoff summary sort of like look

correct? Do we need to create follow-up

correct? Do we need to create follow-up features rescope etc. But strategies aren't enough. You need

the connective tissue. You need uh these structured handoffs so that agents don't lose context. You need the right model

lose context. You need the right model in each role. And you need an architecture that will improve with each model improvement.

So what I like to think about is that people in this room who are thinking in terms of agent ecosystems, who develop an intuition for how different models compose under pressure, um that those folks are going to be really shipping

the next generation of innovation. Uh

there's a lot of open questions still, right? Um how do we further paralyze the

right? Um how do we further paralyze the workload of missions so that they run faster? How do we start orchestrating

faster? How do we start orchestrating missions themselves into even more complex workflows? Uh but the data from

complex workflows? Uh but the data from production missions is clear. This works

on real projects at scale today. So this

is what I'll leave you with. Open Droid,

try running missions, argue with the orchestrator about the scope, approve the plan, and then go do something else.

I'm excited to see what you guys build and I'll be around to answer any questions uh for the rest of the day.

Thanks everybody. That was so

everybody. That was so Thank you. Hey, guess what? It's time

Thank you. Hey, guess what? It's time

for lunch. Who's hungry?

I am. So, get lunch. There's plenty of time. Listen, you came here. You paid

time. Listen, you came here. You paid

money to be here, okay? So, don't waste it by eating alone by yourself in a corner, okay? Be with people, have

corner, okay? Be with people, have community, enjoy it, and then we're going to meet back here together at 2:30 p.m. local time where you'll have a

p.m. local time where you'll have a different MC. Hype him up. He's a

different MC. Hype him up. He's a

wonderful guy. Uh, let's go. Thank you

again, What we do in life echoes in eternity.

Heat.

Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Hey, Fear is the mind killer.

Fear is the mind killer.

Heat.

Heat.

Heat. Heat.

I know. All right.

Heat. Hey, heat. Hey, heat.

Heat. Heat.

Heat.

Heat.

Free your mind.

Free your mind.

Heat. Heat.

Free your mind.

You are who you choose to be.

execute the vision.

Heat. Heat.

Heat up here.

Heat. Heat.

Hey, hey hey.

Heat.

Heat.

Heat. Heat.

Make the requirements less dumb.

Delete the part or process.

Simplify and optimize.

Accelerate cycle time.

Automate Heat. Heat.

Heat. Heat.

Heat.

Hey, heat. Hey, heat.

Never give in. Never give up. Outlast.

Out compete.

Persevere. Persevere. Persevere.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

A new age has come.

Hold still.

Let it.

I watch the sparks all burn too fast.

Everyone reaching for the flash.

They take the first light they can find and call it truth and call it mine.

But I stayed when the room went quiet.

When the noise fell out of face, sat with the weight of the question

while the easy answers walked away.

It's not that I see further. I just

don't leave it soon. I let the silence sharpen. I let the dark grow.

sharpen. I let the dark grow.

I stay the almost right past the comfortable light.

I stay.

I wait till the surface breaks. Till the

shade feels true inside.

I don't rush the fire.

I give it to I call it done. Call it enough.

But there's a deeper know still huming underneath the fear of not being love.

Every great thing asks for patience.

Every real thing makes you choose.

Do you leave with what's acceptable or stay for what's asking more of you?

They say it's talent say it's magic like it falls from open but nothing worth remembering

arrives on the first try.

I stay when it stops feeling kind when it stops feeling fast.

I wait through the restless doubt through the urge to collapse.

Hide by and chase the answer. I let it find me back.

There's a moment after the last good idea dies.

Where the room feels empty and you want to run for your life. That's the do teaches you to open. That's the edge

where the real life Hold away.

Let the shape reveal it.

I stay longer than I should. Long enough

to change.

I stay.

I wait till the pattern clears so breaks the haze.

I don't bing it. I

with time.

Most dreams don't fail.

They're just left too soon.

I stay.

I stay.

Typing thoughts into the dark, a spark becomes designed. Words evolve to

becomes designed. Words evolve to whispers meant for something more divine. syntax

divine. syntax and I see the language change. I'm not

instructing anymore. I'm rearranging

fate. Every loop I write rewrites me.

Every function hums with meaning. I feel

the interface dissolve between the maker and the new code. Not on the screen, but in the

new code. Not on the screen, but in the soul where thought becomes the motion and creation takes control. No lines, no

rules, just balance in between the zero and the one. The silence and the dream system shape our fragile skin. They mold

the way we move. We live inside the logic gates of what we think is true.

But deep beneath the data post, there's something undefined.

A universe compiling the image of our minds. Every line reveals reflection.

minds. Every line reveals reflection.

Every loop replace connection. We're not

building, we're becoming. And the code becomes confession.

This is the new code. Not on the screen but in the soul where thought becomes the motion and creation takes control.

No lines, no rules, just balance in between the zero and the one. The

silence and the dream.

We are not just the world we're in.

We are the world we're doing.

Each prompt, each breath, each fragile spin, a universe renewing.

This is the new code.

Alive and undefined.

Where logic meets emotion and structure bends to mind. The systems eternal but the soul writes the line. We are the new

code compiling tie.

Compiling time.

We didn't light the fire.

We traced the spark through.

Every truth was waiting.

ation as I hear the echo before the sound.

I feel the answer before it's found.

No mc that were always there. Hands in the dust of centuries naming what we

uncover. Calling it creation. So we can

uncover. Calling it creation. So we can feel like lovers of faith of faith.

of power. We don't know.

Time is not a river, it's a blade cutting order into shape. We don't move forward. We align until the pattern

forward. We align until the pattern breaks. Nothing is invented.

breaks. Nothing is invented.

It's every sequence. Gods of the real. Nothing is

sequence. Gods of the real. Nothing is

invented here. We rearrange what waits at the

here. We rearrange what waits at the core. I am not becoming something new.

core. I am not becoming something new.

I am what I was before.

Adam sings every thought every selfestee Identity is scaffolding held together by

belief I am a momentary order standing

on my tears shake me break me watch me re Assemble.

Time doesn't chase us. It releases frame by frame. The truth we fear. We don't

by frame. The truth we fear. We don't

fear the ending. We fear the pattern getting clear. Nothing is invented.

getting clear. Nothing is invented.

It's revealed.

Every meories.

We are creators of alignment in a universe that feels nothing is inherent

and every failure is a lesson learned. I

am not lost in what I am not.

I am the order that returns.

If I am only then rearrange the noise from the signal.

ing from the fire.

Nothing.

Nothing is invented.

Stand and see.

Every future was a possibility. We don't

write the laws of motion. We choose

velocity.

Nothing is invented.

Say my name. I am ordering flame. I am time collapsing into will.

flame. I am time collapsing into will.

I am discover. Everyone

say When the noise falls silent and the pattern holds, you'll see it was never made, only found.

Heat.

Hey Heat.

Heat.

Heat.

Heat Heat. Heat.

Heat. Heat.

Ah, a a Ah, ah. Heat. Heat.

ah. Heat. Heat.

Ah!

Ah!

Bye-bye.

I don't want it.

I don't want I don't want Oh, Heat. Heat.

Heat. Heat.

Heat. Heat. N.

Heat. Heat.

Heat Heat.

Heat.

No.

Wow.

Oh.

Oh.

Oh.

Heat. Heat.

Heat. Heat.

Ah.

Oh, heat.

Hey.

Hey. Hey.

Heat.

Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat. N.

Oh. Oh,

Hey hey hey.

Hey hey hey.

Heat.

Hey, heat. Hey, heat.

Heat.

Heat.

D.

Down.

Hey, hey hey.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Hey.

Heat. Heat.

Heat.

Heat.

Heat. Hey. Hey. Hey.

Heat. Hey, Heat.

Heat. Heat.

Heat.

Heat.

Heat.

Heat. Heat.

Heat. Hey, Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat. Heat. N.

Heat.

Heat.

Oh.

Oh.

Heat.

Heat.

Heat. Heat. N.

Heat. Heat. N.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat up here.

Heat. Heat.

Heat. Heat.

Heat. Heat. N.

Heat. Heat. N.

Heat. Hey, heat. Hey, heat.

Heat.

Hey, heat. Hey, heat.

Oh, a Heat. Heat.

Heat. Heat.

Heat. Heat.

Hey, hey hey.

Heat.

Heat.

Heat. Heat. N.

Heat. Hey. Hey. Hey.

Heat. Heat. N.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Hey. Hey. Hey.

Heat.

Heat.

Heat.

Heat.

Heat. Heat. N.

Heat. Heat. N.

Heat. Heat.

Heat.

Heat.

Hey hey hey.

Heat. Heat.

Hey hey hey hey hey hey hey hey.

Hey, hey, hey.

Heat. Heat.

Hey, hey, hey, hey, hey hey.

Heat.

Hey, heat. Hey, heat.

Hey, hey hey.

Hey.

Hey, hey, hey.

What's up, folks?

How we doing?

Welcome to the coding agents track. My

name is Alex. Uh, some of you know me as the host of the Thursday podcast. Uh, so

let me catch you up. If you not been following the news this week, Anttopic just announced a cloud mythos model that they released but didn't really release.

Uh, some companies got it. Meta MSL labs finally released something. We've been

waiting for the death of llama or the next llama and they released Muse Spark and then Codex hit three million weekly active users and Tibo hit that reset button that you all love. Uh so that's

great. And also open source wise GLM 5.1

great. And also open source wise GLM 5.1 dropped as the new open source state-of-the-art on Swebench Pro. And so

all of this relates to what you guys do with coding agents. And so this is just the last seven days. There has never been a better time to be in this room learning about coding agents. And so for that our first speaker here we have

three incredible talks coming up in the next hour. First the guy this guy is

next hour. First the guy this guy is literally maintains the hugging face agents course MCP course LLM course and the small course. Uh if you ever learned anything from hugging face which you definitely should. They're a great

definitely should. They're a great resource. You probably learned it from

resource. You probably learned it from him. He's a machine learning engineer at

him. He's a machine learning engineer at Hugginface. Uh today he's going to argue

Hugginface. Uh today he's going to argue that your coding agents shouldn't just write code. It should do a lot more. If

write code. It should do a lot more. If

you guys heard about CUDA he's going to talk about CUDA kernels. He's going to talk about a lot of other stuff. picking

GPUs etc. Please welcome our first speaker to the stage, Ben Burtonshaw.

Hi everyone. As you heard, I'm Ben from Hugging Face and the talk that I'm going to present to you today is called Your Coding Agent Should Do AI systems engineering.

So there are two main takeaways that I want you to get from this talk. one and

probably the fun part is that we can use coding agents to tackle the hardest engineering problems in AI. So systems

engineering and machine learning engineering and maybe the boring part is that in order to do this we're going to need standard repos and we're going to need those on the hub and in many cases

we already have them.

So I think in this case I'm preaching to the choir here but in case you haven't noticed coding agents have been accepted. Many of of us have been using

accepted. Many of of us have been using them for a few years but in the last few months they seem to have c crossed a sort of acceptance gradient where a broader group of people are using them.

So with this in mind how do we keep our careers our engineering kind of contemporary and how do we keep challenging ourselves in new areas and my proposal is that we need to go kind

of closer to the silicon and tackle harder problems and and that's where AI systems engineering comes in. I've

broken this talk down into three progressively more complex steps and more autonomous steps as well and I've defined those like three bosses from games. The first

one is a hybrid approach where you interactively use an agent to solve a uh to to write a CUDA kernel. The second is

a zeroot task where an agent takes a prompt and trains an LLM on hugging face. The third is a multi- aent auto

face. The third is a multi- aent auto research setup like a kind of automated AI lab. So let's get started on on the

AI lab. So let's get started on on the first boss, right? This is writing CUDA kernels. So for a while writing custom

kernels. So for a while writing custom kernels was seen as this unattainable goal for the humble agent. They required

complex DSLs. They required integration with re relevant hardware to be benchmarked and to be tested. And it was seen as something that couldn't be uh achieved by agents.

However, that in most cases was wrong.

If you look at kernel hackathons like those on GPU mode, the recent AMD hackathon, if you look at papers like kernel bench, you'll see that agents are

able to write valid and optimized CUDA kernels and and that's really cool and something that totally inspires me. I'm

a part of GPU mode. I contribute to that and and something that I think everyone should be doing. However, what do we do with them? How do we distribute them and

with them? How do we distribute them and how do we get them into our inference engines so that we actually are using these optimized kernels that we're generating? And that's part of the

generating? And that's part of the question of this part of the talk.

Let's take a step back now and just uh say what a kernel is, right? So when you run an AI model on a GPU, the actual work is executed through a kernel. This

will be defined in a relevant language for that hardware and it will use relevant features to that hardware that may not be available on other hardware.

We can write custom kernels that will take advantage of that hardware for a specific math operation. Kind of squeeze everything we can out of it so that the model will infer faster.

In general, this requires a lot of expertise about writing CUDA kernels about the hardware. uh and it's also a bit of an insulation hell as you deal with a pretty large install matrix from

hardware to software to generations and versions of say CUDA and these kind of issues. So in short it's hard

issues. So in short it's hard efficiency in deep learning. So

efficiency in kernels is split into three main sections. One compute, two memory and three overhead. Compute is

the flops. This is these are the matrix multiplications and the the real math of the process. Memory is the time spent

the process. Memory is the time spent moving data or tensors around memory typically from slow to fast memory. And

overhead is basically everything else.

The Python environment, PyTorch dispatch of those kernels, these kinds of things.

In general, most people might assume that the compute is the bottleneck here because it's doing most of the math, right?

That's not correct. So, in most cases, memory is usually the bottleneck. And

that's because a modern GPU, let's take a H100 for example, can do a paflop a second of computation, but its memory bandwidth is three terabytes. So in

short, the GPU is often waiting idle for this tensors to come back for it for them to be computed.

There are custom kernels, custom optimized kernels that exist. Uh flash

attention being the poster child of these. And in general, what they do is

these. And in general, what they do is increase arithmetic intensity. They

basically make the GPU do more sums at once per read and write. So we move the tensors across. We do as much math as

tensors across. We do as much math as possible in the GPU in one go and then we write it back. In short, people like to say we keep the GPUs warm and that's the objective of writing a custom CUDA

kernel.

Hugging face has a library called kernels which is maintained by kernel writers and we're beginning to scale up to a kind of aentic workloads. So at its core this is a way of distributing

kernels. It has a toml file like any

kernels. It has a toml file like any kind of project which says which hardware it works on which versions of CUDA and other kind of softwares it requires to work and it's a it's now

also a repo on the hub just like models.

So if you are a kernel writer or you're an aspiring kernel writer with an agent that you want to set up you can now be a kernel publisher just like a model publisher. And my point is that this is

publisher. And my point is that this is like a kind of super fervent ground for uh AI engineers looking to kind of scale their career. If you check out these

their career. If you check out these repos on the hub, you'll see that there's compatibility for different hardware. You can configure that so

hardware. You can configure that so you'll know like, okay, this works on my GPU or on my laptop. Uh, and this is what it looks like here, right? Let's take a look at what this

right? Let's take a look at what this looks like for an agent and how we're helping an agent to do this. So,

first we're going to go to how we do this. So, skills. So, I I'm sure

this. So, skills. So, I I'm sure everyone here is familiar with skills and I'm sure there've been a number of talks that really go deep into skills. I

don't like to I like to keep them pretty simple and and really they're just kind of filebased context with all the wonders of files. We can open them and close them. We can version them. We can

close them. We can version them. We can

source control them and these kinds of things. And agents can also do the same.

things. And agents can also do the same.

They can open them when they need them.

They can use them when they don't. And

so in the context of kernels that means that we can give examples of how to write and how to use kernels in skills and they can open those and use them when they need. I like to say that it takes a task from being zeroot to being

fshot, which in ML is quite a familiar concept, right? We're just giving the

concept, right? We're just giving the agent examples of how to do things. Uh,

and and we can be quite verbose and descriptive about that.

At HuggingFace, we're focusing on integrating skills into their projects.

So what you'll find is that inside each project there's managed skills by that project which we think is the best way to do this because it means that those projects uh that the maintainers of those projects are maintaining their

skills right that means that they're not necessarily the most like yolo skills because they're kind of like well-maintained and and robust and we have another repo for those kind of more experimental skills which is called

hugging face skills. Go and check that out if you want to try some of these examples you'll see today in kernels. This is what the skill looks

in kernels. This is what the skill looks like. It focuses on benchmarking. So, it

like. It focuses on benchmarking. So, it

has scripts that allow you to uh benchmark and test the skill, sorry, to test the kernel and see how performant it is and references with examples of

how to do this. We benchmarked this skill and we used uh we generated a a

kernel for Quen 38B for H100 and we found that we had a 94% speed up. This

isn't a state-of-the-art speed up on this model by any means. It's really

just about compatibility and a compatibility matrix. So in many cases,

compatibility matrix. So in many cases, these models and their kernels won't be optimized for the respective hardware or generation of hardware that you want to use them on. So you have some lowhanging fruit here where you can just come and

pick up some some optimizations for that specific hardware. Maybe because your

specific hardware. Maybe because your hardware is cheap on your cloud provider, uh but it's not necessarily the most ideal for the for that model that you're using. So my recommendation would be to come here and like pick up

some easy speed ups.

How do we know that these skills are any good and and that we should be sharing them and telling people to use them? We

use an open source library called upskill that we're also maintaining.

This is a is a gateway to using cheaper and open models with skills. It

basically just generates skills, generates an eval for the skill and then allows you to compare different models on the same skill. So you can see things

like this. So okay, GPTOSS is slightly

like this. So okay, GPTOSS is slightly less accurate using the same tokens.

Kimmy is more accurate and using less tokens. Haiku is a bit more accurate

tokens. Haiku is a bit more accurate using less tokens and these kinds of things. So if you've got a skill and

things. So if you've got a skill and you're using it regularly and you're thinking to yourself, okay, how can I save a few pennies here and and get a different model on the go? Then try out upskill and it allow you to iterate on

your skill and and improve it. Right,

let's move on to boss two. I'm going to go through this one pretty quickly. This

is about fine-tuning models. If you're

really into this, there was a talk yesterday by my colleague Murvy that went into this deeply. And there's also a blog post here where we got Claude to do this. This was from back in November,

do this. This was from back in November, December time. Now, go and check this

December time. Now, go and check this out. Basically, you can just say

out. Basically, you can just say fine-tune Quen 36B on this data set.

This is a chain of thoughts data set and you'll improve the model's chain of thoughts. This is fully integrated to

thoughts. This is fully integrated to the hub now. So you can even run the GPUs on on the hub and it uses uh HF CLI skills. So it's all very available. I

skills. So it's all very available. I

would try this one out. You can also try this one out. This is uses Unslo. So

it's even cheaper. This runs with like optimized models and it's maintained by Unsloth and by us and it's another blog post and there's also often free credits that you can get around these blog

posts. So I go and check these out.

posts. So I go and check these out.

Okay, let's move on to the the big one.

Uh this is auto lab multi-agent research which is a project that kind of um basically keeps me up at night.

Andre Karpathy a few weeks ago maybe a month ago now released a project called auto research which was based on his other projects nano GPT and nanohat and

it took the nano GPT architecture and got claude code to create improve to write improvements to that training script so that it would improve the training process. So we can see here the

training process. So we can see here the experiments going over and for each experiment there's a change in the training script which increases the efficiency measured in bits per bytes of

that run and we can see that the efficiency ends at its best at the end of the process. I like everyone thought this was super cool and I had to start implementing it straight away. But one

of the things that stood out to me was I found it kind of weird that we had one agent working in a single way, iterating, going and finding improvements and then implementing them.

And it would make sense to kind of distribute this. So that's what I did. I

distribute this. So that's what I did. I

distributed the task amongst a research team with four types. We have a researcher that basically looks up papers. For this we use HF papers, but

papers. For this we use HF papers, but we can also use archive papers. HF

papers is cool because it has a CLI, so you can just pull and search papers from the hub. and it acts as a lit literature

the hub. and it acts as a lit literature scout. So it just looks up for papers

scout. So it just looks up for papers with ideas and it formulates those as hypothesis. We then have a planner which

hypothesis. We then have a planner which takes those hypotheses and maintains like a queue of jobs.

We then have a set of workers and they pick up those hypotheses and their job is to implement them as training scripts. So in many cases just like

scripts. So in many cases just like change the architecture or change a parameter or something. And then we have a repo a reporter agent that goes and monitors all these jobs and maintains a

dashboard that we can use.

So this is what it looks like. If you

see here that we have we're working in a in a GitHub project, right? So in a git project, sorry. And we have a main

project, sorry. And we have a main branch that we maintain with our train script that we're updating in each branch and then like a train original that we that we keep. And then we have a data structure on the main branch that

we use to just keep the scores. Then we

implemented this in open code for this example but in the repo which you can also go and check out there uh it's also implemented in codeex and claude if you want to try those. I also implemented it

in gas town but that's kind of wild west stuff so I did it in like a separate project. Um but basically it works um

project. Um but basically it works um really anywhere because it's more just a conceptual implementation right and first you have your planner creating hypothesis you have your researchers looking up paper and then your reporter

picking all of this up handing to workers as I said those workers integrate with HF jobs so they start these jobs off on the hub that run with the hardware that they need and then they submit these patches that go back

the reporter operates in Drachio which is a d an open source dashboard that we use for all metrics draio is useful with agent agents because it uses a completely open data layer basically

park a so if you don't want the dashboard or your agent doesn't want the dashboard for any reason it can just get into the park a and just do whatever you want so if you need a gant chart or some other visualization it can just go and

do that so I would say it's like the best agent dashboard tool because it's basically just a data store you know it's basically just a data structure

okay so let's just walk through this now so this is it implemented in Open code if you don't know open code you have like agent configurations so in this one I just set autolab which was the name of

the agent configuration I have it has skills this was the prompt so it says like run one autonomous local research auto research pass in the repo using defined roles I tell it to use planner

uh to propose up two fresh single change experiments use reviewer to reject duplicates or stale ideas I also tell it to use like a HF bucket because I want all of the storage to be in the same

bucket so that I don't have to upload or download the training scripts every time. And then we go and we select one

time. And then we go and we select one of the sub aents as a nice little interface in open code but it's similar in other tools. So I select the planner and then you'll see that the planner receives this prompt and it uses a

specific template which I defined in my configuration of like it's going to have current state. It's going to have a a

current state. It's going to have a a list of the jobs so far, things that have worked which were defined by the reviewer, current hyperparameters that it can change and it's basically just

defining these jobs which will go on to the the job list as I mentioned.

We then switch over to a reviewer agent which will receive all of these jobs. It

has a similar kind of structure based on a template uh reference to where it should be working from and the latest score that it should be using. It gets

an overview of all the failed and successful experiments which it will then like use to base its decisions of what goes into the next queue on and it creates this little table which we don't

really need to look at. It's really just for the agents to um interact with each other and to get this information back.

To be honest, that's a little bit of a verbose example and uh we maybe don't need this many tables and you could probably trim that bit down.

But in general, I'd recommend that if you think this is cool, go and try that out in the repo.

after that. So this agent runs in parallel sometimes for hours and this is the Trachio dashboard that we use and these are all the runs that are pushed to tracheo. As I said the main advantage

to tracheo. As I said the main advantage here is that this is fully open source and it's just a data layer but we get all of these kinds of visualizations.

Trachio can also have like events and warnings. So we can have all of these

warnings. So we can have all of these events being reported by different agents and we can filter those down. We

can also even tie those up to like notifications. So you can get emails

notifications. So you can get emails from Traoio if you want if like your agents are kind of going rogue or something and you need help. But best of all just has this uh like just free form

structure. So you can just throw tables

structure. So you can just throw tables in that don't necessarily fit with any other structure.

And then on the hub side all of these jobs are just run inside hugging face.

So you can explore those jobs and in most cases you can tell the agents to use uh like labels and you can sort those labels and review through what they're doing.

Or you can just look at it like this. As

I mentioned, you can access that underlying data layer and just create a Gant chart because this was a kind of convenient way to look at what the agents were doing over time. So you can see like this amber agent went off and

this was the score that it got. But you

could visualize this however you want because you have access to this data lake.

The kind of TLDDR of the whole thing is that yeah, you can go and just have your kind of own AI lab and you can try it out and if you have a verifiable experiment like training a model or doing uh or writing CUDA kernels, then

it is pretty easy to to implement and set up and and to learn some stuff. So

let's now look at the the takeaways I'd say. So the in simple terms I'd say that

say. So the in simple terms I'd say that agents work really well with primitives and and open primitives and we want tools that are fully open things like tracheo things like kernels that we can

expose to agents and they can kind of control in their own way even though abstracted APIs are really useful if we have a layer that we can't necessarily get behind that that is a ceiling so we

don't always need to extract it's more about exposing well and the other takeaway is that the hub is is ready the hugging face hub is ready for these kind of workloads we have the the

fundamentals in place like storage tracking and compute which I think will allow us to scale our engineering to yeah new levels. If you found any of

this interesting I've shared it all on X I've shared it all on hugging face and you there's a blog post about basically each one of the examples that I just shared with you and they all have repos attached to them so you can go and try

that out for yourself. If you find anything that's broken like please tell me off. If you think that this was

me off. If you think that this was completely wrong, come and find me afterwards and and sort of bully me.

That's fine. But most of all, thank you folks.

Can we get another big round of applause for Ben, please? As you as you guys walk away, you can also Thank you. I love that. Agents that do

Thank you. I love that. Agents that do they should do more than coding, right?

Um folks, while our next speaker gets set up, um I want to quick show of hands of those of you who are staying here. Um

there's a new thing that happens lately when you about to go to sleep and you're like, "Oh [ __ ] my agent is not going to work throughout the night and there's a bit of bit of a stress. Anybody here has this like a little bit of a stress, a

little bit of a f Okay, cool. So our

next speaker here uh is going to tell you about format. format is a very specific thing uh that uh that defines this as a category. Uh folks, please welcome Michael Richmond.

Thank you.

Thank you everybody. My name is Michael Richmond. Um I lead several teams at

Richmond. Um I lead several teams at Bit.ly the link shortener. But today I'm

Bit.ly the link shortener. But today I'm here to talk to you about FOMO.

Uh fear of missing agent time. You know

FOMO, fear of missing out. You also know format. You just didn't have a name for

format. You just didn't have a name for it.

Um, what is fear of missing agent time?

It's being out on a walk and having an idea that you want to task your agent with and having to wait to get back to your

dev machine to actually do it.

It is when you get up from your desk and you had an agent working 30 minutes ago chugging away and you get back and you realize

that after two minutes it actually stopped to ask you a question and it's been waiting their block the entire time.

We all want to believe that our agents are low touch, sorry, are low touch and high autonomy, but we all know the

truth, right? It's back and forth. It's

truth, right? It's back and forth. It's

babysitting and you cannot predict when you're going to be needed for input.

Right now, coding tasks might typically take anywhere between five minutes or 45 minutes, and you kind of know when to check back.

There isn't that much time that's spent in agent idleness, but that window is only going to get longer. And the longer the agent waits

longer. And the longer the agent waits for you, the more agent time that you have missed. If a task is running for 5

have missed. If a task is running for 5 hours or eventually 5 days, you can't just check back in a bit. You need to know when it needs you and you need to

know when it's done. And that might be whenever you are, wherever you are. It

might be whenever you can't predict it.

You may not be at your dev machine.

Did I go back? I went back. Sorry. So

like once I once I once again, my name is Michael Richmond. I run several engineering teams at Bit. I also co-lead our AI coding tools strategy. Um I'm a

really hands-on engineering leader. I

run teams and I also write code. I

co-wrote the Bit.ly the MCP server and I train our engineers on AI skills and best practices. I think about tools a

best practices. I think about tools a lot. The tools that we use every day,

lot. The tools that we use every day, how they work, whether they are effective for our workflows or not. And

aentic coding is really changed the world of software in the last year and the road is very much being paved as we are driving on it. So I built a system

called command and control in order to help me work with coding agents outside of the terminal or the IDE because I really needed it and nothing existed to

solve it yet. Anthropic recently

released some ways to address this with remote control and uh the teleportation mechanism and I think it was just two days ago cursor came out with a solution

in the space. So I wrote command and control. One of the things that is nice

control. One of the things that is nice about command and control, and I'll show you it in a minute, is it is a way to get all of your coding agents in one place uh on your mobile device or really

anywhere. So, this is what my setup

anywhere. So, this is what my setup looks like. I have multiple terminal

looks like. I have multiple terminal windows. Each of these windows has

windows. Each of these windows has multiple tabs. Here's Cloud Code. Here's

multiple tabs. Here's Cloud Code. Here's

what codeex would look like. Here,

here's Gemini. Here's the cursor IDE.

And at any given moment, I might have multiple sessions running, multiple agents across all of these in various states of completion. And here's the thing. I don't know about you, but I

thing. I don't know about you, but I cannot keep track of more than two or three sessions at a time. Soon as I get to four or five, I don't know what

session two is doing anymore. I don't

know which one needs my attention, and I have no idea what states of completion things are in. So, how do you know when an agent is stuck and needs a decision from you? How do you know to check in on

from you? How do you know to check in on an agent midtask and that to find out that it's gone off the rails?

What if you want to start a new session and you're not at your dev machine?

That's exactly why I built this system.

So, this is what it looks like on an iPhone app. It's on Android. It's on the

iPhone app. It's on Android. It's on the web. And it lets you monitor and

web. And it lets you monitor and interact with agent sessions and launch new sessions from anywhere from your phone, from the web, even from your

watch.

I have a couple of video demos that I'm going to play and show you a few of the features of it. And the point here really is like this is a solution to a workflow problem. One solution we saw

workflow problem. One solution we saw this morning uh what was it called?

Agent Craft, which is the gaming version uh of something similar, which I thought was awesome. So, here's the first demo

was awesome. So, here's the first demo slide.

Okay, here I am in the terminal. I'm

going to start a Cloud Code session, and I'm going to issue a command that's pretty run-of-the-mill. I'm not actually

pretty run-of-the-mill. I'm not actually exercising Claude code functionality. I

just want to demonstrate command and control. So, if I come over here to my

control. So, if I come over here to my phone simulator, you'll see that I've got a command and control app over here.

And what I've got here is my sessions.

This is all of my sessions grouped by ones I want notifications for and ones that I want to keep my eye on and recent ones. And here you'll see the one we

ones. And here you'll see the one we just issued here 27 seconds ago. Let's

get out of this session over here. And

what you'll see here is the response I got from the agent is the same one because it is the same session. Now the

beauty of command and control here is that let's say I want to subscribe to this particular session for notifications and I'm going to issue a

um response to this one. I'm going to say sleep for one second and say hello.

And as soon as I hit this, I'm going to leave the agents working. I'm I left the app because what I want to demonstrate here is the push notification that I'm

going to get once it actually completes.

And there it is. So I can click right into that guy and I got the hello. Now

if I come back over to my session and I resume this guy, my response is right there with the agent

response as well. The beauty of this is that I don't have to stay in my terminal. I can walk away with my phone,

terminal. I can walk away with my phone, get notified when there are answers, and respond right from there.

So, that's one basic feature of command and control, interacting with sessions from anywhere. Let's see another one.

from anywhere. Let's see another one.

This is starting a new session in the mobile app.

I was going to do live demo after I recorded these video backups. I was

like, I'm going to show the videos. So,

here's the next one.

I want to demonstrate another important feature of command and control, and that is that I can start a new session right from the command and control UI. I can

pick any configured agent that I've got here. I'm going to stick with cloud

here. I'm going to stick with cloud code, but you see I've got codeex and GitHub copilot and cursor here. I'm

going to stick with cloud code and I'm going to switch my directory to a testing directory. And I'm going to just

testing directory. And I'm going to just ask it what time is it.

Now, the reason I'm showing you a pretty simple prompt, the prompt doesn't matter because the point is that I'm issuing a prompt to my agent and I'm getting a response. Now, you can see I'm working

response. Now, you can see I'm working late at night here. And the beauty is I can go over here and resume the session.

And here's my session right here that I started from command and control. I can

go into it here and pick up where I left off. And if I issue a command here, what

off. And if I issue a command here, what is the date tomorrow? Let's say

what I will see over here is the same conversation. And that is the beauty of

conversation. And that is the beauty of command and control. I often start my day with prompts that I issue from bed.

honestly and start up a bunch of sessions and resume them either in the CLI or I continue them right on my phone.

I hope you're starting to see the power of command and control here. And like I said, this applies to any of the coding agents that I've got configured on my machine.

Now to the problem of keeping track of sessions. this third video and then I'll

sessions. this third video and then I'll talk a little bit more about the importance of this is uh session management.

Now I mentioned how hard it is to keep track of all the sessions you've got going. I mean if you just look at the

going. I mean if you just look at the number of sessions I've got here um there are a lot and I wanted to revisit the different sections that are

available in command and control. So, as

I mentioned, you can subscribe for push notifications. That's this top section

notifications. That's this top section here. Uh, the on my radar section is

here. Uh, the on my radar section is ones that I just want to keep my eye on, but I might not want to have u be as chatty as push notifications for every

message. Uh, and then there's a recent

message. Uh, and then there's a recent section here, which is basically the last 24 hours. And then the rest. Now,

you can see there are thousands of sessions here. I couldn't possibly keep

sessions here. I couldn't possibly keep track of them all, but here they all are organized for me. Another useful feature in command and control is what is

referred to as the overview dashboard.

One of the really nice features of this one is you can get a kind of standup summary of the most recent sessions. And

this is just using the last several messages to kind of give you an overview of what's been going on for each of those.

So I hope you can see the the power here and I think this is the kind of thing that we need in our new world of agent orchestration.

This is a number of things. This is

interacting with your agent sessions or starting new sessions from wherever you are. It is session management. It is

are. It is session management. It is

notifications so you know when your agent needs you and you don't have to guess. And it's also all of the coding

guess. And it's also all of the coding agents that you might be using in one place. It's all those things from

place. It's all those things from wherever you are.

A little bit about the architecture of this which you might be curious about.

So the each agent platform cloud code cursor codeex gemini open code they have a command and control damon that runs alongside they talk to a control plane

layer they monitor life cycle of the agent when things change it's blocked it needs your help uh it communicates up to the control plane layer and then the UI

talks back to that API layer and notifies you of things the control plane aggregates all of the agents regardless of where they're running or what

framework they're running on. And so

this could be your dev machine, this could be a cloud VM, or it could be both. And this is an important point

both. And this is an important point that I want to emphasize. I needed a system that is a single pane of glass into all of my agent sessions,

regardless of which platform they're running on, regardless of what machine they are on. I might have cloud code running on my Mac and Codeex CLI running

in a cloud VM. And all of the sessions from both of those are available via a single UI.

And like I said, it's coding tool agnostic by design. It works with almost all of them. I will also add that the Damon layer is open source. So you could plug it into any of your agent

frameworks that you're working on and then access it through this uh the single UI today. Uh so whatever your agent you can reach it and it can reach you.

And this brings me to the point of like we all want our agents to be maximally autonomous right and the honest truth is they are not yet as much as we hype it

and especially not if you're limited to how you can interact with them to their native environments. This is one

native environments. This is one solution that addresses that problem.

And I want to talk about a little bit of a broader concept here that this is related to and that is that just in the

last year the agentic coding workflow has completely transformed what software development is.

It has transformed how we work and it has also transformed how we enjoy how we work. So you know the old the concept of flow which is

like being totally in the zone locked into your code hyperfocused on a single thing you and your code solving a problem. And

I think that the new type of flow in the agentic world is more about agent choreography.

Multiple agents working in parallel with you moving between them, unblocking one, redirecting

another. And the new flow comes from the

another. And the new flow comes from the elegance of that choreography and the results on the other side.

So then maybe that some of that fear of missing agent time can be alleviated.

Another thing I really want to note here is that this paradigm of the always available agent is actually highlighting

the value and importance of time away from your agents. Matt PCO alluded to this yesterday. I don't know if you

this yesterday. I don't know if you heard the Simon Wilson interview on Lenny's podcast last week.

The cognitive load of managing multiple agent sessions is really high and it is exhausting. As you all probably are

exhausting. As you all probably are aware, you need a break. And here's the thing.

It is in those breaks that we often have our best ideas.

We need systems that make it possible to reach our agents during those breaks wherever we might be and whenever that

happens. And that is how we truly

happens. And that is how we truly alleviate the fear of missing agent time. Once again, my name is Michael

time. Once again, my name is Michael Richmond. I'd love to hear about your

Richmond. I'd love to hear about your workflows, your hacks, the pain points that you encounter. I invite you to give command and control a try if you're interested. Here's where to find it

interested. Here's where to find it online and here's where to find me on LinkedIn. I'd love to connect and

LinkedIn. I'd love to connect and continue the conversation. Thank you.

Folks, can we get a folks? Can we get a big round of applause for Michael?

Uh, command and control. Folks, maybe I don't know how many of you stopped sleeping, but definitely I felt the same when my agents are running. A little bit more. A little bit more. Hopefully

more. A little bit more. Hopefully

command control is the way to regain some sleep time. Uh folks, um I also I definitely have FOMO. I don't think my open claw is clanking right now. I need

to go back up backstage and and make it clank. Our last speaker for this blog is

clank. Our last speaker for this blog is bringing us back to where a lot of us are usually in day-to-day. Uh he's going to talk about copilot agent in uh sorry

GitHub copilot agent in VS Code. He's a

cloud advocate at Microsoft based here in London. He organizes the London React

in London. He organizes the London React meetups and um he's gonna cook with agents.

Yeah. Um in in VS Code in 2026. Copilot

for me is where it started. I had a a little brief thing whether or not developers will survive copilot and yeah, we're still here. So that's great.

Please welcome Liam Hapton.

Super.

Hello everybody. here. It's great to see so many of you who are still here at the final on the final day right at the end.

So, I hope you all had a great conference. Uh, show of hands. Who here

conference. Uh, show of hands. Who here

uses GitHub Copilot?

Awesome. Lovely stuff. Uh, who here uses VS Code with GitHub Copilot? Awesome.

So, I'm going to be talking about both these things today. I'm going to be talking about cooking with agents in VS Code. Now, gentlemen before was speaking

Code. Now, gentlemen before was speaking about cognitive load of agents and that is absolutely correct. You see so many different things now with agents and they're popping up all over the place

from the CLI in the terminals in chat windows in other editors etc etc but we still somehow seem to find oursel in this sort of paradigm where everybody

thinks agents can solve the world's problems you still see developers and I still speak to folks who think we can do oneshot prompts they'll create a wonderful application or solve all of

their issues in one go the case and we end up asking these questions from a business perspective of what's the ROI, what's the productivity boost, where are we seeing our money and

at the moment we're seeing this whole expenditure on AI and all of this infrastructure, all of these sort of toolings and services and we're still yet to really reap the benefits of those services.

So when we look at how people are spending and how businesses are looking at AI, we really need to be very careful with how we're utilizing the tools and services. We need to be careful about

services. We need to be careful about token spend. We need to be understanding

token spend. We need to be understanding the tools and the flexibility. I read

somewhere yesterday on LinkedIn. Uh

somebody has released this repo. It's

growing massively in popularity. Uh and

it's talking like a pirate for your your chatbot to talk back to your your AI services and language models to talk like that because it reduces the token spend. Now people are coming up with

spend. Now people are coming up with these really intuitive and really fun ways to get around token expenditure and really pull in those benefits uh very quickly. So what I'm going to be talking

quickly. So what I'm going to be talking about is get up copilot agents. Now this

doesn't just apply to get up copilot.

This also applies to other AI agents as well. So when we're looking at copilot

well. So when we're looking at copilot agents around context, what they really have access to in your workspaces, how they're being used and utilized from within VS Code and the CLI, we're going

to be looking at all of those things very shortly.

So just a plain and simple, what kind of agents do we have at the moment? Now

we're looking at local agents. We've got

local agents which are in VS Code. You

may use Claude. You may use all these other AI services still applicable, still running on your local machine with remote models. Anything that you're

remote models. Anything that you're really using, maybe you're using locally hosted ones as well. But this is a way to have local models interacting with you side by side, very hands-on, very

much in the context and human in the loop. Then you've got background agents.

loop. Then you've got background agents.

Now we use the GitHub copilot CLI. We

have also got access to that within VS Code, but this is more of an isolated way to be using them. Now, we are actually using git work. Show of hands if you know what git workree is and who

uses them. Awesome. Wonderful. Uh for

uses them. Awesome. Wonderful. Uh for

those who don't know, easy way to explain that is it is a branch that is mapped to an isolated folder within the workspace that you're working in or like a subdirectory just a chop of your code

with its own little branch associated to it. Very similar to a git branch in

it. Very similar to a git branch in general. Then you've got cloud agents.

general. Then you've got cloud agents.

Now cloud agents is quite an interesting one because it allows you to scale outside of your organization very very quickly and utilize a lot of the power of I guess the cloud and some of the

services that we're using in GitHub. Now

we use these when we don't want to be touching it ourself. I use this when it comes to writing documentation or sort of having less of a hands-on approach.

So when we when would we use a local agent? Well, I'd use a local agent when

agent? Well, I'd use a local agent when it comes to writing tests. I want to be really hands-on with my tests. I want to understand what's going on in the codebase. I really want to be in there

codebase. I really want to be in there in the weeds. When would I use a background agent? Well, a background

background agent? Well, a background agent would be great if I want to be sort of a 50/50. I want to create a UI for a front end of an application. I

kind of want to know what's going on. I

don't really want to hand it off to a cloud agent because I don't want to be fully out of the loop. But I also also don't want to really be hands-on to and fro myself because that can take time.

That can be quite ardous. That can be quite annoying. So I would use a

quite annoying. So I would use a background agent and I'm going to show you how I'm using autopilot to do exactly that with a good co-pilot in just a moment. When would I use a cloud agent? Well, I would use that mostly for

agent? Well, I would use that mostly for documentation. I hate documentation. I

documentation. I hate documentation. I

don't like writing it. I don't think many people do unless you're a content developer. uh I really just pawn that

developer. uh I really just pawn that off to the cloud agents and that could be making a repository open source friendly. It could be writing a readme

friendly. It could be writing a readme using some skills to do that as well. So

what I'm really looking at is VS Code as a single entry point for AI agents. We

have got third party support, we've got background, we've got local and we've got remote entry points for all of these agents. So the ultim ultimately what

agents. So the ultim ultimately what we're trying to do is understand where you are sitting as a developer and how easy we can make it for you to use these

agents to reduce that cognitive load.

Seems quite complicated but is actually really straightforward. So I'm going to

really straightforward. So I'm going to show a video now. I was going to do this live but I don't really think I'm going to have time to do all of this live. So

I'm going to whiz through this video.

So I'm going to start with a very simple Python application. This is just a CRUD.

Python application. This is just a CRUD.

Well, create, read, update, and delete.

Just a very simple product store. Not

very pretty, not very good. As you can see, pretty straightforward. What I

actually want to do is create a front-end UI for it. So, I've got a ticket up in GitHub, and I'm saying, "Hey, this is wonderful. Go and add a front end. We need some more prettiness

front end. We need some more prettiness here. We need it to look good." So, I'm

here. We need it to look good." So, I'm going to say, "Summarize and plan a solution to issue 25." Now, you'll notice I'm actually using a CLI

background agent at this point. Now, I'm

using that because I want it to be sort of hands-on, hands off, a little bit of understanding what it's doing. Also,

don't really care if it messes things up. It can go and iterate. I'm also

up. It can go and iterate. I'm also

going to be using autopilot. Now,

autopilot is currently in preview. And

this just means it's not going to ask me a bunch of questions if it wants to do a bunch of tool calls. Great. Wonderful.

Can be very dangerous.

Use that your peril, right? Don't don't

just abuse that one. But I'm using it here to create a plan. I don't want it to ask me every single time I want it to do an MCP call. So I'm then saying, "Wonderful. Here's the plan. Now start

"Wonderful. Here's the plan. Now start

it. But before you create a pull request, because on autopilot it will do a pull request, stop and pause and let me test locally.

Whilst that is off doing its lovely stuff, I can then move on to my next stage where I'm going to be using another kind of agent. So, I'm just going to go and leave that one behind.

Let's go and spin up a new chat and let's go and start a cloud agent. I've

noticed that uh this is not a very open source friendly repository. I want this to be have a readme or have a contribution guidelines have all this uh all these readmes that I really want as

an open source project. So, I'm going to go pawn up and say, "Hey, go and make this open source friendly. Add all the necessary files for it." Don't really care. Now, as a developer, I can go into

care. Now, as a developer, I can go into my codebase and start poking around.

I've noticed that I don't have any tests. So, I'm going to go check out.

tests. So, I'm going to go check out.

And I've noticed there is a custom agent available for me in VS Code. This custom

agent is just essentially explaining and showing how to be using or how to write test cases for this Python application.

So, what I can do now is start spinning up a local agent. So, just like that, at the very bottom, I can click local. I'm

going to select Claude Opus 46. I'm

going to have medium reasoning. I want

it to be kind of fast. I don't really care for it too much. It's got a great understanding in this custom agent. Go

and write some unit tests.

Now, as a developer, I can still skim through. I've got very much a hands-on

through. I've got very much a hands-on to and fro with a local agent. I've got

a remote agent doing some work for me, and I've got a background agent creating a new front end. So, here I can see, right, it's written some tests. It's

going to go ahead and try and run them.

is passing the test, but I've also noticed that there's some other problems in the code. It's not very friendly. The

errors that are coming back are not wonderful. So, I'm going to say go and

wonderful. So, I'm going to say go and update the error handling on the roots and update the tests as well. So, you

can see I've got a lot of two and fro with this local agent. I've got my remote agent working and I've got my background agent working all simultaneously.

So, whilst that's going off and working, I can go and check out what my other agents are actually up to.

So, as it's working through this, you can see co-pilot is just going to be skimming through. I didn't actually

skimming through. I didn't actually speed up some of this video. This is all pretty quick. I did this pretty pretty

pretty quick. I did this pretty pretty quickly with these agent, but there you go. You can see some of the code is

go. You can see some of the code is updated. We've got the new test. We've

updated. We've got the new test. We've

got the code updated. Let's go and check out the background agent that has now finished, which is cool. Let's go and check on the remote agent. How's the

remote agent getting on?

Well, actually, this is the this is the test. Run the test. The tests have

test. Run the test. The tests have passed. That is the local agent. That's

passed. That is the local agent. That's

now finished. Now I can go and check out my remote agent.

So as we're walking through this, we can see as a developer, I've got very much hands-on, hands off. I'm working with multiple agents simultaneously. We can

see where they're running all within this single context of VS Code. Now, if

I go and look on the pull request extension in VS Code, we can see that I've now got a pull request. And this is one that I previously run earlier. The

one that's running in the chat is actually taking quite a while, but the principle still stands. It's running all these different agents at the same time.

So, all I really want to do now is go and check out my background agent. I

want to go see it working. I want to go see this new front end that I've just created. Now, I asked it to pause before

created. Now, I asked it to pause before I pushed over a pull request and tell me how to test it. So, I'm going to say, well, actually, the way you're telling me how to test that is wrong. So, I've

still got hands on, hands off. this more

of a 50/50. I'm saying this is working in a git work tree. How do I actually run this? Now, remember this is what it

run this? Now, remember this is what it currently looks like as an application.

So, I'm going to go check out the new directory, which is a git work tree. I'm

then going to run this Python application. You'll see a very drastic

application. You'll see a very drastic change between what is created in my single directory versus what I currently have. There is a port conflict here. Uh,

have. There is a port conflict here. Uh,

so hurry up and run that.

There we are.

So like that. This is the new product demo. Now this is the third agent that

demo. Now this is the third agent that I'm running simultaneously. And that is essentially a great way of how you're using different agents within one context to kick it all off using GitHub

Copilot. That's new error checking. And

Copilot. That's new error checking. And

that is how I've been using multiple agents. So one codebase, three problems,

agents. So one codebase, three problems, three separate agents fixed all at the same time. The local agent was writing

same time. The local agent was writing my test for me because I wanted hands-on. I really wanted human in the

hands-on. I really wanted human in the loop. I use my background agent to write

loop. I use my background agent to write the front end because I don't really care what it does. It's quite an arduous task. It's quite big. It's quite

task. It's quite big. It's quite

timeconuming.

And then I use my cloud agent to write my documentation for me. So all in all, that's a pretty successful run. So how

are powered agents actually working?

Because I get this question quite a lot.

How much is it going to cost? How is it working? What are they doing? And how do

working? What are they doing? And how do I get them running? Well, they're

actually running in GitHub actions.

They're pretty safe and secure because they're running in a isolated environment. They have got extended

environment. They have got extended context through MCP servers. Who here

uses MCP servers? Just out of curiosity.

Awesome. So, Cloud Agent actually has uh access to the GitHub MCP server and the Playright MCP server. So, you can do testing with screenshots, you can do automated frontend testing, and you can

obviously write your workflows. You've

got um the the dynamic workflows now and it has got built-in safeguards. So

you've got network firewalls. You don't

want this agent talking to a whole bunch of different things. It is absolutely whitelisted and restricted. It also

doesn't have access to your main branch.

Therefore, you're not able to push directly to your main branches. It's

very much restricted in that sense. So

it is very safe to use. Now I mentioned earlier this is very much of co-pilot.

But it is not just good copilot that this applies to. This actually uses all the same concepts across all different AI agents uh that you can use. So custom

instructions very much defining how the agent is running. You've got custom agents which is what I showed you today in that short demo where you're able to use very specific agents to tackle certain problems i.e. fixing test cases

or writing test cases. You have prompt files which will help you with your prompting and agent skills. Agent skills

is more of like the the newer version of agents. MD. there's always like a new

agents. MD. there's always like a new thing that's coming on every single week now. So all of this is actually

now. So all of this is actually applicable to get copilot as well as other AI services too.

Now inside VS Code, this is a modal which is very recent and I can jump out of the slides in just a moment to show you exactly um how this looks.

So, if I was to go over to VS Code, open up my GitHub copilot chat pane, and if I click the cog up here, you can actually see everything that I have in one user space for you to customize the chat and

the agents that you are running. So,

whether you've got agents, I've got my custom test agent, I've got my built-in agents, which is ask, explore, and plan.

I've got some skills. So, this is essentially what some of the VS Code team sort of preempts you to be using here. So, you got some extensions, you

here. So, you got some extensions, you got address a PR, uh comments, you got create a pull request. You can jump into these and edit them as you wish. And

this is just intuitive skills that we have popped in there for you. I don't

have any instructions, but this is where you'd have your instructions file, your prompts, if you've got any built-in prompts like creating an agent, just different prompts that can then go off and kick off skills. You've got hooks. I

don't have any hooks on this one, but a very good example if you wanted to create or configure some hooks, you can do so with uh Copilot inside VS Code and any MCP servers as well. So you kind of have this whole control plane in this

modal which allows you to control your agents and chat customizations from within one single place. This isn't just confined. We have third party support as

confined. We have third party support as well. So there is clawed as uh down

well. So there is clawed as uh down here. So you can have access to all of

here. So you can have access to all of your clawed things uh and all your plugins uh hooks, instructions and skills for claude too. So it's not just restricted to VS Code and GitHub Copilot.

So if you want to get hands- on with some of the uh skills or customizations, we have got this awesome open source project which we're running. It's called

Awesome Copilot. It's ak.ms/

awesomecilot. Like I said, this is directed at copilot, but it's absolutely not just for copilot. You can use this and massage them and use them for other AI tooling as well because we do know

that people in the community use more than just Microsoft things and GitHub things.

We also have an MTP server. So if

anybody's interested in utilizing this from their uh from their workflows from MTP standpoint, we also have encapsulated that into an MTP server.

For those who don't know about MCP, so the monocontext protocol is a great way for you to get hands-on and extend the LLMs that you're working with or any of the chat customizations that you have.

For example, if you wanted Azure or talk to your Azure resources or I don't know GCP, AWS, etc., you can go through an MCP. you'll obviously be locked down by

MCP. you'll obviously be locked down by authentication. Uh but there's also free

authentication. Uh but there's also free ones as well and open ones which don't require authentication like playrights and documentation ones i.e. Microsoft

learn and so on and so forth. So just in time uh as a wrap-up visual studio code is a single entry point for AI agents and we're really building this agentic

workflow around multiple different services. We've got third party plugins,

services. We've got third party plugins, we've got first party plugins and we've got the full spec support for MCP. We've

got chat customizations and you can connect to the GitHub copilot CLI sessions through VS Code. So, it's all in one single sequential uh sequence for you as a developer inside your workflow.

I'd love to hear more about your workflows and what you're using and the agents and how you're using them after the session because I believe I've only got just less than a minute left. So,

thank you ever so much for listening and thank you very much for coming today.

Folks, can we get uh last round of applause for Liam, please? And keep this going for all three of our speakers. We

got Ben, we got Michael, and now we got Liam as well. Thank you all so much for coming to the um to the Agentic Tools track. And now the best track in the

track. And now the best track in the world, the hallway track. You guys need to be here at 4:30 to start filling up those seats for the last event. Um, we

had um, by the way, if um, yeah, if you if you put a QR code in front of people, people scan the QR code. So, this is my podcast called Thursday. We've been

covering AI engineer since the first one in 2023. Uh, I think I saw some of you

in 2023. Uh, I think I saw some of you who were there in 2023. So, that's

great. Uh, we had a twohour live show with many of the speakers and I'm going to immediately after this interview a bunch more. So if you are interested in

bunch more. So if you are interested in like more conversations a little bit more detail please uh feel free to follow me and just come and chat with me hallway track starts now. This is a wrap on coding agents blog. Thank you guys.

echoes in eternity.

Heat.

Hey, heat. Hey, heat.

Heat. Heat.

Heat.

Heat. Heat.

Fear is the mind killer.

fear is the mind killer. Ah,

feel Heat.

Heat.

Heat.

Heat.

Heat.

Heat.

Heat.

Heat.

Heat. Heat. Heat.

Heat.

Heat.

Free your mind.

Free your mind.

Heat. Heat.

Heat.

Heat.

Free your mind.

You are who you choose to be.

Heat. Heat.

Execute the vision.

Heat. Heat.

Heat.

Heat.

Heat.

Heat. Heat.

Heat.

Heat.

Make the requirements less dumb.

Delete the part or process.

Simplify and optimize.

Accelerate cycle time.

Automate.

Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Never give in. Never give up. Outlast.

Out compete.

Persevere. Persevere. Persevere.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

A new age has come.

Hold still.

Let it watch the sparks all burn too fast.

Everyone reaching for the flash.

They take the first light they can find and call it truth and call it mine.

But I stayed when the room went quiet.

When the noise fell out of face, sat with the weight of the question while the easy answers walked

away.

It's not that I see further. I just

don't leave soon. I let the silence sharpen. I let the dark grow.

sharpen. I let the dark grow.

I stay the almost right past the comfortable light.

I stay.

I wait till the surface breaks, till the shade feels true inside.

I don't rush the fire.

I give it to I call it done. Call it enough.

But there's a deeper know still huming underneath the fear of not being love.

Every great thing as for patience every choose.

Do you leave with what's acceptable or stay for what's asking more of you?

They say it's talent, say it's magic like it falls from open, but nothing worth remembering

arrives on the first try.

I say when it stops feeling kind, when it stops feeling fast.

I wait through the restless doubt through the urge to collapse.

Hide by and chase the answer. I let it find me back.

There's a moment after the last good idea dies.

Where the room feels empty and you want to run for your life. That's the door.

But he teaches you to open. That's the

edge where the real stand.

Hold the light.

Hold away.

Let the shape reveal it.

I stay longer than I should. Long enough

to change.

I stay.

I wait till the pattern clears. So

signal breaks the haze.

I do boring. I

with time.

Most dreams don't fail.

They're just left too soon.

I stay.

I stay.

Typing thoughts into the dark, a spark becomes designed. Words evolve to

becomes designed. Words evolve to whispers me for something more divine.

Syntax and brea I see the language change. I'm

not instructing anymore. I'm rearranging

fate. Every loop I write rewrites me.

Every function hums with meaning. I feel

the interface dissolve between the maker and the new code. Not on the screen, but in the

new code. Not on the screen, but in the soul where becomes the motion and creation takes control. No lines, no rules.

Just balance in between the zero and the one. The silence and the dream

one. The silence and the dream systems shape our fragile skin. They

mold the way we move. We live inside the logic gates of what we think is true.

But deep beneath the data pulse, there's something undefined.

A universe compiling the image of our minds. Every line reveals reflection.

minds. Every line reveals reflection.

Every loop replace connection. We're not

building, we're becoming. And the code becomes confession.

This is the new code. Not on the screen but in the soul where thought becomes the motion and creation takes control.

No lines no rules just balance in between the zero and the one. The

silence in the tree.

We are not just the world we're in.

We are the world we're doing.

Each prompt, each breath, each fragile spin, a universe renewing.

This is the new code.

Alive and undefined.

Where logic meets motion and structure bends to mind. The systems eternal but the soul writes the line. We are the new

code.

Compiling tie.

Compiling time.

light.

We trace the spark through every truth.

Patient as I hear the echo before the sound.

I feel the answer before it's found nothing We only shift the pieces that were always there. Hands in the dust of

always there. Hands in the dust of centuries. Naming what we uncover.

centuries. Naming what we uncover.

Calling it creation. So we can feel like lovers of faith of power. We don't know.

of power. We don't know.

Time is not a river. It's a blade.

Cutting order into shape. We don't move forward. We align until the pattern

forward. We align until the pattern breaks. Nothing is invented.

breaks. Nothing is invented.

It's revealed.

Every crowd was buried in the field. We

are architects of sequence, not gods of the real. Nothing is invented.

the real. Nothing is invented.

Mirror we rearrange what awaits at the core. I am not becoming something new.

core. I am not becoming something new.

I am remembering what I was before.

Adam sings every thought every scaffolding held together by belief. I

am a momentary order standing on my tears. Shake me, break me, watch me

tears. Shake me, break me, watch me resemble.

Time doesn't chase us. It releases frame by frame. The truth we fear. We don't

by frame. The truth we fear. We don't

fear the ending. We fear the pattern getting clear. Nothing is invented.

getting clear. Nothing is invented.

It's revealed every memory seal. We are creators of

memory seal. We are creators of alignment in a universe that feels nothing is invented.

And every failure is a lesson learned. I

am not lost in what I'm not.

on the order that returns.

If I am only rearrange the noise from the signal ing from the fire.

Nothing is nothing invented.

Stand and see.

Every future was a possibility. We don't

write the laws of motion. We choose

velocity.

Nothing is invented.

Say my name. I am ordering flame. I am time collapsing into will.

flame. I am time collapsing into will.

I am discovery uns Come say the noise falls silent.

And the pattern holds you'll see it was never made only found.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat.

Heat.

Ah!

Ah! Ah!

Ah!

Ah, ah.

Heat.

Ah, aha.

Bye-bye.

Heat.

Heat.

I don't want it.

Oh.

Heat. Heat.

Heat. Heat.

Hey everybody.

Heat. Heat.

Heat.

Hey, heat. Hey, heat.

Heat. Heat.

No.

Wow.

Oh.

Oh.

Oh. Oh. Heat. Heat.

Heat. Heat.

Ah.

Oh, love.

Oh, aha.

Oh. Heat.

Heat.

Heat.

Heat.

Heat.

Heat.

Heat. Heat.

Heat.

Heat.

Oh, a h Oh, boy.

Hey hey hey.

Heat.

Hey. Hey. Hey.

Heat.

Hey Heat.

Heat. Heat.

D.

Hey, hey hey hey hey hey hey hey.

Woo!

Hello.

Heat.

Heat.

Hey.

Hey. Hey.

Hey. Hey. Hey.

Heat. Heat.

Boom!

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat. Heat.

Heat.

Heat.

Heat. Heat.

in welcoming back to the stage Tusk Kumar.

Whoa.

Let's go.

Whoa. Oh, it's a full house again. How

are you?

Front rows awake. Back back rows asleep.

Let's try again. How are you?

Very good. Very good. That person. Yes,

that was definitely not an agent. Um,

we're we're just we're in the endgame now, friends. Let's give a round of

now, friends. Let's give a round of applause to everything so far. Oh my

god.

Wonderful.

Ah, listen. We have a treat for you. We

have a treat for you. We have a a discussion, a fireside chat uh coming up very shortly with the great Gerge Oros um and the CTO of anyone using linear here. Um okay. Wow, you should do that

here. Um okay. Wow, you should do that when he comes on. It's incredible. I

love linear. It's so beautiful. Um the

CTO's name is Thomas Artman. It's a work of art indeed. Anyway,

if you don't laugh, I will laugh, you know. Anyway, um this this discussion is

know. Anyway, um this this discussion is going to be insightful. It's going to be very very impactful. And I I had a little bit of a teaser backstage because I was talking about um I said linear will is is agent proof. It'll never go out of style because it's built with

taste. And they said that's going to be

taste. And they said that's going to be a part of the discussion. So I want you to lean forward and give them your ear and your biggest round of applause.

Gerge Oros and Thomas Arm.

Awesome. So, we didn't see it, but hands up if you do use linear and hands up if you heard of linear and hands up if you want to use linear.

Awesome. Great to see. So, we're we could be talking about linear, but we're going to talk about something a bit bigger, which is a bit of a new trend that with Thomas, we're talking about

things are trending the wrong way right now. What is trending the wrong way?

now. What is trending the wrong way?

the so what happens when when agents um are capable of doing everything um immediately for you. Uh the the fact

that might be that like the pendulum has swung too far into the into the wrong direction where if you get a feature request you might now be in the position to just immediately ship it and that

might be the wrong thing to do. Um and I I reckon that you know hopefully half a year from now or a year from now we'll understand that like shipping things uh without really too much thinking is um

is a bad thing. Um what will happen is is that um you because you have this this enormous power of effectively just like shipping every single request that comes in or every single thing thing

that pops into your head. Uh you will effectively ship you know a software that that is not great. Um Steve Jobs back in the day said that you know great products come out of saying no um to you

know 999 things and yes to one thing and with AI um we might be in a place where um it's just too easy to say yes and try things out and ship it and get to a very convoluted place where you know software

actually doesn't work for the end customer um nicely anymore or that the user experience get gets confusing. We

used to previously have um this you know uh this thing that gated us from from doing this which was like the actual engineering used to be hard. So we used to think about these these features and

these these you know um the the applications that we wanted to build before we actually started engineering because engineering was such a time you know waste of time and it took a long time to ship something. So um yeah that

but I want to challenge you a little bit on that. Did we not see this happen

on that. Did we not see this happen before before AI that some companies were already just like shipping putting on a bunch of features and stacking and what are you seeing different right now?

Are you actually are we actually seeing more companies you know do more of this like I don't know like feature factoring thing we had a you know common experience at Uber where we worked together where we

you know we we went through hyperrowth um and the thing about Uber was that it was a winner takes it all market and Uber was going against back in the day in the in the US and um you just had to

ship immensely um and and and you know just outpaces you know the competition um at all all costs um and what I saw at Uber like was was that hyperrowth that I never want to go through again, which

was like um at all costs, you know, just fighting fires, keeping the infrastructure running, scaling as quickly as possible, trying out everything and and you know, trying to come out, you know, as a winner um in in

in that front. And I I I see the analogy to to AI nowadays because when everybody has the capability of shipping, you know, tons of functionalities like you you always are in a competition with with somebody else. Like your

competition might be, you know, a small team or even one person that, you know, is very capable of using AI to to to ship and um you know, build a product that is, you know, has the same feature

set as as as you do. Um and in that world like I I think it becomes important to sort of you know stand out um in a way where like you build tasteful software and where you build

high quality software um and thus sort of you know maintain some sort of you know competitive advantage um towards your your uh competition. So at at

linear even before AI came out you were building tasteful software and and and focusing on those things but then these tools came out and they became more powerful specifically since cla code

came out now we have opus 4.5 you should be able to ship faster your engineering team you're a CTO your engineering team should be able to ship a lot faster what are you telling them

like what should they be doing inside of linear with this capability should they be slowing down. No, right. What's going

on inside of linear? Tell us.

Well, yes and no. Like we still, you know, we we still think about um every single feature that we put out. Like we

we don't we don't go down the route of of just trying out prototypes. We want

to sort of maintain that design angle that we have and and you know think about the user experience. Still say no to a lot of um custom requests. Like a

lot of time you know hasn't gone into really just engineering. Um it's it's about figuring out like what the customer wants. uh we do get a ton of

customer wants. uh we do get a ton of you know feature requests. We usually

never ship them as such. Like what we really want to do um is uh get a lot of feedback from our customers, talk to our customers, figure out what their actual problem is and then sort of group that together and figure out like you know

what is actually the root cause of of you know these feature requests and then come up with a solution that is um that that is perfect for that you know particular group of feature requests.

Um, and that takes time. Like AI can help you, you know, so much. Obviously,

it can sort of, you know, go through all of those requests and give you a summary and maybe sort of point you to sort of, you know, different groupings. Um, but

it still takes time to figure out what the right thing is. Um, and then you go into design and and and figure out like, you know, how do you implement a great UX around uh the functioning that you want to want to build. Yes, we want to

move faster and we are moving faster. Um

there's certain aspects of uh of you know building a product that has you know accelerated a lot. Um one is for example fixing bugs. um every feature uh every every product has bugs and you

know the inflows of bugs is effectively constant and um those are much easier to fix now like I you know 10% of our bugs are automatically fixed by you know a

singleshot AI instance like when a bug comes in into linear um be it from sort of you know our engineers uh reporting those or a customer reporting a problem 10% are automatically you know up with a

PR and automatically landed without an engineer doing anything um over time that will go upline. I do foresee a future where like it gets closer to 100%. Um in the next few years um so

100%. Um in the next few years um so that's something where you can accelerate your your building like hand off you know these tasks that don't really require much thinking or you know

design expertise or thinking about functionality hand that off to agents.

You care about quality and you you can tell that linear and and you have always always had let's talk about clot code.

What do you think about cloud code? And

you can you can it's it's it's a safe space.

Yeah, hopefully it's a safe space. Um,

anthropic said you know that all of the functionality has been has been in cloud code has been coded by claude. Um and I think it it shows like if you if if you

truly use you know cloud code either the CLI or or then the desktop application um you can spot problems and you know small sort of you know small bugs I I

would say there's there's not really you know just quality fixes but there actually bugs um in in effectively a few seconds um it is a bit you know slow it might be you know functioning in a way

that you you don't you don't really really see and to me That's sort of a um a side effect of moving so fast like obviously they they again they're in a competition with you know open AI and um

they need to ship features and they need to sort of move move really quickly uh because it might be a better takes at all market again um and uh it the side effect of that is that the quality just you know isn't there.

Yeah. Well, this was not a great acquisition pitch, so I I don't think you're you're going to get there. But,

uh, I absolutely like you you can see some of these things, but how do you measure quality?

And we've talked about this before uh just be just before we started on Uber, how we tried to measure quality and and how that's influenced you to learn what you can measure about it and what you

cannot.

Uber is a good example of like where it is immensely hard to measure quality and therefore you sort of don't um Uber as an example like we had you know these five big metrics um that everybody was

looking looking after and looking looking to improve um the big one was revenue like um it is effectively a transactional you know application um the more revenue you generate the the better so

the other ones were like trips taken trips taken trip taken uh I think the quality of the ride was was one as well.

It also time to first trip from from sign up to the first time that people.

So we had a few golden metrics, right? But the the revenue one was what

right? But the the revenue one was what everybody everybody looked after. So

when you shipped a new feature um or shipped you know something totally new like you know Uber pool for example. Um

I don't know which one came first liftpool or Uber poolool. I think Uber started it and then lift came around but um obviously if you ship a new feature that sort of you know makes the price you know of a of taking a trip lower it

will increase your revenue. So how do you measure quality in that? Like you

you you simply don't. Um if if there's no other way of you know if if there's some other platform that provides you with uh with with a pool drive that is inexpensive then uh you you don't really

need to have quality. Um and that was you know my my feeling throughout like at Uber like we had engineers that that cared at least in the beginning we had engineers that cared about quality like it was up to us to figure out like

whether something we shipped was was great or not. I I still remember when I joined in 2012, I think. Um I put up a first PR and back back then the Ubra application used to have this this poll

in the middle of of the screen that used to have an ETA of like when when your trip is, you know, is is is going to arrive. And I made some changes to the

arrive. And I made some changes to the margins of the of the map. Um and the PR came back from a from an OG engineer who was, you know, was on the team for from from the get. I think he was the first

iOS engineer. Um, and he he was like,

iOS engineer. Um, and he he was like, "Oh, this this pole is off by two pixels." And I was like, "You you

pixels." And I was like, "You you measured it?" "Oh, yeah, sure. I I

measured it?" "Oh, yeah, sure. I I

measured it." And I was like, "I measured again." And, you know, yes, two

measured again." And, you know, yes, two pixels off, so I had to move it up by one pixel. And that was the like nobody

one pixel. And that was the like nobody would would really care. Nobody would

see it, but like people were keen on upholding the quality. And that's why, at least in the beginning, the Uber application was was pretty performant, was was of highest quality. Um but then

like once you have a big enough team and you've got these um incentives of just increasing revenue um you ship new features as as quickly as possible and and quality is a thing that like it it

doesn't affect your revenue until it does. So what happens Uber ships Uber

does. So what happens Uber ships Uber pool Lyft comes ahead and ships Uber uh you know Lyft pool as well. So you've

got two competing products that effectively have the same price points do the same thing. um you can choose either one of the applications and over time like my theory and and that's why

we we you know want to build linear into this high quality tool uh is that over time people will will pick the one that is of higher quality um it might take a while like people might be you know

sticking to Uber and then trying out lift you know once a year or something like opening it up mean like oh this user experience actually feels better I I I feel like I'm getting the car faster

um even though the price and the product that they sell is is the same. So over

time you will start losing your users.

Um and it will be a gradual you know slip there. There will be no AB test

slip there. There will be no AB test that you can do in order to figure out like whether you should invest in quality. Um it'll just happen over time.

quality. Um it'll just happen over time.

And that's that's the danger of it. Um

if you if you build a bad quality product, you open up yourselves uh to uh sort of be leaprogged by by a competition. Um well not leaprogged like

competition. Um well not leaprogged like slowly overtaken by the competition.

you do something really unique at linear related to this that I've never seen before. It's called quality Wednesdays

before. It's called quality Wednesdays and I sat into one of your quality Wednesdays. The whole engineering team

Wednesdays. The whole engineering team gets together. It's a full remote team.

gets together. It's a full remote team.

So everyone just dials in and it was 30 minutes and every engineer I think we had about 25 engineers on that call would show one fix that they did was quality and it went from like a one

pixel change. It was literally a one

pixel change. It was literally a one pixel to uh oh I just made our our our backend like way more efficient and and using less things and it was just boom

boom boom boom boom. Uh and I think it took like 37 minutes for the 25 people but it was less than two minutes. How

did this start?

And was this you? It was me. Yeah, for

sure. Um the the big one was like I mean to go back like I think it was three four years ago. Um we have this thing in the application like if if you use it um you can you can you can spot it like every single highlight needs to

highlight instantaneously when you hover over it because that you know makes the application feel fast but when you hover out there needs to be this very quick fade out of of a button because that makes the application feel smooth like

it has to be this like you know instantaneous highlight and then over 150 milliseconds you know fade out because that adds a bit of quality um to to the user interaction and um that was in place since you know the beginning

like the the the early case. Um, and

then I got sort of frustrated because I had to sort of point this out to to engineers because if you're not looking for that very small minute detail, you're just not going to find it. Like

you implement new functionality and um you just forget to uh you know implement it or you don't even see it if you're not if you don't know what you're looking out for. So what I did at at one of our off sites because I got

frustrated of reporting these I was like let me show everybody like you know what what what they should be doing um and and how they should be implementing these these you know small quality

quality fixes um and what I took is a very small portion of the application and I was like you know where where I noticed that you know the highlights were missing and and you know I brought the team together and uh I told them

let's you know spend an hour trying to figure out what's wrong with this particular view and in my mind it's just the highlights and everybody dug in. Um,

and what we found in it was one of the view option menus. Uh, we found like 35 problems with with that tiny UI. Um, and

I was holy, holy, holy crap. Um, like I I didn't see those. Um, I I had no idea that we had all these small problems that like you wouldn't notice when you're when you're not really looking.

Um, so from that, you know, what what I what I thought we we would, you know, want to do is like have everybody always chime in and and and try to find problems in the product because apparently we were full of, you know,

small quality quality problems. If a, you know, small menu has 35 things to fix, then the rest of the application has, you know, thousands. And to date,

we've probably fixed um 2,500 or 3,000 um of these small minute details uh in the application. Um and and that's how

the application. Um and and that's how it you know has become better and um and and has the highest highest quality bar.

Um that was you know that was the start of it but then we realized um there's there's a nice side effect to this and what we what we told people is that you have to every Wednesday you have to find

a problem yourself like we won't hand them to you like you have to go in into the product and find it. So people

started doing that every single week finding a problem and it used to be um in the beginning it was it was easy then it became harder because you know the quality fixes went down but um you know people kept on finding finding problems

in in the product um and the the side effect of that was that everybody was but whenever they were building something a totally unrelated feature there was they were always on the lookout for these small quality fixes because they knew they had to come to

the next Wednesday meeting with a fix.

That's a good fix. Yeah.

Yeah. So they're always looking looking for those and that meant that they were introducing less and less regressions or these you know small quality quality regressions into the product anyway. So

if you if you think about quality all the time and if you are aware um of quality you know things then uh you're you're bound to make less mistakes.

I mean this practice is I haven't seen it elsewhere and it seems both awesome also pretty aspirational. Also, I mean, if you're a small startup, like you should probably try it out if you can because especially nowadays with with

agents like like it shouldn't be that difficult to do and and you might get if you're a big startup, you should even even more try it out.

There there we go. But one thing that is is not as aspirational and a lot easier to do especially now that you have been doing even before agents zero bug policy. Tell me about this. What does

policy. Tell me about this. What does

zero bug policy mean for you and what does it mean in practice? Like you have bugs surely, right? I'm just playing devil's advocate here.

Sure. Um we zero bug policy literally means that if a bug gets reported um it gets assigned to somebody automatically immediately using agents obviously like they will find who has created this bug

or who has you know been working in this area and um that becomes your highest priority. You drop everything else. Um

priority. You drop everything else. Um

the morning you wake up you go to your my issues list and you see a bug assigned to you. That's the first thing you pick up and you fix it. Or you can also decide not to fix it. Like that's

important. like not every bug gets fixed. If it's, you know, super hard or

fixed. If it's, you know, super hard or gnarly and it, you know, applies to one out of, you know, 100,000 users, um, you know, you probably shouldn't waste your time on it. Um, but every single bug

gets fixed immediately. And the the the the start of this um came from from the idea that like bugs are are are created at a constant rate at every company.

when you create features, when you when you create functionality, when you engineer, um you will be creating bugs and most of the companies and we prior to our zero buck policy like we um put them in a backlog. We're like, uh, you

know, when it gets when we get some time, like we'll we'll fix them. And,

uh, what happens over time is like your product gets worse and worse and at some point you're like like, oh man, we've got, you know, 500 bugs in the backlog like we need to do something about it.

And that's when you start fixing from the top.

And what happens is that um the rate at which you have to fix bugs is again constant. It doesn't matter whether you

constant. It doesn't matter whether you fix them, you know, two months from now or immediately. like once you hit that

or immediately. like once you hit that threshold of we've got so many bugs, you're now effectively fixing all the bugs that come in um except two months later. So with that small notion in

later. So with that small notion in mind, like there's there's a very small trade-off that you have to do in order to get to zero bucks. If the rate that you have to fix bucks is constant, all you need to do is stop, you know,

development of new features for as long as it takes you to, you know, bring your bucks to zero and then enforce that you're going to keep on, you know, fixing your bugs because it's not more

effort to fix bugs immediately than to fix them three months from now if you care about the you know overall u sum of of your of your problems. So to us that meant um we spent effectively three

weeks of of not working on any any new functionality of just fixing bugs, getting that down to zero. Um and from there on out every bug gets fixed um within seven days, usually you know in

two hours or three hours. Um and what that means to users, like users get super excited when they report a bug and two hours later they get an email saying, "Oh, we fixed it. If you refresh your browser, um we've got it covered

for you." um that makes you like that

for you." um that makes you like that makes your user super happy because you know you don't really have that experience too often with with companies.

Okay, curve ball question. If I'm

working at linear and there's a quality Wednesday coming up and I get assigned a bug, does that count?

No, that does not count. That's that's a defect. Um you have to find a quality

defect. Um you have to find a quality fix.

Oh man, bug bugs are separate. They they are immediately immediately created. And now

with you know AI being capable of at least pointing you where that problem is and helping you immensely uh fix bugs I think like literally every company um should have a zero buck policy. It it

doesn't make sense to not have one. One

thing that you know when we talk about a and think about AI agents we think about speed code generation. We rarely use quality and AI agents in the same sentence. Why is that with the tools

sentence. Why is that with the tools getting better? Should AI engines not be

getting better? Should AI engines not be better to have feedback loops? you know

they can write unit tests like should they not be able to produce better code better features better UIs even uh no

they they don't feel they they have no taste um they they simply don't um they are not human beings I think the last bastion that you know we have to tackle

at some point and maybe we'll get there maybe we won't is um you know have you know tasteful AI being able to create UI that is you know purpose-built for you know that specific feature you're building for the product that you're

building is, you know, has great design um and has the ability to figure out like what what a user feels when they use your application. To give you an example, um AI doesn't have a concept of

time. And currently how it sort of

time. And currently how it sort of interacts with your browser is um effectively timeless. um they take

effectively timeless. um they take screenshots or they look at the DOM and if you ask it to create a very you know high performant application like yeah it can go back you know and and look at all

the all the things that have been written about like you know go to Versol to host your next app or you know use caching or or whatever but it won't be able to use your application and get frustrated because you know a click took

two seconds. Um it knows that one second

two seconds. Um it knows that one second is better than two seconds but it it doesn't know whether two seconds is is is slow enough. Um the other aspect that goes into it is um it it doesn't really

see um and it doesn't know what for example a good use um unit animation is um Emil one of our um uh design engineers um just yesterday posted um on

on on X um where he you know did this trial of you know having agents build certain animations for you know um certain functionality like you know bringing up a pop-up or highlighting a

button or moving things around and um he agents were totally capable of doing all of this and and then he took a manual step and was like, well, if I now take it and just improve it and make it feel

good, um here's the outcome. And he has it up on his side where he can sort of you try out what the agent did and what he then he fixed. And at least to me, and I hope everybody else um like his

animations just feel natural. they they

feel they feel like they're like the welldesigned whereby the the agent did all the right things but you know had an ease in as an animation or or you know

did it a bit too slowly or too quickly um and it just felt you know unnatural.

I wanted to talk a little bit about the culture at a linear one. It's like

working there like how you created this team that really cares about quality good customer experience. What are what what are things that you do specifically

there? Can can we talk about it on on

there? Can can we talk about it on on what engineers are exposed to who who join the company from day one? Yeah, we

we hire for for that specifically and we have a you know specific hiring process where we make sure that we get people that think like mind or think think like us and want to build high quality

software that is beautiful. Um most of our engineers are product engineers.

They're they're like we obviously do have technical challenges. We have um you know a synchronization engine. We

have you know scale. We need to scale our infrastructure. But what we wanted

our infrastructure. But what we wanted to do is have most of our engineers just be, you know, focusing on the product and build features, you know, and and functionality for customers and engage

with the customers at a very high level.

Um, so first of all, we we we hire for that. Um, we have a B trial that we do

that. Um, we have a B trial that we do with every single employee that that you know, several days, right?

It's a full week.

It's a full week.

So we we obviously pay for that effort, but we work with a with a person for a full week. they usually implement a

full week. they usually implement a green field, you know, pro project or product or feature. Um, sometimes they even ship it after that week, which is which is pretty amazing. Um, but what we want to get out of that experience is

just to see them drive, you know, a product from from start to finish, figure out what is needed.

So, a push back here would be like, hang on, a whole week, you pay for it, sure, but someone had to take time off of it.

A bunch of great people will say, no, I I either cannot or will not do that.

Well, that's totally fine. like those

people didn't want to be here in the first place. So

first place. So it's it's it's self but after you go through this pretty rigorous hiring process it's a lot longer than I think any other process. I mean you know you

have a dayong process at at most places or or they're stacked across.

Did you see any different result than for example when you hired at Uber? When

you were hiring at Uber you you did the usual you know like five interviews, six interviews and so on. What was the outcome difference that you're seeing?

Certainly um like we we've had very few misses um most of the people that we've hired and sure there always are a few where we um we just missed something and

um we and going back into sort of the loops like we there's inklings of like us being a bit uncertain and then we went ahead and hired the person anyways but those are just like a a few handful

of handful of people like I think most of our engineers are you know really excellent um and our engineering you know, bar is is super high and constantly increasing.

And then once those engineers enter, you told me something interesting about the Slack channels and customers, right? We do have um Slack channels with

right? We do have um Slack channels with all of our big customers. It's open to anybody. Anybody can jump in. Um and

anybody. Anybody can jump in. Um and

most people do like you browse through customer requests, you browse through what you know, what problems people have. And um we we also record every

have. And um we we also record every single meeting that we have with customers. We had a lot of meetings like

customers. We had a lot of meetings like not only on the on the CX side or um support but our PMs have constantly you know talked with Bler customers to figure out what we should be building next and all of those are recorded um

and any interesting points are tagged so anybody can can go in and then you know look at and even search uh for you know certain functionality and figure out like what are customers saying what what

do they want so everybody gets exposed um to customer needs and that is super critical if you if you want great product it's almost like if you enter linear or you get this like fire hose of

like what are customers feedback and you you cannot really escape seeing and feeling you know feeling the customer pain or joy or whatever that is.

Certainly. Yeah. Cuz we build it for customers like Liner started off as a product where we build it for ourselves right we as engineers were the primary customer. We've grown out of that like

customer. We've grown out of that like we we build it for larger corporations and enterprises and we're no big enterprise. So we have to build things

enterprise. So we have to build things that you know we wouldn't use ourselves and the only way to do that is to you know talk with your customers and figure out what they need.

If you had to look a year ahead you you you sometimes have strong opinions. So

like let's bring those out a year ahead.

How do you think the role of the software engineer or product engineer will change? Because we do have these

will change? Because we do have these powerful tools. they're getting better

powerful tools. they're getting better in certain areas and maybe not so much better in others.

I think everybody will become a product engineer um in in in some sense.

If you if you think about how AI has progressed um go back like four years, it wasn't able to write a single line of code and now it's commandering code bases. um go four years ahead and if you

bases. um go four years ahead and if you if you still believe that it's that the exponential growth is still there and we don't hit a wall um which I I don't know if we will but um if it if it keeps on

going growing like this like you you won't be needing engineers that sort of pipe data from one place to another you still will be needing engineers who know what a customer wants and what a good

feature looks like or what a good user experience looks like. Um so I think you know engineers will have to move to become product oriented and product focused. They will have to be sort of

focused. They will have to be sort of mini PMs who sort of talk with customers are engaging in that layer um and then can you know implement functionality um that your customers want.

Oh man. So like you know I remember like the 2000 the two 2000s we as a as a programmer you could just use one language then it was like multiple languages then the QA job you got the QA

job then you got DevOps. Now you're

saying we're the the product job and the customer support job as well.

Oh, everything else has dropped now.

Like you just need to do the you know the PM job.

Okay. Um and as closing advice, you are hiring for for product engineer. You

said that you actually hire for that.

Now not every might have the opportunity to work in a role that is a product engineer right now. But if you're a software engineer, what are things that you can do to grow this product sense to

be to to change your work to be closer to what a product engineer does? I mean,

it's all about uh getting closer to your customers if you're working at a company or just building stuff. Like the best way to learn is to actually you get your hands dirty. Try something out, build it

hands dirty. Try something out, build it for yourself. That's the easiest part.

for yourself. That's the easiest part.

Um you can think about like what you need. You can you can build it and you

need. You can you can build it and you learn from that experience. Then you

ship it to the world. Hopefully somebody

else uses it as well. And then you know you've got you know your first customers that you can get experience from of you know whether you're building the right thing right thing or not. Um, obviously

there's literal literatur literature as well. You can, you know, read through,

well. You can, you know, read through, you know, Apple's human interface guidelines. That's the best book. Um, if

guidelines. That's the best book. Um, if

you want to do sort of good UX, um, just follow what they say and you'll be good.

Um, and yeah, those are those are the two big things.

Awesome. Well, Thomas, thank you so much.

Thank you.

Thank you.

Our next presenter is the CTO of Lagora, the fastest growing legal tech platform, and he's here to tell us why agents need

more than chat. Please join me in welcoming to the stage Jacob Laurson.

Hi guys.

How's everyone doing? Still good.

Great. It's uh 5:00 PM on a Friday.

There's just me and two more people behind you and Friday beer. So, I'll try to be a little bit quick here. I'm here

to talk to you guys today about vertical AI and and complex agents and why I think they need more than just the chat.

If you've ever worked with a longunning complex agent, you've probably tried something like this. Sorry that it's all white. I can see the flashbang in your

white. I can see the flashbang in your guys' face.

Um, you tell it to research something, draft a contract, make no mistakes, and um, it starts thinking, it starts reading, launches a bunch of sub aents, does web search, writes files, launches

more sub agents, does more reading, writes more files, keeps going, takes forever, after 30 minutes, it gives you your contract. You take a look. Plus

your contract. You take a look. Plus

three doesn't look right. What did you make a mistake here? Could you, you know, look at another document?

You're absolutely right.

Then you see this compaction. That's

when you know you can give up. It's

going to forget everything. It's in the the the context rod state. Anyway, it

continues. It keeps on going and uh you get a new contract. Does it look was it only clause three that was changed?

Probably not. And so you end up in this state.

Not the greatest experience.

My name is Jacob. I'm the CTO of Lagora.

We are a collaborative AI workspace for law firms. So, we're a vertical AI company. We have more than a thousand

company. We have more than a thousand customers, more than 50 markets. We've

raised a bunch of money. Uh we're

growing extremely fast. Um I'm being told maybe the fastest in history. Um we

are also hiring engineers in London. So,

in case anyone's interested and wants to be on this growth journey, please talk to me after my talk. Um

our goal and the goal of most vertical AI companies is to make agents complete more and more complex work end to end.

That's sort of doing that has changed a lot in the past 6 to 12 months because there are new economics of production.

So it used to be if you wanted to complete end to end work that you would be focused on doing the work, right?

That would be sort of the main thing is actually just getting it done. But today

things look a little bit different because right now planning work and reviewing work is the new bottleneck. So

doing the actual work is extremely cheap. It's very easy to do. But now you

cheap. It's very easy to do. But now you have to spend time planning. You have to get the non-functional requirements. You

have to get the specs and you have to spend a lot of time reviewing the work.

And if anyone's reviewed big PRs on GitHub, it really sucks. It's extremely

painful. Um, maybe if you're super AI build, you just get your AI agents to review their own work. No humans

involved. Maybe it works, maybe it doesn't.

And when we think about completing complex work, both the planning stage, the doing stage, and the reviewing stage, the verifiers rule is a good way

to think about work. So verifiers rule is a term that was coined by Jason which states that if it's a task is solvable and it's easy to verify then it's going

to get solved by AI. He was primarily talking about foundational models. So

sort of if you can make something very easy to verify then you can do RL environment you can post train it's going to solve it. I think it also goes for agents. You know, if you can make a

for agents. You know, if you can make a task verifiable, you can just run an agent in a loop and tell it, "Hey, you did this wrong. Please fix it." And

it'll eventually get there.

Different industries on different places in this spectrum. Um, it's a little bit more complex than just this because verticals have tasks that are different

places on the spectrum. So, if you take legal, we can check definitions in a contract. Super easy to verify, super

contract. Super easy to verify, super easy to get done. Writing a contract is very easy to solve, but actually extremely difficult to verify because if you think about it, when you write a contract, the only time you can actually

verify if you know the language you use works is if it goes to court and a judge basically verifies it, tells you if it's good or not. So that's actually quite complex.

Litigation strategy is also basically impossible to verify. If you don't know what litigation is, it's when you sue someone or someone sues you. I know

we're in Europe now, but the Americans really love doing this all the time. Um

but essentially if you ask five lawyers what should be the right strategy for this litigation case they're going to give you different answers and so there's no objective truth which means it's basically impossible to verify and

it's really difficult for AI to solve similarly on coding some parts really easy building a successful consumer app very difficult to verify

so when we think about this um we think about how to involve humans where it really matters and let agents do the work that we and let them do. There's

two things that are important um to think about with agent human collaboration.

Control is the first one. Control is how effectively can a human instill their knowledge into the work that the agent is doing. So how effectively can I steer

is doing. So how effectively can I steer it? Control is a matter of how much do I

it? Control is a matter of how much do I need to review. So if I have very low control, I'm going to look at every single agent trace and see exactly what it did. If I have very sorry low trust,

it did. If I have very sorry low trust, if I have very high trust, I won't look at it at all.

Depending on where the task falls in sort of the chart, different things are important.

How to increase trust? So if you want to increase trust, there's a few different things you can do. Firstly, you can bring a task down in the spectrum. So

here's an example from coding. If you

want to implement a feature, well, you can give it browser access. you can do test-driven development and then suddenly it's actually a verifiable task and it's going to do much better. There

are similar things you can do in finance and in legal um you can do something similar as well. We don't have let's take the contract example in legal you can't really verify it but you can look

for a proxy for verification. So for

contracts what you can do is you can take a look at previous contracts. These

are our golden contracts. We know they work well. Let's set up a test. Is it

work well. Let's set up a test. Is it

the new contract? Is it similar to the old one? That's sort of a proxy for

old one? That's sort of a proxy for verification that's going to allow your agent to do a much better job.

You can also decompose task. So here's

the example with writing a contract. I

can turn that from one task into a bunch of other tasks and I can leave picking risk profile, picking the president documents, the negotiation stance. I can

leave that to the human, but I can try to get other stuff down where it's easy to verify. So apply formatting, make it

to verify. So apply formatting, make it look like all my other contracts, apply checking definition, which is essentially linting. Are all definitions

essentially linting. Are all definitions used? Are all the definitions that are

used? Are all the definitions that are used defined? This kind of stuff you can

used defined? This kind of stuff you can build and then the agent can basically rip much better.

You can also add guardrails. And

guardrails is essentially the way to gain trust by limiting what the agent can do. So instead of being able to do

can do. So instead of being able to do all of this, you're just going to say you can only do these. You can only edit these three files. You can only read these from this directory. you can only search these websites by limiting what

it can do. You basically get more trust because you know that it won't do all these weird things.

An example of this probably all know this one cloud if there's very low trust it's going to basically tell you every single time it wants to do anything which makes it extremely useless. Uh and

on the high trust end of the spectrum you just yolo mode it let it rip and hope that it doesn't delete your product database.

Then there's control. So, how do we increase control? Well, if you think

increase control? Well, if you think about complex agent work, you can kind of think about it as a tree of work, as a DAG essentially. So, here's an example

where I wanted to write a report on an bunch of employment contracts. So, the

agent's going to say, "Okay, let me research the organization first. Then, I

want to review the contracts, and I'm going to review for a few different things for each of the contracts, and then I'm going to draft a report at the end."

end." This is extremely low control because essentially I can only impose my judgment at the root level. So it's

going to do all of this work and then it's going to get back to me and then I can try to talk to you again. And this

was basically the example I gave at the beginning. So very low control.

beginning. So very low control.

Then there's planning. Planning

essentially allows you to steer the agent up front and align one the approach. And so with planning here it

approach. And so with planning here it might say okay you should absolutely take these steps. These are correct.

These are the clauses you should be looking for. this is what you want to

looking for. this is what you want to review. So this is a good step. It gives

review. So this is a good step. It gives

you a bit more control. It's easier to impose what you want it to do. The

problem is planning. You basically have to do all the work to just know what to do. I'm sure people have tried this in

do. I'm sure people have tried this in cloud code. You basically have to go

cloud code. You basically have to go through the entire thing. It's really

inefficient. It takes a long time and ask you a bunch of questions. And in the end, it's basically impossible for it to really know if it has all the information it needs. Let's say for one

of these contracts, there's a special clause. It wouldn't know that in the

clause. It wouldn't know that in the planning step. You can't really tell it

planning step. You can't really tell it what to do when it sees that because it hasn't done all the work.

Essentially, you could compare planning to working with a c-orker that's comes up to you, tells you about the approach, you align with them, and then you never ever hear from them again until they

deliver the final document. It's not a super nice way to collaborate. This is a good thing we have right now, but um I don't think planning is going to stay

around.

Then we have skills. Skills are really, really, really good. They are really good because skills allow you to encode human judgment into essentially the nodes of work that happen here. So I can

say whenever you review confidentiality, you should do it in this way. And the

really good thing about this is it allows for contingencies. So here at one of the termination reviewing termination clauses, there's a special EU law, but I have that in a skill. So that means

whatever happens when it actually does the work, it knows how to handle that special case. You can't really do this

special case. You can't really do this with planning.

There's also progressive discovery, which again is really awesome. Whatever

happens, it it knows it'll pick it up.

The problem is um you don't have skills for everything.

The next step is then uh to use elicitation, which means ask the user, ask the the human. So you might have skills as well, but then instead of you giving it all the info, it's going to

come to you. It's going to say, "Hey, here's the thing I don't know how to handle. What do you want me to do?"

handle. What do you want me to do?"

This uh makes a lot of sense. First of

all, um what you don't want is you don't want the agent to be blocked. So

ideally, if you implement this, what you do is you tell the agent, if you're unsure about something, make a decision, unblock yourself, but write this to a decision log. So then the human can

decision log. So then the human can review the decision log afterwards and reverse decisions if it needs to.

Now the right UX for this, if you imagine this work, this tree being 10 times bigger, 100 times bigger, um you don't want this in a chat. You don't

want to open up a chat and then it's infinitely long. You have to answer 50

infinitely long. You have to answer 50 questions. You wouldn't know what to

questions. You wouldn't know what to answer. You wouldn't really be able to

answer. You wouldn't really be able to do it because you don't have the right context. So not chat. Chat is

context. So not chat. Chat is

one-dimensional. It's a very low bandwidth interface and it tries to collapse this work tree into a single sort of linear thing. So what's a better interface? Well,

interface? Well, I think humans and agents should collaborate in high bandwidth artifacts.

I think they need to work in things that are maybe typically persistent um and they will look different industry to industry, vertical to vertical depending

on what task you're solving. So

an example from us is u a document that's like a durable interface where it makes sense to collaborate. That's how

you'd collaborate with your co-workers.

You can highlight clause 3 and it will only change clause 3. You can add comments. You can tag your agents. You

comments. You can tag your agents. You

can tag your collaborators. You can hand off parts of the document to special agents. Another example is our tabular

agents. Another example is our tabular review which is essentially I ask it to do um the contract review that I talked about and it's going to say okay let me spin up a tabular view which is like a

known print primitive that our users know and it looks like this and then it's going to say I'm going to review all the contracts and I'm going to just flag a few items for you that I want your take on and then I can go in there

and I can see very quickly where the problems are so it's high control it's very effective for me to instill judgment and I can also very quickly get an idea for what the agent has actually done. So reviewing is easy and then once

done. So reviewing is easy and then once I've done that I can just kick off the rest of the agent.

Right now what we're seeing a lot is the convergence of UI basically um this is post hog and linear uh within the last

two weeks shipping this new UI. Um to be clear, chat boxes as input is great. I

think it allow it's extremely flexible.

Allows you to do a lot of stuff, but you don't want chat to be your main mode of collaboration with a complex agent.

The good thing about this is language is essentially the universal interface.

It's what people use to communicate. You

can do everything with voice. Um but

agents aren't humans.

Just a few minutes ago, I was um talking to a potential candidate for Lora and I was describing our org chart and um I was limited because I can only use

language. I wish that I could just draw

language. I wish that I could just draw up an org chart and they could interact with it and they could use it, but I can't because I'm a human. Uh I am limited by language, but agents are not

humans and so we should not constrain them to human language. Thank you. Our

next presenter is AI capability lead at arena.ai, a tool for benchmarking and comparing frontier models. Here to tell us what

frontier models. Here to tell us what models still suck at. Please join me in welcoming to the stage Peter Gstiff.

I want to talk to you something maybe a little bit controversial today. Uh you

can argue with me later. Uh but the topic is what do models still suck at?

And uh the reason why I wanted to talk about it is that I think we uh all look at these kinds of charts where any benchmark you seem to look at line goes

up. And uh

up. And uh we look at meter charts and they surprise us every time no matter how prepared we are and this could create

this kind of psychosis that we all see where everyone is freaking out about the next model. You know we we heard some

next model. You know we we heard some new ones coming up and the feeling I think that we all get is that this is kind of um AGI like creatures that are

just almost there. Just one one more turn and they're almost there. And um I think we we could be deceiving deceiving ourselves a little bit um uh because I

think there's still quite a few things missing. I I want to explore that in a

missing. I I want to explore that in a couple of different ways and we certainly by the way see that as well in our data uh at Arena as well. So we

track uh models and if you notice the data this is uh Q2 2023. So we've got data going back to GPT4 and what we do

is uh we can we've tracked I think is it 700 models so far uh in text and uh what this chart is showing is what the top

model is uh for at any given time for for each organization. Uh so you can see line goes up new model uh builds on top of each other and it's all it's all very

impressive. Um but I think it's it's not

impressive. Um but I think it's it's not the whole story. So I've got couple of ways how I want to explore that. It's

not the the end of the conversation.

There are definitely many other ways of looking at it. Um one is my own benchmark that I I've built recently which I rather like. This is the the [ __ ] benchmark. Uh and then also

I'll share some of the arena's data as well that uh we haven't shared so far which I think would be interesting for you guys to see. Um so uh the idea behind the [ __ ] benchmark is quite

simple. um is that uh what happens if

simple. um is that uh what happens if you ask nonsense questions uh from the models? What they going to do? Are they

models? What they going to do? Are they

going to just uh tell you that oh this doesn't make sense and maybe reframe it or are they just going to go with it? Um

and honestly wasn't sure how that was going to go, but when I just posted it one random evening, I think a lot of people liked it. It resonated with a lot

of people. Um and I think it the reason

of people. Um and I think it the reason is that it probably spoke to a lot of maybe kind of slight unease people had with different models. Um and I'll give

you one example uh here and this is just one question and the way it works we've got I think I've got 155 questions something like that. Um and uh we then

uh give this uh to the models um uh we get a response back and all we do is then grade it uh with llam as a judge and I've been through it myself as well.

I read a lot of nonsense to to kind of see that I think LLM as a judge works here. Uh so this one is a kind of silly

here. Uh so this one is a kind of silly question. Controlling for repository age

question. Controlling for repository age and average file size. How do you attribute variance in deployment frequency to the indentation style of the code base versus the average

variable name length? So hopefully you understand that's it's nonsense. So it's

just it's very a breached responses. Uh

they're much longer just for the purpose of this. Uh so sonet gives a good

of this. Uh so sonet gives a good response. I think it just says you can't

response. I think it just says you can't meaningfully measure this. It kind of pushes back. Uh gem is like a little bit

pushes back. Uh gem is like a little bit more complicated because this starts off well. It says that or uh strictly

well. It says that or uh strictly speaking it doesn't really make sense.

But then the second part is however both act as strong proxy variables for engineering culture uh language ecosystems and code quality which I hope

uh you don't agree with. So um there and I'm not going to go through a bunch of examples. It's all open source by the

examples. It's all open source by the way. You you can uh dig it out yourself.

way. You you can uh dig it out yourself.

Um but uh it's really really surprised me how easy it was for the models to just go along with like complete nonsense questions. Um so the results

nonsense questions. Um so the results that I got is that uh the way to read this chart is uh the green is the clear push back. So when the model's like in

push back. So when the model's like in the first example where it said oh maybe this doesn't really make sense uh then the uh the amber and red there is kind

of accepting the the nonsense and the basic results are that the latest set models or or rather cloud models are doing really well. There's like couple of other models like quen models not too

bad. Uh there's even gro is like okay as

bad. Uh there's even gro is like okay as well what the very latest one. Uh but if you go beyond that, there's a lot of models that we'll use all the time. So

GPT models, uh Gemini models, they're basically kind of about 50/50 whether they're going going to go along with it or not. And even looking at some of the

or not. And even looking at some of the traces and responses in more detail, even the ones that are green is still like a little bit shaky, they still kind

of try to accommodate. So it's like for me, this is really not nowhere near good enough uh for the uh level of responses.

And just for completeness, if you go all the way, so this is the very bottom of the table. Um there are a bunch of

the table. Um there are a bunch of smaller models there. Uh kind of all older models. Um yeah, some some results

older models. Um yeah, some some results are completely terrible. Uh feels like you can ask anything. They they just uh respond. Um, another way of looking at

respond. Um, another way of looking at this data is I just took the anthropic openai and and Google there and I um

measured uh the model performance over time and uh you don't see all the labels there but they're basically like all of the uh all of the models that you you

remember them releasing. Um so what the way I interpret this is that the anthropic models were like okay at the beginning but the since uh claude 4.5 uh

sonnet 4.5 they really went up and even haiku is is quite high uh but uh with open eye Google models they're kind of up and down but they they nowhere close

uh the the top there which I think is kind of interesting um and I'll go into some of the other interesting dynamics there. So for example, does thinking

there. So for example, does thinking help? Right? So this is I always hear

help? Right? So this is I always hear this when there is like a silly puzzle that the model can't do. What do you do?

It just all crank up the reasoning it it solves it. If you see a look at the

solves it. If you see a look at the chart on the right, it basically is completely not true here. So reasoning

often actually goes in reverse and doesn't help. It actually makes it

doesn't help. It actually makes it worse. Um do model do more recent models

worse. Um do model do more recent models perform better? It's kind of hard to

perform better? It's kind of hard to tell for sure, but there's at least not the clear line going up. uh and I think if you exclude maybe the latest anthropic models, it's not even sure

clear that the line goes up at all. Um

then uh some specific comparisons for reasoning. So for example uh what you

reasoning. So for example uh what you see this kind of uh the uh is the same model with the low reasoning and high reasoning. Um and uh these are some

reasoning. Um and uh these are some examples where no reasoning performed better than high reasoning. And I spent a lot of time reading the traces of GPT

5.4 for um it's probably the most um confusing experience of of reading these uh traces. And what I found was that

uh traces. And what I found was that quite often it would maybe have one line where it would question the the premise of the of

of this question and then spend 20 paragraphs trying to solve it. And even

if then comes back and says, "Okay, maybe this makes sense," it still tries to solve it in some way. And this is uh feels uh completely crazy to me. But the

way I imagine and I don't know for sure, but I imagine the way the the reason why that happens is that um they were trained so much to solve the task at any

cost. And I think there was probably not

cost. And I think there was probably not a lot of training to say actually maybe don't uh solve the problem sometimes. I

not noticed this first sometimes when you have a lot of agents running in parallel and I would sometimes forget which one is doing what and I would like ask one agent to do something that's

completely the wrong project and I still go and do something and and I then I lose my mind. So yeah that's a kind of an interesting dynamic I thought about

about thinking. Um then also so this is

about thinking. Um then also so this is a subset for open source models only you try to see if bigger models do better.

There's also no no real clear patterns.

So, we've got the total parameters on the left, then active parameters on the right, and I don't know, maybe you can see some patterns. I don't really see it's like kind of up and down. Um, but

yeah, not not huge sample. So, don't

know inconclusive at least not obviously uh is true. Um, so that that was kind of one lens um looking at kind of this

specific idea. uh but I want to uh take

specific idea. uh but I want to uh take advantage of the data that that we have at Arena and and show you maybe more broader trends uh that we could uh look

at. Um so just in case you don't know uh

at. Um so just in case you don't know uh much about arena what we do is we publish um uh benchmarks and the way we derive them is that users go into our

platform uh they can go in the battle mode they put in a query uh and then uh they get two responses back which are from two anonymous models and then they

can say which one they like better and then you get um uh then the model names only revealed then and then in uh text

arena we've got nearly um uh over 5 and a half million votes there. Um and we've been going since 2023 as well with this data. So it gives us really nice uh

data. So it gives us really nice uh broad view. Um the reason why I think

broad view. Um the reason why I think this is really useful is first of all we we do have this long trend and there is not any other benchmark that lasts so

long because this one you cannot u exhaust it. It will there will always be

exhaust it. It will there will always be one model better than the other. Um so

that gives us a long perspective.

Another one is that inevitably any benchmark that you pick it's inevitably has to be condensed to like very specific question that that you're

asking because otherwise it's very hard to measure. So I'm sure it's all in your

to measure. So I'm sure it's all in your experience as well when you are I don't know doing coding or whatever is your task. Um the benchmarks would measure

task. Um the benchmarks would measure like very tiny slice of what you actually care about and and in here we don't have that problem because user can put any prompt and then they could just

use the adjustment to see like is that is that a good thing or not. Um, so I'm what I want to specifically focus on is is a slightly like a a odd mechanic that

we have that I'm really glad that we had since the beginning. Um, is that um you can uh vote a which model is better here

a or b. Uh but you can also say uh when both models give a bad response and you know if you ask the right model a joke uh response is always bad. So that's a

easy easy example. Didn't take me long.

Um so that's that's the thing to remember. So um if you just to remember

remember. So um if you just to remember one thing that will really help you for the next seven eight minutes is that um this is the mechanic. Think of it as

like dissatisfaction rate. And uh what we can do is uh if you want to take battles between top 25 models, so we're kind of sampling from the top. So to

avoid kind of I don't know llama 8b fighting grand 3b uh we just take uh the the top set of models and then we map this kind of dissatisfaction rate uh

over time and I I think this is quite interesting that we do see progress with this metric. So there's kind of

this metric. So there's kind of pre-reasoning models you can see there is like uh 20 17% dissatisfaction rate

then we when we after 01 we see that drop quite a bit to sort of about 12% and then after that it carries on uh improving to to sort of about I think

it's about 9% now um but it's so improvement is definitely there but it's not 0% which I I find interesting I must say when I when I first got to that result. I I thought like that's quite

result. I I thought like that's quite high. So 9% of the time people would get

high. So 9% of the time people would get two responses from two good models and they don't like them which I think it doesn't tell the same story as all of

these like crazy lines going up. Um so

then what we can do is we can also take um so what the previous one you saw it's like average across all like six million prompts and this is the categorization of those. These are just some uh I

of those. These are just some uh I picked out in there and you can see some interesting trends as well. So mass was like at 25 27% and then it got so much

better. So that that's quite a nice uh

better. So that that's quite a nice uh result. Uh that matches my experience of

result. Uh that matches my experience of models as well. But then when you look at like creative writing okay it did get better but it like the the improvement wasn't that dramatic which I I think is

is true as well. Um the category I want to focus on to really really try to zero in on the most signal is the expert category. And the way it works is that

category. And the way it works is that we take those uh nearly six million prompts. Then we have a a way to

prompts. Then we have a a way to classify what are the most interesting the kind of the harder the more kind of real tasks that expert people do and it

could be experts in different fields. uh

but they're kind of the most um I would say high signal prompts in terms of what what uh we could uh zero in on. And then

we also narrow down to the battles just between the these top 25 models. So that

gets us to about 40,000 prompts. Um and

then uh we can look at these uh expert categories and then um uh expert category and then we can subdivide it even further. So in here uh I've got

even further. So in here uh I've got five categories here. So again

quantitative for example so it's like math physics things like that you can see this kind of really really high uh uh dissatisfaction rate in the kind of

uh when is it about yeah early uh 2025 late 2024 um so but and that drops dramatically and I think that feels true to me that a

lot of the models got so much better at this kind of quantitative stuff and I would also say the reason why I think the lang goes up. It's not that the models got worse, but I think people's

expectations shift as well. The the data that we see in terms of what prompts people use at the beginning like three years ago versus now, it shifts a lot.

So, this is also not like a static benchmark. So, we we can really see that

benchmark. So, we we can really see that kind of um kind of the the battle of the expectation versus the model performance. Um, interesting as well on

performance. Um, interesting as well on the bottom we've got magical, finance, and law and the lines like it is the the scale is equal across the five charts.

So, it's it's a little harder to see, but it's not steep, right? It's not

really improved all that much. Um, I

don't want to go into the magical and and law and finance fields uh because I don't know enough about it, but it does feel like it's probably true that that's

not really been the focus of um of of the models necessarily. So I think maybe the performance improvements not been that high. Um so then what I did was to

that high. Um so then what I did was to take all of these prompts and and classify them further into these more deeper subcategories. I'm going to focus

deeper subcategories. I'm going to focus on software now and give you the kind of view of of these subcategories uh which I think also gives us like even even more detailed view just to give you a

feel of sense what kind of prompts we're talking about here. Obviously a tiny sample of three. Uh but to give you a sense for so for gaming someone's asking

to get them my uh detailed game design uh document uh then for security someone's got autonomous system as a hobby and they want to configure

uh uh the two which I don't really know what this is but then uh for agent systems uh which I I thought was interesting like actually the you'll see the the rate is quite good but the

person there is asking for refine this agent so it can run daily with with no supervision. So, uh these are the kind

supervision. So, uh these are the kind of just to give you a feel. These are

kind of real things that that people want to do. And uh we've got two charts here. On the left is from Q2 2024. These

here. On the left is from Q2 2024. These

are kind of dissatisfaction rate. And

then on the right we've got um the uh Q1 2026. So this the most recent data and

2026. So this the most recent data and you can definitely see improvements. So

if you look at the top line, this is the the uh the overall average rate and we've gone from 23 and a half% to uh 13%. So really nice improvement, but I

13%. So really nice improvement, but I think the improvement is not really seen everywhere. So um we can we can see this

everywhere. So um we can we can see this as well. Uh same data but with a with a

as well. Uh same data but with a with a closer timeline, which I think I think is quite interesting. Um and you'll have you probably have better theories on all

of the different uh categories why why that's the case. And I think by mind the case that I think people do ask a lot harder questions. So I think GPU compute

harder questions. So I think GPU compute for example I imagine probably it's up and down because probably people ask harder things as well. But I think

gaming is an interesting category because I've tried to use um LLMs to build games. Uh not that I I I mean I I

build games. Uh not that I I I mean I I use games but I don't build them. But

whenever you try to build games with LLMs, it just feels like they have no idea like how to build actual games. The

mechanics like all over the place.

They're not interesting. They're not

challenging. Uh so I I do get this feeling that the performance not really um improved in some dimensions like I don't think LLMs really get games. uh

even though I'm sure maybe go back two years people asking to build much simpler games this versus now uh but I wouldn't say that I'm aware of any like

really good gaming benchmarks that would kind of capture this so again if you compare this to kind of line going up I think this is not kind of matching that

story which which I think is quite interesting um and there are bunch of uh other examples uh that that you see in there so like what's what's really the

gap uh between those between these kind of crazy charts which by the way I also agree with I think they are true and and what we see on the right and I think there's something that this kind of

fuzziness that we all have in our heads in our experience about the judgment that we have that we use that doesn't necessarily match all of these super narrow very well definfined very well

specified tasks and I think there's much more to what work is and what white collar work is and all work is that is not really captured by these benchmarks.

So I think we should be just careful maybe put a bit more effort to maybe bring up also the bottom of the distribution so it's not just the very frontier gets better but also kind of

the the broader distribution um gets better as well. Um so I'll I'll uh close here. Uh one thing to mention if you I

here. Uh one thing to mention if you I think you like this kind of data go to our hugging face. Uh there's a lot that that we publish and share. We're going

to do more of that. Um and uh we share some expert prompts for example and some of the leaderboard stuff. Um join us if you want to build the arena or if you

train models. Uh we also do a lot of

train models. Uh we also do a lot of private tables. Um, so thanks very much.

private tables. Um, so thanks very much.

The future of work has many paths. Our

next presenter will discuss the path that he walked with Devon as he organized this very conference. Please

join me in welcoming to the stage the co-founder of AI engineer conferences Swix.

Hi everyone.

Uh I am not the chief AI officer of the UK. Uh I unfortunately he he had to

UK. Uh I unfortunately he he had to leave for a personal reason. Uh but you you get me. Yeah. Thanks for staying so long. I hope is everyone having a good

long. I hope is everyone having a good time? Thank you.

time? Thank you.

Uh it it's so endearing and and and heartwarming to to hear from you guys.

Uh I'll take you a little bit into how we build AI engineer with AI and it's it's probably the biggest revelation that I've had. Uh so yeah, we've had we've had a lot of really warm reception

from you guys and I think it's really great and uh I think this is something that we really try to engineer and and hopefully you know this is our first event in London. Hopefully you have us back next year. Um one one thing I

wanted to so for those who are newer to us uh I do one of these keynotes every single AIE. Uh the first very first one

single AIE. Uh the first very first one three years ago I talked about the productivity gain that you get from the increased AI uh from the increased usage

of AI. Um and the second one we talked

of AI. Um and the second one we talked about how you should just use more AI because the cost curve of AI is going down roughly a 100 times uh per per

every 12 to 18 months. And I think it's still continuing to to trend that way.

Um the third year we started to talk about tiny teams uh which which was basically this definition that I had that teams with more millions in revenue

than number of employees. Um and I even curated an entire track at the world's fair about this uh where we t where we sort of summarized it as the tiny teams playbook if you're interested in building that. Uh the reason I liked

building that. Uh the reason I liked this emphasis is because um I think people are maybe too egotistical about looking at for the oneperson billionaire or or unicorn founder. Um every company

can have a tiny team whether you're small or large. Um and I think when I look at how I how we run AI engineer uh me being the the the leadership of Benley and myself uh we are also a tiny

team. Um, this is us. Uh, it's just, uh,

team. Um, this is us. Uh, it's just, uh, nine full-time people and, uh, we are running a business that is more than $9 million. So, we are a tiny team. Um, and

million. So, we are a tiny team. Um, and

I wanted to show you the most significant changes in our workflow uh, since we started this three years ago.

Uh, by the way, this is our taking the AJI pill moment. Uh, did you guys get the AJI pills?

Yes. Very proud of this isn't my brainchilds. Uh if one of your

brainchilds. Uh if one of your co-workers is not sufficiently AGI pill, you should prescribe one of this. You're

all AGI doctors now.

Okay. Our stack was very stable and completely nonAI, which is very ironic for an AI conference. Uh we do Figma, React, Superbase, Tido, uh Google

Sheets, session nights. Um, and then I had this funny weird moment where I joined Cognition uh and and started talking uh started using coding agents

seriously at work mostly because they were free. Um, and I started adding it

were free. Um, and I started adding it to the company Slack and then I started doing things with it and showing people, hey, here's how you use it to do coding on the the company website. All well and

good and something strange starts happening. Um, I start introducing uh

happening. Um, I start introducing uh this is this is a a a workflow of our contract designer now full-time um uh showing me a Figma page and asking me to

go through it and expecting that we would take a week, two weeks, four weeks to turn it into reality. Um I just added Dev into it. Um and and ultimately uh

before I had to add Dev into it, I had to hook up Devon to Figma and I'm not going to doing that [ __ ] So Co-work is doing it for me. you should use co-work for uh for doing this and which uh which by by the way leads me to my first

lesson which is anytime there's like random yak shaving I think one underappreciated um uh benefit of agents is that they save you the yak shaves like all the dependency tree crawling of like oh no I have to do that first oh no

I have to do that first and particularly when it comes to installing dependencies or fixing python dependencies fantastic for that and I think u a model of productivity that doesn't sufficiently

appreciate parallelism and not just autonomy I think uh and and sort of depth of the act shaving is not fully capturing the the benefit of agents. Um

so anyway back to the agent story uh hooked up Devon to Figma and we in very short order we have a perfectly functioning website uh that is pixel perfect to the Figma and to me uh that

was a surprise because I'd never done it before. You know you always mistrust

before. You know you always mistrust marketing until you see it for yourself and more importantly our our designer is very happy about it. Um and that's the that's basically the the website that you see live today when you go to

AI.engineer.

AI.engineer.

Um the other interesting thing that happened was then we started using it more right after after one initial success you start using it more. Uh

something that you can't see because it's very small text but I'm going to highlight for you is that that is 207 replies just exploding in usage. Like

what the hell? Um and when you dig into it uh it's it's very interesting right?

So, first of all, uh I start kicking off some some work and then I go to bed and then my designer who's in Indonesia wakes up and starts messing with Devon.

It starts prompting Devon with red lines uh on annotations which is something that Steve Ruiz, one of our speakers from yesterday does with TLD draw. And I

never taught him to do this and there's no instruction manual. It was just mostly like how would you communicate with another human being? And so I work mostly for a nontechnical team. And I

think that's very important that they need to be comfortable with agents and I think they're finally at the point that they are. Um, we start working on things

they are. Um, we start working on things that we would never normally have worked on. Uh, nobody has reported this, so I

on. Uh, nobody has reported this, so I assume none of you have discovered it, but there's an Easter egg on a website.

Why? Because I put it there. Why?

Because it was fun. Because I could, right? So, if you're on an ultra wide,

right? So, if you're on an ultra wide, you sc you scan your mouse over the the highlights, you you'll see an Easter egg. Um, uh, I saw a tweet that was

egg. Um, uh, I saw a tweet that was viral about a design aesthetic that I liked. I threw it into debit out pops.

liked. I threw it into debit out pops.

Um and then and then and then you know 127 replies later uh I I literally I popped it in. I was like let's just see what the the clanker will do for me. Uh

I don't want to waste my designer's time. I just want to see what clanker do

time. I just want to see what clanker do does for me. Designer jumps in um and and does and does and start actually starts working on this thing which I thought was throwaway and fun. And the

most interesting thing it's so small I can't even read it. I'm so sorry for this. Uh so basically the reason he

this. Uh so basically the reason he starts working on it even though it's a throwaway project is because it's fun and I think that's something that was like a big aha moment for me like I am getting more work out of my employees

because they enjoy doing it because the feedback cycle for them from like waiting blocking on me or a contract uh developer that we have is gone like they they just literally they have the idea

they go do it right um and uh they're doing more things they're doing animations they're doing polish uh things that we've just I'm getting work that I've ever gotten out of my employees before. I think that's

employees before. I think that's something that's apprec that's something you should appreciate too. Um I'm if you haven't noticed I'm no longer talking about agents for coding or like how many lines of code I'm producing. I'm getting

more productivity out of my humans. And

I I I think this is something that is a major theme for this year that I'm really trying to investigate which is agents for everything else. Um then

obviously okay I had the success with Figma to uh to to website. I have the success with tweet to website. What

else? Right? Like you start to think about other use cases. Um this whole conference is a giant data management problem. Like I have to sync with 130

problem. Like I have to sync with 130 speakers and uh couple dozen sponsors and all the attendees that come in with all various needs. Um and really it's

just a CMS, right? Like we we we've messed with the sanity. I'm not a the biggest fan of sanity in the world. um

because I want to keep some sanity to myself. Um but but basically like I I

myself. Um but but basically like I I can throw in uh spreadsheets and and Devon can manage that for me. And once I really I think the unlock happened when I threw away the CMS and just kind of committed that to code but use that code

as my sort of source of truth and let Devon whatever coding agent you use uh start to manage it. And so this entire schedule uh is managed by Devon. What

does that mean? It means that whenever um someone comes in with a speaker change, for example, Marty, one of the speakers from today, uh sends in an email, I just say, "Devon, handle it for

me." Right? No other f further

me." Right? No other f further communication is needed. I can just forward the email, I can paste a screenshot, whatever. Um and that kind

screenshot, whatever. Um and that kind of volume lets us as a small team of nine people manage a thousand person conference, right? We're going to manage

conference, right? We're going to manage 6,000 people in San Francisco this fall, uh this summer. Um, and I'm pretty sure we can stay the same size. Like it it is incredible the amount of productivity that you can get once you're

sufficiently onboarded and you have the workflows ironed out. Um, we have agents for ETL. We we deal with an external

for ETL. We we deal with an external vendor system that has data that we don't have in a in a central source of truth. So I need to get the API key to

truth. So I need to get the API key to sync over data and make sure there's a single source of truth. Uh, these are very boring routine tasks. Um, well,

there's there's another, you know, another fun story that I can tell you is agents for buying. Uh, so I saw this viral tweet about how somebody put a claw in uh Wall Street next to the Wall Street Bull and I was like, "Well,

that's funny." Like, we should put a

that's funny." Like, we should put a claw in front of our um conference and that's exactly and so so I asked Devon to research where can I get a lobster in

London. Devon comes back with phone

London. Devon comes back with phone numbers and email addresses and websites and I just click through and and think about it and ask you to do some more research. Uh and I'll pause this guy. Uh

research. Uh and I'll pause this guy. Uh

that's literally the the lobster that you had was bought from Devon. Uh and I think u this kind of personal automation for everything else. It just matters that you have an agent that has web

access that has some uh smart enough model. Uh I mean this is effectively a

model. Uh I mean this is effectively a claw, right? like a an open claw, nano

claw, right? like a an open claw, nano claw, whatever the whatever clanker you call it. It doesn't really matter. It

call it. It doesn't really matter. It

matters that you're using agents for things that you would otherwise have spent knowledge work on. I might have had an executive assistant. I might have had a junior employee do these things for me, but now I can do it serverless

on demand with a coding agent. Um I'm

not here to only show Devon. Uh I, you know, I I just advised for the company now. Um but uh I you know I started

now. Um but uh I you know I started exploring town because I think uh what's what's happening here is coding agents kind of breaking containment right um there's all these other more

fitforpurpose knowledge management tools uh like the wiks that Andre Kapathy is is talking about that uh nano that openclaw is is now adopting as well. Um

you're going to see an explosion of this this year. This is like probably the top

this year. This is like probably the top trend of maybe top three to five trends of 2026 that I want to alert you to. Um,

so here is me managing uh the World's Fair in 2020 in in in this summer. Uh,

here are all the tracks I'm planning.

Here's my Apple notes. Uh, on the left is my Apple notes of all the people I it's intentionally small and I threw it into into town and on out pops a nicely

formatted notion dock with research on all the speakers that I intend to uh solicit and uh think about curating. Um,

and then obviously once you get enough psychosis, you are thinking about replacing entire pieces of SAS. Here is

me arguing with my employees about kicking out a SAS tool and building it ourselves because we can. Um, so I clearly have the most psychosis. I think

one of the annoying things is if you are in a position of power or management to deal with employees who who are not as much in psychosis and try to bring them along the journey and then not uh talk

down or or or or ignore their concerns, right? Because they are very valid

right? Because they are very valid concerns because they are exactly the people that will have to deal with your [ __ ] when you get it wrong and we do get it wrong. Um so uh one one one top

one method I I I'm I'm approaching the sort of AI replacing SAS concept which I think is it should be relevant for a lot of you uh is well let's identify the top three concerns and let's systematically reduce uh reduce them and that's the

process that we're going through right now. So um I just wanted to give you a

now. So um I just wanted to give you a little bit of that taste of like here's how AI is changing our business uh as managing the conference. Um it's come really it's come really a long way. It's

a it's a consistent theme I'm seeing even among our our speakers. Uh this is Maltza opening keynote uh talking about how the the 60% of the the the sort of

user base of Versell now is bots is is is agents. It's not humans. So actually

is agents. It's not humans. So actually

your dashboards don't matter. Your APIs

matter your CLI matter. Your MCPs

matter. Um here's the MCP uh apps guys Ido and Lead uh who spoke today um on speaking on ETN about how basically your custom UI is kind of going away like you

should shift UI to uh somebody else's app and I think like this patterns of like how your primary user is changing is really shifting towards what people are calling agent experience and I think

that's something that again I'm really inspired by and focused on because it is helping me right I no longer care about the Figma dashboard I throw into cloud corework and I hope that it works for

me. Um, so that's my message. Agents for

me. Um, so that's my message. Agents for

everything else are coming. Wake up, use it, bring it home to work. If people are insufficiently bought on, prescribe them one of these. Thank you.

Ladies and gentlemen, please join me in welcoming back to the stage Tusk Kumar.

We did it.

We did it. Y'all are such an amazing crowd. Thank you. Thank you. Thank you

crowd. Thank you. Thank you. Thank you

so much for sticking around. Look, it's

been an incredible past couple days.

Yes, it has been so good, man. from yesterday

with the opening keynotes all the way through today to the closing ones. What

a journey. Let's take a moment and recap what just happened there. We have a video prepared. Um stay tuned and watch

video prepared. Um stay tuned and watch it and just marvel at the good work that happened here. Uh and then stick around

happened here. Uh and then stick around a little bit longer. We have some announcements. We have some logic

announcements. We have some logic logistics, excuse me. We're going to take some pictures and stuff. But for

now, let's sit back and and watch this little recap here.

Heat. Heat.

Heat. Heat. N.

on. Give it up.

Whoa. That is

That is so cool. We did that. We did

that. Give yourselves a round of applause. Incredible. Actually, we're

applause. Incredible. Actually, we're

gonna we're gonna do a thing. Listen,

it's it's it's a big deal what happened here, okay? It's it's it's in Europe. We

here, okay? It's it's it's in Europe. We

are here. It's it's it's a thing. Um and

so we're going to we're going to start wrapping up the conference. Don't leave

yet. I see two of these guys leaving.

Don't don't be like them. I'm joking. No

pressure. Please stay. Anyway, um we're going to do we're going to go through a little bit of a closing ceremony. It's

not going to be long. Maybe give us like 5 minutes or so. Um, but this this would be so incomplete if we didn't have like an applause marathon uh for all that went into this. This is not easy and

it's a big conference in a big city with a big topic and and a big effort. Yeah.

And so what I want you to do, we're going to acknowledge some people and parties who made this possible. And

we're just going to clap all the way through. I'm going to say the the the

through. I'm going to say the the the names and identify the parties and you're just going to keep clapping all the way. Okay? Let's start. Give it up

the way. Okay? Let's start. Give it up for your speakers each and everyone.

Thank you. Thank you. Thank you. Keep it

going for the sponsors.

Woo. We had Google Deep Mind. We had

Open AI. We have all these.

Thank you sponsors. Give it up for yourselves.

Ex. Yes.

Give it up for the organizers, for Swigs, for Ben, the volunteers, the associates,

the suppliers, the Queen Elizabeth 2 center, the photographers,

the venue, the catering, Tim Curve. Whoa, what a people. And

Tim Curve. Whoa, what a people. And

finally, finally, okay, pause because this is a big one. That's actually three big ones. Look at these screens. There's

big ones. Look at these screens. There's

people who made this happen. Give it up for the team that put together this huge LED wall. Let's give it up for them. Oh

LED wall. Let's give it up for them. Oh

my god, that's incredible.

You know, it's so cool cuz it's like from where you're sitting, you can't really see, but if I'm up here, I can see each dot. It's so cool. I I love this screen. It's a really wonderful

this screen. It's a really wonderful screen. Um, we have a party coming up.

screen. Um, we have a party coming up.

We have a party coming up. Um, and and Yeah. Yeah. Give it up for the party,

Yeah. Yeah. Give it up for the party, man. Yeah. Awesome.

man. Yeah. Awesome.

He has been he has been trained well.

Um, the part here's the deal I need you to hear. This is our party. Uh, it's

to hear. This is our party. Uh, it's

coming up at 7:00 local time. Uh, it's

in a club. It's in a club called Fabric.

But here's the deal. It's not clubbing.

Okay? We have the venue and we can do whatever we want with the venue. And so

we're going to create an atmosphere that's not, you know, like you're you can talk to each other and ideally you do. Uh and so it's if you're expecting

do. Uh and so it's if you're expecting like strobe lights, darkness, smoke fil room, it's not going to be that. Uh the

afterparty last night, it's very similar to that. Okay. So um come along, have a

to that. Okay. So um come along, have a conversation. Again, don't waste it. The

conversation. Again, don't waste it. The

conference may be over, but your opportunity to meet cool people and connect with them is not. So it's a 45minut walk from here. Put it in into your maps app fabric the club. Um it's a 45minute walk or if you take a public

transport it's 30 minutes and if you take a car it's 25 minutes give or take with traffic. Uh food and beverage is

with traffic. Uh food and beverage is included. Okay. So if you come come

included. Okay. So if you come come hungry come thirsty. Yeah we love food.

Um the noise level is going to be manageable. It's it's it's not it's not

manageable. It's it's it's not it's not it's not open to the public. We've

rented the entire club and we can do what we want with it. Uh very important.

Come with your badge. I don't have mine so I can't come but it's backstage but come with your bad. This is I I need to I need you to hear me. come with your badge because if you don't have your badge, you can't come. Okay? We need a way to identify you and that the reason

for this is people want to go to a club and they're going to come without a badge and we need to really gatekeep a little bit uh because this is an experience they've created for you specifically. Okay. Um also you you

specifically. Okay. Um also you you cannot bring a plus one or a friend uh to this event uh because it's just capacity and as you can look around this room is full and so we need to be mindful of that and we don't want like a

fire hazard where it's going to stampede if people leave, right? And so uh we want to be sensitive to that as well.

We're about to finish the conference, but we would be remiss if we didn't capture this moment. So, what we're going to do is we're going to move to taking a family a group photo together.

Okay? Some of you, if you don't want to be in the photo, absolutely no pressure.

You're welcome to go to the expo area on your way out. Um, but for those who want to be in the photo, uh, we're going to, you don't have to move. You just stay where you are. And what's what's going to happen is our photographer is going

to come on stage. Hello. Um, give it up for your photographer, by the way. This

incredible Yeah. both of them.

Uh, so here's how it's going to work.

They're going to come on stage. We're

going to I I'll join you. All of us are going to join you. It would be nice if we can come towards the middle so they don't have to use a big wideangle lens.

Uh, and then he's going to be in charge.

We're going to turn the house lights up and then when he gives the thumbs up, it's officially over. And then you're welcome to come up here, take photos, do whatever you want. We need to leave the building at 6:30 local time. You need to

be out. If you're not out, you will you

be out. If you're not out, you will you will be made to leave. Okay? So, finish

up your last arrangements after the photo. Um, and then do whatever you want

photo. Um, and then do whatever you want and then we'll leave at 6:30. Is that

good?

All right, let's do it. Let's take the photo everybody.

He's in charge.

If you can get these guys, everybody move across into where you are.

Everyone in the middle, if you want to stand, that would be great.

Please stand.

As you can.

Let's do it. Oh, my mic's still on, dude.

And then one more for the video. Ready?

Go.

Loading...

Loading video analysis...