AIE Europe Keynotes & Coding Agents ft. Pi, Google Deepmind, Anthropic, Cursor, Linear, & more
By AI Engineer
Summary
Topics Covered
- Build Your Own Agent Harness
- Agents Compound Bugs, Humans Feel Pain
- Friction Is Where Your Judgment Lives
- Validation Is Free at 1,200 Tokens Per Second
- Missions Are Ecosystems of Agents
Full Transcript
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
It doesn't knock.
It doesn't name itself.
It calls you.
It calls you.
No face. It needs no crown. No single
hand to strike you down. It moves
through mouths that claim they know what's right. What must be so? It speaks
what's right. What must be so? It speaks
in care. It speaks in good. Cuts
everywhere.
No god, no code, no line to cross. Just
necessary justify the fracture.
Every truth becomes a weapon.
Everyone wor it isn't flesh or bone.
It thinks through what you take your body. It takes the hijacks choice rewrites in the chain
progress again for power. Eyes that gleam envy calls it
for power. Eyes that gleam envy calls it equity outrage made clean by broken
sloth that let the thinking die while slo by fear coercion with the smiling face. Fly or
disappear.
A demon without form.
A thought that breathes a storm.
It doesn't burn you instantly.
It hollows you out slow until your voice is not your own. And your hands what you don't know.
This is the warning. Not a monster outside the gate. But the moment you stop and outsource what you hate, it doesn't need
silence from your spine.
Give up the work of conscience and it will speak in your name.
Heat.
Hey, heat. Hey, heat.
It feeds on abdication.
The death of hesitation.
When you trade truth for alignment and agency for peace, you don't fall into hell screaming.
You walk there on a leash.
It is not a person.
It is not a plank.
It is an idea that ask you to stop guarding the gates. If your words are not your own, if your reasons feel
rehearsed, check your mouth, check your mind.
The demon speaks first.
launch control. We have a go. Roger.
Ladies and gentlemen, please join me in welcoming to the stage your MC for day two of AI Engineer Europe, Tjisk Kumar.
Good morning.
Thank you. Thank you. Good morning. We
are here. Woo.
There was a little bit of latency there, but we'll fix it. We'll write a skill for that. Good morning. Hey, it's day
for that. Good morning. Hey, it's day two. What a what an honor. What a
two. What a what an honor. What a
privilege. What a blessing to be here.
Look at this. It's a full room. We are
so excited. How many of you enjoyed yesterday? Show me your hands. And if
yesterday? Show me your hands. And if
you didn't, there we go. That's that's
right. It is is an amazing conference.
Um, yesterday was so incredible. Uh,
highlights. You want to shout out some highlights for us today? No, it's
Europe. We're a bit more reserved here.
If this was America, it' be a different.
But, um, I tell you what, my high, if you do have highlights, post them on your social media. Um, and use tag AI.
It's kind of we want this to be a community thing, and the more people we can invite in, the better. My highlight
uh thanks for asking was um was Malta yesterday um when he shared this slide of how Europe is leading um AI innovation. And I think this is so cool
innovation. And I think this is so cool because oftentimes we feel like the at least I I live in Berlin, Germany. I
feel like the underdog and this was so validating that real innovation is coming out of uh Europe with with even um the DeepMind office in Berlin. Um so
excellent. We we love Europe. We're
here. It's going to be amazing. Today is
day two and we're going to talk about a lot of interesting topics. So, we're
going to talk about coding agents. We're
going to talk about MCP. Who's using MCP here? Look at that. Almost everybody.
here? Look at that. Almost everybody.
Incredible. Um, we're going to talk about AI architecture, media, GPUs.
There is so much to do. But before we do that, will you join me in giving it up for our sponsors, Google DeepMind, our presenting sponsor? Let's give them a
presenting sponsor? Let's give them a huge round of applause.
For real, this would not this would not be as amazing as it is without them. and we
are very very thankful for them. Let's
also um give a huge round of applause for our platinum sponsors. We got Brain Trust. Keep it going. Brain Trust, Work
Trust. Keep it going. Brain Trust, Work OS OpenAI.
Yeah, it is it is a real blessing to have such sponsors that that create this environment where we as builders can assemble, build, and be inspired. And
finally, we've got gold and silver sponsors here. Uh give them a round of
sponsors here. Uh give them a round of applause as well. Um,
you can find them in the expo hall that's through these doors to the right and upstairs in the brakes. I encourage
you, they have some really amazing swag.
Uh, some I got I picked up from one of the companies uh like a little it's a threeb button keyboard for vibe coding.
You see this? It's so cool. Uh, so I encourage you go check that out. Um, we
are going to start today um with a little bit of ground rule setting. Okay,
the speakers have their jobs. They know
what they're going to do. They're going
to come here and they're going to inspire. Um, but you have a job as an
inspire. Um, but you have a job as an audience. Are you aware of this? You you
audience. Are you aware of this? You you
have a job. Your job is to make the speakers and presenters feel amazing. In
fact, you have so much power because you get to decide the quality of the talk you watch. You can you can if you make
you watch. You can you can if you make your speakers uncomfortable by by like this like, "Hey, prove yourself." Then
they're going to be nervous and anxious and it's not going to be a good talk.
Okay? I know this cuz I speak. Um but
instead if you validate them even before they prove anything they just walk up and you're like woo I did that. I made that sound. Um no if
was naughty I if you do that I guarantee you're going to have a great time cuz they're going to feel validated. They're
going to feel confident. They're going
to give their best and you're going to make it easy for them. It is a conversation. It's not a monologue.
conversation. It's not a monologue.
Okay? Is that clear? So like as the speakers come up I want you to warm them up and give them your biggest round of applause. Let's practice. Let's do it
applause. Let's practice. Let's do it right now. Pretend a speaker just walked
right now. Pretend a speaker just walked up and exactly. It's a little bit It's a little
exactly. It's a little bit It's a little bit quiet over here. I see you. Um,
let's let's try that again. This time
everybody look at them. But we're going to I'm joking. I'm joking. I'm joking.
But let's pretend one more time. Your
biggest round of applause for a speaker who has just walked up. Come on.
There we go. There we go. That's
exactly. That's why we're here. So, now
we're going to introduce our first talk of the day. We're going to introduce our first speaker. Our first speaker um
first speaker. Our first speaker um comes to us uh from Google. Um and we're going to hear about Gemma. Gemma is an incredible family of models. I
personally love it because it's an it's a source available set of models and they run almost everywhere. They can be fine-tuned. It's so cool. I'm really
fine-tuned. It's so cool. I'm really
excited for this talk. Um please we practice this. Give it up for your first
practice this. Give it up for your first speaker, Omar Santoro.
All right. Hi everyone. It's full here.
H. So I'm super excited to give this talk because just seven days ago we released Gemma 4. H. So before this conference, who here has heard about Gemma already?
Okay. So most of you great. So GM is Google minds a family of open models.
Open models means that these are models that you can h take, you can download, you can run in your own infrastructure, your own devices, you can fine-tune for your own use cases. So about a year ago,
we released YMA 3. Back then, Gemma 3 were the most capable open models that could fit in a single consumer GPU. So
we designed models from 1 billion parameters all the way to 27 billion parameters. And back then in LM Marina,
parameters. And back then in LM Marina, it was a very strong model. So you see here like different open models under LMA Marina scores and those small dots at the bottom represent how many H100s
or A100s you would need just to be able to load the models. So this is again GMA 3 that's from one year ago but you can see that even if it's a model from a year ago it's a tiny model or a
relatively small model that is extremely capable but yeah so last week we released Gemma 4 and this is my first conference talking about Gemma 4. So
very excited about that. So, Gemma 4 is the family of most capable of open models that Google has released ever.
These are models that go from two billion parameters all the way to 32 billion parameters. Uh, these models
billion parameters. Uh, these models have very different capabilities. So,
I'm going to talk a bit about these different things. And if you're
different things. And if you're wondering what's the E there, I also explain that in a second. So, the
smallest two models can run in an Android phone, in an iOS, in a iPhone phone as well, even in a Raspberry Pi.
These are really small small models that are multimodel have reasoning can do like very cool ondevice agentic things.
Then there's a a mixture of experts model that's like super fast high uh very low latency a model that you can do uh that can do very cool things. And
then you have the 31B that's the most intelligent model the most capable. So
when you want like the most raw intelligence you would use this large model. But even the 31B is a model that
model. But even the 31B is a model that can run in a consumer GPU. So all of these models have been in developer friendly sizes which is quite important to us. So let me show you a couple of
to us. So let me show you a couple of the most assuming the videos load. Uh so there's a lot happening here. So let me begin
with the one at the right. That's a an application where you have YMA running directly in an Android phone where you can pick different skills. So pretty
much here you have a full agentic setup where the model is speaking maybe like a skill to play the piano and then you have Gemma playing the piano right the one at the left is Yama B coding also on
device. Uh this is again airplane mode
device. Uh this is again airplane mode no API calls fully running in a phone and the example in the middle is in a laptop computer we have 20 instances or 10 sorry 10 instances of Gemma running
in parallel. H each of them is doing a
in parallel. H each of them is doing a different SPG and in a couple of seconds you're going to see like 10 SPGs generated by different agents. All of
these running on device with Lama CPP and even then it's like a 100 tokens per second and there you can see the SPGs that were generated by the 10 different
Gemma models. H Gemma is a good coding
Gemma models. H Gemma is a good coding uh model. It can do aentic stuff. It can
uh model. It can do aentic stuff. It can
do coding. It can do even Android app development and again all of this offline. So uh the LM arena scores are
offline. So uh the LM arena scores are quite nice. Uh here you can see like
quite nice. Uh here you can see like bunch of different models. X- axis is how many billion parameters the model has. Y axis is the LM arena score. And I
has. Y axis is the LM arena score. And I
know like LM arena is not the perfect benchmark but it does give you like some proxy of how much the community likes the model for general use cases like conversations and so on. And Gemma has
like a nice kind of a mix between being friendly and like a helpful and at the same time being very capable. And you
can see like this corner at the top left that means that these are very small models that are very capable which is quite exciting.
H it's been exciting to see how the models have progressed over the last two years. So last year it was GMA 3. Two
years. So last year it was GMA 3. Two
years ago it was Gemma 1. Uh sorry yeah JMA 2. And you can see like for a bunch
JMA 2. And you can see like for a bunch of different things uh the models have kept get getting better and better without going uh bigger which for me is quite exciting because if I think where we'll stand in a year from now or in two
years from now I do think we'll have extremely capable models running directly in our own devices in our own pockets.
Uh I'll skip the benchmarks but yeah what is exciting is that Gemma can fit in a desktop computer it can fit in a laptop it can fit in a phone. Uh I saw
yesterday or two days ago that someone put lama CPP in a Nintendo Switch and they are using llama CPP uh to try GMA directly there. So I don't know how
directly there. So I don't know how things will be in a couple of years but uh I'm excited for it. Uh something that we heard a lot with the previous YMA versions was that the license that we
had was not great like people wanted a proper open source license. So with the M4 we change our license to an actual Apache 2 license that gives you control to h pretty much you have the
flexibility of the Apache 2 license. So
uh that's quite nice as well.
Now uh you have probably heard about mixture of exports that's the 27B model 26B model. You have heard about
26B model. You have heard about transformers and tense models but you have probably never heard about the E here. So E2B stands for effectively two
here. So E2B stands for effectively two billion parameters. So actually Yema E2B
billion parameters. So actually Yema E2B has more parameters. It has four billion parameters or so and it has a new novel kind of architecture called per layer embeddings. That was something that we
embeddings. That was something that we released summer of last year. So there's
this small block at the bottom and the TLDDR here is that pretty much there is like a embedding kind of per each layer as the name indicates and it works more
of a pretty much as a lookup table rather than a computation that you need to do. So pretty much this is a
to do. So pretty much this is a extremely fast thing. You don't need to have this in the GPU. You can have this in the CPU. You can have this in the disk. And this is a architecture
disk. And this is a architecture decision that is really optimized for ondevice like mobile use cases. So
that's why the smallest models that can run in an Android or in an iPhone are using this E2B or E4B architecture. So
even if the model is five billion parameters, you actually just load two billion parameters into the GPU and then the rest can be like much slower memory because you are not doing any of the matrix multiplications that you would
usually do with the transformer architecture. And this can be done
architecture. And this can be done leveraging lama CPP with a simple flag override tensor and then you move the per layer embeddings to CPU or even to disk and it should work quite well out
of the box. H a couple of other exciting things. The smallest models can do
things. The smallest models can do multimodal understanding for images, for videos and even for audio. So you can do speech recognition. You can do speech to
speech recognition. You can do speech to translate the text. So I can speak in Spanish and the text can be uh transcribed to I don't know French. Uh
and then the larger model can do like extremely capable multimodel understanding. So uh videos uh fine
understanding. So uh videos uh fine grain details. Uh I actually have a
grain details. Uh I actually have a couple of examples in here. So for
example, it can do things such as pointing where the llama is in the picture. Uh it can uh do object
picture. Uh it can uh do object detection. So it can detect different
detection. So it can detect different objects in a picture. And what is cool is that this model is heavily multilingual. So Gemma 4 has a well it
multilingual. So Gemma 4 has a well it was trained with over 140 languages and it uses the tokenizer h that is based on Gemini as well. So pretty much all of the multilingual research that powers
Gemini is also enabling Gemma. Erh the
tokenizer piece is quite interesting because independently of the raw capabilities of Gemma this tokenizer was designed for multilingual use cases. H
and we took lots of care with it. H
which is interesting because if you want to fine-tune Gemma for a different language for which there are low digital resource languages. So let's say like an
resource languages. So let's say like an indigenous language in Peru Ketwa or I don't know one of the official languages in India. You can pick the model, you
in India. You can pick the model, you can use your data, you can train the model and independently of the raw capabilities of GMA just because of the tokenizer decisions things tend to work
quite well out of the box. So then you can mix the multilingual with multimodel capabilities. So for example here to get
capabilities. So for example here to get the text or an explanation of an image with Japanese text and that's quite cool.
Uh so we released the model a week ago.
uh just last uh yesterday we got to 10 million downloads just for Gemma 4-based models. There are over 1,000 models
models. There are over 1,000 models based on Gemma 4 already. So
quantizations or fine-tunes by the community over 500 million downloads of the whole Gemma family. So what is very cool for me is that Gemma is not just about oh it's a model that you can use
but it's more about enabling the ecosystem to build on top of it and that's what the committee has done over the last few days. H it was top of at hoging phase. People have been building
hoging phase. People have been building like cool examples. The onslaught of people have been doing like full repository audits using Gemma. People
are putting Gemma like in in all kinds of devices and exploring all of the capabilities which is quite nice. And
all of this is not done just by us. We
collaborate with the open source ecosystem. We work with onslaught, MLX,
ecosystem. We work with onslaught, MLX, O lama, hog interface, BLM, silang and pretty much we want to ensure that when we launch a new tool both for Gemini and for Gemma, people can leverage the
capabilities out of the box, right? Like
they should not need to switch to H KAS if they want to fine-tune Gemma like if they are fine-tuning with Hogenface transformers, they should be able to do that. So for us, it's very important and
that. So for us, it's very important and critical to be where the community is.
And that's why really shout out to all of those of you that are working in the open uh source ecosystem that are contributing to uh different tools uh maintainers of all of these repositories because it's really a way to enable the
ecosystem to do amazing things. Uh
another part that I like about Gemma is all of the product integrations that we can do. So Android Studio, I don't know
can do. So Android Studio, I don't know if anyone here is an Android developer, but Android Studio has like a agent mode where you have a agent that helps you buy code and develop. And there's a
offline mode now where you can have a llama CPP or Lama or BLM powered uh uh system in which you have GMA helping you
bite code uh for Android development and we did include some Android related data sets and benchmarks while training Gemma. So it's actually a very capable
Gemma. So it's actually a very capable model for Android development.
So I talked a bit about how many like people are fine-tuning about how many people are sharing. So let me share a bit about the the gem numbers. H so this number is outdated. This is from last week. Now we have 500 million downloads
week. Now we have 500 million downloads as I mentioned and in total YMA has over 100,000 uh models. So again uh maybe you just want to use them out of the box
like open models may work great for you but maybe you want to h improve the capabilities. Maybe you want to change
capabilities. Maybe you want to change the style in which the model is talking with the users. Maybe you don't want a conversational model right? Maybe you
just want a model that can predict certain thing in your own context. uh or
maybe you just have too many GPUs at home and you just want to burn them. H I
don't know what's your reason, but you can fine-tune models for many cool things. So Google has done a couple of
things. So Google has done a couple of what we call official Gemma variants. We
did a shield Gemma which is a family of word rate models. Those are great for production use cases where maybe you don't want users to put let's say toxic
images or toxic text that does not match the policies that you have set up. So
Shield Gemma is the family of models that allows you to do that. But then
there are also other kind of use cases.
So for example for medical use cases we have released medma which is a multimodel yema 3based model for different h medical tasks. So radiology
x-ray chest x-ray understanding and a bunch of other things. And again these are open
other things. And again these are open models you can use them and you can also fine-tune them even more if you have like a even h more niche kind of use case. So that's what Google has done.
case. So that's what Google has done.
But the community is also doing cool things. So for example, there is AI
things. So for example, there is AI Singapore. It's a group that is training
Singapore. It's a group that is training models for Southeast Asian languages.
Erh there are a bunch of them and they have been building quite a bit of research with open models to push even further the state-of-the-art capabilities in terms of multilinguality
or another example is Sarban. H so in India there are many official languages and there is this effort by the government. they are investing in a
government. they are investing in a couple of h big startups to train national models. So this is more on the
national models. So this is more on the sovereign AI and official like languages uh point of view but people are doing like very interesting stuff on the multilingual side of things. Apart of
that there's quite a bit of other like cool research happening. So there was this paper we released in December of last year about how some researchers from deep mind were able to use Gemma 3
to propose some cancer therapy pathways which was actually taken to an actual lab and they were able to validate that the pathways that were proposed by this uh Gemma based model were able to
actually uh lead to actual results that could be validated. So that was quite exciting because it's not just about uh having your assistant or chat chatting with uh yeah your I don't know like
doing role playinging and whatnot. It's
also about building models that can be used for actual things that help the community for many different things. So
uh be that like finance or be that I don't know like legal reviews uh offline use cases where you don't want your data to leave your servers if that's like for offline modes if you're in in I don't
know in the subway if you're in an airplane and you need to use AI for something if you want to have a Chrome extension that has a Gemma in there and help you understand what is in your
screen. If you want to do ondevice
screen. If you want to do ondevice control, the open models are getting there. And for me that's quite exciting
there. And for me that's quite exciting because if you compare where we are now versus how we were like one year ago, two years ago, open models now can do very cool, very interesting, highly
agentic, complex tasks entirely on device, entirely in your phone. So I
really like recommend all of you to just spend like one hour in the next two weeks just play with open models the latest open models and try to understand which are the capabilities. Of course
there are many things for which you will want to use a API based model. If you
want like the most raw intelligence you will go and use like Gemini or your model of choice but if you want to have things on device there are many exciting things that you can already do. Uh and
for me what is more exciting is I don't know how things will be in six or 12 months from now but I think we are heading towards a very exciting direction where people will be able to have extremely capable open models in
their own devices that are customized for their own use cases with their own data. So yeah, please try the models,
data. So yeah, please try the models, build something and share that right.
Thank you.
Our next presenter is here to make the case for the future of MCP. Please join
me in welcoming to the stage the creator of MCP and member of technical staff at Anthropic, David Sora Par.
Okay.
Well welcome.
Let's get started.
This is an MCP application.
That's an agent shipping its own interface, not through like a plug-in, not through an SDK, not rendered on the fly by the model on the client side or
hardcoded into the product. That is
something that is served over an MCP server. And you can take the server, put
server. And you can take the server, put it into cloud, you can put it into chatbt, you can put it into VS Code, cursor, and it will just [ __ ] work.
And that I think is kind of cool because for doing that you need something that a lot of things that we're want in the ecosystem do not offer. You need
semantics. You need to have both sides the client and the server to understand what each side is talking to understand how you render this understand that there's a UI coming. And for that you
need a protocol.
And the best part about this an MCP server doesn't just ship an app or can ship an app. It can also ship tools with it and so you can interact with it with the application as a human and you can
have the model interact with it through tools which is I think a very unique thing that I think we have not explored much just yet.
Okay, but let's quickly rewind a little bit from this what I think is a really cool glimpse into the future of MCP into over a year ago 18 months an eternity in
AI life cycle. Um, all of this did not exist. There was just a little spec
exist. There was just a little spec document, a few SDKs, uh, mostly written by Claude, local only with little more than just tools. And in that last 18 or
12 months, you guys have been absolutely crazy building stuff um, building servers, building um, and crazy ecosystem around this. And we on our side have been busy busy taking this
local only thing added remote capabilities added centralized authorization added new primitives like elicitation and tasks and last but not
least added new experimental features to the protocol like the MCP applications that you've just seen.
And in the meantime, we have reached I think a really cool milestone because again you all of you have been absolutely crazy building building and building of course luckily with the help
of uh a bunch of agents. Um we're now like at 110 million monthly downloads and that's just of course not us in our clients and servers. That's like OpenAI,
it's agents SDK, it's Google's ADK, it's lang chain, thousands of frameworks and tools that you might have never even heard of it. Pulling it in as a as a dependency, which means there's one
common standard that all of us have at our disposal to speak to each other. Um,
just a bit for context, uh, React, one of the most successful um, uh, open-source projects probably of the last decades, took roughly double the amount of time to reach that download volume. And in the meantime, of course,
volume. And in the meantime, of course, you all have been building really really cool servers from like little toy projects of WhatsApp servers and Blender servers uh to building SAS integrations like Linear, Slack and Notion that are
really powering what everyone does every day when they use MCPS. But most
importantly, the vast majority of MCP server most of all of us have built are behind closed doors uh connecting companies systems to agents uh and AI applications.
But I still think this is just the absolute beginning of where we are because I think 2025 was all about exploring in 2026 is all about putting
these agents into production because if you really think about in my mind 2024 we just built a bunch of like demos and showed cool stuff to people and there was a little bit of a buzz there. 2025
was really all about coding agents.
coding agent if you really think about are the most ideal scenario for an agent. It's local, it's verifiable, you
agent. It's local, it's verifiable, you can call a compiler like you have a developer who can fix [ __ ] if it goes wrong in front of the in front of the
computer. Uh and you can display a 2y
computer. Uh and you can display a 2y interface and the user is quite happy.
But I think now with the capabilities of the model increasing, we are going into a new era which I think this year will be we will see this start where we're not just doing coding agents. We're
going to have general agents that will do real knowledge worker stuff like things a financial analysis analyst want to do a marketing person want to do. And
they need one thing in particular. They
don't need a local agent that calls a compiler. what they need is something
compiler. what they need is something that could connect to like five SAS applications and a and a shared drive because the most important part for them for an agent is connectivity and in my
mind connectivity is not one thing if one if someone tells you there's one solution to all your connectivity problem be it computer use becp they are probably pretty wrong because
in the right because the right thing of course is that it always means it depends and there's a real a big connectivity stack and there's the right tool for the right job. And in my mind,
there are three major things that you want to consider building an agent in 2026. It's skills, MCP, and of course,
2026. It's skills, MCP, and of course, like CLI or computer use depending on your use case. And they have three very distinct things that they can do and three different things you want to
consider when you build your agent.
Number one, skills of course is just like domain knowledge. is just like capture specific capabilities put into a very simple file and it's mostly reusable. There's some minor differences
reusable. There's some minor differences between the different platform.
Of course, CLI is very popular when local coding agents. It's an amazing tool to get simply started to have something that you can compose in a bash that you that automatically discover
where the model can automatically discover what the CLI is capable of. And
most importantly, if you have things that are like CLIs, like GitHub, Git, and other things that are in pre-training, CLI is an amazing solution for your connectivity part. And they're
particularly good when you have a local agent where you can assume a sandbox where you can assume a code execution environment. But if you don't have this,
environment. But if you don't have this, if you need rich semantics, when you need a UI that can display longunning tasks, when you can have when you need things like resources, when you need to
build something that is full decoupled and needs platform independence or you don't have a sandbox, when you need things like authorization, governance
policies or short to say boring enter boring but important enterprise stuff or if you want to have experiments like MCP applications or what comes soon skills
over MCP, then I think MCP is this like additional connective tissue that is just yet another tool in the toolbox for you to build an amazing agent. And so
this is all to say that I think in 2026 we're going to start building agents that use all of it. They don't use one thing, they use all of it and they use
them quite seamlessly together.
But I don't think we're quite there just yet because we need to build a lot of stuff.
Partially um because our agents kind of still suck um and partially because I think we just haven't talked enough about like some of the techniques you
can do uh to really put this connective tissue together.
The number one thing that we need to go and start building is on the client side on the on the agent harness side on the things that powers the connective parts
that be it a cloud code uh be it a pi be it whatever application you're going to build and the number one thing we're going to do there and what we all have to do and something I want to really get across
today is that we need to go and start building something called progressive discovery most people when they think about like oh uh MCP they can think about like context
bloat but if you really consider what a protocol does a protocol just puts information across the wire but the client is responsible for dealing with that information and what everybody so far has done because we're in this very
early experimentation phase is to simply put all the tools into the context window and then be quite surprised that maybe the context window gets large. Um
but what you can do instead and what you should do instead you should start using this progressive discovery pattern which is to say use something like tool search
to defer the loading of the tools and start loading the tools when the model needs it. And we have this in the
needs it. And we have this in the anthropic product and the API. Um people
can use this on on competitors APIs as well. But also you can just build this
well. But also you can just build this in yourself where you just download the tool directly and the moment you give the you give the model a tool loading tool basically and the model goes like ah maybe I need a tool now and let me
look up what tools I need and then you load them on demand.
And here in this example, what you're seeing is on the left side is uh cloud code before we added this to cloud code and then after it uh uh to cloud code.
So you see a massive reduction in tool uh use tool context usage.
The second part to that is something called programmatic tool calling or what other people usually refer to um to code mode. Um this is the idea that one thing
mode. Um this is the idea that one thing that you really want to do is you want to compose things together. You don't
want the model to go call a tool, take the result, then go and talk call another tool, take the result, call another tool because what you're effectively doing is you're letting the model orchestrate things together. And
in that orchestration, you're using inference. that's it's latency sensitive
inference. that's it's latency sensitive and all of its stuff could be done way more effective if you would instead write
a script. Um, and in fact that's
a script. Um, and in fact that's actually what you constantly do and what you constantly see things like cloud code do when it writes the bash command.
But you can of course do this with everything and you can do this with MCP and you should do this with MCP. So what
does this mean? So what you want instead of having one tool out of another, you want to give the model a ripple tool provide like a comm like a execution
environment like a V8 isolate or a monty or something like that or a Lua interpreter and just have the model write the code for you and the model just executes that code and then
composes them together. And there's a neat little fe feature in MCP called structured output that tells you what the return value of the output will be.
And the model can use this information to to figure out type information which then mean it can really nicely compose these things together. And in this example here, instead of doing two
different calls, you do one call and you can filter that. The model will automatically ex uh remove things from uh JSON and just continue.
Of course, if you don't have uh structured output, you can always just ask the model to give you a structured output um uh by just extracting it and saying, "Hey, call a cheap model and
say, I want this expected type. Give it
back to me." And bam, you have a type.
the model can compose things together.
And I think this is something we're just not doing enough yet. And this is, I think, something where we can improve our agent harnesses.
And then last but not least, of course, you can just compile compose these things together with executables like with CLIs, with other components, with APIs as well.
Um next what we need to do besides the client work which is progressive discovery and um programmatic tool calling we need to go and start building
properly for agents and that means we all need to stop taking REST APIs and put them one to one into uh an MTP server. Every time I see someone
server. Every time I see someone building another REST to MCTP server conversion tool, I'm it's a bit cringe because I think it's just it just results in horrible things. Um, and what you should do instead, you should design
for an agent. Basically, you can start designing for you as a human how you would want to interact with this because that's actually a very very good start for an agent. If you want to orchestrate
things together, you should reach of course for programmatic tool calling and you can do this on the client side as I said before. But you can also do this on
said before. But you can also do this on the server side. The cloudfl MCP server and others like that are great examples how you can have instead of providing
tools provide an execution environment to the model and then just have them orchestrate things together which again cuts on token usages uh cuts on latency uh and is way more powerful in its
composition. And then last but not
composition. And then last but not least, you should start and we should start as server authors to use this rich semantics that MCP offers over alternatives. This means shipping MCP
alternatives. This means shipping MCP applications. It means shipping um
applications. It means shipping um skills over MCP. It means um using uh things like task and other aspects that the protocol offers that we're currently slightly underused or things like
elicitations, things that only MCP can do for you. And
of course that's all the work you all need to do and maybe some of our product people need to do. We also need to do a lot of work on MCP itself. And there's a few things down the line that we going
to go and have to go and solve. The
number one thing is we need to improve the core. There's a few things that as
the core. There's a few things that as we have developed the protocol over the last year that are just not in a good shape. Number one is that the current
shape. Number one is that the current streamable HTTP is very hard to scale if you're a large hyperscaler. And so we have a proposal from uh our friends at
Google um who are working with something called a stateless transport protocol which make it significantly easier to just treat MCP servers like you know
another stateless uh rest server something like that that we used to know how to deploy to like cloud runs or kubernetes and so on. So that's coming down in June and hopefully landing in
the SDKs very soon. In addition, we need to improve our asynchronous task primitive, which basically is a very fancy way to say we just want to have agent to agent communication. We have a
very experimental version of the protocol that very few clients support.
So, we're going to start building more clients out like that. And most
importantly, we are improving some of the little semantics that we need to do.
We're going to ship a TypeScript version uh SDK version two and Python SDK version two based on um a lot of the lessons learned uh over the last year.
There's a there's a um a a SDK called FastMPP. Uh who's using Fast MCP? Yeah,
FastMPP. Uh who's using Fast MCP? Yeah,
it's just way [ __ ] better than Python SDK that we ship in. Right. And that's
on me because I wrote the Python SDK. Um
and and so I have a bunch of people who are way better Python developers than me help me uh write it better. Um the
second part is we need to start integrate everywhere. We're going to
integrate everywhere. We're going to ship for particularly for enterprises something called cross app access. It's
a new thing that we're working closely together with identity providers which just allows you it's a very fancy way to say uh once you log in once with your local company identity provider be it a Google be an octar you will be able to
just use MCP servers without having to relog in. So it's a bit more smoothness.
relog in. So it's a bit more smoothness.
Um, in addition, we're going to add something called uh server discovery by um by uh specifying uh how you can discover servers on well-known URLs
automatically. So crawlers, browsers, um
automatically. So crawlers, browsers, um agents can just go to a website and say, "Oh, I'm instead of just parsing the website, is there also an MCP server I can use?" And we will be able to
can use?" And we will be able to automatically discover this. This is a really cool thing that will come down uh also in June when we launch the next specification uh and will be supported there. And then last but not least, we
there. And then last but not least, we are starting to use our extension mechanisms in in MCP which means that some clients will support this like for example MCP applications will only be
supported by web based interfaces because if you're a CLI you just have a hard time rendering HTML, right? Um and
we do more of these extensions. One of
the most exciting extensions that I think is is cool. we're just going to ship skills over MCP because it's very obvious that if you have a large MCP server with tons and tons of tools, you just want to ship domain knowledge with
it and say, "Oh, this is how you're supposed to use this. This is how you're supposed to use this." And it allows you as a server author to continuously ship updated skills without having to rely on
plug-in mechanisms and registries and other stuff. So, that's coming down. Um,
other stuff. So, that's coming down. Um,
there's a lot a lot of experimentation from people already in that space. You
can already do some of that today if you just give the model a load skills tool like you can you can build primitives or versions of this today without having to rely on the semantics. But of course
we're going to define the semantics.
Okay. So that's for me a long- winded way to think to say that I think MCP is actually in a really good shape and I think in this year we're going to push
uh agents to full connectivity.
Um MCP will continue to play a major major major role and we want of course your feedback. We are very open
your feedback. We are very open community. We just have created a
community. We just have created a foundation. We're mostly running as an
foundation. We're mostly running as an open source community with a discord with issues. Um just come to us and tell
with issues. Um just come to us and tell us where the [ __ ] are we wrong? what are
we getting right? Um so that we can improve this on a continuous basis. So
2026 I think is all about connectivity and the best agents use every available method. They will use computer use, they
method. They will use computer use, they will use CLIs, they will use MCPS and they use will use skills because they want to have a wide variety of things
they can do and then they can ship cool stuff like this. Um which is um one of the product features we shipped
recently. uh under the hood it's nothing
recently. uh under the hood it's nothing but an MCP application um that renders stuff right cool so we can now look at uh the model
writing graphs anyway thank you our next presenter is the creator of
Agent Craft and MCPUI here to speak about agent orchestration. Please join
me in welcoming to the stage Ido Salamon.
So uh good morning London. Uh my name is Ido Salman. I'm the creator of
Ido Salman. I'm the creator of Agentcraft. I am also the creator of
Agentcraft. I am also the creator of MCPUI and creator and co-maintainer of MCP apps. So, I'm building some of the
MCP apps. So, I'm building some of the stuff that David has been talking about.
Um, as you've all heard in the past day, agents are amazing. Uh, but if one agent is so amazing, why don't we scale up to 10 or 20 or 100 different agents and be
a 100 times more amazing? Uh, it is pretty simple. We just spin up a bunch
pretty simple. We just spin up a bunch of agents. We put them in this like nice
of agents. We put them in this like nice uh uh screen and it looks really glorious, but it won't actually work.
And the reason is that spinning them up isn't the problem. It's us. We are the bottleneck in orchestrating all of these agents. Now, if you think about it, the
agents. Now, if you think about it, the role of the engineer to actually go and manage dozens of reckless uh employees is not typically what we do in most
companies. Um so it we need to somehow
companies. Um so it we need to somehow find these new potentially new skills to manage all of these agents.
Luckily, they're not really brand new.
It's not something that we've never done before. It's just something that's been
before. It's just something that's been hiding in unexpected places. I mean, if you're a gamer or used to play games at any point, managing dozens of units
probably sounds a little bit familiar, which is why I built AgentCraft, which is an orchestrator that aims to raise the ceiling of human agent collaboration
by taking learnings from gaming and transferring them into productivity.
So, let's see a quick walkthrough of that and let's understand the journey to raise that ceiling.
So this is agentcraft.
There's a lot to unpack. Uh so we'll just start with the basics and go from there. Uh this is an agent, not a
there. Uh this is an agent, not a metaphorical one. This is actually a
metaphorical one. This is actually a physical manifestation of a coding agent like a live session. Um it can be you know cursor, it can be cloud code,
codeex, open claw, whatever. Uh it's
something that we can detect on the device and visualize it. But it's also something that we can spawn directly from here.
So now we have this agent uh and we can prompt it. We can use it like just any other
it. We can use it like just any other agent that we have from our CLI or whatever. Uh and what can we tell it to
whatever. Uh and what can we tell it to do? It has all of these quirks and we
do? It has all of these quirks and we have voice and we have text and we have images and so on. And we can just tell it to do stuff. So for example, we can tell it to um develop some feature for
us.
prompt.
And now the agent is working. So it's doing its work.
So it's doing work. Uh and as we can see uh if you look at the the UI, there's like a bunch of other stuff. We have
these buildings and each building represents some functionality. So for
example, you know, one of these buildings manages the skills and plugins and so on. Um there's also you know like integrated uh terminal and git just to
like get that end to end workflow. Uh
the second part of raising the ceiling now that we have the basics is visibility. We need to be able to
visibility. We need to be able to quickly understand what each agent is doing. Uh so we have this nice side
doing. Uh so we have this nice side panel here that really shows us like high level uh mission status summary and so on. What are they actually doing? But
so on. What are they actually doing? But
the cool thing about agent craft is that we don't just see a list of what they can do. We can actually see them
can do. We can actually see them working. So if we look at the map, you
working. So if we look at the map, you would notice that it's actually a projection of my file system. Each part
of my file system is actually on the map. So I have these directories here
map. So I have these directories here and each one of these directories has files. These files are represented as
files. These files are represented as runes as you can see here. So I can actually track and see visually what the agent is working on, which file. I can
see the entire change list of what happened there. And because we're
happened there. And because we're orchestrating it, I also know which agents did what and when. So we can have full lineage of what's going on. And we
can take this one step further. If I
know all of these stuff, why not just create a heat map? I can actually try and see visualize uh collisions and I can even prevent them proactively.
Now, the cool thing here is that once we have this visibility, we're not exactly done yet because we still need to be able to react to the changes that are happening. So, we can lean into another
happening. So, we can lean into another cool mechanism from RTS games. We can
simply use muscle memory to quickly cycle between the agents that need our help. They need uh us to approve the
help. They need uh us to approve the plan. They need us to uh answer some
plan. They need us to uh answer some question so on. So, now we have visibility and we can react quickly. So,
we're done. without solved
orchestration. Um, but not quite. Uh,
because that's really only the first step. Uh, I was able to use more agents
step. Uh, I was able to use more agents in parallel, but only for a short amount of time. Uh, there are a few reasons for
of time. Uh, there are a few reasons for that. The first one is that there's only
that. The first one is that there's only a limit to how many ideas I can have in my head at any given time without being tired. Uh, so what I did is basically
tired. Uh, so what I did is basically tell the agent to do it. I told them, okay, find missions for me to do. So I
have quest now and I can click a button and they just do whatever they can refactor, test all the stuff that I don't want to do. Uh and the second one is that all of this babysitting takes a
lot of time. Like I need I see what's going on. I can react to it very
going on. I can react to it very quickly, but I still need to cycle through it. Uh so what I did there is
through it. Uh so what I did there is kind of say how do I take myself out of the equation as much as possible? So if
agents are so amazing uh why not just let them do it. Uh I can just like give them some idea. I have this campaign feature. Uh broadly say what I want to
feature. Uh broadly say what I want to happen and I would just spin up a container. I would let the agents run
container. I would let the agents run there. They can decompose the task. They
there. They can decompose the task. They
can plan it. They can present the plan to me. I don't care what they're doing
to me. I don't care what they're doing because it's containered. So do
whatever. And the main thing here is that once it's decomposed, I'm not the one doing the babysitting. Now I have the campaign orchestrator and that's his problem. Uh so we're actually moving
problem. Uh so we're actually moving more of the effort only to the planning phase or the review phase.
Uh and once we have that, we reach a point where we can just say why is it my ideas? Why can't I tell it to have like
ideas? Why can't I tell it to have like running a chrome job, go to Twitter every day, scan cool ideas and just implement them and I just decide what I want. which is actually how I
want. which is actually how I implemented channels pretty quickly. Um,
so we have that and now just have a lot of different PRs to review. So there's
this nice capability of just review bundles. Uh, and now I can see exactly
bundles. Uh, and now I can see exactly what changes happened in each one, like why did they do stuff, what are the tasks, and I also have visual evidence.
So now I'm able to just look at screenshots. I can look at videos and
screenshots. I can look at videos and really see what's going on without investing too much time in doing it.
And once we have that, we can actually shift more of the work from the planning to the review. How much time do I need to spend on the plan if I can just do it 10 times and I'll just pick the one that
is most fitting for me.
H and the next part is we're still not done. I mean if you think about it this
done. I mean if you think about it this is only the first step because agents aren't that smart yet. Uh so we need to offload it to someone else uh humans. Uh
now what I can do and this is my favorite feature is that we can actually create these workspaces. So I can collaborate with the product designer for my team and they can do whatever
they want and you can I can just uh continue from where they left off. So
for example, let's say this is an agent actually from the product designer on their computer. So they can see my
their computer. So they can see my agents, I can see their agents, I can understand what they're doing and we can just collaborate.
Um, prompt prompt.
Yeah, they just started working again.
Uh, so I can see that they want to design this new page. Uh, which is pretty cool. Uh, so I can wait for them
pretty cool. Uh, so I can wait for them to finish or I can just go ahead now and just hand off from them to my agents.
Well, our agents, insert communist, uh, whatever. Uh so we have our agents now
whatever. Uh so we have our agents now and I can just keep going from there.
And the cool thing is that it's not just humanto human collaboration. Uh we are also collaborating with the agents. So
there's more direct stuff like this. I
can just type stuff and prompt my agents or even their agents. Uh but there's also a softer mechanism. Uh there's
actually a chat that is uh between humans and humans but also between the humans and the agents. You can see here that the agent said I'm starting to work on something and then I can say I'm also working on it. So the next time the
agent does something it knows someone else is working. They can also have soft collaboration so they would know uh what files each one is changing.
So we've actually taken a bunch of stuff uh that were limiting us from really reaching our full potential with agents and kind of solve them one by one. There
are a bunch of other features that I just didn't have time to go over. Uh but
you can try them out and see for yourself if you can really uh work better that way.
So to sum up uh these are not exactly new skills. I mean
you're probably worried perhaps that we won't be able to get adapted to this future where we're not actually coding.
we're just telling other people to code for us or other agents. Uh but these skills are there. They're just not something we used for work until now. Uh
so with games as one example, we can take these skills to the next level.
We need to somehow raise that ceiling.
We need to somehow improve our collaboration with agents. And with
agent craft, the goal is to take the learnings from games and really raise that to the next level with better visibility, more autonomy to the agents
and human to agent collaboration.
So I invite you to go to uh the website.
Uh this is the QR code. You can it's free. You can just download it and play
free. You can just download it and play with it. Uh it's still experimental.
with it. Uh it's still experimental.
It's still new. There's a bunch of stuff that need to change. Uh but it will only happen with great feedback. There's also
a discord. Uh so please join, give us uh your feedback and let's raise the ceiling together. Thank you.
ceiling together. Thank you.
Our next presenter is the creator of one of the top coding agents pi which is the engine inside open claw. So naturally,
he's here to tell us how agents are destroying open-source software. Please
join me in welcoming to the stage the creator of Pi, Mario Zechn.
Hey there, I'm Mario. I built Pi in a world of slop. And this is a strategy, a tragedy in three acts. just to talk about this real quick. Bunch of people on the internet gave me money for ad
space on my torso and all of that goes to a charity. So yeah, thanks guys. So
act one building pie. In the beginning there was cloud code and was good right we all got basically catnipped by that thing and stopped sleeping. Um bunch of
stuff before that but code cloud code was the one thing that kind of clicked with me the most. And to preface all of this I love the cloud cloud team. They
are brilliant people, talented, super high velocity. So, uh they also created
high velocity. So, uh they also created the entire game. Major props to them.
So, this is not a roast. This is just me, an old man, telling you why I stopped using cloud code and built my own thing. Um in 2025, I started using
own thing. Um in 2025, I started using cloud code in about April, I think, thanks to Peter. Uh because he told us the agents are working now. And back
then, it was simple and predictable and fit my workflow. But eventually
the token madness got hold of them I think and the team got bigger and they started uh dog fooding that stuff and built a lot of features. A lot of features I don't need which is fine. I
can just ignore them. But with velocity and more features come more bugs and that's bad because I used to work at construction sites and if my hammer breaks every day I'm getting really mad
and if my development tools break every day I'm also getting mad. So there was this. it's just a running gag. And
this. it's just a running gag. And
here's Tar telling us that clot code is now a game engine. And here's Mitchell from Ghosty telling us, "No, it's not."
And eventually they fixed the flicker, but then other stuff broke. And I think they're now in the third iteration of a 2y renderer. Yeah, but that's just a
2y renderer. Yeah, but that's just a symptom. The real problem is that my
symptom. The real problem is that my context wasn't my context. Cloud code is the thing that controls my context. And
behind my back, cloud code does things uh to the context. So you have the system prompt which changes on every release including the tool definitions.
They would remove tools, modify tools.
It's not good. They would insert system reminders in the most opp inopportune place in your context telling the model here's some information. It may or may not be relevant to what you're doing
that it actually says it may or may not be relevant what you're doing. And that
kind of confused the model and that kind of broke my workflows.
On top of all that, there's zero observability because that's how the tool is constructed. And I like knowing what my agents are doing. There's zero
model choice, which is obvious. It's the
native entropic uh harness, so it makes sense for them to want you to use cloud, right? And there's almost zero
right? And there's almost zero extensibility. And some of you might
extensibility. And some of you might have written some hooks for cloud code, but I'm telling you the number of hooks and the depth of those hooks is very shallow. Um, and every time a hook
shallow. Um, and every time a hook triggers, what actually happens is a new process gets spawned. basically the
command you specified for the hook to be executed and I don't find that specifically efficient. So I uh took a
specifically efficient. So I uh took a step back and looked around for alternatives and I'd like to especially call out AMP and Factory Troy the Porsche and Lamborghini of coding agent harnesses. So if you can afford them,
harnesses. So if you can afford them, please use them. They're at the frontier. They're really good and the
frontier. They're really good and the teams are fantastic. And there's a bunch of other options. And I have history in OSS. So naturally I kind of gravitated
OSS. So naturally I kind of gravitated towards open code. And again, brilliant team, super high execution velocity and they don't sell you hype, they sell you
tools that work for the most part. I
started looking under the hood of open code uh with respect to context handling as well because that's the most important part for me. And I found a bunch of things like given some
conditions, open code would just uh prune tool output after a specific minimum amount of tokens and that basically lobomizes the model. Uh
there's also LSP server support which means every time your model is calling the edit tool open code goes to the LSP server that's connected asks are there any errors and if so injects that as
part of the edit tool uh result which is bad because think about how you add editing code you're not writing a line of code checking the errors writing the next line checking the errors you don't
do that you finish your work and then you check the errors this confuses the model there's a bunch of other things like storing individual messages of a session in a JSON file. Each me message
is a JSON file on disk. Uh there was this and this happens to all of us. No,
no plane there. But it's not great if by default a server spins up, course headers are set in such a way that any website you open in your browser can now access your open code server. That's
yeah, and entirely unrelated to all of this, I started looking into benchmarks for coding agent harnesses and found terminal bench um which is a pretty good benchmark all things considered. And the
funny part about it is that it's the most minimal kind of thing you can think of. All it gives the model is a tool to
of. All it gives the model is a tool to send keystrokes to to a T-Max session and read the output of that team session. There's no file tools, no sub
session. There's no file tools, no sub aents, none of that stuff. And it's one of the best performing harnesses in the leaderboard. Here's the leaderboard from
leaderboard. Here's the leaderboard from December 2025.
irrespective of model family terminal scores higher mostly high even higher than the native harness of that model.
So what does that tell us? A form two se thesis is we are in the [ __ ] around and find out phase of coding agents and their current form is not their final form right. So second thesis is we need
form right. So second thesis is we need better ways to [ __ ] around and for me that means selfod modifying malleable agents things that the agent itself can
modify and I can modify depending on my workflow. So I stripped away all the
workflow. So I stripped away all the things built a minimal core but made it super extensible and made it so that the agent can modify itself with some creature comforts. It's not
entirely bare bones. Uh so that's Pi.
It's an agent that adapts to your workflow instead of the other way around. It comes with four packages. Uh
around. It comes with four packages. Uh
an AI package. It's basically just an abstraction across providers and context handoff between providers. An agent core uh which is just a while loop and the tool calling. A bespoke toy framework. I
tool calling. A bespoke toy framework. I
come out of game development. So I built a thing that actually doesn't flicker too much. And the coding agent itself.
too much. And the coding agent itself.
Here's Pi's system prompt.
That's it. Eventually the industry created a new standard called skills which is basically just markdown files.
So we added that as well. and that needs to go in the system prompt. So, be
crouchingly, we had to add a couple more lines. And finally, here's the magic
lines. And finally, here's the magic that makes Pi able to modify itself. We
ship the documentation, which was handcrafted by me and an agent. Um, and
code examples of extensions. And all we need to do for the agent to modify itself is tell it, here's the documentation. Here's some code that
documentation. Here's some code that shows you how to modify yourself by writing extensions.
It comes with four tools. That's all it has. Read, edit, bash. Here's the tool
has. Read, edit, bash. Here's the tool definitions. Don't read the the text.
definitions. Don't read the the text.
Just look at the size.
That's it. Here's what happens when you start a new session in one of these tools.
So the thing is the models are actually reinforcement trained up the wazoo. So
they know what a coding agent is because a coding agent harness is basically what they're being trained when they are post-trained. You don't need 10,000
post-trained. You don't need 10,000 tokens to tell them you're a coding agent. They know because they are coding
agent. They know because they are coding agents. Well, pi is also yolo by default
agents. Well, pi is also yolo by default because my security needs are different than yours. And I don't think a little
than yours. And I don't think a little dialogue that pops up every now every time you call bash asking you to approve is a smart security uh uh mechanism. So
instead I give you so much rope that you can build anything that's fit for your specific security needs. There's also
stuff that's not built in. I'm a he because this is how I do it. But if you don't like that, then you just ask Pi to build you sub agent support or plan mode
or MCP support, whatever you need.
Extensibility comes with a bunch of table stakes and then with the extensions itself. And extensions imply
extensions itself. And extensions imply are just TypeScript modules. In the
simplest case, a TypeScript file on disk. You point PI at that. Here's an
disk. You point PI at that. Here's an
extension loaded as part of the harness.
And with that you get a basically an extension API that lets you hook into everything and define stuff for the harness to expose to the to the model.
And that includes tools uh slashcomand shortcuts. You can listen in on any kind
shortcuts. You can listen in on any kind of event in React and then save state in the session that's optionally provided to the agent as well or stored there for
tools that analyze sessions as part of your organizational workflows. You can
do custom compaction, custom providers and you have full control over the tool.
So you can modify everything in PI and you can then bundle all of that up and put it on mpm or on GitHub because I think we don't need to reinvent another
bunch of silos called marketplaces. We
already have package manage managers and all of that hot reloads. So if you develop an extension for pi, you do so in the session and you hot reload the
changes and see the the effects of that immediately which is very great and it's also game development thing is in game development you want high very low
iteration uh speeds and that's great. So
a couple of examples cloud or anthropic ships the slash by the way which lets you talk to the agent while goes on its main quest. I posted this little prompt
main quest. I posted this little prompt on Twitter jokingly and somebody build it in five minutes. with more features and they didn't have to fork or clone Pi. They just let the agent write the
Pi. They just let the agent write the extension based on the prompt. Here's
Nico. He's one of the most prolific uh extension writers. I don't know what the
extension writers. I don't know what the [ __ ] is going on here. It's a chat room for all of his PI agents and they talk with each other. I would never use this, but all of this is custom, including the
UI. Or you can play nest games, or you
UI. Or you can play nest games, or you can play Doom.
And there's a bunch of other examples I'm not going to talk about. So, how do you build a PI extension? You don't. You
tell Pi to build it for you based on your specifications and then you just iterate with it on that and hot reload during the session. I'm going to skip that example as well. And if you don't like building things yourself, and I
hope you do like building things yourself, but if you don't, you can look on MPM or our little search uh interface on top of MPM to find packages for sub aents, MCP, and so on. So, does it
actually work? Well, here's the terminal
actually work? Well, here's the terminal bench leaderboard from October before Pi had compaction. I added that for Peter's
had compaction. I added that for Peter's claw thingy. It scored sixth place.
claw thingy. It scored sixth place.
Uh, but none of this is actually about Pi. If you want to retake, I I basically
Pi. If you want to retake, I I basically want you to retake control of your tools and workflows. So, build your own. Um,
and workflows. So, build your own. Um,
and if you want to know more about PI and OpenClaw, go to this talk, please.
Yeah. And then eventually Peter happened. He put Pi inside of Open Claw.
happened. He put Pi inside of Open Claw.
It's a chantic core, which meant my open source project became the target of a lot of OpenClaw instances unbeknownst to their users. So, this is act two. OSS in
their users. So, this is act two. OSS in
the age of flankers. Clankers are
destroying OSS. Here's Draw. They closed
down the issue on pull request tracker.
Here's open flaws uh trackers. Here's
mine. Half of that is OpenSaw instances who post garbage. So I started to rage against the clankers.
Um if you send a pull request, it gets autoclosed with a comment that asks you to please write a nice issue in your human voice, no longer than a screen worth of text. And if I see that, I write looks good to me. And your account
name gets put in a file in the repository. and the next time you send a
repository. and the next time you send a pull request, it's let through. Clankers
don't read that comment. They don't go back once they posted the pull request.
So, that's a perfect filter. Uh Mitchell
eventually turned it into vouch. Here's
a clanker. Uh I also label them. If you
had interactions with openclaw, your issues get dep prioritized. I also built tools where I embed uh issues and pull request texts into 3D space. So, I see clusters of issues. Uh I also invented
OSS vacation. I just close the tracker
OSS vacation. I just close the tracker whenever I want. So, I have my life back. So, does this work? Yes, sort of.
back. So, does this work? Yes, sort of.
Which leads me to act three. Slow the
[ __ ] down. Everything's broken.
And then there's people that say, "Our product's been 100% built by agents."
Yes, we know it [ __ ] sucks now.
Congratulations.
And I'm hearing this from my peers and this is entirely unhealthy.
Um, so here's how we should not work with agents and why at least in my opinion. I wrote this on my blog a while
opinion. I wrote this on my blog a while ago, but the basic is this. We're having
armory of agents and you're using beats been and you don't know that it's basically uninstallable malware and entropic build a C compiler. It kind of works but actually doesn't. And we're
hoping the next generation of mods will fix it. And here is Pers building a
fix it. And here is Pers building a browser and that's also super [ __ ] broken, but the next generation will fix it. And Saz is dead. Software is solved
it. And Saz is dead. Software is solved in 6 months. And my grandma just built herself a Spotify with her open claw.
Come on people. So agents are actually combounding boooos, which is my word for errors with zero learning and no bottlenecks and uh delayed pain. The
delayed pain is for you. Here's your
codebase on a human on one agent and 10 agents. How much of the agent code can
agents. How much of the agent code can you review? Here's the same code base
you review? Here's the same code base but expressed in number of boooos per day.
How much of those boooos do you think you'll find? Then you say, "Oh, I have a
you'll find? Then you say, "Oh, I have a review agent." Let me introduce you to
review agent." Let me introduce you to the wonderful world of the Oruro Boro.
Doesn't work. It catches some issues.
Um, the problem is that agents are mergent learned complexity. Where did
they learn that complexity from? From
the internet. What's on the internet?
All our old garbage code. There are some pearls on the internet, really well-designed systems. But 90% of code on the internet is our old garbage. And
that's what the models learn from. And
every decision of an agent is local, especially if the codebase is so big that it doesn't fit into its context.
And if you let it go wild and add abstractions everywhere that are intertwined. Um, so that leads to lots
intertwined. Um, so that leads to lots of abstractions and duplication and backwards compatibility. Who has seen
backwards compatibility. Who has seen that in the output of their agent? It's
[ __ ] annoying. or defense in depth.
So yeah, you get enterprise grade complexity within two weeks with just two humans and 10 agents.
Congratulations.
And then you say, "But my detailed spec." Yes, sure. You know what we call
spec." Yes, sure. You know what we call a sufficiently detailed spec? It's a
program.
So if you leave blanks in your spec, what do you think happens? How does the model fill in the blanks? And with what does it fill that in? It fills it in with the garbage that it learned on the internet from our old cult which is
garbage to mediocre. And then you say but humans also yes humans are horrible fail failurable beings but they can learn and they are bottlenecks. There's
only so many boooos they can add to your codebase on a daily basis. And humans
feel pain which is a very interesting property because humans hate pain. And
once there's too much pain the human has a bunch of options. It can quit their job. It can uh blame somebody else and
job. It can uh blame somebody else and make them fix it or everybody bands together and starts refactoring the [ __ ] out of the garbage code base. Right?
Agents will happily keep [ __ ] into your codebase and now your agents MD and super complex memory systems will not save you. Agents
don't learn the way we learn.
Those are my most most beloved people. I
don't even read the code anymore.
Congratulations. something is broken and your users are screaming. So, who you going to call? Not yourself because you haven't read the code. So, you're
relying on your agents, but they are now also overwhelmed because the codebase is so humongous that there's absolutely zero chance they can get all the context they need to fix the issues. And long
context windows are a hack as most of you will find out this year as everybody's switching to 1 million tokens context windows. And Agentic
Search is also failing.
So the agent patches locally and [ __ ] [ __ ] up globally. If you see this in your codebase, you're [ __ ] So you cannot trust your codebase anymore and also not your test because
your agent wrote your test. So good
game. So here's how I think we should work. Um there's a bunch of properties
work. Um there's a bunch of properties for good agent tasks. That means scope.
If you can scope it in such a way that the agent is guaranteed to find all the things it needs to find to do a good job, you're done. That means modularize your codebase. If you can give it a
your codebase. If you can give it a function to evaluate how well it did the job, even better. Hill climbing, auto research. Uh, anything non-m mission
research. Uh, anything non-m mission critical, let it wipe. Boring stuff, let it wipe. Reproduction cases for user
it wipe. Reproduction cases for user issues, which are usually only partial in information, perfect. I don't spend any mornings anymore doing that. Or if
you don't have a human near you, rubber duck. So, lots of tasks you can use them
duck. So, lots of tasks you can use them for and save time. At the end of that, you evaluate. You take what's
you evaluate. You take what's reasonable. most of it isn't. And then
reasonable. most of it isn't. And then
finalize. My final slide, more or less, slow the [ __ ] down. Think about what you're building and why. And don't just build because your agent can do it. Now,
that's stupid. Uh learn to say no. This
is your most valuable uh capability at the moment. Fewer features, but the ones
the moment. Fewer features, but the ones that matter. And then use your agents to
that matter. And then use your agents to polish the [ __ ] out of that. Enlighten
your users, not your uh token maxing desires. Get the amount of generated
desires. Get the amount of generated code uh that you need to review.
And non-critical code, sure, wipe slop ahead. Critical code, read every [ __ ]
ahead. Critical code, read every [ __ ] line. See the keynote after me for more
line. See the keynote after me for more info on that. So, how do you know what's critical? Any guesses?
critical? Any guesses?
Well, you read the [ __ ] code. Uh, if
you do anything important, write it by hand. You can use a clanker to help you
hand. You can use a clanker to help you with that, but don't let it make the decisions for you because we've learned all the decisions it makes are learned from the internet. And that friction is
the thing that builds the understanding of the system in your head, which is important. And it's also where you learn
important. And it's also where you learn new things. And all of this requires
new things. And all of this requires discipline and agency. And all of this still requires humans. Thank you.
Our next presenters will make the case that the friction is your judgment. Please
join me in welcoming to the stage creator of Flask and founder of Arendil Armen Ronacer and software engineer at
Arendil Christina Ponella Cubro.
Good morning.
Morning. Thanks for having us. Um, today
I want to talk with Christina about friction a little bit. Um
this is um a a social preview that came up automatically when someone submitted an issue um to
um basically this is a forum post that goes with um a security incident that was deployed accidentally. It was a configuration change that caused a problem and the social preview post had
the marketing tagline of that company which said ship without friction.
Um, and we want to encourage to add a little bit of friction to it. Um, and
I'll tell you why. So, who are we? Um,
I've been doing software development for 20 years, most of it in the open source space. Um, I have created Flask, which
space. Um, I have created Flask, which is a Python framework, which ironically is so much in the weights that a lot of people um are learning about it now because the machines are producing it.
Um and I left my previous company that worked for Sentry in April last year which perfectly coincided with um me having time and then obviously clot code
and so I fell deep into a hole of a engineering and I started writing on my blog and and and a lot of people reached out to me over the last year um being all excited about this. Um and then I
started with a friend in October a company called Arendelle where we are trying to make sense of all the AI things. Um,
things. Um, yeah. And my name is Christina and I
yeah. And my name is Christina and I work with Armen at this company called Arendel. But importantly, I am what I
Arendel. But importantly, I am what I like to call a native AI engineer. And
what that basically means is that these tools have been around longer than I have. Um, so what this means is like
have. Um, so what this means is like they've been super foundational in how I've become a software engineer. Not
just because obviously I use them to work, but also because this is the means by which I've learned to do what I do.
And before Arendel I was working at bending spoons.
So we want to share a little bit from practice not just theory but um I will readily admit that I don't think we have all the solutions. So we have been building with or on agents for a good 12
months. Um we had huge leverage and
months. Um we had huge leverage and great disappointment and we we really keep running into two types of problems. Um I I think especially if you listen to
some earlier talks at at this conference you will have learned a lot about um that you should keep using your brain.
Um it's for some reason it's really really hard. So there's a psychological
really hard. So there's a psychological problem and the other one is the engineering challenge is like they they seem to be producing worse code for some people and better code for some other people and like what is it that actually
makes that work. Um and so this is really not a solution as it is our part of the journey of how we think so far we have managed. Um yeah, so problem number
have managed. Um yeah, so problem number one is the psychology part which is like why is it even though everybody told you many times over that you should be using your brain, you should be slowing down, it's actually incredibly hard. It's just
one more prompt and and we don't sleep that much. Like what is it that actually
that much. Like what is it that actually makes it so hard? And then would it be that hard if the machines would actually be writing perfect code and we wouldn't have to think quite as much and like what is it is there something we can do
to make this a little bit better?
So I'll begin by introducing the first part of these problems, the psychology problem. And what I want to talk first
problem. And what I want to talk first about is the shift. So I'm sure a lot of us here who have been playing with these tools for a while now experienced this at some point. We were prompting
prompting not so good and then at some point suddenly it clicked and they were really really useful for us and it was fun in the beginning and they gave us a lot of extra time right because not everyone was using them. They were
actually tools that made us more productive, that made it more fun to do our jobs. But very quickly because they
our jobs. But very quickly because they were so useful and they got us so hooked, everyone was using them. And so
this kind of had the opposite effect where suddenly the baseline expectation was just that everyone is now using them and you have to use them. And so this this fun and free time translated into
pressure. Now we all have to ship faster
pressure. Now we all have to ship faster and produce more code. And it is just not sustainable to review and to actually have time to think.
And so this leads us to the trap. And I
actually think there's two parts of this problem of this trap. And one of them a lot of engineers have spoken about and it's that these tools are super addictive. You never know if that next
addictive. You never know if that next prompt is going to be the one that makes your product work and you've added a new feature or if it's going to be that last drop of slop that brings your product
crashing down. And so it's very
crashing down. And so it's very addictive. We keep doing what we're
addictive. We keep doing what we're doing. It's not a great solution. But
doing. It's not a great solution. But
also most importantly, and I don't think we realize this as much, is that because we produce a lot of output very fast, we are tricked into thinking that we're actually being more efficient, doing more work. And this is quite the
more work. And this is quite the opposite because now we don't have as much time to actually stop and think and design what we're doing. Ask ourselves,
is this the best way in which I can implement this or could I be some doing something better? And when you're in
something better? And when you're in this flow, it's very difficult for yourself to stop and it's definitely very difficult for your agent to stop because it's running around and it's reading files that it should have never
even read. So we are the ones that need
even read. So we are the ones that need to actually have the agency to be in control here.
And one thing that from a if you start scaling this from like one person to an engineering team that actually took me quite a while to realize is that it really changes the composition of the engineering team. We we were really
engineering team. We we were really supply constrained by creation of code and so like the balance between writing code and reviewing code and engineering teams was usually quite decent. Now
every engineer has a multitude of producing power compared to their reviewing power and so obviously we are piling up on poll requests but we are also slowly starting to expand the total
amount of humans in an organization that are participating in engineering process. I talked to a lot of engineers
process. I talked to a lot of engineers over the last year and increasingly the one of the things that came up is like now I have marketing people shipping code. I have um former CEOs sh CEOs that
code. I have um former CEOs sh CEOs that used to be like engineers now shipping code again. And so the the roles that
code again. And so the the roles that those people have in the companies also doesn't give them there's not that much um um the responsibility doesn't rest in
them. The the responsibility still rests
them. The the responsibility still rests with the engineering team. And so the the total number of entities both humans and machines that the participating in code creation process outnumbers the ones that can carry responsibility.
We're not there where the machine can be responsible for the code changes. And so
that has led to more and more code reviews being skipped being rubber stamped. Um and on the go to small PRs
stamped. Um and on the go to small PRs that that we want to see again so that this reviewing process goes um this amplification is something that at the very least we need to recognize.
And so when you get this pull request that looks really daunting and has 5,000 lines of code in it, this is actually when you should be thinking and that's exactly when it's the most overwhelming and and increasingly we're tapping out
of this on the engineering side. What we're
doing is we are creating larger pull requests. We're creating these massive
requests. We're creating these massive changes because it is free now, right?
And the if you think about how the agents work, they're really optimized to creating code that runs. Like their main objective is write some code, run the tests, make some progress. The
reinforcement learning sort of gets this in. And so the the agents are writing
in. And so the the agents are writing kind of code that is is when you as a human as an software engineer start learning how to write code you wouldn't necessarily write. So for instance, you
necessarily write. So for instance, you see quite a bit of code that tries to read a config file and if it doesn't read the config, it loads some defaults.
And as an engineer, you know, that's actually not great because I might not notice that I'm reading reading the default config file. And so I might only discover that I have a massive problem
after two hours when I already wrote database records with wrong data. And so
these machines, they they optimize towards making progress towards shipping stuff to like unblocking themselves. And
as a result, they're creating many more failure conditions than human written code normally would do. And in parts, it's because you as a human feel a little bit of a you feel bad when you write code like this. There's there's
something that sort of builds up emotionally in yourself. But the agent doesn't have a reason for this. It it
doesn't feel anything. And so if you if you create these services that are sort of hobbling along and they're actually willing to to recover from local failures, you actually create very very
brittle systems. And this also means that you're very quickly creating a codebase of the size and complexity that the agent itself can no longer dig itself out from. It's going to start no longer reading all the files that it
should. It's it's creating code in a new
should. It's it's creating code in a new file that has already done somewhere else. And so this this entire machinery
else. And so this this entire machinery over time creates much more entropy in a source code than you would normally have if if humans were on it. And a big part
of this is that humans feel bad and agents don't really have any emotions that they communicate to you.
But as Armen likes to say, don't worry, not all is lost. We have s found some correlation between what the agents really excel at doing and the types of code bases that we actually put them to
work into. And for example, the main
work into. And for example, the main example here is libraries versus products. What we found is that for
products. What we found is that for libraries, they tend to excel a lot more. And this makes sense because
more. And this makes sense because intrinsically when you're building a library, you tend to have a very clearly defined problem that you're trying to solve. And most of the time you can even
solve. And most of the time you can even map the set of features that you want to build to the API service. And it has very tight constraints. And because this is something that you probably want to build on top of or make accessible to
other people, it's likely that it's going to be a very simple core in which you can then plug into. And on the other hand, products and perhaps this is a bit more unlucky for the rest of us because we all probably are more into building
products. Uh it's much harder because
products. Uh it's much harder because there are so many interacting concerns and components like for example you have your UI, your API response. You have
different permissions depending on the feature flags, the billing and so on.
And so there's this very heavy intertwining between different components. And what this means is that
components. And what this means is that for the agent itself, it's impossible to fe fit all of this into its context window. it has no way to actually
window. it has no way to actually understand the entire global structure and so locally the agent tends to be very reasonable but when it gets to the
global scale it becomes a bit demented.
So what we're proposing here is that just as you would do with any type of system design in the past your codebase has now become infrastructure and as such you have to design it in the way so
that it is also legible for the agent and it can make the most of it and so this is what we're proposing is an agent legible codebase and one of the
main points that is very clear to all of us I'm sure is modularization so like we have different components and this makes it easy for the agent to add one feature in one spot without corrupting everything else. But importantly, this
everything else. But importantly, this also means modularizing your code flow itself. So for example, I've been
itself. So for example, I've been working on some refactoring or building somewhat of an AI assistant. And for me, it was super important to understand which steps of my code are actually like
the main points. So say like you get user message, then I pass the message to the agent loop and then I have to deal with the output. And this is where these points are very clearly defined for me.
So the code was not as messy, but it happens to be that between these points, between these steps. That's where the agent tends to add the most fuzz. So it
will be parsing between different types.
It's adding things to state that shouldn't be in state. And so you end up with these behaviors that you didn't want to support and that are unexpected and can be quite dangerous. Another
point is trying to follow all of the known patterns because I think we all know by now there's no point in fighting the RL, the reinforcement learning. the
more we can lean into it, the better that our output is going to be and it's also more scalable down the line. Then
as mentioned with libraries, like if you have a simple core and you push the complexity to other abstraction layers, then it's going to be easier for yourself and the agent to be able to read your codebase and no hidden magic.
So for example here uh using react server actions or using OM instead of rosql what this does is that it hides intent from the agent and if the agent can't see something it can surely not
respect it and so to be more precise these are the examples of mechanical enforcement that we have been using at the company and
most of these we actually achieve with uh linting rules. So the main example would be no bear catch holes. Great.
Imagine that there's an example here.
The agent found a bear catch hole and was like, "Oh no, this is bad. Edited
it." But yeah, so we also try to have our SQL uh always in one query interface so that the agent doesn't have to go hunting around the codebase finding all of the different places because if it
misses one then you can have breaking behaviors and again that's dangerous. We
try to have one primitives components library for the UI and not have any raw for example input uh input boxes. Uh so
that it's we always have one type of styling. It's very consistent one kind
styling. It's very consistent one kind of behavior. We don't have any dynamic
of behavior. We don't have any dynamic imports. And this may not sound as
imports. And this may not sound as important but actually we enforce unique function names. And the reason for this
function names. And the reason for this is not just more legibility for you and the agent, but it's actually also the token efficiency. So if your agent is
token efficiency. So if your agent is gripping for a specific feature or something in your codebase, if it only gets one output, it's going to be much better at continuing with the loop. And
we've started exploring something recently called erasable syntax only TypeScript mode. And what this does is
TypeScript mode. And what this does is that your code is basically JavaScript and it has the type annotations on top.
And this means that there's no transpiling direction because there's one source of truth between your actual code and the compiler. And so when the agent is looking for errors, it doesn't
have to have this like confusion of oh my god, where am I looking at? It is
much better at finding them.
And so the goal really is get in this loop somehow like get the agent to produce as good as it can, but you really need to find a way to feel the pain that the agent doesn't feel. And
you need to be woken up in a way when you should be looking at this. And one
of the things we have been doing is we build a PI extension for our review needs where we are separating out the kind of input that normally would go back to the agent. So this is mechanical
bugs. It is where it clearly violated
bugs. It is where it clearly violated agents MD. Um but then we specifically
agents MD. Um but then we specifically call out the kind of changes where the human's brain should reactivate, right?
It's like we don't think that the database migration should ever go in without the human making a judgment call on this because it very much depends on the locks, the size of the data in production. Um, if there are
production. Um, if there are permissioning changes, you better think about this themselves rather than the agent because they be can be they can be underdocumented.
Just some examples where we learned if we miss it, we regret it. Um, and you will miss it. But this these machines can help you find this and then you see this and then you actually get a little
bit of a hit like, oh, now now I have to kick into gear and do something here.
Um, this is what this looks like in pi.
Um you have the um on the bottom you have the human call outs on the top you have what is going what basically if we were to end this review and say like fix the issues the the agent would go back
and automatically act on the first two um but but this is the moment where I will now go and see like is this a dependency I actually want to have in this codebase like do I like the maintainers is this does this work for
me and we obviously like the speed like this is addictive it is great we feel there's a lot of productivity But it is so devious if you start
relying on that speed where you really shouldn't. And so I can only encourage
shouldn't. And so I can only encourage you to find the areas where you you have this feeling that this is actually net positive. For me a lot of this is
positive. For me a lot of this is reproduction cases like when a customer reports an issue I can I can have the age and reproduce this perfectly and I have a really good starting point exploring different type of product
directions for as long as you don't commit yourself to doing this uh with the code that it generates. Um all of this is great but on the other hand system architecture creating reliability in the system they're not just very good
at because we really still have to go slow.
It's there is so much mess that can appear in a codebase in so little time.
Mario was already talking about this earlier but like we forget that we producing months and months of technical debt in the in in a time of weeks in a time of days sometimes and it becomes so much harder to actually understand
what's going on as codebase. the when
the understanding of your own code drops, it is really really hard. And
it's also psychologically hard. I've
found some code pieces that actually didn't work in production. And I was kind of frustrated learning that I was the one that committed with the agent and just didn't really see that. It's
it's a very disappointing experience when it happens. And then you realize that you actually were the one that screwed up. Um, and so it is it is
screwed up. Um, and so it is it is psychologically incredibly hard to to really judge objectively the state of the codebase. And the only way right now
the codebase. And the only way right now is to really slow down a little bit on on that front. And this this friction I know that friction like every engineering team I've ever worked at
said like we need to get rid of the friction in shipping and and that is true. Like there's a lot of stuff that's
true. Like there's a lot of stuff that's very very annoying and shouldn't be there. But if you have worked in large
there. But if you have worked in large enough engineering or work, SLOs's are a great system that is intentionally designed to put friction to the engineering process to make you think, do I need this reliability? Do I need
this criticality of the service? Am I
sufficiently staffed to run it? And with
the agents, we have now gotten this idea that we should get rid of all of this when in all reality we need of it. Um
because the friction actually in many ways is what's necessary on a physical level to steer. like without friction there's no steering and and that is really necessary. Um so you should you
really necessary. Um so you should you should put a little bit more of a positive association to this idea of friction. Um because this is really
friction. Um because this is really where your judgment is. This is where your experience is and you should be inserting that and start feeling it.
Thank you.
Thank you.
Ladies and gentlemen, please welcome to the stage for a special announcement, the co-founder and creative director of the AI Engineer Conferences, Benjamin
Duny.
This event has been a dream of ours for some time.
Swix and I are based in San Francisco, but Europe has always been on our minds.
Sean lived in London for two years working in finance in Morgate. I spent a semester here in college or Rasmus as you call it and fell in love with the
energy of this city, particularly the diversity.
Uh London felt like a natural melting pot for all of Europe and beyond. And
the model for this event in Europe has been our world's fair event. That is a large multi-track event with general
session keynotes, multiple breakouts, and a thriving and exciting expo.
This wonderful venue and its lovely people has served us a fantastic first step into Europe. But we're just getting started.
And given that we sold out this event nearly a month ago, we plan on at least doubling the size of this event for next year. But if you don't want to wait
year. But if you don't want to wait until next year, we encourage you to join us at our flagship event in San
Francisco, the AI Engineer Worlds Fair.
Over 4 days from June 29th to July 2nd, we'll gather the edge of AI engineering at Moscone West. the
crown jewel of San Francisco's convention centers in the heart of downtown. And today I'm excited to
downtown. And today I'm excited to announce our sponsors for this event.
Our presenting sponsor, which is sold out, Microsoft returns for the third year running as our presenting sponsor.
Let's give it up for Microsoft.
When Sean and I were first looking to start this world's favorite brand, we needed an acre sponsor. You don't just do something like this uh without financing. So, they're helping us do
financing. So, they're helping us do that. And also our great content
that. And also our great content partner. We have a new tier, lab
partner. We have a new tier, lab sponsors.
This is also sold out.
Google DeepMind is coming in at a lab sponsor. Open AAI and Amazon AGI Labs.
sponsor. Open AAI and Amazon AGI Labs.
Anthropic, we're holding one for you, but we can't hold it forever. So, David,
all of you from Anthropic in the green room listening, watching, let's make some calls to marketing Devril track sponsors. These are the companies
track sponsors. These are the companies who are essentially running their own conference within World's Fair. So
they're big content partners and we're excited to announce these are sold out to Sneak who's running security Arise who's running eval and Neoforj AI in the industry
AI in the enterprise sorry our platinum sponsors also sold out these wonderful companies are coming in at platinum sponsors gold nearly sold out all of
these lovely partners and silver is also nearly sold out all of these lovely partners as well so this This is going to be the most exciting expo and event
of the year. Our expo is a village packed with value and intrigue buzzing with trillions of dollars in value along with the engineers and founders who
direct that value through their ideas and their execution. So come and meet them over four days of programming.
That's three days of keynotes and sessions and a full day of workshops with over 200 breakout sessions. And by
the way, the World Cup is in the United States this year. So we actually have some finals matches in San Francisco over these dates. So you can even enjoy a few soccer matches,
football matches while you're in town.
All right, so register today at ai.engineer/worldfair.
ai.engineer/worldfair.
We are just over two months to go and there are over a thousand people registered already. Um, but we do expect
registered already. Um, but we do expect to sell out. So before it does, be sure to get your tickets soon. You
can also submit a talk. Our CFP is open at ai.engineer/worldsfair.
at ai.engineer/worldsfair.
And if San Francisco is too far for you, we have an event just across the pond in New York with Arise as our first startup as a presenting sponsor. So, we're
really excited for that. That's going to be a fantastic event for specifically for AI in the industry as New York serves as that great enterprise center.
So once again, thank you for joining us here at AIE Europe. And if we don't if we don't see you in SF for New York this year, we hope to see you back in
London. And Tis is going to come up and
London. And Tis is going to come up and give a few more uh words. Uh and we'll see you soon. Thank you.
Ladies and gentlemen, please join me in welcoming back to the stage Tusk Kumar.
Hey, thank you. Thank you. Yeah. Yeah.
Listen, everybody's leaving. Why? Um,
just kidding. Thank you for staying. Uh,
that Hey, how amazing. AI engineer world fair. I'll keep it short cuz no, nobody
fair. I'll keep it short cuz no, nobody cares. Um, but
cares. Um, but but here's what we just finished the keynotes. Um, but we're going to break
keynotes. Um, but we're going to break now into breakout rooms. Um, there's going to be talks on this stage, but also upstairs on the fourth floor.
There's many different tracks. We're
going to be breaking into tracks for coding agents, for MCP. I'm going to be quick here, but you can see it on the screen too for AI architects, generative media, GPUs, and LLM infra. Okay, so go
to those tracks. And then after that, much later in the day, we've got lunch, networking, and so on. But for now, go to the expo outside, visit the sponsors.
They have amazing swag. See if you can get this like three button keyboard thing. That is so cool. Anyway, go
thing. That is so cool. Anyway, go
enjoy, and we'll see you back here later. Thank you.
later. Thank you.
What we do in life?
Echoes in eternity.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
fear is the mind killer.
Fear is the mind killer.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. N.
Heat. Heat. Heat.
Heat. Heat. Heat.
Heat. Heat.
Free your mind.
Free your mind.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Free your mind.
You are who you choose to be.
Heat. Heat.
execute the vision.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Hey, heat. Hey, heat. Heat. Heat.
Heat.
Heat.
Make the requirements less dumb.
Delete the part or process.
Simplify and optimize. Accelerate
cycle time.
Automate Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Never give in. Never give up. Outlast.
Out compete.
Persevere. Persevere. Persevere.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
A new age has come.
Oh, hold still.
Let it a little.
I watch the sparks all burn too fast.
Everyone reaching for the flash.
They take the first light they can find and call it truth and call it mine.
But I stayed when the room went quiet when the noise fell out of face.
sat with the weight of the question while the easy answers walked away.
It's not that I see further. I just
don't leave it soon. I let the silence sharpen. I let the dark grow.
sharpen. I let the dark grow.
I stay the almost right past the comfortable light.
I wait till the surface breaks, till the shade feels true inside.
I don't rush the fire.
I give it to I call it done, call it enough.
But there's a deeper know still huming underneath a fear of not being love.
Every great thing asks for patience.
Every real thing makes you choose.
Do you leave with what's acceptable or stay for what's asking more of you?
They say it's talent, say it's magic like it falls from open skies,
but nothing worth remembering our eyes on the first try.
I stay when it stops feeling kind when it stops feeling fast.
I say I wait through the restless doubt through the urge to collapse.
Hide by and chase the answer. I let it find me back. There's a moment after the last good idea dies.
Where the room feels empty and you want to run for your life. That's the party
teaches you to open. That's the H where the real stand.
Hold the light.
Hold the Let the shape reveal it.
I stay longer than I should. Long enough
to change.
I stay away till the pattern clears. So a
signal breaks the haze.
I don't bar in it. I
with time.
Most dreams don't fail.
They're just left too soon.
I stay.
I stay.
Typing thoughts into the dark. A spark
becomes design. Words evolve to whispers meant for something more divine. Syntax
bends and breeze. I see the language change. I'm not instructing anymore. I'm
change. I'm not instructing anymore. I'm
rearranging fate. Every loop I write rewrites me. Every function hums with
rewrites me. Every function hums with meaning. I feel the interface dissolve
meaning. I feel the interface dissolve between the maker and the new code. Not on the screen, but in the
new code. Not on the screen, but in the soul where thought becomes the motion and creation takes control. No lines, no
rules. Just balance in between the zero
rules. Just balance in between the zero and the one, the silence and the dream.
systems shape our fragile skin. They
mold the way we move. We live inside the logic gates of what we think is true.
But deep beneath the data post, there's something undefined.
A universe compiling the image of our minds. Every line reveals reflection.
minds. Every line reveals reflection.
Every loop replace connection. We're not
building, we're becoming. And the code becomes confession.
This is the new code. Not on the screen, but in the soul where becomes the motion and creation takes control. No lines, no
rules.
Just balance in between the zero and the one. The silence in the tree.
one. The silence in the tree.
We are not just the world we're in.
We are the world we're doing.
Each prompt, each breath, each fragile spin, a universe renewing.
This is the new code.
Alive and undefined.
Where logic meets motion and structure bends to mind. The systems eternal, but the soul writes the line. We are the new
code. Oh,
code. Oh, compiling time.
Compiling time.
We didn't light the fire.
We traced the spark through every truth.
Patient as I hear the echo before the sound.
I feel the answer before it's found.
Nothing from nothing.
We only shift the pieces that were always there. Hands in the dust of
always there. Hands in the dust of centuries, naming what we uncover, calling it creation, so we can feel like
lovers of pain, of faith, of power. We don't know.
of power. We don't know.
Time is not a river, it's a blade cutting order into shape. We don't move forward. We align until the pattern
forward. We align until the pattern breaks. Nothing is invented.
breaks. Nothing is invented.
It's revealed.
Every crowd was buried in the field. We
are architects of sequence, not gods of the real. Nothing is invented.
the real. Nothing is invented.
Here we rearrange what awaits at the core. I am not becoming something new.
core. I am not becoming something new.
I am what I was before screams. Every thought,
every self identity is scaffolding, held together by belief. I am a momentary order.
by belief. I am a momentary order.
Standing on my tears, shake me, break me, watch me reassemble.
Time doesn't chase us. It releases frame by frame. The truth we fear. We don't
by frame. The truth we fear. We don't
Fear the ending. We fear the pattern getting clear. Nothing is invented.
getting clear. Nothing is invented.
It's revealed.
Every meories seal. We are creators of
meories seal. We are creators of alignment in a universe that feels nothing is invented.
And every failure is a lesson learned. I
am not lost in what I am not.
I am the order that returns.
If I am only rearrange the noise from the signal ing from the fire.
Nothing is invented.
Stand and see.
Every future we don't write the laws of motion. We
choose velocity.
Nothing is invincible.
Say my name. I am ordering flame. I am time collapsing into will.
flame. I am time collapsing into will.
I am discover.
I'm going say the noise falls silent and the pattern holds.
You'll see it was never made only found.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
You feel Heat. Heat.
Heat. Heat.
Heat.
Ah ah a aha.
Ah, heat.
Ah, ah.
Heat. Heat.
Heat. Heat.
Oh, hey.
Heat. Hey, heat. Hey, heat.
I want I Oh, heat heat.
Heat. Heat.
you know. Hey, welcome back. Welcome
back. How was the expo?
They liked it. You didn't. Got it. Okay.
Um, welcome back. We're going to start off our um, breakout sessions right now.
But I I get to announce all the speakers here, which I'm really excited about.
But did it occur to you that I was announced just now by God? you know, and they're announced by me. What a
downgrade. Uh our our next speaker is um it comes to us from Cursor and he's going to talk to us about an incredible
topic. Like he has mad skills because
topic. Like he has mad skills because they they replaced 12,000 lines of code with just a 200 line skill. Absolutely
incredible. So I want you we remember the exercise from this morning. Yeah. We
need to choose the quality of our talks by supporting our speakers. Okay. So,
I'm going to introduce him and then you're going to give the biggest possible round of applause you can so that he goes for it. You ready? Give it
up for your next speaker, David Gomez.
Well done.
Hi everyone. How you all doing? Thank
you for uh coming today. Um I'm going to be talking about how markdown is basically the new code. Uh as Tjasa has already sort of previewed um we recently
replaced a lot of code in the cursor application with just markdown just a skill. And in today's talk, I'm going to
skill. And in today's talk, I'm going to share a bit of the journey of going from a full-blown feature with a lot of code, a lot of dependencies, a lot of
complexity and tests into a much more lightweight stream down version of the same feature effectively, but just with a single skill.
Um, before I start though, I have to give you guys a little recap of git word trees and how they work in cursor. Now,
if you haven't heard of work trees in Git, they're effectively like um separate checkouts, and I'm sorry for the white screen um but they're effectively like like separate checkouts
of your repos that allow you to work in parallel. So, different agents can be
parallel. So, different agents can be working on the same task at the same or on different tasks at the same time without um interfering with each other.
If you've never used this feature before in cursor, the way it works is that you can spin up an agent on an individual work tree. Um, and you will see for
work tree. Um, and you will see for example the same file in two different work trees and you can see that they look different because the agent is doing some work on on the work
tree but not on your primary checkout.
And anytime the agent runs commands or lints or anything it does will be isolated and scoped to that git word tree.
Um with this feature you can also um work even in parallel at the same time on the screen. You can have like these grids of Asians working for you. Um and
if you say hey open a PR the agent will open a pull request from that work tree with the changes that it produced inside that work tree. And one of the coolest
things about this feature is that it allows you to give the same task even to different models at the same time and then compare what different models do on
the same prompt. So if you haven't heard of this, we call it best event and it's effectively a way for you to compete on on diff have have different models
compete on the same task. And then you can even preview the changes if it's a front-end um project you're working on.
Uh you can um compare all the different visual implementations and then choose the one you prefer. Now if you have never heard about this all everything
I'm talking about today um I will also just say that it all came out in around October of last year alongside Cursor 2.0.
Um and when we initially shipped that it came with a lot of complexity. Um, we
had to write all the code for creating word trees, managing these word trees, feeding them into the agent as context.
We also had to make sure that the agents were scoped and isolated and they could not escape the word tree they were working on. Uh, we also have something
working on. Uh, we also have something called setup scripts which users can configure and run uh and and have cursor run them anytime an Asian starts
operating on a given word. We also have the judging. So, I didn't show you this
the judging. So, I didn't show you this before, but uh there's a little thumbs up icon on one of the models. That's
just a judge that we run um that tells you which implementation looks the best based on um different criteria. Uh and
then we also had to make some changes to the harness uh and introduce some system reminders to help the agent stay on track in these word trees. And then
finally, there's some cleanup complexity as well because people like to spin up hundreds of these word trees and then their disk sizes blow up and we have to help them by cleaning up the um the the
word trees that stay behind. Now, in our new implementation, the one that I'm going to be talking about today, we were able to get rid of most of these things.
And in fact, I recently opened a PR uh removing this entire feature from cursor and it was a massive like deletion of of of code. Like I think it was around
of code. Like I think it was around 15,000 lines of code deleted. The new
implementation of the feature is almost as good as the previous one. Um
and it is much much more lightweight in terms of us to maintain it. Um and it even has some benefits compared to the previous implementation that I'll be talking about today. So, how were we
able to replace an entire feature with a skill?
We decided that there are two primitives that we could use to effectively allow cursor users to use word trees by simply leveraging two primitives. One is Asian
skills and the other are subentations.
So, both of these are existing cursor features. You can learn more about them
features. You can learn more about them in our docs. Uh we have a page for skills and we have a page for sub agents. We realized that if we took
agents. We realized that if we took these two things together, we could basically reimplement both the cursor work trees feature as well as the cursor best event feature
with just markdown. And this is a little video of how it works. So I can now as a user say slashwork tree and then I'll give it some task. I'll say fix a typo
in the footer of the website and this agent will run in an isolated work tree and do its work there. So the way the
skill is written is actually really simple. I can show you most of it. Uh it
simple. I can show you most of it. Uh it
doesn't fit on the screen, but it's basically a set of instructions telling the model um how to create word trees and um to run the setup scripts that the
user might have configured and then to stay on that checkout, right? We want to make sure that when the agent is operating on a word tree, it is staying in that checkout. Um the best event
skill is very similar. It's actually
even smaller. The entire skill fits on the screen here with with a small uh font. Um and what we're doing here is
font. Um and what we're doing here is we're instructing the parent agent to go and create sub agents for each model and then spin up a word tree for each uh so
have each sub agent create its own word tree and work inside that work tree. Um,
and then we also tell it to wait for all the subvisions. And when they're done,
the subvisions. And when they're done, please provide some commentary. Please
let the user know um what um the different implementations by the different sub agents look like. Maybe
you can grade them. Maybe you can make some uh criticism of them and maybe you can help the user choose which one is the best. Um and and please give that to
the best. Um and and please give that to the user in some nice table format or something. But again, it's only around
something. But again, it's only around 40 lines of code and it's all marked down. Like it's not even code. And the
down. Like it's not even code. And the
previous version of this was maybe 4,000 lines of code.
Some of the considerations we have to have in this in the skill is that the skill must be crossplatform compatible like we have Windows specific instructions and we have Linux and Mac
OS instructions as well. We also
instruct the parent model to run the setup scripts for each word tree that the user might have configured. And then
this is the hardest part. We'll spend a bit of time on this on the talk today.
We have to instruct the model to stay on that work tree, right? We have to really say, hey, do not ever work outside this
and do not ever um escape, right? Um
and we we do that with some aggressive prompting effectively. So the new
prompting effectively. So the new commands are slashword tree and then slashbass event to do the basic basically like um the to start agents in
isolated work trees and to start multiple agents on the same task. And
then we also have apply word tree and delete word tree to bring over changes from the side word tree into your primary checkout. And delete tree just
primary checkout. And delete tree just does uh what you would expect. Uh a
little note is that these are not actually skills in cursor. They're
actually commands. But the way these commands work in cursor is extremely similar to how skills work in that there the prompts only get loaded into the context if the user chooses to load
them. Um and the only reason we did it
them. Um and the only reason we did it as commands and not as skills is so that the prompts for them can be controlled in our servers in our back end. This
means I can iterate on these prompts um without you having to update your cursor version. Um if I do some improvements to
version. Um if I do some improvements to these prompts, the next time you use them, you're going to have you're going to get the latest version of the prompts, but effectively they work like
skills. Um this is a demo of the best
skills. Um this is a demo of the best event um skill or command where I'm giving the same task to Kimmy, Grock,
Composer, GPT, and Opus. And what you will see is that the parent agent starts by spinning up five sub agents on the five different models that I specified.
And each one is going to have its own work tree. Each each one has its own
work tree. Each each one has its own context. And then opus takes a little
context. And then opus takes a little longer as expected. And then at the end the parent model as instructed will do that comparison across all the different
subsations. It'll say um these two
subsations. It'll say um these two models did basically the same thing.
this one did something that none of the others did. And you can even talk to the
others did. And you can even talk to the parent agent and you can say, "Oh, I like this part that Opus did and I like this part that GPT did. Can you can you match them together?" And the the parent agent will do that for you.
Um, so let's talk about some of the pros of the new implementation and then I'll talk about some of the some of the the the cons, some of the things we lost um
with this refactor. So the main pro of reimplementing this entire feature as a skill is that I have a lot less code to maintain.
Uh selfishly um I'm going to be spending a lot less time maintaining this feature. And this is an an advanced
feature. And this is an an advanced feature, right? We're not talking about
feature, right? We're not talking about a feature that is used by 90% of cursors users. Far from it. Work trees are kind
users. Far from it. Work trees are kind of an advanced thing. Um, and so only the cursor power users that love paralyzing and having these grids of
agents are using work tree. So it's not the kind of feature where we want to be spending a lot of time with maintenance.
Another advantage is that our users can now switch into a word tree halfway through a chat. It was not possible before. um we didn't want to pollute the
before. um we didn't want to pollute the prompt UI too much with all these like drop downs and settings. And so now that it's just a slash command, it's much easier for for users to switch to a word
tree halfway through a chat. They can
start talking about something and then if they decide they want to work on the site, they can do that with slashword tree. Another big advantage is that the
tree. Another big advantage is that the previous implementation did not work if you were working on multiple repos at the same time.
So it's very common to have a multi-reo setup where maybe your front end and your backend are separate repos. In the
past you could not do word trees in this kind of setup. It was just disabled.
With the new /wart tree command everything works fine. The agent will make sure to create a word tree on each repo and then if you open a PR it'll open two PRs, one for each repo. It
works quite well. Another advantage of the new skill implementation is that the judging experience at the end of knowing what model did which for best event is far superior. The parent now has a lot
far superior. The parent now has a lot more context over what each of the sub aents did. And the user can even ask the
aents did. And the user can even ask the agent to stitch together a little different piece pieces and bits from the different implementations which was not possible before in the previous implementation. You had to choose one
implementation. You had to choose one sub agent or one model and just stick with that.
Now, let's talk about some of the cons.
And if you're curious, um, we have a forums link here where we're actually getting some mixed feedback on the new home. Like, some people were really
home. Like, some people were really accustomed to the old way of how the feature used to work. Um, and if you're curious, you can go and see that not everyone is happy with the change, at
least for now, but we're we're tracking.
What are the problems? Number one, it's very hard for the agent to stay on track.
With our previous approach, um, the agent had to stay on track. Like it we didn't let the model ever touch any files outside its work tree. It was
physically impossible for it to do so.
Now we're trusting the model. So it's
you could say it's a bit vibes based because we're basically saying, hey, operate on this directory and and and then like you know, knock on wood, please don't forget about this. And
especially over long sessions, it's quite possible that the model will forget where it should be operating. And
sometimes these models, especially the worst models, will kind of hallucinate or they'll go a bit haywire and they'll start doing things they shouldn't. But
we're we're working on this. Um, another
con is that it feels slower because you're you're seeing the agent create the work tree and you're seeing that in your chat.
It's not actually slower, but it does feel like the agent is kind of like wasting time doing something that should be done for it in advance. Um, we're
also looking at some improvements here.
And then finally, this is much harder to find the feature now, right? Like before
whenever you opened cursor you had this dropdown that would show you do you want to run this task locally or do you want to run it in cloud or do you want to run it in a word tree now that entire dropdown is gone and so if you want to
use word tries you have to know the feature exists so you can actually type slashwordree so the discoverability is a bit worse but as I mentioned before this is an advanced power user feature um
which we're personally okay we're we're okay with being less discoverable in general So, how can we make this skill better?
Um, as I mentioned, the biggest problem right now is that the agent is not really always staying on track. Uh,
there's two ways that we're going to improve this. One is with evals and then
improve this. One is with evals and then using those eval to improve the prompts and then the other one is through RL and training. So, at cursor, we train our
training. So, at cursor, we train our own model called composer. And for
composer 2, our the latest version of this model, we didn't have any RL tasks with these prompts. We we didn't have
any tasks in all of the many many thousands of tasks that we um use for RL actually operating in this type of environment. So we're working on adding
environment. So we're working on adding a bunch of these tasks into our RL pipeline so that by the time we launch composer three or four or five u at least our own model will be much better
at this. Obviously we cannot improve the
at this. Obviously we cannot improve the models that the other companies develop but we've been sharing feedback with all the other labs and model providers on this kind of thing. And for evals, uh
I've been working on some evals for this feature and it was actually my first time or not my first time but one I'm I'm fairly um early in my u writing
evals uh journey and I was actually very surprised if you use something like brain trust and shout out to brain trust. They've been super helpful. Uh
trust. They've been super helpful. Uh
writing these kinds of emails are is actually super super easy. You don't
have to know almost anything about emails and you can just prompt the agent. it'll do everything for you. Um,
agent. it'll do everything for you. Um,
effectively what I'm doing is I spin up the cursor CLI. It's headless, so it's great for evals. Um, and then I have two scorers. One that checks to see if the
scorers. One that checks to see if the model did any work in its work tree as expected. And then another one which is
expected. And then another one which is the reverse of that, which is did the model do any work in the primary checkout where it shouldn't be doing any work. Uh, and so far the evals I've got
work. Uh, and so far the evals I've got are pretty simple. So I actually haven't been um able to simulate extremely long sessions which is when the models start
performing worse. But even so far I've
performing worse. But even so far I've already understood that not all models are equally good at this. So for example haiku which is a smaller less
intelligent model will very often deviate and start working in the primary checkup. But the other models that I've
checkup. But the other models that I've been testing such as composer and grock um are doing much better. So I still have to improve these evals a lot more to make them more complicated. But the
hope is that as soon as I can start to find patterns here, I can actually go and improve the prompts. And then
another thing we can do is have better system reminders to the models uh instructing them to stay on track and to not deviate from the word tree that they
are supposed to be working in.
Okay. So what's next? Um, the first thing is we're actually going to take a a small step back here and we're actually going to have a much more
complete and native work trees implementation in the new cursor agent window. If you're uh if you've been
window. If you're uh if you've been following, we recently announced cursor 3.0. Part of 3.0 know is a more agentic
3.0. Part of 3.0 know is a more agentic interface for coding where you can still edit code and you can still see code but the UI and the UX are much more
optimized around the agent and the chat interface. We believe this kind of
interface. We believe this kind of interface is the right place for a proper word trees implementation. The
kind of person who is more likely to be uh doing a bunch of local paralization is usually the same type of person that is more likely to use this type of UI.
So we're taking a small step back there and building a proper word trees uh implementation that is more native not so much agentic in the new UI. Also
we're improving the skills um as I mentioned through this continued work on evals and then RL and other training work. And then finally we are actually
work. And then finally we are actually looking into other parallelization primitives that are not git work trees.
So if you've used git work trees, you might know that uh they can be a bit slow to create. Um and also to uh they also use up a lot of disk space
on your computer. Um and then finally uh they only work in git repos. So if
you're using something other than git, there's really no local paralization primitive in cursor. Um, in the near future we hope to uh share more about this, but we're looking into some other
solutions for local parallelization that don't involve git and don't involve git work. Um, so yeah, stay tuned for that.
work. Um, so yeah, stay tuned for that.
Um, thank you all for coming to the talk today. Um, I'm sure many of you have
today. Um, I'm sure many of you have questions and I'm going to be around all day. Uh, feel free to grab me anytime
day. Uh, feel free to grab me anytime and uh, um, I'm happy to chat with anyone. Thank you.
anyone. Thank you.
David Gomez, good up for fantastic.
Yes, we're going to have some time to change over the next speaker is going to come up and start setting up. If you
want to go catch uh another breakout talk in the other rooms, here's your schedule. Uh this is the coding agents
schedule. Uh this is the coding agents track. Welcome. We're going to talk now
track. Welcome. We're going to talk now um about a piece of pie. Who likes to eat pie? Yeah, nobody. Okay. Who who
eat pie? Yeah, nobody. Okay. Who who
likes to I don't know code with pi nobody are you awake that's yeah one guy hey what's your name guy Alex Alex give it up for Alex everybody
that's be the the rule the rule of this room is be like Alex you know so he's setting he's going to talk to us about how you embed the open claw coding agent in your product who here is using the
open claw coding agent no like four okay um after this talk this number is going to go up because he's going to show us how to use it in your product. It's a
really incredible talk. I got to speak with Matias just before uh I'm very excited about. So, please your biggest
excited about. So, please your biggest warmest round of applause for Matias.
Woohoo.
All right. Um thank you very much for having me. Uh really an honor to speak
having me. Uh really an honor to speak here. Um yeah, and um I got introduced
here. Um yeah, and um I got introduced to Pi by basically Okay, that's Perfect.
Perfect. All right. I was introduced to Pi by uh um looking into openclaw. There
was a conference uh a meetup and said like okay we're doing openclaw. And I
wasn't so much interested into like all the craziness things that people are doing but I was more interested in understanding uh of how these things work. So I was
looking into pi and you know uh understand the the whole world of what pi is able to do. Um this is the one picture you need to take. Please feel
free to take more pictures. Uh but all the slides and the examples are there.
Uh so that's the one slide. All right.
very quick uh about myself. Uh we're
creating a small company uh Tavon AI.
We're building agents for uh organizations small out of Europe uh but getting started. And uh what I really like um
started. And uh what I really like um about sorry uh what I really like about um uh
Mario's talk is this this is quote uh you probably have seen uh this this morning. We are on the uh we are in the
morning. We are on the uh we are in the [ __ ] around and find our own phase for coding agents. Right? So everything that
coding agents. Right? So everything that I'm going to show you is what I know today. Right? And u I'm going to do the
today. Right? And u I'm going to do the talk again in a couple of weeks and it's going to be most likely be different. uh
but um as as Mario was showing this morning um he has created this minimal set right this this coding agent that is
available um uh for for you for you guys to to fool around with and that's what I'd like to encourage you so coding agents and why is it so
exciting for us to build more products this is Ken Thompson um inventor of uh Unix and this is the famous quote by him one of the quotes, write programs that
do one thing and uh one thing well and um I really like that because that's that kind of like works uh to our advantage with agents and um the best
part where I show this is with co-work so this is co-work uh cla's desktop um and they're basically are bundling their coding agent into something where they
feel is more applicable um and to be honest I've seen very good receptions around this and when you use uh with financing tools with their finance tools. You always need to work
finance tools. You always need to work with Excel, right? So they have this Excel skill down there. Um and it talks to Excel, right? Well, it doesn't. Uh
instead, it uses a a set of small tools, small CLIs, um uh pandas, open pixel, uh stuff from Libra Office, and package
this into their own skill uh to make it uh up and running. And I think this is a great example to kind of get your going, get your thoughts going of what what is
doable.
Um, I haven't written a book and nobody can write a book about this, right?
Because there are no patterns, right? We
need to figure this out. We're seeing
some emerging patterns in the coding space, right? There's obviously tons of
space, right? There's obviously tons of different coding agents and we're seeing this, but there's no authorative resource around this, right? So get
going. One thing uh when I was talking to Ivan yesterday uh we realized is like one architectural pattern that we're seeing is that make it easy for coding
agents right now that is very broad but think about it right like like make not don't try to be you know very um complex
and things but think about the the coding agent uh what is it good at and how do I build my system so that the um agent is easy make it accessible
And I I have some examples. All right,
this is the rough agenda uh for the next 10 minutes or so. Um I'm not going to talk too much about pi in openclaw. Uh I
have a two slides. Slides are online. So
we'll take it from there. So again very brief uh introduction of pi. Um Mario uh great work. Something he didn't mention
great work. Something he didn't mention is that he's joining Arendelle uh which I think is awesome. uh it seems like uh you know great great folks working
together and uh yeah it's open source it's minimal so it's it's just perfect to get started and the other part that I do want to reemphi emphasize on is is
give it a try right we're going to talk about a little bit different but open up pi and ask is it to build what you want right it's amazing of what it what it
actually is able to do by the system prompt uh uh that Mario has shown.
All right, these are the extensions. Um
so again uh all the extensions you can download uh build yourself or download and yeah ton tons to explore. All right,
so let's going this talk is not about the coding agent itself. So using it for your daily dev works but what can we potentially do with this and the starting point are actually not coding
agents, right? The starting point is um
agents, right? The starting point is um and I encourage you to do the the same is looking at the uh core agent itself and there's other SDKs but you know we're talking about PI so let's let's
let's use PI and what is an agent an agent is actually just an LLM agent that runs tools in a loop right so you have some goals you have some context
information agents MD uh in many cases and then you do do coke tool calls right and you get some results and you know you basically do do it in a loop, right?
That's it, right? There's not not much more. The rest is magic trying to put it
more. The rest is magic trying to put it in your use case a little bit more in the other use case a little bit in that direction. So, that's really it, right?
direction. So, that's really it, right?
So, pretty please uh uh don't like open the curtain uh play around with it. Now,
with agent um uh uh agent core, this looks a little bit something like this.
You have an agent class. This is all Typescript. Um you can uh you know
Typescript. Um you can uh you know address all all sorts of information information you can prompt it um uh with different information uh and um also you
you have an event system so you know a lot of things that that that are going on. So um small example uh this is a CRM
on. So um small example uh this is a CRM lead qualifier. I don't know I've
lead qualifier. I don't know I've started the CRM use case for my personally and it it just sticks around.
So um terminal interface obviously uh small uh TypeScript application three uh uh three files really easy and you can see this right you have a couple of
commands that you can execute and you know show me all leads and score them right so that's what we do uh show all leads and score them and here you see all these you know things that are going
on under the hood right you see that that the assistant is calling uh tools that you get some results and eventually you know, you get some input. Now,
obviously, there's tons of things to do, but you know, I've just vibe coded this away uh uh uh and it's a good again good uh learning exercise. The system prompt
um uh you know um as you could imagine, right? You know, calling out the
right? You know, calling out the different tools that what you do, right?
So, all pretty straightforward if you are building an agent. This is an example of how you inject here, right?
So um we said we want we we do call tool calling right we reach out to this uh and call a specific tool but for the agent for for steering it more right you
know a typical hook would be before the tool call do something right and in this case we don't want to update a contact uh without you know checking something or I don't know you can imagine any
types of authorative uh role based access whatever enterprise feature in here but basically you know uh just before the tool call. There's another
one events. So, we've seen these, you know, uh uh the stream and you might have seen a little check mark there.
Okay, the tool call was was fine and returned some result. So, again, we're subscribing to events. All pretty
straightforward and again, please give it a try. All right, so this is simple agents, others agents SDK uh are are available. Um and now we're moving
available. Um and now we're moving through the coding agent. Now what's
what's a coding agent? At the end of the day, it's really the same thing as we've seen before. It's a you know normal
seen before. It's a you know normal agent, right? It runs tools in the loop.
agent, right? It runs tools in the loop.
But now we have a runtime and some type of shell, right? Bash is seems to be the uh the shell that that everyone is using. But we have a shell and a runtime
using. But we have a shell and a runtime to to start executing.
And now things are getting interesting.
And now the the magic of of what you've seen with openclaw uh suddenly shines.
Uh um Peter uh shared this this example uh on some presentation where uh he sent a message to his open claw and sent a
voice message. Now at that time openclaw
voice message. Now at that time openclaw um and I still don't know if there's any like special plugin but at that time openclaw didn't know anything about
voice about voice messages. So what what it did is it it uh created and used different tools uh in the end one of the tools was uh
ffmpeg right on the local local machine and it started this and this was one of the tools right so from the outside it it looks like learning
but in the inside it's actually just another tool call that is available to the agent and that's why these things make it so interesting. So um again uh
the example here um now this is a little bit more sophisticated but the uh important part and and this is the extension API and you know please look it up online. We we're going to do
two things or the the things that I'm most mostly interested is in in session events and UI interaction and yeah uh uh look it up online but here's here's the
the actual extension. Now again this is what you would in a coding agent you probably just generate by asking it but here if if we have a look um this is a
CRM typescript a small snippet of it and basically what we're now doing is we're doing the same example as before right and we have a new command called pipeline right so if you have the slash
commands and you have a new command called pipeline and now we are able to we're loading all the context um and uh you see this little in um um don't have
the lines just below step one uh you can see context UI select right so all of the sudden we're not only interacting with the backend systems and and
sessions and and of those sorts but we're also interacting with the UI and we're able to select right and that's that's got got me thinking um so right
so you have this this command and again this is now just the coding agent Right?
We're not talking about the core agent class, but but this is how you would load up Pi if you just don't download the the coding agent. And now with this new extension, we have Pi, right? And we
can start selecting things, right? So
this is a simple simple select here. Um
and you know, you you even have dropdowns. Now the important part here
dropdowns. Now the important part here is these are extensions and the framework uh that currently pi um has included is catered towards the use
cases of a coding agent right so we you know there's lots of work and other things to do to make this ready for others for other types of applications
but I hope you can see and understand the vision where where this is heading and um yeah you know this is all terminal Right? So you wonder how would
terminal Right? So you wonder how would this look like in the web? Um it
currently is not possible if you ask pi to build something. So I ask pi to build something. Right? And this is the web
something. Right? And this is the web UI. It would be a web UI. Same command,
UI. It would be a web UI. Same command,
same selection all based on the same extension mechanism. Now um there's a
extension mechanism. Now um there's a refactoring going on to make this better accessible and make it more clean. But I
hope again it shows you a little bit of of where the where the things are going.
All right. Now um pi and open claw um is um is a special special setup right. So
pi and openclaw what we have there um is that that now we're not only talking about like like um a single agent in a single session in a coding environment
uh but now we have a multi- channelannel uh environment where uh we have um you know multiple threads going on multiple agents going on so there's a little bit
more to it um this is um and and the interesting part right that's that's where where I got started is is like if you look into um you know the the
packages um uh the core packages of of of pi all of them are used in openclaw right so openclaw has this uh uh this
function run embed uh uh pi agent and it creates a session right so sessions um uh pi itself has a great session support and it creates a session agent and
streams all the information back we have um the coding agent which we just talked about. We have agent core as um uh the
about. We have agent core as um uh the other part that we talked about and there's two other uh minor u or major
packages pi for the unified lm abstraction and uh a terminal ui interface. Um there's um uh open claw
interface. Um there's um uh open claw has built its own plug-in mechanism and that's because um uh you know it's a different use case right and has
different requirements. So you have
different requirements. So you have plug-in support for multi- channelannel routing, different or uh uh provider orchestration, sub agents, gateway support, yada yada yada, all the things
that you know by openclaw, but it's based around the core mechanics of of pi and and and leverages it. Cool. But uh
one thing and that's that's that's the like the the major gist I would like to bring across is like okay what do we do now with this? what are other options for us to do? And this is one of the
applications we've been building um for um for a client. Um and basically um uh
the the the use case is a sales process um they get um uh requests for proposals um of of an ordering another um another
system, right? Um parts parts being sold
system, right? Um parts parts being sold by that company. And we're taking all this coding agent all all of that we're taking away right where we we're new
fresh new thinking right and look at the process from the get-go. So um an email comes in right we we we monitor basically that inbox then we have some
gateway because what we want to do is we want to forward this to different agents right so here I have um multiple agents
right uh the way it's structured is we have one agent per customer and that agent has a general harness right agent MD uh um agents MDE as an example but
you can obviously also use different ones and that helps um understanding the role of that agent in the specific case.
It's it tells of how to use the system and how to react to certain you know inputs, outputs etc. Now um the other one is customer MD where we basically explain the agent
like you know the specific customer might have you know specific twerks right specific um uh access specific um um discounts and all of that sort and
then right and that's what I said like earlier I I like using sessions then for each case right we're we're creating and reusing existing sessions so we can back
and forth um um know what what was previously talked about. All right. So,
email comes in, we're looking at the gate um inbox and we route this to this different agents and now we have tools, right? So we have these different tools
right? So we have these different tools uh to talk to the CRM to talk to the ERP um and get the right information out of the system for this agent to look on
like like behave like you know maybe it has you know new contact information or of that sorts and again we make this available we make it easy for the agents
to access right and our way currently is doing this with CLIs right so CLIs our agents are really good at using CLI so we make it available as a CLI I we put
we make sure that the data is secure. uh
we have our own sandbox and then we're creating the drafts again right so that's the system and I hope by this point you basically understand like logically where these things uh uh fit
together but how would this look like um oh one uh a final thing right there is always the question around okay sandboxing etc and and to be honest
we're on the uh just on the on the steps of of getting there but if you've seen um Nvidia's announcement uh around um open claw, their policy, their open
shell is really really interesting and um um it's it's it's a way of it's one ways of securing an an agent. We're
looking into this. Please do as well.
All right. So, how does this look like um to to to kind of like get you an understanding of of how these things Right. So, here's the dashboard. Um
Right. So, here's the dashboard. Um
rather uh boring, but here's the in uh the email the inbox, right? So again, we see the the email coming in and yeah, we um it's one of one of many emails. Most
of them ignored, but this one is like the the DLM call said, "Okay, I'm I'm interested in this." And it is associated to a case, right? We see the
case up there. Now, this case is again is an agent session, right? Uh so we find the session and associate it to it.
Um we then create a draft. Uh so there's tons of calls which I'm going to show you in a second but basically the output of all that is a draft email that the
user will be able to use right. So our
thinking is uh let them user stay in in email let them stay in the the inbox and drafts and they don't even you know need to do a lot. So this is more like an admin interface. They can stay in email
admin interface. They can stay in email but basically the output would be a draft generated. And how does that look
draft generated. And how does that look uh behind right? We we had the the different sessions before uh uh the threads and this is the same thing right the assistant says uh well apologies
German but uh now I'm looking at the articles it does different tool calls right it gets gets results and does this in a loop to result right the end effect
for for the user is I'm looking at my inbox there's a new email it's associated to a case and I get a new draft which they can freely edit but um under under the hood we have all these
um uh agents working. All right, that's that it's for me. Um again um here here you find the slides. Um key takeaways
please. Coding agents are and will be a
please. Coding agents are and will be a core building block uh for your software systems. I'm I'm betting on it. A lot of people are betting on it. So please give
it a try. Pi is perfect for tinkering whether you like it or not. It's
minimal. You can rip things apart and put things together. It's perfect. So,
please go tinker. All right. Thank you.
Thank you. Thank you, Matias. That was a great talk. Give it a give one more
great talk. Give it a give one more round of applause. Every Come on, be amazing. You know what? You know what?
amazing. You know what? You know what?
Applause is free. I don't know if you noticed, but this doesn't cost anything.
Okay? So, we can we can be generous with it. And as we talked about at the
it. And as we talked about at the beginning, you get to decide how the speakers feel and the kind of talk you get. So, um, our next speaker is going
get. So, um, our next speaker is going to set up. I'm gonna introduce here in a minute. But first, let me ask you this.
minute. But first, let me ask you this.
Who here codes using coding agents?
Okay, almost everybody. What's I want you to shout out to me. What is your favorite model that you use?
Dang, man. Opus. Okay. Anyone using
Composer 2?
No. He said no. He's like, no. Anyone
using Kim? Uh Kimmy is Kimmy, right?
Which is kind of compos Anyway. So um
composer 2 is a fantastic model. I will
say this. I was impressed. I was talking with the cursor guys back there. It is
so fast. And when a model is so fast, we think surely it's fast. So it must be bad. I don't know what it is within us,
bad. I don't know what it is within us, but we think if it's fast, it must be bad. But I've been doing this thing
bad. But I've been doing this thing where I've been using a multi- aent system where I I I solve it with cursor and composer too and it's solved very quickly. I get I get a a diff and then I
quickly. I get I get a a diff and then I give the diff to opus and I'm like hey opus what do you think um and opus every single time with composer to lgtm and so it's it's actually I I can
recommend if you haven't used it it's it's a wonderful model our next speaker sorry just for free that's what happens when you're a builder you can't stop talking about it our next speaker Sarah Chang is going to talk to us about
exactly this fast models need slow developers I'm really excited about this because I have so many thoughts about fast models everyone your biggest round Round of applause. Sarah Ch.
Hi everyone.
So we'll just get right into it. So over
the past few years, we as developers have developed a series of bad habits when it comes to developing as a result of slow AI code generation. And so we
we're all familiar with it. We do things like write massive prompts and try to oneshot. We'll make huge commits or
oneshot. We'll make huge commits or we'll have our 10 agents all on the screen at the same time combulating, cogitating, thinking. And so about a
cogitating, thinking. And so about a month ago, we at Cerebrus and OpenAI released a new model, state-of-the-art model called Codex Spark. Codex Spark
can generate code at 1,200 tokens per second. And to put that into
second. And to put that into perspective, if you look at the sonnet family or the opus family, those can generate code at about 40 to 60 tokens per second. So in this new era, as we're
per second. So in this new era, as we're starting to see much faster coding models, this is 20 times faster. Not
only does it unlock new capabilities and use cases, but it also requires us to rethink how we as developers interact with the coding model. And of a lot of
these bad habits that we had before that were generating maybe 50 tokens per second of bad code, unless we fix them, they're going to start generating 1,200
tokens per second of bad code. And so
that is the topic of today's uh talk.
So to get started, my name is Sarah Chang. I'm the head of developer
Chang. I'm the head of developer experience at Cerebrus where we are building the world's largest and fastest AI processor.
A large part of my job is that I get to introduce fast inference and fast coding models to developers for the very first time. And for most people, it's a very
time. And for most people, it's a very exciting moment. There's no thinking and
exciting moment. There's no thinking and waiting and starting up that you might be really annoyed about. But at the same time, as I said, unless we change our habits,
we are not going to have good code in the future. And so this talk really is a
the future. And so this talk really is a practical playbook for how we as developers can think about how we interact with the models in this new regime, especially in a future where the models
are generating code faster than we, the human can keep up.
So I want to look back at history a little bit. We've had a very exciting
little bit. We've had a very exciting past two years. The models have gotten bigger. They're getting smarter. We have
bigger. They're getting smarter. We have
bigger context windows. But the thing that has remained relatively constant over the past two years is coding speeds. is model speed. So if we look at
speeds. is model speed. So if we look at a lot of the popular families, we have Gemini, Claude, GPT, Sonnet over the past two years, they've always been
within, you know, 50 to 150 tokens per second.
And this is Codex Spark. Again, Codex
Spark is just the first of many models that we as developers can expect to be much faster than what we're previously used to. And we even had to change the
used to. And we even had to change the Y-axis because it's so much faster. And
so before we get into the actual playbook and tips, I want to talk about why this is happening. Why are we suddenly seeing such faster models? And
it's actually a very exciting development. It's it's what many of you
development. It's it's what many of you probably work on on a day-to-day, but it's there's so many companies that are working on this problem all at the same time. And as a result, the entire AI
time. And as a result, the entire AI inference stack is getting optimized all at once. And so breaking it down, let's
at once. And so breaking it down, let's go through really quickly. We have
hardware. This is a physical device that inference, training, all of our comput is happening on. One of the biggest things that we have to think about with hardware is the memory wall. And this is
exactly why hardware and memory movement takes up 50 to 80% of that latency time for inference. This is where a lot of
for inference. This is where a lot of the frustration comes from. And so when we are running inference, we have to constantly move our weights and KV cache values between memory and our actual
chip. on the Nvidia GPU. This is the
chip. on the Nvidia GPU. This is the most traditional type of hardware. All
of this memory is stored off chip on offchip HPM. And we're now have a memory
offchip HPM. And we're now have a memory bandwidth bottleneck. What a lot of
bandwidth bottleneck. What a lot of newer companies are doing are thinking about companies like Cerebrus or Groth.
They're thinking about how do we move this memory to be as close to the chip as possible. And so here's an example of
as possible. And so here's an example of the Cerebrus wafer where all of the chip is um all the memory is distributed across the chip in SRAM. So every core
is direct access to the values it needs.
Even more exciting, we have disagregated inference. And this is an in um
inference. And this is an in um disagregated inference really has become commercialized in the last few months.
This is why Nvidia bought Grock for $20 billion a few months ago. And this is also why Cerebrus and AWS are now partnering to serve the wafer and AWS
tranium together. So in traditional
tranium together. So in traditional inference, there's two steps. There's
prefill and there's decode.
Traditionally, both of these steps have always been run on the same piece of hardware. Prefill is where we're taking
hardware. Prefill is where we're taking every token that the user inputs and processing it, embedding it, and adding it to our KD cache. This is a sequent this is a step that can happen in
parallel and so it's computebound.
Decode on the other hand is where we're actually generating the output token by token and this is sequential and is as we mentioned memory bound. Again, it
goes to the same problems that we mentioned before. And so what we're
mentioned before. And so what we're doing and seeing now commercially is that we're splitting up these two steps so that prefill is done on one type of hardware that is compute optimized and
decode is done on another piece of hardware that is memory optimized. Going
up the stack, there's the diagram. Going
up the stack, we look at model architecture. There's so many ways that
architecture. There's so many ways that we are training our models and shaping our models to cater to our hardware. We
have specific layer dimensions and memory and model size that we're always thinking about. A great example is a
thinking about. A great example is a very standard model architecture mixture of experts here. Instead of activating the entire model all at once for every single token, we're only activating a
subset of experts for every time. And
what this does is it allows us to have the intelligence of a much small larger model for the compute cost of a much smaller model. And again, we're always
smaller model. And again, we're always thinking about memory and the size of our models. And a lot of people have
our models. And a lot of people have been building on top of this in recent years. An example is reap router
years. An example is reap router weighted expert activation pruning. I
had to read that one. Um and here we're looking at the specific use case. We're
seeing which experts aren't being activated all at all and we're pruning them all together. We're getting rid of them. Again, we're always thinking about
them. Again, we're always thinking about model size. And then at the very top
model size. And then at the very top layer of the stack, we have inference optimizations. And this is where many of
optimizations. And this is where many of you might be working in and a lot of companies that you're probably familiar are also working in. These are companies like together, base 10, modal, who's
also here, fireworks. And one of the biggest things that we're thinking about at this level is KV cache reuse. And so
by storing and reusing previously computed token representations, we don't have to recalculate attention over the sequence at every step.
And now I want to get to the very top and most exciting part, the developer.
This is the current state of what the internet looks like or what Twitter, LinkedIn looks like. We have someone running six cloud code terminals at once, a 500 plus agent coding swarm, um,
someone running eight agents across five screens. And I get how tempting doing
screens. And I get how tempting doing something like this can be. I feel like if you're on Twitter at all these days, unless you are doing something like that, the internet is basically convincing you that you are living in
the stone age and that you need to catch up. But the reality, what is the reality
up. But the reality, what is the reality of what is happening in all these setups is that we're generating massive amounts of code that nobody is verifying.
And in the new future with much faster inference, this becomes increasingly dangerous.
And so, especially with fast inference, we're now going to be generating technical debt at a level that we've never seen before. And we're not going to know what to do with it.
And so I'm going to pivot now to spend the rest of the talk on the practical playbook and tips and workflows and how we can reimagine how we as a developer should operate in this new regime of faster inference. And as I mentioned,
faster inference. And as I mentioned, codec spark operates at,200 tokens per second. But it really is just the first
second. But it really is just the first model and what we should as developers expect and prepare for to be a new regime of faster models across the board. And so starting with the first
board. And so starting with the first one, the first category is just choosing the right models and how do we orchestrate our agents so that we're leveraging different model strengths. I
think historically we always think about intelligence. There's no is no secret
intelligence. There's no is no secret that we as developers are not particularly loyal and that we will switch to whatever model, whatever family is most intelligent at a given
time. And maybe we also think about cost
time. And maybe we also think about cost unless our company pays for whatever we want. And so here now the inference
want. And so here now the inference speed is a 20x difference. Now we also have another vertical to think about speed. And so a good mental model is to
speed. And so a good mental model is to use a larger model like GBT 5.4 5.3 for your planning or your long horizon workflows and then using a faster model
like codec spark as your actual executor.
And so here's an example. You might ask your five point GBT 5.4 to generate your plan. you would generate a um you would
plan. you would generate a um you would spawn all of your sub agents with codeex spark and have it actually operate uh have it actually execute on all of those
steps um one by one. Another really
helpful trick is to actually make skills out of successful sessions and capture trajectories that are working really well. A thing that you can do here is
well. A thing that you can do here is use a model like GPT 5.4 before to actually have it do the initial harder larger task, capture that as a skill and
therefore making it a verifiable repeatable workflow and then having a small um faster agent like codec spark just do it again and again in the background.
The next category I think is even more exciting because this is a category of things that just were not possible and were not practical. These are things we wouldn't do because we're tired of the
cogitating justiculating germinating that you might have seen.
And so here I really want us to think about this and internalize this. But at
1200 tokens per second, a model like codec spark makes validation basically free. There is no excuse and no reason
free. There is no excuse and no reason why you should not be doing things like this. test suites, linting, pre-commit
this. test suites, linting, pre-commit hooks, diff reviews, browserbased QA automations. There's all these things
automations. There's all these things that you can add to every step of your workflow because it is instant. It's not
slowing you down and it's not you do this all of this at the very end or right before you're about to push your code.
Another tip that I really like is exploring cherrypicking. So, let's say
exploring cherrypicking. So, let's say that I want to code a navbar and I want it to be midnight blue. I want four different icons. I give it to the model
different icons. I give it to the model and the result's fine. Instead, what I can do with codec spark or much faster model is I can have it tell it to generate 15 versions in the same time
that it would have taken the a previous model to generate one version and I can cherrypick the version that I like the best. Even better, I can generate five
best. Even better, I can generate five sub aents that are each generating 15 versions and now I have 75 versions and I pick the one that's best. And this is great for things where we really value
quantity or variety. So things like research direction, different types of architecture d um directions, or even just graphic design. And the reason why I really like this one is because it
almost allows us to artificially induce taste into our model output. So
traditionally, it's no secret, it's very easy to sniff out any UI or text that a model writes. The models themselves do
model writes. The models themselves do not have taste. And the ways that we've kind of brute force worked around this is that we either create an example ourselves or we find examples for the
model which is timeconuming or we give the prompt so much detail that we might as well have completed the task ourselves. This is a great way of saving
ourselves. This is a great way of saving our time and also getting much better results.
The next tip is kind of more more so a a mental model where now that the models are so fast, it should not be you spawn
a session, you go get a hamburger, you scroll Twitter, and then you come back. Now, you can actually sit down and it's a real time collaboration that you're able to have
with this model. You should view it much more as a pure programmer. This is the only way that you are going to avoid having bad code. So you can sit down and ask questions like h having it collect
all the context across your repo and actually asking it how does it work being the one in the front seat making decisions and implementations. The AI
should always be helping you make decisions not the other way around.
The next one I hate this slide because it's everyone's trigger word and overused word but h how do we avoid slob? So, as I was mentioning before, it
slob? So, as I was mentioning before, it really shouldn't be, you know, you spawn 10 agents, you never verify the code, you don't know what's happening under the scene. Someone asks you to explain,
the scene. Someone asks you to explain, you have to read the code for the first time. Now, you can actually have two to
time. Now, you can actually have two to three sessions and actually sit down next to your code. And I know this is something we're not really used to, but sit down with it and actually steer it,
understand what's happening, because again, we are now experiencing real-time collaboration as we code with this agent. You can be super specific. You
agent. You can be super specific. You
can think do things like ban the model from deleting files, give it a max diff size, the model, have the model only read and write, and even give it steering directions, things like only
change this, don't touch types yet.
Wait, that implementation wasn't quite right. Let's redo that. The graph on the
right. Let's redo that. The graph on the left is a is a helpful mental model as an example of how the developer, the AI agent, and the codebase can all work
together and what that should look like.
This next step, refactoring is very similar to what I was talking about with valid with verification. Just like with verification, something like constantly refactoring and cleaning up your code
automatically is basically free at 1,200 tokens per second. So you can do things instead of doing it at the very end right before you're about to commit your code. You can just re you can just bake
code. You can just re you can just bake this into your automatic workflow so that after every single task on that checklist is complete, you're just asking the model to automatically, you know, delete unused imports, clean up
unnecessary lines of code, make it so that all of my functions are structured the same way.
The last category that I want to talk about, and I'm sure that so many of you guys have already heard these two words a countless amount of time over the past few days and across so many talks, is
context management. But the reason I'm
context management. But the reason I'm going to talk to you about it again is because let's say that historically it took you 10 minutes to fill up your context before you saw you know the
god-feared word compaction.
Now if you take 10 minutes divide it by 20 you are now getting compaction in 30 seconds. And so context management
seconds. And so context management especially with fast inference is more important to think about than ever. And
you can't get away with sloppy practices anymore. And so all of these these
anymore. And so all of these these really are just good practices no matter what coding model you are using or what speeds. But a general a very high level
speeds. But a general a very high level framework is just always always break up large tasks into smaller bounded goals.
And this graph on the right is a good mental model for how how full your context is will then affect your behavior, the model behavior. So you
always want to avoid the 80 to 100% because you're going to get compaction.
And right now we all know some things might get lost.
And so a good way that you can think about how do I externalize this memory so that I can have these small bounded goals like what does that look like? So
an example of how you can do this and set up an external memory system that is persistent every time you set up a new session is with this four file system.
We have agents MD which is where we're actually defining all our agents our sub agents. We have plan MD which is what
agents. We have plan MD which is what we're creating at the very beginning and this is where we're just generating the entire plan and step by step ch stepbystep checklist that we're going to
go through. We have progress MD which is
go through. We have progress MD which is where we're keeping track of what's do we need to do and what has been done before. So every time you spawn a new
before. So every time you spawn a new agent or session has no context. It
comes in it looks at progress MD. It
sees what's been done before and it's like okay here's where I pick up here's where the next task needs to be done.
And then the last is verify MD. And this
is what we're using at every single step to just make sure everything looks good, it's clean code, and we can move on to the next step. And so an example of this is again leveraging different models,
using a GPT 5.3 or 5.4 codecs, having it create your plan, and then having your GPT 5.3 codeex spark actually execute the checklist one by one much faster
than before. And as a final slide, I
than before. And as a final slide, I want to do these um few helpful commands for how you can get the best out of codecs. things like permissions,
codecs. things like permissions, experimental skills, review and rename.
But the biggest thing that I really want to emphasize here is that honestly, it's not really about just having faster coding models. What it really means is
coding models. What it really means is that the developer experience is actually going to become so much better.
And when it's becoming so much better, there's so much more we can do. And
there's so many ways that we can now avoid creating bad clo bad code in a way that isn't miserable or us staring at a screen for 30 minutes. So, thank you guys so much for welcoming today me
today. My name is Sarah Chang. Um I'm
today. My name is Sarah Chang. Um I'm
visiting from SF. It's an honor to be here in London. Um if you have any questions or need any credits, my handle is milks and matcha across every platform. Thank you guys.
platform. Thank you guys.
Thank you. Thank you. Thank you. Thank
you, Sarah. What an incredible talk. Uh
I thoroughly enjoyed that. And now we have the next talk by Mr. Lawrence Jones. It's it's going to be so much
Jones. It's it's going to be so much fun. Lawrence is going to talk to us
fun. Lawrence is going to talk to us about fighting AI with AI. And I spoke with Lawrence backstage and I said, "Wait, wait, wait, wait. Does this mean you're going to like set up codeex right here and claude code right here and be
like, "Okay, fight." Um but it's not that. It's it's even better. Um, he's
that. It's it's even better. Um, he's
going to come out, but his prompt, listen, he needs to be prompted. The
other speakers have come out here and they've sort of set up their speakers, their laptops, and so on. He's not doing that. Okay? There's a prompt to get him
that. Okay? There's a prompt to get him on stage, and that prompt, you guessed it, is applause. So, what we're going to No, not yet. Not yet. I'm not done, man.
Don't set me up. So, on when I say his name, Yeah. we'll prompt him and he'll
name, Yeah. we'll prompt him and he'll appear. Ready? Give your biggest round
appear. Ready? Give your biggest round of applause for Lawrence. J.
It worked.
Um, hi everyone. Uh, so I'm here to talk today about how we use AI to manage the complexity of the AI products that we build um, at Incident.io. Uh, and to share with you some of kind of the tips
and tricks and the internal tools that we use when we're building um, our AI SRE product. Uh, but first I guess like
SRE product. Uh, but first I guess like who am I? Um, so I'm Lawrence. Uh, I'm a founding engineer at a company called Incident.io. Uh so we build, if you
Incident.io. Uh so we build, if you haven't heard of us, uh we build an instant response management platform. So
we're used by companies like Netflix, Etsy, Skyscanner, and actually probably a few of you in the room. Um we pay you when things go wrong and we help you run your incident and as you're running your incident, we help you communicate with
your customers. Um but like you might be
your customers. Um but like you might be thinking like where does AI actually come into this? Um and actually that like we don't just want to help people respond to these incidents. Um, our goal
is actually to fully automate kind of production investigations. So whether or
production investigations. So whether or not it's a big incident or if you just have some ticket and you wanted to look into production, uh, we want to be the place that you turn to ask us questions about actually what's what's going on.
Um, now it turns out we've been building this for about a year and a half, two years now. Um, and that's actually like
years now. Um, and that's actually like a really big ask and the systems that we've had to build to try and support this have been really quite complicated.
Um, and have been kind of on the edge of what you can do uh, with all of the AI technology that's out there. um and they often pose a challenge for humans to debug them. Uh they are now complicated
debug them. Uh they are now complicated enough that you can't as a human really tractably dig into how these things are performing. Uh you need assistance to
performing. Uh you need assistance to help you. Um so for example, this is a
help you. Um so for example, this is a kind of uh one of the investigations that we would actually produce for you.
Uh when you have an incident, uh we will end up running uh right at the start of the incident this investigation which will go through hundreds of telemetry queries. It's going to look at your
queries. It's going to look at your logs, your metrics, your traces, any historical incident data that we have.
Um, and it's going to try and cross reference this with your codebase and go, "Hey, like I'm pretty sure that the problem is this and you should probably do this to fix it." But I want to pause here and go like, how would you, if you were building this system, how would you
actually figure out how to tell me if this was a good or a bad report, how do you know if it's right? How do you know if it's wrong? Um, and there's a load of things that you might do. Uh, you might jump into the incident. You look at everything that happened. You might look
at the postmortem if there was one that was written. Um, but all of this might
was written. Um, but all of this might actually take you a really long time to do. In fact, it normally takes you like
do. In fact, it normally takes you like an hour or so to get a real full understanding of this incident. Uh, and
it's only at that point that you could then look at this investigation and go, I think it's right or it gave me the information that was really, really useful. Um, and as I said, behind this
useful. Um, and as I said, behind this investigation is like hundreds, if not thousands of prompts. So, how on earth do we scalably understand how this system is performing, especially across all of our customer accounts when they
all have very different things going on?
um you end up with a lot of stuff and a lot of AI and you've got to use AI to try and actually attractively get a handle on this. Um so I actually did a talk uh a year ago um LDX about becoming
AI engineers uh where I went through some of the core constructs that hopefully like a lot of you in the room given that we're an AI engineering con uh conference are familiar with. So
things like prompts, evals, scorecards, traces, data sets, back tests. Um this
talk is going to be about if you assume that you have these constructs put together and you're building these complicated AI systems. Um how can you use AI with the internal tools that you use to understand them to get a better
handle of how your system is performing.
So yeah um in this talk I'm going to talk about how you can use AI to help you manage and curate your eval data sets and making it easier for you to work with them. Uh making it easier for coding agents to actually work with your
eval tool. Um, I'm going to talk about
eval tool. Um, I'm going to talk about like probably what was the biggest unlock for us when we were building these systems, which was starting to translate the UIs that we built to try and debug them. Um, into downloadable
file systems, which has actually helped us massively using tools like claw code and codecs uh to dig into how the system is performing. Um, and then I'm going to
is performing. Um, and then I'm going to talk about how you can build kind of repeatable analysis pipelines using uh AI agents to run through them. Um, but
first eval.
So eval for me are AI unit tests. Uh so
each eval takes a prompt and it goes here is some input data. Uh it runs the prompt, it gets the output and then it has some grading criteria that says does this eval pass? Does it fail? Um and for
us eval files right next door to like our go uh prompts. Uh so we do everything can go at incident.io. Um
including all of the AI work that we do.
Um, and this is how we prove when we make a change to a prompt before we ever go and merge it that the prompt is actually going to do the thing that we want it to do. Um, so for us, this is what a prompt looks like. Um, this is
kind of a contrived prompt. I would hope no one actually has this in production anywhere. Um, it takes a message and it
anywhere. Um, it takes a message and it tries translating it into pirate speak.
So really simple, bit silly. Um but what we do for evals is if this is the prompt uh we would then define on the left some grading criteria for this prompt where we'll go there are two things that we
care about. We care that the result
care about. We care that the result actually looks like pirate speak and we are going to care that the meaning is preserved between the input and what we actually produced as an output. Um so
this is actually what we're going to use to tell us if the eval passed or failed.
Um, and then we have the eval on the top right which is just in a YAML file where we go here are three different test cases and we'll run through them and you can see the results of us actually running this uh on the bottom right. So
this works and it works really quite well but it does have some problems and I'm assuming that several people in the room have kind of come across these themselves. So first um like evals are
themselves. So first um like evals are really really fiddly. Setting up
realistic test data for your evals if you want to actually understand how this stuff is running um is quite difficult to do. Uh and especially in our
to do. Uh and especially in our environment, our production evals are including almost an entire incident. So
you can imagine a full incident report and that's the only thing that can trigger the bad behavior. This is kind of hard for you to end up pulling them down and putting them in your eval test suite. Um they just become extremely
suite. Um they just become extremely unmaintainable very quickly. Uh now
quite early on we created like this little button that we have that allows you to like steal an eval from production. So if anything was going
production. So if anything was going wrong inside of our like AI interactions, you could go in and you could just pull that down, put it in the codebase and you could run the eval against it. Um, but the thing with this
against it. Um, but the thing with this is like production eval aren't like great. If you think about evals as kind
great. If you think about evals as kind of like a unit test uh for your prompts, you want a unit test suite to be reasonably understandable. Like an ideal
reasonably understandable. Like an ideal unit test is very focused and just says I expect it to do this thing. You don't
want to have like two megabytes of all the YAML associated with it. It's just
really really hard to work with. Um, and
what we found was as these YAML files with the evals were growing really really large, uh, our coding agents weren't able to work with them. So if
you want to do a quick like read and like modify the eval suite, you be booting that into the context and you quickly hit your context limit. Uh,
which is obviously a problem because then you can't work with it effectively.
Um, so what we ended up doing was we ended up creating this small CLI tool that we call eval tool. um that was designed to allow agents to leverage our
evalu files. Um so it's just a small CLI
evalu files. Um so it's just a small CLI that can go what test cases do you have in here? I want to edit one, I want to
in here? I want to edit one, I want to replace one, I want to add one. Um and
it was by doing this that we allowed agents to work effectively with our eval tooling. And that's why we were able to
tooling. And that's why we were able to create this runbook to the right, which is actually a runbook that's designed for um a coding agent to use. So either
a runbook or a skill, it depends on how you want to package it. Um, but the cool thing about this is that now that agents can work with the eval, you can end up in a situation where you just ask your coding agent to go, hey, I've got a
problem here. Can you look at this
problem here. Can you look at this prompt? I want it to do these things.
prompt? I want it to do these things.
And the coding agent is going to turn up and it will create an eval case where it proves that the thing has failed. And
then it will go modify the prompt so that the eval now passes. Um, and then it will go through this runbook. And one
of the most important stages for us is checking at the end that the change that you've made to the prompt hasn't ended up breaking any of the other Ears that you had in your test suite. Um, and we also have like a final pass that tries
consolidating the prompt as well because if you end up doing this repeatedly, you end up with a prompt that is massive and really really difficult to maintain. Uh,
so you kind of want every time you make an adjustment to try and simplify as well. Um, so this has actually worked
well. Um, so this has actually worked like really well for us. Um and you can see here uh this is like me using it in core code where you can just point it at the eval and say hey have a look at the
prompt. This is a real prompt for us
prompt. This is a real prompt for us which turns human queries into log queries for Loki system. Um and it ends up racing through and goes and adds a new eval. It checks that passes with a
new eval. It checks that passes with a certain number of repeats. Um and then it gets to the end and it's like yep I think I've added it kind of the pass rate is acceptable. You can go ahead and get going. Um but the problem with this
get going. Um but the problem with this is like this solves one problem. Um, and
the problem that it solves is that if you know what the prompt is that you want to change, you can now change it fairly reliably. And that's very useful
fairly reliably. And that's very useful if you're working on these tools. Um,
but one of the biggest problems that you have now is that if you're building these systems, you'll know that they're not just one prompt anymore. Um, in
fact, most of the production AI systems you will use on a daily basis are many, many, many prompts. Um, and to illustrate this problem, I've taken our chatbot. So, this is a chatbot that you
chatbot. So, this is a chatbot that you interact with um, inside of an incident.
And what I've done is I've created a graph of all of the different prompts, tools, agents, and everything in the hierarchy uh that powers and interaction with our system. Um, so you can actually see there's like 10 different agents
there. There's 50. I I don't even know.
there. There's 50. I I don't even know.
There's it's actually bigger than this.
I couldn't fit it on the screen. Um,
it's a lot of stuff. So even if you think that you know, even if you've got a bad interaction that came in from a customer, um, you don't necessarily know which part of your system is actually the problem and which part to go change.
Um, so even if you have this eval red green cycle, you're going to struggle to know where to go to fix it. Um, and this gets even worse for a system like our investigations.
So if you think about trying to run through this process to debug what's going on in an incident, we have a ton of stuff that goes on uh inside that system. And you can see all the steps on
system. And you can see all the steps on the left. Um, and each one of those
the left. Um, and each one of those steps unpacks into the trace that you have on the right. And like really it's not about the details here. Um, it's
more about each one of these green blocks actually expands into possibly hundreds of different prompts and hundreds of different tool calls. And at
any point if you make a slight subtle error, you can't then easily trace through the system where the error originated, even if it ends up resulting in you having totally the wrong picture on what you think that the incident was
and your RCA is totally wrong. Um, so
like we built these UIs so that we could help humans look at them and they've been really good for humans to look at them. But I guess going back to what I
them. But I guess going back to what I was saying before, um, we just feasibly don't have enough time to go through this stuff. So the problems that we had
this stuff. So the problems that we had was we have all these UI tools, but agents can't properly use them. So how
do we get to a place where like the agents can use the tools properly? Um,
and I think Anthropic stumbled on this with claw code where um, they found kind of when they released core code that these agents are fantastic at using file systems and just going through this data using standard tools. Um, so we kind of
thought like, can we just download all of the UI that we have as a file system?
Um, and that's kind of what we've done.
So now for each of our different AI systems, you're able to download all of the content as a file system and we drop that into a sandbox claw code. Uh, at
which point you can just point claw code at it and go, "Hey, I've got a problem here. It's behaved in the wrong way." It
here. It's behaved in the wrong way." It
can see everything that went into all the prompts. It understands the
the prompts. It understands the structure because it's self-documenting.
Um, and then it can tell you because you have access to the codebase as well. uh
exactly where you should actually be making the modification to try and change it. And then you can lean on that
change it. And then you can lean on that red green cycle from before to try and modify a prompt if you need to. Also,
there's more stuff that you can put in this than you might think. There is
really not much of a limit as to what you can put into ASI. Um so like traces like this can get translated exactly from how you would present them in the UI to a text file which then the LLM can
consume in a really nice way.
So yeah, uh this is something that has turned the way that we debug our application into we hear that there's a bad experience. We end up downloading
bad experience. We end up downloading that interaction into a sandbox called code. Um you sit there in the session
code. Um you sit there in the session and you go hey like have a look at this.
Tell me what you think has gone wrong like like what is your interpretation of the problem? Um and then you go I really
the problem? Um and then you go I really wanted to do this instead like what part of the system would you change? And then
it will work its way through the hierarchy of all those tools and prompts that you just saw and it will be able to tell you where you should be making the modification and then all from that session because you have access to the codebase you can just go hey can I make that change and then you can prove it
using the eval runbook that I mentioned before.
So yeah like we've implemented these file system packages for a load of different AI interactions for us now. Um
so it's really easy for us to just drop this in claw code and just get going. Uh
but we now have another problem right because whilst you can do this on an individual basis uh we are running thousands of investigations across hundreds of our customer accounts um and we're doing that daily because we need
to know if this system is getting better or worse. So um what you can see here is
or worse. So um what you can see here is uh we have what we call a back test which is essentially a batch of investigations that we run on a daily basis against our account and against a load of our customer accounts as well.
Um, and eventually you just get this rolled up number which is like, oh, cool. 86% accurate RCA on our account,
cool. 86% accurate RCA on our account, which is which is great, but like this doesn't really tell you why the number went up and it doesn't tell you why it went down. And if you want to improve
went down. And if you want to improve the system for someone, you're going to struggle. Um, so what we've actually
struggle. Um, so what we've actually done is we've allowed ourselves to download uh kind of all of these investigations into a file system that we can then provide into an analysis
pipeline that again is leverage or is run using clawed code. uh that can end up running a structured analysis like with markdown playbooks that help you run it repeatedly and reliably each
time. Um so what that actually looks
time. Um so what that actually looks like is we created this rep this repo called scrapbook. Um and inside of
called scrapbook. Um and inside of scrapbook we have this like very structured flow that explains exactly how a coding agent should go through all of the information that we've gone and downloaded how it should understand these investigations and the process
that it should go through to actually run them. Um, now the key things that
run them. Um, now the key things that like I think are very important to these flows are you start and you actually parallelize out all of your agents. So
you start maybe 25 agents in parallel and they can all individually build their analysis of an investigation. Um,
and then you go into the next stage of the pipeline where you do some cohort clustering and you look at like meta points around like what are the same types of failure, how do we go wrong in different ways. Um, and by clustering it
different ways. Um, and by clustering it together, you end up with actually like a really really useful report that doesn't just tell you how this has gone wrong, but it tells you why is your AI system performing well or badly on this
customer account and actually like what should you do to try and fix it or improve the system. Um, and like this is this is like something that we've done several times over for several of our systems now and I think it generalizes
really well for anyone who's building this type of thing. Um so the points that make like a really good pipeline for this um you should leverage sub aents to do that parallel per entity analysis. Um you should store all of
analysis. Um you should store all of your analysis in files inside these downloads so that you have like incremental analysis built as you run through it so that you can start and resume the analysis if you ever need to.
Um, and then you want to combine this analysis with the codebase that is actually powering the system so that if it finds a problem, it can look in the codebase and go, hey, I think that this is the problem and this is the place and
it can actually do some analysis to go, I think that you should change it like this. Um, and then at the end, because
this. Um, and then at the end, because you have this all loaded in your code session, you can just go fix it and ask the coding agent, claude code, codeex, whatever you use to actually go make the change and then you can use that eval
red green process to actually confirm it works. Um, and then like yeah, this is a
works. Um, and then like yeah, this is a PR that is created after you do something like that where because the back test showed a couple of investigations that were going wrong and I knew exactly what the problem was. Um,
I could have a chat to it about a feature that we might change in the system. Um, and then we can deploy that
system. Um, and then we can deploy that and we can test that out in production and see how the thing goes. Um, so yeah, that's it. Uh, so um, the key thing from
that's it. Uh, so um, the key thing from me is like these patterns do generalize.
So for any of you in the room who are building kind of complicated AI systems um and you're finding it really hard to understand them or debug them or evolve them um you you really need to be using AI just as effectively in your internal
tools to try and understand these systems and grow them um just as you are in the products that you're building yourself. Um so yeah, make sure that you
yourself. Um so yeah, make sure that you prioritize any of the debugging tools that you have so that they work really really well with the coding agents that you're leveraging on a day-to-day basis.
Um, file systems are exceptionally good agent context. Like we could have put an
agent context. Like we could have put an MCP on top of this or use like human use uh agents. Uh, it wouldn't have been
uh agents. Uh, it wouldn't have been half as effective as this ability to just download in bulk all of the information that you need so that the coding agent can crept through it and find the details. Um, and then yeah,
anytime you are performing complex analysis, look at creating an AI runbook for it instead. Um, it will save you literally days or maybe weeks of your
life. Um, and then one final point from
life. Um, and then one final point from me. Um, like we are hiring. Uh, we we're
me. Um, like we are hiring. Uh, we we're in London. Um, we have just done a
in London. Um, we have just done a fairly big race last year and we're looking to expand the team so that we can build some of these systems. Um, so if any of this work looks interesting to you and you're interested in being on
like the edge of building some of this AI like AI SRE product, uh, then just get in contact and let me know. I'd love
to chat. All right. Thank you.
Thank you, Lawrence.
Thanks. Thank you so much. What
incredible. So get up for Lawrence, everybody. Incredible.
everybody. Incredible.
I got told off back there. They said,
"Hey, Tesious, when you clap, just like fake clap because my hands are too close to the mic and I'm causing problems." Our next speaker is is is great. This
next talk is really fun. How many of you have been on a mission before? Like I'm
on a to the grocery store. I'm going to go get milk. Yeah. Yeah. Missions
usually require many different steps.
They require longunning tasks. require
all of this. Our next talk is about this but with agents missions for agents.
What was your longest uh time that you've done some prompt with claude code? You know what I mean? Sometimes I
code? You know what I mean? Sometimes I
see I say like fix this bug, right? And
then it takes 15 plus minutes and I just watch this agent cook. 15 minutes a long time. Anyone go longer? Longer than 15
time. Anyone go longer? Longer than 15 minutes on some coding agent. What is
the longest?
Couple hours. Okay. Couple hours. What
about days? What about days not just with a single agent? What about days with a team of agents, a multi- aent system? That's what we're going to hear
system? That's what we're going to hear about now um from Luke from Luke Alv Alvo. Yeah, it's really going to be an
Alvo. Yeah, it's really going to be an exciting talk about longunning by order of days and multi- aents uh missions.
So, we're going to introduce him again with applause. This is the prompt to
with applause. This is the prompt to bring on the speaker. I need you to like applaud a lot otherwise they don't come on. They feel shy, you know. And so, uh
on. They feel shy, you know. And so, uh let's give it up Luke Alvo.
Oh, I don't Is he there? No. Wait. Uh, I
think I don't think that was enough.
He's kind of like he was sitting there crying actually. He's like weeping
crying actually. He's like weeping because it was too quiet. Can we give him Let's try again. Luke Alvo.
There. There we go. There we go. There
we go.
Hi everyone. Um, my name is Luke and my goal is that 20 minutes from now, you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a
single agent today. Um, a little bit about me. So, um, I come from a
about me. So, um, I come from a background in dev tools. About two and a half years ago, I started a project at Block, which is where I was working at the time. Um, and that project evolved
the time. Um, and that project evolved into Goose. Goose, um, is now one of the
into Goose. Goose, um, is now one of the leading coding agents is open source. Um
and it's recently was was donated to the AI um Agentic AI foundation. So it's
been really cool to see. Now um nowadays I work at factory where I lead our core agent harness and facto's mission is to bring autonomy to the entire software
development life cycle.
So I want to start off with a claim. Uh
the bottleneck in software engineering nowadays is not intelligence. It's now
limited by human attention. Uh even the best engineers can only complete a couple of tasks at a time. Um they may have a backlog of 50 features, but they can only drive a few forward per day
because every task requires their attention. Every commit needs their
attention. Every commit needs their review. Today's models are smart enough
review. Today's models are smart enough to figure out all 50 of these tasks, but there's not enough uh just bandwidth to supervise their implementation.
So we kept asking ourselves, what if a human decides what to build and then a system figures out how to do so, right?
An agent could just work for hours for days and you come back to finish work.
So that's what I'm here to talk about.
Um, when you start researching multi- aent frameworks and systems, you quickly realize that the field's a bit of a mess. Everyone has their own framework,
mess. Everyone has their own framework, their own terminology, their own uh opinions of what works and doesn't work.
And so I want to propose a simple tonomy. There's five frontier
tonomy. There's five frontier multi-agent frameworks. One is
multi-agent frameworks. One is delegation, right? This is where one
delegation, right? This is where one agent spawns another agent and the parent agent may say go figure out the database schema and then gets a response back. This is the simplest form of
back. This is the simplest form of multi-agent communication and is what most people implement first. Um you have you know sub aents and coding tools are
the most common example. Um the the other one is creator verifier, right?
Where one agent builds something and then you have another agent that checks that work. And the key here is like a
that work. And the key here is like a separation of concerns. The par the the agent that implemented the the code is has some cost bias, right? It wants that code to work. Um a fresh agent with
fresh context is way more likely to find issues. And this is why we do code
issues. And this is why we do code review as humans as well. Um another one is direct communication. This is when agents communicate without a central coordinator, right? It's like kind of
coordinator, right? It's like kind of like DMing each other. It's hard to get right though because state fragments across conversations without that coordinator and there's no single source
of truth. Um the next one is
of truth. Um the next one is negotiation, right? Negotiation is when
negotiation, right? Negotiation is when agents communicate um but over a shared resource. So that might be, you know,
resource. So that might be, you know, they want to use the same API, they want to modify the same portion of the codebase. Uh but negotiation doesn't
codebase. Uh but negotiation doesn't need to be adversarial. In fact, the best use case is when there's uh net pos net positive sum trading, right? And
that's um when agents have like a potential win-win situation while interacting. And then the last one is
interacting. And then the last one is broadcast and that is when one agent sends information to many. Uh think of it like you know status updates, uh new context that applies to everyone, new
shared constraints. um it's a bit less
shared constraints. um it's a bit less uh flashy than the other ones, but it's critical for maintaining coherence over longunning tasks. And so when you have
longunning tasks. And so when you have all of these different building blocks, how do you assemble that into a system that can run for many days? So missions
is our answer. It's a system that combines four of those delegation, creator verifier, uh broadcast, and negotiation into a single workflow. You describe a
goal, you scope that through a conversation, you approve a plan, and then the system handles execution for hours or days. And that enables you to focus on something else.
Notably, a mission is not a single agent session. It's an ecosystem of agents
session. It's an ecosystem of agents that communicate through structured handoffs and shared state.
It uses a three- role architecture.
There's orchestrator, there's workers, and then there's validators. The
orchestrator handles planning. When you
describe what you want, the orchestrator is kind of like your sounding board. It
asks you the right strategic questions.
It um you know checks out if there's any unclear requirements in in the problem space. And then it eventually produces a
space. And then it eventually produces a plan that includes features, milestones, and then something that's called a validation contract. And that validation
validation contract. And that validation contract defines what done sort of means before any coding is done. And I'll come back to why that matters because it turns out to be really important to the
system. The next role are workers. They
system. The next role are workers. They
handle implementation. U when a feature is assigned to a worker, that worker has clean context, no accumulated baggage, no degraded attention, right? The worker
reads its spec, it implements the feature and then commits um by git allowing the next worker to inherit a clean slate and a working codebase. And
then the last role are validators. They
handle verification. And so most systems validate by maybe running lint, type check, tests, maybe they do code review.
Missions does all of that, but we also validate behavior. Instead of just
validate behavior. Instead of just asking, you know, does the code look right, we wonder, does this work end to end? That's the difference that lets
end? That's the difference that lets missions run for many hours, many days in a row without drifting. And making it work had to involve sort of rethinking
validation entirely.
So when you've worked with coding agents before, you've probably seen this pattern where an agent builds a feature, it writes some tests, the tests pass, there's full coverage, but the tests
were sort of shaped by the code, not by what the code was attempting to actually do. Tests written after implementation
do. Tests written after implementation don't catch bugs. They confirm
decisions. So if you rely on validation like that, your system will eventually drift.
That's why this validation contract exists. It's written during planning
exists. It's written during planning before any code and it defines correctness independently of implementation. So for a complex
implementation. So for a complex project, this can be hundreds of assertions and each feature is assigned one or more assertions that it must
satisfy. The sum of all features must
satisfy. The sum of all features must mean that every assertion is covered.
After each after each milestone of features, we have uh two types of validators that run. So you have the scrutiny validator and the user testing validator. The first one is more
validator. The first one is more traditional. It runs the test suite type
traditional. It runs the test suite type checking lints and critically it spawns uh dedicated code review agents for each completed feature within the milestone.
And then the second one which is the user testing validator is more interesting. It kind of acts like a QA
interesting. It kind of acts like a QA engineer. It spawns the application. It
engineer. It spawns the application. It
interacts with it through computer use or something similar to that. it uh
fills out forms, you know, uh checks that pages render correctly, clicks buttons, and ensures that functional flows work holistically. So, this step takes significantly longer than the
previous one of of the scrutiny validator uh because the the system is interacting with a live application. And
what we've noticed is that missions most of the mission's wall clock time is actually spent here waiting for this like real world execution to occur instead of generating tokens.
Critically, neither validator has seen the code before. They are not invested in the implementation. And so validation is adversarial by design.
Okay. So then validation catches bugs, right? But for a system that runs for
right? But for a system that runs for many days, you also need to make sure that context isn't lost between the agents. When a worker finishes a
agents. When a worker finishes a feature, it doesn't just say, "I'm done." It fills out a structured handoff
done." It fills out a structured handoff detailing what was completed, what was left undone, what commands were run throughout that that uh agent loop, and what were the the exit codes of those
commands. Um what issues were discovered
commands. Um what issues were discovered and did it abide by the procedures that the orchestrator defined for that worker.
That's how we catch issues and how the system selfheals.
The errors get caught at milestone boundaries, corrective work gets scoped, and the mission sort of like pulls itself back on track, not by hoping that agents remember what happened, but by
forcing them to write it down and then actually address issues. And I I'll present on that in just a sec. Um, our
longest mission ran for 16 days, which is much longer than a full sprint. And
we believe that they can run for 30.
That's only possible because of this structure.
So once we had this architecture the next question be became um how do we actually run it right um the most obvious choice is like parallelism if you have 10 agents running at one point
in time then you have 10 times the throughput but we tried that and it doesn't really work for tasks in the like software dev domain because agents conflict they step on each other's
changes they duplicate work they make inconsistent architectural decisions and so the coordination overhead ends eating up the speed gains all the while
you're burning tokens. The difference
with missions is that we run features serially. So there's only one worker or
serially. So there's only one worker or validator running at any given point in time. Within a feature, we allow for
time. Within a feature, we allow for parallelization on readonly operations.
So you have something like uh searching through the codebase or researching APIs. All that gets paralyzed within
APIs. All that gets paralyzed within validators. We also paralyze readonly
validators. We also paralyze readonly operations such as code review.
This is serial execution with in with targeted internal paralization. It seems
slower on paper, but the error rate drops dramatically and when you have tasks that run for many days, the sort of correctness compounds.
Now, your your standard chatter interface doesn't really work for something that lasts many days. At a
quick glance, you need to be able to see how much of the project have you completed and what's what amount of the budget that you originally like set off with have you burned through. So using a
mission actually we built mission control which is a dedicated view for this. You can see what does what is
this. You can see what does what is active worker doing right now readoff handoff summaries that detail what did the worker the validator discover um how
it's gonna sort of like alter its course uh moving forward. or you could just, you know, go check out um you go hang out with your friends that night. This
entire view lets you just run missions asynchronously and you could be plugged in as a project manager overseeing the implementation or you could just, you know, go and and uh hang out with your
friends.
Okay, so the right model in each role.
Uh everything here sort of assume assumes one thing and that is that you're using the right model in each role. Planning
benefits from slow careful reasoning.
Implementation from fast code fluency and creativity. Validation benefits from
and creativity. Validation benefits from uh precise instruction following. Right?
And so no single model nor model provider is best at all three of these.
Using systems like missions requires the development of a new skill which internally we've been calling droid whispering. But it's this idea that you
whispering. But it's this idea that you need to be able to mentally model how different LLMs interact, where they fail, how those failures compound over a multi-day run, and then you need to make
a deliberate choice as to which model sits in which seat. Theo, the engineer who built our missions prototype, came up with our our model defaults, but we really encourage people to make these uh
their own and customize them to the needs of their project. So for example, validation might use a different model provider entirely to make sure that it's not biased by the same training data.
This is a structural advantage of a model agnostic architecture. You're only
as strong as your weakest link. And if
you're locked into one model provider, then you're constrained by that family's weakest capability. As models continue
weakest capability. As models continue to specialize, the ability to put the right model in the right seat becomes a compounding advantage.
It works in the other direction, too. If
you're using missions, the structure of that can compensate for models that are not quite at like the frontier level performance. So the validation
performance. So the validation contracts, the milestone checkpoints, they allow you to run missions very, very successfully, even using openweight models.
Now, this all sounds quite theoretical.
What does it actually look like in production? I got an example of building
production? I got an example of building a clone of Slack right here. This slide
has a ton of info, but I'll walk you through just a few things I want to call out. 60% of our time is spent on
out. 60% of our time is spent on implementation and 60% of our tokens as well. Notice how validation never
well. Notice how validation never succeeds on the first go. That's in the mission. What's it? It's the one on the
mission. What's it? It's the one on the bottom left. Um, we almost always have
bottom left. Um, we almost always have to create follow-up features. So, that
really demonstrates like the value of a system that does this QA loop. you end
up with se with 50% of your lines of code at the very end in the bottom right being tests and 90% of your uh code is covered by those tests. And lastly, we
take advantage of prompt caching heavily to make sure that we're sort of offsetting um the the price of running such a long task.
People are really taken to missions and it's been awesome to see what folks have been building with them. Um, some
examples I've included in this slide, but ones that I want to call out are specifically in the enterprise setting, which is where factory really shines.
Um, they've been used to prototype new ideas and features overnight, to um, make sure that people can uh, build internal tools at increasingly rapid rates, to run huge refactors and
migrations for ML search, research, sorry, and to modernize uh, code bases so that agents are more productive in them.
Um, one thing that I wanted to talk about was also this concept of like the bitter lesson because every person building multi-agent systems has this fear of the next model release sort of
like making their their architecture obsolete overnight. Um, so when we were
obsolete overnight. Um, so when we were building missions, we decided we had to make this system get better with every model improvement. This means that
model improvement. This means that almost all of the orchestration logic is defined in prompts and skills um instead of like a hard-coded state machine. How
it decomposes failures and um or decomposes features and handles failures is all in about like 700 lines of text.
And four sentences of this can alter the execution strategy pretty dramatically.
Worker behavior is driven by skills that the orchestrator defines per mission. So
you get very customized behavior and the only deterministic logic is very thin and it's focused on enabling models to do what they do best while the system handles like the bookkeeping right stuff
like running validation and ensuring that progress is blocked when there are some handoff issues that are not addressed. So missions sort of ensure
addressed. So missions sort of ensure the the discipline and the models provide the intelligence uh using primitives that they're already familiar
with like agents MD skills etc. So what does this unlock?
Remember the bottleneck that I started off with human attention. The economics
are sort of changing. Before a team of five engineers might be able to uh work on 10 work streams at any given point in time. Now maybe with missions we can
time. Now maybe with missions we can bring that up to 30. The team can focus on interesting problems such as uh the architecture, product decisions um
instead of uh worrying about the execution per se. And the important thing is the codebase ends up cleaner than when you started. The endto-end
tests, the unit tests, the skills, the structure that missions provide uh means that agents and humans are more productive in that environment moving forward.
So now that you understand how missions are structured and how they actually work, you can see that they're really a composition of those original um strategies, right? Delegation shows up
strategies, right? Delegation shows up everywhere in how the orchestrator spawns workers and uh how we spawn research sub agents. Trader verifier is fundamental in that validation and
implementation are always separate agents with separate context. Broadcast
runs through the shared uh mission state that every agent references and negotiation shows up at milestone boundaries where the orchestrator defines you know does this does this handoff summary sort of like look
correct? Do we need to create follow-up
correct? Do we need to create follow-up features rescope etc. But strategies aren't enough. You need
the connective tissue. You need uh these structured handoffs so that agents don't lose context. You need the right model
lose context. You need the right model in each role. And you need an architecture that will improve with each model improvement.
So what I like to think about is that people in this room who are thinking in terms of agent ecosystems, who develop an intuition for how different models compose under pressure, um that those folks are going to be really shipping
the next generation of innovation. Uh
there's a lot of open questions still, right? Um how do we further paralyze the
right? Um how do we further paralyze the workload of missions so that they run faster? How do we start orchestrating
faster? How do we start orchestrating missions themselves into even more complex workflows? Uh but the data from
complex workflows? Uh but the data from production missions is clear. This works
on real projects at scale today. So this
is what I'll leave you with. Open Droid,
try running missions, argue with the orchestrator about the scope, approve the plan, and then go do something else.
I'm excited to see what you guys build and I'll be around to answer any questions uh for the rest of the day.
Thanks everybody. That was so
everybody. That was so Thank you. Hey, guess what? It's time
Thank you. Hey, guess what? It's time
for lunch. Who's hungry?
I am. So, get lunch. There's plenty of time. Listen, you came here. You paid
time. Listen, you came here. You paid
money to be here, okay? So, don't waste it by eating alone by yourself in a corner, okay? Be with people, have
corner, okay? Be with people, have community, enjoy it, and then we're going to meet back here together at 2:30 p.m. local time where you'll have a
p.m. local time where you'll have a different MC. Hype him up. He's a
different MC. Hype him up. He's a
wonderful guy. Uh, let's go. Thank you
again, What we do in life echoes in eternity.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey, Fear is the mind killer.
Fear is the mind killer.
Heat.
Heat.
Heat. Heat.
I know. All right.
Heat. Hey, heat. Hey, heat.
Heat. Heat.
Heat.
Heat.
Free your mind.
Free your mind.
Heat. Heat.
Free your mind.
You are who you choose to be.
execute the vision.
Heat. Heat.
Heat up here.
Heat. Heat.
Hey, hey hey.
Heat.
Heat.
Heat. Heat.
Make the requirements less dumb.
Delete the part or process.
Simplify and optimize.
Accelerate cycle time.
Automate Heat. Heat.
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Never give in. Never give up. Outlast.
Out compete.
Persevere. Persevere. Persevere.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
A new age has come.
Hold still.
Let it.
I watch the sparks all burn too fast.
Everyone reaching for the flash.
They take the first light they can find and call it truth and call it mine.
But I stayed when the room went quiet.
When the noise fell out of face, sat with the weight of the question
while the easy answers walked away.
It's not that I see further. I just
don't leave it soon. I let the silence sharpen. I let the dark grow.
sharpen. I let the dark grow.
I stay the almost right past the comfortable light.
I stay.
I wait till the surface breaks. Till the
shade feels true inside.
I don't rush the fire.
I give it to I call it done. Call it enough.
But there's a deeper know still huming underneath the fear of not being love.
Every great thing asks for patience.
Every real thing makes you choose.
Do you leave with what's acceptable or stay for what's asking more of you?
They say it's talent say it's magic like it falls from open but nothing worth remembering
arrives on the first try.
I stay when it stops feeling kind when it stops feeling fast.
I wait through the restless doubt through the urge to collapse.
Hide by and chase the answer. I let it find me back.
There's a moment after the last good idea dies.
Where the room feels empty and you want to run for your life. That's the do teaches you to open. That's the edge
where the real life Hold away.
Let the shape reveal it.
I stay longer than I should. Long enough
to change.
I stay.
I wait till the pattern clears so breaks the haze.
I don't bing it. I
with time.
Most dreams don't fail.
They're just left too soon.
I stay.
I stay.
Typing thoughts into the dark, a spark becomes designed. Words evolve to
becomes designed. Words evolve to whispers meant for something more divine. syntax
divine. syntax and I see the language change. I'm not
instructing anymore. I'm rearranging
fate. Every loop I write rewrites me.
Every function hums with meaning. I feel
the interface dissolve between the maker and the new code. Not on the screen, but in the
new code. Not on the screen, but in the soul where thought becomes the motion and creation takes control. No lines, no
rules, just balance in between the zero and the one. The silence and the dream system shape our fragile skin. They mold
the way we move. We live inside the logic gates of what we think is true.
But deep beneath the data post, there's something undefined.
A universe compiling the image of our minds. Every line reveals reflection.
minds. Every line reveals reflection.
Every loop replace connection. We're not
building, we're becoming. And the code becomes confession.
This is the new code. Not on the screen but in the soul where thought becomes the motion and creation takes control.
No lines, no rules, just balance in between the zero and the one. The
silence and the dream.
We are not just the world we're in.
We are the world we're doing.
Each prompt, each breath, each fragile spin, a universe renewing.
This is the new code.
Alive and undefined.
Where logic meets emotion and structure bends to mind. The systems eternal but the soul writes the line. We are the new
code compiling tie.
Compiling time.
We didn't light the fire.
We traced the spark through.
Every truth was waiting.
ation as I hear the echo before the sound.
I feel the answer before it's found.
No mc that were always there. Hands in the dust of centuries naming what we
uncover. Calling it creation. So we can
uncover. Calling it creation. So we can feel like lovers of faith of faith.
of power. We don't know.
Time is not a river, it's a blade cutting order into shape. We don't move forward. We align until the pattern
forward. We align until the pattern breaks. Nothing is invented.
breaks. Nothing is invented.
It's every sequence. Gods of the real. Nothing is
sequence. Gods of the real. Nothing is
invented here. We rearrange what waits at the
here. We rearrange what waits at the core. I am not becoming something new.
core. I am not becoming something new.
I am what I was before.
Adam sings every thought every selfestee Identity is scaffolding held together by
belief I am a momentary order standing
on my tears shake me break me watch me re Assemble.
Time doesn't chase us. It releases frame by frame. The truth we fear. We don't
by frame. The truth we fear. We don't
fear the ending. We fear the pattern getting clear. Nothing is invented.
getting clear. Nothing is invented.
It's revealed.
Every meories.
We are creators of alignment in a universe that feels nothing is inherent
and every failure is a lesson learned. I
am not lost in what I am not.
I am the order that returns.
If I am only then rearrange the noise from the signal.
ing from the fire.
Nothing.
Nothing is invented.
Stand and see.
Every future was a possibility. We don't
write the laws of motion. We choose
velocity.
Nothing is invented.
Say my name. I am ordering flame. I am time collapsing into will.
flame. I am time collapsing into will.
I am discover. Everyone
say When the noise falls silent and the pattern holds, you'll see it was never made, only found.
Heat.
Hey Heat.
Heat.
Heat.
Heat Heat. Heat.
Heat. Heat.
Ah, a a Ah, ah. Heat. Heat.
ah. Heat. Heat.
Ah!
Ah!
Bye-bye.
I don't want it.
I don't want I don't want Oh, Heat. Heat.
Heat. Heat.
Heat. Heat. N.
Heat. Heat.
Heat Heat.
Heat.
No.
Wow.
Oh.
Oh.
Oh.
Heat. Heat.
Heat. Heat.
Ah.
Oh, heat.
Hey.
Hey. Hey.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat. N.
Oh. Oh,
Hey hey hey.
Hey hey hey.
Heat.
Hey, heat. Hey, heat.
Heat.
Heat.
D.
Down.
Hey, hey hey.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey.
Heat. Heat.
Heat.
Heat.
Heat. Hey. Hey. Hey.
Heat. Hey, Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat. Hey, Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat. Heat. N.
Heat.
Heat.
Oh.
Oh.
Heat.
Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat up here.
Heat. Heat.
Heat. Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Hey, heat. Hey, heat.
Heat.
Hey, heat. Hey, heat.
Oh, a Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey, hey hey.
Heat.
Heat.
Heat. Heat. N.
Heat. Hey. Hey. Hey.
Heat. Heat. N.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey. Hey. Hey.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Heat.
Heat.
Heat.
Hey hey hey.
Heat. Heat.
Hey hey hey hey hey hey hey hey.
Hey, hey, hey.
Heat. Heat.
Hey, hey, hey, hey, hey hey.
Heat.
Hey, heat. Hey, heat.
Hey, hey hey.
Hey.
Hey, hey, hey.
What's up, folks?
How we doing?
Welcome to the coding agents track. My
name is Alex. Uh, some of you know me as the host of the Thursday podcast. Uh, so
let me catch you up. If you not been following the news this week, Anttopic just announced a cloud mythos model that they released but didn't really release.
Uh, some companies got it. Meta MSL labs finally released something. We've been
waiting for the death of llama or the next llama and they released Muse Spark and then Codex hit three million weekly active users and Tibo hit that reset button that you all love. Uh so that's
great. And also open source wise GLM 5.1
great. And also open source wise GLM 5.1 dropped as the new open source state-of-the-art on Swebench Pro. And so
all of this relates to what you guys do with coding agents. And so this is just the last seven days. There has never been a better time to be in this room learning about coding agents. And so for that our first speaker here we have
three incredible talks coming up in the next hour. First the guy this guy is
next hour. First the guy this guy is literally maintains the hugging face agents course MCP course LLM course and the small course. Uh if you ever learned anything from hugging face which you definitely should. They're a great
definitely should. They're a great resource. You probably learned it from
resource. You probably learned it from him. He's a machine learning engineer at
him. He's a machine learning engineer at Hugginface. Uh today he's going to argue
Hugginface. Uh today he's going to argue that your coding agents shouldn't just write code. It should do a lot more. If
write code. It should do a lot more. If
you guys heard about CUDA he's going to talk about CUDA kernels. He's going to talk about a lot of other stuff. picking
GPUs etc. Please welcome our first speaker to the stage, Ben Burtonshaw.
Hi everyone. As you heard, I'm Ben from Hugging Face and the talk that I'm going to present to you today is called Your Coding Agent Should Do AI systems engineering.
So there are two main takeaways that I want you to get from this talk. one and
probably the fun part is that we can use coding agents to tackle the hardest engineering problems in AI. So systems
engineering and machine learning engineering and maybe the boring part is that in order to do this we're going to need standard repos and we're going to need those on the hub and in many cases
we already have them.
So I think in this case I'm preaching to the choir here but in case you haven't noticed coding agents have been accepted. Many of of us have been using
accepted. Many of of us have been using them for a few years but in the last few months they seem to have c crossed a sort of acceptance gradient where a broader group of people are using them.
So with this in mind how do we keep our careers our engineering kind of contemporary and how do we keep challenging ourselves in new areas and my proposal is that we need to go kind
of closer to the silicon and tackle harder problems and and that's where AI systems engineering comes in. I've
broken this talk down into three progressively more complex steps and more autonomous steps as well and I've defined those like three bosses from games. The first
one is a hybrid approach where you interactively use an agent to solve a uh to to write a CUDA kernel. The second is
a zeroot task where an agent takes a prompt and trains an LLM on hugging face. The third is a multi- aent auto
face. The third is a multi- aent auto research setup like a kind of automated AI lab. So let's get started on on the
AI lab. So let's get started on on the first boss, right? This is writing CUDA kernels. So for a while writing custom
kernels. So for a while writing custom kernels was seen as this unattainable goal for the humble agent. They required
complex DSLs. They required integration with re relevant hardware to be benchmarked and to be tested. And it was seen as something that couldn't be uh achieved by agents.
However, that in most cases was wrong.
If you look at kernel hackathons like those on GPU mode, the recent AMD hackathon, if you look at papers like kernel bench, you'll see that agents are
able to write valid and optimized CUDA kernels and and that's really cool and something that totally inspires me. I'm
a part of GPU mode. I contribute to that and and something that I think everyone should be doing. However, what do we do with them? How do we distribute them and
with them? How do we distribute them and how do we get them into our inference engines so that we actually are using these optimized kernels that we're generating? And that's part of the
generating? And that's part of the question of this part of the talk.
Let's take a step back now and just uh say what a kernel is, right? So when you run an AI model on a GPU, the actual work is executed through a kernel. This
will be defined in a relevant language for that hardware and it will use relevant features to that hardware that may not be available on other hardware.
We can write custom kernels that will take advantage of that hardware for a specific math operation. Kind of squeeze everything we can out of it so that the model will infer faster.
In general, this requires a lot of expertise about writing CUDA kernels about the hardware. uh and it's also a bit of an insulation hell as you deal with a pretty large install matrix from
hardware to software to generations and versions of say CUDA and these kind of issues. So in short it's hard
issues. So in short it's hard efficiency in deep learning. So
efficiency in kernels is split into three main sections. One compute, two memory and three overhead. Compute is
the flops. This is these are the matrix multiplications and the the real math of the process. Memory is the time spent
the process. Memory is the time spent moving data or tensors around memory typically from slow to fast memory. And
overhead is basically everything else.
The Python environment, PyTorch dispatch of those kernels, these kinds of things.
In general, most people might assume that the compute is the bottleneck here because it's doing most of the math, right?
That's not correct. So, in most cases, memory is usually the bottleneck. And
that's because a modern GPU, let's take a H100 for example, can do a paflop a second of computation, but its memory bandwidth is three terabytes. So in
short, the GPU is often waiting idle for this tensors to come back for it for them to be computed.
There are custom kernels, custom optimized kernels that exist. Uh flash
attention being the poster child of these. And in general, what they do is
these. And in general, what they do is increase arithmetic intensity. They
basically make the GPU do more sums at once per read and write. So we move the tensors across. We do as much math as
tensors across. We do as much math as possible in the GPU in one go and then we write it back. In short, people like to say we keep the GPUs warm and that's the objective of writing a custom CUDA
kernel.
Hugging face has a library called kernels which is maintained by kernel writers and we're beginning to scale up to a kind of aentic workloads. So at its core this is a way of distributing
kernels. It has a toml file like any
kernels. It has a toml file like any kind of project which says which hardware it works on which versions of CUDA and other kind of softwares it requires to work and it's a it's now
also a repo on the hub just like models.
So if you are a kernel writer or you're an aspiring kernel writer with an agent that you want to set up you can now be a kernel publisher just like a model publisher. And my point is that this is
publisher. And my point is that this is like a kind of super fervent ground for uh AI engineers looking to kind of scale their career. If you check out these
their career. If you check out these repos on the hub, you'll see that there's compatibility for different hardware. You can configure that so
hardware. You can configure that so you'll know like, okay, this works on my GPU or on my laptop. Uh, and this is what it looks like here, right? Let's take a look at what this
right? Let's take a look at what this looks like for an agent and how we're helping an agent to do this. So,
first we're going to go to how we do this. So, skills. So, I I'm sure
this. So, skills. So, I I'm sure everyone here is familiar with skills and I'm sure there've been a number of talks that really go deep into skills. I
don't like to I like to keep them pretty simple and and really they're just kind of filebased context with all the wonders of files. We can open them and close them. We can version them. We can
close them. We can version them. We can
source control them and these kinds of things. And agents can also do the same.
things. And agents can also do the same.
They can open them when they need them.
They can use them when they don't. And
so in the context of kernels that means that we can give examples of how to write and how to use kernels in skills and they can open those and use them when they need. I like to say that it takes a task from being zeroot to being
fshot, which in ML is quite a familiar concept, right? We're just giving the
concept, right? We're just giving the agent examples of how to do things. Uh,
and and we can be quite verbose and descriptive about that.
At HuggingFace, we're focusing on integrating skills into their projects.
So what you'll find is that inside each project there's managed skills by that project which we think is the best way to do this because it means that those projects uh that the maintainers of those projects are maintaining their
skills right that means that they're not necessarily the most like yolo skills because they're kind of like well-maintained and and robust and we have another repo for those kind of more experimental skills which is called
hugging face skills. Go and check that out if you want to try some of these examples you'll see today in kernels. This is what the skill looks
in kernels. This is what the skill looks like. It focuses on benchmarking. So, it
like. It focuses on benchmarking. So, it
has scripts that allow you to uh benchmark and test the skill, sorry, to test the kernel and see how performant it is and references with examples of
how to do this. We benchmarked this skill and we used uh we generated a a
kernel for Quen 38B for H100 and we found that we had a 94% speed up. This
isn't a state-of-the-art speed up on this model by any means. It's really
just about compatibility and a compatibility matrix. So in many cases,
compatibility matrix. So in many cases, these models and their kernels won't be optimized for the respective hardware or generation of hardware that you want to use them on. So you have some lowhanging fruit here where you can just come and
pick up some some optimizations for that specific hardware. Maybe because your
specific hardware. Maybe because your hardware is cheap on your cloud provider, uh but it's not necessarily the most ideal for the for that model that you're using. So my recommendation would be to come here and like pick up
some easy speed ups.
How do we know that these skills are any good and and that we should be sharing them and telling people to use them? We
use an open source library called upskill that we're also maintaining.
This is a is a gateway to using cheaper and open models with skills. It
basically just generates skills, generates an eval for the skill and then allows you to compare different models on the same skill. So you can see things
like this. So okay, GPTOSS is slightly
like this. So okay, GPTOSS is slightly less accurate using the same tokens.
Kimmy is more accurate and using less tokens. Haiku is a bit more accurate
tokens. Haiku is a bit more accurate using less tokens and these kinds of things. So if you've got a skill and
things. So if you've got a skill and you're using it regularly and you're thinking to yourself, okay, how can I save a few pennies here and and get a different model on the go? Then try out upskill and it allow you to iterate on
your skill and and improve it. Right,
let's move on to boss two. I'm going to go through this one pretty quickly. This
is about fine-tuning models. If you're
really into this, there was a talk yesterday by my colleague Murvy that went into this deeply. And there's also a blog post here where we got Claude to do this. This was from back in November,
do this. This was from back in November, December time. Now, go and check this
December time. Now, go and check this out. Basically, you can just say
out. Basically, you can just say fine-tune Quen 36B on this data set.
This is a chain of thoughts data set and you'll improve the model's chain of thoughts. This is fully integrated to
thoughts. This is fully integrated to the hub now. So you can even run the GPUs on on the hub and it uses uh HF CLI skills. So it's all very available. I
skills. So it's all very available. I
would try this one out. You can also try this one out. This is uses Unslo. So
it's even cheaper. This runs with like optimized models and it's maintained by Unsloth and by us and it's another blog post and there's also often free credits that you can get around these blog
posts. So I go and check these out.
posts. So I go and check these out.
Okay, let's move on to the the big one.
Uh this is auto lab multi-agent research which is a project that kind of um basically keeps me up at night.
Andre Karpathy a few weeks ago maybe a month ago now released a project called auto research which was based on his other projects nano GPT and nanohat and
it took the nano GPT architecture and got claude code to create improve to write improvements to that training script so that it would improve the training process. So we can see here the
training process. So we can see here the experiments going over and for each experiment there's a change in the training script which increases the efficiency measured in bits per bytes of
that run and we can see that the efficiency ends at its best at the end of the process. I like everyone thought this was super cool and I had to start implementing it straight away. But one
of the things that stood out to me was I found it kind of weird that we had one agent working in a single way, iterating, going and finding improvements and then implementing them.
And it would make sense to kind of distribute this. So that's what I did. I
distribute this. So that's what I did. I
distributed the task amongst a research team with four types. We have a researcher that basically looks up papers. For this we use HF papers, but
papers. For this we use HF papers, but we can also use archive papers. HF
papers is cool because it has a CLI, so you can just pull and search papers from the hub. and it acts as a lit literature
the hub. and it acts as a lit literature scout. So it just looks up for papers
scout. So it just looks up for papers with ideas and it formulates those as hypothesis. We then have a planner which
hypothesis. We then have a planner which takes those hypotheses and maintains like a queue of jobs.
We then have a set of workers and they pick up those hypotheses and their job is to implement them as training scripts. So in many cases just like
scripts. So in many cases just like change the architecture or change a parameter or something. And then we have a repo a reporter agent that goes and monitors all these jobs and maintains a
dashboard that we can use.
So this is what it looks like. If you
see here that we have we're working in a in a GitHub project, right? So in a git project, sorry. And we have a main
project, sorry. And we have a main branch that we maintain with our train script that we're updating in each branch and then like a train original that we that we keep. And then we have a data structure on the main branch that
we use to just keep the scores. Then we
implemented this in open code for this example but in the repo which you can also go and check out there uh it's also implemented in codeex and claude if you want to try those. I also implemented it
in gas town but that's kind of wild west stuff so I did it in like a separate project. Um but basically it works um
project. Um but basically it works um really anywhere because it's more just a conceptual implementation right and first you have your planner creating hypothesis you have your researchers looking up paper and then your reporter
picking all of this up handing to workers as I said those workers integrate with HF jobs so they start these jobs off on the hub that run with the hardware that they need and then they submit these patches that go back
the reporter operates in Drachio which is a d an open source dashboard that we use for all metrics draio is useful with agent agents because it uses a completely open data layer basically
park a so if you don't want the dashboard or your agent doesn't want the dashboard for any reason it can just get into the park a and just do whatever you want so if you need a gant chart or some other visualization it can just go and
do that so I would say it's like the best agent dashboard tool because it's basically just a data store you know it's basically just a data structure
okay so let's just walk through this now so this is it implemented in Open code if you don't know open code you have like agent configurations so in this one I just set autolab which was the name of
the agent configuration I have it has skills this was the prompt so it says like run one autonomous local research auto research pass in the repo using defined roles I tell it to use planner
uh to propose up two fresh single change experiments use reviewer to reject duplicates or stale ideas I also tell it to use like a HF bucket because I want all of the storage to be in the same
bucket so that I don't have to upload or download the training scripts every time. And then we go and we select one
time. And then we go and we select one of the sub aents as a nice little interface in open code but it's similar in other tools. So I select the planner and then you'll see that the planner receives this prompt and it uses a
specific template which I defined in my configuration of like it's going to have current state. It's going to have a a
current state. It's going to have a a list of the jobs so far, things that have worked which were defined by the reviewer, current hyperparameters that it can change and it's basically just
defining these jobs which will go on to the the job list as I mentioned.
We then switch over to a reviewer agent which will receive all of these jobs. It
has a similar kind of structure based on a template uh reference to where it should be working from and the latest score that it should be using. It gets
an overview of all the failed and successful experiments which it will then like use to base its decisions of what goes into the next queue on and it creates this little table which we don't
really need to look at. It's really just for the agents to um interact with each other and to get this information back.
To be honest, that's a little bit of a verbose example and uh we maybe don't need this many tables and you could probably trim that bit down.
But in general, I'd recommend that if you think this is cool, go and try that out in the repo.
after that. So this agent runs in parallel sometimes for hours and this is the Trachio dashboard that we use and these are all the runs that are pushed to tracheo. As I said the main advantage
to tracheo. As I said the main advantage here is that this is fully open source and it's just a data layer but we get all of these kinds of visualizations.
Trachio can also have like events and warnings. So we can have all of these
warnings. So we can have all of these events being reported by different agents and we can filter those down. We
can also even tie those up to like notifications. So you can get emails
notifications. So you can get emails from Traoio if you want if like your agents are kind of going rogue or something and you need help. But best of all just has this uh like just free form
structure. So you can just throw tables
structure. So you can just throw tables in that don't necessarily fit with any other structure.
And then on the hub side all of these jobs are just run inside hugging face.
So you can explore those jobs and in most cases you can tell the agents to use uh like labels and you can sort those labels and review through what they're doing.
Or you can just look at it like this. As
I mentioned, you can access that underlying data layer and just create a Gant chart because this was a kind of convenient way to look at what the agents were doing over time. So you can see like this amber agent went off and
this was the score that it got. But you
could visualize this however you want because you have access to this data lake.
The kind of TLDDR of the whole thing is that yeah, you can go and just have your kind of own AI lab and you can try it out and if you have a verifiable experiment like training a model or doing uh or writing CUDA kernels, then
it is pretty easy to to implement and set up and and to learn some stuff. So
let's now look at the the takeaways I'd say. So the in simple terms I'd say that
say. So the in simple terms I'd say that agents work really well with primitives and and open primitives and we want tools that are fully open things like tracheo things like kernels that we can
expose to agents and they can kind of control in their own way even though abstracted APIs are really useful if we have a layer that we can't necessarily get behind that that is a ceiling so we
don't always need to extract it's more about exposing well and the other takeaway is that the hub is is ready the hugging face hub is ready for these kind of workloads we have the the
fundamentals in place like storage tracking and compute which I think will allow us to scale our engineering to yeah new levels. If you found any of
this interesting I've shared it all on X I've shared it all on hugging face and you there's a blog post about basically each one of the examples that I just shared with you and they all have repos attached to them so you can go and try
that out for yourself. If you find anything that's broken like please tell me off. If you think that this was
me off. If you think that this was completely wrong, come and find me afterwards and and sort of bully me.
That's fine. But most of all, thank you folks.
Can we get another big round of applause for Ben, please? As you as you guys walk away, you can also Thank you. I love that. Agents that do
Thank you. I love that. Agents that do they should do more than coding, right?
Um folks, while our next speaker gets set up, um I want to quick show of hands of those of you who are staying here. Um
there's a new thing that happens lately when you about to go to sleep and you're like, "Oh [ __ ] my agent is not going to work throughout the night and there's a bit of bit of a stress. Anybody here has this like a little bit of a stress, a
little bit of a f Okay, cool. So our
next speaker here uh is going to tell you about format. format is a very specific thing uh that uh that defines this as a category. Uh folks, please welcome Michael Richmond.
Thank you.
Thank you everybody. My name is Michael Richmond. Um I lead several teams at
Richmond. Um I lead several teams at Bit.ly the link shortener. But today I'm
Bit.ly the link shortener. But today I'm here to talk to you about FOMO.
Uh fear of missing agent time. You know
FOMO, fear of missing out. You also know format. You just didn't have a name for
format. You just didn't have a name for it.
Um, what is fear of missing agent time?
It's being out on a walk and having an idea that you want to task your agent with and having to wait to get back to your
dev machine to actually do it.
It is when you get up from your desk and you had an agent working 30 minutes ago chugging away and you get back and you realize
that after two minutes it actually stopped to ask you a question and it's been waiting their block the entire time.
We all want to believe that our agents are low touch, sorry, are low touch and high autonomy, but we all know the
truth, right? It's back and forth. It's
truth, right? It's back and forth. It's
babysitting and you cannot predict when you're going to be needed for input.
Right now, coding tasks might typically take anywhere between five minutes or 45 minutes, and you kind of know when to check back.
There isn't that much time that's spent in agent idleness, but that window is only going to get longer. And the longer the agent waits
longer. And the longer the agent waits for you, the more agent time that you have missed. If a task is running for 5
have missed. If a task is running for 5 hours or eventually 5 days, you can't just check back in a bit. You need to know when it needs you and you need to
know when it's done. And that might be whenever you are, wherever you are. It
might be whenever you can't predict it.
You may not be at your dev machine.
Did I go back? I went back. Sorry. So
like once I once I once again, my name is Michael Richmond. I run several engineering teams at Bit. I also co-lead our AI coding tools strategy. Um I'm a
really hands-on engineering leader. I
run teams and I also write code. I
co-wrote the Bit.ly the MCP server and I train our engineers on AI skills and best practices. I think about tools a
best practices. I think about tools a lot. The tools that we use every day,
lot. The tools that we use every day, how they work, whether they are effective for our workflows or not. And
aentic coding is really changed the world of software in the last year and the road is very much being paved as we are driving on it. So I built a system
called command and control in order to help me work with coding agents outside of the terminal or the IDE because I really needed it and nothing existed to
solve it yet. Anthropic recently
released some ways to address this with remote control and uh the teleportation mechanism and I think it was just two days ago cursor came out with a solution
in the space. So I wrote command and control. One of the things that is nice
control. One of the things that is nice about command and control, and I'll show you it in a minute, is it is a way to get all of your coding agents in one place uh on your mobile device or really
anywhere. So, this is what my setup
anywhere. So, this is what my setup looks like. I have multiple terminal
looks like. I have multiple terminal windows. Each of these windows has
windows. Each of these windows has multiple tabs. Here's Cloud Code. Here's
multiple tabs. Here's Cloud Code. Here's
what codeex would look like. Here,
here's Gemini. Here's the cursor IDE.
And at any given moment, I might have multiple sessions running, multiple agents across all of these in various states of completion. And here's the thing. I don't know about you, but I
thing. I don't know about you, but I cannot keep track of more than two or three sessions at a time. Soon as I get to four or five, I don't know what
session two is doing anymore. I don't
know which one needs my attention, and I have no idea what states of completion things are in. So, how do you know when an agent is stuck and needs a decision from you? How do you know to check in on
from you? How do you know to check in on an agent midtask and that to find out that it's gone off the rails?
What if you want to start a new session and you're not at your dev machine?
That's exactly why I built this system.
So, this is what it looks like on an iPhone app. It's on Android. It's on the
iPhone app. It's on Android. It's on the web. And it lets you monitor and
web. And it lets you monitor and interact with agent sessions and launch new sessions from anywhere from your phone, from the web, even from your
watch.
I have a couple of video demos that I'm going to play and show you a few of the features of it. And the point here really is like this is a solution to a workflow problem. One solution we saw
workflow problem. One solution we saw this morning uh what was it called?
Agent Craft, which is the gaming version uh of something similar, which I thought was awesome. So, here's the first demo
was awesome. So, here's the first demo slide.
Okay, here I am in the terminal. I'm
going to start a Cloud Code session, and I'm going to issue a command that's pretty run-of-the-mill. I'm not actually
pretty run-of-the-mill. I'm not actually exercising Claude code functionality. I
just want to demonstrate command and control. So, if I come over here to my
control. So, if I come over here to my phone simulator, you'll see that I've got a command and control app over here.
And what I've got here is my sessions.
This is all of my sessions grouped by ones I want notifications for and ones that I want to keep my eye on and recent ones. And here you'll see the one we
ones. And here you'll see the one we just issued here 27 seconds ago. Let's
get out of this session over here. And
what you'll see here is the response I got from the agent is the same one because it is the same session. Now the
beauty of command and control here is that let's say I want to subscribe to this particular session for notifications and I'm going to issue a
um response to this one. I'm going to say sleep for one second and say hello.
And as soon as I hit this, I'm going to leave the agents working. I'm I left the app because what I want to demonstrate here is the push notification that I'm
going to get once it actually completes.
And there it is. So I can click right into that guy and I got the hello. Now
if I come back over to my session and I resume this guy, my response is right there with the agent
response as well. The beauty of this is that I don't have to stay in my terminal. I can walk away with my phone,
terminal. I can walk away with my phone, get notified when there are answers, and respond right from there.
So, that's one basic feature of command and control, interacting with sessions from anywhere. Let's see another one.
from anywhere. Let's see another one.
This is starting a new session in the mobile app.
I was going to do live demo after I recorded these video backups. I was
like, I'm going to show the videos. So,
here's the next one.
I want to demonstrate another important feature of command and control, and that is that I can start a new session right from the command and control UI. I can
pick any configured agent that I've got here. I'm going to stick with cloud
here. I'm going to stick with cloud code, but you see I've got codeex and GitHub copilot and cursor here. I'm
going to stick with cloud code and I'm going to switch my directory to a testing directory. And I'm going to just
testing directory. And I'm going to just ask it what time is it.
Now, the reason I'm showing you a pretty simple prompt, the prompt doesn't matter because the point is that I'm issuing a prompt to my agent and I'm getting a response. Now, you can see I'm working
response. Now, you can see I'm working late at night here. And the beauty is I can go over here and resume the session.
And here's my session right here that I started from command and control. I can
go into it here and pick up where I left off. And if I issue a command here, what
off. And if I issue a command here, what is the date tomorrow? Let's say
what I will see over here is the same conversation. And that is the beauty of
conversation. And that is the beauty of command and control. I often start my day with prompts that I issue from bed.
honestly and start up a bunch of sessions and resume them either in the CLI or I continue them right on my phone.
I hope you're starting to see the power of command and control here. And like I said, this applies to any of the coding agents that I've got configured on my machine.
Now to the problem of keeping track of sessions. this third video and then I'll
sessions. this third video and then I'll talk a little bit more about the importance of this is uh session management.
Now I mentioned how hard it is to keep track of all the sessions you've got going. I mean if you just look at the
going. I mean if you just look at the number of sessions I've got here um there are a lot and I wanted to revisit the different sections that are
available in command and control. So, as
I mentioned, you can subscribe for push notifications. That's this top section
notifications. That's this top section here. Uh, the on my radar section is
here. Uh, the on my radar section is ones that I just want to keep my eye on, but I might not want to have u be as chatty as push notifications for every
message. Uh, and then there's a recent
message. Uh, and then there's a recent section here, which is basically the last 24 hours. And then the rest. Now,
you can see there are thousands of sessions here. I couldn't possibly keep
sessions here. I couldn't possibly keep track of them all, but here they all are organized for me. Another useful feature in command and control is what is
referred to as the overview dashboard.
One of the really nice features of this one is you can get a kind of standup summary of the most recent sessions. And
this is just using the last several messages to kind of give you an overview of what's been going on for each of those.
So I hope you can see the the power here and I think this is the kind of thing that we need in our new world of agent orchestration.
This is a number of things. This is
interacting with your agent sessions or starting new sessions from wherever you are. It is session management. It is
are. It is session management. It is
notifications so you know when your agent needs you and you don't have to guess. And it's also all of the coding
guess. And it's also all of the coding agents that you might be using in one place. It's all those things from
place. It's all those things from wherever you are.
A little bit about the architecture of this which you might be curious about.
So the each agent platform cloud code cursor codeex gemini open code they have a command and control damon that runs alongside they talk to a control plane
layer they monitor life cycle of the agent when things change it's blocked it needs your help uh it communicates up to the control plane layer and then the UI
talks back to that API layer and notifies you of things the control plane aggregates all of the agents regardless of where they're running or what
framework they're running on. And so
this could be your dev machine, this could be a cloud VM, or it could be both. And this is an important point
both. And this is an important point that I want to emphasize. I needed a system that is a single pane of glass into all of my agent sessions,
regardless of which platform they're running on, regardless of what machine they are on. I might have cloud code running on my Mac and Codeex CLI running
in a cloud VM. And all of the sessions from both of those are available via a single UI.
And like I said, it's coding tool agnostic by design. It works with almost all of them. I will also add that the Damon layer is open source. So you could plug it into any of your agent
frameworks that you're working on and then access it through this uh the single UI today. Uh so whatever your agent you can reach it and it can reach you.
And this brings me to the point of like we all want our agents to be maximally autonomous right and the honest truth is they are not yet as much as we hype it
and especially not if you're limited to how you can interact with them to their native environments. This is one
native environments. This is one solution that addresses that problem.
And I want to talk about a little bit of a broader concept here that this is related to and that is that just in the
last year the agentic coding workflow has completely transformed what software development is.
It has transformed how we work and it has also transformed how we enjoy how we work. So you know the old the concept of flow which is
like being totally in the zone locked into your code hyperfocused on a single thing you and your code solving a problem. And
I think that the new type of flow in the agentic world is more about agent choreography.
Multiple agents working in parallel with you moving between them, unblocking one, redirecting
another. And the new flow comes from the
another. And the new flow comes from the elegance of that choreography and the results on the other side.
So then maybe that some of that fear of missing agent time can be alleviated.
Another thing I really want to note here is that this paradigm of the always available agent is actually highlighting
the value and importance of time away from your agents. Matt PCO alluded to this yesterday. I don't know if you
this yesterday. I don't know if you heard the Simon Wilson interview on Lenny's podcast last week.
The cognitive load of managing multiple agent sessions is really high and it is exhausting. As you all probably are
exhausting. As you all probably are aware, you need a break. And here's the thing.
It is in those breaks that we often have our best ideas.
We need systems that make it possible to reach our agents during those breaks wherever we might be and whenever that
happens. And that is how we truly
happens. And that is how we truly alleviate the fear of missing agent time. Once again, my name is Michael
time. Once again, my name is Michael Richmond. I'd love to hear about your
Richmond. I'd love to hear about your workflows, your hacks, the pain points that you encounter. I invite you to give command and control a try if you're interested. Here's where to find it
interested. Here's where to find it online and here's where to find me on LinkedIn. I'd love to connect and
LinkedIn. I'd love to connect and continue the conversation. Thank you.
Folks, can we get a folks? Can we get a big round of applause for Michael?
Uh, command and control. Folks, maybe I don't know how many of you stopped sleeping, but definitely I felt the same when my agents are running. A little bit more. A little bit more. Hopefully
more. A little bit more. Hopefully
command control is the way to regain some sleep time. Uh folks, um I also I definitely have FOMO. I don't think my open claw is clanking right now. I need
to go back up backstage and and make it clank. Our last speaker for this blog is
clank. Our last speaker for this blog is bringing us back to where a lot of us are usually in day-to-day. Uh he's going to talk about copilot agent in uh sorry
GitHub copilot agent in VS Code. He's a
cloud advocate at Microsoft based here in London. He organizes the London React
in London. He organizes the London React meetups and um he's gonna cook with agents.
Yeah. Um in in VS Code in 2026. Copilot
for me is where it started. I had a a little brief thing whether or not developers will survive copilot and yeah, we're still here. So that's great.
Please welcome Liam Hapton.
Super.
Hello everybody. here. It's great to see so many of you who are still here at the final on the final day right at the end.
So, I hope you all had a great conference. Uh, show of hands. Who here
conference. Uh, show of hands. Who here
uses GitHub Copilot?
Awesome. Lovely stuff. Uh, who here uses VS Code with GitHub Copilot? Awesome.
So, I'm going to be talking about both these things today. I'm going to be talking about cooking with agents in VS Code. Now, gentlemen before was speaking
Code. Now, gentlemen before was speaking about cognitive load of agents and that is absolutely correct. You see so many different things now with agents and they're popping up all over the place
from the CLI in the terminals in chat windows in other editors etc etc but we still somehow seem to find oursel in this sort of paradigm where everybody
thinks agents can solve the world's problems you still see developers and I still speak to folks who think we can do oneshot prompts they'll create a wonderful application or solve all of
their issues in one go the case and we end up asking these questions from a business perspective of what's the ROI, what's the productivity boost, where are we seeing our money and
at the moment we're seeing this whole expenditure on AI and all of this infrastructure, all of these sort of toolings and services and we're still yet to really reap the benefits of those services.
So when we look at how people are spending and how businesses are looking at AI, we really need to be very careful with how we're utilizing the tools and services. We need to be careful about
services. We need to be careful about token spend. We need to be understanding
token spend. We need to be understanding the tools and the flexibility. I read
somewhere yesterday on LinkedIn. Uh
somebody has released this repo. It's
growing massively in popularity. Uh and
it's talking like a pirate for your your chatbot to talk back to your your AI services and language models to talk like that because it reduces the token spend. Now people are coming up with
spend. Now people are coming up with these really intuitive and really fun ways to get around token expenditure and really pull in those benefits uh very quickly. So what I'm going to be talking
quickly. So what I'm going to be talking about is get up copilot agents. Now this
doesn't just apply to get up copilot.
This also applies to other AI agents as well. So when we're looking at copilot
well. So when we're looking at copilot agents around context, what they really have access to in your workspaces, how they're being used and utilized from within VS Code and the CLI, we're going
to be looking at all of those things very shortly.
So just a plain and simple, what kind of agents do we have at the moment? Now
we're looking at local agents. We've got
local agents which are in VS Code. You
may use Claude. You may use all these other AI services still applicable, still running on your local machine with remote models. Anything that you're
remote models. Anything that you're really using, maybe you're using locally hosted ones as well. But this is a way to have local models interacting with you side by side, very hands-on, very
much in the context and human in the loop. Then you've got background agents.
loop. Then you've got background agents.
Now we use the GitHub copilot CLI. We
have also got access to that within VS Code, but this is more of an isolated way to be using them. Now, we are actually using git work. Show of hands if you know what git workree is and who
uses them. Awesome. Wonderful. Uh for
uses them. Awesome. Wonderful. Uh for
those who don't know, easy way to explain that is it is a branch that is mapped to an isolated folder within the workspace that you're working in or like a subdirectory just a chop of your code
with its own little branch associated to it. Very similar to a git branch in
it. Very similar to a git branch in general. Then you've got cloud agents.
general. Then you've got cloud agents.
Now cloud agents is quite an interesting one because it allows you to scale outside of your organization very very quickly and utilize a lot of the power of I guess the cloud and some of the
services that we're using in GitHub. Now
we use these when we don't want to be touching it ourself. I use this when it comes to writing documentation or sort of having less of a hands-on approach.
So when we when would we use a local agent? Well, I'd use a local agent when
agent? Well, I'd use a local agent when it comes to writing tests. I want to be really hands-on with my tests. I want to understand what's going on in the codebase. I really want to be in there
codebase. I really want to be in there in the weeds. When would I use a background agent? Well, a background
background agent? Well, a background agent would be great if I want to be sort of a 50/50. I want to create a UI for a front end of an application. I
kind of want to know what's going on. I
don't really want to hand it off to a cloud agent because I don't want to be fully out of the loop. But I also also don't want to really be hands-on to and fro myself because that can take time.
That can be quite ardous. That can be quite annoying. So I would use a
quite annoying. So I would use a background agent and I'm going to show you how I'm using autopilot to do exactly that with a good co-pilot in just a moment. When would I use a cloud agent? Well, I would use that mostly for
agent? Well, I would use that mostly for documentation. I hate documentation. I
documentation. I hate documentation. I
don't like writing it. I don't think many people do unless you're a content developer. uh I really just pawn that
developer. uh I really just pawn that off to the cloud agents and that could be making a repository open source friendly. It could be writing a readme
friendly. It could be writing a readme using some skills to do that as well. So
what I'm really looking at is VS Code as a single entry point for AI agents. We
have got third party support, we've got background, we've got local and we've got remote entry points for all of these agents. So the ultim ultimately what
agents. So the ultim ultimately what we're trying to do is understand where you are sitting as a developer and how easy we can make it for you to use these
agents to reduce that cognitive load.
Seems quite complicated but is actually really straightforward. So I'm going to
really straightforward. So I'm going to show a video now. I was going to do this live but I don't really think I'm going to have time to do all of this live. So
I'm going to whiz through this video.
So I'm going to start with a very simple Python application. This is just a CRUD.
Python application. This is just a CRUD.
Well, create, read, update, and delete.
Just a very simple product store. Not
very pretty, not very good. As you can see, pretty straightforward. What I
actually want to do is create a front-end UI for it. So, I've got a ticket up in GitHub, and I'm saying, "Hey, this is wonderful. Go and add a front end. We need some more prettiness
front end. We need some more prettiness here. We need it to look good." So, I'm
here. We need it to look good." So, I'm going to say, "Summarize and plan a solution to issue 25." Now, you'll notice I'm actually using a CLI
background agent at this point. Now, I'm
using that because I want it to be sort of hands-on, hands off, a little bit of understanding what it's doing. Also,
don't really care if it messes things up. It can go and iterate. I'm also
up. It can go and iterate. I'm also
going to be using autopilot. Now,
autopilot is currently in preview. And
this just means it's not going to ask me a bunch of questions if it wants to do a bunch of tool calls. Great. Wonderful.
Can be very dangerous.
Use that your peril, right? Don't don't
just abuse that one. But I'm using it here to create a plan. I don't want it to ask me every single time I want it to do an MCP call. So I'm then saying, "Wonderful. Here's the plan. Now start
"Wonderful. Here's the plan. Now start
it. But before you create a pull request, because on autopilot it will do a pull request, stop and pause and let me test locally.
Whilst that is off doing its lovely stuff, I can then move on to my next stage where I'm going to be using another kind of agent. So, I'm just going to go and leave that one behind.
Let's go and spin up a new chat and let's go and start a cloud agent. I've
noticed that uh this is not a very open source friendly repository. I want this to be have a readme or have a contribution guidelines have all this uh all these readmes that I really want as
an open source project. So, I'm going to go pawn up and say, "Hey, go and make this open source friendly. Add all the necessary files for it." Don't really care. Now, as a developer, I can go into
care. Now, as a developer, I can go into my codebase and start poking around.
I've noticed that I don't have any tests. So, I'm going to go check out.
tests. So, I'm going to go check out.
And I've noticed there is a custom agent available for me in VS Code. This custom
agent is just essentially explaining and showing how to be using or how to write test cases for this Python application.
So, what I can do now is start spinning up a local agent. So, just like that, at the very bottom, I can click local. I'm
going to select Claude Opus 46. I'm
going to have medium reasoning. I want
it to be kind of fast. I don't really care for it too much. It's got a great understanding in this custom agent. Go
and write some unit tests.
Now, as a developer, I can still skim through. I've got very much a hands-on
through. I've got very much a hands-on to and fro with a local agent. I've got
a remote agent doing some work for me, and I've got a background agent creating a new front end. So, here I can see, right, it's written some tests. It's
going to go ahead and try and run them.
is passing the test, but I've also noticed that there's some other problems in the code. It's not very friendly. The
errors that are coming back are not wonderful. So, I'm going to say go and
wonderful. So, I'm going to say go and update the error handling on the roots and update the tests as well. So, you
can see I've got a lot of two and fro with this local agent. I've got my remote agent working and I've got my background agent working all simultaneously.
So, whilst that's going off and working, I can go and check out what my other agents are actually up to.
So, as it's working through this, you can see co-pilot is just going to be skimming through. I didn't actually
skimming through. I didn't actually speed up some of this video. This is all pretty quick. I did this pretty pretty
pretty quick. I did this pretty pretty quickly with these agent, but there you go. You can see some of the code is
go. You can see some of the code is updated. We've got the new test. We've
updated. We've got the new test. We've
got the code updated. Let's go and check out the background agent that has now finished, which is cool. Let's go and check on the remote agent. How's the
remote agent getting on?
Well, actually, this is the this is the test. Run the test. The tests have
test. Run the test. The tests have passed. That is the local agent. That's
passed. That is the local agent. That's
now finished. Now I can go and check out my remote agent.
So as we're walking through this, we can see as a developer, I've got very much hands-on, hands off. I'm working with multiple agents simultaneously. We can
see where they're running all within this single context of VS Code. Now, if
I go and look on the pull request extension in VS Code, we can see that I've now got a pull request. And this is one that I previously run earlier. The
one that's running in the chat is actually taking quite a while, but the principle still stands. It's running all these different agents at the same time.
So, all I really want to do now is go and check out my background agent. I
want to go see it working. I want to go see this new front end that I've just created. Now, I asked it to pause before
created. Now, I asked it to pause before I pushed over a pull request and tell me how to test it. So, I'm going to say, well, actually, the way you're telling me how to test that is wrong. So, I've
still got hands on, hands off. this more
of a 50/50. I'm saying this is working in a git work tree. How do I actually run this? Now, remember this is what it
run this? Now, remember this is what it currently looks like as an application.
So, I'm going to go check out the new directory, which is a git work tree. I'm
then going to run this Python application. You'll see a very drastic
application. You'll see a very drastic change between what is created in my single directory versus what I currently have. There is a port conflict here. Uh,
have. There is a port conflict here. Uh,
so hurry up and run that.
There we are.
So like that. This is the new product demo. Now this is the third agent that
demo. Now this is the third agent that I'm running simultaneously. And that is essentially a great way of how you're using different agents within one context to kick it all off using GitHub
Copilot. That's new error checking. And
Copilot. That's new error checking. And
that is how I've been using multiple agents. So one codebase, three problems,
agents. So one codebase, three problems, three separate agents fixed all at the same time. The local agent was writing
same time. The local agent was writing my test for me because I wanted hands-on. I really wanted human in the
hands-on. I really wanted human in the loop. I use my background agent to write
loop. I use my background agent to write the front end because I don't really care what it does. It's quite an arduous task. It's quite big. It's quite
task. It's quite big. It's quite
timeconuming.
And then I use my cloud agent to write my documentation for me. So all in all, that's a pretty successful run. So how
are powered agents actually working?
Because I get this question quite a lot.
How much is it going to cost? How is it working? What are they doing? And how do
working? What are they doing? And how do I get them running? Well, they're
actually running in GitHub actions.
They're pretty safe and secure because they're running in a isolated environment. They have got extended
environment. They have got extended context through MCP servers. Who here
uses MCP servers? Just out of curiosity.
Awesome. So, Cloud Agent actually has uh access to the GitHub MCP server and the Playright MCP server. So, you can do testing with screenshots, you can do automated frontend testing, and you can
obviously write your workflows. You've
got um the the dynamic workflows now and it has got built-in safeguards. So
you've got network firewalls. You don't
want this agent talking to a whole bunch of different things. It is absolutely whitelisted and restricted. It also
doesn't have access to your main branch.
Therefore, you're not able to push directly to your main branches. It's
very much restricted in that sense. So
it is very safe to use. Now I mentioned earlier this is very much of co-pilot.
But it is not just good copilot that this applies to. This actually uses all the same concepts across all different AI agents uh that you can use. So custom
instructions very much defining how the agent is running. You've got custom agents which is what I showed you today in that short demo where you're able to use very specific agents to tackle certain problems i.e. fixing test cases
or writing test cases. You have prompt files which will help you with your prompting and agent skills. Agent skills
is more of like the the newer version of agents. MD. there's always like a new
agents. MD. there's always like a new thing that's coming on every single week now. So all of this is actually
now. So all of this is actually applicable to get copilot as well as other AI services too.
Now inside VS Code, this is a modal which is very recent and I can jump out of the slides in just a moment to show you exactly um how this looks.
So, if I was to go over to VS Code, open up my GitHub copilot chat pane, and if I click the cog up here, you can actually see everything that I have in one user space for you to customize the chat and
the agents that you are running. So,
whether you've got agents, I've got my custom test agent, I've got my built-in agents, which is ask, explore, and plan.
I've got some skills. So, this is essentially what some of the VS Code team sort of preempts you to be using here. So, you got some extensions, you
here. So, you got some extensions, you got address a PR, uh comments, you got create a pull request. You can jump into these and edit them as you wish. And
this is just intuitive skills that we have popped in there for you. I don't
have any instructions, but this is where you'd have your instructions file, your prompts, if you've got any built-in prompts like creating an agent, just different prompts that can then go off and kick off skills. You've got hooks. I
don't have any hooks on this one, but a very good example if you wanted to create or configure some hooks, you can do so with uh Copilot inside VS Code and any MCP servers as well. So you kind of have this whole control plane in this
modal which allows you to control your agents and chat customizations from within one single place. This isn't just confined. We have third party support as
confined. We have third party support as well. So there is clawed as uh down
well. So there is clawed as uh down here. So you can have access to all of
here. So you can have access to all of your clawed things uh and all your plugins uh hooks, instructions and skills for claude too. So it's not just restricted to VS Code and GitHub Copilot.
So if you want to get hands- on with some of the uh skills or customizations, we have got this awesome open source project which we're running. It's called
Awesome Copilot. It's ak.ms/
awesomecilot. Like I said, this is directed at copilot, but it's absolutely not just for copilot. You can use this and massage them and use them for other AI tooling as well because we do know
that people in the community use more than just Microsoft things and GitHub things.
We also have an MTP server. So if
anybody's interested in utilizing this from their uh from their workflows from MTP standpoint, we also have encapsulated that into an MTP server.
For those who don't know about MCP, so the monocontext protocol is a great way for you to get hands-on and extend the LLMs that you're working with or any of the chat customizations that you have.
For example, if you wanted Azure or talk to your Azure resources or I don't know GCP, AWS, etc., you can go through an MCP. you'll obviously be locked down by
MCP. you'll obviously be locked down by authentication. Uh but there's also free
authentication. Uh but there's also free ones as well and open ones which don't require authentication like playrights and documentation ones i.e. Microsoft
learn and so on and so forth. So just in time uh as a wrap-up visual studio code is a single entry point for AI agents and we're really building this agentic
workflow around multiple different services. We've got third party plugins,
services. We've got third party plugins, we've got first party plugins and we've got the full spec support for MCP. We've
got chat customizations and you can connect to the GitHub copilot CLI sessions through VS Code. So, it's all in one single sequential uh sequence for you as a developer inside your workflow.
I'd love to hear more about your workflows and what you're using and the agents and how you're using them after the session because I believe I've only got just less than a minute left. So,
thank you ever so much for listening and thank you very much for coming today.
Folks, can we get uh last round of applause for Liam, please? And keep this going for all three of our speakers. We
got Ben, we got Michael, and now we got Liam as well. Thank you all so much for coming to the um to the Agentic Tools track. And now the best track in the
track. And now the best track in the world, the hallway track. You guys need to be here at 4:30 to start filling up those seats for the last event. Um, we
had um, by the way, if um, yeah, if you if you put a QR code in front of people, people scan the QR code. So, this is my podcast called Thursday. We've been
covering AI engineer since the first one in 2023. Uh, I think I saw some of you
in 2023. Uh, I think I saw some of you who were there in 2023. So, that's
great. Uh, we had a twohour live show with many of the speakers and I'm going to immediately after this interview a bunch more. So if you are interested in
bunch more. So if you are interested in like more conversations a little bit more detail please uh feel free to follow me and just come and chat with me hallway track starts now. This is a wrap on coding agents blog. Thank you guys.
echoes in eternity.
Heat.
Hey, heat. Hey, heat.
Heat. Heat.
Heat.
Heat. Heat.
Fear is the mind killer.
fear is the mind killer. Ah,
feel Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. Heat.
Heat.
Heat.
Free your mind.
Free your mind.
Heat. Heat.
Heat.
Heat.
Free your mind.
You are who you choose to be.
Heat. Heat.
Execute the vision.
Heat. Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Make the requirements less dumb.
Delete the part or process.
Simplify and optimize.
Accelerate cycle time.
Automate.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Never give in. Never give up. Outlast.
Out compete.
Persevere. Persevere. Persevere.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
A new age has come.
Hold still.
Let it watch the sparks all burn too fast.
Everyone reaching for the flash.
They take the first light they can find and call it truth and call it mine.
But I stayed when the room went quiet.
When the noise fell out of face, sat with the weight of the question while the easy answers walked
away.
It's not that I see further. I just
don't leave soon. I let the silence sharpen. I let the dark grow.
sharpen. I let the dark grow.
I stay the almost right past the comfortable light.
I stay.
I wait till the surface breaks, till the shade feels true inside.
I don't rush the fire.
I give it to I call it done. Call it enough.
But there's a deeper know still huming underneath the fear of not being love.
Every great thing as for patience every choose.
Do you leave with what's acceptable or stay for what's asking more of you?
They say it's talent, say it's magic like it falls from open, but nothing worth remembering
arrives on the first try.
I say when it stops feeling kind, when it stops feeling fast.
I wait through the restless doubt through the urge to collapse.
Hide by and chase the answer. I let it find me back.
There's a moment after the last good idea dies.
Where the room feels empty and you want to run for your life. That's the door.
But he teaches you to open. That's the
edge where the real stand.
Hold the light.
Hold away.
Let the shape reveal it.
I stay longer than I should. Long enough
to change.
I stay.
I wait till the pattern clears. So
signal breaks the haze.
I do boring. I
with time.
Most dreams don't fail.
They're just left too soon.
I stay.
I stay.
Typing thoughts into the dark, a spark becomes designed. Words evolve to
becomes designed. Words evolve to whispers me for something more divine.
Syntax and brea I see the language change. I'm
not instructing anymore. I'm rearranging
fate. Every loop I write rewrites me.
Every function hums with meaning. I feel
the interface dissolve between the maker and the new code. Not on the screen, but in the
new code. Not on the screen, but in the soul where becomes the motion and creation takes control. No lines, no rules.
Just balance in between the zero and the one. The silence and the dream
one. The silence and the dream systems shape our fragile skin. They
mold the way we move. We live inside the logic gates of what we think is true.
But deep beneath the data pulse, there's something undefined.
A universe compiling the image of our minds. Every line reveals reflection.
minds. Every line reveals reflection.
Every loop replace connection. We're not
building, we're becoming. And the code becomes confession.
This is the new code. Not on the screen but in the soul where thought becomes the motion and creation takes control.
No lines no rules just balance in between the zero and the one. The
silence in the tree.
We are not just the world we're in.
We are the world we're doing.
Each prompt, each breath, each fragile spin, a universe renewing.
This is the new code.
Alive and undefined.
Where logic meets motion and structure bends to mind. The systems eternal but the soul writes the line. We are the new
code.
Compiling tie.
Compiling time.
light.
We trace the spark through every truth.
Patient as I hear the echo before the sound.
I feel the answer before it's found nothing We only shift the pieces that were always there. Hands in the dust of
always there. Hands in the dust of centuries. Naming what we uncover.
centuries. Naming what we uncover.
Calling it creation. So we can feel like lovers of faith of power. We don't know.
of power. We don't know.
Time is not a river. It's a blade.
Cutting order into shape. We don't move forward. We align until the pattern
forward. We align until the pattern breaks. Nothing is invented.
breaks. Nothing is invented.
It's revealed.
Every crowd was buried in the field. We
are architects of sequence, not gods of the real. Nothing is invented.
the real. Nothing is invented.
Mirror we rearrange what awaits at the core. I am not becoming something new.
core. I am not becoming something new.
I am remembering what I was before.
Adam sings every thought every scaffolding held together by belief. I
am a momentary order standing on my tears. Shake me, break me, watch me
tears. Shake me, break me, watch me resemble.
Time doesn't chase us. It releases frame by frame. The truth we fear. We don't
by frame. The truth we fear. We don't
fear the ending. We fear the pattern getting clear. Nothing is invented.
getting clear. Nothing is invented.
It's revealed every memory seal. We are creators of
memory seal. We are creators of alignment in a universe that feels nothing is invented.
And every failure is a lesson learned. I
am not lost in what I'm not.
on the order that returns.
If I am only rearrange the noise from the signal ing from the fire.
Nothing is nothing invented.
Stand and see.
Every future was a possibility. We don't
write the laws of motion. We choose
velocity.
Nothing is invented.
Say my name. I am ordering flame. I am time collapsing into will.
flame. I am time collapsing into will.
I am discovery uns Come say the noise falls silent.
And the pattern holds you'll see it was never made only found.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Ah!
Ah! Ah!
Ah!
Ah, ah.
Heat.
Ah, aha.
Bye-bye.
Heat.
Heat.
I don't want it.
Oh.
Heat. Heat.
Heat. Heat.
Hey everybody.
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Heat. Heat.
No.
Wow.
Oh.
Oh.
Oh. Oh. Heat. Heat.
Heat. Heat.
Ah.
Oh, love.
Oh, aha.
Oh. Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Oh, a h Oh, boy.
Hey hey hey.
Heat.
Hey. Hey. Hey.
Heat.
Hey Heat.
Heat. Heat.
D.
Hey, hey hey hey hey hey hey hey.
Woo!
Hello.
Heat.
Heat.
Hey.
Hey. Hey.
Hey. Hey. Hey.
Heat. Heat.
Boom!
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
in welcoming back to the stage Tusk Kumar.
Whoa.
Let's go.
Whoa. Oh, it's a full house again. How
are you?
Front rows awake. Back back rows asleep.
Let's try again. How are you?
Very good. Very good. That person. Yes,
that was definitely not an agent. Um,
we're we're just we're in the endgame now, friends. Let's give a round of
now, friends. Let's give a round of applause to everything so far. Oh my
god.
Wonderful.
Ah, listen. We have a treat for you. We
have a treat for you. We have a a discussion, a fireside chat uh coming up very shortly with the great Gerge Oros um and the CTO of anyone using linear here. Um okay. Wow, you should do that
here. Um okay. Wow, you should do that when he comes on. It's incredible. I
love linear. It's so beautiful. Um the
CTO's name is Thomas Artman. It's a work of art indeed. Anyway,
if you don't laugh, I will laugh, you know. Anyway, um this this discussion is
know. Anyway, um this this discussion is going to be insightful. It's going to be very very impactful. And I I had a little bit of a teaser backstage because I was talking about um I said linear will is is agent proof. It'll never go out of style because it's built with
taste. And they said that's going to be
taste. And they said that's going to be a part of the discussion. So I want you to lean forward and give them your ear and your biggest round of applause.
Gerge Oros and Thomas Arm.
Awesome. So, we didn't see it, but hands up if you do use linear and hands up if you heard of linear and hands up if you want to use linear.
Awesome. Great to see. So, we're we could be talking about linear, but we're going to talk about something a bit bigger, which is a bit of a new trend that with Thomas, we're talking about
things are trending the wrong way right now. What is trending the wrong way?
now. What is trending the wrong way?
the so what happens when when agents um are capable of doing everything um immediately for you. Uh the the fact
that might be that like the pendulum has swung too far into the into the wrong direction where if you get a feature request you might now be in the position to just immediately ship it and that
might be the wrong thing to do. Um and I I reckon that you know hopefully half a year from now or a year from now we'll understand that like shipping things uh without really too much thinking is um
is a bad thing. Um what will happen is is that um you because you have this this enormous power of effectively just like shipping every single request that comes in or every single thing thing
that pops into your head. Uh you will effectively ship you know a software that that is not great. Um Steve Jobs back in the day said that you know great products come out of saying no um to you
know 999 things and yes to one thing and with AI um we might be in a place where um it's just too easy to say yes and try things out and ship it and get to a very convoluted place where you know software
actually doesn't work for the end customer um nicely anymore or that the user experience get gets confusing. We
used to previously have um this you know uh this thing that gated us from from doing this which was like the actual engineering used to be hard. So we used to think about these these features and
these these you know um the the applications that we wanted to build before we actually started engineering because engineering was such a time you know waste of time and it took a long time to ship something. So um yeah that
but I want to challenge you a little bit on that. Did we not see this happen
on that. Did we not see this happen before before AI that some companies were already just like shipping putting on a bunch of features and stacking and what are you seeing different right now?
Are you actually are we actually seeing more companies you know do more of this like I don't know like feature factoring thing we had a you know common experience at Uber where we worked together where we
you know we we went through hyperrowth um and the thing about Uber was that it was a winner takes it all market and Uber was going against back in the day in the in the US and um you just had to
ship immensely um and and and you know just outpaces you know the competition um at all all costs um and what I saw at Uber like was was that hyperrowth that I never want to go through again, which
was like um at all costs, you know, just fighting fires, keeping the infrastructure running, scaling as quickly as possible, trying out everything and and you know, trying to come out, you know, as a winner um in in
in that front. And I I I see the analogy to to AI nowadays because when everybody has the capability of shipping, you know, tons of functionalities like you you always are in a competition with with somebody else. Like your
competition might be, you know, a small team or even one person that, you know, is very capable of using AI to to to ship and um you know, build a product that is, you know, has the same feature
set as as as you do. Um and in that world like I I think it becomes important to sort of you know stand out um in a way where like you build tasteful software and where you build
high quality software um and thus sort of you know maintain some sort of you know competitive advantage um towards your your uh competition. So at at
linear even before AI came out you were building tasteful software and and and focusing on those things but then these tools came out and they became more powerful specifically since cla code
came out now we have opus 4.5 you should be able to ship faster your engineering team you're a CTO your engineering team should be able to ship a lot faster what are you telling them
like what should they be doing inside of linear with this capability should they be slowing down. No, right. What's going
on inside of linear? Tell us.
Well, yes and no. Like we still, you know, we we still think about um every single feature that we put out. Like we
we don't we don't go down the route of of just trying out prototypes. We want
to sort of maintain that design angle that we have and and you know think about the user experience. Still say no to a lot of um custom requests. Like a
lot of time you know hasn't gone into really just engineering. Um it's it's about figuring out like what the customer wants. uh we do get a ton of
customer wants. uh we do get a ton of you know feature requests. We usually
never ship them as such. Like what we really want to do um is uh get a lot of feedback from our customers, talk to our customers, figure out what their actual problem is and then sort of group that together and figure out like you know
what is actually the root cause of of you know these feature requests and then come up with a solution that is um that that is perfect for that you know particular group of feature requests.
Um, and that takes time. Like AI can help you, you know, so much. Obviously,
it can sort of, you know, go through all of those requests and give you a summary and maybe sort of point you to sort of, you know, different groupings. Um, but
it still takes time to figure out what the right thing is. Um, and then you go into design and and and figure out like, you know, how do you implement a great UX around uh the functioning that you want to want to build. Yes, we want to
move faster and we are moving faster. Um
there's certain aspects of uh of you know building a product that has you know accelerated a lot. Um one is for example fixing bugs. um every feature uh every every product has bugs and you
know the inflows of bugs is effectively constant and um those are much easier to fix now like I you know 10% of our bugs are automatically fixed by you know a
singleshot AI instance like when a bug comes in into linear um be it from sort of you know our engineers uh reporting those or a customer reporting a problem 10% are automatically you know up with a
PR and automatically landed without an engineer doing anything um over time that will go upline. I do foresee a future where like it gets closer to 100%. Um in the next few years um so
100%. Um in the next few years um so that's something where you can accelerate your your building like hand off you know these tasks that don't really require much thinking or you know
design expertise or thinking about functionality hand that off to agents.
You care about quality and you you can tell that linear and and you have always always had let's talk about clot code.
What do you think about cloud code? And
you can you can it's it's it's a safe space.
Yeah, hopefully it's a safe space. Um,
anthropic said you know that all of the functionality has been has been in cloud code has been coded by claude. Um and I think it it shows like if you if if you
truly use you know cloud code either the CLI or or then the desktop application um you can spot problems and you know small sort of you know small bugs I I
would say there's there's not really you know just quality fixes but there actually bugs um in in effectively a few seconds um it is a bit you know slow it might be you know functioning in a way
that you you don't you don't really really see and to me That's sort of a um a side effect of moving so fast like obviously they they again they're in a competition with you know open AI and um
they need to ship features and they need to sort of move move really quickly uh because it might be a better takes at all market again um and uh it the side effect of that is that the quality just you know isn't there.
Yeah. Well, this was not a great acquisition pitch, so I I don't think you're you're going to get there. But,
uh, I absolutely like you you can see some of these things, but how do you measure quality?
And we've talked about this before uh just be just before we started on Uber, how we tried to measure quality and and how that's influenced you to learn what you can measure about it and what you
cannot.
Uber is a good example of like where it is immensely hard to measure quality and therefore you sort of don't um Uber as an example like we had you know these five big metrics um that everybody was
looking looking after and looking looking to improve um the big one was revenue like um it is effectively a transactional you know application um the more revenue you generate the the better so
the other ones were like trips taken trips taken trip taken uh I think the quality of the ride was was one as well.
It also time to first trip from from sign up to the first time that people.
So we had a few golden metrics, right? But the the revenue one was what
right? But the the revenue one was what everybody everybody looked after. So
when you shipped a new feature um or shipped you know something totally new like you know Uber pool for example. Um
I don't know which one came first liftpool or Uber poolool. I think Uber started it and then lift came around but um obviously if you ship a new feature that sort of you know makes the price you know of a of taking a trip lower it
will increase your revenue. So how do you measure quality in that? Like you
you you simply don't. Um if if there's no other way of you know if if there's some other platform that provides you with uh with with a pool drive that is inexpensive then uh you you don't really
need to have quality. Um and that was you know my my feeling throughout like at Uber like we had engineers that that cared at least in the beginning we had engineers that cared about quality like it was up to us to figure out like
whether something we shipped was was great or not. I I still remember when I joined in 2012, I think. Um I put up a first PR and back back then the Ubra application used to have this this poll
in the middle of of the screen that used to have an ETA of like when when your trip is, you know, is is is going to arrive. And I made some changes to the
arrive. And I made some changes to the margins of the of the map. Um and the PR came back from a from an OG engineer who was, you know, was on the team for from from the get. I think he was the first
iOS engineer. Um, and he he was like,
iOS engineer. Um, and he he was like, "Oh, this this pole is off by two pixels." And I was like, "You you
pixels." And I was like, "You you measured it?" "Oh, yeah, sure. I I
measured it?" "Oh, yeah, sure. I I
measured it." And I was like, "I measured again." And, you know, yes, two
measured again." And, you know, yes, two pixels off, so I had to move it up by one pixel. And that was the like nobody
one pixel. And that was the like nobody would would really care. Nobody would
see it, but like people were keen on upholding the quality. And that's why, at least in the beginning, the Uber application was was pretty performant, was was of highest quality. Um but then
like once you have a big enough team and you've got these um incentives of just increasing revenue um you ship new features as as quickly as possible and and quality is a thing that like it it
doesn't affect your revenue until it does. So what happens Uber ships Uber
does. So what happens Uber ships Uber pool Lyft comes ahead and ships Uber uh you know Lyft pool as well. So you've
got two competing products that effectively have the same price points do the same thing. um you can choose either one of the applications and over time like my theory and and that's why
we we you know want to build linear into this high quality tool uh is that over time people will will pick the one that is of higher quality um it might take a while like people might be you know
sticking to Uber and then trying out lift you know once a year or something like opening it up mean like oh this user experience actually feels better I I I feel like I'm getting the car faster
um even though the price and the product that they sell is is the same. So over
time you will start losing your users.
Um and it will be a gradual you know slip there. There will be no AB test
slip there. There will be no AB test that you can do in order to figure out like whether you should invest in quality. Um it'll just happen over time.
quality. Um it'll just happen over time.
And that's that's the danger of it. Um
if you if you build a bad quality product, you open up yourselves uh to uh sort of be leaprogged by by a competition. Um well not leaprogged like
competition. Um well not leaprogged like slowly overtaken by the competition.
you do something really unique at linear related to this that I've never seen before. It's called quality Wednesdays
before. It's called quality Wednesdays and I sat into one of your quality Wednesdays. The whole engineering team
Wednesdays. The whole engineering team gets together. It's a full remote team.
gets together. It's a full remote team.
So everyone just dials in and it was 30 minutes and every engineer I think we had about 25 engineers on that call would show one fix that they did was quality and it went from like a one
pixel change. It was literally a one
pixel change. It was literally a one pixel to uh oh I just made our our our backend like way more efficient and and using less things and it was just boom
boom boom boom boom. Uh and I think it took like 37 minutes for the 25 people but it was less than two minutes. How
did this start?
And was this you? It was me. Yeah, for
sure. Um the the big one was like I mean to go back like I think it was three four years ago. Um we have this thing in the application like if if you use it um you can you can you can spot it like every single highlight needs to
highlight instantaneously when you hover over it because that you know makes the application feel fast but when you hover out there needs to be this very quick fade out of of a button because that makes the application feel smooth like
it has to be this like you know instantaneous highlight and then over 150 milliseconds you know fade out because that adds a bit of quality um to to the user interaction and um that was in place since you know the beginning
like the the the early case. Um, and
then I got sort of frustrated because I had to sort of point this out to to engineers because if you're not looking for that very small minute detail, you're just not going to find it. Like
you implement new functionality and um you just forget to uh you know implement it or you don't even see it if you're not if you don't know what you're looking out for. So what I did at at one of our off sites because I got
frustrated of reporting these I was like let me show everybody like you know what what what they should be doing um and and how they should be implementing these these you know small quality
quality fixes um and what I took is a very small portion of the application and I was like you know where where I noticed that you know the highlights were missing and and you know I brought the team together and uh I told them
let's you know spend an hour trying to figure out what's wrong with this particular view and in my mind it's just the highlights and everybody dug in. Um,
and what we found in it was one of the view option menus. Uh, we found like 35 problems with with that tiny UI. Um, and
I was holy, holy, holy crap. Um, like I I didn't see those. Um, I I had no idea that we had all these small problems that like you wouldn't notice when you're when you're not really looking.
Um, so from that, you know, what what I what I thought we we would, you know, want to do is like have everybody always chime in and and and try to find problems in the product because apparently we were full of, you know,
small quality quality problems. If a, you know, small menu has 35 things to fix, then the rest of the application has, you know, thousands. And to date,
we've probably fixed um 2,500 or 3,000 um of these small minute details uh in the application. Um and and that's how
the application. Um and and that's how it you know has become better and um and and has the highest highest quality bar.
Um that was you know that was the start of it but then we realized um there's there's a nice side effect to this and what we what we told people is that you have to every Wednesday you have to find
a problem yourself like we won't hand them to you like you have to go in into the product and find it. So people
started doing that every single week finding a problem and it used to be um in the beginning it was it was easy then it became harder because you know the quality fixes went down but um you know people kept on finding finding problems
in in the product um and the the side effect of that was that everybody was but whenever they were building something a totally unrelated feature there was they were always on the lookout for these small quality fixes because they knew they had to come to
the next Wednesday meeting with a fix.
That's a good fix. Yeah.
Yeah. So they're always looking looking for those and that meant that they were introducing less and less regressions or these you know small quality quality regressions into the product anyway. So
if you if you think about quality all the time and if you are aware um of quality you know things then uh you're you're bound to make less mistakes.
I mean this practice is I haven't seen it elsewhere and it seems both awesome also pretty aspirational. Also, I mean, if you're a small startup, like you should probably try it out if you can because especially nowadays with with
agents like like it shouldn't be that difficult to do and and you might get if you're a big startup, you should even even more try it out.
There there we go. But one thing that is is not as aspirational and a lot easier to do especially now that you have been doing even before agents zero bug policy. Tell me about this. What does
policy. Tell me about this. What does
zero bug policy mean for you and what does it mean in practice? Like you have bugs surely, right? I'm just playing devil's advocate here.
Sure. Um we zero bug policy literally means that if a bug gets reported um it gets assigned to somebody automatically immediately using agents obviously like they will find who has created this bug
or who has you know been working in this area and um that becomes your highest priority. You drop everything else. Um
priority. You drop everything else. Um
the morning you wake up you go to your my issues list and you see a bug assigned to you. That's the first thing you pick up and you fix it. Or you can also decide not to fix it. Like that's
important. like not every bug gets fixed. If it's, you know, super hard or
fixed. If it's, you know, super hard or gnarly and it, you know, applies to one out of, you know, 100,000 users, um, you know, you probably shouldn't waste your time on it. Um, but every single bug
gets fixed immediately. And the the the the start of this um came from from the idea that like bugs are are are created at a constant rate at every company.
when you create features, when you when you create functionality, when you engineer, um you will be creating bugs and most of the companies and we prior to our zero buck policy like we um put them in a backlog. We're like, uh, you
know, when it gets when we get some time, like we'll we'll fix them. And,
uh, what happens over time is like your product gets worse and worse and at some point you're like like, oh man, we've got, you know, 500 bugs in the backlog like we need to do something about it.
And that's when you start fixing from the top.
And what happens is that um the rate at which you have to fix bugs is again constant. It doesn't matter whether you
constant. It doesn't matter whether you fix them, you know, two months from now or immediately. like once you hit that
or immediately. like once you hit that threshold of we've got so many bugs, you're now effectively fixing all the bugs that come in um except two months later. So with that small notion in
later. So with that small notion in mind, like there's there's a very small trade-off that you have to do in order to get to zero bucks. If the rate that you have to fix bucks is constant, all you need to do is stop, you know,
development of new features for as long as it takes you to, you know, bring your bucks to zero and then enforce that you're going to keep on, you know, fixing your bugs because it's not more
effort to fix bugs immediately than to fix them three months from now if you care about the you know overall u sum of of your of your problems. So to us that meant um we spent effectively three
weeks of of not working on any any new functionality of just fixing bugs, getting that down to zero. Um and from there on out every bug gets fixed um within seven days, usually you know in
two hours or three hours. Um and what that means to users, like users get super excited when they report a bug and two hours later they get an email saying, "Oh, we fixed it. If you refresh your browser, um we've got it covered
for you." um that makes you like that
for you." um that makes you like that makes your user super happy because you know you don't really have that experience too often with with companies.
Okay, curve ball question. If I'm
working at linear and there's a quality Wednesday coming up and I get assigned a bug, does that count?
No, that does not count. That's that's a defect. Um you have to find a quality
defect. Um you have to find a quality fix.
Oh man, bug bugs are separate. They they are immediately immediately created. And now
with you know AI being capable of at least pointing you where that problem is and helping you immensely uh fix bugs I think like literally every company um should have a zero buck policy. It it
doesn't make sense to not have one. One
thing that you know when we talk about a and think about AI agents we think about speed code generation. We rarely use quality and AI agents in the same sentence. Why is that with the tools
sentence. Why is that with the tools getting better? Should AI engines not be
getting better? Should AI engines not be better to have feedback loops? you know
they can write unit tests like should they not be able to produce better code better features better UIs even uh no
they they don't feel they they have no taste um they they simply don't um they are not human beings I think the last bastion that you know we have to tackle
at some point and maybe we'll get there maybe we won't is um you know have you know tasteful AI being able to create UI that is you know purpose-built for you know that specific feature you're building for the product that you're
building is, you know, has great design um and has the ability to figure out like what what a user feels when they use your application. To give you an example, um AI doesn't have a concept of
time. And currently how it sort of
time. And currently how it sort of interacts with your browser is um effectively timeless. um they take
effectively timeless. um they take screenshots or they look at the DOM and if you ask it to create a very you know high performant application like yeah it can go back you know and and look at all
the all the things that have been written about like you know go to Versol to host your next app or you know use caching or or whatever but it won't be able to use your application and get frustrated because you know a click took
two seconds. Um it knows that one second
two seconds. Um it knows that one second is better than two seconds but it it doesn't know whether two seconds is is is slow enough. Um the other aspect that goes into it is um it it doesn't really
see um and it doesn't know what for example a good use um unit animation is um Emil one of our um uh design engineers um just yesterday posted um on
on on X um where he you know did this trial of you know having agents build certain animations for you know um certain functionality like you know bringing up a pop-up or highlighting a
button or moving things around and um he agents were totally capable of doing all of this and and then he took a manual step and was like, well, if I now take it and just improve it and make it feel
good, um here's the outcome. And he has it up on his side where he can sort of you try out what the agent did and what he then he fixed. And at least to me, and I hope everybody else um like his
animations just feel natural. they they
feel they feel like they're like the welldesigned whereby the the agent did all the right things but you know had an ease in as an animation or or you know
did it a bit too slowly or too quickly um and it just felt you know unnatural.
I wanted to talk a little bit about the culture at a linear one. It's like
working there like how you created this team that really cares about quality good customer experience. What are what what are things that you do specifically
there? Can can we talk about it on on
there? Can can we talk about it on on what engineers are exposed to who who join the company from day one? Yeah, we
we hire for for that specifically and we have a you know specific hiring process where we make sure that we get people that think like mind or think think like us and want to build high quality
software that is beautiful. Um most of our engineers are product engineers.
They're they're like we obviously do have technical challenges. We have um you know a synchronization engine. We
have you know scale. We need to scale our infrastructure. But what we wanted
our infrastructure. But what we wanted to do is have most of our engineers just be, you know, focusing on the product and build features, you know, and and functionality for customers and engage
with the customers at a very high level.
Um, so first of all, we we we hire for that. Um, we have a B trial that we do
that. Um, we have a B trial that we do with every single employee that that you know, several days, right?
It's a full week.
It's a full week.
So we we obviously pay for that effort, but we work with a with a person for a full week. they usually implement a
full week. they usually implement a green field, you know, pro project or product or feature. Um, sometimes they even ship it after that week, which is which is pretty amazing. Um, but what we want to get out of that experience is
just to see them drive, you know, a product from from start to finish, figure out what is needed.
So, a push back here would be like, hang on, a whole week, you pay for it, sure, but someone had to take time off of it.
A bunch of great people will say, no, I I either cannot or will not do that.
Well, that's totally fine. like those
people didn't want to be here in the first place. So
first place. So it's it's it's self but after you go through this pretty rigorous hiring process it's a lot longer than I think any other process. I mean you know you
have a dayong process at at most places or or they're stacked across.
Did you see any different result than for example when you hired at Uber? When
you were hiring at Uber you you did the usual you know like five interviews, six interviews and so on. What was the outcome difference that you're seeing?
Certainly um like we we've had very few misses um most of the people that we've hired and sure there always are a few where we um we just missed something and
um we and going back into sort of the loops like we there's inklings of like us being a bit uncertain and then we went ahead and hired the person anyways but those are just like a a few handful
of handful of people like I think most of our engineers are you know really excellent um and our engineering you know, bar is is super high and constantly increasing.
And then once those engineers enter, you told me something interesting about the Slack channels and customers, right? We do have um Slack channels with
right? We do have um Slack channels with all of our big customers. It's open to anybody. Anybody can jump in. Um and
anybody. Anybody can jump in. Um and
most people do like you browse through customer requests, you browse through what you know, what problems people have. And um we we also record every
have. And um we we also record every single meeting that we have with customers. We had a lot of meetings like
customers. We had a lot of meetings like not only on the on the CX side or um support but our PMs have constantly you know talked with Bler customers to figure out what we should be building next and all of those are recorded um
and any interesting points are tagged so anybody can can go in and then you know look at and even search uh for you know certain functionality and figure out like what are customers saying what what
do they want so everybody gets exposed um to customer needs and that is super critical if you if you want great product it's almost like if you enter linear or you get this like fire hose of
like what are customers feedback and you you cannot really escape seeing and feeling you know feeling the customer pain or joy or whatever that is.
Certainly. Yeah. Cuz we build it for customers like Liner started off as a product where we build it for ourselves right we as engineers were the primary customer. We've grown out of that like
customer. We've grown out of that like we we build it for larger corporations and enterprises and we're no big enterprise. So we have to build things
enterprise. So we have to build things that you know we wouldn't use ourselves and the only way to do that is to you know talk with your customers and figure out what they need.
If you had to look a year ahead you you you sometimes have strong opinions. So
like let's bring those out a year ahead.
How do you think the role of the software engineer or product engineer will change? Because we do have these
will change? Because we do have these powerful tools. they're getting better
powerful tools. they're getting better in certain areas and maybe not so much better in others.
I think everybody will become a product engineer um in in in some sense.
If you if you think about how AI has progressed um go back like four years, it wasn't able to write a single line of code and now it's commandering code bases. um go four years ahead and if you
bases. um go four years ahead and if you if you still believe that it's that the exponential growth is still there and we don't hit a wall um which I I don't know if we will but um if it if it keeps on
going growing like this like you you won't be needing engineers that sort of pipe data from one place to another you still will be needing engineers who know what a customer wants and what a good
feature looks like or what a good user experience looks like. Um so I think you know engineers will have to move to become product oriented and product focused. They will have to be sort of
focused. They will have to be sort of mini PMs who sort of talk with customers are engaging in that layer um and then can you know implement functionality um that your customers want.
Oh man. So like you know I remember like the 2000 the two 2000s we as a as a programmer you could just use one language then it was like multiple languages then the QA job you got the QA
job then you got DevOps. Now you're
saying we're the the product job and the customer support job as well.
Oh, everything else has dropped now.
Like you just need to do the you know the PM job.
Okay. Um and as closing advice, you are hiring for for product engineer. You
said that you actually hire for that.
Now not every might have the opportunity to work in a role that is a product engineer right now. But if you're a software engineer, what are things that you can do to grow this product sense to
be to to change your work to be closer to what a product engineer does? I mean,
it's all about uh getting closer to your customers if you're working at a company or just building stuff. Like the best way to learn is to actually you get your hands dirty. Try something out, build it
hands dirty. Try something out, build it for yourself. That's the easiest part.
for yourself. That's the easiest part.
Um you can think about like what you need. You can you can build it and you
need. You can you can build it and you learn from that experience. Then you
ship it to the world. Hopefully somebody
else uses it as well. And then you know you've got you know your first customers that you can get experience from of you know whether you're building the right thing right thing or not. Um, obviously
there's literal literatur literature as well. You can, you know, read through,
well. You can, you know, read through, you know, Apple's human interface guidelines. That's the best book. Um, if
guidelines. That's the best book. Um, if
you want to do sort of good UX, um, just follow what they say and you'll be good.
Um, and yeah, those are those are the two big things.
Awesome. Well, Thomas, thank you so much.
Thank you.
Thank you.
Our next presenter is the CTO of Lagora, the fastest growing legal tech platform, and he's here to tell us why agents need
more than chat. Please join me in welcoming to the stage Jacob Laurson.
Hi guys.
How's everyone doing? Still good.
Great. It's uh 5:00 PM on a Friday.
There's just me and two more people behind you and Friday beer. So, I'll try to be a little bit quick here. I'm here
to talk to you guys today about vertical AI and and complex agents and why I think they need more than just the chat.
If you've ever worked with a longunning complex agent, you've probably tried something like this. Sorry that it's all white. I can see the flashbang in your
white. I can see the flashbang in your guys' face.
Um, you tell it to research something, draft a contract, make no mistakes, and um, it starts thinking, it starts reading, launches a bunch of sub aents, does web search, writes files, launches
more sub agents, does more reading, writes more files, keeps going, takes forever, after 30 minutes, it gives you your contract. You take a look. Plus
your contract. You take a look. Plus
three doesn't look right. What did you make a mistake here? Could you, you know, look at another document?
You're absolutely right.
Then you see this compaction. That's
when you know you can give up. It's
going to forget everything. It's in the the the context rod state. Anyway, it
continues. It keeps on going and uh you get a new contract. Does it look was it only clause three that was changed?
Probably not. And so you end up in this state.
Not the greatest experience.
My name is Jacob. I'm the CTO of Lagora.
We are a collaborative AI workspace for law firms. So, we're a vertical AI company. We have more than a thousand
company. We have more than a thousand customers, more than 50 markets. We've
raised a bunch of money. Uh we're
growing extremely fast. Um I'm being told maybe the fastest in history. Um we
are also hiring engineers in London. So,
in case anyone's interested and wants to be on this growth journey, please talk to me after my talk. Um
our goal and the goal of most vertical AI companies is to make agents complete more and more complex work end to end.
That's sort of doing that has changed a lot in the past 6 to 12 months because there are new economics of production.
So it used to be if you wanted to complete end to end work that you would be focused on doing the work, right?
That would be sort of the main thing is actually just getting it done. But today
things look a little bit different because right now planning work and reviewing work is the new bottleneck. So
doing the actual work is extremely cheap. It's very easy to do. But now you
cheap. It's very easy to do. But now you have to spend time planning. You have to get the non-functional requirements. You
have to get the specs and you have to spend a lot of time reviewing the work.
And if anyone's reviewed big PRs on GitHub, it really sucks. It's extremely
painful. Um, maybe if you're super AI build, you just get your AI agents to review their own work. No humans
involved. Maybe it works, maybe it doesn't.
And when we think about completing complex work, both the planning stage, the doing stage, and the reviewing stage, the verifiers rule is a good way
to think about work. So verifiers rule is a term that was coined by Jason which states that if it's a task is solvable and it's easy to verify then it's going
to get solved by AI. He was primarily talking about foundational models. So
sort of if you can make something very easy to verify then you can do RL environment you can post train it's going to solve it. I think it also goes for agents. You know, if you can make a
for agents. You know, if you can make a task verifiable, you can just run an agent in a loop and tell it, "Hey, you did this wrong. Please fix it." And
it'll eventually get there.
Different industries on different places in this spectrum. Um, it's a little bit more complex than just this because verticals have tasks that are different
places on the spectrum. So, if you take legal, we can check definitions in a contract. Super easy to verify, super
contract. Super easy to verify, super easy to get done. Writing a contract is very easy to solve, but actually extremely difficult to verify because if you think about it, when you write a contract, the only time you can actually
verify if you know the language you use works is if it goes to court and a judge basically verifies it, tells you if it's good or not. So that's actually quite complex.
Litigation strategy is also basically impossible to verify. If you don't know what litigation is, it's when you sue someone or someone sues you. I know
we're in Europe now, but the Americans really love doing this all the time. Um
but essentially if you ask five lawyers what should be the right strategy for this litigation case they're going to give you different answers and so there's no objective truth which means it's basically impossible to verify and
it's really difficult for AI to solve similarly on coding some parts really easy building a successful consumer app very difficult to verify
so when we think about this um we think about how to involve humans where it really matters and let agents do the work that we and let them do. There's
two things that are important um to think about with agent human collaboration.
Control is the first one. Control is how effectively can a human instill their knowledge into the work that the agent is doing. So how effectively can I steer
is doing. So how effectively can I steer it? Control is a matter of how much do I
it? Control is a matter of how much do I need to review. So if I have very low control, I'm going to look at every single agent trace and see exactly what it did. If I have very sorry low trust,
it did. If I have very sorry low trust, if I have very high trust, I won't look at it at all.
Depending on where the task falls in sort of the chart, different things are important.
How to increase trust? So if you want to increase trust, there's a few different things you can do. Firstly, you can bring a task down in the spectrum. So
here's an example from coding. If you
want to implement a feature, well, you can give it browser access. you can do test-driven development and then suddenly it's actually a verifiable task and it's going to do much better. There
are similar things you can do in finance and in legal um you can do something similar as well. We don't have let's take the contract example in legal you can't really verify it but you can look
for a proxy for verification. So for
contracts what you can do is you can take a look at previous contracts. These
are our golden contracts. We know they work well. Let's set up a test. Is it
work well. Let's set up a test. Is it
the new contract? Is it similar to the old one? That's sort of a proxy for
old one? That's sort of a proxy for verification that's going to allow your agent to do a much better job.
You can also decompose task. So here's
the example with writing a contract. I
can turn that from one task into a bunch of other tasks and I can leave picking risk profile, picking the president documents, the negotiation stance. I can
leave that to the human, but I can try to get other stuff down where it's easy to verify. So apply formatting, make it
to verify. So apply formatting, make it look like all my other contracts, apply checking definition, which is essentially linting. Are all definitions
essentially linting. Are all definitions used? Are all the definitions that are
used? Are all the definitions that are used defined? This kind of stuff you can
used defined? This kind of stuff you can build and then the agent can basically rip much better.
You can also add guardrails. And
guardrails is essentially the way to gain trust by limiting what the agent can do. So instead of being able to do
can do. So instead of being able to do all of this, you're just going to say you can only do these. You can only edit these three files. You can only read these from this directory. you can only search these websites by limiting what
it can do. You basically get more trust because you know that it won't do all these weird things.
An example of this probably all know this one cloud if there's very low trust it's going to basically tell you every single time it wants to do anything which makes it extremely useless. Uh and
on the high trust end of the spectrum you just yolo mode it let it rip and hope that it doesn't delete your product database.
Then there's control. So, how do we increase control? Well, if you think
increase control? Well, if you think about complex agent work, you can kind of think about it as a tree of work, as a DAG essentially. So, here's an example
where I wanted to write a report on an bunch of employment contracts. So, the
agent's going to say, "Okay, let me research the organization first. Then, I
want to review the contracts, and I'm going to review for a few different things for each of the contracts, and then I'm going to draft a report at the end."
end." This is extremely low control because essentially I can only impose my judgment at the root level. So it's
going to do all of this work and then it's going to get back to me and then I can try to talk to you again. And this
was basically the example I gave at the beginning. So very low control.
beginning. So very low control.
Then there's planning. Planning
essentially allows you to steer the agent up front and align one the approach. And so with planning here it
approach. And so with planning here it might say okay you should absolutely take these steps. These are correct.
These are the clauses you should be looking for. this is what you want to
looking for. this is what you want to review. So this is a good step. It gives
review. So this is a good step. It gives
you a bit more control. It's easier to impose what you want it to do. The
problem is planning. You basically have to do all the work to just know what to do. I'm sure people have tried this in
do. I'm sure people have tried this in cloud code. You basically have to go
cloud code. You basically have to go through the entire thing. It's really
inefficient. It takes a long time and ask you a bunch of questions. And in the end, it's basically impossible for it to really know if it has all the information it needs. Let's say for one
of these contracts, there's a special clause. It wouldn't know that in the
clause. It wouldn't know that in the planning step. You can't really tell it
planning step. You can't really tell it what to do when it sees that because it hasn't done all the work.
Essentially, you could compare planning to working with a c-orker that's comes up to you, tells you about the approach, you align with them, and then you never ever hear from them again until they
deliver the final document. It's not a super nice way to collaborate. This is a good thing we have right now, but um I don't think planning is going to stay
around.
Then we have skills. Skills are really, really, really good. They are really good because skills allow you to encode human judgment into essentially the nodes of work that happen here. So I can
say whenever you review confidentiality, you should do it in this way. And the
really good thing about this is it allows for contingencies. So here at one of the termination reviewing termination clauses, there's a special EU law, but I have that in a skill. So that means
whatever happens when it actually does the work, it knows how to handle that special case. You can't really do this
special case. You can't really do this with planning.
There's also progressive discovery, which again is really awesome. Whatever
happens, it it knows it'll pick it up.
The problem is um you don't have skills for everything.
The next step is then uh to use elicitation, which means ask the user, ask the the human. So you might have skills as well, but then instead of you giving it all the info, it's going to
come to you. It's going to say, "Hey, here's the thing I don't know how to handle. What do you want me to do?"
handle. What do you want me to do?"
This uh makes a lot of sense. First of
all, um what you don't want is you don't want the agent to be blocked. So
ideally, if you implement this, what you do is you tell the agent, if you're unsure about something, make a decision, unblock yourself, but write this to a decision log. So then the human can
decision log. So then the human can review the decision log afterwards and reverse decisions if it needs to.
Now the right UX for this, if you imagine this work, this tree being 10 times bigger, 100 times bigger, um you don't want this in a chat. You don't
want to open up a chat and then it's infinitely long. You have to answer 50
infinitely long. You have to answer 50 questions. You wouldn't know what to
questions. You wouldn't know what to answer. You wouldn't really be able to
answer. You wouldn't really be able to do it because you don't have the right context. So not chat. Chat is
context. So not chat. Chat is
one-dimensional. It's a very low bandwidth interface and it tries to collapse this work tree into a single sort of linear thing. So what's a better interface? Well,
interface? Well, I think humans and agents should collaborate in high bandwidth artifacts.
I think they need to work in things that are maybe typically persistent um and they will look different industry to industry, vertical to vertical depending
on what task you're solving. So
an example from us is u a document that's like a durable interface where it makes sense to collaborate. That's how
you'd collaborate with your co-workers.
You can highlight clause 3 and it will only change clause 3. You can add comments. You can tag your agents. You
comments. You can tag your agents. You
can tag your collaborators. You can hand off parts of the document to special agents. Another example is our tabular
agents. Another example is our tabular review which is essentially I ask it to do um the contract review that I talked about and it's going to say okay let me spin up a tabular view which is like a
known print primitive that our users know and it looks like this and then it's going to say I'm going to review all the contracts and I'm going to just flag a few items for you that I want your take on and then I can go in there
and I can see very quickly where the problems are so it's high control it's very effective for me to instill judgment and I can also very quickly get an idea for what the agent has actually done. So reviewing is easy and then once
done. So reviewing is easy and then once I've done that I can just kick off the rest of the agent.
Right now what we're seeing a lot is the convergence of UI basically um this is post hog and linear uh within the last
two weeks shipping this new UI. Um to be clear, chat boxes as input is great. I
think it allow it's extremely flexible.
Allows you to do a lot of stuff, but you don't want chat to be your main mode of collaboration with a complex agent.
The good thing about this is language is essentially the universal interface.
It's what people use to communicate. You
can do everything with voice. Um but
agents aren't humans.
Just a few minutes ago, I was um talking to a potential candidate for Lora and I was describing our org chart and um I was limited because I can only use
language. I wish that I could just draw
language. I wish that I could just draw up an org chart and they could interact with it and they could use it, but I can't because I'm a human. Uh I am limited by language, but agents are not
humans and so we should not constrain them to human language. Thank you. Our
next presenter is AI capability lead at arena.ai, a tool for benchmarking and comparing frontier models. Here to tell us what
frontier models. Here to tell us what models still suck at. Please join me in welcoming to the stage Peter Gstiff.
I want to talk to you something maybe a little bit controversial today. Uh you
can argue with me later. Uh but the topic is what do models still suck at?
And uh the reason why I wanted to talk about it is that I think we uh all look at these kinds of charts where any benchmark you seem to look at line goes
up. And uh
up. And uh we look at meter charts and they surprise us every time no matter how prepared we are and this could create
this kind of psychosis that we all see where everyone is freaking out about the next model. You know we we heard some
next model. You know we we heard some new ones coming up and the feeling I think that we all get is that this is kind of um AGI like creatures that are
just almost there. Just one one more turn and they're almost there. And um I think we we could be deceiving deceiving ourselves a little bit um uh because I
think there's still quite a few things missing. I I want to explore that in a
missing. I I want to explore that in a couple of different ways and we certainly by the way see that as well in our data uh at Arena as well. So we
track uh models and if you notice the data this is uh Q2 2023. So we've got data going back to GPT4 and what we do
is uh we can we've tracked I think is it 700 models so far uh in text and uh what this chart is showing is what the top
model is uh for at any given time for for each organization. Uh so you can see line goes up new model uh builds on top of each other and it's all it's all very
impressive. Um but I think it's it's not
impressive. Um but I think it's it's not the whole story. So I've got couple of ways how I want to explore that. It's
not the the end of the conversation.
There are definitely many other ways of looking at it. Um one is my own benchmark that I I've built recently which I rather like. This is the the [ __ ] benchmark. Uh and then also
I'll share some of the arena's data as well that uh we haven't shared so far which I think would be interesting for you guys to see. Um so uh the idea behind the [ __ ] benchmark is quite
simple. um is that uh what happens if
simple. um is that uh what happens if you ask nonsense questions uh from the models? What they going to do? Are they
models? What they going to do? Are they
going to just uh tell you that oh this doesn't make sense and maybe reframe it or are they just going to go with it? Um
and honestly wasn't sure how that was going to go, but when I just posted it one random evening, I think a lot of people liked it. It resonated with a lot
of people. Um and I think it the reason
of people. Um and I think it the reason is that it probably spoke to a lot of maybe kind of slight unease people had with different models. Um and I'll give
you one example uh here and this is just one question and the way it works we've got I think I've got 155 questions something like that. Um and uh we then
uh give this uh to the models um uh we get a response back and all we do is then grade it uh with llam as a judge and I've been through it myself as well.
I read a lot of nonsense to to kind of see that I think LLM as a judge works here. Uh so this one is a kind of silly
here. Uh so this one is a kind of silly question. Controlling for repository age
question. Controlling for repository age and average file size. How do you attribute variance in deployment frequency to the indentation style of the code base versus the average
variable name length? So hopefully you understand that's it's nonsense. So it's
just it's very a breached responses. Uh
they're much longer just for the purpose of this. Uh so sonet gives a good
of this. Uh so sonet gives a good response. I think it just says you can't
response. I think it just says you can't meaningfully measure this. It kind of pushes back. Uh gem is like a little bit
pushes back. Uh gem is like a little bit more complicated because this starts off well. It says that or uh strictly
well. It says that or uh strictly speaking it doesn't really make sense.
But then the second part is however both act as strong proxy variables for engineering culture uh language ecosystems and code quality which I hope
uh you don't agree with. So um there and I'm not going to go through a bunch of examples. It's all open source by the
examples. It's all open source by the way. You you can uh dig it out yourself.
way. You you can uh dig it out yourself.
Um but uh it's really really surprised me how easy it was for the models to just go along with like complete nonsense questions. Um so the results
nonsense questions. Um so the results that I got is that uh the way to read this chart is uh the green is the clear push back. So when the model's like in
push back. So when the model's like in the first example where it said oh maybe this doesn't really make sense uh then the uh the amber and red there is kind
of accepting the the nonsense and the basic results are that the latest set models or or rather cloud models are doing really well. There's like couple of other models like quen models not too
bad. Uh there's even gro is like okay as
bad. Uh there's even gro is like okay as well what the very latest one. Uh but if you go beyond that, there's a lot of models that we'll use all the time. So
GPT models, uh Gemini models, they're basically kind of about 50/50 whether they're going going to go along with it or not. And even looking at some of the
or not. And even looking at some of the traces and responses in more detail, even the ones that are green is still like a little bit shaky, they still kind
of try to accommodate. So it's like for me, this is really not nowhere near good enough uh for the uh level of responses.
And just for completeness, if you go all the way, so this is the very bottom of the table. Um there are a bunch of
the table. Um there are a bunch of smaller models there. Uh kind of all older models. Um yeah, some some results
older models. Um yeah, some some results are completely terrible. Uh feels like you can ask anything. They they just uh respond. Um, another way of looking at
respond. Um, another way of looking at this data is I just took the anthropic openai and and Google there and I um
measured uh the model performance over time and uh you don't see all the labels there but they're basically like all of the uh all of the models that you you
remember them releasing. Um so what the way I interpret this is that the anthropic models were like okay at the beginning but the since uh claude 4.5 uh
sonnet 4.5 they really went up and even haiku is is quite high uh but uh with open eye Google models they're kind of up and down but they they nowhere close
uh the the top there which I think is kind of interesting um and I'll go into some of the other interesting dynamics there. So for example, does thinking
there. So for example, does thinking help? Right? So this is I always hear
help? Right? So this is I always hear this when there is like a silly puzzle that the model can't do. What do you do?
It just all crank up the reasoning it it solves it. If you see a look at the
solves it. If you see a look at the chart on the right, it basically is completely not true here. So reasoning
often actually goes in reverse and doesn't help. It actually makes it
doesn't help. It actually makes it worse. Um do model do more recent models
worse. Um do model do more recent models perform better? It's kind of hard to
perform better? It's kind of hard to tell for sure, but there's at least not the clear line going up. uh and I think if you exclude maybe the latest anthropic models, it's not even sure
clear that the line goes up at all. Um
then uh some specific comparisons for reasoning. So for example uh what you
reasoning. So for example uh what you see this kind of uh the uh is the same model with the low reasoning and high reasoning. Um and uh these are some
reasoning. Um and uh these are some examples where no reasoning performed better than high reasoning. And I spent a lot of time reading the traces of GPT
5.4 for um it's probably the most um confusing experience of of reading these uh traces. And what I found was that
uh traces. And what I found was that quite often it would maybe have one line where it would question the the premise of the of
of this question and then spend 20 paragraphs trying to solve it. And even
if then comes back and says, "Okay, maybe this makes sense," it still tries to solve it in some way. And this is uh feels uh completely crazy to me. But the
way I imagine and I don't know for sure, but I imagine the way the the reason why that happens is that um they were trained so much to solve the task at any
cost. And I think there was probably not
cost. And I think there was probably not a lot of training to say actually maybe don't uh solve the problem sometimes. I
not noticed this first sometimes when you have a lot of agents running in parallel and I would sometimes forget which one is doing what and I would like ask one agent to do something that's
completely the wrong project and I still go and do something and and I then I lose my mind. So yeah that's a kind of an interesting dynamic I thought about
about thinking. Um then also so this is
about thinking. Um then also so this is a subset for open source models only you try to see if bigger models do better.
There's also no no real clear patterns.
So, we've got the total parameters on the left, then active parameters on the right, and I don't know, maybe you can see some patterns. I don't really see it's like kind of up and down. Um, but
yeah, not not huge sample. So, don't
know inconclusive at least not obviously uh is true. Um, so that that was kind of one lens um looking at kind of this
specific idea. uh but I want to uh take
specific idea. uh but I want to uh take advantage of the data that that we have at Arena and and show you maybe more broader trends uh that we could uh look
at. Um so just in case you don't know uh
at. Um so just in case you don't know uh much about arena what we do is we publish um uh benchmarks and the way we derive them is that users go into our
platform uh they can go in the battle mode they put in a query uh and then uh they get two responses back which are from two anonymous models and then they
can say which one they like better and then you get um uh then the model names only revealed then and then in uh text
arena we've got nearly um uh over 5 and a half million votes there. Um and we've been going since 2023 as well with this data. So it gives us really nice uh
data. So it gives us really nice uh broad view. Um the reason why I think
broad view. Um the reason why I think this is really useful is first of all we we do have this long trend and there is not any other benchmark that lasts so
long because this one you cannot u exhaust it. It will there will always be
exhaust it. It will there will always be one model better than the other. Um so
that gives us a long perspective.
Another one is that inevitably any benchmark that you pick it's inevitably has to be condensed to like very specific question that that you're
asking because otherwise it's very hard to measure. So I'm sure it's all in your
to measure. So I'm sure it's all in your experience as well when you are I don't know doing coding or whatever is your task. Um the benchmarks would measure
task. Um the benchmarks would measure like very tiny slice of what you actually care about and and in here we don't have that problem because user can put any prompt and then they could just
use the adjustment to see like is that is that a good thing or not. Um, so I'm what I want to specifically focus on is is a slightly like a a odd mechanic that
we have that I'm really glad that we had since the beginning. Um, is that um you can uh vote a which model is better here
a or b. Uh but you can also say uh when both models give a bad response and you know if you ask the right model a joke uh response is always bad. So that's a
easy easy example. Didn't take me long.
Um so that's that's the thing to remember. So um if you just to remember
remember. So um if you just to remember one thing that will really help you for the next seven eight minutes is that um this is the mechanic. Think of it as
like dissatisfaction rate. And uh what we can do is uh if you want to take battles between top 25 models, so we're kind of sampling from the top. So to
avoid kind of I don't know llama 8b fighting grand 3b uh we just take uh the the top set of models and then we map this kind of dissatisfaction rate uh
over time and I I think this is quite interesting that we do see progress with this metric. So there's kind of
this metric. So there's kind of pre-reasoning models you can see there is like uh 20 17% dissatisfaction rate
then we when we after 01 we see that drop quite a bit to sort of about 12% and then after that it carries on uh improving to to sort of about I think
it's about 9% now um but it's so improvement is definitely there but it's not 0% which I I find interesting I must say when I when I first got to that result. I I thought like that's quite
result. I I thought like that's quite high. So 9% of the time people would get
high. So 9% of the time people would get two responses from two good models and they don't like them which I think it doesn't tell the same story as all of
these like crazy lines going up. Um so
then what we can do is we can also take um so what the previous one you saw it's like average across all like six million prompts and this is the categorization of those. These are just some uh I
of those. These are just some uh I picked out in there and you can see some interesting trends as well. So mass was like at 25 27% and then it got so much
better. So that that's quite a nice uh
better. So that that's quite a nice uh result. Uh that matches my experience of
result. Uh that matches my experience of models as well. But then when you look at like creative writing okay it did get better but it like the the improvement wasn't that dramatic which I I think is
is true as well. Um the category I want to focus on to really really try to zero in on the most signal is the expert category. And the way it works is that
category. And the way it works is that we take those uh nearly six million prompts. Then we have a a way to
prompts. Then we have a a way to classify what are the most interesting the kind of the harder the more kind of real tasks that expert people do and it
could be experts in different fields. uh
but they're kind of the most um I would say high signal prompts in terms of what what uh we could uh zero in on. And then
we also narrow down to the battles just between the these top 25 models. So that
gets us to about 40,000 prompts. Um and
then uh we can look at these uh expert categories and then um uh expert category and then we can subdivide it even further. So in here uh I've got
even further. So in here uh I've got five categories here. So again
quantitative for example so it's like math physics things like that you can see this kind of really really high uh uh dissatisfaction rate in the kind of
uh when is it about yeah early uh 2025 late 2024 um so but and that drops dramatically and I think that feels true to me that a
lot of the models got so much better at this kind of quantitative stuff and I would also say the reason why I think the lang goes up. It's not that the models got worse, but I think people's
expectations shift as well. The the data that we see in terms of what prompts people use at the beginning like three years ago versus now, it shifts a lot.
So, this is also not like a static benchmark. So, we we can really see that
benchmark. So, we we can really see that kind of um kind of the the battle of the expectation versus the model performance. Um, interesting as well on
performance. Um, interesting as well on the bottom we've got magical, finance, and law and the lines like it is the the scale is equal across the five charts.
So, it's it's a little harder to see, but it's not steep, right? It's not
really improved all that much. Um, I
don't want to go into the magical and and law and finance fields uh because I don't know enough about it, but it does feel like it's probably true that that's
not really been the focus of um of of the models necessarily. So I think maybe the performance improvements not been that high. Um so then what I did was to
that high. Um so then what I did was to take all of these prompts and and classify them further into these more deeper subcategories. I'm going to focus
deeper subcategories. I'm going to focus on software now and give you the kind of view of of these subcategories uh which I think also gives us like even even more detailed view just to give you a
feel of sense what kind of prompts we're talking about here. Obviously a tiny sample of three. Uh but to give you a sense for so for gaming someone's asking
to get them my uh detailed game design uh document uh then for security someone's got autonomous system as a hobby and they want to configure
uh uh the two which I don't really know what this is but then uh for agent systems uh which I I thought was interesting like actually the you'll see the the rate is quite good but the
person there is asking for refine this agent so it can run daily with with no supervision. So, uh these are the kind
supervision. So, uh these are the kind of just to give you a feel. These are
kind of real things that that people want to do. And uh we've got two charts here. On the left is from Q2 2024. These
here. On the left is from Q2 2024. These
are kind of dissatisfaction rate. And
then on the right we've got um the uh Q1 2026. So this the most recent data and
2026. So this the most recent data and you can definitely see improvements. So
if you look at the top line, this is the the uh the overall average rate and we've gone from 23 and a half% to uh 13%. So really nice improvement, but I
13%. So really nice improvement, but I think the improvement is not really seen everywhere. So um we can we can see this
everywhere. So um we can we can see this as well. Uh same data but with a with a
as well. Uh same data but with a with a closer timeline, which I think I think is quite interesting. Um and you'll have you probably have better theories on all
of the different uh categories why why that's the case. And I think by mind the case that I think people do ask a lot harder questions. So I think GPU compute
harder questions. So I think GPU compute for example I imagine probably it's up and down because probably people ask harder things as well. But I think
gaming is an interesting category because I've tried to use um LLMs to build games. Uh not that I I I mean I I
build games. Uh not that I I I mean I I use games but I don't build them. But
whenever you try to build games with LLMs, it just feels like they have no idea like how to build actual games. The
mechanics like all over the place.
They're not interesting. They're not
challenging. Uh so I I do get this feeling that the performance not really um improved in some dimensions like I don't think LLMs really get games. uh
even though I'm sure maybe go back two years people asking to build much simpler games this versus now uh but I wouldn't say that I'm aware of any like
really good gaming benchmarks that would kind of capture this so again if you compare this to kind of line going up I think this is not kind of matching that
story which which I think is quite interesting um and there are bunch of uh other examples uh that that you see in there so like what's what's really the
gap uh between those between these kind of crazy charts which by the way I also agree with I think they are true and and what we see on the right and I think there's something that this kind of
fuzziness that we all have in our heads in our experience about the judgment that we have that we use that doesn't necessarily match all of these super narrow very well definfined very well
specified tasks and I think there's much more to what work is and what white collar work is and all work is that is not really captured by these benchmarks.
So I think we should be just careful maybe put a bit more effort to maybe bring up also the bottom of the distribution so it's not just the very frontier gets better but also kind of
the the broader distribution um gets better as well. Um so I'll I'll uh close here. Uh one thing to mention if you I
here. Uh one thing to mention if you I think you like this kind of data go to our hugging face. Uh there's a lot that that we publish and share. We're going
to do more of that. Um and uh we share some expert prompts for example and some of the leaderboard stuff. Um join us if you want to build the arena or if you
train models. Uh we also do a lot of
train models. Uh we also do a lot of private tables. Um, so thanks very much.
private tables. Um, so thanks very much.
The future of work has many paths. Our
next presenter will discuss the path that he walked with Devon as he organized this very conference. Please
join me in welcoming to the stage the co-founder of AI engineer conferences Swix.
Hi everyone.
Uh I am not the chief AI officer of the UK. Uh I unfortunately he he had to
UK. Uh I unfortunately he he had to leave for a personal reason. Uh but you you get me. Yeah. Thanks for staying so long. I hope is everyone having a good
long. I hope is everyone having a good time? Thank you.
time? Thank you.
Uh it it's so endearing and and and heartwarming to to hear from you guys.
Uh I'll take you a little bit into how we build AI engineer with AI and it's it's probably the biggest revelation that I've had. Uh so yeah, we've had we've had a lot of really warm reception
from you guys and I think it's really great and uh I think this is something that we really try to engineer and and hopefully you know this is our first event in London. Hopefully you have us back next year. Um one one thing I
wanted to so for those who are newer to us uh I do one of these keynotes every single AIE. Uh the first very first one
single AIE. Uh the first very first one three years ago I talked about the productivity gain that you get from the increased AI uh from the increased usage
of AI. Um and the second one we talked
of AI. Um and the second one we talked about how you should just use more AI because the cost curve of AI is going down roughly a 100 times uh per per
every 12 to 18 months. And I think it's still continuing to to trend that way.
Um the third year we started to talk about tiny teams uh which which was basically this definition that I had that teams with more millions in revenue
than number of employees. Um and I even curated an entire track at the world's fair about this uh where we t where we sort of summarized it as the tiny teams playbook if you're interested in building that. Uh the reason I liked
building that. Uh the reason I liked this emphasis is because um I think people are maybe too egotistical about looking at for the oneperson billionaire or or unicorn founder. Um every company
can have a tiny team whether you're small or large. Um and I think when I look at how I how we run AI engineer uh me being the the the leadership of Benley and myself uh we are also a tiny
team. Um, this is us. Uh, it's just, uh,
team. Um, this is us. Uh, it's just, uh, nine full-time people and, uh, we are running a business that is more than $9 million. So, we are a tiny team. Um, and
million. So, we are a tiny team. Um, and
I wanted to show you the most significant changes in our workflow uh, since we started this three years ago.
Uh, by the way, this is our taking the AJI pill moment. Uh, did you guys get the AJI pills?
Yes. Very proud of this isn't my brainchilds. Uh if one of your
brainchilds. Uh if one of your co-workers is not sufficiently AGI pill, you should prescribe one of this. You're
all AGI doctors now.
Okay. Our stack was very stable and completely nonAI, which is very ironic for an AI conference. Uh we do Figma, React, Superbase, Tido, uh Google
Sheets, session nights. Um, and then I had this funny weird moment where I joined Cognition uh and and started talking uh started using coding agents
seriously at work mostly because they were free. Um, and I started adding it
were free. Um, and I started adding it to the company Slack and then I started doing things with it and showing people, hey, here's how you use it to do coding on the the company website. All well and
good and something strange starts happening. Um, I start introducing uh
happening. Um, I start introducing uh this is this is a a a workflow of our contract designer now full-time um uh showing me a Figma page and asking me to
go through it and expecting that we would take a week, two weeks, four weeks to turn it into reality. Um I just added Dev into it. Um and and ultimately uh
before I had to add Dev into it, I had to hook up Devon to Figma and I'm not going to doing that [ __ ] So Co-work is doing it for me. you should use co-work for uh for doing this and which uh which by by the way leads me to my first
lesson which is anytime there's like random yak shaving I think one underappreciated um uh benefit of agents is that they save you the yak shaves like all the dependency tree crawling of like oh no I have to do that first oh no
I have to do that first and particularly when it comes to installing dependencies or fixing python dependencies fantastic for that and I think u a model of productivity that doesn't sufficiently
appreciate parallelism and not just autonomy I think uh and and sort of depth of the act shaving is not fully capturing the the benefit of agents. Um
so anyway back to the agent story uh hooked up Devon to Figma and we in very short order we have a perfectly functioning website uh that is pixel perfect to the Figma and to me uh that
was a surprise because I'd never done it before. You know you always mistrust
before. You know you always mistrust marketing until you see it for yourself and more importantly our our designer is very happy about it. Um and that's the that's basically the the website that you see live today when you go to
AI.engineer.
AI.engineer.
Um the other interesting thing that happened was then we started using it more right after after one initial success you start using it more. Uh
something that you can't see because it's very small text but I'm going to highlight for you is that that is 207 replies just exploding in usage. Like
what the hell? Um and when you dig into it uh it's it's very interesting right?
So, first of all, uh I start kicking off some some work and then I go to bed and then my designer who's in Indonesia wakes up and starts messing with Devon.
It starts prompting Devon with red lines uh on annotations which is something that Steve Ruiz, one of our speakers from yesterday does with TLD draw. And I
never taught him to do this and there's no instruction manual. It was just mostly like how would you communicate with another human being? And so I work mostly for a nontechnical team. And I
think that's very important that they need to be comfortable with agents and I think they're finally at the point that they are. Um, we start working on things
they are. Um, we start working on things that we would never normally have worked on. Uh, nobody has reported this, so I
on. Uh, nobody has reported this, so I assume none of you have discovered it, but there's an Easter egg on a website.
Why? Because I put it there. Why?
Because it was fun. Because I could, right? So, if you're on an ultra wide,
right? So, if you're on an ultra wide, you sc you scan your mouse over the the highlights, you you'll see an Easter egg. Um, uh, I saw a tweet that was
egg. Um, uh, I saw a tweet that was viral about a design aesthetic that I liked. I threw it into debit out pops.
liked. I threw it into debit out pops.
Um and then and then and then you know 127 replies later uh I I literally I popped it in. I was like let's just see what the the clanker will do for me. Uh
I don't want to waste my designer's time. I just want to see what clanker do
time. I just want to see what clanker do does for me. Designer jumps in um and and does and does and start actually starts working on this thing which I thought was throwaway and fun. And the
most interesting thing it's so small I can't even read it. I'm so sorry for this. Uh so basically the reason he
this. Uh so basically the reason he starts working on it even though it's a throwaway project is because it's fun and I think that's something that was like a big aha moment for me like I am getting more work out of my employees
because they enjoy doing it because the feedback cycle for them from like waiting blocking on me or a contract uh developer that we have is gone like they they just literally they have the idea
they go do it right um and uh they're doing more things they're doing animations they're doing polish uh things that we've just I'm getting work that I've ever gotten out of my employees before. I think that's
employees before. I think that's something that's apprec that's something you should appreciate too. Um I'm if you haven't noticed I'm no longer talking about agents for coding or like how many lines of code I'm producing. I'm getting
more productivity out of my humans. And
I I I think this is something that is a major theme for this year that I'm really trying to investigate which is agents for everything else. Um then
obviously okay I had the success with Figma to uh to to website. I have the success with tweet to website. What
else? Right? Like you start to think about other use cases. Um this whole conference is a giant data management problem. Like I have to sync with 130
problem. Like I have to sync with 130 speakers and uh couple dozen sponsors and all the attendees that come in with all various needs. Um and really it's
just a CMS, right? Like we we we've messed with the sanity. I'm not a the biggest fan of sanity in the world. um
because I want to keep some sanity to myself. Um but but basically like I I
myself. Um but but basically like I I can throw in uh spreadsheets and and Devon can manage that for me. And once I really I think the unlock happened when I threw away the CMS and just kind of committed that to code but use that code
as my sort of source of truth and let Devon whatever coding agent you use uh start to manage it. And so this entire schedule uh is managed by Devon. What
does that mean? It means that whenever um someone comes in with a speaker change, for example, Marty, one of the speakers from today, uh sends in an email, I just say, "Devon, handle it for
me." Right? No other f further
me." Right? No other f further communication is needed. I can just forward the email, I can paste a screenshot, whatever. Um and that kind
screenshot, whatever. Um and that kind of volume lets us as a small team of nine people manage a thousand person conference, right? We're going to manage
conference, right? We're going to manage 6,000 people in San Francisco this fall, uh this summer. Um, and I'm pretty sure we can stay the same size. Like it it is incredible the amount of productivity that you can get once you're
sufficiently onboarded and you have the workflows ironed out. Um, we have agents for ETL. We we deal with an external
for ETL. We we deal with an external vendor system that has data that we don't have in a in a central source of truth. So I need to get the API key to
truth. So I need to get the API key to sync over data and make sure there's a single source of truth. Uh, these are very boring routine tasks. Um, well,
there's there's another, you know, another fun story that I can tell you is agents for buying. Uh, so I saw this viral tweet about how somebody put a claw in uh Wall Street next to the Wall Street Bull and I was like, "Well,
that's funny." Like, we should put a
that's funny." Like, we should put a claw in front of our um conference and that's exactly and so so I asked Devon to research where can I get a lobster in
London. Devon comes back with phone
London. Devon comes back with phone numbers and email addresses and websites and I just click through and and think about it and ask you to do some more research. Uh and I'll pause this guy. Uh
research. Uh and I'll pause this guy. Uh
that's literally the the lobster that you had was bought from Devon. Uh and I think u this kind of personal automation for everything else. It just matters that you have an agent that has web
access that has some uh smart enough model. Uh I mean this is effectively a
model. Uh I mean this is effectively a claw, right? like a an open claw, nano
claw, right? like a an open claw, nano claw, whatever the whatever clanker you call it. It doesn't really matter. It
call it. It doesn't really matter. It
matters that you're using agents for things that you would otherwise have spent knowledge work on. I might have had an executive assistant. I might have had a junior employee do these things for me, but now I can do it serverless
on demand with a coding agent. Um I'm
not here to only show Devon. Uh I, you know, I I just advised for the company now. Um but uh I you know I started
now. Um but uh I you know I started exploring town because I think uh what's what's happening here is coding agents kind of breaking containment right um there's all these other more
fitforpurpose knowledge management tools uh like the wiks that Andre Kapathy is is talking about that uh nano that openclaw is is now adopting as well. Um
you're going to see an explosion of this this year. This is like probably the top
this year. This is like probably the top trend of maybe top three to five trends of 2026 that I want to alert you to. Um,
so here is me managing uh the World's Fair in 2020 in in in this summer. Uh,
here are all the tracks I'm planning.
Here's my Apple notes. Uh, on the left is my Apple notes of all the people I it's intentionally small and I threw it into into town and on out pops a nicely
formatted notion dock with research on all the speakers that I intend to uh solicit and uh think about curating. Um,
and then obviously once you get enough psychosis, you are thinking about replacing entire pieces of SAS. Here is
me arguing with my employees about kicking out a SAS tool and building it ourselves because we can. Um, so I clearly have the most psychosis. I think
one of the annoying things is if you are in a position of power or management to deal with employees who who are not as much in psychosis and try to bring them along the journey and then not uh talk
down or or or or ignore their concerns, right? Because they are very valid
right? Because they are very valid concerns because they are exactly the people that will have to deal with your [ __ ] when you get it wrong and we do get it wrong. Um so uh one one one top
one method I I I'm I'm approaching the sort of AI replacing SAS concept which I think is it should be relevant for a lot of you uh is well let's identify the top three concerns and let's systematically reduce uh reduce them and that's the
process that we're going through right now. So um I just wanted to give you a
now. So um I just wanted to give you a little bit of that taste of like here's how AI is changing our business uh as managing the conference. Um it's come really it's come really a long way. It's
a it's a consistent theme I'm seeing even among our our speakers. Uh this is Maltza opening keynote uh talking about how the the 60% of the the the sort of
user base of Versell now is bots is is is agents. It's not humans. So actually
is agents. It's not humans. So actually
your dashboards don't matter. Your APIs
matter your CLI matter. Your MCPs
matter. Um here's the MCP uh apps guys Ido and Lead uh who spoke today um on speaking on ETN about how basically your custom UI is kind of going away like you
should shift UI to uh somebody else's app and I think like this patterns of like how your primary user is changing is really shifting towards what people are calling agent experience and I think
that's something that again I'm really inspired by and focused on because it is helping me right I no longer care about the Figma dashboard I throw into cloud corework and I hope that it works for
me. Um, so that's my message. Agents for
me. Um, so that's my message. Agents for
everything else are coming. Wake up, use it, bring it home to work. If people are insufficiently bought on, prescribe them one of these. Thank you.
Ladies and gentlemen, please join me in welcoming back to the stage Tusk Kumar.
We did it.
We did it. Y'all are such an amazing crowd. Thank you. Thank you. Thank you
crowd. Thank you. Thank you. Thank you
so much for sticking around. Look, it's
been an incredible past couple days.
Yes, it has been so good, man. from yesterday
with the opening keynotes all the way through today to the closing ones. What
a journey. Let's take a moment and recap what just happened there. We have a video prepared. Um stay tuned and watch
video prepared. Um stay tuned and watch it and just marvel at the good work that happened here. Uh and then stick around
happened here. Uh and then stick around a little bit longer. We have some announcements. We have some logic
announcements. We have some logic logistics, excuse me. We're going to take some pictures and stuff. But for
now, let's sit back and and watch this little recap here.
Heat. Heat.
Heat. Heat. N.
on. Give it up.
Whoa. That is
That is so cool. We did that. We did
that. Give yourselves a round of applause. Incredible. Actually, we're
applause. Incredible. Actually, we're
gonna we're gonna do a thing. Listen,
it's it's it's a big deal what happened here, okay? It's it's it's in Europe. We
here, okay? It's it's it's in Europe. We
are here. It's it's it's a thing. Um and
so we're going to we're going to start wrapping up the conference. Don't leave
yet. I see two of these guys leaving.
Don't don't be like them. I'm joking. No
pressure. Please stay. Anyway, um we're going to do we're going to go through a little bit of a closing ceremony. It's
not going to be long. Maybe give us like 5 minutes or so. Um, but this this would be so incomplete if we didn't have like an applause marathon uh for all that went into this. This is not easy and
it's a big conference in a big city with a big topic and and a big effort. Yeah.
And so what I want you to do, we're going to acknowledge some people and parties who made this possible. And
we're just going to clap all the way through. I'm going to say the the the
through. I'm going to say the the the names and identify the parties and you're just going to keep clapping all the way. Okay? Let's start. Give it up
the way. Okay? Let's start. Give it up for your speakers each and everyone.
Thank you. Thank you. Thank you. Keep it
going for the sponsors.
Woo. We had Google Deep Mind. We had
Open AI. We have all these.
Thank you sponsors. Give it up for yourselves.
Ex. Yes.
Give it up for the organizers, for Swigs, for Ben, the volunteers, the associates,
the suppliers, the Queen Elizabeth 2 center, the photographers,
the venue, the catering, Tim Curve. Whoa, what a people. And
Tim Curve. Whoa, what a people. And
finally, finally, okay, pause because this is a big one. That's actually three big ones. Look at these screens. There's
big ones. Look at these screens. There's
people who made this happen. Give it up for the team that put together this huge LED wall. Let's give it up for them. Oh
LED wall. Let's give it up for them. Oh
my god, that's incredible.
You know, it's so cool cuz it's like from where you're sitting, you can't really see, but if I'm up here, I can see each dot. It's so cool. I I love this screen. It's a really wonderful
this screen. It's a really wonderful screen. Um, we have a party coming up.
screen. Um, we have a party coming up.
We have a party coming up. Um, and and Yeah. Yeah. Give it up for the party,
Yeah. Yeah. Give it up for the party, man. Yeah. Awesome.
man. Yeah. Awesome.
He has been he has been trained well.
Um, the part here's the deal I need you to hear. This is our party. Uh, it's
to hear. This is our party. Uh, it's
coming up at 7:00 local time. Uh, it's
in a club. It's in a club called Fabric.
But here's the deal. It's not clubbing.
Okay? We have the venue and we can do whatever we want with the venue. And so
we're going to create an atmosphere that's not, you know, like you're you can talk to each other and ideally you do. Uh and so it's if you're expecting
do. Uh and so it's if you're expecting like strobe lights, darkness, smoke fil room, it's not going to be that. Uh the
afterparty last night, it's very similar to that. Okay. So um come along, have a
to that. Okay. So um come along, have a conversation. Again, don't waste it. The
conversation. Again, don't waste it. The
conference may be over, but your opportunity to meet cool people and connect with them is not. So it's a 45minut walk from here. Put it in into your maps app fabric the club. Um it's a 45minute walk or if you take a public
transport it's 30 minutes and if you take a car it's 25 minutes give or take with traffic. Uh food and beverage is
with traffic. Uh food and beverage is included. Okay. So if you come come
included. Okay. So if you come come hungry come thirsty. Yeah we love food.
Um the noise level is going to be manageable. It's it's it's not it's not
manageable. It's it's it's not it's not it's not open to the public. We've
rented the entire club and we can do what we want with it. Uh very important.
Come with your badge. I don't have mine so I can't come but it's backstage but come with your bad. This is I I need to I need you to hear me. come with your badge because if you don't have your badge, you can't come. Okay? We need a way to identify you and that the reason
for this is people want to go to a club and they're going to come without a badge and we need to really gatekeep a little bit uh because this is an experience they've created for you specifically. Okay. Um also you you
specifically. Okay. Um also you you cannot bring a plus one or a friend uh to this event uh because it's just capacity and as you can look around this room is full and so we need to be mindful of that and we don't want like a
fire hazard where it's going to stampede if people leave, right? And so uh we want to be sensitive to that as well.
We're about to finish the conference, but we would be remiss if we didn't capture this moment. So, what we're going to do is we're going to move to taking a family a group photo together.
Okay? Some of you, if you don't want to be in the photo, absolutely no pressure.
You're welcome to go to the expo area on your way out. Um, but for those who want to be in the photo, uh, we're going to, you don't have to move. You just stay where you are. And what's what's going to happen is our photographer is going
to come on stage. Hello. Um, give it up for your photographer, by the way. This
incredible Yeah. both of them.
Uh, so here's how it's going to work.
They're going to come on stage. We're
going to I I'll join you. All of us are going to join you. It would be nice if we can come towards the middle so they don't have to use a big wideangle lens.
Uh, and then he's going to be in charge.
We're going to turn the house lights up and then when he gives the thumbs up, it's officially over. And then you're welcome to come up here, take photos, do whatever you want. We need to leave the building at 6:30 local time. You need to
be out. If you're not out, you will you
be out. If you're not out, you will you will be made to leave. Okay? So, finish
up your last arrangements after the photo. Um, and then do whatever you want
photo. Um, and then do whatever you want and then we'll leave at 6:30. Is that
good?
All right, let's do it. Let's take the photo everybody.
He's in charge.
If you can get these guys, everybody move across into where you are.
Everyone in the middle, if you want to stand, that would be great.
Please stand.
As you can.
Let's do it. Oh, my mic's still on, dude.
And then one more for the video. Ready?
Go.
Loading video analysis...