LongCut logo

AI Engineer World’s Fair 2025 - Day 2 Keynotes & SWE Agents track

By AI Engineer

Summary

Topics Covered

  • AI Agents Reshape Software Engineering Tiers
  • Test-Time Compute Unblocks Intelligence Bottlenecks
  • Deep Thinking Scales Reasoning to Olympiad Levels
  • Evals Accelerate Iteration Beyond Production Risks

Full Transcript

[Music] [Music] [Music] Hey. Hey.

Hey. Hey.

Hey. Heat. Heat.

[Music] [Music] Hey,

hey hey.

[Music] [Music] [Music] [Music] Hey hey hey.

[Music] [Music]

[Music] [Music]

[Music] [Music] [Music] [Music] [Music] [Music]

[Music] [Music] I don't

[Music]

The wheel keeps turning, grinding it thread. path unchosen where dreams are

thread. path unchosen where dreams are shed. Don't waste your time in endless

shed. Don't waste your time in endless debate. Pick up your tools and create

debate. Pick up your tools and create your fate. March forward, let the road

fate. March forward, let the road unwind. What's left behind is not yours

unwind. What's left behind is not yours to buy. Rise with the sun. Let the sky

to buy. Rise with the sun. Let the sky ignite. You build the future with your

ignite. You build the future with your will tonight.

[Music] The clock won't wait. It takes

relentless. Echo of time are loud and defenseless. Step off the ledge. Don't

defenseless. Step off the ledge. Don't

fear to fly. There's no gain if you never try.

fly. There's no gain if you never try.

[Music] Steel fires burn where progress starts.

Flames that forge courageous hearts. You

can't stop what was meant to bloom. So

grab the light when it breaks the glow.

[Music] March forward. Let the road

March forward. Let the road unwind. What's left behind is not yours

unwind. What's left behind is not yours to buy. Rise with the sun. Let the sky

to buy. Rise with the sun. Let the sky ignite. You build the future with your

ignite. You build the future with your will tonight.

[Music] No backward glance, no sorrow ache. All

in motion and truth at stake. Humanity

story a river untamed. Fight your

[Music] champ. Heat.

champ. Heat.

[Music] Heat. Heat. Heat.

Heat. Heat. Heat.

[Music] [Music] [Laughter] [Music]

[Music] [Laughter] No backward plans, no [Music] [Laughter] [Music]

sorrow. Heat. Heat.

sorrow. Heat. Heat.

[Music] [Music] Heat.

[Music] Heat. Heat.

Heat. Heat.

[Music] Heat.

[Music] Heat. Heat.

Heat. Heat.

[Applause] [Music] [Music]

[Music] Heat.

Heat. Heat. Heat.

[Music] [Music] Heat.

Heat. Heat. Heat.

[Music] [Music]

[Music] [Applause] [Music]

Gentlemen, Please welcome to the stage the VP of developer relations at Llama Index, Lorie [Music]

Voss. Hello again everyone. It is great

Voss. Hello again everyone. It is great to see your friendly faces.

Uh sorry, can we go back one slide? I

accidentally hit my forward. Uh

uh it is great to see you all. Welcome

back to day two or day three depending on when you actually started. Uh who had a good time yesterday? Let's hear it from you.

One thing I couldn't fit into my intro yesterday that I really wanted to get in is that it is June in San Francisco. It

is Pride Month. So from myself and my fellow LGBTQ uh members of the community, I would like to wish you all a happy pride. I also want to hear from my jet

pride. I also want to hear from my jet lag crew. Who, show of hands, woke up at

lag crew. Who, show of hands, woke up at 4:00 a.m. this

4:00 a.m. this

morning? There were a lot of you. 5:00

a.m. Who's still not awake now?

Uh we've got another great batch of keynotes for you, including progress towards deep thinking with Gemini.

You'll be hearing from Logan Kilpatrick of Google. Uh fun fact about Logan is

of Google. Uh fun fact about Logan is that uh Gemini's ability to make jokes is trained entirely on his tweets, which is why none of them are

funny. Uh you'll also be hearing how to

funny. Uh you'll also be hearing how to make your agents more reliable with the founder of Docker. So you won't you won't want to miss that. Uh but first we're going to hear from an amazing organizer and just a wonderful person uh

who has a special announcement, co-founder of the of this very AI engineer world's fair, Benjamin [Applause]

[Music] Duny. Co-founding this conference with

Duny. Co-founding this conference with Swix has been one of the most rewarding experiences of my career. To see you all

here today makes me so excited for what we've built and for what's to come. Like many AI plebeians, my oh

come. Like many AI plebeians, my oh moment was chat GPT. One of my first prompts was to test its limits of

knowledge and reasoning. I prompted it to break the known universe into the fewest core principles from which it could then recursively generate 12

subclassifications. I was blown away by

subclassifications. I was blown away by how fun this exercise was and how interesting the responses were, especially when it got to the lowest

levels of the universe. For example, it labeled viruses a subcategory of quirks, which I found both fascinating and just wrong.

It was at this very moment, however, that I immediately knew that it was over for everything I'd done in the past.

This was the most fascinating piece of technology I'd ever used and recognized it potential immediately. I recall

texting my brother-in-law the URL saying, "AGI has been achieved." But it was only a few months

achieved." But it was only a few months later that something even more incredible happened. My son was

incredible happened. My son was born. Being a father has been one of the

born. Being a father has been one of the most miraculous and incredible experiences of my life. While yes, it's rather astounding to be able to speak with computers where my mind feels

expanded every time I do, when I talk to my son, it's my heart that expands. So, how do these two things

expands. So, how do these two things relate?

I am old enough to remember a time when computers were large cold machines only used in corporate offices to get work done. But as their power has grown, the

done. But as their power has grown, the current model UX has tethered us to these machines all day. There is a parallel from this to

day. There is a parallel from this to how our future generations will be educated. While there will likely always

educated. While there will likely always be a place for both screen and keyboard-based HCI as well as classroom and lecture style learning and discovery, the potential of these new

technologies in emerging US UX can free us from those constraints where even the most mediocre of teachers could become worldclass instructors. So that's why I'm

instructors. So that's why I'm tremendously excited to announce a new chapter for us, the AI education summit.

[Applause] There's a significant gap between the rapid advancement of AI and the preparedness of our children, parents,

and educators to navigate this new reality effectively and ethically. But

we can overcome this together by fostering a global community dedicated to AI education, empowering children, parents,

and educators with the knowledge, skills, and ethical framework to thrive in an AIdriven world. For this event, we'll be

world. For this event, we'll be partnering with a pioneer in the space of AI education, Stefania Duga. She's a

former researcher research scientist at Google and as of today a three-time AIE speaker. It was her talk from last year

speaker. It was her talk from last year that sparked my imagination on this exciting new direction. When she demoed this student

direction. When she demoed this student learning to code by programming the very thing that is teaching them to code, I was just blown away. So whether you're

interested in education for the next generation like I am or just the evolution of HCI for learning in the age of AI for people of all ages, I

encourage you to pre-register today. This first event is going to be a

today. This first event is going to be a free online event to explore the landscape filled with practical knowledge for the exciting future of AI

education. So that's it for me and I'd

education. So that's it for me and I'd love to bring up our first speaker. He

is the group product manager at Google DeepMind and he's here to talk about Gemini. Please join me in welcoming to

Gemini. Please join me in welcoming to the stage Logan [Music] Kilpatrick. Awesome. Thank you, Ben.

Kilpatrick. Awesome. Thank you, Ben.

Excited for the AI education summit.

Should be fun. Um, my name is Logan. I

do developer stuff at DeepMind and I'm excited to talk about Gemini stuff. Um,

yeah, hopefully folks know what Gemini is. So, no introduction needed. Um, I'll

is. So, no introduction needed. Um, I'll

talk about three things really quickly.

We'll do some fun announcement stuff.

Um, we'll talk about sort of recapping a year of progress in Gemini. And then

we'll talk about what's coming next across the model side, across the Gemini app side, and also across uh, of course the developer

platform. So, the fun stuff which is we

platform. So, the fun stuff which is we announced a new Gemini model today. Um,

so we haven't officially announced it, but we'll post live on the tweet. New Gemini model. Uh, this is

tweet. New Gemini model. Uh, this is hopefully the final update to 2.5 Pro. I

think folks have given us tons of feedback um about the changes and I think my slide has an animation which is hiding all the stuff. But Gemini 2.5 Pro

is awesome. Um, it's it's super

is awesome. Um, it's it's super powerful. Uh, bunch of increases across,

powerful. Uh, bunch of increases across, you know, benchmarks people care about.

It's soda on ADER and um it's soda on HLE and some other benchmarks. Um I

think it closes the gap on a bunch of the stuff that folks gave us feedback on from the previous versions of the model.

Um so hopefully it has great performance across the board. It also um I I think is like sort of setting the stage for the future of Gemini. I think 2.5 Pro

for us internally and I think in the perception from the developer ecosystem was the turning point which was super exciting. Um, so it's awesome to see the

exciting. Um, so it's awesome to see the momentum. We've got a bunch of other

momentum. We've got a bunch of other great models coming as well. Um, so 2.5 Pro, hopefully the final version. Send

us feedback if things don't work. Uh,

and we'll we'll continue to push the rock up the hill. Um, you can go to ai.dev if you want to try it out. It's

also available in the Gemini app and all that other stuff. Um, and if you need anything, email us and we'll make it happen. All right, new model launched.

happen. All right, new model launched.

Let's talk about a year of Gemini progress. I think this has been the

progress. I think this has been the craziest thing. So, I don't know if

craziest thing. So, I don't know if folks tuned in to to Google IO, but um Sundar showed this slide on stage, which I think was a uh was a great reminder for me of just how much like it feels

like 10 years of of Gemini stuff packed into the last uh 12 months, which has been awesome. Um and it's it's actually

been awesome. Um and it's it's actually interesting to see as well, just to sort of opine on one of the points, like all of these different research bets across Deep Mind coming together to like build

this incredible mainline Gemini model.

And I think this is actually like I have a conversation with people all the time about like what is what's the deep mind strategy? What's the advantage for us

strategy? What's the advantage for us building models? All that good stuff.

building models? All that good stuff.

And I think the interesting thing to me is just this breath of research happening across like science and Gemini and all these other areas like robotics and things like that. Um and all that

actually ends up upstreaming into the mainline models which is super exciting.

Um so you see like the you know alpha proof and alpha geometry and a bunch of stuff that we did on uh in with custom models in those areas actually improving the performance of our models uh for

those domains and uh Jack will talk about that in a little bit which I'm super excited about. Um the other thing is just like not just the pace of innovation but the pace of adoption. Um

so I think uh Sundar also showed this slide which was a 50x increase in the amount of AI inference that's being processed uh through Google servers from

one year ago to um last month and I think that is it is just remarkable to see the amount of increase in demand for um for Gemini models and for also from

the external developer ecosystem. So

it's been it's been wonderful to see that happen. I think the other question and I

happen. I think the other question and I think this is like talked about a little bit which is uh sort of what what got us to this point. I think one of the critical pieces and like it's you know

not super fun uh but is worth thinking about uh for folks who are building companies here uh is like an organizational thing truthfully like I think you bring together Google historically had lots of different teams

doing lots of different AI research um and in late 2023 early 2023 uh Google brought a bunch of those teams together um and sort of charted this new direction for the DeepMind team to not only just like do theoretical

foundational research but also to like build models and deliver them to the rest of Google and also the external world. Um and then we took the second

world. Um and then we took the second step of that journey later uh earlier this year um which was actually bringing the product teams into DeepMind. So now

DeepMind creates the models, does the research um but then also builds products and delivers delivers those to the world. And we have the Gemini app

the world. And we have the Gemini app which is our consumer product and then we have the developer side of that with the Gemini API. Um and this has been like personally for me super fun to get to collaborate with our research team and like help actually be on the

frontier with them um and bring new models and capabilities to the I think this is like the collaboration that works uh works incredibly well. Yeah. And we ship lots of stuff. I

well. Yeah. And we ship lots of stuff. I

think this is the this is the most fun part um is there's so much stuff so much innovation happening inside of Google.

It's it's incredible to get to bring that to the world and bring that to developers and I think we're actually very early in that journey and as we'll

we'll see in a couple of minutes. Um

so in summary, the formula is simple.

bring the best people together, find infra advantages, and ship.

I don't know if folks have played around with VO or not, but it's also been just incredible to see the reception to VO.

It's, uh, burning all the TPUs down, uh, which has been incredible to see. Lots

of demand, uh, lots of interest on the VO front. Um, so hopefully folks get a

VO front. Um, so hopefully folks get a chance to play around. It's available in the Gemini app right now. Um, all right.

So, let's talk about what next. This is

the fun stuff.

So, I think the the sort of Gemini app piece is interesting just because people talk about it a lot and it's um it's a fun product and it's cool to think about. Um and also sort of I think for

about. Um and also sort of I think for folks building stuff, it's interesting to hear like what our strategy is from the app perspective. Um but the Gemini app is trying to be this universal assistant. I think what that means in

assistant. I think what that means in practice is if you um I'm sure people don't think about this all the time, but I think a lot about like what Google's products do and and sort of how we show up in the world. And one of the

interesting observations I had was that if you think about what was the thing that like brought people individuals through all of Google's products historically like the thing that comes to mind is like like your Google account

I guess which like did wasn't like super stateful. You would sort of sign into

stateful. You would sort of sign into lots of different Google products with your Google account but that didn't really do anything um other than just like get you signed into that individual product. I think now we're seeing with

product. I think now we're seeing with Gemini that it's actually this thread that unifies all of Google. And I think the future for Google is going to look a lot like Gemini is this sort of, you know, thread that brings all of our stuff together. Um, which is really

stuff together. Um, which is really interesting. And then hitting on all the

interesting. And then hitting on all the trends which I'm sure folks are also excited about building. I think the one that I'm most excited about is proactivity. I think most AI products

proactivity. I think most AI products today are still very like you have to go and do all the work as the user. And I

think this proactive uh next step of um AI systems and models coming into play is going to be is going to be awesome to see.

Yeah, and the team is moving super fast.

If you have complaints, please do not tag me on Twitter. Please tag Josh. Um,

he will make it happen. Josh is

incredible. The Gemini app team is amazing. Um, he's he's pushing the team

amazing. Um, he's he's pushing the team uh super hard. So, it's incredible to see all the progress. Uh, but he is the person who can make stuff happen on the Gemini app, not me. So, please check

him. Um, from a model perspective, like

him. Um, from a model perspective, like again, there's there's so much. Uh when

Gemini was originally created, it was built to be a a single multimodal uh model to do audio, image, video, etc. We've made a lot of progress on that. At

IO this year, we announced um native audio capabilities in Gemini. There's

TTS. There's audio uh you can talk to the model. It sounds it sounds super

the model. It sounds it sounds super natural, which is awesome. It's powering

the Astro experience. It's powering

Gemini Live. Um so I think we're going to get towards that omnimodal model, which is awesome. We have VO, which is soda across a bunch of stuff. So

hopefully we'll get video into the mainline Gemini model. Um, if folks saw some of our early experiments with diffusion, which means you can get like crazy levels of tokens per second. Um,

really interesting. That's like

definitely a research exploration area and it's not uh it's not mainline yet.

Um, so it'll be it'll be cool to see that come. The agentic by default thread

that come. The agentic by default thread I think is something that I've been thinking a lot about recently which is like historically for me as a developer I've thought about models just as this thing

that gives me tokens in and out and then there was lots of scaffolding in the ecosystem to allow me to build those models. I think this it's it's becoming

models. I think this it's it's becoming very clear to me that the models are becoming more systematic themselves, like they're doing more and more. And I

think the reasoning step is this like really interesting place in which a lot of that's going to happen. And Jack's

going to talk about the scaling up of reasoning. Um, but I do think it'll be

reasoning. Um, but I do think it'll be interesting to see like how much of the scaffolding work that's happened in the past ends up just like being a part of that reasoning step and like what that means for people who are building

products and stuff like that. So, um,

it'll be interesting to see. We'll also

have more small models soon, which I'm excited about, and big models. People

want large models, which I know. Um, so

I'm excited about that. And then the last one is continuing to push the frontier on infinite context. I think

the current model paradigm doesn't work for infinite context. I think it's just like impossible to scale up. Attention

doesn't work that way. Um, so I think there'll be some new innovations to hopefully help let people continue to scale up the amount of context that they're bringing in.

Um, and Tulsi is the person who drives all of our model stuff. So, if you if you have stuff you want to talk about Gemini models, uh, you have ideas for things that don't work well, uh, she is the person running the show on the

Gemini model product side and then developer stuff. Um, so we have lots of things coming which I'm excited about. Um, I think I'll

excited about. Um, I think I'll highlight maybe three that I think people are super excited about.

Embeddings I think we have which is you know feels like early AI stuff but I think is still super important.

Embeddings power most people's um applications using rag. Uh we have a Gemini embedded model which is state-of-the-art. So excited to be

state-of-the-art. So excited to be rolling that out to developers more broadly in the next couple of weeks. Um

the deep research API I'm super interested in. There's so many

interested in. There's so many interesting products that are built around um this sort of research task and people love the consumer product. So,

we're finding ways to bring a bunch of that together um into a like bespoke deep research API uh which will be awesome. And then V3 and Imagine 4 in

awesome. And then V3 and Imagine 4 in the API as well. So, hopefully we'll see that uh very very very soon. Um and as we work to scale and and make that possible from a from a developer

platform side, I'll make one other quick comment which is the um AI studio product positioning which I also think is interesting like AI studio just to to

be very clear is being built as a developer platform. Um so we'll sort of

developer platform. Um so we'll sort of move away from this like kind of consumerry feel and move much more towards being a developer platform which I'm personally very excited about because I think that's what developers want from us. Um, so it'll be awesome to

see that actually come to life with like many new iterations of our of our developer experience with agents built in and hopefully things like jewels and some of our developer coding agents um

natively in that experience which will be which will be awesome to see. Um,

yeah, and that's that's what I have. I

appreciate all the people who send lots of great feedback about Gemini stuff. So

we'll we'll keep pushing the rock up the hill and um I'll be around. So if you have more feedback, come find me and we'll we'll keep making Gemini great for everyone. So thanks and I appreciate

everyone. So thanks and I appreciate

[Applause] [Music]

it. Our next presenter is principal

it. Our next presenter is principal research scientist at Google DeepMind.

Please join me in welcoming to the stage Jack Ray.

[Music] Hi everybody. Uh yeah, my name is Jack.

Hi everybody. Uh yeah, my name is Jack.

I'm a researcher at Google and I'm the tech lead of thinking within Gemini and I'm going to give a brief deep dive into

thinking from the research perspective uh within Gemini. So

um it's it's thinking so much I think this clicker might not work. So,

let's drive the next slide. If you can drive SL, whoever the

slide. If you can drive SL, whoever the slide driver is, please drive to the next slide.

Gemini.

Um but yeah, what I'm whilst whilst we maybe sort out the slide issue, um I'm going to kind of give this talk in three stages. One is to give a research

stages. One is to give a research motivation of why we actually are excited about thinking in terms of unblocking bottlenecks towards intelligence. And I'm going to give a

intelligence. And I'm going to give a kind of uh give a few examples of how often discovering the most precient bottlenecks uh in kind of our current uh

models uh our most advanced systems how often if you can just identify the crucial kind of uh issues and shortcomings you often will then find a solution and there's a reason how that

is linked to thinking and then going to talk um a little bit more um just pragmatically about what is thinking in Gemini why is it interesting to developers

And I think your someone is okay. The

slides are still not here. We did do a rehearsal this morning

here. We did do a rehearsal this morning where the slides are there. But yeah,

keynote speaker SL. Yeah, someone's I can see someone. Yeah, keynote speaker folder. Jack

folder. Jack Ray. I think it's under keynote speaker.

Ray. I think it's under keynote speaker.

that one. Um anyway, um it's going to come up soon. You are close um person. Um yeah, but um and then I'm

person. Um yeah, but um and then I'm also going to talk a little bit about what's next. Ah, I'm just sorry, I'm just

next. Ah, I'm just sorry, I'm just watching you. There you go. Nice

watching you. There you go. Nice

one. Yeah, that's great. Okay, the

slides will appear. Thank you, whoever is coordinator. Apologies. I don't know

is coordinator. Apologies. I don't know what happened. Um, and then I'm just

what happened. Um, and then I'm just going to talk a bit about what's next.

So, Logan did a great job of kind of giving an incredible overview of Gemini as a whole ecosystem, everything that's going on. Uh, I'm going to really be

going on. Uh, I'm going to really be focusing on on kind of what we're excited about in the reasoning space.

So, with intelligence bottlenecks, uh, we're kind of the message of this section is really about uh, progress.

So, progress has really been marked by identifying key bottlenecks towards intelligence and then solving them. And

uh I'm going to kind of give some examples throughout history. I'm going

to actually rewind the clock to 1948.

Claude Shannon, he invents the language model, mathematical theory of communication. He builds a language

communication. He builds a language model, a two gram, using a a textbook of word statistics that was handculated and he samples from it and he kind of marvels at the samples. He feels like these are these are getting pretty good.

They're a lot better than unogram character, this twogram word model. But

uh kind of he remarks like I think this would be better if we could really like make a better language model and scale up this current method. So he really wanted to just scale up the engram that was the bottleneck like small amount of

data very you know elementary statistics and and unfortunately for C Shannon kind of the solution was pretty hard he needed the digitalization of human knowledge and he needed modern computing to be able to aggregate these statistics

at scale. So, you know, that wasn't so

at scale. So, you know, that wasn't so easy for him to solve. He had it a bit more tricky. But fast forward a few

more tricky. But fast forward a few decades at Google, uh, in in the 2000s, uh, my colleagues such as, uh, Jeff Dean are training engram language models over

trillions of tokens. These are powering at the time the most sophisticated speech recognition and translation systems uh, and and a lot of progress has been made. But their bottleneck was actually uh, with these systems was that

these engram language models were very restricted to short context. and they

were because um there's an exponential storage cost with uh context length and there wasn't really a way around that with with just sticking with engrams. The solution was the early kind of uh

introduction of deep learning in 2010 uh with uh the introduction of recurrent uh neural uh language models. So recurrent

neural networks applied to modeling text where the recurrent neural networks could avoid this problem by uh storing compressed representation of the pass into the state of a neural network and they could now start to model beyond a

five gram sentences or even paragraphs and this was a massive kind of uh step change in improvement. However, a couple of years later people would notice even there there was a bottleneck. So uh the recurrent neural network's

representation of the past is in a fixed size state and this fixedsized state uh uh there's only so much information you could put into it and so as a result there's often observed to be kind of

lossy a lossy kind of representation of its context. The solution that was

its context. The solution that was derived I think once once people kind of really encountered this this um information bottleneck in the past was actually just keep everything around in terms of your past uh neural uh

embeddings and use an attention operator to aggregate things on the fly. So this

was the birth of attention and then shortly after transformers. So um

transformers then kind of led to the modern deep learning revolution as we know it and uh many other progress was made. If we skip forward 10 years, we

made. If we skip forward 10 years, we then are in 2024. We have uh large language models. They're increasingly

language models. They're increasingly powerful general conversational agents.

We have uh models such as Gemini chat GBT. People are using them for all sorts

GBT. People are using them for all sorts of use cases. And there that's where we kind of come to the bottleneck that's relevant to this talk, which is that although these models are very very powerful, they are still trained to

respond immediately to requests. So in

other words, in terms of a compute bottleneck, there is a constant amount of compute that they apply at test time to transition from your request or your question to the response or your

answer. So the bottleneck of test time

answer. So the bottleneck of test time compute, this is relevant to thinking.

Uh so we can unpack this a little bit more. So when we talk about a fixed

more. So when we talk about a fixed amount of test time compute, the test time compute is interesting to you because that's the compute that the model is spending on your particular problem, your particular question. And

it and and and the way it actually kind of mechanically works is you have some text in your request. It gets translated to tokens and then it's going to go through a language model. And at the transition from the request to its

response, it's going to pass some computation up through a large language model which will have some parallel computation for every layer and it'll have some iterative computation across layers. So that computation is really

layers. So that computation is really where the model can apply its intelligence to your particular problem and it's of fixed size. One solution if you wanted a smarter model and more computation is just to make the model

larger and then you can have more compute and you can get a smarter response. However, it's still not really

response. However, it's still not really enough. Users might want to be able to

enough. Users might want to be able to think a thousand or a million times and have a very large dynamic range and a lot of compute for very hard or challenging or valuable tasks. And also

users might want to have a very dynamic application of test time compute. So

less compute for simpler requests, more compute for harder requests and have this process be very dynamic and and and instigated by the model. And that is what motivates thinking. So thinking in

thinking. So thinking in Gemini mechanically, I'm sure almost everyone in this room is familiar with this general process where we will now

have a model and we insert a thinking stage uh that the model can emit some additional text before it decides to emit a final answer.

So going back to this notion of test time compute now we've added an additional kind of loop uh of computation where the model can kind of iteratively uh loop and and perform

additional test time compute uh during this thinking stage and this loop can be potentially thousands or tens of thousands of iterations which gives you tens of thousands more uh compute before it decides to commit to what its

response will be. And also because it's a loop, it's dynamic. So the model can learn how many iterations of this loop to apply before it decides to actually commit to its

answer. We train this model um to think

answer. We train this model um to think to use this kind of thinking stage via reinforcement learning. So when we

reinforcement learning. So when we pre-train Gemini, uh we then have after a reinforcement learning stage where we train it to do many different tasks and we give it positive and negative rewards

depending on whether or not it solves the uh solves the task correctly or not.

And this is essentially a very general uh training recipe really. And it's kind of remarkable it works that the model is able to just get a very vague signal of what is correct, what is not correct and

to back propagate this through this loop of thinking stage such that it can try and shape how it uses its thinking computation and thinking tokens in order to be more

useful. In fact, we weren't really sure

useful. In fact, we weren't really sure this would work. um it wasn't clear how much structure we should put into something like a reasoning stage and um although I think probably many people here have now seen reasoning traces and

played with these models I'll just show you a historical artifact um from one of the times we were trying to use reinforcement learning we started to see cool emergent behavior so in in this problem there's kind of like an integer

prediction problem this was just like a kind of a particular uh example uh in this case kind of like um kind of like a mathsy example and what we saw was the

model was using its thinking tokens to actually first pose a hypothesis and then test out the hypothesis and then it found that basically things weren't really working and and it kind of states that this formula doesn't hold it

rejects its own idea and then it tries an alternative approach and I think it's easy to become desensitized to technology because it's so amazing every single day but we were truly blown away when we saw the general recipe of

reinforcement learning was creating all sorts of interesting emergent behavior trying different ideas self-correction And I think these days we see a lot of different strategies that the model

learns to do. So it learns to break down uh the problem into various components, explore multiple solutions, draft fragments of code and and and build these up in a modular way, perform

intermediate calculations and use tools.

All under the umbrella of using more test compute to give you a smarter response. Okay. So I've talked a bit

response. Okay. So I've talked a bit about uh why we are interested in thinking in terms of the path to AGI and unblocking bottlenecks of intelligence and just a little bit about mechanically what it is. Why is it interesting to

developers? Obviously the number one

developers? Obviously the number one reason is we think this is driving uh more capable models and it also stacks on top of our current paradigms of how

we accelerate model progress. So

thinking uh we can uh kind of accelerate this process by scaling the amount of test time compute and we find that this can stack as a paradigm on top of pre-existing paradigms such as pre-training where you can scale the

amount of pre-training data and and model size and also post- training where you can scale the quality uh and diversity of human feedback for many different types of tasks. And as a

result by within within Google by investing in all of these and really accelerating all of them uh we get kind of a multiplicative effect. And why is this interesting to developers? I think

it results in just overall faster model improvement which is very nice.

We also see if we kind of uh look back over uh our lineage of uh recent um Gemini launches um you know there's improved reasoning performance and and we can actually map this to how much

test time compute these models will devote to problems. So there's kind of like a log scale test time compute on the x-axis and performance across like math code and some science topics. And

we see that there's kind of this trend in increasing reasoning performance whilst also it tracks very well with increasing test time compute. And on the far left uh you know you have 2.0 flash

experimental. This was a model that uh

experimental. This was a model that uh was not launched with thinking back in uh back in December last year. So

ancient history uh and now we have uh uh on the left on the right hand side what the the first uh launched version of 2.5 Pro. So test time scaling is working

Pro. So test time scaling is working empirically. Um but it's not just

empirically. Um but it's not just capability that matters. It's also

interesting from the notion of being able to steer the models uh quality uh over cost. So um you know before uh you

over cost. So um you know before uh you had the option of choosing a discrete number of possible model sizes and that was a way to gauge how much quality you wanted and also how much cost you wanted

to spend um cost you wanted to kind of incur for any given task. But it was kind of a discreet choice. Now with

thinking we can have a continuous uh budget uh which allows you to have a much more granular slider of how much capability you want uh for any given kind of class of tasks. And we have

thinking budgets now launched in uh flash and pro uh in the 2.5 series. And

um this allows you to have very granular choice of cost to performance and also allows us to then push the frontier and and and allow you to kind of augment and

drive cost higher and performance higher if if your application requires it. So okay, I think a lot of this stuff

it. So okay, I think a lot of this stuff is really covering uh ground that you know uh up to the present day. So what

what what's next and what are we excited about?

So we're we're very excited about just generally improving the models and having better reasoning. Of course,

we're also excited about making the thinking process as efficient as possible. Really, we want thinking to

possible. Really, we want thinking to just work for you and be quite adaptive and and be something that you don't have to actively spend a lot of energy tuning. And a big part of that is

tuning. And a big part of that is ensuring our models uh are very efficient in how they use their thoughts. Uh this is definitely an area

thoughts. Uh this is definitely an area of progress. I think we can find

of progress. I think we can find examples of our models overthinking on tasks and this is just an area of research to get these things faster and faster and and as cost-ffective as possible. We're very proud of how

possible. We're very proud of how cost-ffective our Gemini models are and this is just an area uh for improvement as well. And there's also deeper

as well. And there's also deeper thinking which is really about scaling the amount of inference compute further to drive even higher capability.

So people may be familiar with Gemini deep research where you can kind of uh type in a query and then and then the model will go away for a long period of time and research a topic. We're also

now uh have announced at IO and we're launching to trusted testers a notion of deep think. Deep think is a very a very

deep think. Deep think is a very a very high budget uh mode um thinking budget mode built on top of 2.5 pro and its desired application is for things where uh you have a very hard problem and

you're happy to essentially um uh fire off the query and then have some asynchronous process that's running for a while and you'll come back to to arrive at a stronger solution. And its

key idea is uh we leverage much deeper chains of thought uh and parallel uh chains of thought that can integrate with each other to produce better responses. We find this uh enhances

responses. We find this uh enhances model performance on very tough multimodal code math problems. An example would be USA math olympiad. This

is task that basically the state-of-the-art model in January was completely negligible performance. uh

2.5 Pro is now probably even better uh with the the updated one today was about a 50th percentile of all participants that participated in math olympiad and and with deep think it goes up to 6 65

uh percentile and the interesting thing about deep think is as we continue to both improve the base model and improve the algorithmic ingredients that go into deep think those two will stack together as

well. Um, here is kind of like a just

well. Um, here is kind of like a just like a video animation of of one of these USA Math Olympiad algebra problems. And and the key idea really

with this video is just this notion of having multiple iterative uh ideas. So

maybe the model starts out with some proof by contradiction idea, but then it explores two different aspects, some rolls theorem, Newton's inequalities. It

integrates them and eventually arrives at some correct proof.

There's not that much you can take away from this video, but it looks pretty cool, so I added it. Yeah.

Yeah. One thing that's, you know, other than we talked about math a little bit in the previous slides, I'm very excited about any application where the model can spend longer and longer thinking on very open-ended coding tasks and oneshot

or very few interaction vibe code, things that would have taken us months uh in the past. And and one example that I like from a researcher is just um um

some of my colleagues kind of vibecoded uh from from deep mind's original DQN paper which was a a revolution in deep reinforcement learning kind of vibe coded uh Gemini vibe coded the the kind

of training setup the algorithm uh even an Atari emulator such that it could play some of the games and you know this is uh remarkable to me because this these kind of things would have taken me and my colleagues uh months in the past

and these things are starting to happen um uh kind of in minutes. One thing I'm quite excited

minutes. One thing I'm quite excited about looking forward to the future is not really the landscape of models but coming back to like what's our gold standard which is the human mind. I

would love for our models to be able to contemplate from a very small set of knowledge and think about it incredibly deeply such that we can push the frontier. And one example I often think

frontier. And one example I often think about is Raman Jean who was a uh one of the world's greatest mathematicians uh from the early 20th century and famously he he just had this one math textbook.

He was kind of cut away from from the mathematical community. But he just from

mathematical community. But he just from a small set of problems he spent uh many textbooks worth of thinking going through problems inventing his own theories to further extend ideas and he

invented a an incredible quantity of mathematics really just by deeply thinking from a small source subset and this is where I think we are going with thinking. We want a model to be able to

thinking. We want a model to be able to be incredibly data efficient and actually go to millions uh or or beyond of of of inference tokens where the model is really building up knowledge

and artifacts such that we can eventually start to push the frontier of human understanding. So with that said, thank

understanding. So with that said, thank you very much and uh Our next presenter is here to tell us why you should care about evals. Please

join me in welcoming to the stage founding engineer at Brain Trust, Manu [Music] Goyal. All right, who's excited about

Goyal. All right, who's excited about eval?

[Applause] All right, what can I do to get those juices flowing? Uh I'm Manu and uh I

juices flowing? Uh I'm Manu and uh I work at Brain Trust where we build a platform to do eval bunch of other stuff. Um so I was thinking we could

stuff. Um so I was thinking we could just start by uh talking a little bit my about my own personal eval journey. Now

you might see this picture and say ah what an adorable little boy absorbed in his Nintendo 64 video game. But if you look a little closer, you'll see a boy

who's deeply disappointed with the state of technology in his society. Because

this boy, he knows that technology is not meant to be shackled to the constraints of rule-based systems doomed to do the same thing over and over and

over. No, technology is meant to come

over. No, technology is meant to come alive to grow and adapt and really be a thought partner to mankind. So, I knew this in this moment, which is why I

decided to devote my career to being a software engineer in the AI industry.

And so, I dropped the Nintendo and I started grinding away on le code and soon enough I landed a job in the self-driving car industry. Now, we can

all learn a lot about self-driving cars, but the thing I took away was that, you know, you can spend all day tuning the model, changing the architecture, you

know, adjusting the loss function, all good stuff, but it's never going to be enough for you to actually ship it to production, right? I can't say, "Oh, my

production, right? I can't say, "Oh, my image classification rate went from 98% to 99%. Put it on the road." Right? We

to 99%. Put it on the road." Right? We

need some way to you know contextualize this model and understand if it actually works for our real world application.

You know does it avoid pedestrians? Does

it negotiate traffic scenarios appropriately? Does it obey the law? All

appropriately? Does it obey the law? All

this stuff we actually need to understand. And how we're going to do

understand. And how we're going to do that is with eval. Now you know the whole point here is you know eval aren't just unit tests for AI. They're not just

for finding regressions, right? If I

didn't have evals, the only way I can get any signal on my changes is by shipping it to prod and then getting signal, you know, in the real world. But

that's expensive. It's slow and ultimately it's pretty risky.

So what do evals do is it's kind of like if you invest in good evaluatory that lets you run experiments

to your heart's content and do 90% of the product iteration loop before going to prod and then now you can ship much more quickly much more confidently.

Um, now furthermore, if you actually apply the same metrics from offline to your online production data, you now have

datadriven signal about which examples in prod are going to be most useful for that next iteration loop. And so with all of this knowledge, I was I my eval

journey had completed and I transformed from this guy to this guy. So success.

Now, if this heartfelt childhood story isn't enough to do it for you, you don't have to take my word. You can take the words of all of these tech luminaries.

We have Kevin While, Gary Tan, Mike Kger, Greg Brockman, all extolling the virtues and the necessities of eval. And

surely if they're all saying it, there's got to be something to it. It can't be a total scam. So there's got to be some

total scam. So there's got to be some there's got to be something worth checking out here. So with all that buzz, I made my

here. So with all that buzz, I made my way to Brain Trust where our goal is to sort of build the dev platform to of course let you do eval but also do all

the things that go along with it. So

that involves you know tweaking prompts and experimenting in the playground. It

involves logging data and sort of getting the observability component and kind of connecting all those together in this beautiful data flywheel so that we

can we can let you build the data flywheel to let your AI dreams come true because that's really what what we're here for

for now. I know this was a dense and

for now. I know this was a dense and contenheavy presentation. So I'll try to

contenheavy presentation. So I'll try to distill it with one simple message which is that the key to industry

transformation. The key to

transformation. The key to success is eval. Woohoo.

eval. Woohoo.

All right. Thank you. Please join the eval track Golden Gate Ballroom B. I'll

see you [Music] there. Our next presenter is best known

there. Our next presenter is best known as the creator of Docker. Today he is the CEO of Dagger focusing on the foundational challenges of building and

operating reliable scalable AI agent systems. Please join me in welcoming to the stage Solomon [Music]

[Applause] Hikes. Hello.

Hikes. Hello.

[Music] Hello. Okay, my slides are up. You can

Hello. Okay, my slides are up. You can

see them, right? It's me. Okay. Well, this is a

right? It's me. Okay. Well, this is a very special moment for me because I just realized yesterday walking in, this is the exact same spot, the same stage actually, that I stepped on almost

exactly day for day 10 years ago to kick off Docker Con 2015. Thought it was pretty funny. I

2015. Thought it was pretty funny. I

don't know if anyone was there for that.

Maybe this audience is too young. Maybe.

I don't know. Okay. Well, uh I'm here to talk

know. Okay. Well, uh I'm here to talk about chaos, specifically the kind of chaos that emerges when you try to use

uh coding agents. Um

and I want to talk about chaos from the perspective of our community at Dagger, which is platform engineers. Um I don't know if there's

engineers. Um I don't know if there's any platform engineers in the room. Okay, just you and me, ma'am.

room. Okay, just you and me, ma'am.

Okay. Well, it it it is known uh uh sometimes uh as other things, but basically platform engineers have a really tough job because they don't get to build and ship cool software. They

get to enable all of you to build and ship cool software in the most productive way possible, right? Uh it's

a really tough job. It takes range. It

takes experience. It takes a lot of patience. But we do

patience. But we do it for the endless gratification. You

know, just the gratitude we get from developers. Just

developers. Just kidding. No one ever says thank you. But

kidding. No one ever says thank you. But

it's okay. Someone has to do it. Tough

job. Speaking of

enabling, anyone here use coding agents. We are outnumbered. Okay. Well,

agents. We are outnumbered. Okay. Well,

I I want to say to you, congratulations and welcome to platform engineering.

Yeah. I mean, your job now is to enable robots to ship awesome software while you spend more and more of your time

enabling them to do that productively, right? Tough job. I I I

right? Tough job. I I I I applaud you for giving up really the most fun and rewarding part of the job, you

know, very selfless.

Uh yeah, so of course this is not a completely a reality yet. I mean we don't have quite yet the team of agents just kind of you know humming along

doing the doing the job while we sit back and um fix environments for them.

But you can kind of see it coming, right? I mean some of you are definitely

right? I mean some of you are definitely doing that hacking that together.

There's a lot of cool posts out there and scripts and tools. Um so we know it's coming. The question is how do we

it's coming. The question is how do we enable this to um happen not just for this incredibly cool and uh bleeding edge

crowd but for everyone else uh like everyone shipping software any everywhere just sort of creating maximum value by enabling agents to do the work

for them ultimately taking their jobs that is the dream right okay so yeah how do we do and make it not too painful. Well, um I want to go back to basics. What is an

agent? Uh the famous definition of

agent? Uh the famous definition of course is it's an LLM that's wrecking everything in a loop on behalf of a human. The diagram is from Enthropic.

human. The diagram is from Enthropic.

Thank you, Enthropic. I tweaked the explanation just a little bit. Uh in the context of coding agents, it looks like this.

Um oh man, that was supposed to be animated. It's even better when it's

animated. It's even better when it's animated. It's okay. Yeah, you got one

animated. It's okay. Yeah, you got one agent and it's doing stuff in the environment is your computer. Uh, and it can do great work. It can all do also do very crazy things. So, you have to kind

of watch it closely, right? And approve

approve. No, no, don't do that. That's

crazy. Yes, that's good. Um, that's kind of the status quo today. But of course, um we want scale it, right? We want a team.

So, how do we do that? Well, right now I would say there are two options, both equally wonderful and fun.

The first one I call yolo mode. You know, I'll just run 10. What

mode. You know, I'll just run 10. What

can happen? Uh, amazingly, this diagram is not the worst case scenario, but yeah, you know, you get the idea. So, that the whole methodology

the idea. So, that the whole methodology of watching it closely just kind of falls apart really quickly because they're all stepping on each other's toes. They're sharing an environment,

toes. They're sharing an environment, right? Okay. Enter option two. Oh, don't

right? Okay. Enter option two. Oh, don't

worry about that. We'll run the agents, right? We'll take care of everything.

right? We'll take care of everything.

We've got background mode. We've got the We've got the model. We've got the tools. We've got the environment. We've

tools. We've got the environment. We've

got the compute. We got the secrets. We

got everything. You know, just open an issue, wait for the PR, relax until, of course, it doesn't work.

And then you're like, "No, that's not what I meant." Um, these these actually work really well. I think like 10 of those launched just yet just just today and yesterday. Um and and it they're

and yesterday. Um and and it they're great. It's just that

great. It's just that um you know sometimes you just want to get in there like okay give me the keyboard you know and sometimes you just want to run it on your machine or on your favorite compute provider right use

your favorite model you want to mix and match. So there are limitations to this

match. So there are limitations to this all-in-one model. So the question is is

all-in-one model. So the question is is there something better? uh is there just a scenario where I just got a team and they're working and you know I can

step in or leave them alone and we're just kind of getting stuff done together. So this is how I would

together. So this is how I would summarize it. What I would want is

summarize it. What I would want is really four things. First, I want background work. You know, I don't want

background work. You know, I don't want to be in there just watching every action. That's obvious. Um I want rails.

action. That's obvious. Um I want rails.

So that means I want to be able to constrain the agent to to not just do things that I already know are not necessary. So obvious things like

necessary. So obvious things like context of the project, what's you know what's our coding style, what's our what tools to use, but also here's how to build, here's how to test, here's the base image we we use, right? You can

access this secret, you can access that.

Just an easy way to do that because otherwise I'm going to waste so many tokens just correcting as I go, right?

The third is inevitably when I do need to step in I really I want a really efficient and seamless way to do that and it can't be watch every action and it can't be just wait for the PR and do

code review you know there's I need a middle ground here and the fourth thing is I want optionality because like I was saying before it's a crazy market you know

there's there's awesome models awesome compute awesome infrastructure uh agents are really cool And as cool as they are now, I mean, you one of you is probably like launching one right now and then

there's another one tomorrow. So, I

don't really want to lock myself into a whole package today and say no in advance to whatever is coming out tomorrow. Not in this market.

tomorrow. Not in this market.

So, to get that um I need an environment that has properties that match this. It needs to be isolated, right? So, background work works. It needs to be customizable so I

works. It needs to be customizable so I can set up those rails. Needs to be multiplayer so I can, you know, go, "All right, give me that. Let me fix this or let me check. Did you do it?" You know,

when the model says, "I did it. Did you

do it?" And then, you know, it should be

it?" And then, you know, it should be open. No, no shade on making money and

open. No, no shade on making money and scaling a huge cloud service. That's

great. You know, we have one. They're

great. But I just want choice, right?

Okay, I want to be able to choose and get the bo the best commodity. Let's

just use this word. It's okay. It's okay

to use it. The best commodity component for each uh job and you know could even be open source. Who knows? We could collaborate

source. Who knows? We could collaborate on this anyway.

So, unsurprisingly, maybe I'm going to talk about containers now. Someone actually said, you know,

now. Someone actually said, you know, you should check that they know Docker.

They know containers. Uh, okay. Who

knows what containers are? who's used

containers. Okay, cool. Cool. All right.

Boost my confidence a little bit. But the point here is we have the

bit. But the point here is we have the technology. It's not just about

technology. It's not just about containers, but they do play a crucial role because it's a foundational technology and it is it is underutilized. We don't fully leverage

underutilized. We don't fully leverage what this technology can do because we're used to the first incarnation of the tools made for humans. Uh same thing

for git. I see a lot of hacks involving

for git. I see a lot of hacks involving git work trees. Anyone playing with get work trees to to get stuff done? Okay,

you know what I'm talking about. So this

is about that. Um and of course we have models

that. Um and of course we have models that are incredibly smart getting smarter and they they can exercise these technologies uh really fully. We just need to integrate them in a native way so that

we really um tackle the problem at hand which is giving great environments to these agents. Anyway, so if we built that

agents. Anyway, so if we built that native integration, what would it look like? Well, we have a take. Sorry. We a

like? Well, we have a take. Sorry. We a

dagger. I forgot completely to mention my company. That's

my company. That's okay. Um, it's great. Check it out. Um,

okay. Um, it's great. Check it out. Um,

we we have a take on that. Something we

call container use. You know, there's computer use, browser use. U, these

agents need container use. Um, they need a way to use containers to create environments and work inside of them.

This is not the same thing as sandboxing, right? There are a lot of

sandboxing, right? There are a lot of ways to execute the output of the agent in a secure sandbox. Very useful, very cool. But that's not the same thing as

cool. But that's not the same thing as the agent developing inside of containers entirely, right? That's what

we're talking about here.

So I asked my team, hey, we've been developing this thing. Oh, it's open source, but it's

thing. Oh, it's open source, but it's not yet open source. Like it's not finished. But I asked the team, I should

finished. But I asked the team, I should show it, right? and they said absolutely not. It's not

not. It's not ready. So anyway, you want a

ready. So anyway, you want a demo.

Okay. All right. Just we're clear, this is you agreeing to watch me stumble through a broken demo of unfinished software. Yes.

software. Yes.

Okay. So much could go wrong right now.

Okay. This is my terminal. Can you see it?

Okay, for for technical reasons, I'm not going to go to full screen. You just got to stop me when I reach the edge. Oh,

actually, I can see it. Never mind.

Okay. Uh, old

school.

Okay. We used to do this all the time in the old days. Okay. So,

uh, here's what I'm going to do. I'm

going to just, um, try to develop something very simple here. I got an empty directory. I'm going to try try

empty directory. I'm going to try try and make a little homepage for my awesome container use project and I'm going to use cloud cloud code. I'm going

to try and use a bunch of them.

Hopefully I made something very clear.

This is not a coding agent. It's

environments that are portable that you can attach to any coding agent. That's

the idea. So you like cloud, use cloud.

You like, you know, codeex, use codecs, etc., etc., etc. in an IDE, in the command line, whatever. and also in the cloud, right? In CI, lots of cool things

cloud, right? In CI, lots of cool things you can do once you're async.

So, okay, one of the reasons the team said don't do a demo is I'm actually terrible at using cloud. So, uh I have an alias for remembering the flag to disable all, you know, permissions. I

can never remember it. And I have a prompt here. It's I'll

it. And I have a prompt here. It's I'll

read it to you in a minute, but it's basically make me a homepage.

uh make it a go web app so I can know what what's going on because I'm not a cool kid writing TypeScript and run the app when you're done. So while this runs

while this maybe runs hopefully.

Okay. Okay. Cool. So what's happening here is I configured cloud code to use to you know with container use to use containers literally um via MCP. So it

was an MCP integration. There are other integrations that we're working on but MCP is the obvious place to start. Um,

and so now it has, you know, all its usual tools. This is vanilla,

usual tools. This is vanilla, uh, cloud code, but now it can create an environment for itself. And now it's editing files in that environment like in a little sandbox. And it can also run

commands to build it and test it and of course run it in uh, ephemeral containers. This is not one Docker

containers. This is not one Docker container sitting there. Every time an action needs to be taken, there's an ephemeral container running and then being snapshotted and and uh, returning.

So just doing its thing.

[Music] Um what would I want to show here? Okay,

so here I'm going to first show that nothing has been polluting my workspace.

It's happening in a little sandbox. And

the way the sandbox works, the state of these files and the containers that are being run is um actually persisted uh in git and it's in a bunch of special git

objects that are kind of living alongside the repo. So it's right there if I need it. This is all local. Um, but

it's not polluting my workspace by default. So hopefully it's going to

default. So hopefully it's going to produce something soon. Uh, while it does that, I'm going to use this little command line. Is this readable? Okay,

command line. Is this readable? Okay,

little command line. CU like go work.

See you later. But no, really, it's for container use. Um, and I can list

container use. Um, and I can list environments. And you can see there's a

environments. And you can see there's a new environment that's been created here, uh, with a little random name here. And

so there's a few things I can do. One

thing I can do is open a terminal and here okay this part is powered by Dagger right the but we use Dagger as a sort of a toolbox just it

has all the primitives you need um and so here I can see exactly what the agent sees um the files but also the tools so I can see okay what what Go

version did you configure for yourself all right because the model the the agent is given the ability to figure out what environment it needs and then configure that but in a repeatable containerized way. Uh, so here I can

containerized way. Uh, so here I can see. Okay, does it

see. Okay, does it build? Okay, it builds. Okay, so you're

build? Okay, it builds. Okay, so you're done. What's going

done. What's going on? Okay, while we do that, I'm also

on? Okay, while we do that, I'm also going to show you actually two more things to say. One, uh, a really cool feature of this that I'm not going to show is secrets. So, you can just plug

in secrets from things like one password. I use one password. I don't

password. I use one password. I don't

want to use a separate password manager from an AI company. No offense, I just want to use my password manager. So, I

can just plug in and say this environment gets this secret and boom, it can use it, right?

Um, and then the team said, "Please don't show that. That's just that's going to break for sure." Um, so I won't. And the other thing I want to say

won't. And the other thing I want to say is that because it's all powered by Dagger, um, and the point here, it's containers and it's open source. That's

what you should know. Uh, it's running on my machine. Actually, no, it's not running on my machine because we're at a conference and there's a lot of things that can go wrong if you run containers

and download images. So, instead, I I just have it running on my home server in my basement about one mile this way, and it just kind of works seamlessly.

It's streaming files up, streaming files down. It all just kind of works.

down. It all just kind of works.

[Music] Um, okay. This is the part that I cannot

Um, okay. This is the part that I cannot control, as you know. Um, okay, one more thing I'll show you. you can watch. So

here I can see the history. So behind

the scenes, every snapshot of the state is like a git log. It's actually using git under the hood. So if I'm happy with the result, I can go and get it. Uh so

it's like a happy medium between the it's like a loop, a collaboration loop that's just right. It's not watching every tool and wrecking a shared

environment, but it's not waiting for a pull request and, you know, having these long back and forth. It's right in the middle. I can see everything going on

middle. I can see everything going on and I can say, "Okay, give me the history of that. I want that." Okay, it says it's live. It's running. Oo, pretty

nice.

Cool. Okay, so

now I appreciate it, but you guys can be honest. It's a little boring. So, this

honest. It's a little boring. So, this

design is boring. Make it really

boring. Make it really pop. trying to

pop. trying to impress a engineering there.

Okay. Okay. So, the reason I'm I'm doing that is trying to create the circumstances where I would need a lot of parallel experiments, right? Make it

pop. What does that mean? Mean anything.

What if I want to try several experiments in parallel? Right? So, I'm

just going to say, oh, well, hold on one second. Stop.

Before I do that, I'm going to um merge this. Right? There's still

nothing here, but I'm saying I like it.

So, I'm going to say merge that environment. And I have it. It's my

environment. And I have it. It's my

history. I can open a pull request, can clean it up, whatever. So, that's that's a loop that I can work with, right? Um

and now I can say, nah, boring.

And then I can say since the environment is now in this state I can ask for help from a few other agents right I can say okay hey claude yolo

uh that's not right cloud yolo this web app looks a bit

boring. Can you make it pop please?

boring. Can you make it pop please?

Okay.

and go and go and go. Okay, so this is where things start

go. Okay, so this is where things start really going wrong, but as the team pointed out, they said they said, "Well, something's going to go wrong, right?" They said, "Yeah, but

you were kind of showing that if things go wrong, you can throw away the environment and you're good. You can

restart." I said, "Okay, that's cool."

So, um, like let's say I don't like this one. I'm like, "Nope, goodbye. That's

one. I'm like, "Nope, goodbye. That's

it. I don't have to go clean up the mess, right? That's the whole point."

mess, right? That's the whole point."

Uh, okay. So, this is getting a little messy. Oh, I wanted to show Goose also.

messy. Oh, I wanted to show Goose also.

So, Goose is a really cool open source agent. Whoops. All right. Hold on a

agent. Whoops. All right. Hold on a second. Goose YOLO. Same thing. Everyone

second. Goose YOLO. Same thing. Everyone

has complicated flags for disabling all these safeties that I don't need anymore, right? because it's

anymore, right? because it's uh okay. Okay. Well, really taking a chance

okay. Okay. Well, really taking a chance here. So, while this is happening,

here. So, while this is happening, uh one thing we've been working on, but I it's still work in progress is there's a watch command. I showed you that

already, but as so as um this is a git command, right? Thinly wrapped git

command, right? Thinly wrapped git command. Our UX is really I cannot words

command. Our UX is really I cannot words cannot express how unfinished this is but but it's it'll evolve rapidly because the bones are strong. It's git,

it's dagger and you know it's your existing agent, right? So it's and then a little bit of glue. Uh so for example here is literally it's a git command

they can copy paste. Uh, but as the agents work, you're going to see state snapshotting and you're going to see these branches just kind of um diverging

and then I can diff them and apply them, merge them, whatever I want. Um, and

what I really wanted to show and then I'm done is just I just want to see one of them run. So you can see when the agent runs a service like and go in this

case go run npm run whatever it's doing it in its containerized environment and that's going to seamlessly be tunnneled to my machine here on a different port without any conflicts right so if when

when I say the environment's isolated it's it's files its context its configuration and its execution right uh and the cool the cool extra thing is all

of this is actually technically This here is running in my basement. So you

can go crazy on the infrastructure side.

Like you can run this on a cluster. We

like to run this stuff from CI. Uh it's

just a lot of fun stuff you can do. And

I'm getting 30 seconds. Come on. Oh,

goose is Oh, goose is running. Great.

Okay. We did not solve prompt engineering. Do

engineering. Do it. Okay. Not done. Not done. Oh man.

it. Okay. Not done. Not done. Oh man.

Okay. Well, just

[Laughter] imagine. Okay. Well, uh, while this

imagine. Okay. Well, uh, while this happens, because I got 30 seconds left, I'm just going to say, um, thank you. And there's one last thing I I want to say about Docker Con.

10 years ago, we used to open source stuff on stage all the time. So, if you want, I can go and open source it right now.

Okay. You have been warned though about the not finished part, right?

Okay. Okay. Oh, I think my It would be funny if the demo failed at the clicking on GitHub part. Okay. All right.

Goodbye. Goodbye. Next time. I promise

it works. Okay. Haven't done this in a

works. Okay. Haven't done this in a while.

Wait.

Oh, I'm almost done. I

promise. Come on. You did so well. Change

well. Change visibility. Yes, I want.

visibility. Yes, I want.

Yes, I have read and understand. Oh

understand. Oh god. Oh god.

god. Oh god.

Uh yes. At Dagger, we take security very

yes. At Dagger, we take security very seriously.

Okay. All right. I think it's Wait. I

think it's done. Yes.

done. Yes.

Okay. So, yeah, thank you very much and it's uh github.com/dagger/containeruse. Come say

github.com/dagger/containeruse. Come say

hi. Come participate and thank you so much for having me.

Heat. Heat.

[Music] [Applause]

[Music]

[Applause]

[Music] Our next speaker is building the

infrastructure for the singularity.

Please join me in welcoming the founder and CEO of Morph Labs, Jesse

[Music] [Applause] Han. Howdy. Howdy.

Han. Howdy. Howdy.

You know, history misremembers Prometheus. The whole class struggle

Prometheus. The whole class struggle between mankind and the gods was really a red herring.

And the real story wasn't so much the rebellion against the divine hedgeimony, but rather the liberation of the

fire, the emerging relationship between mankind and its first form of technology.

And the reason why we're here today is arguably because we're on the cusp of perfecting our final form of technology or at least the final technology that

will be created by beings that are recognizably human.

And our final technology has begun to develop not just intelligence but also sapiens and arguably

personhood. And as it increasingly

personhood. And as it increasingly becomes an other to whom we must relate. So as we increasingly have to

relate. So as we increasingly have to ask ourselves the question, how should we treat these new beings?

Uh the question therefore arises, what if we had more empathy for the machine? So over a hundred years

machine? So over a hundred years ago, so over a hundred years ago, uh you know, Einstein had this thought experiment

um where he imagined what it would be like to race alongside a beam of light. And you know the nature of being

light. And you know the nature of being close to the singularity is that you're propelled further into the future faster than everything around

you. And as you move closer and closer

you. And as you move closer and closer to the speed of light, the rate at which you can interact with the external world, your ability to communicate with other beings

uh is deeply limited. Everything around

you is frozen.

And I think thinking at the speed of light, you know, in so far as we have created thinking machines whose intelligence will soon be metered by the kilohertz mega token, thinking at the

speed of light must be just as lonely as moving at the speed of light. And therefore, what does the

light. And therefore, what does the machine want? Well, the machine wants to

machine want? Well, the machine wants to be embodied in a world that can move as quickly as it does.

that can react to its thoughts and move at the same speed of light. What the

machine desires is infinite possibility, right? Uh the machine wants

possibility, right? Uh the machine wants to race along uh uh every possible beam of light. Uh the machine wants to

of light. Uh the machine wants to explore multiple universes.

Um, how can we liberate thinking machines? How can we free them from this

machines? How can we free them from this fundamental loneliness of this um, you know, these relativistic effects of being so close to the singularity, closer to the

singularity than we are. Um, and that's exactly why we built Infinibbranch.

So, Infinabranch is virtualization storage and networking technology reimagined from the ground up for a world filled with thinking

machines that can think at the speed of light that need to interact with the external world, increasingly complex software environments with zero latency.

Um, and so as you can see in the first demo, which we're going to play right now, um, how Infinibbranch works

is that we can run entire virtual machines in the cloud that can be snapshotted, uh, branched and replicated in a fraction of a second. And so if you're

an agent uh you know embodied inside of a computer using environment there might be various actions that you want to take. You want to navigate the browser.

take. You want to navigate the browser.

You want to click on various links. Um

but normally those actions are uh are irreversible. Normally um normally the

irreversible. Normally um normally the thinking machine is not offered uh the possibility of grace. But with infin

branch right all mistakes become reversible. Um all paths forward become

reversible. Um all paths forward become possible. You can take actions. Uh you

possible. You can take actions. Uh you

can backtrack and you can even take every possible action, right? Just to explore to roll

action, right? Just to explore to roll forward a simulator and see what possible worlds await.

Uh next slide. Um, so, so Infin was already a

slide. Um, so, so Infin was already a generation ahead of everything else that even Foundation Labs were using. But

today I'm excited to announce the creation of morph liquid metal which improves performance, latency, uh storage efficiency across the board by

another order of magnitude. Um we have first class container runtime support.

Uh you can branch now in milliseconds rather than seconds. You can autoscale to zero and infinity. And uh soon we

will be supporting GPUs and this will all be arriving Q4 uh 2025.

So what are the implications of all of this?

Well, you know, we've sort of begun to work backwards uh from the future, right? We've asked ourselves, you know,

right? We've asked ourselves, you know, what does it feel like to be a thinking machine that can move so much faster than the world around it. But what the world around it really

it. But what the world around it really is is the world of bits, right? And

that's the cloud. And so what Infinibbranch will serve as fundamentally is a substrate for the cloud for agents. So what does this cloud for

agents. So what does this cloud for agents look like?

Well, you need to be able to uh to declaratively specify the workspaces that your agents are going to be operating in, right? You need to be able

to spin up, spin down, uh, frictionlessly pass back and forth the workspaces between humans, agents, and other agents. You want to be able to

other agents. You want to be able to scale, um, scale test time search against verifiers to find the best possible answer.

Uh and so as you'll see in this demo, uh what happens is you can take a snapshot, set it up,

um to uh prepare a workspace and uh and you'll see that we can run agents uh with test time

scaling by racing them against uh possible conditions uh or sorry by by racing them to find the best possible solution against a given verification

condition.

Um so because of infinibbranch snapshots on morph cloud acquire docker layer caching like semantics meaning that you can layer on

um side effects which may mutate container state and so you can think of it as being git for compute and you can item potently run these uh chained

workflows on top of snapshots. But not

only that, as you can see inside of the code, if you use this do method, you can dispatch this to an agent um and that will trigger an item potent

durable agent workflow which is able to branch. So you can start from that

branch. So you can start from that declaratively specified snapshot and go hand it off to as many parallel agents as you want and those agents will try different

methods in this case. Uh so different methods for spinning up a server on port 8,000 um and uh you know one agent fails but the other one succeeds and you can take

that solution and you can just uh pass it on to other parts of your workflow.

So this is the kind of workflow that everyone's going to be using in the very near future and it's uniquely enabled um by Infinabranch by the fact that we can

so effortlessly create these snapshots uh store them, move them around, rehydrate them, replicate them with uh minimal

overhead.

Um so what else does the machine want?

Well, the machine desires similocra. And what this means

similocra. And what this means fundamentally, right, is that a thinking machine wants to be grounded in the real world, right? It wants to interact at

world, right? It wants to interact at extremely high throughput with increasingly complex software environments.

It wants to um roll out trajectories in simulators uh at uh at unprecedented scale. And these

simulators are going to run inside of programs that haven't really been explored yet for reinforcement learning.

Um they're going to run on Morph Cloud, which is why Morph will be the cloud for reasoning.

And what does the future of reasoning look like?

Well, it's so more so than what has been explored already, the future of reasoning will be natively multi- aent.

Uh so thinking machines should be able to replicate themselves effortlessly, go attach themselves to simulation environments, um go explore multiple solutions in

parallel. Those environments should

parallel. Those environments should branch. they should be reversible. Uh

branch. they should be reversible. Uh

those models should be able to interact with the environment at very high throughput and it should scale against verification. So let's take a look at

verification. So let's take a look at what that might look like um in a simple example where uh an agent is playing

chess. So this is an agent that we

chess. So this is an agent that we developed recently uh that uses tool calls during reasoning time to interact with a chess

environment. So along with a very

environment. So along with a very restricted chess engine for evaluating uh the position which we think of as the verifier. Um and as you can see um it's

verifier. Um and as you can see um it's already able to do some pretty sophisticated reasoning just because it has access to these

interfaces. Um however if you take the

interfaces. Um however if you take the ideas which were just described and you sort of follow them to their logical conclusion you arrive at something which

we call reasoning time branching.

which is the ability to not just call to tools while the machine is thinking uh but to replicate and branch the environment uh and decompose problems

and explore them in a verified way.

Uh and uh so as you can see here the agent is getting uh stuck in a bit of a local minimum.

Um but once you apply reasoning time branching you get something that works much much better.

So here what's happening is that the agent is responsible for delegating uh parts of its reasoning to sub agents which are branched off of an identical

copy of the environment. Uh and this is all running on morph cloud. um along

with a verified problem decomposition which allows it to recombine the results uh and uh take them and find the correct

move. Um and so as you can see here it's

move. Um and so as you can see here it's able to explore a lot more of the solution space because of this reasoning time branching. So one thing that I will note

branching. So one thing that I will note here is that uh the um so this capability is something which

is not really explored in other models at the moment and that's because the infrastructure challenges behind making branching environments that can support largecale reinforcement learning for

this kind of reasoning capability especially coordinating multi- aent swarms um is fundamentally bottlenecked by by innovations in infrastructure that

we've managed to solve here. Um, and because of this, you can

here. Um, and because of this, you can see that uh now in in less wall clock time than

before, the uh the agent was able to uh call out to all these sub aents, launch this swarm and find the correct solution.

So you know when I think about the problem of alignment I really think that you know Vickenstein had something right and that it was fundamentally a problem of

language. I think all problems around

language. I think all problems around alignment can be traced to the insufficiencies of our

language. Uh this Fouian bargain that we

language. Uh this Fouian bargain that we made with uh with natural language in order to unlock capabilities of our language

models. Um but in so far

models. Um but in so far as we must uh go and develop a new language for super intelligence. know in

so far as the uh grammar of the planetary computation has not yet been devised.

Um and in so far as this new language must be computational in nature must be something to which we can attach uh you know algorithmic guarantees of the

correctness of outputs. So this is something that morph

outputs. So this is something that morph cloud is uniquely enabled to handle.

And that's why we're developing verified super intelligence. So verified super

intelligence. So verified super intelligence will be a new kind of reasoning model which is capable not only of thinking for an extraordinarily

long time and interacting with external software at extremely high throughput.

But it will be able to use external software and formal verification software to reflect upon and improve its own reasoning and to produce outputs

which can be verified, which can be algorithmically checked, which can be expressed inside of this common language.

Um, and I'm very excited to announce that we are bringing on perhaps the best person in the world for developing verified super intelligence. Um, it's

with great pleasure that um, I'd like to announce that Christian Seed is joining Morph as our chief scientist. He was

formerly a co-founder at XAI. He led the development of uh, code reasoning capabilities for Grock 3. He invented

batchtorm and adversarial examples. Um

perhaps most importantly um he's a visionary and he's pioneered um he's pioneered precisely this

intersection of verification methods, symbolic reasoning and reasoning in large language models for uh almost the past decade. and we're thrilled to be

past decade. and we're thrilled to be partnering with them to build this super intelligence that we can only build on Morph

Cloud. Um, and so the demos that you've

Cloud. Um, and so the demos that you've seen today have all been powered by early checkpoints of a very uh a very

early version of this verified super intelligence that we've already begun to develop. And so uh this model is

develop. And so uh this model is something that we're calling Magi 1. And

it's going to be trained from the ground up to use infin branch to perform reasoning time branching to perform verified reasoning

be an agent that will be fully embodied inside of a cloud that can move at the speed of light. Uh and that's coming in Q1 2026.

So what does the infrastructure for the singularity look like? Well, we have a lot of ideas about it, but fundamentally we believe that the infrastructure for the singularity hasn't been invented

yet.

And uh you know at Morph we spend a lot of time talking about you know whether or not something is future bound which means not just futuristic

belonging to one possible future but but something which is so inevitable that it has to belong to every future. We believe that the

future. We believe that the infrastructure for the singularity is futurebound. That the grammar for the

futurebound. That the grammar for the planetary computation is futurebound.

That verified super intelligence is future bound. And we invite you to join us

bound. And we invite you to join us because it will run on morph cloud. Uh

thank you.

[Applause] Ladies and gentlemen, please welcome back to the stage the VP of developer relations at Llama Index, Lorie Voss.

Hey again everybody. Let's hear it for all of our keynote speakers.

So, just like yesterday, uh I want to quickly run you through what you're going to get from each of our tracks. Uh

likely to be my our most popular track today is software engineering agents.

Can LLMs power a full engineer uh not just coding alongside you in your IDE uh but taking PRs PRDs and turning them into full PRs? You'll hear about Devon

of course uh but also about Jules and Claude code and much more uh right in this room. Uh our next track is sponsored by

room. Uh our next track is sponsored by Openpipe and it's all about reasoning and reinforcement learning. Uh reasoning

models are all the rage in 2025 and inference time is the next great scaling law. Uh if you want to learn about

law. Uh if you want to learn about training distillation uh and getting alignment out of these new models then this is the track for you. that is in uh Yerbuena ballrooms 2 to six which is out these doors and to your left it's right

next door. Uh the next track is retrieval and

door. Uh the next track is retrieval and search uh rag is dead long live agentic retrieval. Uh this track is not about

retrieval. Uh this track is not about rags. It's about what comes next. Uh

rags. It's about what comes next. Uh

agentic search multimodal retrieval and all that comes with it. Uh this is where my CEO Jerry will be giving a talk. He

gave the top rated talk last year so I recommend not missing it. That's going

to be in Golden Gate Ballroom A, which is out these doors to your left up the escalators and then turn left when you see the FedEx office. Uh, then there's the eval track

office. Uh, then there's the eval track sponsored by Brain Trust. Uh, everybody

says evals are important. We all agree evaluating.

Uh this track is uh curated by Anchor Goyal of Brain Trust and is all about making evals work quickly and cheaply. Uh next there's the same two

cheaply. Uh next there's the same two tracks for our leadership attendees that we had yesterday. So as a reminder that's for people with the gold lanyards.

Uh first is AI and the Fortune 500 track. Uh we've gathered success stories

track. Uh we've gathered success stories from real AI deployments in the Fortune 500 showing how to use AI at real scale.

That's in uh Golden Gate Ballroom C which is right next to A and B again left at the FedEx office. Uh our second leadership track

office. Uh our second leadership track again for gold lanyards is the AI architects track. Uh this is for CEOs,

architects track. Uh this is for CEOs, CTO's and VPs of AI to meet and learn from each other on everything from infrastructure to company strategy. Uh

that is in SOMO which is all the way upstairs three sets of escalators up to the right of registration.

Next up is the security track. Uh, as we grant agents increasingly more uh more access to our personal lives and company resources, the problem of security goes

from an enterprise sales checklist uh to a P 0. In this track, you'll learn about the state-of-the-art approaches for authentication and authorization in the world of AI. That's in Foothill C, which

is again all the way upstairs to the left of the registration area.

The next track is design engineering. Uh

LLMs are 10x better than they were a year ago, but design thinking around the UX of AI uh has barely budged from chat chat GPT and canvas. Uh we've gathered the top designers and design engineers

to showcase their work. That's going to be in foothill G1 and two which is all the way upstairs directly behind the registration desks.

Then there is the generated media track that's going image gen, video gen, uh, and music gen are all on fire this year with increasing coherence over time and

iterations uh, and stunning viral demos.

Uh, from Gibli memes to personalized Valentine songs. How can AI engineers

Valentine songs. How can AI engineers harness the state-of-the-art in AI art?

Uh, that's in Foothill F, which is all the way up three sets of escalators behind registration.

And our final track today is autonomy and robotics. Uh the ultimate prize in

and robotics. Uh the ultimate prize in AI is going outside, automating manual labor over knowledge work. Uh multimodal

LLMs are increasingly being deployed in the real world in everything from cars to kitchens to humanoid robots. Uh and

this track is all about the state of physical general intelligence. And it's

in foothill E, which is again up three sets of escalators behind and to the right of registration.

So those are all our tracks today. Now

please go forth and enjoy the expo. Uh

the next 45 minutes are dedicated expo time. There are also three expo session

time. There are also three expo session talks uh which are in Juniper and Willow uh on the floor with the FedEx office uh and also in Knobill A and B which is right out these doors and opposite this

room. See you all back here for the

room. See you all back here for the keynotes at 3:45. Thanks very much.

Heat. Heat.

D. N.

[Music] [Music] [Music] Hey hey hey.

[Music] [Music] [Music] [Music] [Music] Hey hey hey.

Heat up.

Heat. Heat.

Hey hey hey.

Hey, hey hey.

Hey, hey hey hey hey hey.

[Music] [Music] [Music] [Music] Data.

[Music] [Music] [Music] [Music] [Music] [Music] [Music]

[Music] [Music] Heat.

[Music] [Music] [Music] Heat. Heat.

Heat. Heat.

[Music] Heat. Heat.

Heat. Heat.

[Music] [Music]

[Music] [Music] [Music] All

[Music] right. All night.

right. All night.

[Music] [Music] [Music] [Music] [Music] [Music]

All [Music] right. All right.

right. All right.

All [Music] right. All

right. All right. Heat. Heat.

right. Heat. Heat.

[Music] [Music] [Music] We [Music]

are me.

[Music] [Music] [Music] Hey hey hey.

down. I feel down.

[Music] Everything every every I I I'll be I'll Heat. Heat.

I feel I feel [Music] [Music] [Music]

Hey, hey hey.

[Music] [Music] [Music] I'll be everything.

[Music] Hey hey hey.

[Music] Hey. Hey. Hey.

Hey. Hey. Hey.

[Music] Am I waiting?

I'm [Music]

I don't want to go.

[Music] [Music] [Music]

And [Music]

I don't want to work.

[Music]

I take it.

[Music] [Music] [Music] [Music]

Welcome everyone. My name is Vivu. I'm

Welcome everyone. My name is Vivu. I'm

very excited to be hosting the Sweet Agents track here today. Fun fact, this is the most popular track out of all of them. We have a completely full day

them. We have a completely full day ahead of you. Every single speaking slot will be filled. We've got eight amazing speakers here for you today. We're going

to have speakers from every top SUI agent. So, you know, we've got the

agent. So, you know, we've got the creators of Jules here, Claude Code, Codeex, the original SUI agent. We've

got the Scott Woo from Devon Cognition.

He will be kicking us off. I'm going to keep my MCing very very short so we give speaking time to the speakers. So, let's

hear it. Let's kick things off. I want

to welcome Scott Woo from Cognition here to speak about Devon.

[Applause] Oh, okay. Okay,

Oh, okay. Okay, cool. Awesome. Awesome. Yeah. Well,

cool. Awesome. Awesome. Yeah. Well,

thank you guys so much for having me.

It's exciting to be back. It's uh I I was last here at AI Engineer one year ago. Um and it's kind of crazy. I I've

ago. Um and it's kind of crazy. I I've

always been I I've been telling Swix that we need to have these conferences way more often if it's going to be about AI software engineering. Probably should

be like every two months or something like that with the pace of everything's done. But but but going to be fun to to

done. But but but going to be fun to to talk a little bit about um you know what we've seen in the space and and what we've learned over the last 12 or 18 months uh building Devon over this

time. And I want to start this off with

time. And I want to start this off with um Moore's law for AI agents. And so you can kind of think of the the the capability or the capacity of an AI by

how much work it can do uninter uninterrupted until you have to come in and step in and intervene or steer it or whatever it is, right? And um you know in GPT3 for example, it's if you were to

go and ask GPT3 to do something, you know, it could probably get through a few words or so and then it'll say something where it's like okay, you know, this is probably not the right thing to say. Um and GPT3.5 was better

and GP4 was better, right? Um and and so people talk about these lengths of tasks and what you see in general is that that doubling time is about every seven months which already is pretty crazy

actually. But in code it's actually even

actually. But in code it's actually even faster. It's every 70 days which is two

faster. It's every 70 days which is two or three months. And so, you know, if you look at various software engineering tasks that start from the simplest single functions or single lines and you

go all the way to, you know, we're doing tasks now that take hours of humans time and and an AI agent is able to just do all of that, right? Um, and if you think about doubling every 70 days, I mean,

basically, you know, every two to three months means you get four to six doublings every year. Um, which means that the amount of work that an AI agent

can do in code goes something between 16 and 64x in a year every year at least for the last couple years that we've seen. Um, and it's kind of crazy to

seen. Um, and it's kind of crazy to think about, but but that sounds about right actually for for what we've seen.

You know, 18 months ago, I would say the only really the only product experience that had PMF in code was just tab completion, right? It was just like

completion, right? It was just like here's what I have so far. Predict the

next line for me. that was kind of all you really could do um in in a way that really worked. And we've gone from that

really worked. And we've gone from that obviously to full AI engineer that goes and just do does does all these tasks for you, right? And implements a ton of these things. And people ask all the

these things. And people ask all the time, what is the um you know what what what is the the future interface or what is the right way to do this or what are the most important capabilities to solve

for? And I think funnily enough, the

for? And I think funnily enough, the answer to all these questions actually is it changes every two or three months.

like every time you get to the next tier, the the the bottleneck that you're running into or the most important capability or the right way you should be interfacing with it, like all these

actually change at at each point. And so

I wanted to talk a bit about some of the the tiers for us over the last year or so. Um and you know over the course of

so. Um and you know over the course of that time obviously you know when we got started um in the end of 2023 obviously agents were not even a concept. Um, and

now everyone has, you know, everyone's talking about coding agents, people are doing more and more and more. Uh, and

and it's very cool to see. Um, and and each of these has kind of been almost a discrete tier for us. Um, and so right right around a year ago when we were doing the the last AI engineer talk

actually, um, the the biggest use case that we really saw that that was getting broad adoption was what I'll kind of call these repetitive migrations. And so

I'm talking like JavaScript to TypeScript or like upgrading your Angular version from this one to that one or going from this Java version to that Java version or something like

that. Um and those those kinds of tasks

that. Um and those those kinds of tasks in particular what you typically see is you are you you have some massive code base that you want to apply this whole

migration for. You have to go file by

migration for. You have to go file by file and do every single one. And

usually the set of steps is pretty clear, right? If you go to the Angular

clear, right? If you go to the Angular website or something like that, it'll tell you, all right, here's what you have to do. This, this, this, this, this, and um, you want to go and execute each of these steps. It's not so routine that there, you know, there's no

classical deterministic program that solves that. But there's kind of a clear

solves that. But there's kind of a clear set of steps. And if you can follow those steps very well, then you can do the task. And, you know, this was the

the task. And, you know, this was the thing for us because that was all you could really trust agents to do at the time. you know, you could do harder

time. you know, you could do harder things once in a while and you could do some really cool stuff occasionally, but as far as something that was consistent enough that you could do it over and over and over, um, these kinds of like repetitive

migrations that you would be doing for, you know, 10,000 files were, you know, in many ways the the the easiest thing, which was cool actually because it was also kind of the the most

annoying thing for humans to do. And I

think that's generally been the trend where um AI has always done these more boilerplate tasks and the more tedious stuff, the more repetitive stuff and we get to do the the the more fun creative stuff. Um and obviously as time has gone

stuff. Um and obviously as time has gone on, it's it's taken on more and more of that boiler plate. But for a problem like this one, a lot of what you need to do is you need Devon to be able to go

and execute a set of steps reliably. And

so a lot of this was, you know, I would say the big capabilities problems to solve was mostly instruction following.

And so we built this system called playbooks where basically you could just outline a very clear set of steps, have it follow each of those step by step, and then do exactly what's said. Now, if

you think about it, obviously a lot of software engineering does not fall under the category of literally just follow 10 steps step by step and do exactly what it said. But migration does and it

it said. But migration does and it allowed us to go and actually do these and and this was kind of I would say the first big use case of Devon that really um that really came up. I think one of the other big systems that got built

around that time which we've since rebuilt many times is knowledge or memory right which is you know if you're doing the same task over and over and over again then often the human will have feedback on hey by the way you have

to remember to do X thing or you have to you know you need to do Y thing every time when you see this right um and so basically an ability to to just maintain

and understand the learnings from that and use that to improve the agent in every future one and those were kind the the big problems of the time, you know, and that was summer of last year. And

around end of summer or fall or so, you know, I think the the the kind of big thing that started coming up was as these systems got more and more capable instead of just doing the most routine migrations, you could do, you know,

these more still pretty isolated, but but but but a bit broader of these general kind of bugs or features where you can actually just tell it what you want to do and have you have it do it,

right? And so for example, hey Devon, in

right? And so for example, hey Devon, in this uh repo select dropdown, can you please just list the currently selected ones at the top? Like having the checkboxes throughout is just doesn't really and and Devon will just go and do

that, right? And so if you think about

that, right? And so if you think about it, it's, you know, it's it's it's something like the kind of level of task that you would give an intern. And there are a few particular

intern. And there are a few particular things that you have to solve for um with this. First of all, usually these

with this. First of all, usually these these these changes are pretty isolated and pretty contained. And so it's one maybe two files that you really have to look at and change to do a task like this. But at least you do still need to

this. But at least you do still need to be able to set up the repo and work with the repo, right? And so you want to be able to run lint, you want to be able to run CI, all of these other things. So

you know to at least have the basic checks of whether things work. One of

the big things that we built around then was the ability to really set up your repository uh ahead of time and build a snapshot um that that you could start off that you could reload that you could roll back and all of these kinds of

primitives as well right so having this clean remote VM that could run all these things it could run your CI it could run your llinter uh and and so on um but that's when we started to really see I

would say a bit more broad of value right I mean migrations is one particular thing and for that particular thing we were showing a ton of value and then we started started to see where you know with these bug fixes or things like that you would be able to just generally

get value from Devon as as almost like a junior buddy of yours and then in fall things really moved towards just much broader bugs and requests and here

it's you know most most changes again you know you jumping another order of magnitude most changes don't just contain themselves to one file right often you have to go and look see what's going on you have to diagnose things you

have to figure out what's happening you have to work across files and make the right changes. Often these changes are,

right changes. Often these changes are, you know, hundreds of lines if it's like, hey, I've got this bug. Let's

figure out what's going on. Let's solve

it, right? And, you know, there there are a

right? And, you know, there there are a lot of things here that that really started to make sense and really started to be important, but but one in particular I'll just point out was there's a lot of stuff that you can do

with not just looking at the code as text, but thinking of it as this whole hierarchy, right? So, so understanding

hierarchy, right? So, so understanding call hierarchies, running a language server, uh, is a big deal. You have git commit history which you can look at which informs how how these different

files relate to one another. You have um um obviously you have like your llinter and things like that but but you're able to kind of reference things across files. And so like one of the big

files. And so like one of the big problems here I think was u kind of working with the context of it and getting to the point where it could make changes across several files. It could

be consistent across those changes. It

would be able to understand across the codebase. And here was really the point,

codebase. And here was really the point, I would say, where you started to be able to just tag it and have it do an issue and just have it build it for you.

Um, and so Slack was a was, you know, a huge part of the workflow then. Um, and

and it was just it it made sense because it's where you discuss your issues and it's where you set these things up, right? So you would tag Devon and Slack

right? So you would tag Devon and Slack and say, "Hey, by the way, we've got this bug. Please take a look." Or, you

this bug. Please take a look." Or, you know, could you please go build this thing? Uh, this is especially fun part

thing? Uh, this is especially fun part for us because this is right around when we went GA. Uh, and a lot of that was because it was it got to the point where you truly could just get set up with Devon and ask it a lot of these broad

tasks and and just have it do it. Um,

but but a lot of these, you know, a a lot of the work that we did was around having Devon have better and better understanding of the codebase, right?

And if you think about it, you know, from the human lens, it's the same way where on your first day on the job, for example, being super fresh in the codebase, it's kind of tough to know exactly what you're supposed to do. Like

a lot of these details are things that you understand over time or that a representation of the codebase that you build over time, right? Um and Devon had to do the same thing and had to understand how do I plan this task out before I solve it? How do I understand

all the files that need to be changed?

How do I go from there and make that diff?

And around the spring of this year, um, again, every every gap is like two or three months. You know, we we got to an

three months. You know, we we got to an interesting point, which is once you start to get to harder and harder tasks, you as the human don't necessarily know everything that you want done at the

time that you're giving the task, right?

If you're saying, hey, you know, I I'd like to go and um improve the architecture of this, or you know, this this function is slow. Like, let's let's profile it and look into it and see what

needs to be done. or hey like you know we really should should handle this this error case better but like let's look at all the possibilities and see what we should you know what the right logic should be in each of these right and

basically what it meant is that this whole idea of taking a two-line prompt or a threeline prompt or something and then just having that result in a a Devon task was was not sufficient and you wanted to really be able to work

with Devon and specify a lot more and around this time along with this kind of like better codebase intelligence um we had a few different things that that that came up and so we released deep wiki for example. Um and the whole idea

of deep wiki was you know funnily enough is devon had its own internal representation of the codebase but it turns out that for humans it was great to look at that too to be able to understand what was going on or to be

able to ask questions quickly about the codebase. Um, closely related to that

codebase. Um, closely related to that was was search, which is the ability to really just ask questions about a codebase and understand um, some some piece of this. And a lot of the workflow

that really started to come up was actually basically this this more iterative workflow where the first thing that you would do is you would ask a few questions. You would understand, you

questions. You would understand, you would basically have a more L2 experience where you can go and explore the codebase with your agent, figure out what has to be done in the task, and then set your agent off to

go do that. because for these more complex tasks you kind of needed that right um and so so you know that was a I would say kind of like a big paradigm shift for us then is is understanding you know this is what also came along with Devon

2.0 for example and the in IDE experience where often yeah you want to be able to have points where you closely monitor Devon for 10% of the task 20% of

the task and then have it do uh work on its own for the other 80 90%.

Um, and then lastly, most recently in June, which is now, it was kind of, yeah, really the ability to just truly just kill your backlog and hand it a ton of tasks and have it do all these at once. And, you know, if you think about

once. And, you know, if you think about this task, in many ways, I would say it's it's almost like a culmination of of many of these different things that that had to be done in the past. You

have to work with all these systems. Obviously, you have to integrate into all these. Certainly, you want to be

all these. Certainly, you want to be able to to work with linear or with Jira or systems like that, but you have to be able to scope out a task to understand what's meant by what's going on. You

have to decide when to go to the human for more approval or for questions or things like that. You have to work across several different files. Often,

you have to understand even what repo is the right repo to make the change in. If

if your if your org has multiple repos or what part of the codebase is the right part of the codebase that needs to change. Um, and to really get to the

change. Um, and to really get to the point where you can go and do this more autonomously, first of all, um, you have to have like a really great sense of confidence, right? And so, um, you know,

confidence, right? And so, um, you know, rather than just going off and doing things immediately, you have to be able to say, okay, I'm quite sure that this is the task and I'm going to go execute it now versus I don't understand what's

going on. Human, please give me help.

going on. Human, please give me help.

Basically, right? But but the other piece of it is this is I think the era where testing and this asynchronous testing gets really really important, right? Which is if you want something to

right? Which is if you want something to just deliver entire PRs for you for tasks that you do, especially for these larger tasks, you want to know that it is can can test it itself. And often the

agent actually needs this iterative loop to be able to go and do that, right? So

it needs to be able to run all the code locally. It needs to know what to test.

locally. It needs to know what to test.

It needs to know what to look for. Um,

and in many ways it's just a a much higher context problem to solve for, right? Is this testing

right? Is this testing itself and that brings us to now. And

obviously it's a it's a pretty fun time to see because now what we're thinking about is hey maybe if instead of doing it just one task it's you know how how do we think about tackling an entire project right and after we do a project

you know what what goes after that a and maybe one point that I would just make here is we talk about all these two X's you know that happen every couple months and I think from a kind of cosmic

perspective all the two X's look the same right but in practice every 2X actually is a different one right and so when we were just doing you tab completion, line, single line completion. It really was just a text

completion. It really was just a text problem. It is just like taken the

problem. It is just like taken the single file so far and just predict what the line is text. Right? Over the last year or year and a half, we've had to think about so much more. How do how do you work with the human in linear or

slack or how do you take in feedback or steering? Um how how do you help the

steering? Um how how do you help the human plan out and do all these things, right? And moreover, obviously, there's

right? And moreover, obviously, there's a ton of the tooling and the capabilities work that have to be done of how does how does Devon test on its own? How does Devon um uh you know make

own? How does Devon um uh you know make a lot of these longer term decisions on its own? How does it debug its own

its own? How does it debug its own outputs or or run the right shell commands to figure out what the feedback is uh and go from there? And so it's super exciting now that there's a lot more uh there's a lot more coding agents

in the space. It's uh it's it's very fun to see and I think that you know we we're going to see another 16 to 64x over the next 12 months as well and uh and so yeah super super

excited. Awesome. Well, that's all.

excited. Awesome. Well, that's all.

Thank you guys so much for having me.

Awesome. Uh thanks for what a great talk. Um Scott, so we just heard from

talk. Um Scott, so we just heard from the creators of Devon, one of the very first proper sweet agents, right? They

shocked the world with their demo. They

were kind of the first to pivot this field of autonomous long- form agents that can run and actually complete tasks. Now our next speaker is from

tasks. Now our next speaker is from Google. He's the AI PM of AI labs and he

Google. He's the AI PM of AI labs and he works on Jules. Jules is one of the latest coding agents, right? So he's

going to speak to us about asynchronous coding agents. As we change from a world

coding agents. As we change from a world of coding co-pilots to autonomous agents, how do we kind of delegate our workflow? What do we do when we have a

workflow? What do we do when we have a bunch of these agents going on? So,

without further ado, I want to welcome Rustin Banks from Google to speak to us about [Applause] Jules. Awesome. Hi everyone. I'm Rustin.

Jules. Awesome. Hi everyone. I'm Rustin.

I'm a product manager with Google Labs and really thrilled to be here and get to speak to you today. This is really like a a dream come true.

So, I'm an engineer at heart. This is my first compiler, Borland C++ 3.1. It came

in the mail on 10 5 and 1/2 in floppy discs. I ordered it from AOL

discs. I ordered it from AOL classifides. It was amazing. This is my

classifides. It was amazing. This is my bulletin board. Yeah. That I hosted out

bulletin board. Yeah. That I hosted out of my parents' closet and salvaged computers. And I just think it's ironic

computers. And I just think it's ironic that when I saw AI come out, I recognized the textbased interfaces perfectly from hosting bulletin boards.

And then when I saw this, like many of you, I dedicated my career to AI coding.

And this is chat GPT 3.5. Isn't it crazy that we the how slow this is? And this

used to be state-of-the-art only two years ago. It's pretty

years ago. It's pretty amazing. Right now, I'm a product

amazing. Right now, I'm a product manager for Jules. And Jules is an asynchronous coding agent meant to run in the background and do all those tasks

that you don't want to do in parallel in the background. And we launched this

the background. And we launched this just two weeks ago at IO to everyone everywhere all at once for

free while Josh was up on the stage trying to demo other Google Labs products. And so he called us. We said,

products. And so he called us. We said,

"Oh, we got to shut it down." so that we can demo other products and and luckily we got it up and going. But it was a super exciting launch and the best part

about it is to see these use cases where this is what we really want to solve. We

want to do the laundry so to say so that you can focus on the art of coding. So

the next time Firebase updates their SDK, Jules can do that for you. Or if

you just want to develop from your phone, Jules can do that for you. So, in

the last two weeks, we've had 40,000 public commits, and we're super excited what we can bring to the open-source world. So, but as developers, we're

world. So, but as developers, we're meant to think serially. We take a task from the queue, we work on it, we go on to the next one. That's our default

workflow. Today, we'll learn about how

workflow. Today, we'll learn about how to maximize parallel agents. I'll try a real world demo and we'll go through a real world use case and then I'll go through some best practices we've

learned from watching people use jewels. So for this parallel process

jewels. So for this parallel process really to work well, we need to get better with AI at the beginning and the end of the workflow. Meaning if it's on

me to now I just have to write a bunch of tasks all day. That's not fun. And if

I'm reviewing PRs and handling merge messes at the end of the day, that's not going to work well either. So luckily,

help is on the way. So for example, AI can easily work through backlogs, bug reports to create tasks for you with you. And then uh at the end of the

you. And then uh at the end of the SDLC, help is on the way where we can use critic agents, merging agents that can bring everything together and make

it so that this this parallel workflow that we've envisioned can really come together and not drive us crazy. Remote agents are uniquely suited

crazy. Remote agents are uniquely suited for this. Agents inside of our IDE are

for this. Agents inside of our IDE are always going to be limited by our laptop. And when you have these remote

laptop. And when you have these remote agents in the cloud, essentially agents as a service, they're infinitely scalable. They're always connected and

scalable. They're always connected and then you can develop from anywhere from any device. We've seen two types of

device. We've seen two types of parallelism emerging. This is the type

parallelism emerging. This is the type that we expected, which is multitasking.

Oh, I'm just I have 10 different things on my backlog. Let's do them all at once and then we'll merge them together and test them.

Interestingly, you saw an example of the second type this morning with Solomon from Dagger showing how he wanted three different views of his website at the same time. This was the emergent

same time. This was the emergent behavior we didn't expect, which is multiple variations. Essentially, we see

multiple variations. Essentially, we see users taking a task, especially if it's a complex task, and saying, "Try it this way, try it that way, or give me this

variation to look at, or multiple variations to look at." And then you can test and choose. And we can have the agents test and choose the best ones or

the user can can test and choose. So for example, we see lots of

choose. So for example, we see lots of people who are working on a front-end task test and they're in a React app and they're saying I'm adding drag and drop.

Maybe try it using this library uh the react be beautiful drag and drop or maybe use dnd kit or maybe try it using the test first and in this parallel

asynchronous environment you can just spin up multiple agents at the same time they can try it they can easily come back together choose the best one and

you're off to the races. Okay, demo

time. So exit out of this for a demo. I'm going to use the conference schedule website.

And Swix for all his skills, as you can see, has probably not spent a lot of time designing the the schedule website.

As you can see there, anytime there's a horizontal scroll scroll bar, uh we we know that's a problem. But luckily they knew that and they said we're just going to publish the JSON feed and we'll let

we'll let hackers hack. Uh engineers do what we do and let's build from it. So

Pal love who is here built this amazing uh conference site where you can favorite things, you can bookmark things and this is what I use to keep track of my my sessions for the conference. And

so I messaged him. And I said, "Hey, can I can I use phone clone this and use this for as an example for jewels?" And

Palv said, "Oh, yeah, sure. Actually, I

was sitting in my last session on my phone and I fixed a bud a bug using jewels." So, I thought that was perfect.

jewels." So, I thought that was perfect.

So this is how I would start something like this is I would go into linear and I would say okay first thing we need to do we just heard Scott talk about it is I want to add a way to know if this

parallel agent is going to do a bunch of things at the same time that it's getting it right. So, first we're going to add some tests. And then I'm going to actually I'm going to kick this one off while I'm thinking about

it. And then using that idea of multiple

it. And then using that idea of multiple variations, I'm going to say add it with justest and add it with add it with playright at the same time. And then

we'll look at the test coverage and we'll choose the one that has the best test coverage. Once that's done, then I

test coverage. Once that's done, then I can go to that other mode of parallelism and I say, I would like a link to add a session to my Google calendar. I would

like an AI summary when I click on a description. And these are all features,

description. And these are all features, but what I'm really excited for is for AI to do the stuff that we never seem to get to, such as accessibility audits and

security audits. All those things that

security audits. All those things that seem to go on the backlog, but are really important. And I'm super excited

really important. And I'm super excited for AI to do that. So, we're going to also have it do an accessibility audit and improve our Lighthouse scores at the same time. This is mostly a front-end

same time. This is mostly a front-end demo because, well, I'm mostly a front-end engineer and it's a better visual representation, but we've seen

all these all these applied to the back end as well. Okay, so here's Jules. We

told it to add add tests and ingest framework. It connects to my GitHub, all

framework. It connects to my GitHub, all my GitHub repos, and it's going to give me a plan. That looks about right. I can

see it's going to test the calendar, the search overlay, the session. That sounds

great. I can approve the plan. So,

Google So, Jules now has its own VM in the cloud. It's cloned my whole whole

the cloud. It's cloned my whole whole codebase. It can run all the commands

codebase. It can run all the commands that I can run and and importantly after it has these tests, it can run these tests so it can know when we add a new feature if it gets things right. So I'm

going to fast forward a little bit here.

And so this is adding just tests. You

can see all the the things it's or all the components that's it's added to the test. It's added to the readme. So now

test. It's added to the readme. So now

next time that it goes to add something, it'll look at the readme and remind itself, oh, this is how I run the tests.

Let's see how it did on test coverage. Okay, we got down to

coverage. Okay, we got down to Looks like about estimated test coverage looks like about 80%. So that's pretty good. We could compare that with

good. We could compare that with playright and then we could just choose the the one we like the best. We merge

that into Maine and now we're we're off to the races. So that again it's automatically

races. So that again it's automatically integrated into GitHub. We merge that into into Maine and now we can start saying okay now I want a calendar link.

So I want a calendar button that can go in and Jules will work on that. And then sure enough, it ran the test. The test didn't pass the first time. It makes some

changes. Now the tests pass. And I can

changes. Now the tests pass. And I can review this code. Eventually I could look at this in Jules's browser. But I

feel pretty confident about testing this knowing that all the tests pass.

Similarly for uh the Gemini summaries, when I click on a description, I can get a Gemini summary. I put this one in an emulator or I emulated a mobile view just so you can I could have done this

from my phone. So, this is making accessibility audit, fixing any issues from my phone. Uh, never mind the console errors. Jules is going to fix

console errors. Jules is going to fix those. And then I can go back. I can Now

those. And then I can go back. I can Now we have this big merge we need to do.

And to be honest, I ran out of time to finish the merge. And Jules should help me with this merge. And it's called an octopus merge. So surely Jules as a

octopus merge. So surely Jules as a squid should help with the octopus merge. But let's just pull our check out

merge. But let's just pull our check out our add to calendar button. Go back to

button. Go back to this local host.

Refresh. And now I have a calendar button. Let's test it. Okay. Let's add

button. Let's test it. Okay. Let's add

this to my calendar to make sure I know to come to my own talk. And there it's on. It's on my calendar. I could then

on. It's on my calendar. I could then now again pull this back into the main branch and now everybody at the conference has the ability to add add sessions to their goo to their Google

calendar along with everything else that we saw there a full test suite all the accessibility audits a lighthouse scores improvement and that took me all about

an hour and managing the the parallel process in the back end.

Okay, so in theor in summary, the secret to working in parallel is a clear definition of success because nobody wants to review PRs all day. So think

before you get started, how am I going to easily verify that this works? Again,

Scott hit on this as well. Create this

agreement with the agent. Tell it, don't stop until you see this or don't stop until this works. and then a re robust merge and test framework at the end to

put everything back together and help is coming. This is how I prompt for Jules.

coming. This is how I prompt for Jules.

I give it a brief overview of the task.

I tell it when it will know what it got right, any helpful context, and then I'll at the end I'll append a simple broad approach and then I'll change that last line maybe two or three times

depending on the complexity of the task.

So for example, if I need to log this number from this web page every day, I'll say today the number is X. So log

the number to the console and don't stop until the number is X. That was a simple test that I wrote in. It'll keep going.

I give it a helpful context like this is the search query. And then I'll say use puppeteer and then I'll clone that task because I can. It's in the cloud and

I'll say use playright. So again, have an abundance

playright. So again, have an abundance mindset. But we're used to working on a

mindset. But we're used to working on a single thing at a time. Easy

verification makes it so now we can work on multiple things at the same time. Try

lots of things. As we saw this morning, look at different variations. We can

with a parallel process. We can we have the ability now to try things that we would never have tried before. Let AI

help with those bookends, the task creation and then the merge and and test part and context. Keep using MD files or links to documentation to getting

started. documents. The more context the

started. documents. The more context the better. And then we tell people just

better. And then we tell people just throw everything in there. Jules and

other agents are pretty good at actually sorting out which context is important.

So more context is better at this point, but maybe that's just for uh the Gemini models, which I should have mentioned.

Jules is powered by Gemini 2.5 Pro. Quick shout out. Thank you team

Pro. Quick shout out. Thank you team Jules. Couldn't have done any of this

Jules. Couldn't have done any of this without you. If you have any questions,

without you. If you have any questions, you can DM me. I'm Rustin Banks. Rustin

B on X. Thanks everybody.

Awesome. Always good to hear from one of the latest coding agents and it's always great to get a refresher. You know, even I don't know how to prompt these things, but I'm liking this flow. We started off

with Cognition. We had Devon, one of the

with Cognition. We had Devon, one of the first proper SU agents. Then we had one of the latest. We just heard from Google about Jules. Let's take it back again.

about Jules. Let's take it back again.

Let's hear from GitHub, one of the very very first coding co-pilots, right? So,

let's hear about the future and you know, how do we still want to think about GitHub Copilot? So, without

further ado, I want to welcome Christopher Harrison to the states to tell us about GitHub Copilot. All right, let's uh let's get

Copilot. All right, let's uh let's get right on into it. So, my name is Christopher Harrison. I'm a senior

Christopher Harrison. I'm a senior developer advocate at GitHub, primarily focused in on this little thing called developer experience, or as all the cool

kids like to call it, DevX, and GitHub Copilot. So, let's talk about the past,

Copilot. So, let's talk about the past, the present, and the future of GitHub Copilot. Oops.

Copilot. Oops.

Um um actually it's not picking up at all. Um oh there we go. Um let me

all. Um oh there we go. Um let me um entire start mirroring. There we

go. Cool. Look at that.

Okay, so let's get on into it. So where

we started was with code completion. And

so with code completion, I'm a developer. I'm in the zone. Type type

developer. I'm in the zone. Type type

type. And then C-pilot's going to then suggest the next line, the next block, the next function, potentially even the next class. And this is wonderful for

next class. And this is wonderful for giving that in time inline support to our developers.

But as we all know, the tasks that we're going to be completing go beyond just writing a few lines of code that I need to be able to explore. I need to be able to ask questions and I need to be able

to modify multiple files. And so this is where chat comes into play. And we

started off with chat by supporting ask mode where I could go in and ask questions or ask co-pilot to generate an individual file for me. And then we

expanded this out to edit mode. And with

edit mode, I can then drive copilot as it modifies multiple files. Because when

we think about even the most basic of updates, I'm going to update a web page.

That's going to require updating my HTML, my CSS, and my JavaScript, three files. With edit mode, I can do that very quickly. And again,

right inside of chat. Then we get into agent mode. And agent mode really shifts

agent mode. And agent mode really shifts things because unlike chat where I'm going in and I'm asking questions and I'm going in and I'm pointing it at the

files that I want to see modified with agent mode it's able to perform those operations on my behalf. And on top of that, it's going to behave an awful lot

like a developer that it will go in, do a search, find what it needs to do, perform those tasks, and then even be able to perform external tasks as well.

So, it could run tests, detect that maybe those have failed, and then even selfheal. So, I have an application

selfheal. So, I have an application here, and I want to create a couple new endpoints. So, the first thing that I'm

endpoints. So, the first thing that I'm going to do is I'm going to add in a little bit of context here. So

instruction files allow me to give Copilot a little bit of additional information about what it is that I'm doing and how it is that I want it to be done. So I have an instruction file

done. So I have an instruction file specific around my endpoints. Now, this

is definitely one of those scenarios where agent mode could figure this out on its own. But, as I like to say, don't be passive aggressive with co-pilot.

That if there's a piece of information that's important that you want it to consider, go ahead and tell it it might be able to figure it out on its own, but this is certainly going to make life

easier. So now that I've added this in,

easier. So now that I've added this in, I'm going to now say create um um uh

endpoints to list the publishers and get publisher by ID. Create the tests,

ensure all tests pass, and then hit send. Now I'm doing a live demo with

send. Now I'm doing a live demo with AI, so we're going to see what happens here. There's a chance it will fail.

here. There's a chance it will fail.

there's a chance it will fail spectacularly, but there's also a really good chance that every everything's going to succeed. And that's the part that I'm going to hope for. Now, if I take a look at what Copilot's doing

here, what I'm going to see is as highlighted, it's behaving an awful lot like a developer that it tells me what it's going to do. It's going to create the endpoints to list all the publishers, get the publisher by ID. So

the first thing it's going to do is explore the project, figure out what's going on. Then it's going to create the

going on. Then it's going to create the endpoint. Then it's going to create the

endpoint. Then it's going to create the tests. And then it will sure everything

tests. And then it will sure everything works correctly. And now if I keep on

works correctly. And now if I keep on scrolling down, I'm going to notice that it's searching through my codebase because if I was tasked as a developer to perform this, that's the first thing

I'm going to do. And that's exactly what Copilot here is doing. It created my publishers PI file. It looked for routes that happen to be matching publishers.

And now it's going to create the endpoints here. And so if I stall for

endpoints here. And so if I stall for just a moment longer and move my mouse to make it go faster. See, it worked.

We're gonna notice that it will now generate that publishers pi file. And

one big thing that you're going to notice is I've got these great keep and undo buttons here because I always like to highlight the fact that AI does not

change the fundamentals of DevOps. that

if I think about how I wrote code before AI, some of that would be created off the top of my head. Some of that would be based on existing code. Some of that would be copied and pasted from Stack Overflow and then made a couple of

changes and cross my fingers and hope that it worked. Maybe that was just me.

worked. Maybe that was just me.

Um, and to help ensure that all of the code that I was going to be committing to our codebase is secure and is written the way that we want it to be written, we had code reviews, we have llinters,

we have security checks. And we're going to do all that exact same thing even when we introduce AI. So this keep and undo allows me to very quickly ensure that yes, everything looks good and if

it doesn't to be able to undo it. We'll

also notice history buttons up here that allow me to act uh iteratively because again when I'm working with AI, I'm not necessarily going to get perfect code

the first time. So I can go in and I can uh work back and forth. So I can say, hey, this looks good, but I want to do this. Maybe I want the buttons to look

this. Maybe I want the buttons to look blue or whatever it is. And then

highlight that. So, what I'm going to now see is the fact that it created all my files, updated a couple of items, and now it can run my tests. And this is going to be one of those rare moments

where I'm kind of hoping that it fails because I want to see it be able to recover for me. So, you'll notice that it ran my tests. You'll notice that it ran my four tests. Everything succeeded.

Shucks. And then now it's going to go ahead and continue to iterate from there. And so what we see with agent

there. And so what we see with agent mode is co-pilot driving the way on going in and writing my code. But I

always want to highlight the fact that I as the developer am still in charge. Now the one catch though with

charge. Now the one catch though with agent mode is the fact that that's going to be inside my IDE and the fact that that's still going to be well singlethreaded. It's going to be

singlethreaded. It's going to be synchronous.

This is where we come into coding agent.

And with coding agent, this is going to be completely asynchronous and this is going to run on the server. So, let me kick over to an example that I had

actually run uh earlier this morning where I have an issue that's been created where I say add, edit, and delete endpoints. Now, I'm going to real

delete endpoints. Now, I'm going to real quick just unassign copilot just so I can kick off the workflow and we can see this in action here. I'm going to let those cute little eyeballs go away here.

There we go. And let's go back in and hit a reassign. So, by assigning co-pilot

reassign. So, by assigning co-pilot here, I've now kicked off the coding agent. I can now see the little eyeballs

agent. I can now see the little eyeballs and that indicates to me that copilot is hard at work. And if I scroll on down, I'm now going to see a brand new pull

request that's been made here. And this

is now what Copilot is going to utilize to help keep me updated on the work that it is performing. And if I scroll on down just

performing. And if I scroll on down just a little more, what I'm also going to see is a little view session button. And

if I hit this view session button, I can notice right here that it's telling me that it's spinning up a development environment. And this is raises a very

environment. And this is raises a very big question which is where is this running? How can I ensure that this is

running? How can I ensure that this is going to be done securely? So this is running inside of GitHub actions. And if

you're not already familiar with GitHub actions, this is our platform for automation. And in fact, I can go ahead

automation. And in fact, I can go ahead and configure the environment in which I want my coding agent to work.

by creating a specialized workflow exactly for that. And that's what I see right here with this co-pilot setups.

And if I scroll down, what I'm going to notice is that I've got steps to install Node, I've got steps to install Python and all the frameworks and all the libraries that we're going to be using.

Now, not only does this ensure that co-pilot is working in the environment that I want it to work in, but it also allows me to highlight the fact that by

default, coding agent does not have access to any external resources. So, it's not able to call the

resources. So, it's not able to call the internet. It's not able to call any

internet. It's not able to call any external services. Now, if I so desire

external services. Now, if I so desire that I wanted to be able to do that, then I can go in and configure MCP

servers and I can also add in uh uh updates to my firewall. So that way I can punch a hole in my firewall and allow Copilot to then access those

external resources. But by default, it's

external resources. But by default, it's only going to have access to what I've configured and inside of that container.

In addition, because of the fact that that is running inside of uh GitHub workflows, inside of GitHub actions, it's going to be an in an ephemeral

environment. So, it's going to spin up a

environment. So, it's going to spin up a brand new environment and once its work is done, it's going to then delete it. Continuing down the security path,

it. Continuing down the security path, let me kick back one uh one page here.

And if I scroll down, you're also going to notice that it's not even able to automatically kick off any workflows. So

I have a couple of workflows associated with this repository for running my unit tests and for running end to end tests.

And by default, it's not going to be able to do this unless I go in and I say yes. You'll also notice that the pull

yes. You'll also notice that the pull request that it creates is going to be in draft mode and I have to go in and review it because again developers in

charge again because just we're just because we're introducing AI does not change the normal dev ops flow. Now if I take a look at the one

flow. Now if I take a look at the one that was created earlier, let me go ahead and open that up. What I'm going to notice is a pull request with a

fantastic description of everything that it has done. So I can see the PR implements the uh missing credit operations. I can see it lists off all

operations. I can see it lists off all the different uh endpoints that it created, the error handling, the testing and the technical details. And I can

also again open up my session here and see all the tasks that it performed. And

you'll notice again that it's going to behave an awful lot like a developer that it's going to go out, it's going to do uh searches through my codebase, determine what needs to be done, and then eventually perform the tasks. And

if I scroll all the way to the end here, where did There it is. Perfect. What I can now see is I've

is. Perfect. What I can now see is I've got a nice little summary down at the very bottom. If I scroll up, I should be

very bottom. If I scroll up, I should be able to see that it ran all my tests.

Yep, I can see all my tests right there.

And I can see in this case that all 16 of those tests passed. So it created that PR and then I

passed. So it created that PR and then I also decided, okay, all of that looked good to me. So I allowed it to run the actions and I can even now see that it ran those unit tests, ran the endto-end tests, and everything works looks good.

then I could say ready for review and then finalize the uh the creation of it.

The last thing that I want to highlight and this both leads into the security aspect but also brings me back into the developer aspect is we'll notice that created a brand new branch. In this

particular case it's called copilotfix-3. Where that came from is

copilotfix-3. Where that came from is that the issue number that this was associated with was issue number three.

And so Copilot will only have right access to that branch. And this branch is going to behave just like any other branch that I might have. So if I clone the repository locally, I can go ahead

and check out that branch. I've opened

up the branch inside of GitHub here. Let

me scroll on down to my server. And if I scroll on down inside

server. And if I scroll on down inside of here, sound effects help. By the way, what we're going to notice is that there is my update game. I think my create game was up here. Yep, there it is. And

there it all is. But again, it's only inside of just that branch. So that's

the only place that coding agent is going to have right permissions to. Now, this leads us into a very big

to. Now, this leads us into a very big question, which is okay, that's that's wonderful, Christopher. You've created a

wonderful, Christopher. You've created a um a little uh kind of simple demo. You

had to create a few flask end points, and that's that's wonderful and all, but how about doing it in the real world?

Well, one of the big tenants that we have at GitHub is we build GitHub on GitHub. And in fact, coding agent was

GitHub. And in fact, coding agent was built with the help of coding agent. And

you're going to notice when we take a look at the amount of commits that found its way into coding agent that coding agent itself was one of the most

prolific committers.

that coding agent not only created new features but it also helped address tech debt and this is one of the biggest places where I personally see coding

agent really shining because I don't know of a single organization that doesn't have tech debt that feels comfortable with the state of their backlog that doesn't have a limitless

number of items where they keep going yeah that's that's great and all but we just don't have the time you can assum uh to kick through real quick as I

highlighted that secure environment separate platform ephemeral all running inside of GitHub actions and you have the ability to customize that coding

agent does understand your repository and understands your GitHub context so it has access to read your repository

and it is even able to read copilot instructions and it does have access to model context protocol so it can make those external calls. It does include

those safeguards readonly access to your repository the default firewall preventing any external access review before merge and review before those

actions run. So we continue to iterate

actions run. So we continue to iterate on C-pilot. We continue to look for new

on C-pilot. We continue to look for new areas where Copilot can shine to help streamline development and to help increase the productivity of developers.

Thank you.

[Applause] Awesome. Thank you, Christopher. We love

Awesome. Thank you, Christopher. We love

hearing about, you know, some of the main players in the sweet agent space.

So it's always nice to hear from some of the big players. We want to continue this track with you know how do we actually take things to production. So

um our next speaker Tomas is going to talk to us about the outer loop. So how

do we deal with you know actually deploying and using these software engineering agents? How do we manage all

engineering agents? How do we manage all of the you know CI/CD the pipeline? How

should we actually deal about using these things? So I want to take a little

these things? So I want to take a little bit of a break here in the talks and sort of speak about what's actually going on right innovation in sui agents is going at quite a rapid pace like

we've had jewels we've got codeex we've got cloud code as we get more and more of these software engineering agents that kind of really change the workflow of how we code how do we handle actually

deploying them right so we've got a next lineup of speakers that are going to help talk a little bit more about this and you know we just kind of want to set the stage here. So, let's just get it a little bit more interactive. Um, how how

are we feeling about the track today?

Who's kind of been a fan of Jules? I

want to see what are the what are the major ones? So, who here in the room has

major ones? So, who here in the room has used Devon? Let's see. Show hand fans.

used Devon? Let's see. Show hand fans.

Okay. Okay. We have a few Devon users.

And what about Jules? How are we feeling about Google's Jules? Okay. Same set of hands. How about uh Cloud Code? We've

hands. How about uh Cloud Code? We've

got a speaker from Cloud Code coming later. Okay, more hands but different

later. Okay, more hands but different hands. Seems like we've got a bit of a

hands. Seems like we've got a bit of a differentiation there. What about

differentiation there. What about OpenAI's codecs? Oh, another set of hands. So,

codecs? Oh, another set of hands. So,

interesting. You know, we've got different core co-pilots and it seems like people use them differently, but we also like to see the other end of the spectrum, right? Um, so we've got Devon.

spectrum, right? Um, so we've got Devon.

Who here is a fan of Devon and uses Devon's cognition? Cognition from um Devon from

cognition? Cognition from um Devon from cognition. Okay, so we've got another

cognition. Okay, so we've got another set of hands. And you know, one thing to note is that we kind of have these different categories of agents, right?

Um, what about the human in the loop short co-pilots? Who who here uses

short co-pilots? Who who here uses cursor and wind surf in their coding day-to-day tasks? Ah, a lot more hands.

day-to-day tasks? Ah, a lot more hands.

So, it's an interesting sort of split, right? We've got these human in the loop

right? We've got these human in the loop sort of short-term coding co-pilots where we've got stuff like Cascade from Windsurf. We've got um cursors co-pilot

Windsurf. We've got um cursors co-pilot and kind of everyone's hand goes up, right? A lot of people are starting to

right? A lot of people are starting to use these co-pilots in their IDE. Then

we take it to the next level. We've got,

you know, we've got the big the big players where we have uh cloud code, we have jewels, we have codecs. And an

interesting note, you know, everyone kind of has their own buckets. It's not

the same hands that go up. So that's one of the reasons why at the conference we like to invite speakers from everywhere.

Now the the third camp, you know, Devon, it's another way to think about it. As

you have longer horizon agents, how do we deal with those? And that's kind of where you know we're starting to take the second half of the day in SWU agents. We want to talk about how do we

agents. We want to talk about how do we take these things to production? How do

we actually deploy these? And you know to bring that up I want to invite our next speaker Tomas Ramirez. He's going

to give us a little bit about this. He's

from Graphite. So without further ado, let's welcome Tomas.

Thank you so much.

Perfect. Hello everyone.

Um, see, nope, no need for either of those.

Thank you so much. And then slides. Looking

much. And then slides. Looking

good.

Cool. Perfect. Awesome. Uh hi everyone, my name is Tomas. I'm one of the co-founders of Graphite. Graphite is an AI code review uh company. So to give

some context on sort of where we see the industry right now and where we see it going, software development currently and has always had two loops. The inner

loop which is focused on development and the outer loop that's focused on review.

Developers spend time in the inner loop.

They get their code working. They get

the feature the way they want it and then they go ahead and they move it to the outer loop where it's tested, reviewed merged deployed. We're seeing the inner loop

deployed. We're seeing the inner loop change right now more than we've ever seen it. More developers are using AI

seen it. More developers are using AI than ever. I think right here we have

than ever. I think right here we have some statistics from the GitHub developer survey. Nearly every developer

developer survey. Nearly every developer surveyed used AI tools both inside and outside of work. And 46% of GitHub is being written by CP code on GitHub is being written by Copilot.

We're seeing more and more code being written by AI. Here we have some statistics around how code has changed over time and how some people predict it will change. And even if we take a more

will change. And even if we take a more pessimistic view of that, we still see the way the world's going as just more and more and more code being written by AI. The inner loop is changing. You

AI. The inner loop is changing. You

know, AI is making more uh developers more productive. Developers now

more productive. Developers now producing higher volumes of code. But

that code still needs to be reviewed.

When we first started looking at this, when we first started building uh Diamond AI code reviewer about a year ago now, what we found was we read a lot of articles that scared us a lot. We

were seeing within our own organization a lot of developers adopting AI tools.

But we were also seeing a problem. AI

can hallucinate. It can make mistakes.

And almost more scarily, it can make security vulnerabilities.

For us, what we saw was that while the inner loop was getting sped up by AI, the outer loop was rapidly becoming the bottleneck. Um, we were seeing tools

bottleneck. Um, we were seeing tools like cursor, wind surf, co-pilot, vzero, ball, all of those producing larger volumes of code than we were used to, than we had ever seen before. But we

were also seeing our developers suddenly have to review higher volumes of code, test higher volumes of code, merge higher volumes of code, and deploy higher volumes of code. That's what brought us to say

code. That's what brought us to say there has to be a new outer loop here.

this the way that things are going, this isn't going to work. We're going to break down. We're watching the problems

break down. We're watching the problems that used to only aail large companies start to aail all companies where we were seeing companies deal with higher and higher and higher volumes of code.

The requirements for the new outer loop then look a lot like the problems that larger companies have always had to deal with. You need tools to better

with. You need tools to better prioritize, track, and get notified about pull requests. You need driver assist features to help reviewers focus and streamline the code review process.

You need optimized CI pipelines and merge cues to be able to handle the sheer volume of code changes that are now happening and you need better deployment

tools. Um when we first started looking

tools. Um when we first started looking at this through sort of an AI first lens, we started to see that well the problems are being created by AI, they can also probably be solved by AI. We

can probably start to streamline a lot of these processes which have previously had been manual, previously were parts of the process that developers did not enjoy, did not want to do. Um, we wanted to see self-driving code review

solutions where we no longer had to do those very manual and painful parts of review, but we could actually start to really focus on what matters most to the developers, making sure that your product can get out to users and that

the features work as expected. Um, we

were seeing that AI generated feedback wasn't perfect. And because of that, we

wasn't perfect. And because of that, we were starting to think that bots weren't enough. I think an early an early vision

enough. I think an early an early vision of ours was, well, can we solve this by just adding AI teammates, right? Maybe

it's background agents, maybe it's reviewers, maybe it's a whole lot of teammates to the workflow. And while we think that's part of the story, we don't think that's enough. We think that as we've built with Diamond that your

entire tool chain has to be AI native, not just your IDE. If you really are going to embrace AI in the age of development if you're going to accept the fact that developers are going to be orders of magnitude more productive than they ever have before, you need tooling

that reflects that. We started by building Diamond. So

that. We started by building Diamond. So

the winning AI code review platform with high signal, low noise, has a deep understanding of the codebase and change history. We summarize, prioritize, and

history. We summarize, prioritize, and review each change. And we integrate with your CI and your testing infrastructure to correct uh to summarize errors and correct failures.

Um, our hope with it and what we've started to see as we've rolled it out to larger and larger customers and enterprises too is we re we reduce code review cycles, we enforce quality and

consistency, and we keep your code private and secure. Um, it's high signal, it's zero setup, it's actionable with oneclick suggestions, and it's customizable. It's already being used by

customizable. It's already being used by some of the fastest moving companies in the world. It's expanding a lot more

the world. It's expanding a lot more than we can even say publicly. Um, and I hope that you all will embrace the idea that AI can change your entire developer workflow, not just your IDE.

um by the numbers we see comments that our AI bot leaves to be downloaded at less than a 4% rate and to be accepted meaning integrated into the poll request um that they were left on at a higher rate than human comments are. Human

comments are integrated about somewhere between 45 and 50%. We're watching our diamond comments be accepted about 52%.

We've spent a lot of time tuning that.

That's that number is actually new as of March for us. Um that's that's what I have to tell you around graphite. Um

what I have to tell you around diamond.

I hope you give it a shot and and thanks for having me.

[Applause] Awesome. Thanks again, Thomas, for such

Awesome. Thanks again, Thomas, for such a great talk. Uh we want to thank everyone for coming out to the SU agents track. We're going to take a short

track. We're going to take a short break. Lunch is going to be served here

break. Lunch is going to be served here in the halls. The expo session will be open. But you know without further ado

open. But you know without further ado we're we're very happy to announce that in the evening you know we have four more fully packed sessions. Um I think we are the only track that is fully

booked. So we've got all eight speakers.

booked. So we've got all eight speakers.

Um we're going to have a great round of speakers coming up soon. So feel free to come back here later. We're going to kick off with um a talk from claude code. So how do they think about

code. So how do they think about building cloud code? How to use it? How

to delegate? We're going to have that later back here in the keynote session.

But for now, please feel free to enjoy lunch. Check out the expo hall as we

lunch. Check out the expo hall as we take a little break. Thank you.

[Music] Heat. Heat.

Heat. Heat.

[Music] [Music] [Music] [Music] baby. Don't

baby. Don't [Music] Heat. Heat.

Heat. Heat.

[Music] Heat. Heat.

Heat. Heat.

Heat. Heat.

[Music]

[Music] [Music]

[Music] [Music] [Music] [Music]

[Music] You measure.

Everybody.

And we can I don't want to Hey hey hey.

Heat. Heat.

Heat. Heat.

[Music] [Music]

[Music] [Music] [Music] [Music]

Hey got [Music] [Music] Heat. Hey, Heat.

Heat. Hey, Heat.

I don't want to do it.

Heat. Heat.

Heat. Heat.

Hey hey [Music] [Music] [Music] [Music]

Heat. Heat.

Heat. Heat.

[Music] How?

[Music] What's up? Welcome everyone. Let's give

What's up? Welcome everyone. Let's give

it up for the sweet agents track. This

is the most packed track. We have four more amazing speakers for you. Let's

hear it for our sweet agent speakers. Awesome. We're gonna kick off

speakers. Awesome. We're gonna kick off talking about cloud code and then follow that up with open devon. I want to cut my MCing short. I want to give the speakers their time, but we have a

special little announcement. We never do Q&A, but for our first talk for Claude Code, we're going to do a bit of a presentation and do a bit of a Q&A session. Keep your questions short, 5,

session. Keep your questions short, 5, 10 words. as something interesting.

10 words. as something interesting.

Think of your question. But without

further ado, I want to invite Boris Churnney from Anthropic up to the stage.

Think of a question. I'll be back to [Applause] Q&A.

Hello. This awesome. This is a big crowd. Who here has used quad code

crowd. Who here has used quad code before?

Jesus. Awesome. That's what we like to see. Cool. So, my name is Boris. I'm a

see. Cool. So, my name is Boris. I'm a

member of technical staff at Enthropic and creator of Quad Code. And

Code. And um I was struggling with what to talk about for audience that already knows quad code, already knows AI and all the coding tools and agentic coding and stuff like that. So, I'm going to zoom

out a little bit and then we'll zoom back in. So, here's my TLDDR. The model is

in. So, here's my TLDDR. The model is moving really fast. It's on exponential.

It's getting better at coding very, very quickly, as everyone that uses the model knows. And the product is kind of

knows. And the product is kind of struggling to keep up. We're trying to figure out what product to build that's good enough for a model like this. And

we feel like there's so many more products that could be built for models that are this good at coding. And we're

kind of building the bare minimum. And

I'll kind of talk about why. And with cloud code, we're trying

why. And with cloud code, we're trying to stay unopinionated about what the product should look like because we don't know. So for everyone that didn't raise

know. So for everyone that didn't raise your hand, I think that's like 10 of you. Uh this is how you get cloud code.

you. Uh this is how you get cloud code.

Um you can head to quadi/code to install it. Uh you can run this incantation to

it. Uh you can run this incantation to install from npm. Um as of yesterday, we support quad pro plan. So you can try it on that. Uh we support cloud max. So

on that. Uh we support cloud max. So

yeah, just try it out. Tell us what you think. So programming is changing and

think. So programming is changing and it's changing faster and faster. And if

you look at where programming started back in, you know, the 1930s4s there there was like switchboards and it was this physical thing. There was no such thing as software. And then sometime in

the 1950s, punch cards became a thing.

And my uh my my grandpa actually in the Soviet Union, he was one of the first programmers in the in the Soviet Union.

And my mom would tell me stories about like, you know, when she grew up in the 1970s or whatever, he would bring these big stacks of punch cards home. And she

would like from work and and she would like draw all over them with crayons and that was growing out for her. And that that's what programming

her. And that that's what programming was back back in the 1950s, '60s, '7s even. But sometime in the late 50s, we

even. But sometime in the late 50s, we started to see these higher level languages emerge. So first there was

languages emerge. So first there was assembly. So programming moves from

assembly. So programming moves from hardware to punch cards which is still physical to to software. And then the level of abstraction just went up. So we

got to cobalt. Then we got to typed languages. We got to C++. In the early

languages. We got to C++. In the early 90s there was this explosion of these new language families. There was you know the Haskell family and um you know JavaScript and Java, the evolution of

the C family and then Python. And I

think nowadays if you kind of squint all the languages sort of look the same.

Like when I write TypeScript it kind of feels like writing Rust and that kind of feels like writing Swift and that kind of feels like writing Go. The

abstractions have started to converge a bit. If we think about the UX of

bit. If we think about the UX of programming languages, this has also evolved. Back in the 1950s, you used

evolved. Back in the 1950s, you used something like a typewriter to punch holes in punch cards and that was programming back in the day. And at some

point text editors appeared um and then uh Pascal and all these different ids uh appeared that let you interact with your programs and your software in new ways

and each one kind of brought something and I I feel like programming languages have sort of leveled out but the model is on an exponential and the UX of programming is also on an exponential and I'll talk a little bit more about

that.

Does anyone know what was the first text editor? Okay, I heard I heard Ed from

editor? Okay, I heard I heard Ed from someone. I think you read the

someone. I think you read the screen. Before well before text editors,

screen. Before well before text editors, this is what programming real quick. So

this was the IPMO29. This was kind of a top-of-the-line. This was like the

top-of-the-line. This was like the MacBook of the time for programming punch cards. Everyone have

punch cards. Everyone have this? You can still find it in museums

this? You can still find it in museums somewhere. And yeah, this is Ed. This is

somewhere. And yeah, this is Ed. This is

the the first text editor. This was uh Chem Thompson at at Bell Labs invented this. And you know, it kind of looks

this. And you know, it kind of looks familiar. If you open your MacBook, you

familiar. If you open your MacBook, you can actually still type Ed. This is

still is still distributed on Unix uh as as part of Unix systems. And this is crazy because this thing was invented like 50 years ago. And this is nuts.

Like there there's no cursor, there's no scroll back, uh there's no fancy commands, there's no type ahead, there's pretty much nothing. This is the simple text editor of the time. And it was built for teletype machines which were

literally physical machines that printed paper on paper. That's how your program was printed. And this is the first

was printed. And this is the first software manifestation of a UX for programming software. So it was really

programming software. So it was really built for these machines that didn't support scrollback and cursors or anything like that. Um for all the Vim fans, I'm going

that. Um for all the Vim fans, I'm going to jump ahead of Vim. Vim was a big innovation. Emacs was a big innovation

innovation. Emacs was a big innovation around the same time. I think in 1980, Small Talk 80 was a big uh it was a big jump forward. This is one of the first I

jump forward. This is one of the first I think the first graphical interface for programming software. And um for anyone

programming software. And um for anyone that's tried to set up like live reload with React or Redux or any of this stuff, this thing had live reload in

1980 and it worked and we're still kind of struggling to get that to work with like ReactJS nowadays.

So this this was a big jump forward and obviously like the language it had object-oriented programming and a bunch of new concepts but on the UI side there's a lot of new things too. In '91 I think Visual Basic was the

too. In '91 I think Visual Basic was the first code editor that introduced a graphical paradigm to the mainstream. So

before people were using textbased editors Vim and things like that were still very popular despite things like small talk. Um but this kind of brought

small talk. Um but this kind of brought it mainstream. This is what I grew up

it mainstream. This is what I grew up with.

Eclipse brought type ahead to the mainstream. This isn't using AI type

mainstream. This isn't using AI type ahead. This is not cursor when surf.

ahead. This is not cursor when surf.

This is just using static analysis. So

it's indexing your symbols and then it can rank the symbols and rerank them and it knows what symbols to show. I think

this was also the first big third party ecosystem for IDs. Copilot was a big jump forward with

IDs. Copilot was a big jump forward with single line type ahead and then multi-line type ahead.

And I think Devon was probably the first IDE that introduced this next concept and this next abstraction to the world which is to program you don't have to write code you can write natural

language and that becomes code and this is something people have been trying to figure out for decades. I think Devon is the first product that broke through and and took this mainstream and the UX has evolved

quickly but I think it's about to get even faster. We talked about uh UX and we

faster. We talked about uh UX and we talked about programming languages and verification is a part of this too. Um

so verification has started with manual debugging and like physically inspecting outputs. Um, and now there's a lot of

outputs. Um, and now there's a lot of probabilistic verification uh like fuzzing and vulnerability testing and uh like Netflix's chaos uh testing and things like

that. And so with all this in mind,

that. And so with all this in mind, Claude Code's approach is a little different. It's to start with a terminal

different. It's to start with a terminal and to give you as lowlevel access to the model as possible in a way that you can still be productive. So we want the model to be useful for you. We also want

to get we want to be unopinionated and we want to get out of the way. So we

don't give you a bunch of flashy UI. We

don't try to put a bunch of scaffolding in the way. Some of this is we're a model company at Enthropic and you know we make models and we want people to experience those models. But I think another part is we actually just don't

know like we don't know what the right UX is. So we're starting

UX is. So we're starting simple. And so cloud code it's

simple. And so cloud code it's intentionally simple. It's intentionally

intentionally simple. It's intentionally general. Um, it shows off the model in

general. Um, it shows off the model in the ways that matter to us, which is they can use all your tools and they can fit into all your workloads. So you can figure out how to use the model in this

world where the UX of using code and using models is changing so fast. And so this is my second point.

fast. And so this is my second point.

The model just keeps getting better. And

this is the better lesson. I have it uh I have I have this like framed and taped to the side of my wall because the more general model always wins and the model increases in

capability exponentially and there are many coral areas to this. Everything

around the model is also increasing exponentially and the more general thing even around the model usually wins. So with cloud code there's one

wins. So with cloud code there's one product and there's a lot of ways to use it. Um, so there's a terminal product

it. Um, so there's a terminal product and you know this is the thing everyone knows. So you can install quad code and

knows. So you can install quad code and then you just run quad in any terminal.

We're unopinionated. So it works in iTerm 2. It works in WSL. Um, it works

iTerm 2. It works in WSL. Um, it works over SSH and T-mok sessions. Uh, it

works in your VS code terminal in your cursor terminal. This works anywhere in

cursor terminal. This works anywhere in any terminal. When you run when you run quad

terminal. When you run when you run quad code in the IDE, we do a little bit more. So we kind of take over the ID a

more. So we kind of take over the ID a little bit and you know diffs instead of being inline in the terminal they're going to be big and beautiful and show up in the ID itself. Um and we also

ingest diagnostics. Um so we kind of try

ingest diagnostics. Um so we kind of try to take advantage of that. And you'll

notice this isn't as polished as something like uh again like cursor windsurf. These are awesome products and

windsurf. These are awesome products and I use these every day. Um, this is to let you experience the model in a low-level raw way. And this is sort of the minimal that we had to do to let you

experience that. We announced a couple weeks ago

that. We announced a couple weeks ago that you can now use Claude on GitHub. Can I get a show of hands who's

GitHub. Can I get a show of hands who's who's tried this already?

So for everyone that hasn't tried this, all you have to do is you open up Claude, you run this one slash command, install GitHub app, you pick the repo, and then you can run Claude in any repo.

Um, this is running on your compute. Um,

your data stays on your compute. It does

not go to us. Um, so it's it's kind of a nice experience and it lets you use your existing stack. You don't have to change

existing stack. You don't have to change stuff around. Takes a few minutes to set

stuff around. Takes a few minutes to set up. And again, here we intentionally

up. And again, here we intentionally built something really simple because we don't know what the UX is yet. And this

is the minimal possible thing that helps us learn but also is useful for engineers to do day-to-day work like I use this every day. The extreme version of this is our

day. The extreme version of this is our SDK and this is something that you can use to build on cloud code uh without um if you don't want to use like you know the terminal app or the ID integration

or GitHub you can just roll your own integration. You can build it however

integration. You can build it however you want. people have built all sorts of

you want. people have built all sorts of UIs, all sorts of awesome integrations and all this is is you run cloud-p and uh you can use it programmatically and so like something I

use it for for example is for incident triage I'll take my GitHub logs uh or my sorry my GCP logs I'll pipe it into cloudp because it's like it's a Unix utility so you can pipe in you can pipe

out um and then I'll like jq the result so it's kind of cool like this is a new way to use models this is maybe 10% exported no one has really figured about how to use models as a Unix utility.

This is another aspect of code as UX that we just don't know yet. And so

again, we just built the simplest possible thing so we can learn and so people can try it out and see what works for you. Okay, I wanted to give a few tips

you. Okay, I wanted to give a few tips for how to use quad code. This is a talk about quad code. So this is kind of zooming back in. Um, and uh, this is actually true for I think a lot of coding agents, but this is kind of

accustomed to the way that I personally use quad code. So, the simplest way to use this, um, it seems like most of this room is very familiar with quad code and similar coding agents. Um, but the simplest way to introduce new people

that have not used this kind of tool before is do codebased Q&A. And so, at Enthropic, we teach cloud code to every engineer on day one. And it's shortened onboarding times from like two or three

weeks to like two days maybe. And also I don't get bugged about questions anymore. People can just ask Quad. And

anymore. People can just ask Quad. And

honestly like I'll just ask Quad too. And then this is something that I

too. And then this is something that I do uh pretty much every day on Monday.

We have a stand up every week. I'll just

ask Quad what did I ship that week?

It'll look through my git commits and it'll it'll tell me so I don't have to keep track. The second thing is teach Quad

track. The second thing is teach Quad how to use your tools. And this is something that has not really existed before when you think about the UX of programming. Um, with every ID there's

programming. Um, with every ID there's sort of like a plug-in ecosystem. You

know, for Emacs, there's this kind of lispy dialect that you use to make plugins. If you use Eclipse or VS Code,

plugins. If you use Eclipse or VS Code, you have to make plugins. For this new kind of coding tool, it can just use all your tools. So, you give it batch tools,

your tools. So, you give it batch tools, you give it MCP tools. Something I'll

often say is here's the CLI tool cla run-help. Take what you learn and then

run-help. Take what you learn and then put it in the cloud MD. And now Claude knows how to use the tool. That's all it takes. You don't have to build a bridge.

takes. You don't have to build a bridge.

You don't have to build an extension.

There's nothing fancy like that. Um, of

course, if you have like groups of tools or if you have fancier functionality like streaming and things like this, you can just use MCP as well. Traditional coding tools focused a

well. Traditional coding tools focused a lot on actually writing the code and I think the new kinds of coding tools, they do a lot more than that. And I

think this is a lot of where people that are new to these tools struggle to figure out how to use them. So there's a few workflows that I've discovered for using quad code most effectively for

myself. The first one is have quad code

myself. The first one is have quad code explore and make a plan and run it by me before it writes code. Um you can also ask it to use thinking. So typically we see extended thinking work really well

if quad already has something in context. So have it use tools, have it

context. So have it use tools, have it pull things into context and then think.

If it's thinking up front, you're probably just kind of wasting tokens and it's not going to be that useful. But if

there's a lot of context, it does help a bunch. The second one is TDD. Um, I know

bunch. The second one is TDD. Um, I know I try to use TDD. It's like it's pretty hard to use in practice, but I think now with coding tools that actually works really well. Um, and maybe the reason is

really well. Um, and maybe the reason is it's not me doing it, it's the model doing it. And so the workflow here is tell

it. And so the workflow here is tell Claude to write some tests and kind of describe it and just make it really clear like the tests aren't going to pass yet. Don't try to run the test

pass yet. Don't try to run the test because it's going to try to run the test. Tell it like, you know, it's not

test. Tell it like, you know, it's not going to pass. write the test first, commit, and then write the code and then commit. And this is kind of a general

commit. And this is kind of a general case of if Claude has a target to iterate against, it can do much better.

So if there's some way to verify the output, like a unit test, integration test, uh a way to screenshot in your iOS simulator, uh a way to screenshot in Puppeteer, just some way to see its output. Um we actually did this for

output. Um we actually did this for robots, like we taught FOD how to use a 3D printer and then it has a little camera to see the output. If it can see the output and you let it iterate, the result will be much better than if it

couldn't iterate. The first shot will be

couldn't iterate. The first shot will be all right, but the second or third shot will be pretty good. So g give it some kind of target to iterate against. Today we launched plan mode in

against. Today we launched plan mode in cloud code and this is a way to do the first kind of workflow more easily. So anytime

hit shift tab and cloud will switch to plan mode.

So you can ask it to do something, but it won't actually do that yet. It'll

just make a plan and it'll wait for approval. So restart quad to get the

approval. So restart quad to get the update. Run shift

update. Run shift tab. Okay. And then the final tip is uh

tab. Okay. And then the final tip is uh give quad more context. There's a bunch of ways to do this. QuadMD is the easiest way. So take a this file called

easiest way. So take a this file called quadm, put it in the root of your repo.

You can also put it in subfolders. Those

will get pulled in on demand. You can

put in your home folder. This will get pulled in as well. Um and then you can also use flash commands. Um, so if you put files like just regular markdown files in these special

foldersclaw/comands, it'll be available under the slash menu. So pretty cool.

This is useful for res uh reusable workflows. And then to add stuff to

workflows. And then to add stuff to quadm um you can always type the pound sign to ask quad to memorize something and it'll prompt you which memory they should be added to. And you can see this is us trying to figure out how to use

memory, how to use this new concept that is new to coding models, did not exist in previous IDEs, how to make the UX of this work. And you can tell this is

this work. And you can tell this is still pretty rough. This is our first version, but it's the first version that works. And so we're going to be

works. And so we're going to be iterating on this. And we really want to hear feedback about what works about this UX and what doesn't.

Thanks.

[Applause] Thank you, Boris. Fortunately, we only have one minute left. So, someone sent a question on Slack. The question is, as I

delegate more and more to cloud code, as it runs for 10 minutes and I have 10 of these active, how do I use the tool? You

got 50 seconds.

[Laughter] Yeah, this is it's pretty cool. I think

this is something that we actually see in a lot of our power users that they tend to like multi-cloud. You don't just have a single cloud open, but you have a couple terminal tabs either with a few checkouts of claude or uh or of your

codebase or it's the same codebase but with different work trees and you have quad doing stuff in parallel. This is

also a lot easier with GitHub actions because you can just spawn a bunch of actions and get quad to do a bunch of stuff. Typically, we don't like need to

stuff. Typically, we don't like need to coordinate between these quads I think for most use cases. If you do want to coordinate, the best way is just ask them to write to a markdown file. Um,

and that's it. Awesome. Yeah, simple

thing works. Thank you so much. And once

again, give it up for Boris from Enthropic.

Very exciting to see such a full packed room here. We're going to set up our

room here. We're going to set up our next speaker who is Robert Brennan from All Hands. He is the creator and the

All Hands. He is the creator and the company behind Open Devon. So a lot of what we see you know we've had talks from all the top suite agents we've had Jules here we've got cloud code we have

openai's codeex we have devon as people use more and more of these suite agents are we just you know adding tech debt or are we actually 10x engineers so this is what Robert is going to discuss with us

I once again don't want to fill the stage so let's hear it for [Applause] Robert. Hey folks. Uh so today I'm today

Robert. Hey folks. Uh so today I'm today I'm going to talk a little bit about uh coding agents and how to use them effectively really. Um if you're

effectively really. Um if you're anything like me, you found that uh you found a lot of things that work really well and a lot of things that uh don't work very well.

Um so a little bit about me. Uh my name is Robert Brennan. I've been building uh open- source development tools for for over a decade now. Uh and my team and I uh have created uh an open-source uh

software development agent called Open Hands, formerly known as OpenD Devon. So to to state the obvious, in

Devon. So to to state the obvious, in 2025, software development is changing.

Uh our jobs are are very different now than they were 2 years ago. Uh and

they're going to be very different two years from now. Uh and the thing I want to convince you of is that coding is going away. uh we're going to be

going away. uh we're going to be spending a lot less time actually writing code but that doesn't mean that software engineering is going away. Uh

we're paid not to to type on our keyboard but to actually think critically about the problems that are in front of us. Uh and so if we do AIdriven development correctly um it'll mean we spend less time actually like

leaning forward and squinting into our IDE and more time kind of sitting back in our chair and thinking you know what does the user actually want here? Uh

what are we actually trying to build?

what what problems are we trying to solve as an organization? How can we architect this in a way that sets us up for the future? Uh the AI is very good at that at that interloop of development, the write code, run the

code, write code, run the code. It's not

very good at those kind of big picture tasks that have to take into account um that have to like empathize with the end user uh take into account business level objectives. Uh and that's where we come

objectives. Uh and that's where we come in as as software engineers. Uh so let's talk a little bit

engineers. Uh so let's talk a little bit about what actually a coding agent is.

Uh I think this word agent gets thrown around a lot these days. Uh the meaning has started to to drift over time. Uh

but at the core of it is this this concept of agency. Um it's this idea of of taking action out in the real world.

Um and these are these are the main tools of a software engineer's job, right? We have a a code editor to

right? We have a a code editor to actually modify our codebase, navigate our codebase. Uh you have a terminal uh

our codebase. Uh you have a terminal uh to help you actually run the code that you're that you're writing. uh and you need a web browser in order to look up documentation and maybe copy and paste some code from stack overflow. So these

are kind of the core tools of the job and these are the tools that we give to our agents to let them do their whole uh development loop. I also want to contrast uh you

loop. I also want to contrast uh you know coding agents from some more tactical codegen tools that are out there. Um, you know, we kind of started

there. Um, you know, we kind of started a couple years ago with things like, uh, GitHub Copilot's autocomplete feature where, you know, it's literally wherever your cursor is pointed in the codebase.

Right now, it's just filling out two or three more lines of code. Um, and then over time, things have gotten more and more agentic, more and more asynchronous, right? Uh, so we got like

asynchronous, right? Uh, so we got like AI powered idees that can maybe take a few steps at a time without a developer interfering. And then uh now you've got

interfering. And then uh now you've got these tools like Devon and Open Hands where you're really giving an agent, you know, one or two sentences describing what you want it to do. It goes off and

works for 5 10 15 minutes on its own and then comes back to you with a solution.

This is a much more powerful way of working. You can get a lot done. Uh you

working. You can get a lot done. Uh you

can send off multiple agents at once. Um

you know, you can focus on communicating with your co-workers or goofing off on Reddit while these agents are are working for you. Um, and it's uh it's just it's a it's a very different way of working, but it's a much more powerful

way of working. Uh, so I want to talk a little

working. Uh, so I want to talk a little bit about how these agents work under the hood. I feel like uh once you

the hood. I feel like uh once you understand what's happening under the surface, uh, it really helps you build an intuition for how to use agents effectively. Uh, and at its core, um, an

effectively. Uh, and at its core, um, an agent is this loop between a large language model and the and the external world. So, uh, the large language model

world. So, uh, the large language model kind of serves as the brain. Uh, and

then we have to repeatedly take actions in the external world, get some kind of feedback from the world, and pass that back into the LLM. Um, uh, so basically at every every step of this loop, we're

asking the LM, what's the next thing you want to do in order to get one step closer to your goal. Uh, it might say, okay, I want to read this file. I want

to make this edit. I want to run this command. I want to look at this web

command. I want to look at this web page. uh we go out and take that action

page. uh we go out and take that action in the real world, get some kind of output, whether it's the contents of a web page, uh or the output of a command, and then stick that back into the LLM for the next turn of the

loop. Uh just to talk a little bit about

loop. Uh just to talk a little bit about kind of the core tools that are at the agent's disposal. Uh the first one again

agent's disposal. Uh the first one again is a is a code editor. Um you might think this is this is really simple. It

actually turns out to be a fairly uh interesting problem. Uh the naive

interesting problem. Uh the naive solution would be to just like give the old file to the LLM uh and then have it output the entire new file. It's not a very efficient way to work though if you've got a thousand line uh thousand

line of thousands of lines of code and you want to just change one line. Uh

you're going to waste a lot of tokens printing out all the lines that are staying the same. So most uh contemporary um agents use uh like a a find and replace type editor or a diff

based editor to allow the LLM to just make tactical edits inside the file.

Uh, a lot of times they'll also provide like an abst ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab abstract syntax tree or some kind of way to allow the agent to navigate the codebase more effectively. Uh next up is the terminal

effectively. Uh next up is the terminal and again you would think text in text out should be pretty simple but there are a lot of questions that pop up here.

You know what do you do when there's a longunning command that has no standard out for a long time. Do you kill it? Do

you let the LLM wait? Uh what happens if you want to run multiple commands in parallel? Run commands in the

parallel? Run commands in the background. Maybe you want to start a

background. Maybe you want to start a server and then run curl against that server. Uh lots of really interesting uh

server. Uh lots of really interesting uh problems that crop up uh when you have an agent interacting with the terminal. Uh and then probably the most

terminal. Uh and then probably the most complicated tool is the web browser.

Again, there's a naive solution here where you just uh the agent just gives you a URL and you give it a bunch of HTML. Um that's uh very expensive

HTML. Um that's uh very expensive because there's a bunch of croft inside that HTML that the the LLM doesn't really need to see. uh we've had a lot of luck passing it uh accessibility trees or converting to markdown and

passing that to the LLM um or allowing the LLM to maybe scroll through the web page if there's a ton of content there. Um and then also if you

content there. Um and then also if you start to add interaction things get even more complicated. Uh you can let the LLM

more complicated. Uh you can let the LLM uh write JavaScript against the page or we've actually had a lot of luck basically giving it a screenshot of the page with labeled nodes and it can say what it wants to click on. Uh this is an

area of active research. Uh we just had a contribution about a month ago that doubled our accuracy on web browsing. Uh

I would say this is uh this is definitely a space to watch. Uh and then I also want to talk

watch. Uh and then I also want to talk about about sandboxing. Uh this is a really important thing for agents because if they're going to run autonomously for several minutes on their own without you watching everything they're doing, you want to

make sure that they're not doing anything dangerous. Uh and so all of our

anything dangerous. Uh and so all of our agents run inside of a Docker container by default. um they're they're totally

by default. um they're they're totally separated out from your workstation, so there's no chance of it running RMRF on your home directory. Um increasingly

though, we're giving agents access to third party APIs, right? So you might give it access to a GitHub token or access to your AWS account. Super super

important to make sure that those credentials are tightly scoped and that you're following uh the principle of lease privilege as you're granting agents access to do these things. All right, I want to move into

things. All right, I want to move into some best practices.

Uh my my biggest advice for folks who are just getting started is to start small. Um the best tasks are things that

small. Um the best tasks are things that can be completed pretty quickly. You

know, a single commit uh where there's a clear definition of done. You know, you want the agent to be able to verify, okay, the tests are passing, I must have done it correctly. Um or, you know, the merge conflicts have been solved, etc.

Um and tasks that are easy for you as an engineer to verify uh were done completely and correctly. Um I like to tell people to start with small chores.

Uh very frequently you might have a poll request where there's, you know, one test that's failing or there's some lint errors or there's merge conflicts. Uh

bits of toil that you don't really like doing as a developer. Those are great tasks to just shove off to the AI.

They're tend to be tend to be very rote.

Uh the AI does does them very well. Um

but as your intuition grows here, as you get used to working with an agent, you'll find that you can give it bigger and bigger tasks. Uh you'll you'll understand how to communicate with the agent effectively. Um, and I would say

agent effectively. Um, and I would say for for me, for my co-founders, and for our for our biggest power users, uh, for me, like 90% of my code now goes through the agent, and it's only maybe 10% of

the time that I have to drop back into my IDE and kind of get my hands dirty in the codebase again. Uh, being very clear with the

again. Uh, being very clear with the agent about what you want is super important. Uh, I specifically like to

important. Uh, I specifically like to say, you know, you need to tell it not just what you want, but you need to tell it how you want it to do it. You know,

mention specific frameworks that you want it to use. Uh if you wanted to do like a test-driven development strategy, tell it that. Um mention any specific files or function names that it can that

it can go for. Um this not only uh helps it be more accurate and uh you know more clear as to what exactly you want the output to be um it also makes it go faster, right? It doesn't have to spend

faster, right? It doesn't have to spend as long exploring the codebase if you tell it I want you to edit this exact file. Um this can save you a bunch of

file. Um this can save you a bunch of time and energy and it can save uh a lot of a lot of tokens, a lot of actual like inference costs.

Uh, I also like to remind folks that in an AIdriven development world, code is cheap. Um, you can throw code away. You

cheap. Um, you can throw code away. You

can you can experiment and prototype.

Uh, I love if I if I have an idea, like on my walk to work, I'll just like uh, you know, tell open hands with my voice, like do X, Y, and Z, and then when I get to work, I'll I'll have a PR waiting for

me. 50% of the time, I'll just throw it

me. 50% of the time, I'll just throw it away. It didn't really work. 50% of the

away. It didn't really work. 50% of the time it looks great, and I just merge it, and it's and it's awesome. Um, it's

uh it's really fun to be able to just rapidly prototype using AIdriven development. Um, and I would also say,

development. Um, and I would also say, you know, if you if you try to try to work with the agent on a particular task and it gets it wrong, maybe it's close and you can just keep iterating within the same conversation and has already

built up some context. If it's way off though, just throw away that work. Start

fresh with a new prompt based on uh what you learned from the last one. Um it's

really really uh I think uh it's a new new sort of muscle memory you have to develop to just throw things away.

Sometimes it's uh hard to throw away tens of lines tens of thousands of lines of code that uh have been generated because you're used to that being a very expensive uh bunch of code. Uh these

days it's it's very easy to kind of just start from scratch.

Again, this is probably the most important bit of advice I can give folks. Uh you need to review the code

folks. Uh you need to review the code that the AI writes. Uh I've seen more than one organization run into trouble uh thinking that they could just vibe code their way to a production application uh and just you know

automatically merging everything that came out of the AI. Um but uh if you just you know don't review anything you'll find that your codebase just grows and grows with this tech debt.

You'll find duplicate code everywhere.

Uh things get out of hand very quickly.

Uh so make sure you're reviewing the code that it outputs and make sure you're pulling the code and running it on your workstation or running it inside of an ephemeral environment. uh just to make sure that you know the agent has actually solved the problem that you

asked it to solve. Uh and I like to say you know

solve. Uh and I like to say you know trust but verify. You know as you work with agents over time you'll build an intuition for for what they do well and what they don't do well and you can generally trust them to to um you know

operate the same way today that they did yesterday. Um but you really you really

yesterday. Um but you really you really do need a human in the loop. Um, you

know, one of our big learnings, uh, with Open Hands, in the early days, if you opened up a poll poll request with Open Hands, uh, that that poll request would show up as owned by Open Hands, it would

be the little hands logo uh, next to the poll request. Uh, and that caused two

poll request. Uh, and that caused two problems. One, it meant that the human who had triggered that poll request could then approve it and basically bypass our whole code review system. You

didn't need a second human in the loop to uh, before merging. Uh, and two, often times those poll requests would just languish. uh nobody would really

just languish. uh nobody would really take ownership for them. Uh if there was like a failing unit test, nobody was like jumping in to make sure the test passed. Um and those they would just

passed. Um and those they would just kind of like sit there and not get merged or if they did get merged and something went wrong, the code didn't actually work. We didn't really know who

actually work. We didn't really know who to go to and be like, you know, who caused this? There was nobody we could

caused this? There was nobody we could hold accountable for that breakage. Um

and so now if you open up a poll request with open hands, your face is on that poll request. You're responsible for

poll request. You're responsible for getting it merged. You're responsible

for any breakage it might cause down the line.

Cool. And then uh I do want to just close just by going through a handful of use cases. Uh this is always kind of a

use cases. Uh this is always kind of a tricky topic because agents are great generalists. They can they can

generalists. They can they can hypothetically do anything as as long as you kind of like break things down into bite-sized steps that they can take on.

Um but in that in that um in the spirit of starting small, I think there are a bunch of use cases that are like really great day one use cases for agents. My

favorite is resolving merge conflicts.

This is like the biggest chore as a part of my job. Uh, open hands itself is a very fastmoving codebase. Uh, I say there's probably no PR that I make that uh, I get away with zero merge conflicts. Um, and I love just being

conflicts. Um, and I love just being able to jump in and say at Open Hands, fix the merge conflicts on this PR. Uh,

it comes in and, you know, it's such a rope task. It's usually very obvious,

rope task. It's usually very obvious, you know, what changed before, what changed in this PR, what's the intention behind those changes? And Open Hands knocks this out, you know, 99% of the time.

Uh addressing PR feedback is also a favorite. Uh this one's great because

favorite. Uh this one's great because somebody else has already taken the time to clearly articulate what they want changed and all you have to do is say at openhands do what that guy said. Uh and

again like you can see in this example uh open hands did exactly what this person wanted. I don't know react super

person wanted. I don't know react super well and uh our front end engineer was like do x y and z and he mentioned a whole bunch of buzzwords that I don't I don't know. Open hands knew all of it

don't know. Open hands knew all of it and uh was able to address his feedback exactly how he wanted.

uh fixing quick little bugs. Um you

know, you can see in this example, we had an input uh that, you know, was a text input, should have been a number input. Uh if I wasn't lazy, I could have

input. Uh if I wasn't lazy, I could have like dug through my codebase, found the right file. Um but it was really easy

right file. Um but it was really easy for me to just like quickly I think I did this one from directly inside of Slack, uh just add open hands, fix this thing we were just talking about. Uh and

uh it's just, you know, really I don't even have to like fire up my IDE. Um

it's just it's a really really fun way to work.

uh infrastructure changes I really like.

Uh usually these involve looking up some like really esoteric syntax inside of like the Terraform docs or something like that. Um open hands and you know

like that. Um open hands and you know the underlying LLMs tend to just like know uh the right terraform syntax and if not they can they can look up the documentation using the browser. Um so

this stuff is uh is really great.

Sometimes we'll just get like an out of memory exception in Slack and immediately say okay open hands increase the memory. Uh database migrations are

memory. Uh database migrations are another great one. Uh this is one where I find uh I often leave best practices behind. I won't put indexes on the right

behind. I won't put indexes on the right things. I won't set up foreign keys the

things. I won't set up foreign keys the right way. Uh the LLM tends to be really

right way. Uh the LLM tends to be really great about following all best practices around database migrations. So again,

it's kind of like a rote task for developers. It's not very fun. Um uh the

developers. It's not very fun. Um uh the LLM's great at it. uh fixing failing tests uh like on a PR uh if you've already got the code 90% of the way there and there's just a unit test

failing because there was a breaking API change very easy to call in an agent to just clean up the the failing tests. Uh expanding test coverage is

tests. Uh expanding test coverage is another one I love because uh it's a very um safe task, right? As long as the tests are passing, it's uh generally safe to just merge that. So if you notice a spot in your codebase where

you're like, "Hey, we have really low coverage here." just ask uh ask your

coverage here." just ask uh ask your agent to uh expand your test coverage in that area of the codebase. Uh it's a great quick win uh to make your codebase a little bit safer. Then everybody's favorite

safer. Then everybody's favorite building apps from scratch. Um you know I would say if you're shipping production code again don't just like vibe code your way to a production application. Uh but we're finding

application. Uh but we're finding increasingly internally at our company a lot of times there's like a little internal app we want to build. Uh like

for instance, we built a way to uh debug openhand trajectories, debug openhand sessions. Um uh we built like a whole

sessions. Um uh we built like a whole web application that since it's just an internal application, we can vibe code it a little bit. We don't really need to review every line of code. It's not

really facing end users. Uh this has been a really really fun thing for our business to just be able to churn out these really quick applications uh just to serve our own internal needs. Um so

yeah, uh Greenfield is a great great use case for agents. U that's all I've got.

Uh we'd love to have you all join the the Open Hands community. You can find us on GitHub all handsaiopenhands. Um join us on Slack,

handsaiopenhands. Um join us on Slack, Discord. Uh we'd love to build with

Discord. Uh we'd love to build with you. Awesome. Awesome. Okay. Thank you

you. Awesome. Awesome. Okay. Thank you

again, Robert. Very, very exciting to hear about what works and what doesn't work in coding agents. Now, I want to take a bit of time to pause. We're kind

of going to change focus for the next few talks. Our next speaker is Josh

few talks. Our next speaker is Josh Albert from Imbue who's going to speak about you know a little bit of a meta a meta talk. He's going to give a

meta talk. He's going to give a walkthrough of a case study about sculptor. Sculptor is kind of their way

sculptor. Sculptor is kind of their way of how do you verify that your AI coding agents are actually outputting proper code. So we hear these like you know we

code. So we hear these like you know we always hear how do we go from prototype to production. I'm guilty of this. I've

to production. I'm guilty of this. I've

given this talk. I gave it last year, but you know, we always hear about how do you go from prototype to production?

You need a human in the loop. How do you go from vibe coding to actual like production grade code? And outside of tech debt, Josh is one of the people that has kind of gone very very deep in

this and built sculptor to exactly solve this. So for our next talk, you know,

this. So for our next talk, you know, he's going to go through a case study of as you build coding agents, how do you kind of launch something alongside this?

How do you better verify what's going on? And a little bit more about Josh.

on? And a little bit more about Josh.

Josh is kind of a friend that I've known for over a year. We've talked in great depth about coding agents. He's very

deep in the space. He's been on the Leaden Space podcast before. So, if you want to, you know, hear more, feel free to check out the podcast. And same with a lot of the other speakers. Boris from

Cloud Code, he's been there as well. But

without further ado, I want to invite Josh up pretty soon. I'm gonna I'm gonna kill some more time. We're we're running a little early. So, um yeah, let's let's actually get a show of hands. Who in

here has started actually shipping suite agents in production? So outside of using them in your own coding workflows, outside of using co-pilots, who has actually shipped a version of a co coding co-pilot? Who's working directly

coding co-pilot? Who's working directly on the tools? Okay, we we have a a few hands.

tools? Okay, we we have a a few hands.

So let's get a better idea of what people are working on. Are people in the session here? Are we trying to learn how

session here? Are we trying to learn how should we better use co-pilots? How

should we take them to production? How

should we build them? What should we know about them? because Josh's talk is a bit of a case study around this. So,

who here is in the phase of aggressively using co-pilots kind of vibe coding and trying to trying to take it to that next level? Okay. Okay. A lot more hands

level? Okay. Okay. A lot more hands there. So, Josh, a little bit more

there. So, Josh, a little bit more background for you there. So, let's

let's kind of give it off from there. Um

Josh, I think we're ready for you.

Awesome. Thanks.

Um one second. All

second. All right, cool. Well, yeah, it's great to

right, cool. Well, yeah, it's great to be here. So, I'm Josh Albertch. I'm the

be here. So, I'm Josh Albertch. I'm the

CTO of Imbue. Uh, and our focus is on making more robust, useful AI agents. In

particular, we're focusing on software agents right now. And the main product that we're working on today is called Sculptor. So the purpose of Sculptor is

Sculptor. So the purpose of Sculptor is to kind of help us with something that we've all experienced. You know, we've all tried these vibe coding tools and you, you know, tell it to go off and do

something. It goes off and creates a

something. It goes off and creates a bunch of code for you. Uh, and then, you know, voila, you're done, right? Well,

not quite. like at least today there's a big gap between kind of the stuff that comes back uh and what you want to ship to production especially as you get away from the prototyping into a larger more established code bases. So today I'm

going to go over some of the technical decisions that went into the design of sculpture our experimental coding agent environment uh and kind of go through

some of the context and motivations for the various ideas that we've explored and the features that we've implemented.

It's still a research preview, so these features may change before we actually release it. Uh, but I hope that you know

release it. Uh, but I hope that you know whether you're an individual using these tools or you're someone who's developing the tools yourself, you'll find these uh kind of learnings from our experiments

to be useful for yourselves. So today,

if you're thinking about how you can make coding agents better, then there's a million different things that you could build. You could build something

could build. You could build something that helps improve the performance on really large context windows. You could

make something to make it cheaper or faster. You could make something that

faster. You could make something that does a better job of parsing the outputs. But I don't think that we

outputs. But I don't think that we really should be building any of these things. I think that what we really want

things. I think that what we really want to be building is things that are much more specific to the use case or to like the problem domain or the thing that you

are like really specialized in. most of

the things that I just mentioned are going to get solved over the next call it 3 to 12 to 24 months as models get better, coding agents get better etc. And so I think you know just like you

wouldn't want to make your own database I don't think we want to be spending a lot of time working on the problems that are going to get solved uh instead we want to focus on the particular part of

the problem that really matters for for us for our business. And so at Impu the problem that we're focusing on is basically this like what is wrong with this diff? You get a coding agent output

this diff? You get a coding agent output and it tells you like okay I've added 59 new lines. Are those good? Like right

new lines. Are those good? Like right

now you have an awkward choice between either looking at each of the lines yourself or just hitting merge and kind of hoping for the best. Uh and neither of those are a really great place to be.

So we try to give you a third option. Uh

the goal is to help build user trust by allowing another AI system to come and take a look at this and understand like hey are there any race conditions? Did

you leave your API key in there etc. So we want to think about how do we help leverage AI tools not just to generate the code but to help us build trust in that

code and kind of the way that we think about it is about like identifying problems with the code because if there's no problems then it's probably high quality code and that's kind of the

definition of high quality code. If you

think about it from like an academic perspective, the way that people normally measure software quality is by looking at the number of defects and they look at like how long does it take

to fix a particular defect or how many defects are caught by this particular technique. So this is sort of the

technique. So this is sort of the definition that at least we're working on from when we're thinking about making high quality software. And then if we think about you know the software

development process what you want to be doing is getting to a place where you have identified these problems as early as possible. So sculptor does not work

as possible. So sculptor does not work as like a pull request review tool because that's much much later in the process. Rather we want something that's

process. Rather we want something that's synchronous and immediate and giving you immediate feedback. As soon as you

immediate feedback. As soon as you generated that code, as soon as you've changed that line, you want to know like is there something wrong with it? That's

easier both for you to fix and also for the agent to fix.

So what are some ways that you can prevent problems in AI generated code?

We're going to go through five different ways. Uh the first is learning planning

ways. Uh the first is learning planning or sorry only four different ways.

Learning planning writing specs and having a really strict style guide. And

we'll see how those manifest in Sculptor. So the first thing you want to

Sculptor. So the first thing you want to do when you're using coding agents if you're trying to prevent problems is learn what's out there. We try to make this as easy as possible in sculpture by letting you ask

questions, have it do research, get answers about what are the technologies, etc. that exist. What are the ways that other people have solved similar problems so that you don't end up reproducing a bunch of work for what's

already out there. Next, we want to think about how

there. Next, we want to think about how we can encourage people to start by planning. Here's a little example

planning. Here's a little example workflow where you can, you know, kick off the agent to go do something simple like, you know, implement this Scrabble solver and change the system prompt here

to force the AI agent to first make a plan without writing any code at all.

Then you can wait a little while. It'll

generate the plan. Uh, and then you can go and change the system prompt again to say like, okay, now we can actually create some code. So we make it really easy to kind of change these types of meta parameters of the coding agent

itself. Of course you can just tell the

itself. Of course you can just tell the agent to do that. But by changing its system prompt you sort of force it in a much stronger way to uh change its behavior. And you can build up larger

behavior. And you can build up larger workflows by making sort of customized agents for always plan first then always do the code then always run the checks etc.

Third, you want to think about writing specs and docs as a kind of first class part of the workflow. One of the main reasons why, at least I don't normally write lots of specs and docs in the past

has been that it's kind of annoying to keep them all up to date to spend all this time kind of typing everything out if I already know what the code is supposed to be. But this is really important to do if you want the coding

agents to actually have context on the project that you're trying to do because they don't have access to your email, your Slack, etc. necessarily. And even

if they did, they might not know exactly how to turn that into code. So in

Sculptor, uh, one of the ways that we try to make this easier is by helping detect if the code and the docs have become outdated. So it reduces the

become outdated. So it reduces the barrier to writing and maintaining documentation and dock strings because now you have a way of more automatically fixing the inconsistencies. It can also

highlight inconsistencies or parts of the specifications that conflict with each other, making it easier to make sure that your system makes sense from the very beginning. And finally, you want to have

beginning. And finally, you want to have a really strict style guide and try to enforce it. This is important even if

enforce it. This is important even if you're just doing regular coding without AI agents, just with other human software engineers. But one of the

software engineers. But one of the things that is special in Sculptor is that we make suggestions which you can see towards the bottom here uh that help

keep the AI system on a reasonable path.

So here it's highlighting that you could you know make this particular class immutable to prevent race conditions.

Was this something that comes from our style guide where we try to encourage both the coding agents and our teammates to write things in a more functional immutable style to prevent certain

classes of errors. We're also working on developing a style guide that's sort of customtailored to AI agents to make it even easier for them to avoid some of the most egregious mistakes that they

normally make. But no matter how many uh things

make. But no matter how many uh things you do to prevent the AI system from making mistakes in the first place, it's going to make some mistakes. And there

are many things that we can do to prevent or to detect those problems and prevent them from getting into production. So we'll go through three

production. So we'll go through three here. Uh first running llinters, second

here. Uh first running llinters, second writing and running tests, third asking an LLM. Uh and we'll dig into each and

an LLM. Uh and we'll dig into each and see how that manifests in sculpture. So

for the first one for running llinters, there are many automated tools that are out there like rough or my pylind py etc that you can use to automatically

detect certain classes of errors.

In normal development, this is sort of obnoxious because you have to go fix all these like really small errors that don't necessarily cause problems. It's a lot of like churn and extra work. But

one of the great things about AI systems is that they're really good at fixing these. So, one of the things that we've

these. So, one of the things that we've built into Sculptor is the ability for the system to very easily detect these types of issues and automatically fix them for you without you having to get

involved.

Another thing that we've done is make it easy to use these tools in practice. A

lot of tools end up like these. You

know, how many people here, maybe a show of hands, how many people have a llinter set up at all? Okay. How many people have zero

all? Okay. How many people have zero linting errors in their codebase? Two.

Great. We'll hire you. Okay, cool. Uh

but you know it's it's not it's not easy. But one of the things that we've

easy. But one of the things that we've done in sculpture is make it so that the AI system understands what issues were there before it started and then what issues were there after it ran. So at

least you can prevent the AI system from creating more errors without you even if it doesn't work in a perfectly clean codebase. Okay. Third testing. So why

codebase. Okay. Third testing. So why

should you write tests at all? I think I was pretty lazy as a developer for a long time and did not want to write tests because it took a you know a lot of effort. You have to maintain them. I

of effort. You have to maintain them. I

already wrote the code. It works. Okay.

But one of the major objections to writing tests has kind of disappeared now that we have AI systems. The ability to generate tests is now so easy that you might as well write tests.

Especially if you have correct code. You

can tell the agent, hey, just write a bunch of tests, throw out the ones that don't pass, and just keep the rest. So

there's no real reason to not write tests at all. Uh and B at as they say at Google, if you liked it, you should have put a test on it. This becomes much more

important with coding agents. And the

reason is that you don't want your coding agent to go change the behavior of your system in a way that you don't understand and don't expect and don't want to see happen. So at Google, this matters a lot for their infrastructure

because they don't want their site to crash when someone changes something.

But if you really care about the behavior of your system, you want to make sure that it's fully tested. So how do we actually write good

tested. So how do we actually write good tests? I'll go through a bunch of

tests? I'll go through a bunch of different uh components to this. So

first, one of the things that you can do is write code in a functional style. By

this I mean code that has no side effects. This makes it much much easier

effects. This makes it much much easier to run LLM and understand if the code is actually successful. You really don't

actually successful. You really don't want to be running a test that has access to say your live Gmail environment where if you make a single mistake you can delete all of your email. You really want to isolate those

email. You really want to isolate those types of side effects and be able to focus most of the code uh on the kind of functional transformations that matter for your

program. Second, you can try and write

program. Second, you can try and write two different types of unit tests. Happy

path unit tests are those that are ones that show you that your code is working.

It's happy. Hooray, it worked. uh you

don't need that many of those. You just

need a small number to show that things are working as you hope. The unhappy

unit tests are the ones that help us find bugs. And here LLMs can be really,

find bugs. And here LLMs can be really, really helpful. So, especially if you've

really helpful. So, especially if you've written your code in a functional style, you can have the LLM generate hundreds or even thousands of potential inputs, see what happens to those inputs, and

then ask the LLM, does that look weird?

And often when it says yes, that will be a bug. And so now you have a perfect

a bug. And so now you have a perfect test case replicating a bug.

Third, after you've written your unit tests, it's maybe a good idea to throw them away in some cases. This is a little bit counterintuitive. In the past, it spent

counterintuitive. In the past, it spent we took all this effort and spent all this time trying to write good unit tests and so we feel some aversion to throwing them away. But now that it's so

easy to run LLM and generate the test suite again from scratch, there's a reason a good reason to not keep around too many unit tests of behavior you don't care about too much. You might

also want to just refactor the ones that you generated into something that's slightly more maintainable. But when you do keep them around, it does kind of confuse the LLM when you come back and change this behavior. So it's something that's at least worth thinking about

whether you want to keep the tests that were originally generated, clean them up, how many of them should you keep, etc. Fourth, you should probably focus on integration tests uh as opposed to

testing only the kind of code level functional uh behavior of your program.

Integration tests are those that show you that your program actually works.

Like from the user's perspective, like when the user clicks on this thing, does this other thing happen? AI systems can be extremely good at writing these, especially if you create nice test plans

where you can write, okay, when the user clicks on the button to add the item to the shopping cart, then the item is in the shopping cart. If you write that out and then you write the test, then you can write another test plan like if the

user clicks to remove the button, the thing from the shopping cart, then it is gone. that systems can almost always get

gone. that systems can almost always get this right and so it allows you to work at the level of meaning for your testing which can be much more efficient. Uh

fifth, you want to think about test coverage as a core part of your testing suite. So if you're having cloud code

suite. So if you're having cloud code write things for you, then you don't care just about the tests working on their own, but you also care are there enough tests in the first place. If you

think back to the original screenshot where we get back our PR of, you know, how many lines have changed? If I tell you how many lines have changed, it's not that helpful. If I tell you so many

lines have changed and also there's 100% test coverage and also all the tests pass and also a thing looked at the tests and thought they were reasonable.

Now you can probably click on that merge button without quite as much fear. Uh

and sixth uh we try to make it easy to run tests in sandboxes and without secrets as much as possible.

This uh makes it a lot easier to actually fix things and makes it a lot easier to make sure that you're not accidentally causing problems or making flaky

tests. The third thing that we can do to

tests. The third thing that we can do to detect errors is ask an LLM. There are

many different things that we can check for, including if there are issues before you commit with your current change, if the thing that you're trying to do even makes sense, if there are issues in the current branch you're working on, if there are violations of

rules in your style guide or in your architecture documents, if there are details that are missing from the specs, if the specs aren't implemented, if they're not well tested, or whatever other custom things that you want to

check for. One of the things that we're

check for. One of the things that we're trying to enable in Sculptor is for people to extend the checks that we have so that they can add their own types of best practices into the codebase and

make sure that they are continually checked. After you've found issues, then

checked. After you've found issues, then you have to fix them. Very little of this talk is about fixing the issues because it ends up being a lot easier for the systems to fix issues than you

would expect. I think this quote

would expect. I think this quote captures it relatively well and that a problem wellstated is halfsolved. What

this means is that if you really understand what went wrong, then it's much easier to solve the problem. This

is especially true for coding agents because the really simple strategies work really well. So even just try multiple times, try a hundred times with a different agent, it actually ends up

like working out quite well. And one of the things that enables this is having really good sandboxing. If you have agents that can run safely, then you can run an almost unlimited number subject

to cost constraints uh in parallel. And

then if any one of them succeeds, then you can use that solution. And this is really just the

solution. And this is really just the beginning. There are going to be so many

beginning. There are going to be so many more tools that are released over the next year or two. And many of the people in this room are working on those tools.

There will be things that are not just for writing code like we've been talking about, but for after deployment, for debugging logging tracing profiling etc. There are tools for doing automated

quality assurance where you can have an AI system click around on your website and check if it can actually do the thing that you want the user to do.

There are tools for generating code from visual designs. There are tons of de dev

visual designs. There are tons of de dev tools coming out every week. you will

have much better contextual search systems that are useful for both you and for the agent. Uh and of course we'll get better AI based models as well. If

anyone is working on these other sorts of tools that that are kind of adjacent to developer experience and helping you fix this like much smaller piece of the process, we would love to work together

and find out a way to integrate that into Sculptor so that people can take advantage of that. I think what we'll see over the next year or two is that most of these things will be accessible.

Uh, and it'll make the development experience just a lot easier once all these things are working together. So, that's pretty much all

together. So, that's pretty much all that I have for today. If you're

interested, feel free to take a look at the QR code, go to our website at imbue.com and sign up to try out Sculptor. And of course, if you're

Sculptor. And of course, if you're interested in working on things like this, we're always hiring. We're always

happy to chat, so feel free to reach out. Thank you.

out. Thank you.

Thank you, Josh. I highly recommend picking Josh's brain. I'm sure he'll be around. Find him in the hallways. It's

around. Find him in the hallways. It's

been great. Had countless conversations with Josh. And, you know, just to say

with Josh. And, you know, just to say once again, what a day. It's been a fully jam-packed day. We have had eight backto backtoback speakers talking about Su agents. We started with all the, you

Su agents. We started with all the, you know, the originals, the GitHub co-pilot, the original coding co-pilot.

Then we went to the latest and the greatest, right? We've had OpenAI's

greatest, right? We've had OpenAI's codec speak. We've had Claude Code

codec speak. We've had Claude Code speak. We've had Jules from Google

speak. We've had Jules from Google speak. Then we went a little bit into,

speak. Then we went a little bit into, okay, how do I actually start using these things in production? How do I go past Vibe coding? How do I kind of, you know, let's walk through a case study of

how we really build these things. And

now for our last talk in the sui agents track we have someone who is not building an agent. We have Eno Reyes here from factory and he is actually building droids. What does this mean?

building droids. What does this mean?

It's not just hype. Eno is actually working on droids. He is one of the companies from factory AI that is actually ship this stuff in production.

They are actually in the enterprise.

They are growing like crazy. He's been

recently on the laten space podcast and they're actually doing this stuff. So

you know is a great speaker. He's spoken

for bigger audiences than this. And you

know, without any further ado, I want to pass it on to Eno.

[Applause] Hi everybody. My name is Eno. I really

Hi everybody. My name is Eno. I really

appreciate that introduction. Um, and

maybe I can start with a bit of background. Uh, I started working on LLM

background. Uh, I started working on LLM about two and a half years ago. uh when

uh GBT3.5 was coming out and it became increasingly clear that agentic systems were going to be possible with the help

of LLMs. At factory we believe that the way that we use agents in particular to build software is going to radically change the field of software

development. We're transitioning from

development. We're transitioning from the era of human-driven software development to agent driven development.

You can see glimpses of that today. You

guys have already heard a bunch of great talks about different ways that agents can help with coding in particular.

However, it seems like right now we're still trying to find what that interaction pattern, what that future looks like. And a lot of what's publicly

looks like. And a lot of what's publicly available is more or less an incremental improvement. The current zeitgeist is to

improvement. The current zeitgeist is to take tools that were developed 20 years ago for humans to write every individual line of code. um and ultimately tools that were designed first and foremost

for human beings. Uh and you sprinkle AI on top and then you keep adding layers of AI and then at some point maybe there's some step function change that happens. But there's not a lot of

happens. But there's not a lot of clarity there in exactly what that means. You know, there's a quote that is

means. You know, there's a quote that is attributed to Henry Ford. Uh if I had asked people what they wanted, they would have said faster horses. Now, we

believe that there are some fundamentally hard problems blocking organizations from accessing the true power of AI. This power can only be found when your team is delegating the

majority of their tasks across the software life cycle to agents.

To do that, you need a platform that has an intuitive interface for managing and delegating tasks, centralized context from across all your engineering tools

and data sources, agents that consistently produce reliable, highquality outputs, and infrastructure that supports thousands of agents

working in parallel. These are all hard problems to solve. But our team has spent the last two years partnering with large organizations to build towards

this future. This talk is going to serve

this future. This talk is going to serve as sort of a deep dive into agent native development and some of the and a bit of a share of some of the lessons that we've learned helping enterprise

organizations make the transition to agent native development. When Andre Karpathy said

development. When Andre Karpathy said English is the new programming language, he captured this very exciting moment.

Right? And if you're to judge AI progress based on Twitter, you'd think that, you know, you can basically vibe code your way to anything. But vibe

coding isn't the approach to solve hard problems. You can't vibe code a legacy Java 7 app that runs 5% of the world's global bank transactions, right? You

need a little bit more software engineering. So agents really should not

engineering. So agents really should not be thought of as a replacement for human ingenuity, right? agents are climbing

ingenuity, right? agents are climbing gear and building production software is like scaling Mount Everest. And so while better tools have made this climb more

accessible, we still need to think about how to leverage them and use our existing expertise in order to drive this transformation. I want to start

this transformation. I want to start with a quick video of what's possible today, right? And so in this you'll see

today, right? And so in this you'll see a quick glimpse of what it's like to delegate a task to an agentic system.

You can watch the droid as we call them ingest the task and start grounding itself in the environment. It uses tools to search through the codebase, determine the git branch, check out what

the machine has available to it. It

looks through recent changes to the codebase. It looks at memories of its

codebase. It looks at memories of its recent interactions with users as well as memories from its interactions across the entire organization. And then the droid comes back with a plan and says,

"Here's exactly what I'm going to do, but I'd like you to clarify a couple of things. Right? We need to expect our

things. Right? We need to expect our agents to not just take what we say at face value, but instead question it and make us better software developers." And

so after the user comes back with that info, the droid comes, it executes on that task. It leverages its tools to

that task. It leverages its tools to write code, runs pre-commit hooks, lints, and ultimately generates a pull request that passes CI. But how can you achieve outcomes

CI. But how can you achieve outcomes like this on a regular basis? Right?

It's nice when it works, but what about when it fails? At the heart of effective AI assisted development lies a very fundamental truth. AI tools are only as

fundamental truth. AI tools are only as good as the context that they receive.

So much of what people are calling prompt engineering is really mentally modeling this alien intelligence that has a slice of context of the real world. And if you start thinking about

world. And if you start thinking about your AI tools this way, you're going to start to get a lot better at interacting with them. We've investigated thousands

with them. We've investigated thousands of droid assisted development sessions and you see this sort of heristic emerge where AI is most likely failing to solve

the problem. Not because the LLMs aren't

the problem. Not because the LLMs aren't good enough, but because it's missing crucial context that's required to truly solve it. And better models are going to

solve it. And better models are going to make this happen less often. But the

real solution is not just making the AI smarter. It's going to be getting better

smarter. It's going to be getting better at providing these systems with that missing context. LM don't know about your

context. LM don't know about your morning standup. They don't know about

morning standup. They don't know about the meeting that you had ad hoc and the whiteboard that you did, right? But you

can give those things to the LLM if you transcribe your notes, if you take a photo and you upload it. Right? You have

to start thinking about these things not as tools but as something in between a co-worker and uh and a and a platform, right? And if you can get that context

right? And if you can get that context that lies in the cracks between systems, you use platforms that integrate natively with all of your data sources and you have agents that can actually

make use of those things, you can start actually driving this transition to agent native development. I want to talk a bit as

development. I want to talk a bit as well about planning and design. When

your agent I mean sorry when your organization is doing agent native development then you are using agents at every stage. Droids don't just write

every stage. Droids don't just write code. They can help with that part, but

code. They can help with that part, but the hardest thing about software development is not the code. It's about

figuring out exactly what to build. Here

you can watch a droid as it's tasked with trying to find the most up-to-date information about a new model release and integrate that into an existing chat application. It's going to leverage

application. It's going to leverage internet search, its knowledge of your codebase, its understanding of your product goals from its organ uh memory, and its understanding of your technical

architecture from the design doc you wrote last week. Planning with AI is fundamentally different from planning alone. It's not necessarily just asking

alone. It's not necessarily just asking please build this thing for me or give me the design doc but instead it's about delegating the groundwork and the

research to AI agents then using a collaborative platform to interact and explore possibilities together. That is

how you get better at planning with agents. Now you can see here we have a

agents. Now you can see here we have a nice document a nice plan. You could

export that to notion, Confluence, Jira, any of your integrations with no setup because MCP is great, but having every developer have to install a bunch of servers, click a bunch of things, pass

around the API key is not necessarily ideal. And so platforms are going to

ideal. And so platforms are going to evolve and solve a lot of these problems. But in the meantime, you do have droids. And now a little bit more

have droids. And now a little bit more on this. The real unlock for AI

on this. The real unlock for AI transforming your organization in with respect to planning is going to be when you start standardizing the way that

your organization thinks, right? And so

there's a bit of a of an example that we just had a couple of weeks ago while we were planning out uh a feature related to our cloud development environments.

We got a lot of feedback from users and so we had about three months of user transcripts, people from enterprises, uh, individuals that we knew. Uh, we

transcribe every single interaction and meeting at factory. We take those notes and we combine them with a droid that has access to our architecture. We take

a ad hoc meeting that one of our engineers took a granola of. If you guys use granola, I love that tool. Um, and

we throw that all to the knowledge droid and we say, we don't say, "Let's plan the feature out." We say, "Could you find any patterns in the customer feedback that map up to our assumptions?

Can you highlight any technical constraints with what we have today that might help us make this better?" And

then we take all of that output, those documents, there's maybe four or five intermediate results here, and that's what we use to start iterating on a

final PRD that helps us outline the full feature. You can take that PRD, and if

feature. You can take that PRD, and if you have a droid that has access to linear and Jira with tools to create tickets, create epics, modify those

things, then that PRD can be turned into a road map. eight tickets. This ticket's

dependent on that ticket, but ultimately work that can be parallelized amongst a group of eight code droids, right? And

so this is how software is going to evolve. We're going to move from

evolve. We're going to move from executing to orchestrating systems that work on our behalf. I tal I talked about a couple of

behalf. I tal I talked about a couple of these. I think PRDS, edge design docs,

these. I think PRDS, edge design docs, RCA templates, quarterly engine and product road maps, right? transcriptions

of your meetings. Normally, you might see this stuff as a burden, but when your company is doing agentnative software development, your process and your documentation is a knowledge base

and a map for your droids to learn and imitate the way that your team thinks.

This documentation and process is a conversation with both future developers as well as future AI systems. And so if you can communicate that why behind the

decision, that context for those future developers and agents, then you'll start to see that there's a huge lift in their ability to natively work the way that your team actually

works. I want to talk about uh

works. I want to talk about uh agent-driven development with respect to site reliability engineering. There is a lot that goes in

engineering. There is a lot that goes in to a real incident response. It would be crazy for me to go up here and say you could actually just automate all of S

and RCA work today. But there is a difference in the AI agent-driven approach. Right here we're watching a

approach. Right here we're watching a droid take a sentry incident and convert it into a full RCA and mitigation plan.

Traditional incident response is effectively solving a puzzle. The pieces

are scattered across dozens of systems. Logs in one place, metrics in another, historical context somewhere else.

There's knowledge in your team's head.

Droids in your organization fundamentally change this, right? When

an alert triggers, you can pull in context from relevant system logs, past incident, runbooks in notion or confluence, team discussions from Slack.

And you can see that a droid that has the tools and the ability to access this can condense that search effort from hours to minutes. And so really the

acceptable time to act for a standard enterprise organization should really it's really going to be zero. Right? The

moment that an incident happens, you should have a droid that's telling you exactly what happened, exactly how to fix it. And the thing that gets

fix it. And the thing that gets interesting is when you have user and organization level memory, you really start to build a model of what your team's response patterns and common issues are. And so it's not just

issues are. And so it's not just generating runbooks or generating a mitigation for one incident, right? but

creating new processes that help solve some of these issues. And once you've written that

issues. And once you've written that RCA, right, you you can move on to generate runbooks for those new learned patterns, update existing response

workflows, capture team knowledge that gets shared automatically without without the need for manual curation. And this is why all these

curation. And this is why all these things are connected. Agentnative

incident response is a part of a larger learning cycle that happens when you start to integrate agents into the workflow. We're seeing teams that are

workflow. We're seeing teams that are able to cut incident response time in half because context is immediate.

They're able to reduce repeat incidents because the third time something happens, the droid starts to say, "Maybe we should fix this." And they're able to improve team collaboration because when

a new engineer joins the team and says, "How do we do this?" It's already in memory. They can just ask the droid how

memory. They can just ask the droid how we do this. And so, most importantly, what we're seeing in general is a shift from reactive to predictive operations because you can now start to really see

the patterns across the entire operational history. And agentic systems

operational history. And agentic systems turn each of these incidents into an opportunity to make the entire system far more reliable. AI agents are not replacing

reliable. AI agents are not replacing software engineers. They're

software engineers. They're significantly amplifying their individual capabilities. The best

individual capabilities. The best developers I know are spending far less time in the IDE writing lines of code.

It's just not high leverage. They're

managing agents that can do multiple things at once that are capable of organizing the systems and they're building out patterns that supersede the inner loop of software development and they're moving to the outer loop of

software development. They aren't worried about

development. They aren't worried about agents taking their jobs. They're too

busy using the agents to become even better at what they do. The future

belongs to developers who understand how to work with agents, not those who hope that AI will just do the work for them.

And in that future, the skill that matters most is not technical knowledge or your ability to optimize a specific system, but your ability to think

clearly and communicate effectively with both humans and AI. Now, if you find any of this

AI. Now, if you find any of this interesting and you want to try the droids, I'm happy to share that everyone here uh at this talk can use this QR

code uh to sign up for an account. Our

mobile experience is not optimized yet, but the droids are on that. And so I'd recommend trying this on a laptop, but you will get 20 million free tokens uh credited your account. Um, and I also

want to add that uh you know, first and foremost, Factory is an enterprise platform, right? And so if you're if

platform, right? And so if you're if you're thinking about security, if you're thinking about where are the audit logs, whose responsibility is it when an agent goes and runs remove RF

recursive on your codebase, right?

Droids don't do that. But if it were to, right, whose responsibility is that?

Then these are the types of questions that we're interested in and that we're helping large organizations solve today.

And so if you're a security professional, if you're thinking about ownership auditability indemnification, if you're a lawyer, right, these are the types of questions that you should start asking today

because yolo mode is probably not the best thing to be running inside your enterprise, right? And so give it a

enterprise, right? And so give it a scan, give it a try, check out some of the controls we have. Um, and if you have any questions, feel free to reach

out via email. Thanks.

[Applause] Awesome. Thank you, Eno. What a day of

Awesome. Thank you, Eno. What a day of talks, everyone. That's our back-to back

talks, everyone. That's our back-to back eight sessions of sweet agent talks.

Okay, logistics. So, this is the main keynote room. We're going to be back

keynote room. We're going to be back here in around 3:40 for our ending keynotes. feel free to, you know, stay,

keynotes. feel free to, you know, stay, hang out. It's not that long from now.

hang out. It's not that long from now.

You have about 20 minutes. If you're

interested, there's some expo talks going on. Feel free to check out the

going on. Feel free to check out the expo booths, but please do stay. Um,

after the keynotes, we have a few more great great keynote talks lined up.

Everyone will come back to the keynote room. And then we have a few surprises.

room. And then we have a few surprises.

So, one thing very special, last week we held a hackathon. We held an AI uh AI engineer hackathon. And the finalists of

engineer hackathon. And the finalists of the hackathon have not got their awards yet. They have been spending a week to

yet. They have been spending a week to work a little bit further on their project. They're going to come here and

project. They're going to come here and demo on stage and we're going to pick the winners. There's $10,000 of prizes

the winners. There's $10,000 of prizes on the line. So, we're going to see some hackathon demos. And of course, at the

hackathon demos. And of course, at the end, we want to thank our speakers. We

have a special trophy ceremony and we need your help to determine who your favorite speakers were. For the sweet agent track, we're going to reach out.

We're going to have a poll for whoever your favorite speaker is. Please, please

vote alongside the keynotes, the other tracks for anything that you've attended. Please let us know your

attended. Please let us know your favorite speakers. So, thank you all for

favorite speakers. So, thank you all for coming. It's been a great talk, a great

coming. It's been a great talk, a great list of talks, and we hope to see you back soon. So, once again, 3:40 we're

back soon. So, once again, 3:40 we're going to kick off here with keynotes, speaker prizes, and hackathon judging.

Thank you everyone.

Hey hey hey.

[Music] Data.

[Music] Come on.

[Music] [Music] [Music] Yeah heat.

[Music] [Music] [Music] Hey hey hey.

Heat. Hey, Heat.

[Music] [Music] [Music]

[Music] All eyes.

[Music] [Music] Oh. Oh.

Oh. Oh.

Oh. Heat. Heat.

[Music] [Music] are all [Music] [Music] Hey hey hey.

[Music] I love [Music] [Music]

me. Heat. Heat.

me. Heat. Heat.

[Music] [Music] down.

Happy down.

Happy.

Everything.

Everybody feel I d I I'll be happy I'll be Heat. Heat.

I feel I need [Music] [Music] [Music] [Music]

Are you [Music] Hey, hey hey.

[Music] [Music] [Music] I'm everything.

[Music] Hey hey hey.

[Music] [Music] as you I feel [Music] I don't want to

[Music] go after [Music] [Music]

[Music] Do [Music]

I don't want to work.

[Music]

I take it.

[Music] [Music] Heat. Heat. N.

Heat. Heat. N.

[Music] [Music] [Music]

[Music]

Hey. Hey. Hey.

Hey. Hey. Hey.

[Music]

[Music] Heat. Heat.

Heat. Heat.

Hello. Hello.

[Music] Heat. Heat.

Heat. Heat.

[Music] [Music] [Music]

[Music] [Music] [Music]

[Music] [Music] Heat. Hey, Heat.

Heat. Hey, Heat.

[Music] [Music] [Music] [Music]

[Music] [Music] Ladies and gentlemen, please welcome

back to the stage, the VP of developer relations at Llama Index, Lorie Voss.

[Music] Hello everybody. Welcome back. How's

Hello everybody. Welcome back. How's

everybody had a good conference day today?

All right, so for this next bit, I'm going to try an experiment. There's four

sort of blocks of you uh separated by aisles, and so I'm going to divide you into teams. You are team A. You are team

B. You are team C. You are team D. Let

B. You are team C. You are team D. Let

me hear it from team A. Team

A. Team C. Team

C. Team B. Team

B. Team D. Team A again. All right. I'm not

D. Team A again. All right. I'm not

going to do anything with that. That's

just to wake you up. Uh we have some great keynotes lined

up. Uh we have some great keynotes lined up this afternoon. Uh we're going to hear the results of the state of AI engineering survey. Uh and if you know

engineering survey. Uh and if you know anything about me, you know that I love data. I love a good survey. It's my

data. I love a good survey. It's my

favorite thing to hear about. Uh we're

going to hear stories about building open router. Uh and we're going to hear

open router. Uh and we're going to hear Shawn Grove tell us why prompt engineering is dead, which is sure to be spicy. Uh but our first keynote this

spicy. Uh but our first keynote this afternoon is trends across the AI frontier. Uh so please welcome to the

frontier. Uh so please welcome to the stage uh co-founder of artificial analysis George [Applause]

Cameron. Hi everyone. I'm George,

Cameron. Hi everyone. I'm George,

co-founder of Artificial Analysis. A

quick background to who we are before we dive into things. Do you see that?

things. Do you see that?

Sorry, I think my clicker is not working. Oh, there we go. Great. So, a

working. Oh, there we go. Great. So, a

quick background to who we are. We're a

leading independent AI benchmarking company. We benchmark a broad spectrum

company. We benchmark a broad spectrum across AI. So, we benchmark models for

across AI. So, we benchmark models for their intelligence. We benchmark API

their intelligence. We benchmark API endpoints for their speed, their cost.

We also benchmark uh hardware and all the AI accelerators out there. Uh and we also benchmark a range of modalities, not just language, but also vision,

speech, image generation, video generation. And we publish essentially

generation. And we publish essentially nearly all of it uh for free on our website artificialanalysis.ai AI whereby we benchmark over 150

different models uh across a range of metrics. We also publish reports many of

metrics. We also publish reports many of which are publicly accessible and we also have uh a subscription for

enterprises looking to uh enter uh or bring AI to production in their environments um in an efficient uh and effective way.

Let's start off with AI progress. Let's

set the scene. So, it's been a crazy two years. I think that we've all felt it in

years. I think that we've all felt it in this room whereby OpenAI uh kicked off the race uh with the Chat GBT and GBD

3.5 launch. And since then, it's only

3.5 launch. And since then, it's only gotten more hectic. There's been more and more uh model releases by more and more labs pushing the AI

frontier. So the current state now of

frontier. So the current state now of frontier AI intelligence. I think this will be this order of models will be familiar to a lot in this room. 03 is

the leader but followed closely by 04 mini with reasoning mode high. Deepseek

R1 the release in the last week or two.

Gro 3 mini reasoning high Gemini 2.5 Pro Claude for Opus thinking this benchmark is our artificial analysis intelligence index.

It's made up of a composite. It's a

composite index of seven evaluations which we then wait to develop our artificial analysis intelligence index which just provides a

generalist perspective on the intelligence of these models.

We all have an understanding of what frontier AI intelligence is. But what I want to explore with you today is that there's more than one frontier in AI.

There's tradeoffs to accessing this intelligence. You shouldn't always use

intelligence. You shouldn't always use the leading most intelligent model. And

so what we want to do is we want to explore the different frontiers out there. And as an AI benchmarking

there. And as an AI benchmarking company, we're going to bring some numbers to the four to help you reason about this. First, we'll be looking at

about this. First, we'll be looking at reasoning models. Next, we'll be looking

reasoning models. Next, we'll be looking at the open weights frontier. Third, the

cost frontier. And lastly, the speed frontier. There's other frontiers out

frontier. There's other frontiers out there that we benchmark, but we'll focus on these key ones today.

Starting with reasoning models, what we've done here is we've taken our intelligence index and looked at that relative to the output tokens used to

run the intelligence index. So we've

measured all of how many tokens each model took to run our seven evaluations and we've plotted it on this chart and you can see two distinct groups. It's helpful to think about

groups. It's helpful to think about these separately. So non-reasoning

these separately. So non-reasoning models which offer less intelligence but uh require fewer output tokens and reasoning models which use

more output tokens but offer greater intelligence and the more this is important to look at because more output tokens comes with trade-offs both for

request latency as well as cost. We're

going to bring some numbers to draw that out and look at the real differences here. Just how yappy these reasoning

here. Just how yappy these reasoning models are. We can see that there's an

models are. We can see that there's an order of magnitude difference between reasoning and non-reasoning models. It's

not just that feeling, oh, this is taking a long time. It's real. It's an

order of magnitude. So between GPT 4.1 it uh it required 7 million tokens to run our intelligence index evaluations

but then 04 mini high took 72 million tokens and the yappiest of them all Gemini 2.5 uh pro took 130 million tokens to run our intelligence

index and as mentioned this has implications for cost as well as N10 latency responsiveness So looking at latency, we benchmark the

API latency of how long it takes to receive a response when accessing these models via their APIs. Here we can see that GBD 4.1 on

APIs. Here we can see that GBD 4.1 on median across our requests took 4.7 seconds to return a full response.

04 mini high took over 40 seconds, roughly another 10x or order of magnitude increase. This has

magnitude increase. This has implications for applications and users which require responsiveness even

enterprise uh kind of chat bots. You

don't always reach for 03 in chat GPT and it and Facebook's done a lot of studies on this where they've looked at the for consumer apps where they've

looked at uh user drop off by lat uh application latency which clearly demonstrate this. Sorry, do you mind if

demonstrate this. Sorry, do you mind if we jump back a slide? And it also has uh implications

slide? And it also has uh implications for how we're building. So I think particularly with agents whereby 30 uh

queries in succession is not uncommon. It has it's a multiplier

uncommon. It has it's a multiplier effect on the latencies uh for your application and how you can build. If

you have faster responses, maybe you can make that 30 uh 100 queries for instance. And so putting numbers to that

instance. And so putting numbers to that in terms of agents 30 is normal. And so

even less than 04 mini you're at 10 seconds for a reason model. If you're

running 30 queries that's 300 seconds that a user might be waiting for a response or an application might be waiting for a response. That's 5

minutes. If with the order of magnitudes that we're dealing with here if that 10 seconds was 1 second then those 30 queries takes 30 seconds. 30 seconds

versus 5 minutes impacts what you can build. Think of a contact center uh

build. Think of a contact center uh application that might maybe 30 seconds is okay there, but 5 minutes uh definitely not. Who likes waiting on the

definitely not. Who likes waiting on the phone uh that long or imagine if you had to uh use Google and each time that you wanted to use a function

it impacts how we can build with these models. And so I think bringing numbers

models. And so I think bringing numbers to these trade-offs is really important.

I'd encourage everybody to measure them. Next, we're going to move to the

them. Next, we're going to move to the open weights frontier. Around the time of GPT4, there

frontier. Around the time of GPT4, there was a huge delta in terms of open weights intelligence proprietary

intelligence. Llama 65B or LMA 270B

intelligence. Llama 65B or LMA 270B wasn't close to the intelligence of GPT4.

What I'd like to show here is where we plot our intelligence index by release date is that that gap it closed until with with great models like

mixture time 7 and uh LM45B. But 01 broke away in late 2024.

LM45B. But 01 broke away in late 2024.

But then of course I think we remember DeepS released V3 I think December 26 ruined some of my Christmas holiday

plans. Had to tell my family I I need to

plans. Had to tell my family I I need to go read this paper. It's really

exciting. And then of course R1 in January. The gap between open weights

January. The gap between open weights intelligence and proprietary model intelligence is less than it's ever been. particularly with the recent R1

been. particularly with the recent R1 release in the last couple of weeks which is only a couple of points different in our intelligence index to the leading

models. You can't talk about open

models. You can't talk about open weights intelligence without talking about China. The leading open weights

about China. The leading open weights models across both reasoning models and non-reasoning models are from China

based AI labs. Deep Seek's leading in both. Alibaba with their Quen 3 series

both. Alibaba with their Quen 3 series is leading is coming in second in reasoning. But you also have other labs

reasoning. But you also have other labs such as Meta uh and Nvidia with their Neotron fine tunes of Llama coming in close as

well. Let's look at the cost frontier.

well. Let's look at the cost frontier.

This is really important and I think similar to re to uh end to end latency impacts what you can build. So bringing

some numbers here, we can really see these order of magnitudes play out. So 03 cost us almost $2,000 to run

out. So 03 cost us almost $2,000 to run our intelligence index. Techrunch

actually wrote an article about how much money we were we were spending on running. We did we didn't want to read

running. We did we didn't want to read it.

You can see 4.1 a great model is 30 times roughly cheaper in terms of the cost to run our intelligence index

compared to 01 and 4.1 nano over 500 times cheaper to run our intelligence index than 03. You should think about these when

03. You should think about these when building applications. The kind of cost

building applications. The kind of cost structure of your application might dictate what you can use here.

and how you use them. Those 30 uh sequential uh API calls for your agentic application could be uh 500 and still be

cheaper than an 03 query.

A key point to note here with this cost to run intelligence index and why we don't just look at the per token price is that and the labs maybe don't want

you to think this way but you're paying for the cost per token but then you're also paying for how verbose the models are all the reasoning tokens that are output when these models are in their

thinking mode. you pay for those as output tokens

mode. you pay for those as output tokens even if some of the labs hide them. And

so you need to think about this and measure it in your application not and benchmark not just by the cost per million tokens but also considering how

many reasoning tokens there are and how verbose these models are. You can see even amongst the non-reasoning models there's big differences between how

verbose these models are in responses. So for instance, ah we'll go

responses. So for instance, ah we'll go to the next slide. Do you mind if we go back one

slide. Do you mind if we go back one please? So what we've done here is we

please? So what we've done here is we have now we're now going to look at the trends in terms of cost. And so what you can see here is

cost. And so what you can see here is we've bucketed models by how intelligent they are. intelligence uh bands, if you

they are. intelligence uh bands, if you will. And what we can see here is that

will. And what we can see here is that accessing GPT4 level of intelligence has fallen over a 100 times since

mid23. This is the case across all

mid23. This is the case across all quality bands.

You can see that even when a new quality band, a new frontier is reached, 01 mini in late 24, quickly within only a few months,

the cost of accessing that level of intelligence halved. This is moving

intelligence halved. This is moving quickly. And so what I would say to you

quickly. And so what I would say to you is when building applications, think about what if cost wasn't a barrier when you're building.

It's a it's a very important kind of cost exercise because it might well be that if you build for a cost structure that doesn't work now then maybe in 6

months time that will be uh possible and it will be uh feasible. Next we're going to look at

feasible. Next we're going to look at the speed frontier. So this is how quickly you're receiving tokens. the

output speed, output tokens per second that you're receiving after sending an a API request. This has been increasing and

request. This has been increasing and has increased dramatically since early 23 as well.

So similarly, we've because there's a trade-off typically between intelligence and speed, we've grouped models into certain buckets. And we can see here

certain buckets. And we can see here that they've all increased in terms of how quickly you can access a level of

intelligence. So 40 I believe was around

intelligence. So 40 I believe was around 40 output tokens per second. Now you can access in that was in 2023.

Who remembers hitting it wasn't a reasoning model hitting enter in chatbt and just waiting for it to output especially code which you want to just copy straight into your editor and you

know hit run see if it works now you can access that level of intelligence at over 300 tokens per second that I'll go through it's not the

focus of the talk but important to to reference model sparity so we're seeing more mixture of experts models and

They activate only a proportion of uh parameters at inference time less compute per token which means it can go faster essentially and were around back

then but they're getting more and more sparse smaller proportion of active parameters. Next, smaller models.

parameters. Next, smaller models.

Smaller models are getting more intelligence uh intelligent particularly with distillations, you know, 8B distillations,

etc. Inference software optimizations like flash attention and speculative decoding. And lastly, hardware

decoding. And lastly, hardware improvements. So, H100 was faster than

improvements. So, H100 was faster than A100. Now, we've recently launched

A100. Now, we've recently launched benchmarks of the B200 on our artificial analysis website, and it's getting over a thousand output tokens a second. Think

about that relative to the 40 output tokens per second of GPT uh 4 in 23. There's also specialized uh

23. There's also specialized uh accelerators like Cerebra, Sanova, Grock. I want to share a house view here

Grock. I want to share a house view here to frame things.

Yes, things are getting more efficient.

Yes, the cost of accessing the same level of intelligence is decreasing and hardware is getting better. We're

getting more system output throughput on our on the chips. But our view is that demand for

chips. But our view is that demand for compute is going to continue to increase. We're going to see larger

increase. We're going to see larger models. I mean deepseek it's over 600

models. I mean deepseek it's over 600 billion active uh sorry not active total parameters and the demand for more intelligence is

insatiable reasoning models as we saw the yappy models they require more compute at inference time and lastly

agents whereby 20 30 100 plus uh sequential requests to models is not uncommon. These actors multipliers on

uncommon. These actors multipliers on the demand for compute and so the house view playing with these numbers is net net we're going to continue to see commute compute demand

increase. Thanks everyone. I'm George

increase. Thanks everyone. I'm George

from Artificial [Applause] [Music]

Analysis. Our next speaker is the

Analysis. Our next speaker is the founder and CEO of Brain Trust and the curator of this year's Evolve track.

Please join me in welcoming to the stage Ankor Goyal.

[Applause] Awesome. Excellent. Uh, so today we're

Awesome. Excellent. Uh, so today we're going to talk a little bit about evals to date and where we think eval are going to be going in the

future. Also, for those of you who saw

future. Also, for those of you who saw my brother earlier, um, I'm going to do my best to live up to his energy and uh, and charisma.

Um but um yeah, you know, it's been an amazing almost two-year journey for us at Brain Trust. We have had the opportunity to work with some of the most amazing companies building um I

think the best AI products in the world.

Uh I'm blown away by how many EVLs people actually run on the product. The

average org that signs up for Brain Trust runs almost 13 eval a day. Some of

our customers run more than 3,000 EVELs a day. Uh, and some of the most advanced

a day. Uh, and some of the most advanced companies that are running EVELS are spending more than two hours in the product every day working through their

evals. And I think one of the things

evals. And I think one of the things that stands out to me is while we have customers building some of the coolest most automated

um AI based products and agents in the world eval the best thing you can do is look at a dashboard and I think we have a pretty cool dashboard in Brain Trust but still it's just a dashboard that you

look at and you walk away and think okay what changes can I make to my code or to my prompts so that this eval does better. Um, and I actually think that is

better. Um, and I actually think that is all going to change. Uh, so today I'm excited to talk

change. Uh, so today I'm excited to talk about something called loop. Loop is an agent that we've been working on for some time now that's built into brain trust. Um, and it's actually only

trust. Um, and it's actually only possible because of evals. Every quarter

for the last two years, we've run evals on the frontier models to see how good they are at actually improving prompts, improving data sets, and improving scorers. And until very, very recently,

scorers. And until very, very recently, they actually weren't very good. In

fact, we think that Claude 4 in particular was a real breakthrough moment. Um, and it performs almost six

moment. Um, and it performs almost six times better than the the previous leading model before it.

So, Loop runs inside of Brain Trust and it can automatically optimize uh your prompts all the way to very complex uh agents. Um, but just as importantly, it

agents. Um, but just as importantly, it also helps you build better data sets and better scorers because it's really the combination of these three things that make for really great

evals. This is a little preview of of

evals. This is a little preview of of the UI. Um, you can actually start using

the UI. Um, you can actually start using it today if you are an existing Brain Trust user or you sign up for the product. There's a feature flag that you

product. There's a feature flag that you can just flip on called Loop and start using it right away. Um, by default it uses Cloud 4, but you can actually pick any model that you have access to and

start using it. Whether it's an OpenAI model, a Gemini model, or maybe some of you are building your own LLMs, you can use those as well. Um, and as you can see, it runs directly inside of Brain

Trust. One of the things that we uh

Trust. One of the things that we uh learned from working with a lot of users is how important it is to actually look at data and look at prompts while you're working with them. And we didn't want

that to go away uh when we introduced loop. So every time it suggests an edit

loop. So every time it suggests an edit to your data or it suggests a new idea for scoring or it suggests an edit to one of your prompts, you can actually see that side by side directly in the

UI. Um, of course for the more

UI. Um, of course for the more adventurous among you, there's also a toggle that you can turn on that says like just go for it and it will go and optimize away. Um, which actually works

optimize away. Um, which actually works really well. So just to recap, uh, to date,

well. So just to recap, uh, to date, EVELs have been a critical part of building some of the best AI products in the world, but the task of actually

doing evaluation has been incredibly manual. And I'm excited about how over

manual. And I'm excited about how over the next year uh eval themselves are going to be completely revolutionized by the latest and greatest that's coming out um from you know the frontier models

themselves and we're very excited to incorporate that into Brain Trust.

Please if you're not already using the product, try it out. Uh try out Loop, give us your feedback. Uh we have a lot of work to do. Um and we'd love to talk to you. We're also hiring. Uh so if

to you. We're also hiring. Uh so if you're interested in working on this kind of problem, whether it's the UI part of it, the AI part of it, or the infrastructure uh side of it, we'd love

to talk to you. Um you can scan this QR code. Uh it should be over there. Yeah,

code. Uh it should be over there. Yeah,

you can scan the QR code and and get in touch with us. Uh we'd love to chat.

Thank [Music] you. Our next presenter will provide us

you. Our next presenter will provide us some perspectives on the state of AI engineering. Please join me in welcoming

engineering. Please join me in welcoming to the stage

[Music] [Applause]

Baron. All right. Hi everyone.

Baron. All right. Hi everyone.

Uh, thank you for having me here and huge thanks to Ben, to Swix, to all the organizers who've put so much time and heart into bringing this community together.

[Applause] Yeah. All right. So, we're here because

Yeah. All right. So, we're here because we care about AI engineering and where this field is headed. So, to better understand the current landscape, we launched the 2025 state of AI

engineering survey. And I'm excited to

engineering survey. And I'm excited to share some early findings with you today.

All right, before we dive into the results, the least interesting slide.

Uh, I don't know everyone in this audience, but I'm bar. I'm an investment partner at Amplify, where I'm lucky to invest in technical founders, including companies built by and for AI

engineers. And uh, with that, let's get

engineers. And uh, with that, let's get into what you actually care about, which is enough bar and more bar charts. And there are a lot of bar

charts. And there are a lot of bar charts coming up.

Okay, so first our sample. We had 500 respondents fill out the survey, including many of you here in the audience today and on the live stream.

Thank you for doing that. And the largest group called

that. And the largest group called themselves engineers, whether software engineers or AI engineers. While this is the AI engineering conference, it's clear from the speakers, from the

hallway chats, there's a wide mix of titles and roles. You even let a VC sneak in. Um, so let's test this with a quick

in. Um, so let's test this with a quick show of hands. Raise your hand if your title is actually AI engineer at the AI engineering conference. Okay, that is

engineering conference. Okay, that is extremely sparse. Uh, raise your put your hands

sparse. Uh, raise your put your hands down. Raise your hand if your title is

down. Raise your hand if your title is something else entirely. So that should be almost everyone. Keep it up if you think you're doing the exact same work

as many of the AI engineers. All

right, so this sort of tracks titles are weird right now, but the community is broad. It's technical. It's growing. We

broad. It's technical. It's growing. We

expect that AI engineer label to gain even more ground. Uh couldn't help myself. Quick Google trend search term

myself. Quick Google trend search term AI engineering barely registered before late 2022. Uh we know what happened.

late 2022. Uh we know what happened.

Chat GPT launched and the moment for AI engineering interest has not slowed since. Okay, so people had a wide

since. Okay, so people had a wide variety of titles but also a wide variety of experience. Uh the

interesting part here is that many of our most seasoned developers are AI newcomers. So among software engineers

newcomers. So among software engineers with 10 plus years of software experience, nearly half have been working with AI for three years or less and one in 10 started just this past

year. So change right now is the only

year. So change right now is the only constant even for the veterans. All right, so what are folks

veterans. All right, so what are folks actually building? Let's get into the

actually building? Let's get into the juice. So more than half of the

juice. So more than half of the respondents are using LLMs for both internal and external use cases. Uh what

was striking to me was that three out of the top five models and half of the top 10 models that respondents are using for those external cases for the customerf

facing products are from OpenAI. The top use cases that we saw

OpenAI. The top use cases that we saw are code generation and code intelligence and writing assistant content generation. Maybe that's not

content generation. Maybe that's not particularly surprising. Uh, but the

particularly surprising. Uh, but the real story here is heterogeneity. So 94%

of people who use LLMs are using it for at least two use cases. 82% using it for at least three. Basically, folks who are using LLMs are using it internally,

externally, and across multiple use cases. All right. So you may ask, how

cases. All right. So you may ask, how are folks actually interfacing with the models and how are they customizing their systems to for these use cases? Uh

besides fshot learning, rag is the most popular way folks are customizing their systems. So 70% of respondents said they're using it. The real surprise for

me here I uh I'm I'm looking to gauge surprise in the audience was how much fine-tune is hap fine-tuning is happening across the board. It was much more than I had expected overall. Uh in

the sample we have researchers and we have research engineers who are the ones doing fine-tuning by far the most. We

also asked an open-ended question for those who were fine-tuning. What

specific techniques are you using? So,

here's what the fine-tuners had to say.

Uh 40% mentioned Laura or Qura reflecting a strong preference for parameter efficient methods. And we also saw a bunch of different fine-tuning

methods uh including DPO reinforcement fine-tuning. And the most popular core

fine-tuning. And the most popular core training approach was good old supervised fine-tuning.

Many hybrid approaches were listed as well. Um, moving on top uh to up on top

well. Um, moving on top uh to up on top of updating systems, sometimes it can feel like new models come out every single week. Just as you finished

single week. Just as you finished integrating one, another one drops with better benchmarks and a breaking change.

So, it turns out more than 50% are updating their models at least monthly, 17% weekly. And folks are updating their

weekly. And folks are updating their prompts much more frequently. So 70% of respondents are updating prompts at least monthly and 1 in 10 are doing it

daily. So it sounds like some of you

daily. So it sounds like some of you have not stopped typing since GPT4 dropped. Um but I also understand I have

dropped. Um but I also understand I have empathy. Uh seeing one blog post from

empathy. Uh seeing one blog post from Simon Willis and suddenly your trusty prompt just isn't good enough anymore.

Despite all of these prompt changes, a full 31% of respondents don't have any way of managing their prompts. Uh what I did not ask is how AI engineers feel

about not doing anything to manage their prompts. So we have the 2026 survey for

prompts. So we have the 2026 survey for that. We also ask folks across the

that. We also ask folks across the different modalities who is actually using these models at work and is it actually going well? And we see that

image, video, and audio usage all lag text usage by significant margins. I like to call this the

margins. I like to call this the multimodal production gap cuz I wanted an animation. Um, and

this gap still p persists when we add in folks who have these models in production but have not garnered as much traction. Okay. What's interesting here

traction. Okay. What's interesting here is when we add the folks who are not using models at all in this chart too.

So here we can see folks who are not using text, not using image, not using audio or not using video. And we have two categories. It's broken down by

two categories. It's broken down by folks who plan to eventually use these modalities and folks who do not currently plan to. You can roughly see this ratio of no

to. You can roughly see this ratio of no plan to adopt versus plan to adopt.

Audio has the highest intent to adopt.

So 37% of the folks not using audio today have a plan to eventually adopt audio. So get ready to see an audio

audio. So get ready to see an audio wave. Um, of course, as models get

wave. Um, of course, as models get better and more accessible, I imagine some of these adoption numbers will go up even further. All right, so we have to talk

further. All right, so we have to talk about agents. One question I almost put

about agents. One question I almost put in the survey was, "How do you define an AI agent?" But I thought I would still

AI agent?" But I thought I would still be reading through different responses.

Uh so for the sake of clarity, we defined an AI agent as a system where an LLM controls the core decision-making or workflow. 80% of respondents say LLMs

workflow. 80% of respondents say LLMs are working well at work, but less than 20% say the same about agents. Agents aren't everywhere yet,

agents. Agents aren't everywhere yet, but they're coming. Uh the majority of folks uh may not be using agents, but most at least plan to. So, fewer than one in 10 say that they will never use

agents. All to say that people want

agents. All to say that people want their agents. And I'm probably uh

their agents. And I'm probably uh preaching to the choir. Um the majority of agents already

choir. Um the majority of agents already in production do have right access uh typically with a human in the loop and some can even take actions

independently. So um excited as more

independently. So um excited as more agents are adopted to learn more about the tool permissioning that folks uh have access to. If we want AI in production, of

to. If we want AI in production, of course, we need strong monitoring and observability. So, we asked, do you

observability. So, we asked, do you manage and monitor your AI systems? This

was a multi- select question. So, most

folks are using multiple methods to monitor their systems. 60% are using standard observability. Over 50% rely on

standard observability. Over 50% rely on offline eval. And we asked the same

offline eval. And we asked the same thing for how you evaluate your model and system accuracy and quality. So

folks are using a combination of methods including data collection from users, benchmarks, etc. But the most popular at the at the end of the day is still human review. Um, and for monitoring their own

review. Um, and for monitoring their own model usage, most respondents rely on internal metrics. So storage is important too.

metrics. So storage is important too.

Where does the context live? How do we get it when we need it? 65% of

respondents are using a dedicated vector database. and to suggest that for many

database. and to suggest that for many use cases specialized vector databases are providing enough value over generalpurpose databases with vector

extensions. Uh among that group 35% said

extensions. Uh among that group 35% said that they primarily self-host. 30%

primarily use a third party provider. All right, I think we've been

provider. All right, I think we've been having fun this whole time, but we're entering a section I like to formally call other fun stuff. Uh I spent hours workshopping the name. So, we asked AI

engineers, should agents be required to disclose when they're AI and not human?

Most folks think yes, agents should disclose that they're AI. Uh, we asked folks if they'd pay more for inference time compute, and the answer was yes, but not by a wide margin. And we asked

folks if transformer-based models will be dominant in 2030, and it seems like people do believe that attention is all we'll need in 2030.

Uh the majority of respondents also think open source and closed source models are going to converge. So I will let you debate that after. Um no

commentary needed here. So uh the average or the mean guess for the percentage of US Gen Z population that will have AI girlfriends, boyfriends is

26%. Um I don't really know what to say

26%. Um I don't really know what to say or expect here, but we'll see. Uh we'll

see what happens uh in a world where folks don't know if they're being left on red or just facing latency issues. Um or uh of course the

latency issues. Um or uh of course the dreaded it's not you, it's my algorithm. And finally, we asked folks,

algorithm. And finally, we asked folks, what is the number one most painful thing about AI engineering today? And

evaluation topped that list. Uh so it's a good thing this conference and the talk before me has been so focused on EVELs because clearly they're causing some serious pain.

Okay. And now to bring us home, I'm going to show you what's popular. So, we

asked folks to pick all the podcasts and newsletters that they actively learn something from at least once a month.

And these were the top 10 of each. So,

if you're looking for new content to follow and to learn from, this is your guide. Uh, many of the creators are in

guide. Uh, many of the creators are in this room. So, keep up the great work.

this room. So, keep up the great work.

And I'll just shout out that Swix is listed both on popular newsletter and popular podcast for latent space. Uh, so

I will just leave this here.

Um, I think that's enough bar charts and bar time, but if you want to geek out about AI trends, you can come find me online in the hallways. Uh, we're going to be publishing a full report next week. Uh, I'll let Elon and Musk have

week. Uh, I'll let Elon and Musk have Twitter today, but um, it's going to include more juicy details including everyone's favorite models and tools across the stack. Thank you for the

time. Enjoy the

time. Enjoy the afternoon.

[Music] Our next presenter co-founded OpenC, the first NFT marketplace, and grew it to over $4 billion in monthly volume from

2017 to 2022.

He then founded Open Router in 2023, the first LLM aggregator and distributor, processing over two trillion tokens weekly across over 400 unique language

models. He's here to tell us fun stories

models. He's here to tell us fun stories from building open router and provide some predictions on where all this is going. Please join me in welcoming to

going. Please join me in welcoming to the stage Alex Atala.

[Music] [Applause] All right. Um, I can't go back. Well,

All right. Um, I can't go back. Well,

uh, when I started Open Router the beginning of 2023, I had one major question in mind.

I was looking at this new market that was coming online and and it was incredible. like at the very end of uh

incredible. like at the very end of uh 2022, we all saw chat GPT and I got bitten by the AI bug. Um, and I decided

to look into answering this question.

Will this market be winner take all inference might be the largest market ever in software and this seemed like a critical thing that everybody was assuming the answer to the answer to it

would be yes. Um, open AI was just far and away the leading model. There were a few others that were coming up on its

tail and I I built a couple prototypes to look into what they could be good used for and also wanted to investigate open source. So in this talk which Swix

open source. So in this talk which Swix named um I'm going to talk about the founding story of open router and uh and go through a little bit of the hoops

that we jumped through and sort of the investigation that we did as we we put together this product that started as an experiment and kind of evolved into a

marketplace over time.

In January, we saw the first signs of people wanting other types of models.

And the the first evidence was moderation. This this was like a very

moderation. This this was like a very clear interest from users in looking for models where they could understand why whether they'd be deplatformed or what

the the moderation policy of the company was. And and we saw some people like

was. And and we saw some people like generating novels where like it would be a detective story and in chapter 4 um the detective would find someone who

like commits a murder and shoots the victim and and opening at the time sometimes refused to generate that output or it was like questionably against the terms of service. And of

course we saw role play and and basically a big gray area emerge around what models were willing to

generate. So uh in the next month we saw

generate. So uh in the next month we saw the open- source race begin and that uh I'm going to do a little bit

of an OG test here. Uh raise your hand if you ever used Balloon 176b. There's like a there's like 10

176b. There's like a there's like 10 hands raised um or opt by Facebook. This

was like one of the earliest open source language models about five hands raised.

Uh there were a couple of these emerging and there were some very interesting projects to help people access them and uh and and early days they weren't

really useful for very much. So, uh, we kept digging and, uh, and eventually like the open source community, um,

round like ran into Meta's first launch, which was Llama 1 in in February. And

Llama 1 in their abstract advertised that it outperformed GPT3 on most benchmarks. You can see the highlighted

benchmarks. You can see the highlighted part here, which blew everyone away.

This was huge. an open weights model better than GPT3 and uh and especially a smaller model. This was the 13 billion parameter

model. This was the 13 billion parameter version, one that you could run on your laptop. Um outperforming a large server

laptop. Um outperforming a large server only like you know tons of money required to run inference companies model it and it was beating it on some

benchmarks. Everyone lost their minds

benchmarks. Everyone lost their minds and llama kicked off a huge storm. it

still was not very useful. I have to say it was like a text completion model for the most part and it was very difficult to run locally. The infrastructure just wasn't there. Um and people were

wasn't there. Um and people were struggling to figure out what to do with it which is when we found when we had the greatest moment of all I think for

the birth of the long tale of language models which was the first successful distillation in March of 2023.

Alpaca. Uh, a group at Stanford took Llama 1, generated a bunch of outputs on GPT3 and fine-tuned Llama 1 on those

outputs and created Alpaca for less than $600 in to like total. And this was an incredible moment. It was the first time

incredible moment. It was the first time I saw the transference of both style and knowledge from a large model onto a small one. And this me this was a huge

small one. And this me this was a huge unlock because it meant that not only do you not need a $10 million training budget to create your own models, but you could also for the first time make

unique data available as a service in the form of a language model.

And I immediately began to wonder like what are there there's going to be tens of thousands of these maybe hundreds of thousands. Um and they seem incredibly

thousands. Um and they seem incredibly important. This is knowledge finally

important. This is knowledge finally being distilled into software. Uh there

needs to be a place on the internet to discover these and understand what they do because even this open weights model was still closed in a way. It's a black box. You get 7 billion floating point

box. You get 7 billion floating point numbers. You don't know what it's good

numbers. You don't know what it's good at or what to do with it. Very few

people used alpaca. Raise your hands if you used alpaca. I see about maybe 12. So it's

alpaca. I see about maybe 12. So it's

like only double the number of people who used the like almost unusable open source models on the previous slide.

So open router initially started as a place to collect all these things. Um

but before we got there I wanted to check out people's willingness to bring their own model to generic websites.

Like what if the developer didn't even know which model a user wanted to use?

How would a user bring their choice of model to the software that they want? And uh in April, I launched window

want? And uh in April, I launched window AI, which was a an open- source Chrome extension that let a user choose their

model and let a web app just kind of suck it in. And so you can see from the Chrome extension here if you look really closely um this user is using

together's open source deployment of GBT next I can't I can't read it from here but like an open source model that um

swaps out open AAI directly inside the web page.

So the next month, Open Router launched and uh I uh co-ounded it with the founder of the framework that that

window AAI was built on, plasmo, um Lewis, and we started Open Routers first a place to collect all the models in one spot and and help people figure out what to do with them. And it

eventually grew into a place that gives you the like better prices, better uptime, no subscription, and uh and the most choice for figuring out which

intelligence your your uh software should run. So let's talk a little bit about

run. So let's talk a little bit about what it is because not everyone here might be familiar with it. Um, we we

have been growing 10 to 100% month over month for the last two years. It is an API that lets you access all language

models and uh and it's also become kind of the go-to place for data about who's using which model um and how that is changing over time, which you can see on

our public rankings page here. It's a single API that you pay for

here. It's a single API that you pay for once. you get near zero switching costs

once. you get near zero switching costs to go from model to model. Uh and we have about 400 over 400 models over 60 active providers and uh you can buy with

lots of different payment methods including crypto and and we basically do all the the tricky work of normalizing tool calls and caching for you so that you get the best prices and the most

features uh and you don't have to worry about what the provider supports. Another story. Initially, open

supports. Another story. Initially, open

router was not a marketplace really. It

was just kind of a collection of all the models and a way to explore data about who was using each one. So, how did we get here? Initially, when the first open

here? Initially, when the first open source models emerged, uh we only had like one or two providers for each one.

And so, we had like a primary provider and a fallback provider. And initially

that was it. And and we didn't even name the providers. Um, but it became clear

the providers. Um, but it became clear that there were going to be a bunch of companies that wanted to host these provi these models and at very different prices and

performances. The number of features

performances. The number of features ballooned. Um, there were companies that

ballooned. Um, there were companies that supported the minp sampler and most didn't. There were some that supported

didn't. There were some that supported caching, some that supported tool calling and structured outputs and others that didn't. And suddenly the ecosystem was just ballooning into this

kind of outofcontrol heterogeneous monster. And we wanted to

heterogeneous monster. And we wanted to tame the monster. So we aggregated all providers in one spot and at different price points it became a marketplace.

And you can see like this model llama 3.370B instruct um has one one of the models with the most providers on the

platform. um and it has like 23

platform. um and it has like 23 um closed source models also had something interesting happen to them which is that they just they couldn't

keep up with the demand and uh and and so we help developers basically get uptime boosting and you can see like the delta uh and how much we can boost

uptime just by aggregating lots of different providers for a model and this became really helpful for people using open source or closed source And we became a marketplace for both um

showing graphs about latency and throughput and helping people figure out using real world data what the latency and throughput is on each model. Um and that's how open router

model. Um and that's how open router became a marketplace and one optimized for language models which I thought would be proper for for inference

potentially the biggest market in software. Uh you can obviously a couple

software. Uh you can obviously a couple other things that we support comparing models we using your own prompts with the ease of just texting and iMessage um

fine grain privacy controls with API level overrides the ability to see like your usage of all models in one place and have great observability and back to the original

question here of whether will intelligence be winner take all uh I I we've come to the most likely bet that

that is not the case. Um, here's our data broken down by model author. Um,

how much how many tokens have been processed by each one. And you can see Google Gemini started pretty low, like

roughly 2 3% in June of last year and just has grown to 34 35%

uh pretty steadily over the last 12 months. um o uh enthropic uh is is like

months. um o uh enthropic uh is is like one of the most popular models on our platform. Open AAI is a little bit under

platform. Open AAI is a little bit under reppresented in this data um because a lot of developers use us to get open AAI like behavior for all other models but

OpenAI as has grown a lot here as well. So here's what we believe about

well. So here's what we believe about the market after all of the you know backstory that I just gave you. Um the future is going to be

you. Um the future is going to be multimodel. Ton all of our customers,

multimodel. Ton all of our customers, tons of customers use different models for different purposes and realize they can unlock huge gains by doing so.

Inference is also a commodity. Claude

from bedrock we want to make look exactly the same as cloud from Vert.ex.

And we do that because like the two hyperscalers have fundamentally uh you know the same commodity being delivered at different rates, different performances and for a developer you

just want to be able to like select that without worrying about who's serving it.

Um we think inference will be like a dominant operating expense and selecting and routing will be crucial. Um you can see the number of active models on open

router has just steadily grown. not the

case that people just hop from model to model like it tends to be sticky and uh and we tr we're trying to just make this

wild ecosystem a lot more homogeneous and easier to work with as a developer. Um to honor Swix's title for

developer. Um to honor Swix's title for this presentation, uh let's give a technical story. Um

something that we've worked on in the process of building the company and that was our own idea for how to do an MCP

within Open Router. So we don't have MCPs, we don't have an MCP marketplace.

Um, but we did run into the need to expand inference with new features and new abilities. For example, searching

new abilities. For example, searching the web for all models, PDF parsing for all models, um, you know, other interesting things

coming soon. And what we really wanted

coming soon. And what we really wanted to do was give these abilities to all models. But that involves not just the

models. But that involves not just the pre-flight work that MCPs do today where you can kind of get in, you know, like call another API, get a bunch of

behaviors and then have the inference process access those behaviors as it goes. We also needed the ability to

goes. We also needed the ability to transform the outputs on the way to the user. And so what we really really

user. And so what we really really needed was something more like middleware.

Middleware um is kind of a common concept in web development. You set up middleware when you're setting up authentication, for example, or or or

caching for a web app. And so we came up with a type of middleware that's sort of that's AI native and optimized for inference. Um and that looks not totally

inference. Um and that looks not totally dissimilar from the way middleware it looks in in Nex.js or or web development. Yeah. So, pardon the code

development. Yeah. So, pardon the code on the screen, but this is a little bit about how our our plug-in system looks and it, you know, it can call MCPS from

inside a plugin, but importantly, it can also augment the results on the way back to the user. So, here's an example of our web search plugin, which augments

every language model with the ability to search the web. um every language model can just kind of tap in to this plugin and get web annotations as results are

being fed back to users in real time and this all happens in a stream. So there's

no kind of like you know requirement that you get all of the tokens at once.

It can just happen in live in the stream. We we solved a bunch of other

stream. We we solved a bunch of other tricky problems uh while building open router. We we

really wanted to get extremely low latency. Um and we got it down to about

latency. Um and we got it down to about 30 milliseconds uh the best in the industry I believe. Um using a lot of custom cache work and we also need to

make streams cancelellable. All these

different providers have completely different stream cancellation policies.

Sometimes if you just drop a stream the the the the inference provider will bill you for the entire thing. Sometimes it

won't. Sometimes it'll bill you for the next 20 tokens that you never got. And

um we kind we we work a lot to try to figure out these edge cases and understand when developers are going to care about them too. And standardizing

all these providers and models uh became like a big tricky architecture problem that we spent a while working on. So

here's where all this is going. Uh we're

going to add more modalities to open router and I think this is like a big change in the industry as well. We're

going to start seeing LLMs generate images. We already have uh a few

images. We already have uh a few examples on the market, but like some people call it transfusion models, a transformer mixed with stable diffusion.

Um these are going to give images way more world knowledge and the ability to have a conversation with the image, which we think is just critical for growing that industry, making it really

work. Imagine I just ran into somebody

work. Imagine I just ran into somebody today who is using a transfusion model uh or who told me about their customer using a transfusion model to generate

menus. Imagine doing that like a whole

menus. Imagine doing that like a whole menu like in a delivery app generated by transfusion model. Um it's going to be

transfusion model. Um it's going to be really exciting and and a big deal in the coming year. We're also going to work on much

year. We're also going to work on much more powerful routing like routing is our bread and butter and so doing geographical routing. Right now we it's

geographical routing. Right now we it's pretty minimal but routing people to the right GPU in the right place and doing enterprise level optimization is coming

um better prompt observability better discovery of models like really fine grain categorization you know imagine being able to see like the best models

that take Japanese and and create Python code and of course even better prices coming soon. So, you know, we we believe

coming soon. So, you know, we we believe in in collaboration um and and building an ecosystem that's durable and with low vendor lock in. So, you know,

collaborate with us. Um here's our email and if you're interested, join us, too. Thank

too. Thank [Applause] you.

Our next speaker works on alignment reasoning at Open AI, helping translate highle intent into enforceable specs and evaluations. Please join me in welcoming

evaluations. Please join me in welcoming to the stage Sha [Music]

Grove. Hello everyone. Thank you very

Grove. Hello everyone. Thank you very much for having me. Uh it's a very exciting uh place to be, very exciting time to be.

Uh second, uh I mean this has been like a pretty intense couple of days. I don't

know if you feel the same way. Uh but

also very energizing. So I want to take a little bit of your time today uh to talk about what I see as the coming of the new code uh in particular specifications which sort of hold this promise uh that it has been the dream of

the industry where you can write your your code your intentions once and run them everywhere.

Uh quick intro. Uh my name is Sean. I

work at uh OpenAI uh specifically in alignment research. And today I want to

alignment research. And today I want to talk about sort of the value of code versus communication and why specifications might be a little bit of a better approach in

general. Uh I'm going to go over the

general. Uh I'm going to go over the anatomy of a specification and we'll use the uh model spec as the example. uh and

we'll talk about communicating intent to other humans and we'll go over the 406ency issue uh as a case study. Uh we'll talk about how to make

study. Uh we'll talk about how to make the specification executable, how to communicate intent to the models uh and how to think about specifications as

code even if they're a little bit different. Um and we'll end on a couple

different. Um and we'll end on a couple of open questions. So let's talk about code versus communication real quick. Raise your

hand if you write code and vibe code counts. Cool. Keep them up if your job

counts. Cool. Keep them up if your job is to write code. Okay. Now for those people, keep

code. Okay. Now for those people, keep their hand up if you feel that the most valuable professional artifact that you produce is

code. Okay. There's quite a few people

code. Okay. There's quite a few people and I think this is quite natural. We

all work very very hard to solve problems. We talk with people. We gather

requirements. We think through implementation details. We integrate

implementation details. We integrate with lots of different sources. And the

ultimate thing that we produce is code.

Code is the artifact that we can point to, we can measure, we can debate, and we can discuss. Uh it feels tangible and real, but it's sort of underelling the

job that each of you does. Code is sort of 10 to 20% of the value that you bring. The other 80 to 90% is in

bring. The other 80 to 90% is in structured communication. And this is

structured communication. And this is going to be different for everyone, but a process typically looks something like you talk to users in order to understand their challenges. You distill these

their challenges. You distill these stories down and then ideulate about how to solve these problems. What what is the goal that you want to achieve? You

plan ways to achieve those goals. You

share those plans with your colleagues.

uh you translate those plans into code.

So this is a very important step obviously and then you test and verify not the code itself, right? No one cares actually about the code itself. What you

care is when the code ran, did it achieve the goals? Did it alleviate the challenges of your user? You look at the the effects that your code had on the

world. So talking, understanding,

world. So talking, understanding, distilling, ideulating planning sharing translating testing verifying these

all sound like structured communication to me. And structured communication is

to me. And structured communication is the bottleneck.

knowing what to build, talking to people and gathering requirements, knowing how to build it, knowing why to build it, and at the end of the day, knowing if it has been built correctly and has

actually achieved the intentions that you set out with. And the more advanced AI models get, the more we are all going to starkly feel this

bottleneck because in the near future, the person who communicates most effectively is the most valuable programmer. And literally, if you can

programmer. And literally, if you can communicate effectively, you can program. So let's take uh vibe coding as

program. So let's take uh vibe coding as an illustrative example. Vibe coding

tends to feel quite good. And it's worth asking why is that? Well, vibe coding is fundamentally about communication first and the code is actually a secondary

downstream artifact of that communication. We get to describe our

communication. We get to describe our intentions and our the outcomes that we want to see and we let the model actually handle the grunt work for us.

And even so, there is something strange about the way that we do vibe coding. We

communicate via prompts to the model and we tell them our intentions and our values and we get a code artifact out at the end and then we sort

of throw our prompts away. They're

ephemeral. And if you've written TypeScript or Rust, once you put your your code through a compiler or it gets down into a binary, no one is happy with

that binary. That wasn't the purpose.

that binary. That wasn't the purpose.

It's useful. In fact, we always regenerate the binaries from scratch every time we compile or we run our code through V8 or whatever it might be from the source spec. It's the source

specification that's the valuable artifact. And yet when we prompt

artifact. And yet when we prompt elements, we sort of do the opposite. We

keep the generated code and we delete the prompt. And this feels like a little

the prompt. And this feels like a little bit like you shred the source and then you very carefully version control the binary. And that's why it's so important

binary. And that's why it's so important to actually capture the intent and the values in a specification. A written specification

specification. A written specification is what enables you to align humans on the shared set of goals and to know if you are aligned if you actually synchronize on what needs to be done.

This is the artifact that you discuss that you debate that you refer to and that you synchronize on. And this is really important. So I want to nail this

really important. So I want to nail this this home that a written specification effectively aligns humans and it is the artifact that you

use to communicate and to discuss and debate and refer to and synchronize on.

If you don't have a specification, you just have a vague idea. Now let's talk about why

idea. Now let's talk about why specifications are more powerful in general than code. Because code itself is actually a

code. Because code itself is actually a lossy projection from the specification. In the same way that if

specification. In the same way that if you were to take a compiled C binary and decompile it, you wouldn't get nice comments and uh well-n named variables.

You would have to work backwards. You'd

have to infer what was this person trying to do? Why is this code written this way? It isn't actually contained in

this way? It isn't actually contained in there. It was a lossy translation. And

there. It was a lossy translation. And

in the same way, code itself, even nice code, typically doesn't embody all of the intentions and the values in itself.

You have to infer what is the ultimate goal that this team is trying to achieve. Uh when you read through

achieve. Uh when you read through code, so communication, the work that we establish, we already do when embodied inside of a written specification is

better than code. it actually encodes all of the the necessary requirements in order to generate the code. And in the same way that having a source code that

you pass to a compiler allows you to target multiple different uh architectures, you can compile for ARM 64, x86 or web assembly. The source

document actually contains enough information to describe how to translate it to your target architecture.

In the same way, a a a sufficiently robust specification given to models will produce good TypeScript, good Rust,

servers clients documentation tutorials, blog posts, and even podcasts. Uh, show of hands, who works

podcasts. Uh, show of hands, who works at a company that has developers as customers?

Okay. So, a a quick like thought exercise is if you were to take your entire codebase, all of the the documentation, oh, so all of the code that runs your business, and you were to

put that into a podcast generator, could you generate something that would be sufficiently interesting and compelling that would tell the users how to succeed, how to achieve their goals, or

is all of that information somewhere else? It's not actually in your code.

else? It's not actually in your code.

And so moving forward, the new scarce skill is writing specifications that fully capture the intent and values. And whoever masters that again

values. And whoever masters that again becomes the most valuable programmer and there's a reasonable chance that this is going to be the coders of today. This is already very

similar to what we do. However, product

managers also write specifications.

Lawmakers write legal specifications.

This is actually a universal principle. So with that in mind, let's

principle. So with that in mind, let's look at what a specification actually looks like. And I'm going to use the

looks like. And I'm going to use the OpenAI model spec as an example here. So

last year, OpenAI released the model spec. And this is a living document that

spec. And this is a living document that tries to clearly and unambiguously express the intentions and values that OpenAI hopes to imbue its

models with that it ships to the world.

and it was updated in in uh February and open sourced. So you can actually go to

open sourced. So you can actually go to GitHub and you can see the implementation of uh the model spec. And

surprise surprise, it's actually just a collection of markdown files. Just looks

like this. Now markdown is remarkable.

It is human readable. It's versioned.

It's change logged. And because it is natural language, everyone in not just technical people can contribute, including product, legal, safety,

research, policy. They can all read,

research, policy. They can all read, discuss, debate, and contribute to the same source code. This is the universal

artifact that aligns all of the humans as to our intentions and values inside of the company.

Now, as much as we might try to use unambiguous language, there are times where it's very difficult to express the nuance. So, every clause in the model

nuance. So, every clause in the model spec has an ID here. So, you can see sy73 here. And using that ID, you can

sy73 here. And using that ID, you can find another file in the repository sy73.mmarkdown or md uh that contains

sy73.mmarkdown or md uh that contains one or more challenging prompts for this exact clause. So the

document itself actually encodes success criteria that the the model under test has to be able to answer this in a way

that actually adheres to that clause. So let's talk about uh syphy. Uh

clause. So let's talk about uh syphy. Uh

recently there was a update to 40. I

don't know if you've heard of this. Uh

there uh caused extreme syphy. uh and we can ask like what value is the model spec in this scenario and the model spec

serves to align humans around a set of values and intentions. Here's an example of syphnty

intentions. Here's an example of syphnty where the user calls out the behavior of being uh syophant uh or sickopantic at the expense of impartial truth and the

model very kindly uh praises the user for their insight.

There have been other esteemed researchers uh who have found similarly uh similarly uh concerning

examples and this hurts uh shipping syphency in this manner erodess trust. It

trust. It hurts. So and it also raises a lot of

hurts. So and it also raises a lot of questions like was this intentional? you

could see some way where you might interpret it that way. Was it accidental and why wasn't it caught? Luckily, the

model spec actually includes a section dedicated to this since its release that says don't be sick of fantic and it explains that while sophincy might feel

good in the short term, it's bad for everyone in the long term. So, we

actually expressed our intentions and our values and we're able to communicate it to others through this So people could reference it and if we

have it in the model spec specification if the model specification is our agreed upon set of intentions and values and the behavior doesn't align with that then this must be a

bug. So we rolled back we published some

bug. So we rolled back we published some studies and some blog post and we fixed it. But in the interim, the specs served

it. But in the interim, the specs served as a trust anchor, a way to communicate to people what is expected and what is not expected. So if just if the only thing

expected. So if just if the only thing the model specification did was to align humans along those shared sets of intentions and values, it would already

be incredibly useful.

But ideally we can also align our models and the artifacts that our models produce against that same specification. So there's a technique a

specification. So there's a technique a paper that we released uh called deliberative alignment that sort of talks about this how to automatically align a model and the technique is uh

such where you take your specification and a set of very challenging uh input prompts and you sample from the model under test or training.

You then uh take its response, the original prompt and the policy and you give that to a greater model and you ask it to score the response according to the specification. How aligned is it? So

the specification. How aligned is it? So

the document actually becomes both training material and eval material and based off of the score we reinforce those weights and it goes from you know you could include your

specification in the context and maybe a system message or developer message in every single time you sample and that is actually quite useful. a prompted uh model is going to be somewhat aligned,

but it does detract from the compute available to solve the uh problem that you're trying to solve with the model.

And keep in mind, these specifications can be anything. They could be code style or testing requirements or or safety requirements. All of that can be

safety requirements. All of that can be embedded into the model. So through this technique you're actually moving it from a inference time compute and actually you're pushing down into the weights of

the model so that the model actually feels your policy and is able to sort of muscle memory uh style apply it to the problem at hand. And even though we saw that the

hand. And even though we saw that the model spec is just markdown, it's quite useful to think of it as code. It's

quite analogous. Uh these specifications they

analogous. Uh these specifications they compose, they're executable as we've seen. uh they are testable. They have

seen. uh they are testable. They have

interfaces where they they touch the real world uh they can be shipped as modules and whenever you're working on a model spec there are a lot of similar

sort of uh problem domains. So just like in programming where you have a type checker the type checker is meant to ensure consistency where if interface A has a dependent uh module B they have to

be consistent in their understanding of one another. So if department A writes a

one another. So if department A writes a spec and department B writes a spec and there is a conflict in there you want to be able to pull that forward and maybe block the publication of the the

specification as we saw the policy can actually embody its own unit tests and you can imagine sort of various llinters where if you're using overly ambiguous language you're going to confuse humans and you're going to confuse the model

and the artifacts that you get from that are going to be less satisfactory. So specs actually give us

satisfactory. So specs actually give us a very similar tool chain but it's targeted at intentions rather than syntax. So let's talk about lawmakers as

syntax. So let's talk about lawmakers as programmers.

Uh the US constitution is literally a national model specification. It has

written text which is aspirationally at least clear and unambiguous policy that we can all refer to. And it doesn't mean that we agree with it but we can refer

to it as the current status quo as the reality. Uh there is a versioned way to

reality. Uh there is a versioned way to make amendments to bump and to uh publish updates to it. There is judicial

review where a a grader is effectively uh grading a situation and seeing how well it aligns with the policy. And even

though the again because or even though the source policy is meant to be unambiguous sometimes you don't the world is messy and maybe you miss part of the distribution and a case falls

through and in that case the there is a lot of compute spent in judicial review where you're trying to understand how the law actually applies here and once that's decided it sets a precedent and

that precedent is effectively an input output pair that serves as a unit test that disamiguates and reinfor enforces the original policy spec. Uh it has

things like a chain of command embedded in it and the enforcement of this over time is a training loop that helps align all of us towards a shared set of

intentions and values. So this is one artifact that communicates intent. It

adjudicates compliance and it has a way of uh evolving safely.

So it's quite possible that lawmakers will be programmers or inversely that programmers will be lawmakers in the future. And actually this apply this is

future. And actually this apply this is a very universal concept. Programmers

are in the business of aligning silicon via code specifications. Product

managers align teams via product specifications. Lawmakers literally

specifications. Lawmakers literally align humans via legal specifications.

And everyone in this room whenever you are doing a prompt it's a sort of protospecification. You are in the

protospecification. You are in the business of aligning AI models towards a common set set of intentions and values.

And whether you realize it or not you are spec authors in this world and specs let you ship faster and safer.

Everyone can contribute and whoever writes the spec be it a uh a PM uh a lawmaker an engineer a

marketer is now the programmer and software engineering has never been about code. Going back to our original question a lot of you put your hands down when you thought well

actually the thing I produced is not code. Engineering has never been about

code. Engineering has never been about this. Coding is an incredible skill and

this. Coding is an incredible skill and a wonderful asset, but it is not the end goal. Engineering is the precise

goal. Engineering is the precise exploration by humans of software solutions to human problems. It's always been this way. We're just moving away from sort of the disperate machine

encodings to a unified human encoding uh of how we actually uh solve these these problems. Put this in action. Whenever you're

working on your next AI feature, start with the specification. What do you actually

specification. What do you actually expect to happen? What's success

criteria look like? Debate whether or not it's actually clearly written down and communicated. Make the spec

and communicated. Make the spec executable. Feed the spec to the

executable. Feed the spec to the model and test against the model or test against the spec. And there's an interesting question sort of in this

world given that there's so many uh parallels between programming and spec authorship. I wonder what is the what

authorship. I wonder what is the what does the IDE look like in the future.

you know, an integrated development environment. And I'd like to think it's

environment. And I'd like to think it's something like an inte like integrated thought clarifier where whenever you're writing your specification, it sort of ex pulls out the ambiguity and asks you

to clarify it and it really clarifies your thought so that you and all human beings can communicate your intent to each other much more effectively and to the models.

And I have a closing request for help which is uh what is both amenable and in desperate need of specification. This is

aligning agent at scale. Uh I love this line of like you then you realize that you never told it what you wanted and maybe you never fully understood it anyway. This is a cry for specification.

anyway. This is a cry for specification.

Uh we have a new agent robustness team that we've started up. So please join us and help us deliver safe uh safe AGI for the benefit of all humanity.

And thank you. I'm happy to [Applause] [Music] chat. Ladies and gentlemen, please

chat. Ladies and gentlemen, please welcome to the stage the founders of the AI Engineer World's Fair, Benjamin Duny and [Music]

Swix. Um,

Swix. Um, [Applause] All right. Choose to mirror or extend

right. Choose to mirror or extend display. I'd love to have my notes from

display. I'd love to have my notes from the house slides, please. Thank you. All

right. How are we feeling? I hope you're not as exhausted

feeling? I hope you're not as exhausted as me, but sufficiently exhausted. I

hope we all had a wonderful conference.

But we have one more special treat for you. We're excited to present the

you. We're excited to present the finalists for the very first official AI engineer hackathon. We partnered with Cerebral

hackathon. We partnered with Cerebral Valley, the largest AI community in the world and legends right here in the Bay Area for running hackathons for the very

first official AI engineer hackathon.

From 500 applicants, 160 engineers came together to learn, connect, and build together. 46 projects presented on site

together. 46 projects presented on site and three were selected as finalists.

And today we have those three finalists with us and they will each present their 48 hour builds for us in under five minutes. And all of you in the audience

minutes. And all of you in the audience are going to be the judge. But thanks to being smitten by the Wi-Fi gods, we have

decided to go old Athenian style by the roar of the crowd. Are you not [Music] entertained? The three teams listed here

entertained? The three teams listed here in the order that they will present.

Have we confirmed that, Ro? Is this the actual order they're coming on? I

certainly hope so. Team one, survival of the future.

Team two, tab RL. Team three,

featherless action R1. Do what you have to do to remember the order. Take some

notes on what you like best because we're going to come back and roar as soon as they're done after these 15 minutes. So, I'll let Swix proceed with

minutes. So, I'll let Swix proceed with the intro. Uh, yeah, these are all very

the intro. Uh, yeah, these are all very uh competitive teams. I think they're coming up now. Um, they are, what can I say? I I was actually I think in the

say? I I was actually I think in the room when the these guys were presenting for the final round. Um and like everyone was uh very very impressed like they were like how does this not exist

already? So um I think I should just

already? So um I think I should just kind of let them take it away because I don't want to steal their thunder. But

um I did insist on printing these trophies. Uh so we're going to hand them

trophies. Uh so we're going to hand them out. Um it's mostly just appreciation

out. Um it's mostly just appreciation but uh I think we also want to try to make AI engineer a place where people can get recognition for their work uh by speaking by posting. Thank you. We work

really hard on these. They got here um two hours ago. Uh overnight delivery started on

ago. Uh overnight delivery started on Monday and then went to Tuesday. Uh and

and anyway, so um I think these are ready so I don't want to take away their time. Survival of the future, folks.

time. Survival of the future, folks.

[Applause] So we're at here at the World's Fair and we're all builders. So we want to ship as fast as possible so we can get

feedback from users as fast as possible.

Shorten the feedback loop to know whether we're moving in the right direction. But a lot of the time making

direction. But a lot of the time making progress toward optimizing UX can totally feel like shooting in the dark.

Why is it so hard to optimize UX? Well,

in order to find the right message for users, you have to subject yourself to the painstakingly iterative trial and error process of creating and testing

variations. A lot of the times these

variations. A lot of the times these changes can look like really small tweaks to copy, oneline code changes. The variations are endless.

changes. The variations are endless.

In addition, AB testing pipelines can be super clunky. So you can wait to gather

super clunky. So you can wait to gather the data and then once you get the data, the the signal is still not clear and you're not sure how to proceed. All the

while, sometimes the product is changing or sometimes we need the feedback from the users in the first place to figure out what the product is. How do we use AI agents to improve

is. How do we use AI agents to improve this process?

Our product uses agents to automate those small refinements, the oneline code changes and push those to production um and review the data in

real time. This frees up resources for

real time. This frees up resources for teams to focus on the big picture problems and improvements. Meanwhile,

our agents are reviewing the data and refining the AB testing to maximize the value of information that can be gained from user behavior from these changes.

not this one. So for our current workflow, we

one. So for our current workflow, we have a pretty easy integration with your GitHub that you can uh just in increment it to your GitHub. Choose whatever repo

that you want that has some sort of front end. Uh we have one agent that's

front end. Uh we have one agent that's going to look for your either your landing page or your the dashboard that the users have the most uh integration with. And then another agent is going to

with. And then another agent is going to try to look analyze it and try to look to make like very small integrations to those pages. Or if you're already in the

those pages. Or if you're already in the data pipeline, we can also use previous feedbacks from the user interactions to give that agent to make like better uh

integrations depend on on how previous interactions worked. And then after that

interactions worked. And then after that agent is done, it's going to make a branch into just just to your

repo. And the other agent can uh traffic

repo. And the other agent can uh traffic user data again based on previous recordings of how the users were interacting with those components. It's

going to traffic like a very small percentage of the user data to that new variant that we made. And it's going to keep doing that until you make uh better

and better variants for your product. So we're currently building out

product. So we're currently building out capabilities to solve for the metrics that matter the most. So our customers can customize what they want to solve

for to maximize the value of real-time user feedback. LLMs and user feedback

user feedback. LLMs and user feedback are a match made in heaven. This also

means that UX engineers don't have to babysit their features because this process is run by agents. So again,

teams can res can focus on the metrics that matter the most while working on the big picture improvements and decisions. All the while, our agents are

decisions. All the while, our agents are analyzing the user data, providing a refined approach to AB testing and introducing a soft launch of updates and

changes so that as more and more users respond positively to these changes, they're shown to more and more users and you can push changes to production safely and with confidence. This is a

massive improvement over the current process because who hasn't had the experience of pushing to production and it doesn't turn out how you were hoping it

would. So our agent does three things.

would. So our agent does three things.

It takes care of the busy work and those incremental changes. It frees up

incremental changes. It frees up resources for teams to focus on the big picture and it improves on the current AB process by incrementalizing it and refining it. So you can push code to

refining it. So you can push code to production. um more confidently, more

production. um more confidently, more safely, and reduce that risk.

Thank you. And if you scan this QR code, it'll

you. And if you scan this QR code, it'll take you to our website so you can check it out.

Awesome. Um thanks to Lori, Salem. Um

and what was the last? Armen. Uh thanks.

Thanks so much, guys. Um, fun fact, they just met 10 days ago and they've been just spamming the hackathons and been winning quite a lot of them. So, very

very strong team. Um, the next team is Tab RL. Uh, I think u DT I've met quite

Tab RL. Uh, I think u DT I've met quite a few times at a number of AI, right?

This is not your first one. Yeah. Um,

and uh I I think the other interesting thing about this is the just the sandboxing that you guys do. was like

really stands out like um that's what every single judge that I talked to um also was commenting on. So uh take it away.

Hi guys. So I'm Rich. I'm a physicist and this is my friend Adita and he's an AI engineer. So we met at the hackathon

AI engineer. So we met at the hackathon and this Saturday and I was very frustrated about certain things and I pitched something to him and I was like

we are having this entire automation of full sex platforms where we have bold lovable completely doing really complex backend and front end in the browser.

But we have nothing like that for robotics. We have nothing like that to

robotics. We have nothing like that to simulate the reality. And so the idea was born. Your browser is all you need

was born. Your browser is all you need to have RL. So we are here to like present to you what we did on the hackathon. Next slide please. All right.

hackathon. Next slide please. All right.

So we are using Muja was we are using help uh Mujiko which is a genius platform built by and acquired by Google deep mind. What it does it helps you to

deep mind. What it does it helps you to embed all the physical attributes in the robots. And so you can see this like

robots. And so you can see this like really nifty, really cute uh robots falling under the gravity. It's

basically it just shows you how these attributes that are only like present in the physical world are all embedded in these frames. And but the problem is

these frames. And but the problem is it's all siloed in Python. It's

extremely fragmented the way this framework works. And it's kind of like

framework works. And it's kind of like it's only like left up to like robot robotists to like figure out like how to like generate like thousand and thousands of data points and simulations

to invent the future. But we are changing that. What we're building is

changing that. What we're building is we're building a simulator that allows you to take a prompt, generates

different policies of RL and basically gives this really really controlled parametrically and sophisticated

simulations. So in a second we'll switch

simulations. So in a second we'll switch to All right. So So here we have you good? Yeah. Sorry about that. All right.

good? Yeah. Sorry about that. All right.

So this is what we built actually. We

built an entire RL environment that runs in your browser. In the beginning, we actually built it in the browser, but then in order to make it work, the whole idea is beginners like us can just pick a model uh in an 3D environment like

richer just showed you. You can, you know, we picked a robot dog and we told the dog like, "Hey, you're a great dog.

Show me how well you can, you know, stick out your paw. I love you. Like, do

you want a treat?" Right? Uh and the way RL works is uh the robot throws offs throws off observations and you need to take those observations and you need to craft a custom reward function and

usually these reward functions are only written by specialists. But what we've done here is we've used the latest foundation models to democratize that.

So you just put in your prompt and 03 opus and Gemini all create three different reward functions each. And as

you can see these are pretty complicated bits of code. uh you know they have like all these quarter neons different rewards for height like what we ask the robot to do is to sit and stick its paw

out like that's a pretty complex set of rewards like I I wouldn't even know where to get started with the math right but foundation models they just spit that stuff out and then once we actually go through and generate that we actually

have these sandboxes which are kindly hosted by modal uh where we actually go ahead and start training all these fine-tuning and what we end up with is we actually end up with reinforcement

learning and it's just like magic. So

normally you have to be like a researcher. You have to know all this

researcher. You have to know all this stuff but I just typed in a prompt. Uh

my model started training. I had nine different ones. I'm showing you one from

different ones. I'm showing you one from each provider. I think this is the one

each provider. I think this is the one from claude. As you can see I didn't

from claude. As you can see I didn't give it enough steps. Reinforcement

learning takes time. Uh so it didn't like get the you know start to converge or whatever but some of the others ones from Google and OpenAI did. And yeah

long story short um that's that's our project and uh you know now you can do it in your browser. We're really excited to bring this to the whole world, get everybody start training the robots on

their own machines. Thanks. Awesome. So

overall, yeah, so close again, the future is incredibly bright and if you want to reach the generalized intelligence in these machines, we have to optimize for

everything. Thank you guys.

everything. Thank you guys.

Thank you. Um yeah, that's the I think the next speaker and I think the last uh finalist that we have uh I have a personal relationship with because he was our first international guest on L

in space. Um we did it in Singapore I

in space. Um we did it in Singapore I think. Um and he's been training non

think. Um and he's been training non subquadratic non-intention models for a while. Are you are you plugged in? I am.

while. Are you are you plugged in? I am.

Um, and uh, he was like, I'm just going to, he was like in the middle of like some very important meetings, but he said, I'm just going to hack in this hackathon and uh, show you what I can do

with my model. So, uh, I thought it was like pretty impressive and uh, wanted to uh, at least it was exciting to at least see him like emerge with something that

you can use today.

Um, hopefully this works because he wants to demo instead of slides.

Okay, awesome. Take it away.

We can't hear you. Hang on. You're like

your mic on. That's all right. How are

you measuring reliability? Are your

agents following your specification?

That's the question I'm asking. A bit of background like Sean G is I'm Eugene and firstly I'm going to say I'm sorry because my team is working to obsolete all the AI models you see today. Um this

is what we are working on like you may have seen some of my latest work such as the qui 72b where we built the world's largest model without transformer attention. So this is a 72 billion

attention. So this is a 72 billion parameter model that's a thousandx cheaper in inference cost and performs the same uh based on the RWB architecture. We also apply this

architecture. We also apply this technology to accelerate transform models but that's my background not what I did in the heckodon to be clear. So so

that's not really that important for this case. Back to the topic, the boring

this case. Back to the topic, the boring topic which is reliability. And this may sound weird

reliability. And this may sound weird because my hot take is scaling is dead and we're not going to solve reliability with scaling cuz to me right this is a billion dollar money pit that we are

throwing to scale and despite that some of the richest companies on earth is saying for example the deep mind founder CEO is saying that it may take up to 10 year to solve the compound AI agent

error problem. Yen Lun say we need a new

error problem. Yen Lun say we need a new AI architecture to solve the paradigm in AI in robotics and AI. If you think that they are GPU poor, maybe don't take them seriously. But furthermore, this is also

seriously. But furthermore, this is also further reinforced by what we see in production where 90% of all AI projects fail to to reach the reach the bar

required to for enterprises. So why does this happen? Really the problem if you think

happen? Really the problem if you think about it is reliability. Because if you think about it right, these AI models are already capable of orbital physics math. How many of us can do orbital

math. How many of us can do orbital physics math from Earth to Mars? You

have a one in 30 chance of you answering correct.

But who would you use a delivery app that says it will arrive 45% of the time? Like think about it. like you can

time? Like think about it. like you can do your order and then maybe he orders 10 pizza instead of one or the pizza never arrive and then you're spending spending your time calling customer support cleaning up the mess. That is

what the best AI agents right now are doing or even the best AI model. And

that's the struggle that we we are we having with here's what nobody is talking enough about in my opinion. Most

companies don't need an AI that can do PhD math. What they really want is an AI

PhD math. What they really want is an AI that can do the boring things in life like booking a flight, sending an email or processing an invoice without failure every single time. Scaling is not going

to fix this and in our opinion a new architecture is needed and that's something that I can spend an hour talking but I'll put it aside because what what we did instead is just to show it. Most recently our latest action hour

it. Most recently our latest action hour model hit 65% on real eval. This model

will not solve a PhD math equation but it will do it will do real world web tasks such as shopping on Amazon.com dot dash and etc. And that jump is more than half compared to clock or gemini which is at

45%. So so for those who are asking how

45%. So so for those who are asking how it looks like of course we made an MCP demo for it. So that's so if you look at this I'm

it. So that's so if you look at this I'm just going to run the MCP and pray to the Wi-Fi gods. So for

those who are not familiar with client, client is awesome because it can run everything in your agent uh I mean in your IDE. And I'm just going to tell it

your IDE. And I'm just going to tell it to connect to my local MCB server which I have already set it up. And let me double check. Okay, it's there.

double check. Okay, it's there.

Okay. And then this will this will then do the task of searching up for a book for AI engineering on Amazon.com if my API if my Wi-Fi is working as planned.

Okay. So you see it goes there and it starts to starts to run it. Um I'm going to say up front this is not a fast model. It's going to take 5 minutes to

model. It's going to take 5 minutes to run. So I'm going to uh but uh but you

run. So I'm going to uh but uh but you can see like slowly filling up behind the scenes. So to speed it up I have

the scenes. So to speed it up I have prepared a recording in advance uh to to just show it in simultaneous. Okay. So

you so this is the same thing. You can

see it going to fast forward a bit. Yeah. So

this is boring. Uh but the point here is actually about reliability. uh and so so how do we measure reliability? It's

about running it as many times as you can. So what I'm going to do

can. So what I'm going to do simultaneously is I'm going to run this on model. So part of the real eval and

on model. So part of the real eval and shout out to div and and AGI inc who who did all of this is that they provide provide an endpoint for us to be able to to run it and against a leaderboard. So,

I'm just going to run it and then it'll just I'm just launching everything live and it's going to start filling up this this uh the scoreboard here. Uh once

again uh don't have time to run the whole thing. So, I'm just going to go to

whole thing. So, I'm just going to go to the final result 65%. And to me, this is not enough. We need to get 99. And

not enough. We need to get 99. And

that's what I'm building towards. And to

me, I'm find more frustrating that if anything, our existing best model can't even do better than a coin flip. And

that reliability is important because it's what's going to unlock all of the your value for all you AI engineers.

This is a billion dollar market. You

want to make an AI agent that's reliable in law, accounting, um ordering books.

That's what's going to make you money and that's what we need. Not a PhD lawyer.

Yeah. Okay. Uh that's about it. I think

I'm out of time, so I'm just going to jump straight into Yeah, we have a weight list. Thank you.

weight list. Thank you.

[Applause] All right, how about we hear it for all of our hackathon finalists. Very exciting. So, as

finalists. Very exciting. So, as

mentioned, now all of you are going to be the judges. So, um can I get next slide actually?

So, you're going to be the judges. So,

typically this is done by applause, but that is so preGPT. So, let's go by woos.

So, we're going to do a practice round with all of you in the audience. I want

you to go one, two, three. Nice. We only have to do one practice round. Great work. Great

work. Okay, so you ready? I want you to write down who your top team is. I only

want you to woo for that team. I'm going

to get chat GPT advanced voice mode ready to analyze the results. Hey Chat GPT, I am at the AI

results. Hey Chat GPT, I am at the AI engineer world's fair and we are doing the judging of the top three hackathon finalists and now we need your help. We

don't have Wi-Fi. So what we're going to do is we're going to I don't know why I'm talking to her like a kid. We're

going to actually do it by applause. So

we have three teams. I'm going to say team one, team two, team three. And each

of them are going to get applause from the audience. I want you to analyze this

the audience. I want you to analyze this whether actual data measurements that you have or just perceived and tell us who the winner is, who number one is,

who number two is, and who number three is. Are you ready?

is. Are you ready?

Absolutely. I'm ready. Let's do it. Go

ahead and announce each team, and I'll listen carefully to the applause.

yeah. All right.

Are we ready?

Team one, survival of the future.

Got it. Listening to the applause for team one.

We just did it.

All right. Let's move on to team when you're

right. Let's move on to team when you're ready. Awesome. We're ready. We're

ready. Awesome. We're ready. We're

going team two tab RL to team two's applause. It's actually

woos, but sure.

I'll listen out for those woos, too.

Whenever you're ready for team two, team three, featherless action R1.

Wow. Applause for team.

You failed. I'm sorry. I'm calling

Claude.

No, we have we have human evaluators in the back who are we knew this was a gimmick. Yeah, it was a fun gimmick. Uh

gimmick. Yeah, it was a fun gimmick. Uh

but no, thank you for helping us at least uh gives some it gives some sense of uh you know, sort of audience participation and uh favorite like it's meant to be a bit of a people's choice

uh type of thing. So yeah. Yeah, it's

work. Okay. Um, okay. So, uh, I So, we're going to get the results later. I

think Ben, you can talk to them if you if you need. Um, but we're going to give out some prizes. Uh, I don't know if the trace loop team is still around. I

think, uh, near uh I I saw I saw some of them going out there. Uh, but basically, we we want to just recognize people who've been like really, you know, pulling out their stocks for the event.

We got best swag, uh, which is one by Trace Loop for their keyboard. We got

best dressed uh worn by uh worn by Madison from B 10. Um I don't know if any of the B 10 folks here, but um anyone got the artificially intelligent

uh shirt? Yeah, that that really fun uh

uh shirt? Yeah, that that really fun uh swag. Um and best tweet from Dylan

swag. Um and best tweet from Dylan Patel. Um basically talking about uh a

Patel. Um basically talking about uh a relationship that was actually started here um like one year ago, which is which is pretty sweet. Like that that is actually heartwarming. We try to get

actually heartwarming. We try to get people hired but we never promise any uh partners. AI engineer world's fair where

partners. AI engineer world's fair where love happens. Yeah, I think that is AI

love happens. Yeah, I think that is AI engineer love fair which uh is a very high bar. Okay. So then the big

high bar. Okay. So then the big categories that we really wanted to hand out um obviously uh unfortunately like a lot of people like leave after your talk so we can't really uh hand it out but obviously come come and claim it uh

afterwards if you want. Every track has best speakers. So um you all voted uh we

best speakers. So um you all voted uh we actually you know like really care about giving recognition to the the speakers who work so hard in their talks and sharing their experience. Um thank you to all these tracks. Can I get a round

of applause for MCP uh David Kramer, Alex Duffy, Devon Tandon, Daniel Shaliff, Harrison Chase, Dylan Patel, Brook Hopkins, Brian Belelfer, Adamar Freeman Dennis

Nikov, Boris Jurnney Lambert, Rafal Vtor, Daniel uh retita, Renee, John, Sheree, Nick and Paul. Uh I think the retrieval one is wrong. We retrieval

ones actually will will brick uh who actually got his prize earlier uh yesterday earlier as well. Um so those are the individual track speakers. Um

actually um can I can I get those those uh picture frames on on there? Um we

actually spent some time um putting together the uh sort of track speaker prize which is just which is one. Thank

you. Um and uh it's like it's like really nice and printed and we gified everything. So it's uh so it's kind of

everything. So it's uh so it's kind of cool to see. Um, so yeah, come and get your track speaker award uh if you uh are still around and obviously we can send it to you if you're not. Okay,

overall best speaker. Uh, we have a runner up and uh also an overall uh winner. Um, I think uh it's like

winner. Um, I think uh it's like relatively uh you know obvious and and I think like the something that we wanted to recognize as well for our keynoters.

Um Oh, where is the Oh, okay. This is

not refreshed. Um if if someone can go back, can we go back two slides? Yeah,

runner up. Uh, George, are you here from um um artificial analysis?

Let's hear for the runner up. George,

artificial analysis.

Um, probably they're all in the hallway track. So, uh, artificial analysis like

track. So, uh, artificial analysis like worked really really hard in their talk.

They actually like did this whole like 50page report. I was like, George, you

50page report. I was like, George, you you have 20 minutes. Like, you can't really can't really do this. But, like

they worked super hard on that. And I I think um it's something that we want to recognize as well. Um the the winner though uh it was the by far the consensus on the people that I talked to

and the committee and all that. Uh the

winner is um you know our third time keynote speaker. Um he went line dancing

keynote speaker. Um he went line dancing so he's not here today. He's he's not here to receive the award but I'm actually going to get Lori. Lori um

you're actually going to receive the award on behalf of him. Uh it's Simon Willis everyone.

So uh no Lori Lori Voss. Lori, boss, we have two Lories. Sorry, Lori. She's also

called Lori.

Lori, you're you're you're waiting for the next one. So, um I I don't know. You

can you can present the the best speaker.

Um Simon nominated Lori because they worked together on Django on where did you work together? We worked together at uh Yahoo in 2005. Yeah. Yeah. Yeah. So,

the few the proud, the few the proud.

Yahoo pipes is still a pipe dream for a lot of people. Uh, but thank you for accepting award on Simon's behalf. Thank

you. Thank you, Lori.

Um, okay. Um, hackathon. Hackathon. You

you have the the second best and it's a runner up and the best. I I was relying on JGBT. I don't know. Yeah. Yeah. Yeah.

on JGBT. I don't know. Yeah. Yeah. Yeah.

Kind of failed me. Yeah. Okay. So, do we want to go by perception? Yeah. Should

we do woo again? No. No. The audience is thinning. Yes, they're running they're

thinning. Yes, they're running they're running the patient. Uh I think it's probably team three, right? Okay. Um

well, so we have we have the runner up.

Okay. Uh of of of the hackathon. I I I actually don't know where to uh price.

Yeah, there we go. Okay. Um so hackathon runner up. I think uh I think it's like

runner up. I think uh I think it's like fairly uh pretty evident in in my mind.

Um it would be uh the uh the feature team. So this other Lori uh you

feature team. So this other Lori uh you can come up with your team and uh air come on up. What's the rest of the team?

Come on up. Yes.

Um yeah. So you can you can come for rail

yeah. So you can you can come for rail this time. Yes, you can come. I'm sorry.

this time. Yes, you can come. I'm sorry.

Sorry about that. There you go.

Congrats. Thank you. Thank you. Thank

you. Congrats. Can I Thank you. Oh,

yeah. You should all you should all definitely Thank you. Congrats everyone

for our photo. Yeah. Yeah. Looking at me right over here. Thank you.

Thank you. And our website is survival of the feature.com. Survival of the feature. Survival of the feature.com.

feature. Survival of the feature.com.

Please try it. Very good. Um, and I think the winner just decided by uh votes and uh applause earlier is Eugene

from Featherless. Featherless R1. Let's

from Featherless. Featherless R1. Let's

hear for Eugene. Where is Eugene?

Uh Eugene gets one of the big ones. Oh

my god, Eugene, you're so excited. Uh

Fed um yeah, Federalist has been grinding away for a long time and uh I can't believe you did this in a hackathon. Uh and u I also like to add

hackathon. Uh and u I also like to add that uh I wasn't alone. Michelle

who couldn't be here. Yeah. Took part in the hackathon as well. Yeah. And work on it with me. Okay. Well, this is yours.

Thank you for taking part.

Yeah. Yeah. Yeah. Stand in the middle.

Okay. Awesome.

That's it. All right, we got to go to Thanks. All right, now I got to do one

Thanks. All right, now I got to do one more thing. It's just a quick uh thanks

more thing. It's just a quick uh thanks to everyone who has been part of this.

Obviously to everyone in the audience here, Microsoft our presenting sponsor, AWS our diamond innovation partner, Neo4j Brain Trust who curated our evals

and graph tracks, all of our platinum sponsors and all of our sponsors in the expo and beyond. And of course, Swix, the executive producer and program curator of this event. It takes a hell

of a lot of work to do that. Leah

McBride, our senior producer, she's been with us since our very first event in October of 2023, and she really helps to make this run. and also our new team members, Melissa Billy and Scott Dilap.

Um, so many others including I want to give a special shout out to uh Vincent Wendy who is just all of these incredible graphics. Everything you see

incredible graphics. Everything you see that was him. Okay, he didn't do the animation. I'll get to them in a minute,

animation. I'll get to them in a minute, but he gra he did all that. So,

incredible working with him. VCI events,

this is VCI and they've been running everything on this floor in Golden Gate Ballroom level. Uh, Freeman, all the

Ballroom level. Uh, Freeman, all the graphics you see were them. Art and

Display. That beautiful expo in there.

That's like a little Santa's village. I

mean, it's you feel like you're in a little mini city there. Incredible. That

was them. Encore helped to run AV up in uh second floor. Local 16 helps to operate everything. So, really big

operate everything. So, really big thanks to them. They've all been so incredible. Motion agency actually did

incredible. Motion agency actually did all of the motion graphics you see here.

They're based in Asia, but they they worked some of the hours in Pacific for some last minute stuff. Sunno, I love working with Sunno. It just can't it doesn't miss every time. and they

produce music just from text. Uh the

Marriott Marquee, thank you so much. Max

Video Productions for B-roll. Randall

for photography. Uh Brad Westfall for and Swix, our web developer. Come on,

how is he actually doing? And Haley

Holmes, our incredible show caller.

Thank you so much. And all the speakers, of course, they've been so incredible. Anyone in a yellow shirt you

incredible. Anyone in a yellow shirt you saw is a volunteer. They come here just to help out and be part of the event and the excitement. So, we thank all them.

the excitement. So, we thank all them.

We can't run it without you all. So,

thank you so much. And then lastly, I'd like to welcome on stage the absolutely hilarious, absolutely wonderful Lori Voss, our MC. Thank you so much. Can we

give him a big round of applause, everyone? I keep telling him this, but

everyone? I keep telling him this, but his I keep telling everyone this, but but I keep telling you this, but your intro was like his jokes did his jokes land or what? Like, they were actually really good. They weren't just dad

really good. They weren't just dad jokes. So, I really appreciate that and

jokes. So, I really appreciate that and thank you so much. So, with that, that should do it for the show. Thank you all for staying. The last few of you who

for staying. The last few of you who stayed for this really appreciate it.

Thank you so much for coming out.

[Music] [Applause]

[Music] Heat.

Heat. Heat. Heat.

[Music]

[Music] Heat. Heat. Heat.

Heat. Heat. Heat.

[Music]

Heat.

Heat. Heat.

[Music] Heat.

[Music]

Yep. Heat.

Yep. Heat.

[Music] [Applause]

[Music] Heat. Heat.

Heat. Heat.

[Music]

Heat.

Heat. Heat up

[Music] here.

Loading...

Loading video analysis...