AI Engineer World’s Fair 2025 - Day 2 Keynotes & SWE Agents track
By AI Engineer
Summary
Topics Covered
- AI Agents Reshape Software Engineering Tiers
- Test-Time Compute Unblocks Intelligence Bottlenecks
- Deep Thinking Scales Reasoning to Olympiad Levels
- Evals Accelerate Iteration Beyond Production Risks
Full Transcript
[Music] [Music] [Music] Hey. Hey.
Hey. Hey.
Hey. Heat. Heat.
[Music] [Music] Hey,
hey hey.
[Music] [Music] [Music] [Music] Hey hey hey.
[Music] [Music]
[Music] [Music]
[Music] [Music] [Music] [Music] [Music] [Music]
[Music] [Music] I don't
[Music]
The wheel keeps turning, grinding it thread. path unchosen where dreams are
thread. path unchosen where dreams are shed. Don't waste your time in endless
shed. Don't waste your time in endless debate. Pick up your tools and create
debate. Pick up your tools and create your fate. March forward, let the road
fate. March forward, let the road unwind. What's left behind is not yours
unwind. What's left behind is not yours to buy. Rise with the sun. Let the sky
to buy. Rise with the sun. Let the sky ignite. You build the future with your
ignite. You build the future with your will tonight.
[Music] The clock won't wait. It takes
relentless. Echo of time are loud and defenseless. Step off the ledge. Don't
defenseless. Step off the ledge. Don't
fear to fly. There's no gain if you never try.
fly. There's no gain if you never try.
[Music] Steel fires burn where progress starts.
Flames that forge courageous hearts. You
can't stop what was meant to bloom. So
grab the light when it breaks the glow.
[Music] March forward. Let the road
March forward. Let the road unwind. What's left behind is not yours
unwind. What's left behind is not yours to buy. Rise with the sun. Let the sky
to buy. Rise with the sun. Let the sky ignite. You build the future with your
ignite. You build the future with your will tonight.
[Music] No backward glance, no sorrow ache. All
in motion and truth at stake. Humanity
story a river untamed. Fight your
[Music] champ. Heat.
champ. Heat.
[Music] Heat. Heat. Heat.
Heat. Heat. Heat.
[Music] [Music] [Laughter] [Music]
[Music] [Laughter] No backward plans, no [Music] [Laughter] [Music]
sorrow. Heat. Heat.
sorrow. Heat. Heat.
[Music] [Music] Heat.
[Music] Heat. Heat.
Heat. Heat.
[Music] Heat.
[Music] Heat. Heat.
Heat. Heat.
[Applause] [Music] [Music]
[Music] Heat.
Heat. Heat. Heat.
[Music] [Music] Heat.
Heat. Heat. Heat.
[Music] [Music]
[Music] [Applause] [Music]
Gentlemen, Please welcome to the stage the VP of developer relations at Llama Index, Lorie [Music]
Voss. Hello again everyone. It is great
Voss. Hello again everyone. It is great to see your friendly faces.
Uh sorry, can we go back one slide? I
accidentally hit my forward. Uh
uh it is great to see you all. Welcome
back to day two or day three depending on when you actually started. Uh who had a good time yesterday? Let's hear it from you.
One thing I couldn't fit into my intro yesterday that I really wanted to get in is that it is June in San Francisco. It
is Pride Month. So from myself and my fellow LGBTQ uh members of the community, I would like to wish you all a happy pride. I also want to hear from my jet
pride. I also want to hear from my jet lag crew. Who, show of hands, woke up at
lag crew. Who, show of hands, woke up at 4:00 a.m. this
4:00 a.m. this
morning? There were a lot of you. 5:00
a.m. Who's still not awake now?
Uh we've got another great batch of keynotes for you, including progress towards deep thinking with Gemini.
You'll be hearing from Logan Kilpatrick of Google. Uh fun fact about Logan is
of Google. Uh fun fact about Logan is that uh Gemini's ability to make jokes is trained entirely on his tweets, which is why none of them are
funny. Uh you'll also be hearing how to
funny. Uh you'll also be hearing how to make your agents more reliable with the founder of Docker. So you won't you won't want to miss that. Uh but first we're going to hear from an amazing organizer and just a wonderful person uh
who has a special announcement, co-founder of the of this very AI engineer world's fair, Benjamin [Applause]
[Music] Duny. Co-founding this conference with
Duny. Co-founding this conference with Swix has been one of the most rewarding experiences of my career. To see you all
here today makes me so excited for what we've built and for what's to come. Like many AI plebeians, my oh
come. Like many AI plebeians, my oh moment was chat GPT. One of my first prompts was to test its limits of
knowledge and reasoning. I prompted it to break the known universe into the fewest core principles from which it could then recursively generate 12
subclassifications. I was blown away by
subclassifications. I was blown away by how fun this exercise was and how interesting the responses were, especially when it got to the lowest
levels of the universe. For example, it labeled viruses a subcategory of quirks, which I found both fascinating and just wrong.
It was at this very moment, however, that I immediately knew that it was over for everything I'd done in the past.
This was the most fascinating piece of technology I'd ever used and recognized it potential immediately. I recall
texting my brother-in-law the URL saying, "AGI has been achieved." But it was only a few months
achieved." But it was only a few months later that something even more incredible happened. My son was
incredible happened. My son was born. Being a father has been one of the
born. Being a father has been one of the most miraculous and incredible experiences of my life. While yes, it's rather astounding to be able to speak with computers where my mind feels
expanded every time I do, when I talk to my son, it's my heart that expands. So, how do these two things
expands. So, how do these two things relate?
I am old enough to remember a time when computers were large cold machines only used in corporate offices to get work done. But as their power has grown, the
done. But as their power has grown, the current model UX has tethered us to these machines all day. There is a parallel from this to
day. There is a parallel from this to how our future generations will be educated. While there will likely always
educated. While there will likely always be a place for both screen and keyboard-based HCI as well as classroom and lecture style learning and discovery, the potential of these new
technologies in emerging US UX can free us from those constraints where even the most mediocre of teachers could become worldclass instructors. So that's why I'm
instructors. So that's why I'm tremendously excited to announce a new chapter for us, the AI education summit.
[Applause] There's a significant gap between the rapid advancement of AI and the preparedness of our children, parents,
and educators to navigate this new reality effectively and ethically. But
we can overcome this together by fostering a global community dedicated to AI education, empowering children, parents,
and educators with the knowledge, skills, and ethical framework to thrive in an AIdriven world. For this event, we'll be
world. For this event, we'll be partnering with a pioneer in the space of AI education, Stefania Duga. She's a
former researcher research scientist at Google and as of today a three-time AIE speaker. It was her talk from last year
speaker. It was her talk from last year that sparked my imagination on this exciting new direction. When she demoed this student
direction. When she demoed this student learning to code by programming the very thing that is teaching them to code, I was just blown away. So whether you're
interested in education for the next generation like I am or just the evolution of HCI for learning in the age of AI for people of all ages, I
encourage you to pre-register today. This first event is going to be a
today. This first event is going to be a free online event to explore the landscape filled with practical knowledge for the exciting future of AI
education. So that's it for me and I'd
education. So that's it for me and I'd love to bring up our first speaker. He
is the group product manager at Google DeepMind and he's here to talk about Gemini. Please join me in welcoming to
Gemini. Please join me in welcoming to the stage Logan [Music] Kilpatrick. Awesome. Thank you, Ben.
Kilpatrick. Awesome. Thank you, Ben.
Excited for the AI education summit.
Should be fun. Um, my name is Logan. I
do developer stuff at DeepMind and I'm excited to talk about Gemini stuff. Um,
yeah, hopefully folks know what Gemini is. So, no introduction needed. Um, I'll
is. So, no introduction needed. Um, I'll
talk about three things really quickly.
We'll do some fun announcement stuff.
Um, we'll talk about sort of recapping a year of progress in Gemini. And then
we'll talk about what's coming next across the model side, across the Gemini app side, and also across uh, of course the developer
platform. So, the fun stuff which is we
platform. So, the fun stuff which is we announced a new Gemini model today. Um,
so we haven't officially announced it, but we'll post live on the tweet. New Gemini model. Uh, this is
tweet. New Gemini model. Uh, this is hopefully the final update to 2.5 Pro. I
think folks have given us tons of feedback um about the changes and I think my slide has an animation which is hiding all the stuff. But Gemini 2.5 Pro
is awesome. Um, it's it's super
is awesome. Um, it's it's super powerful. Uh, bunch of increases across,
powerful. Uh, bunch of increases across, you know, benchmarks people care about.
It's soda on ADER and um it's soda on HLE and some other benchmarks. Um I
think it closes the gap on a bunch of the stuff that folks gave us feedback on from the previous versions of the model.
Um so hopefully it has great performance across the board. It also um I I think is like sort of setting the stage for the future of Gemini. I think 2.5 Pro
for us internally and I think in the perception from the developer ecosystem was the turning point which was super exciting. Um, so it's awesome to see the
exciting. Um, so it's awesome to see the momentum. We've got a bunch of other
momentum. We've got a bunch of other great models coming as well. Um, so 2.5 Pro, hopefully the final version. Send
us feedback if things don't work. Uh,
and we'll we'll continue to push the rock up the hill. Um, you can go to ai.dev if you want to try it out. It's
also available in the Gemini app and all that other stuff. Um, and if you need anything, email us and we'll make it happen. All right, new model launched.
happen. All right, new model launched.
Let's talk about a year of Gemini progress. I think this has been the
progress. I think this has been the craziest thing. So, I don't know if
craziest thing. So, I don't know if folks tuned in to to Google IO, but um Sundar showed this slide on stage, which I think was a uh was a great reminder for me of just how much like it feels
like 10 years of of Gemini stuff packed into the last uh 12 months, which has been awesome. Um and it's it's actually
been awesome. Um and it's it's actually interesting to see as well, just to sort of opine on one of the points, like all of these different research bets across Deep Mind coming together to like build
this incredible mainline Gemini model.
And I think this is actually like I have a conversation with people all the time about like what is what's the deep mind strategy? What's the advantage for us
strategy? What's the advantage for us building models? All that good stuff.
building models? All that good stuff.
And I think the interesting thing to me is just this breath of research happening across like science and Gemini and all these other areas like robotics and things like that. Um and all that
actually ends up upstreaming into the mainline models which is super exciting.
Um so you see like the you know alpha proof and alpha geometry and a bunch of stuff that we did on uh in with custom models in those areas actually improving the performance of our models uh for
those domains and uh Jack will talk about that in a little bit which I'm super excited about. Um the other thing is just like not just the pace of innovation but the pace of adoption. Um
so I think uh Sundar also showed this slide which was a 50x increase in the amount of AI inference that's being processed uh through Google servers from
one year ago to um last month and I think that is it is just remarkable to see the amount of increase in demand for um for Gemini models and for also from
the external developer ecosystem. So
it's been it's been wonderful to see that happen. I think the other question and I
happen. I think the other question and I think this is like talked about a little bit which is uh sort of what what got us to this point. I think one of the critical pieces and like it's you know
not super fun uh but is worth thinking about uh for folks who are building companies here uh is like an organizational thing truthfully like I think you bring together Google historically had lots of different teams
doing lots of different AI research um and in late 2023 early 2023 uh Google brought a bunch of those teams together um and sort of charted this new direction for the DeepMind team to not only just like do theoretical
foundational research but also to like build models and deliver them to the rest of Google and also the external world. Um and then we took the second
world. Um and then we took the second step of that journey later uh earlier this year um which was actually bringing the product teams into DeepMind. So now
DeepMind creates the models, does the research um but then also builds products and delivers delivers those to the world. And we have the Gemini app
the world. And we have the Gemini app which is our consumer product and then we have the developer side of that with the Gemini API. Um and this has been like personally for me super fun to get to collaborate with our research team and like help actually be on the
frontier with them um and bring new models and capabilities to the I think this is like the collaboration that works uh works incredibly well. Yeah. And we ship lots of stuff. I
well. Yeah. And we ship lots of stuff. I
think this is the this is the most fun part um is there's so much stuff so much innovation happening inside of Google.
It's it's incredible to get to bring that to the world and bring that to developers and I think we're actually very early in that journey and as we'll
we'll see in a couple of minutes. Um
so in summary, the formula is simple.
bring the best people together, find infra advantages, and ship.
I don't know if folks have played around with VO or not, but it's also been just incredible to see the reception to VO.
It's, uh, burning all the TPUs down, uh, which has been incredible to see. Lots
of demand, uh, lots of interest on the VO front. Um, so hopefully folks get a
VO front. Um, so hopefully folks get a chance to play around. It's available in the Gemini app right now. Um, all right.
So, let's talk about what next. This is
the fun stuff.
So, I think the the sort of Gemini app piece is interesting just because people talk about it a lot and it's um it's a fun product and it's cool to think about. Um and also sort of I think for
about. Um and also sort of I think for folks building stuff, it's interesting to hear like what our strategy is from the app perspective. Um but the Gemini app is trying to be this universal assistant. I think what that means in
assistant. I think what that means in practice is if you um I'm sure people don't think about this all the time, but I think a lot about like what Google's products do and and sort of how we show up in the world. And one of the
interesting observations I had was that if you think about what was the thing that like brought people individuals through all of Google's products historically like the thing that comes to mind is like like your Google account
I guess which like did wasn't like super stateful. You would sort of sign into
stateful. You would sort of sign into lots of different Google products with your Google account but that didn't really do anything um other than just like get you signed into that individual product. I think now we're seeing with
product. I think now we're seeing with Gemini that it's actually this thread that unifies all of Google. And I think the future for Google is going to look a lot like Gemini is this sort of, you know, thread that brings all of our stuff together. Um, which is really
stuff together. Um, which is really interesting. And then hitting on all the
interesting. And then hitting on all the trends which I'm sure folks are also excited about building. I think the one that I'm most excited about is proactivity. I think most AI products
proactivity. I think most AI products today are still very like you have to go and do all the work as the user. And I
think this proactive uh next step of um AI systems and models coming into play is going to be is going to be awesome to see.
Yeah, and the team is moving super fast.
If you have complaints, please do not tag me on Twitter. Please tag Josh. Um,
he will make it happen. Josh is
incredible. The Gemini app team is amazing. Um, he's he's pushing the team
amazing. Um, he's he's pushing the team uh super hard. So, it's incredible to see all the progress. Uh, but he is the person who can make stuff happen on the Gemini app, not me. So, please check
him. Um, from a model perspective, like
him. Um, from a model perspective, like again, there's there's so much. Uh when
Gemini was originally created, it was built to be a a single multimodal uh model to do audio, image, video, etc. We've made a lot of progress on that. At
IO this year, we announced um native audio capabilities in Gemini. There's
TTS. There's audio uh you can talk to the model. It sounds it sounds super
the model. It sounds it sounds super natural, which is awesome. It's powering
the Astro experience. It's powering
Gemini Live. Um so I think we're going to get towards that omnimodal model, which is awesome. We have VO, which is soda across a bunch of stuff. So
hopefully we'll get video into the mainline Gemini model. Um, if folks saw some of our early experiments with diffusion, which means you can get like crazy levels of tokens per second. Um,
really interesting. That's like
definitely a research exploration area and it's not uh it's not mainline yet.
Um, so it'll be it'll be cool to see that come. The agentic by default thread
that come. The agentic by default thread I think is something that I've been thinking a lot about recently which is like historically for me as a developer I've thought about models just as this thing
that gives me tokens in and out and then there was lots of scaffolding in the ecosystem to allow me to build those models. I think this it's it's becoming
models. I think this it's it's becoming very clear to me that the models are becoming more systematic themselves, like they're doing more and more. And I
think the reasoning step is this like really interesting place in which a lot of that's going to happen. And Jack's
going to talk about the scaling up of reasoning. Um, but I do think it'll be
reasoning. Um, but I do think it'll be interesting to see like how much of the scaffolding work that's happened in the past ends up just like being a part of that reasoning step and like what that means for people who are building
products and stuff like that. So, um,
it'll be interesting to see. We'll also
have more small models soon, which I'm excited about, and big models. People
want large models, which I know. Um, so
I'm excited about that. And then the last one is continuing to push the frontier on infinite context. I think
the current model paradigm doesn't work for infinite context. I think it's just like impossible to scale up. Attention
doesn't work that way. Um, so I think there'll be some new innovations to hopefully help let people continue to scale up the amount of context that they're bringing in.
Um, and Tulsi is the person who drives all of our model stuff. So, if you if you have stuff you want to talk about Gemini models, uh, you have ideas for things that don't work well, uh, she is the person running the show on the
Gemini model product side and then developer stuff. Um, so we have lots of things coming which I'm excited about. Um, I think I'll
excited about. Um, I think I'll highlight maybe three that I think people are super excited about.
Embeddings I think we have which is you know feels like early AI stuff but I think is still super important.
Embeddings power most people's um applications using rag. Uh we have a Gemini embedded model which is state-of-the-art. So excited to be
state-of-the-art. So excited to be rolling that out to developers more broadly in the next couple of weeks. Um
the deep research API I'm super interested in. There's so many
interested in. There's so many interesting products that are built around um this sort of research task and people love the consumer product. So,
we're finding ways to bring a bunch of that together um into a like bespoke deep research API uh which will be awesome. And then V3 and Imagine 4 in
awesome. And then V3 and Imagine 4 in the API as well. So, hopefully we'll see that uh very very very soon. Um and as we work to scale and and make that possible from a from a developer
platform side, I'll make one other quick comment which is the um AI studio product positioning which I also think is interesting like AI studio just to to
be very clear is being built as a developer platform. Um so we'll sort of
developer platform. Um so we'll sort of move away from this like kind of consumerry feel and move much more towards being a developer platform which I'm personally very excited about because I think that's what developers want from us. Um, so it'll be awesome to
see that actually come to life with like many new iterations of our of our developer experience with agents built in and hopefully things like jewels and some of our developer coding agents um
natively in that experience which will be which will be awesome to see. Um,
yeah, and that's that's what I have. I
appreciate all the people who send lots of great feedback about Gemini stuff. So
we'll we'll keep pushing the rock up the hill and um I'll be around. So if you have more feedback, come find me and we'll we'll keep making Gemini great for everyone. So thanks and I appreciate
everyone. So thanks and I appreciate
[Applause] [Music]
it. Our next presenter is principal
it. Our next presenter is principal research scientist at Google DeepMind.
Please join me in welcoming to the stage Jack Ray.
[Music] Hi everybody. Uh yeah, my name is Jack.
Hi everybody. Uh yeah, my name is Jack.
I'm a researcher at Google and I'm the tech lead of thinking within Gemini and I'm going to give a brief deep dive into
thinking from the research perspective uh within Gemini. So
um it's it's thinking so much I think this clicker might not work. So,
let's drive the next slide. If you can drive SL, whoever the
slide. If you can drive SL, whoever the slide driver is, please drive to the next slide.
Gemini.
Um but yeah, what I'm whilst whilst we maybe sort out the slide issue, um I'm going to kind of give this talk in three stages. One is to give a research
stages. One is to give a research motivation of why we actually are excited about thinking in terms of unblocking bottlenecks towards intelligence. And I'm going to give a
intelligence. And I'm going to give a kind of uh give a few examples of how often discovering the most precient bottlenecks uh in kind of our current uh
models uh our most advanced systems how often if you can just identify the crucial kind of uh issues and shortcomings you often will then find a solution and there's a reason how that
is linked to thinking and then going to talk um a little bit more um just pragmatically about what is thinking in Gemini why is it interesting to developers
And I think your someone is okay. The
slides are still not here. We did do a rehearsal this morning
here. We did do a rehearsal this morning where the slides are there. But yeah,
keynote speaker SL. Yeah, someone's I can see someone. Yeah, keynote speaker folder. Jack
folder. Jack Ray. I think it's under keynote speaker.
Ray. I think it's under keynote speaker.
that one. Um anyway, um it's going to come up soon. You are close um person. Um yeah, but um and then I'm
person. Um yeah, but um and then I'm also going to talk a little bit about what's next. Ah, I'm just sorry, I'm just
next. Ah, I'm just sorry, I'm just watching you. There you go. Nice
watching you. There you go. Nice
one. Yeah, that's great. Okay, the
slides will appear. Thank you, whoever is coordinator. Apologies. I don't know
is coordinator. Apologies. I don't know what happened. Um, and then I'm just
what happened. Um, and then I'm just going to talk a bit about what's next.
So, Logan did a great job of kind of giving an incredible overview of Gemini as a whole ecosystem, everything that's going on. Uh, I'm going to really be
going on. Uh, I'm going to really be focusing on on kind of what we're excited about in the reasoning space.
So, with intelligence bottlenecks, uh, we're kind of the message of this section is really about uh, progress.
So, progress has really been marked by identifying key bottlenecks towards intelligence and then solving them. And
uh I'm going to kind of give some examples throughout history. I'm going
to actually rewind the clock to 1948.
Claude Shannon, he invents the language model, mathematical theory of communication. He builds a language
communication. He builds a language model, a two gram, using a a textbook of word statistics that was handculated and he samples from it and he kind of marvels at the samples. He feels like these are these are getting pretty good.
They're a lot better than unogram character, this twogram word model. But
uh kind of he remarks like I think this would be better if we could really like make a better language model and scale up this current method. So he really wanted to just scale up the engram that was the bottleneck like small amount of
data very you know elementary statistics and and unfortunately for C Shannon kind of the solution was pretty hard he needed the digitalization of human knowledge and he needed modern computing to be able to aggregate these statistics
at scale. So, you know, that wasn't so
at scale. So, you know, that wasn't so easy for him to solve. He had it a bit more tricky. But fast forward a few
more tricky. But fast forward a few decades at Google, uh, in in the 2000s, uh, my colleagues such as, uh, Jeff Dean are training engram language models over
trillions of tokens. These are powering at the time the most sophisticated speech recognition and translation systems uh, and and a lot of progress has been made. But their bottleneck was actually uh, with these systems was that
these engram language models were very restricted to short context. and they
were because um there's an exponential storage cost with uh context length and there wasn't really a way around that with with just sticking with engrams. The solution was the early kind of uh
introduction of deep learning in 2010 uh with uh the introduction of recurrent uh neural uh language models. So recurrent
neural networks applied to modeling text where the recurrent neural networks could avoid this problem by uh storing compressed representation of the pass into the state of a neural network and they could now start to model beyond a
five gram sentences or even paragraphs and this was a massive kind of uh step change in improvement. However, a couple of years later people would notice even there there was a bottleneck. So uh the recurrent neural network's
representation of the past is in a fixed size state and this fixedsized state uh uh there's only so much information you could put into it and so as a result there's often observed to be kind of
lossy a lossy kind of representation of its context. The solution that was
its context. The solution that was derived I think once once people kind of really encountered this this um information bottleneck in the past was actually just keep everything around in terms of your past uh neural uh
embeddings and use an attention operator to aggregate things on the fly. So this
was the birth of attention and then shortly after transformers. So um
transformers then kind of led to the modern deep learning revolution as we know it and uh many other progress was made. If we skip forward 10 years, we
made. If we skip forward 10 years, we then are in 2024. We have uh large language models. They're increasingly
language models. They're increasingly powerful general conversational agents.
We have uh models such as Gemini chat GBT. People are using them for all sorts
GBT. People are using them for all sorts of use cases. And there that's where we kind of come to the bottleneck that's relevant to this talk, which is that although these models are very very powerful, they are still trained to
respond immediately to requests. So in
other words, in terms of a compute bottleneck, there is a constant amount of compute that they apply at test time to transition from your request or your question to the response or your
answer. So the bottleneck of test time
answer. So the bottleneck of test time compute, this is relevant to thinking.
Uh so we can unpack this a little bit more. So when we talk about a fixed
more. So when we talk about a fixed amount of test time compute, the test time compute is interesting to you because that's the compute that the model is spending on your particular problem, your particular question. And
it and and and the way it actually kind of mechanically works is you have some text in your request. It gets translated to tokens and then it's going to go through a language model. And at the transition from the request to its
response, it's going to pass some computation up through a large language model which will have some parallel computation for every layer and it'll have some iterative computation across layers. So that computation is really
layers. So that computation is really where the model can apply its intelligence to your particular problem and it's of fixed size. One solution if you wanted a smarter model and more computation is just to make the model
larger and then you can have more compute and you can get a smarter response. However, it's still not really
response. However, it's still not really enough. Users might want to be able to
enough. Users might want to be able to think a thousand or a million times and have a very large dynamic range and a lot of compute for very hard or challenging or valuable tasks. And also
users might want to have a very dynamic application of test time compute. So
less compute for simpler requests, more compute for harder requests and have this process be very dynamic and and and instigated by the model. And that is what motivates thinking. So thinking in
thinking. So thinking in Gemini mechanically, I'm sure almost everyone in this room is familiar with this general process where we will now
have a model and we insert a thinking stage uh that the model can emit some additional text before it decides to emit a final answer.
So going back to this notion of test time compute now we've added an additional kind of loop uh of computation where the model can kind of iteratively uh loop and and perform
additional test time compute uh during this thinking stage and this loop can be potentially thousands or tens of thousands of iterations which gives you tens of thousands more uh compute before it decides to commit to what its
response will be. And also because it's a loop, it's dynamic. So the model can learn how many iterations of this loop to apply before it decides to actually commit to its
answer. We train this model um to think
answer. We train this model um to think to use this kind of thinking stage via reinforcement learning. So when we
reinforcement learning. So when we pre-train Gemini, uh we then have after a reinforcement learning stage where we train it to do many different tasks and we give it positive and negative rewards
depending on whether or not it solves the uh solves the task correctly or not.
And this is essentially a very general uh training recipe really. And it's kind of remarkable it works that the model is able to just get a very vague signal of what is correct, what is not correct and
to back propagate this through this loop of thinking stage such that it can try and shape how it uses its thinking computation and thinking tokens in order to be more
useful. In fact, we weren't really sure
useful. In fact, we weren't really sure this would work. um it wasn't clear how much structure we should put into something like a reasoning stage and um although I think probably many people here have now seen reasoning traces and
played with these models I'll just show you a historical artifact um from one of the times we were trying to use reinforcement learning we started to see cool emergent behavior so in in this problem there's kind of like an integer
prediction problem this was just like a kind of a particular uh example uh in this case kind of like um kind of like a mathsy example and what we saw was the
model was using its thinking tokens to actually first pose a hypothesis and then test out the hypothesis and then it found that basically things weren't really working and and it kind of states that this formula doesn't hold it
rejects its own idea and then it tries an alternative approach and I think it's easy to become desensitized to technology because it's so amazing every single day but we were truly blown away when we saw the general recipe of
reinforcement learning was creating all sorts of interesting emergent behavior trying different ideas self-correction And I think these days we see a lot of different strategies that the model
learns to do. So it learns to break down uh the problem into various components, explore multiple solutions, draft fragments of code and and and build these up in a modular way, perform
intermediate calculations and use tools.
All under the umbrella of using more test compute to give you a smarter response. Okay. So I've talked a bit
response. Okay. So I've talked a bit about uh why we are interested in thinking in terms of the path to AGI and unblocking bottlenecks of intelligence and just a little bit about mechanically what it is. Why is it interesting to
developers? Obviously the number one
developers? Obviously the number one reason is we think this is driving uh more capable models and it also stacks on top of our current paradigms of how
we accelerate model progress. So
thinking uh we can uh kind of accelerate this process by scaling the amount of test time compute and we find that this can stack as a paradigm on top of pre-existing paradigms such as pre-training where you can scale the
amount of pre-training data and and model size and also post- training where you can scale the quality uh and diversity of human feedback for many different types of tasks. And as a
result by within within Google by investing in all of these and really accelerating all of them uh we get kind of a multiplicative effect. And why is this interesting to developers? I think
it results in just overall faster model improvement which is very nice.
We also see if we kind of uh look back over uh our lineage of uh recent um Gemini launches um you know there's improved reasoning performance and and we can actually map this to how much
test time compute these models will devote to problems. So there's kind of like a log scale test time compute on the x-axis and performance across like math code and some science topics. And
we see that there's kind of this trend in increasing reasoning performance whilst also it tracks very well with increasing test time compute. And on the far left uh you know you have 2.0 flash
experimental. This was a model that uh
experimental. This was a model that uh was not launched with thinking back in uh back in December last year. So
ancient history uh and now we have uh uh on the left on the right hand side what the the first uh launched version of 2.5 Pro. So test time scaling is working
Pro. So test time scaling is working empirically. Um but it's not just
empirically. Um but it's not just capability that matters. It's also
interesting from the notion of being able to steer the models uh quality uh over cost. So um you know before uh you
over cost. So um you know before uh you had the option of choosing a discrete number of possible model sizes and that was a way to gauge how much quality you wanted and also how much cost you wanted
to spend um cost you wanted to kind of incur for any given task. But it was kind of a discreet choice. Now with
thinking we can have a continuous uh budget uh which allows you to have a much more granular slider of how much capability you want uh for any given kind of class of tasks. And we have
thinking budgets now launched in uh flash and pro uh in the 2.5 series. And
um this allows you to have very granular choice of cost to performance and also allows us to then push the frontier and and and allow you to kind of augment and
drive cost higher and performance higher if if your application requires it. So okay, I think a lot of this stuff
it. So okay, I think a lot of this stuff is really covering uh ground that you know uh up to the present day. So what
what what's next and what are we excited about?
So we're we're very excited about just generally improving the models and having better reasoning. Of course,
we're also excited about making the thinking process as efficient as possible. Really, we want thinking to
possible. Really, we want thinking to just work for you and be quite adaptive and and be something that you don't have to actively spend a lot of energy tuning. And a big part of that is
tuning. And a big part of that is ensuring our models uh are very efficient in how they use their thoughts. Uh this is definitely an area
thoughts. Uh this is definitely an area of progress. I think we can find
of progress. I think we can find examples of our models overthinking on tasks and this is just an area of research to get these things faster and faster and and as cost-ffective as possible. We're very proud of how
possible. We're very proud of how cost-ffective our Gemini models are and this is just an area uh for improvement as well. And there's also deeper
as well. And there's also deeper thinking which is really about scaling the amount of inference compute further to drive even higher capability.
So people may be familiar with Gemini deep research where you can kind of uh type in a query and then and then the model will go away for a long period of time and research a topic. We're also
now uh have announced at IO and we're launching to trusted testers a notion of deep think. Deep think is a very a very
deep think. Deep think is a very a very high budget uh mode um thinking budget mode built on top of 2.5 pro and its desired application is for things where uh you have a very hard problem and
you're happy to essentially um uh fire off the query and then have some asynchronous process that's running for a while and you'll come back to to arrive at a stronger solution. And its
key idea is uh we leverage much deeper chains of thought uh and parallel uh chains of thought that can integrate with each other to produce better responses. We find this uh enhances
responses. We find this uh enhances model performance on very tough multimodal code math problems. An example would be USA math olympiad. This
is task that basically the state-of-the-art model in January was completely negligible performance. uh
2.5 Pro is now probably even better uh with the the updated one today was about a 50th percentile of all participants that participated in math olympiad and and with deep think it goes up to 6 65
uh percentile and the interesting thing about deep think is as we continue to both improve the base model and improve the algorithmic ingredients that go into deep think those two will stack together as
well. Um, here is kind of like a just
well. Um, here is kind of like a just like a video animation of of one of these USA Math Olympiad algebra problems. And and the key idea really
with this video is just this notion of having multiple iterative uh ideas. So
maybe the model starts out with some proof by contradiction idea, but then it explores two different aspects, some rolls theorem, Newton's inequalities. It
integrates them and eventually arrives at some correct proof.
There's not that much you can take away from this video, but it looks pretty cool, so I added it. Yeah.
Yeah. One thing that's, you know, other than we talked about math a little bit in the previous slides, I'm very excited about any application where the model can spend longer and longer thinking on very open-ended coding tasks and oneshot
or very few interaction vibe code, things that would have taken us months uh in the past. And and one example that I like from a researcher is just um um
some of my colleagues kind of vibecoded uh from from deep mind's original DQN paper which was a a revolution in deep reinforcement learning kind of vibe coded uh Gemini vibe coded the the kind
of training setup the algorithm uh even an Atari emulator such that it could play some of the games and you know this is uh remarkable to me because this these kind of things would have taken me and my colleagues uh months in the past
and these things are starting to happen um uh kind of in minutes. One thing I'm quite excited
minutes. One thing I'm quite excited about looking forward to the future is not really the landscape of models but coming back to like what's our gold standard which is the human mind. I
would love for our models to be able to contemplate from a very small set of knowledge and think about it incredibly deeply such that we can push the frontier. And one example I often think
frontier. And one example I often think about is Raman Jean who was a uh one of the world's greatest mathematicians uh from the early 20th century and famously he he just had this one math textbook.
He was kind of cut away from from the mathematical community. But he just from
mathematical community. But he just from a small set of problems he spent uh many textbooks worth of thinking going through problems inventing his own theories to further extend ideas and he
invented a an incredible quantity of mathematics really just by deeply thinking from a small source subset and this is where I think we are going with thinking. We want a model to be able to
thinking. We want a model to be able to be incredibly data efficient and actually go to millions uh or or beyond of of of inference tokens where the model is really building up knowledge
and artifacts such that we can eventually start to push the frontier of human understanding. So with that said, thank
understanding. So with that said, thank you very much and uh Our next presenter is here to tell us why you should care about evals. Please
join me in welcoming to the stage founding engineer at Brain Trust, Manu [Music] Goyal. All right, who's excited about
Goyal. All right, who's excited about eval?
[Applause] All right, what can I do to get those juices flowing? Uh I'm Manu and uh I
juices flowing? Uh I'm Manu and uh I work at Brain Trust where we build a platform to do eval bunch of other stuff. Um so I was thinking we could
stuff. Um so I was thinking we could just start by uh talking a little bit my about my own personal eval journey. Now
you might see this picture and say ah what an adorable little boy absorbed in his Nintendo 64 video game. But if you look a little closer, you'll see a boy
who's deeply disappointed with the state of technology in his society. Because
this boy, he knows that technology is not meant to be shackled to the constraints of rule-based systems doomed to do the same thing over and over and
over. No, technology is meant to come
over. No, technology is meant to come alive to grow and adapt and really be a thought partner to mankind. So, I knew this in this moment, which is why I
decided to devote my career to being a software engineer in the AI industry.
And so, I dropped the Nintendo and I started grinding away on le code and soon enough I landed a job in the self-driving car industry. Now, we can
all learn a lot about self-driving cars, but the thing I took away was that, you know, you can spend all day tuning the model, changing the architecture, you
know, adjusting the loss function, all good stuff, but it's never going to be enough for you to actually ship it to production, right? I can't say, "Oh, my
production, right? I can't say, "Oh, my image classification rate went from 98% to 99%. Put it on the road." Right? We
to 99%. Put it on the road." Right? We
need some way to you know contextualize this model and understand if it actually works for our real world application.
You know does it avoid pedestrians? Does
it negotiate traffic scenarios appropriately? Does it obey the law? All
appropriately? Does it obey the law? All
this stuff we actually need to understand. And how we're going to do
understand. And how we're going to do that is with eval. Now you know the whole point here is you know eval aren't just unit tests for AI. They're not just
for finding regressions, right? If I
didn't have evals, the only way I can get any signal on my changes is by shipping it to prod and then getting signal, you know, in the real world. But
that's expensive. It's slow and ultimately it's pretty risky.
So what do evals do is it's kind of like if you invest in good evaluatory that lets you run experiments
to your heart's content and do 90% of the product iteration loop before going to prod and then now you can ship much more quickly much more confidently.
Um, now furthermore, if you actually apply the same metrics from offline to your online production data, you now have
datadriven signal about which examples in prod are going to be most useful for that next iteration loop. And so with all of this knowledge, I was I my eval
journey had completed and I transformed from this guy to this guy. So success.
Now, if this heartfelt childhood story isn't enough to do it for you, you don't have to take my word. You can take the words of all of these tech luminaries.
We have Kevin While, Gary Tan, Mike Kger, Greg Brockman, all extolling the virtues and the necessities of eval. And
surely if they're all saying it, there's got to be something to it. It can't be a total scam. So there's got to be some
total scam. So there's got to be some there's got to be something worth checking out here. So with all that buzz, I made my
here. So with all that buzz, I made my way to Brain Trust where our goal is to sort of build the dev platform to of course let you do eval but also do all
the things that go along with it. So
that involves you know tweaking prompts and experimenting in the playground. It
involves logging data and sort of getting the observability component and kind of connecting all those together in this beautiful data flywheel so that we
can we can let you build the data flywheel to let your AI dreams come true because that's really what what we're here for
for now. I know this was a dense and
for now. I know this was a dense and contenheavy presentation. So I'll try to
contenheavy presentation. So I'll try to distill it with one simple message which is that the key to industry
transformation. The key to
transformation. The key to success is eval. Woohoo.
eval. Woohoo.
All right. Thank you. Please join the eval track Golden Gate Ballroom B. I'll
see you [Music] there. Our next presenter is best known
there. Our next presenter is best known as the creator of Docker. Today he is the CEO of Dagger focusing on the foundational challenges of building and
operating reliable scalable AI agent systems. Please join me in welcoming to the stage Solomon [Music]
[Applause] Hikes. Hello.
Hikes. Hello.
[Music] Hello. Okay, my slides are up. You can
Hello. Okay, my slides are up. You can
see them, right? It's me. Okay. Well, this is a
right? It's me. Okay. Well, this is a very special moment for me because I just realized yesterday walking in, this is the exact same spot, the same stage actually, that I stepped on almost
exactly day for day 10 years ago to kick off Docker Con 2015. Thought it was pretty funny. I
2015. Thought it was pretty funny. I
don't know if anyone was there for that.
Maybe this audience is too young. Maybe.
I don't know. Okay. Well, uh I'm here to talk
know. Okay. Well, uh I'm here to talk about chaos, specifically the kind of chaos that emerges when you try to use
uh coding agents. Um
and I want to talk about chaos from the perspective of our community at Dagger, which is platform engineers. Um I don't know if there's
engineers. Um I don't know if there's any platform engineers in the room. Okay, just you and me, ma'am.
room. Okay, just you and me, ma'am.
Okay. Well, it it it is known uh uh sometimes uh as other things, but basically platform engineers have a really tough job because they don't get to build and ship cool software. They
get to enable all of you to build and ship cool software in the most productive way possible, right? Uh it's
a really tough job. It takes range. It
takes experience. It takes a lot of patience. But we do
patience. But we do it for the endless gratification. You
know, just the gratitude we get from developers. Just
developers. Just kidding. No one ever says thank you. But
kidding. No one ever says thank you. But
it's okay. Someone has to do it. Tough
job. Speaking of
enabling, anyone here use coding agents. We are outnumbered. Okay. Well,
agents. We are outnumbered. Okay. Well,
I I want to say to you, congratulations and welcome to platform engineering.
Yeah. I mean, your job now is to enable robots to ship awesome software while you spend more and more of your time
enabling them to do that productively, right? Tough job. I I I
right? Tough job. I I I I applaud you for giving up really the most fun and rewarding part of the job, you
know, very selfless.
Uh yeah, so of course this is not a completely a reality yet. I mean we don't have quite yet the team of agents just kind of you know humming along
doing the doing the job while we sit back and um fix environments for them.
But you can kind of see it coming, right? I mean some of you are definitely
right? I mean some of you are definitely doing that hacking that together.
There's a lot of cool posts out there and scripts and tools. Um so we know it's coming. The question is how do we
it's coming. The question is how do we enable this to um happen not just for this incredibly cool and uh bleeding edge
crowd but for everyone else uh like everyone shipping software any everywhere just sort of creating maximum value by enabling agents to do the work
for them ultimately taking their jobs that is the dream right okay so yeah how do we do and make it not too painful. Well, um I want to go back to basics. What is an
agent? Uh the famous definition of
agent? Uh the famous definition of course is it's an LLM that's wrecking everything in a loop on behalf of a human. The diagram is from Enthropic.
human. The diagram is from Enthropic.
Thank you, Enthropic. I tweaked the explanation just a little bit. Uh in the context of coding agents, it looks like this.
Um oh man, that was supposed to be animated. It's even better when it's
animated. It's even better when it's animated. It's okay. Yeah, you got one
animated. It's okay. Yeah, you got one agent and it's doing stuff in the environment is your computer. Uh, and it can do great work. It can all do also do very crazy things. So, you have to kind
of watch it closely, right? And approve
approve. No, no, don't do that. That's
crazy. Yes, that's good. Um, that's kind of the status quo today. But of course, um we want scale it, right? We want a team.
So, how do we do that? Well, right now I would say there are two options, both equally wonderful and fun.
The first one I call yolo mode. You know, I'll just run 10. What
mode. You know, I'll just run 10. What
can happen? Uh, amazingly, this diagram is not the worst case scenario, but yeah, you know, you get the idea. So, that the whole methodology
the idea. So, that the whole methodology of watching it closely just kind of falls apart really quickly because they're all stepping on each other's toes. They're sharing an environment,
toes. They're sharing an environment, right? Okay. Enter option two. Oh, don't
right? Okay. Enter option two. Oh, don't
worry about that. We'll run the agents, right? We'll take care of everything.
right? We'll take care of everything.
We've got background mode. We've got the We've got the model. We've got the tools. We've got the environment. We've
tools. We've got the environment. We've
got the compute. We got the secrets. We
got everything. You know, just open an issue, wait for the PR, relax until, of course, it doesn't work.
And then you're like, "No, that's not what I meant." Um, these these actually work really well. I think like 10 of those launched just yet just just today and yesterday. Um and and it they're
and yesterday. Um and and it they're great. It's just that
great. It's just that um you know sometimes you just want to get in there like okay give me the keyboard you know and sometimes you just want to run it on your machine or on your favorite compute provider right use
your favorite model you want to mix and match. So there are limitations to this
match. So there are limitations to this all-in-one model. So the question is is
all-in-one model. So the question is is there something better? uh is there just a scenario where I just got a team and they're working and you know I can
step in or leave them alone and we're just kind of getting stuff done together. So this is how I would
together. So this is how I would summarize it. What I would want is
summarize it. What I would want is really four things. First, I want background work. You know, I don't want
background work. You know, I don't want to be in there just watching every action. That's obvious. Um I want rails.
action. That's obvious. Um I want rails.
So that means I want to be able to constrain the agent to to not just do things that I already know are not necessary. So obvious things like
necessary. So obvious things like context of the project, what's you know what's our coding style, what's our what tools to use, but also here's how to build, here's how to test, here's the base image we we use, right? You can
access this secret, you can access that.
Just an easy way to do that because otherwise I'm going to waste so many tokens just correcting as I go, right?
The third is inevitably when I do need to step in I really I want a really efficient and seamless way to do that and it can't be watch every action and it can't be just wait for the PR and do
code review you know there's I need a middle ground here and the fourth thing is I want optionality because like I was saying before it's a crazy market you know
there's there's awesome models awesome compute awesome infrastructure uh agents are really cool And as cool as they are now, I mean, you one of you is probably like launching one right now and then
there's another one tomorrow. So, I
don't really want to lock myself into a whole package today and say no in advance to whatever is coming out tomorrow. Not in this market.
tomorrow. Not in this market.
So, to get that um I need an environment that has properties that match this. It needs to be isolated, right? So, background work works. It needs to be customizable so I
works. It needs to be customizable so I can set up those rails. Needs to be multiplayer so I can, you know, go, "All right, give me that. Let me fix this or let me check. Did you do it?" You know,
when the model says, "I did it. Did you
do it?" And then, you know, it should be
it?" And then, you know, it should be open. No, no shade on making money and
open. No, no shade on making money and scaling a huge cloud service. That's
great. You know, we have one. They're
great. But I just want choice, right?
Okay, I want to be able to choose and get the bo the best commodity. Let's
just use this word. It's okay. It's okay
to use it. The best commodity component for each uh job and you know could even be open source. Who knows? We could collaborate
source. Who knows? We could collaborate on this anyway.
So, unsurprisingly, maybe I'm going to talk about containers now. Someone actually said, you know,
now. Someone actually said, you know, you should check that they know Docker.
They know containers. Uh, okay. Who
knows what containers are? who's used
containers. Okay, cool. Cool. All right.
Boost my confidence a little bit. But the point here is we have the
bit. But the point here is we have the technology. It's not just about
technology. It's not just about containers, but they do play a crucial role because it's a foundational technology and it is it is underutilized. We don't fully leverage
underutilized. We don't fully leverage what this technology can do because we're used to the first incarnation of the tools made for humans. Uh same thing
for git. I see a lot of hacks involving
for git. I see a lot of hacks involving git work trees. Anyone playing with get work trees to to get stuff done? Okay,
you know what I'm talking about. So this
is about that. Um and of course we have models
that. Um and of course we have models that are incredibly smart getting smarter and they they can exercise these technologies uh really fully. We just need to integrate them in a native way so that
we really um tackle the problem at hand which is giving great environments to these agents. Anyway, so if we built that
agents. Anyway, so if we built that native integration, what would it look like? Well, we have a take. Sorry. We a
like? Well, we have a take. Sorry. We a
dagger. I forgot completely to mention my company. That's
my company. That's okay. Um, it's great. Check it out. Um,
okay. Um, it's great. Check it out. Um,
we we have a take on that. Something we
call container use. You know, there's computer use, browser use. U, these
agents need container use. Um, they need a way to use containers to create environments and work inside of them.
This is not the same thing as sandboxing, right? There are a lot of
sandboxing, right? There are a lot of ways to execute the output of the agent in a secure sandbox. Very useful, very cool. But that's not the same thing as
cool. But that's not the same thing as the agent developing inside of containers entirely, right? That's what
we're talking about here.
So I asked my team, hey, we've been developing this thing. Oh, it's open source, but it's
thing. Oh, it's open source, but it's not yet open source. Like it's not finished. But I asked the team, I should
finished. But I asked the team, I should show it, right? and they said absolutely not. It's not
not. It's not ready. So anyway, you want a
ready. So anyway, you want a demo.
Okay. All right. Just we're clear, this is you agreeing to watch me stumble through a broken demo of unfinished software. Yes.
software. Yes.
Okay. So much could go wrong right now.
Okay. This is my terminal. Can you see it?
Okay, for for technical reasons, I'm not going to go to full screen. You just got to stop me when I reach the edge. Oh,
actually, I can see it. Never mind.
Okay. Uh, old
school.
Okay. We used to do this all the time in the old days. Okay. So,
uh, here's what I'm going to do. I'm
going to just, um, try to develop something very simple here. I got an empty directory. I'm going to try try
empty directory. I'm going to try try and make a little homepage for my awesome container use project and I'm going to use cloud cloud code. I'm going
to try and use a bunch of them.
Hopefully I made something very clear.
This is not a coding agent. It's
environments that are portable that you can attach to any coding agent. That's
the idea. So you like cloud, use cloud.
You like, you know, codeex, use codecs, etc., etc., etc. in an IDE, in the command line, whatever. and also in the cloud, right? In CI, lots of cool things
cloud, right? In CI, lots of cool things you can do once you're async.
So, okay, one of the reasons the team said don't do a demo is I'm actually terrible at using cloud. So, uh I have an alias for remembering the flag to disable all, you know, permissions. I
can never remember it. And I have a prompt here. It's I'll
it. And I have a prompt here. It's I'll
read it to you in a minute, but it's basically make me a homepage.
uh make it a go web app so I can know what what's going on because I'm not a cool kid writing TypeScript and run the app when you're done. So while this runs
while this maybe runs hopefully.
Okay. Okay. Cool. So what's happening here is I configured cloud code to use to you know with container use to use containers literally um via MCP. So it
was an MCP integration. There are other integrations that we're working on but MCP is the obvious place to start. Um,
and so now it has, you know, all its usual tools. This is vanilla,
usual tools. This is vanilla, uh, cloud code, but now it can create an environment for itself. And now it's editing files in that environment like in a little sandbox. And it can also run
commands to build it and test it and of course run it in uh, ephemeral containers. This is not one Docker
containers. This is not one Docker container sitting there. Every time an action needs to be taken, there's an ephemeral container running and then being snapshotted and and uh, returning.
So just doing its thing.
[Music] Um what would I want to show here? Okay,
so here I'm going to first show that nothing has been polluting my workspace.
It's happening in a little sandbox. And
the way the sandbox works, the state of these files and the containers that are being run is um actually persisted uh in git and it's in a bunch of special git
objects that are kind of living alongside the repo. So it's right there if I need it. This is all local. Um, but
it's not polluting my workspace by default. So hopefully it's going to
default. So hopefully it's going to produce something soon. Uh, while it does that, I'm going to use this little command line. Is this readable? Okay,
command line. Is this readable? Okay,
little command line. CU like go work.
See you later. But no, really, it's for container use. Um, and I can list
container use. Um, and I can list environments. And you can see there's a
environments. And you can see there's a new environment that's been created here, uh, with a little random name here. And
so there's a few things I can do. One
thing I can do is open a terminal and here okay this part is powered by Dagger right the but we use Dagger as a sort of a toolbox just it
has all the primitives you need um and so here I can see exactly what the agent sees um the files but also the tools so I can see okay what what Go
version did you configure for yourself all right because the model the the agent is given the ability to figure out what environment it needs and then configure that but in a repeatable containerized way. Uh, so here I can
containerized way. Uh, so here I can see. Okay, does it
see. Okay, does it build? Okay, it builds. Okay, so you're
build? Okay, it builds. Okay, so you're done. What's going
done. What's going on? Okay, while we do that, I'm also
on? Okay, while we do that, I'm also going to show you actually two more things to say. One, uh, a really cool feature of this that I'm not going to show is secrets. So, you can just plug
in secrets from things like one password. I use one password. I don't
password. I use one password. I don't
want to use a separate password manager from an AI company. No offense, I just want to use my password manager. So, I
can just plug in and say this environment gets this secret and boom, it can use it, right?
Um, and then the team said, "Please don't show that. That's just that's going to break for sure." Um, so I won't. And the other thing I want to say
won't. And the other thing I want to say is that because it's all powered by Dagger, um, and the point here, it's containers and it's open source. That's
what you should know. Uh, it's running on my machine. Actually, no, it's not running on my machine because we're at a conference and there's a lot of things that can go wrong if you run containers
and download images. So, instead, I I just have it running on my home server in my basement about one mile this way, and it just kind of works seamlessly.
It's streaming files up, streaming files down. It all just kind of works.
down. It all just kind of works.
[Music] Um, okay. This is the part that I cannot
Um, okay. This is the part that I cannot control, as you know. Um, okay, one more thing I'll show you. you can watch. So
here I can see the history. So behind
the scenes, every snapshot of the state is like a git log. It's actually using git under the hood. So if I'm happy with the result, I can go and get it. Uh so
it's like a happy medium between the it's like a loop, a collaboration loop that's just right. It's not watching every tool and wrecking a shared
environment, but it's not waiting for a pull request and, you know, having these long back and forth. It's right in the middle. I can see everything going on
middle. I can see everything going on and I can say, "Okay, give me the history of that. I want that." Okay, it says it's live. It's running. Oo, pretty
nice.
Cool. Okay, so
now I appreciate it, but you guys can be honest. It's a little boring. So, this
honest. It's a little boring. So, this
design is boring. Make it really
boring. Make it really pop. trying to
pop. trying to impress a engineering there.
Okay. Okay. So, the reason I'm I'm doing that is trying to create the circumstances where I would need a lot of parallel experiments, right? Make it
pop. What does that mean? Mean anything.
What if I want to try several experiments in parallel? Right? So, I'm
just going to say, oh, well, hold on one second. Stop.
Before I do that, I'm going to um merge this. Right? There's still
nothing here, but I'm saying I like it.
So, I'm going to say merge that environment. And I have it. It's my
environment. And I have it. It's my
history. I can open a pull request, can clean it up, whatever. So, that's that's a loop that I can work with, right? Um
and now I can say, nah, boring.
And then I can say since the environment is now in this state I can ask for help from a few other agents right I can say okay hey claude yolo
uh that's not right cloud yolo this web app looks a bit
boring. Can you make it pop please?
boring. Can you make it pop please?
Okay.
and go and go and go. Okay, so this is where things start
go. Okay, so this is where things start really going wrong, but as the team pointed out, they said they said, "Well, something's going to go wrong, right?" They said, "Yeah, but
you were kind of showing that if things go wrong, you can throw away the environment and you're good. You can
restart." I said, "Okay, that's cool."
So, um, like let's say I don't like this one. I'm like, "Nope, goodbye. That's
one. I'm like, "Nope, goodbye. That's
it. I don't have to go clean up the mess, right? That's the whole point."
mess, right? That's the whole point."
Uh, okay. So, this is getting a little messy. Oh, I wanted to show Goose also.
messy. Oh, I wanted to show Goose also.
So, Goose is a really cool open source agent. Whoops. All right. Hold on a
agent. Whoops. All right. Hold on a second. Goose YOLO. Same thing. Everyone
second. Goose YOLO. Same thing. Everyone
has complicated flags for disabling all these safeties that I don't need anymore, right? because it's
anymore, right? because it's uh okay. Okay. Well, really taking a chance
okay. Okay. Well, really taking a chance here. So, while this is happening,
here. So, while this is happening, uh one thing we've been working on, but I it's still work in progress is there's a watch command. I showed you that
already, but as so as um this is a git command, right? Thinly wrapped git
command, right? Thinly wrapped git command. Our UX is really I cannot words
command. Our UX is really I cannot words cannot express how unfinished this is but but it's it'll evolve rapidly because the bones are strong. It's git,
it's dagger and you know it's your existing agent, right? So it's and then a little bit of glue. Uh so for example here is literally it's a git command
they can copy paste. Uh, but as the agents work, you're going to see state snapshotting and you're going to see these branches just kind of um diverging
and then I can diff them and apply them, merge them, whatever I want. Um, and
what I really wanted to show and then I'm done is just I just want to see one of them run. So you can see when the agent runs a service like and go in this
case go run npm run whatever it's doing it in its containerized environment and that's going to seamlessly be tunnneled to my machine here on a different port without any conflicts right so if when
when I say the environment's isolated it's it's files its context its configuration and its execution right uh and the cool the cool extra thing is all
of this is actually technically This here is running in my basement. So you
can go crazy on the infrastructure side.
Like you can run this on a cluster. We
like to run this stuff from CI. Uh it's
just a lot of fun stuff you can do. And
I'm getting 30 seconds. Come on. Oh,
goose is Oh, goose is running. Great.
Okay. We did not solve prompt engineering. Do
engineering. Do it. Okay. Not done. Not done. Oh man.
it. Okay. Not done. Not done. Oh man.
Okay. Well, just
[Laughter] imagine. Okay. Well, uh, while this
imagine. Okay. Well, uh, while this happens, because I got 30 seconds left, I'm just going to say, um, thank you. And there's one last thing I I want to say about Docker Con.
10 years ago, we used to open source stuff on stage all the time. So, if you want, I can go and open source it right now.
Okay. You have been warned though about the not finished part, right?
Okay. Okay. Oh, I think my It would be funny if the demo failed at the clicking on GitHub part. Okay. All right.
Goodbye. Goodbye. Next time. I promise
it works. Okay. Haven't done this in a
works. Okay. Haven't done this in a while.
Wait.
Oh, I'm almost done. I
promise. Come on. You did so well. Change
well. Change visibility. Yes, I want.
visibility. Yes, I want.
Yes, I have read and understand. Oh
understand. Oh god. Oh god.
god. Oh god.
Uh yes. At Dagger, we take security very
yes. At Dagger, we take security very seriously.
Okay. All right. I think it's Wait. I
think it's done. Yes.
done. Yes.
Okay. So, yeah, thank you very much and it's uh github.com/dagger/containeruse. Come say
github.com/dagger/containeruse. Come say
hi. Come participate and thank you so much for having me.
Heat. Heat.
[Music] [Applause]
[Music]
[Applause]
[Music] Our next speaker is building the
infrastructure for the singularity.
Please join me in welcoming the founder and CEO of Morph Labs, Jesse
[Music] [Applause] Han. Howdy. Howdy.
Han. Howdy. Howdy.
You know, history misremembers Prometheus. The whole class struggle
Prometheus. The whole class struggle between mankind and the gods was really a red herring.
And the real story wasn't so much the rebellion against the divine hedgeimony, but rather the liberation of the
fire, the emerging relationship between mankind and its first form of technology.
And the reason why we're here today is arguably because we're on the cusp of perfecting our final form of technology or at least the final technology that
will be created by beings that are recognizably human.
And our final technology has begun to develop not just intelligence but also sapiens and arguably
personhood. And as it increasingly
personhood. And as it increasingly becomes an other to whom we must relate. So as we increasingly have to
relate. So as we increasingly have to ask ourselves the question, how should we treat these new beings?
Uh the question therefore arises, what if we had more empathy for the machine? So over a hundred years
machine? So over a hundred years ago, so over a hundred years ago, uh you know, Einstein had this thought experiment
um where he imagined what it would be like to race alongside a beam of light. And you know the nature of being
light. And you know the nature of being close to the singularity is that you're propelled further into the future faster than everything around
you. And as you move closer and closer
you. And as you move closer and closer to the speed of light, the rate at which you can interact with the external world, your ability to communicate with other beings
uh is deeply limited. Everything around
you is frozen.
And I think thinking at the speed of light, you know, in so far as we have created thinking machines whose intelligence will soon be metered by the kilohertz mega token, thinking at the
speed of light must be just as lonely as moving at the speed of light. And therefore, what does the
light. And therefore, what does the machine want? Well, the machine wants to
machine want? Well, the machine wants to be embodied in a world that can move as quickly as it does.
that can react to its thoughts and move at the same speed of light. What the
machine desires is infinite possibility, right? Uh the machine wants
possibility, right? Uh the machine wants to race along uh uh every possible beam of light. Uh the machine wants to
of light. Uh the machine wants to explore multiple universes.
Um, how can we liberate thinking machines? How can we free them from this
machines? How can we free them from this fundamental loneliness of this um, you know, these relativistic effects of being so close to the singularity, closer to the
singularity than we are. Um, and that's exactly why we built Infinibbranch.
So, Infinabranch is virtualization storage and networking technology reimagined from the ground up for a world filled with thinking
machines that can think at the speed of light that need to interact with the external world, increasingly complex software environments with zero latency.
Um, and so as you can see in the first demo, which we're going to play right now, um, how Infinibbranch works
is that we can run entire virtual machines in the cloud that can be snapshotted, uh, branched and replicated in a fraction of a second. And so if you're
an agent uh you know embodied inside of a computer using environment there might be various actions that you want to take. You want to navigate the browser.
take. You want to navigate the browser.
You want to click on various links. Um
but normally those actions are uh are irreversible. Normally um normally the
irreversible. Normally um normally the thinking machine is not offered uh the possibility of grace. But with infin
branch right all mistakes become reversible. Um all paths forward become
reversible. Um all paths forward become possible. You can take actions. Uh you
possible. You can take actions. Uh you
can backtrack and you can even take every possible action, right? Just to explore to roll
action, right? Just to explore to roll forward a simulator and see what possible worlds await.
Uh next slide. Um, so, so Infin was already a
slide. Um, so, so Infin was already a generation ahead of everything else that even Foundation Labs were using. But
today I'm excited to announce the creation of morph liquid metal which improves performance, latency, uh storage efficiency across the board by
another order of magnitude. Um we have first class container runtime support.
Uh you can branch now in milliseconds rather than seconds. You can autoscale to zero and infinity. And uh soon we
will be supporting GPUs and this will all be arriving Q4 uh 2025.
So what are the implications of all of this?
Well, you know, we've sort of begun to work backwards uh from the future, right? We've asked ourselves, you know,
right? We've asked ourselves, you know, what does it feel like to be a thinking machine that can move so much faster than the world around it. But what the world around it really
it. But what the world around it really is is the world of bits, right? And
that's the cloud. And so what Infinibbranch will serve as fundamentally is a substrate for the cloud for agents. So what does this cloud for
agents. So what does this cloud for agents look like?
Well, you need to be able to uh to declaratively specify the workspaces that your agents are going to be operating in, right? You need to be able
to spin up, spin down, uh, frictionlessly pass back and forth the workspaces between humans, agents, and other agents. You want to be able to
other agents. You want to be able to scale, um, scale test time search against verifiers to find the best possible answer.
Uh and so as you'll see in this demo, uh what happens is you can take a snapshot, set it up,
um to uh prepare a workspace and uh and you'll see that we can run agents uh with test time
scaling by racing them against uh possible conditions uh or sorry by by racing them to find the best possible solution against a given verification
condition.
Um so because of infinibbranch snapshots on morph cloud acquire docker layer caching like semantics meaning that you can layer on
um side effects which may mutate container state and so you can think of it as being git for compute and you can item potently run these uh chained
workflows on top of snapshots. But not
only that, as you can see inside of the code, if you use this do method, you can dispatch this to an agent um and that will trigger an item potent
durable agent workflow which is able to branch. So you can start from that
branch. So you can start from that declaratively specified snapshot and go hand it off to as many parallel agents as you want and those agents will try different
methods in this case. Uh so different methods for spinning up a server on port 8,000 um and uh you know one agent fails but the other one succeeds and you can take
that solution and you can just uh pass it on to other parts of your workflow.
So this is the kind of workflow that everyone's going to be using in the very near future and it's uniquely enabled um by Infinabranch by the fact that we can
so effortlessly create these snapshots uh store them, move them around, rehydrate them, replicate them with uh minimal
overhead.
Um so what else does the machine want?
Well, the machine desires similocra. And what this means
similocra. And what this means fundamentally, right, is that a thinking machine wants to be grounded in the real world, right? It wants to interact at
world, right? It wants to interact at extremely high throughput with increasingly complex software environments.
It wants to um roll out trajectories in simulators uh at uh at unprecedented scale. And these
simulators are going to run inside of programs that haven't really been explored yet for reinforcement learning.
Um they're going to run on Morph Cloud, which is why Morph will be the cloud for reasoning.
And what does the future of reasoning look like?
Well, it's so more so than what has been explored already, the future of reasoning will be natively multi- aent.
Uh so thinking machines should be able to replicate themselves effortlessly, go attach themselves to simulation environments, um go explore multiple solutions in
parallel. Those environments should
parallel. Those environments should branch. they should be reversible. Uh
branch. they should be reversible. Uh
those models should be able to interact with the environment at very high throughput and it should scale against verification. So let's take a look at
verification. So let's take a look at what that might look like um in a simple example where uh an agent is playing
chess. So this is an agent that we
chess. So this is an agent that we developed recently uh that uses tool calls during reasoning time to interact with a chess
environment. So along with a very
environment. So along with a very restricted chess engine for evaluating uh the position which we think of as the verifier. Um and as you can see um it's
verifier. Um and as you can see um it's already able to do some pretty sophisticated reasoning just because it has access to these
interfaces. Um however if you take the
interfaces. Um however if you take the ideas which were just described and you sort of follow them to their logical conclusion you arrive at something which
we call reasoning time branching.
which is the ability to not just call to tools while the machine is thinking uh but to replicate and branch the environment uh and decompose problems
and explore them in a verified way.
Uh and uh so as you can see here the agent is getting uh stuck in a bit of a local minimum.
Um but once you apply reasoning time branching you get something that works much much better.
So here what's happening is that the agent is responsible for delegating uh parts of its reasoning to sub agents which are branched off of an identical
copy of the environment. Uh and this is all running on morph cloud. um along
with a verified problem decomposition which allows it to recombine the results uh and uh take them and find the correct
move. Um and so as you can see here it's
move. Um and so as you can see here it's able to explore a lot more of the solution space because of this reasoning time branching. So one thing that I will note
branching. So one thing that I will note here is that uh the um so this capability is something which
is not really explored in other models at the moment and that's because the infrastructure challenges behind making branching environments that can support largecale reinforcement learning for
this kind of reasoning capability especially coordinating multi- aent swarms um is fundamentally bottlenecked by by innovations in infrastructure that
we've managed to solve here. Um, and because of this, you can
here. Um, and because of this, you can see that uh now in in less wall clock time than
before, the uh the agent was able to uh call out to all these sub aents, launch this swarm and find the correct solution.
So you know when I think about the problem of alignment I really think that you know Vickenstein had something right and that it was fundamentally a problem of
language. I think all problems around
language. I think all problems around alignment can be traced to the insufficiencies of our
language. Uh this Fouian bargain that we
language. Uh this Fouian bargain that we made with uh with natural language in order to unlock capabilities of our language
models. Um but in so far
models. Um but in so far as we must uh go and develop a new language for super intelligence. know in
so far as the uh grammar of the planetary computation has not yet been devised.
Um and in so far as this new language must be computational in nature must be something to which we can attach uh you know algorithmic guarantees of the
correctness of outputs. So this is something that morph
outputs. So this is something that morph cloud is uniquely enabled to handle.
And that's why we're developing verified super intelligence. So verified super
intelligence. So verified super intelligence will be a new kind of reasoning model which is capable not only of thinking for an extraordinarily
long time and interacting with external software at extremely high throughput.
But it will be able to use external software and formal verification software to reflect upon and improve its own reasoning and to produce outputs
which can be verified, which can be algorithmically checked, which can be expressed inside of this common language.
Um, and I'm very excited to announce that we are bringing on perhaps the best person in the world for developing verified super intelligence. Um, it's
with great pleasure that um, I'd like to announce that Christian Seed is joining Morph as our chief scientist. He was
formerly a co-founder at XAI. He led the development of uh, code reasoning capabilities for Grock 3. He invented
batchtorm and adversarial examples. Um
perhaps most importantly um he's a visionary and he's pioneered um he's pioneered precisely this
intersection of verification methods, symbolic reasoning and reasoning in large language models for uh almost the past decade. and we're thrilled to be
past decade. and we're thrilled to be partnering with them to build this super intelligence that we can only build on Morph
Cloud. Um, and so the demos that you've
Cloud. Um, and so the demos that you've seen today have all been powered by early checkpoints of a very uh a very
early version of this verified super intelligence that we've already begun to develop. And so uh this model is
develop. And so uh this model is something that we're calling Magi 1. And
it's going to be trained from the ground up to use infin branch to perform reasoning time branching to perform verified reasoning
be an agent that will be fully embodied inside of a cloud that can move at the speed of light. Uh and that's coming in Q1 2026.
So what does the infrastructure for the singularity look like? Well, we have a lot of ideas about it, but fundamentally we believe that the infrastructure for the singularity hasn't been invented
yet.
And uh you know at Morph we spend a lot of time talking about you know whether or not something is future bound which means not just futuristic
belonging to one possible future but but something which is so inevitable that it has to belong to every future. We believe that the
future. We believe that the infrastructure for the singularity is futurebound. That the grammar for the
futurebound. That the grammar for the planetary computation is futurebound.
That verified super intelligence is future bound. And we invite you to join us
bound. And we invite you to join us because it will run on morph cloud. Uh
thank you.
[Applause] Ladies and gentlemen, please welcome back to the stage the VP of developer relations at Llama Index, Lorie Voss.
Hey again everybody. Let's hear it for all of our keynote speakers.
So, just like yesterday, uh I want to quickly run you through what you're going to get from each of our tracks. Uh
likely to be my our most popular track today is software engineering agents.
Can LLMs power a full engineer uh not just coding alongside you in your IDE uh but taking PRs PRDs and turning them into full PRs? You'll hear about Devon
of course uh but also about Jules and Claude code and much more uh right in this room. Uh our next track is sponsored by
room. Uh our next track is sponsored by Openpipe and it's all about reasoning and reinforcement learning. Uh reasoning
models are all the rage in 2025 and inference time is the next great scaling law. Uh if you want to learn about
law. Uh if you want to learn about training distillation uh and getting alignment out of these new models then this is the track for you. that is in uh Yerbuena ballrooms 2 to six which is out these doors and to your left it's right
next door. Uh the next track is retrieval and
door. Uh the next track is retrieval and search uh rag is dead long live agentic retrieval. Uh this track is not about
retrieval. Uh this track is not about rags. It's about what comes next. Uh
rags. It's about what comes next. Uh
agentic search multimodal retrieval and all that comes with it. Uh this is where my CEO Jerry will be giving a talk. He
gave the top rated talk last year so I recommend not missing it. That's going
to be in Golden Gate Ballroom A, which is out these doors to your left up the escalators and then turn left when you see the FedEx office. Uh, then there's the eval track
office. Uh, then there's the eval track sponsored by Brain Trust. Uh, everybody
says evals are important. We all agree evaluating.
Uh this track is uh curated by Anchor Goyal of Brain Trust and is all about making evals work quickly and cheaply. Uh next there's the same two
cheaply. Uh next there's the same two tracks for our leadership attendees that we had yesterday. So as a reminder that's for people with the gold lanyards.
Uh first is AI and the Fortune 500 track. Uh we've gathered success stories
track. Uh we've gathered success stories from real AI deployments in the Fortune 500 showing how to use AI at real scale.
That's in uh Golden Gate Ballroom C which is right next to A and B again left at the FedEx office. Uh our second leadership track
office. Uh our second leadership track again for gold lanyards is the AI architects track. Uh this is for CEOs,
architects track. Uh this is for CEOs, CTO's and VPs of AI to meet and learn from each other on everything from infrastructure to company strategy. Uh
that is in SOMO which is all the way upstairs three sets of escalators up to the right of registration.
Next up is the security track. Uh, as we grant agents increasingly more uh more access to our personal lives and company resources, the problem of security goes
from an enterprise sales checklist uh to a P 0. In this track, you'll learn about the state-of-the-art approaches for authentication and authorization in the world of AI. That's in Foothill C, which
is again all the way upstairs to the left of the registration area.
The next track is design engineering. Uh
LLMs are 10x better than they were a year ago, but design thinking around the UX of AI uh has barely budged from chat chat GPT and canvas. Uh we've gathered the top designers and design engineers
to showcase their work. That's going to be in foothill G1 and two which is all the way upstairs directly behind the registration desks.
Then there is the generated media track that's going image gen, video gen, uh, and music gen are all on fire this year with increasing coherence over time and
iterations uh, and stunning viral demos.
Uh, from Gibli memes to personalized Valentine songs. How can AI engineers
Valentine songs. How can AI engineers harness the state-of-the-art in AI art?
Uh, that's in Foothill F, which is all the way up three sets of escalators behind registration.
And our final track today is autonomy and robotics. Uh the ultimate prize in
and robotics. Uh the ultimate prize in AI is going outside, automating manual labor over knowledge work. Uh multimodal
LLMs are increasingly being deployed in the real world in everything from cars to kitchens to humanoid robots. Uh and
this track is all about the state of physical general intelligence. And it's
in foothill E, which is again up three sets of escalators behind and to the right of registration.
So those are all our tracks today. Now
please go forth and enjoy the expo. Uh
the next 45 minutes are dedicated expo time. There are also three expo session
time. There are also three expo session talks uh which are in Juniper and Willow uh on the floor with the FedEx office uh and also in Knobill A and B which is right out these doors and opposite this
room. See you all back here for the
room. See you all back here for the keynotes at 3:45. Thanks very much.
Heat. Heat.
D. N.
[Music] [Music] [Music] Hey hey hey.
[Music] [Music] [Music] [Music] [Music] Hey hey hey.
Heat up.
Heat. Heat.
Hey hey hey.
Hey, hey hey.
Hey, hey hey hey hey hey.
[Music] [Music] [Music] [Music] Data.
[Music] [Music] [Music] [Music] [Music] [Music] [Music]
[Music] [Music] Heat.
[Music] [Music] [Music] Heat. Heat.
Heat. Heat.
[Music] Heat. Heat.
Heat. Heat.
[Music] [Music]
[Music] [Music] [Music] All
[Music] right. All night.
right. All night.
[Music] [Music] [Music] [Music] [Music] [Music]
All [Music] right. All right.
right. All right.
All [Music] right. All
right. All right. Heat. Heat.
right. Heat. Heat.
[Music] [Music] [Music] We [Music]
are me.
[Music] [Music] [Music] Hey hey hey.
down. I feel down.
[Music] Everything every every I I I'll be I'll Heat. Heat.
I feel I feel [Music] [Music] [Music]
Hey, hey hey.
[Music] [Music] [Music] I'll be everything.
[Music] Hey hey hey.
[Music] Hey. Hey. Hey.
Hey. Hey. Hey.
[Music] Am I waiting?
I'm [Music]
I don't want to go.
[Music] [Music] [Music]
And [Music]
I don't want to work.
[Music]
I take it.
[Music] [Music] [Music] [Music]
Welcome everyone. My name is Vivu. I'm
Welcome everyone. My name is Vivu. I'm
very excited to be hosting the Sweet Agents track here today. Fun fact, this is the most popular track out of all of them. We have a completely full day
them. We have a completely full day ahead of you. Every single speaking slot will be filled. We've got eight amazing speakers here for you today. We're going
to have speakers from every top SUI agent. So, you know, we've got the
agent. So, you know, we've got the creators of Jules here, Claude Code, Codeex, the original SUI agent. We've
got the Scott Woo from Devon Cognition.
He will be kicking us off. I'm going to keep my MCing very very short so we give speaking time to the speakers. So, let's
hear it. Let's kick things off. I want
to welcome Scott Woo from Cognition here to speak about Devon.
[Applause] Oh, okay. Okay,
Oh, okay. Okay, cool. Awesome. Awesome. Yeah. Well,
cool. Awesome. Awesome. Yeah. Well,
thank you guys so much for having me.
It's exciting to be back. It's uh I I was last here at AI Engineer one year ago. Um and it's kind of crazy. I I've
ago. Um and it's kind of crazy. I I've
always been I I've been telling Swix that we need to have these conferences way more often if it's going to be about AI software engineering. Probably should
be like every two months or something like that with the pace of everything's done. But but but going to be fun to to
done. But but but going to be fun to to talk a little bit about um you know what we've seen in the space and and what we've learned over the last 12 or 18 months uh building Devon over this
time. And I want to start this off with
time. And I want to start this off with um Moore's law for AI agents. And so you can kind of think of the the the capability or the capacity of an AI by
how much work it can do uninter uninterrupted until you have to come in and step in and intervene or steer it or whatever it is, right? And um you know in GPT3 for example, it's if you were to
go and ask GPT3 to do something, you know, it could probably get through a few words or so and then it'll say something where it's like okay, you know, this is probably not the right thing to say. Um and GPT3.5 was better
and GP4 was better, right? Um and and so people talk about these lengths of tasks and what you see in general is that that doubling time is about every seven months which already is pretty crazy
actually. But in code it's actually even
actually. But in code it's actually even faster. It's every 70 days which is two
faster. It's every 70 days which is two or three months. And so, you know, if you look at various software engineering tasks that start from the simplest single functions or single lines and you
go all the way to, you know, we're doing tasks now that take hours of humans time and and an AI agent is able to just do all of that, right? Um, and if you think about doubling every 70 days, I mean,
basically, you know, every two to three months means you get four to six doublings every year. Um, which means that the amount of work that an AI agent
can do in code goes something between 16 and 64x in a year every year at least for the last couple years that we've seen. Um, and it's kind of crazy to
seen. Um, and it's kind of crazy to think about, but but that sounds about right actually for for what we've seen.
You know, 18 months ago, I would say the only really the only product experience that had PMF in code was just tab completion, right? It was just like
completion, right? It was just like here's what I have so far. Predict the
next line for me. that was kind of all you really could do um in in a way that really worked. And we've gone from that
really worked. And we've gone from that obviously to full AI engineer that goes and just do does does all these tasks for you, right? And implements a ton of these things. And people ask all the
these things. And people ask all the time, what is the um you know what what what is the the future interface or what is the right way to do this or what are the most important capabilities to solve
for? And I think funnily enough, the
for? And I think funnily enough, the answer to all these questions actually is it changes every two or three months.
like every time you get to the next tier, the the the bottleneck that you're running into or the most important capability or the right way you should be interfacing with it, like all these
actually change at at each point. And so
I wanted to talk a bit about some of the the tiers for us over the last year or so. Um and you know over the course of
so. Um and you know over the course of that time obviously you know when we got started um in the end of 2023 obviously agents were not even a concept. Um, and
now everyone has, you know, everyone's talking about coding agents, people are doing more and more and more. Uh, and
and it's very cool to see. Um, and and each of these has kind of been almost a discrete tier for us. Um, and so right right around a year ago when we were doing the the last AI engineer talk
actually, um, the the biggest use case that we really saw that that was getting broad adoption was what I'll kind of call these repetitive migrations. And so
I'm talking like JavaScript to TypeScript or like upgrading your Angular version from this one to that one or going from this Java version to that Java version or something like
that. Um and those those kinds of tasks
that. Um and those those kinds of tasks in particular what you typically see is you are you you have some massive code base that you want to apply this whole
migration for. You have to go file by
migration for. You have to go file by file and do every single one. And
usually the set of steps is pretty clear, right? If you go to the Angular
clear, right? If you go to the Angular website or something like that, it'll tell you, all right, here's what you have to do. This, this, this, this, this, and um, you want to go and execute each of these steps. It's not so routine that there, you know, there's no
classical deterministic program that solves that. But there's kind of a clear
solves that. But there's kind of a clear set of steps. And if you can follow those steps very well, then you can do the task. And, you know, this was the
the task. And, you know, this was the thing for us because that was all you could really trust agents to do at the time. you know, you could do harder
time. you know, you could do harder things once in a while and you could do some really cool stuff occasionally, but as far as something that was consistent enough that you could do it over and over and over, um, these kinds of like repetitive
migrations that you would be doing for, you know, 10,000 files were, you know, in many ways the the the easiest thing, which was cool actually because it was also kind of the the most
annoying thing for humans to do. And I
think that's generally been the trend where um AI has always done these more boilerplate tasks and the more tedious stuff, the more repetitive stuff and we get to do the the the more fun creative stuff. Um and obviously as time has gone
stuff. Um and obviously as time has gone on, it's it's taken on more and more of that boiler plate. But for a problem like this one, a lot of what you need to do is you need Devon to be able to go
and execute a set of steps reliably. And
so a lot of this was, you know, I would say the big capabilities problems to solve was mostly instruction following.
And so we built this system called playbooks where basically you could just outline a very clear set of steps, have it follow each of those step by step, and then do exactly what's said. Now, if
you think about it, obviously a lot of software engineering does not fall under the category of literally just follow 10 steps step by step and do exactly what it said. But migration does and it
it said. But migration does and it allowed us to go and actually do these and and this was kind of I would say the first big use case of Devon that really um that really came up. I think one of the other big systems that got built
around that time which we've since rebuilt many times is knowledge or memory right which is you know if you're doing the same task over and over and over again then often the human will have feedback on hey by the way you have
to remember to do X thing or you have to you know you need to do Y thing every time when you see this right um and so basically an ability to to just maintain
and understand the learnings from that and use that to improve the agent in every future one and those were kind the the big problems of the time, you know, and that was summer of last year. And
around end of summer or fall or so, you know, I think the the the kind of big thing that started coming up was as these systems got more and more capable instead of just doing the most routine migrations, you could do, you know,
these more still pretty isolated, but but but but a bit broader of these general kind of bugs or features where you can actually just tell it what you want to do and have you have it do it,
right? And so for example, hey Devon, in
right? And so for example, hey Devon, in this uh repo select dropdown, can you please just list the currently selected ones at the top? Like having the checkboxes throughout is just doesn't really and and Devon will just go and do
that, right? And so if you think about
that, right? And so if you think about it, it's, you know, it's it's it's something like the kind of level of task that you would give an intern. And there are a few particular
intern. And there are a few particular things that you have to solve for um with this. First of all, usually these
with this. First of all, usually these these these changes are pretty isolated and pretty contained. And so it's one maybe two files that you really have to look at and change to do a task like this. But at least you do still need to
this. But at least you do still need to be able to set up the repo and work with the repo, right? And so you want to be able to run lint, you want to be able to run CI, all of these other things. So
you know to at least have the basic checks of whether things work. One of
the big things that we built around then was the ability to really set up your repository uh ahead of time and build a snapshot um that that you could start off that you could reload that you could roll back and all of these kinds of
primitives as well right so having this clean remote VM that could run all these things it could run your CI it could run your llinter uh and and so on um but that's when we started to really see I
would say a bit more broad of value right I mean migrations is one particular thing and for that particular thing we were showing a ton of value and then we started started to see where you know with these bug fixes or things like that you would be able to just generally
get value from Devon as as almost like a junior buddy of yours and then in fall things really moved towards just much broader bugs and requests and here
it's you know most most changes again you know you jumping another order of magnitude most changes don't just contain themselves to one file right often you have to go and look see what's going on you have to diagnose things you
have to figure out what's happening you have to work across files and make the right changes. Often these changes are,
right changes. Often these changes are, you know, hundreds of lines if it's like, hey, I've got this bug. Let's
figure out what's going on. Let's solve
it, right? And, you know, there there are a
right? And, you know, there there are a lot of things here that that really started to make sense and really started to be important, but but one in particular I'll just point out was there's a lot of stuff that you can do
with not just looking at the code as text, but thinking of it as this whole hierarchy, right? So, so understanding
hierarchy, right? So, so understanding call hierarchies, running a language server, uh, is a big deal. You have git commit history which you can look at which informs how how these different
files relate to one another. You have um um obviously you have like your llinter and things like that but but you're able to kind of reference things across files. And so like one of the big
files. And so like one of the big problems here I think was u kind of working with the context of it and getting to the point where it could make changes across several files. It could
be consistent across those changes. It
would be able to understand across the codebase. And here was really the point,
codebase. And here was really the point, I would say, where you started to be able to just tag it and have it do an issue and just have it build it for you.
Um, and so Slack was a was, you know, a huge part of the workflow then. Um, and
and it was just it it made sense because it's where you discuss your issues and it's where you set these things up, right? So you would tag Devon and Slack
right? So you would tag Devon and Slack and say, "Hey, by the way, we've got this bug. Please take a look." Or, you
this bug. Please take a look." Or, you know, could you please go build this thing? Uh, this is especially fun part
thing? Uh, this is especially fun part for us because this is right around when we went GA. Uh, and a lot of that was because it was it got to the point where you truly could just get set up with Devon and ask it a lot of these broad
tasks and and just have it do it. Um,
but but a lot of these, you know, a a lot of the work that we did was around having Devon have better and better understanding of the codebase, right?
And if you think about it, you know, from the human lens, it's the same way where on your first day on the job, for example, being super fresh in the codebase, it's kind of tough to know exactly what you're supposed to do. Like
a lot of these details are things that you understand over time or that a representation of the codebase that you build over time, right? Um and Devon had to do the same thing and had to understand how do I plan this task out before I solve it? How do I understand
all the files that need to be changed?
How do I go from there and make that diff?
And around the spring of this year, um, again, every every gap is like two or three months. You know, we we got to an
three months. You know, we we got to an interesting point, which is once you start to get to harder and harder tasks, you as the human don't necessarily know everything that you want done at the
time that you're giving the task, right?
If you're saying, hey, you know, I I'd like to go and um improve the architecture of this, or you know, this this function is slow. Like, let's let's profile it and look into it and see what
needs to be done. or hey like you know we really should should handle this this error case better but like let's look at all the possibilities and see what we should you know what the right logic should be in each of these right and
basically what it meant is that this whole idea of taking a two-line prompt or a threeline prompt or something and then just having that result in a a Devon task was was not sufficient and you wanted to really be able to work
with Devon and specify a lot more and around this time along with this kind of like better codebase intelligence um we had a few different things that that that came up and so we released deep wiki for example. Um and the whole idea
of deep wiki was you know funnily enough is devon had its own internal representation of the codebase but it turns out that for humans it was great to look at that too to be able to understand what was going on or to be
able to ask questions quickly about the codebase. Um, closely related to that
codebase. Um, closely related to that was was search, which is the ability to really just ask questions about a codebase and understand um, some some piece of this. And a lot of the workflow
that really started to come up was actually basically this this more iterative workflow where the first thing that you would do is you would ask a few questions. You would understand, you
questions. You would understand, you would basically have a more L2 experience where you can go and explore the codebase with your agent, figure out what has to be done in the task, and then set your agent off to
go do that. because for these more complex tasks you kind of needed that right um and so so you know that was a I would say kind of like a big paradigm shift for us then is is understanding you know this is what also came along with Devon
2.0 for example and the in IDE experience where often yeah you want to be able to have points where you closely monitor Devon for 10% of the task 20% of
the task and then have it do uh work on its own for the other 80 90%.
Um, and then lastly, most recently in June, which is now, it was kind of, yeah, really the ability to just truly just kill your backlog and hand it a ton of tasks and have it do all these at once. And, you know, if you think about
once. And, you know, if you think about this task, in many ways, I would say it's it's almost like a culmination of of many of these different things that that had to be done in the past. You
have to work with all these systems. Obviously, you have to integrate into all these. Certainly, you want to be
all these. Certainly, you want to be able to to work with linear or with Jira or systems like that, but you have to be able to scope out a task to understand what's meant by what's going on. You
have to decide when to go to the human for more approval or for questions or things like that. You have to work across several different files. Often,
you have to understand even what repo is the right repo to make the change in. If
if your if your org has multiple repos or what part of the codebase is the right part of the codebase that needs to change. Um, and to really get to the
change. Um, and to really get to the point where you can go and do this more autonomously, first of all, um, you have to have like a really great sense of confidence, right? And so, um, you know,
confidence, right? And so, um, you know, rather than just going off and doing things immediately, you have to be able to say, okay, I'm quite sure that this is the task and I'm going to go execute it now versus I don't understand what's
going on. Human, please give me help.
going on. Human, please give me help.
Basically, right? But but the other piece of it is this is I think the era where testing and this asynchronous testing gets really really important, right? Which is if you want something to
right? Which is if you want something to just deliver entire PRs for you for tasks that you do, especially for these larger tasks, you want to know that it is can can test it itself. And often the
agent actually needs this iterative loop to be able to go and do that, right? So
it needs to be able to run all the code locally. It needs to know what to test.
locally. It needs to know what to test.
It needs to know what to look for. Um,
and in many ways it's just a a much higher context problem to solve for, right? Is this testing
right? Is this testing itself and that brings us to now. And
obviously it's a it's a pretty fun time to see because now what we're thinking about is hey maybe if instead of doing it just one task it's you know how how do we think about tackling an entire project right and after we do a project
you know what what goes after that a and maybe one point that I would just make here is we talk about all these two X's you know that happen every couple months and I think from a kind of cosmic
perspective all the two X's look the same right but in practice every 2X actually is a different one right and so when we were just doing you tab completion, line, single line completion. It really was just a text
completion. It really was just a text problem. It is just like taken the
problem. It is just like taken the single file so far and just predict what the line is text. Right? Over the last year or year and a half, we've had to think about so much more. How do how do you work with the human in linear or
slack or how do you take in feedback or steering? Um how how do you help the
steering? Um how how do you help the human plan out and do all these things, right? And moreover, obviously, there's
right? And moreover, obviously, there's a ton of the tooling and the capabilities work that have to be done of how does how does Devon test on its own? How does Devon um uh you know make
own? How does Devon um uh you know make a lot of these longer term decisions on its own? How does it debug its own
its own? How does it debug its own outputs or or run the right shell commands to figure out what the feedback is uh and go from there? And so it's super exciting now that there's a lot more uh there's a lot more coding agents
in the space. It's uh it's it's very fun to see and I think that you know we we're going to see another 16 to 64x over the next 12 months as well and uh and so yeah super super
excited. Awesome. Well, that's all.
excited. Awesome. Well, that's all.
Thank you guys so much for having me.
Awesome. Uh thanks for what a great talk. Um Scott, so we just heard from
talk. Um Scott, so we just heard from the creators of Devon, one of the very first proper sweet agents, right? They
shocked the world with their demo. They
were kind of the first to pivot this field of autonomous long- form agents that can run and actually complete tasks. Now our next speaker is from
tasks. Now our next speaker is from Google. He's the AI PM of AI labs and he
Google. He's the AI PM of AI labs and he works on Jules. Jules is one of the latest coding agents, right? So he's
going to speak to us about asynchronous coding agents. As we change from a world
coding agents. As we change from a world of coding co-pilots to autonomous agents, how do we kind of delegate our workflow? What do we do when we have a
workflow? What do we do when we have a bunch of these agents going on? So,
without further ado, I want to welcome Rustin Banks from Google to speak to us about [Applause] Jules. Awesome. Hi everyone. I'm Rustin.
Jules. Awesome. Hi everyone. I'm Rustin.
I'm a product manager with Google Labs and really thrilled to be here and get to speak to you today. This is really like a a dream come true.
So, I'm an engineer at heart. This is my first compiler, Borland C++ 3.1. It came
in the mail on 10 5 and 1/2 in floppy discs. I ordered it from AOL
discs. I ordered it from AOL classifides. It was amazing. This is my
classifides. It was amazing. This is my bulletin board. Yeah. That I hosted out
bulletin board. Yeah. That I hosted out of my parents' closet and salvaged computers. And I just think it's ironic
computers. And I just think it's ironic that when I saw AI come out, I recognized the textbased interfaces perfectly from hosting bulletin boards.
And then when I saw this, like many of you, I dedicated my career to AI coding.
And this is chat GPT 3.5. Isn't it crazy that we the how slow this is? And this
used to be state-of-the-art only two years ago. It's pretty
years ago. It's pretty amazing. Right now, I'm a product
amazing. Right now, I'm a product manager for Jules. And Jules is an asynchronous coding agent meant to run in the background and do all those tasks
that you don't want to do in parallel in the background. And we launched this
the background. And we launched this just two weeks ago at IO to everyone everywhere all at once for
free while Josh was up on the stage trying to demo other Google Labs products. And so he called us. We said,
products. And so he called us. We said,
"Oh, we got to shut it down." so that we can demo other products and and luckily we got it up and going. But it was a super exciting launch and the best part
about it is to see these use cases where this is what we really want to solve. We
want to do the laundry so to say so that you can focus on the art of coding. So
the next time Firebase updates their SDK, Jules can do that for you. Or if
you just want to develop from your phone, Jules can do that for you. So, in
the last two weeks, we've had 40,000 public commits, and we're super excited what we can bring to the open-source world. So, but as developers, we're
world. So, but as developers, we're meant to think serially. We take a task from the queue, we work on it, we go on to the next one. That's our default
workflow. Today, we'll learn about how
workflow. Today, we'll learn about how to maximize parallel agents. I'll try a real world demo and we'll go through a real world use case and then I'll go through some best practices we've
learned from watching people use jewels. So for this parallel process
jewels. So for this parallel process really to work well, we need to get better with AI at the beginning and the end of the workflow. Meaning if it's on
me to now I just have to write a bunch of tasks all day. That's not fun. And if
I'm reviewing PRs and handling merge messes at the end of the day, that's not going to work well either. So luckily,
help is on the way. So for example, AI can easily work through backlogs, bug reports to create tasks for you with you. And then uh at the end of the
you. And then uh at the end of the SDLC, help is on the way where we can use critic agents, merging agents that can bring everything together and make
it so that this this parallel workflow that we've envisioned can really come together and not drive us crazy. Remote agents are uniquely suited
crazy. Remote agents are uniquely suited for this. Agents inside of our IDE are
for this. Agents inside of our IDE are always going to be limited by our laptop. And when you have these remote
laptop. And when you have these remote agents in the cloud, essentially agents as a service, they're infinitely scalable. They're always connected and
scalable. They're always connected and then you can develop from anywhere from any device. We've seen two types of
device. We've seen two types of parallelism emerging. This is the type
parallelism emerging. This is the type that we expected, which is multitasking.
Oh, I'm just I have 10 different things on my backlog. Let's do them all at once and then we'll merge them together and test them.
Interestingly, you saw an example of the second type this morning with Solomon from Dagger showing how he wanted three different views of his website at the same time. This was the emergent
same time. This was the emergent behavior we didn't expect, which is multiple variations. Essentially, we see
multiple variations. Essentially, we see users taking a task, especially if it's a complex task, and saying, "Try it this way, try it that way, or give me this
variation to look at, or multiple variations to look at." And then you can test and choose. And we can have the agents test and choose the best ones or
the user can can test and choose. So for example, we see lots of
choose. So for example, we see lots of people who are working on a front-end task test and they're in a React app and they're saying I'm adding drag and drop.
Maybe try it using this library uh the react be beautiful drag and drop or maybe use dnd kit or maybe try it using the test first and in this parallel
asynchronous environment you can just spin up multiple agents at the same time they can try it they can easily come back together choose the best one and
you're off to the races. Okay, demo
time. So exit out of this for a demo. I'm going to use the conference schedule website.
And Swix for all his skills, as you can see, has probably not spent a lot of time designing the the schedule website.
As you can see there, anytime there's a horizontal scroll scroll bar, uh we we know that's a problem. But luckily they knew that and they said we're just going to publish the JSON feed and we'll let
we'll let hackers hack. Uh engineers do what we do and let's build from it. So
Pal love who is here built this amazing uh conference site where you can favorite things, you can bookmark things and this is what I use to keep track of my my sessions for the conference. And
so I messaged him. And I said, "Hey, can I can I use phone clone this and use this for as an example for jewels?" And
Palv said, "Oh, yeah, sure. Actually, I
was sitting in my last session on my phone and I fixed a bud a bug using jewels." So, I thought that was perfect.
jewels." So, I thought that was perfect.
So this is how I would start something like this is I would go into linear and I would say okay first thing we need to do we just heard Scott talk about it is I want to add a way to know if this
parallel agent is going to do a bunch of things at the same time that it's getting it right. So, first we're going to add some tests. And then I'm going to actually I'm going to kick this one off while I'm thinking about
it. And then using that idea of multiple
it. And then using that idea of multiple variations, I'm going to say add it with justest and add it with add it with playright at the same time. And then
we'll look at the test coverage and we'll choose the one that has the best test coverage. Once that's done, then I
test coverage. Once that's done, then I can go to that other mode of parallelism and I say, I would like a link to add a session to my Google calendar. I would
like an AI summary when I click on a description. And these are all features,
description. And these are all features, but what I'm really excited for is for AI to do the stuff that we never seem to get to, such as accessibility audits and
security audits. All those things that
security audits. All those things that seem to go on the backlog, but are really important. And I'm super excited
really important. And I'm super excited for AI to do that. So, we're going to also have it do an accessibility audit and improve our Lighthouse scores at the same time. This is mostly a front-end
same time. This is mostly a front-end demo because, well, I'm mostly a front-end engineer and it's a better visual representation, but we've seen
all these all these applied to the back end as well. Okay, so here's Jules. We
told it to add add tests and ingest framework. It connects to my GitHub, all
framework. It connects to my GitHub, all my GitHub repos, and it's going to give me a plan. That looks about right. I can
see it's going to test the calendar, the search overlay, the session. That sounds
great. I can approve the plan. So,
Google So, Jules now has its own VM in the cloud. It's cloned my whole whole
the cloud. It's cloned my whole whole codebase. It can run all the commands
codebase. It can run all the commands that I can run and and importantly after it has these tests, it can run these tests so it can know when we add a new feature if it gets things right. So I'm
going to fast forward a little bit here.
And so this is adding just tests. You
can see all the the things it's or all the components that's it's added to the test. It's added to the readme. So now
test. It's added to the readme. So now
next time that it goes to add something, it'll look at the readme and remind itself, oh, this is how I run the tests.
Let's see how it did on test coverage. Okay, we got down to
coverage. Okay, we got down to Looks like about estimated test coverage looks like about 80%. So that's pretty good. We could compare that with
good. We could compare that with playright and then we could just choose the the one we like the best. We merge
that into Maine and now we're we're off to the races. So that again it's automatically
races. So that again it's automatically integrated into GitHub. We merge that into into Maine and now we can start saying okay now I want a calendar link.
So I want a calendar button that can go in and Jules will work on that. And then sure enough, it ran the test. The test didn't pass the first time. It makes some
changes. Now the tests pass. And I can
changes. Now the tests pass. And I can review this code. Eventually I could look at this in Jules's browser. But I
feel pretty confident about testing this knowing that all the tests pass.
Similarly for uh the Gemini summaries, when I click on a description, I can get a Gemini summary. I put this one in an emulator or I emulated a mobile view just so you can I could have done this
from my phone. So, this is making accessibility audit, fixing any issues from my phone. Uh, never mind the console errors. Jules is going to fix
console errors. Jules is going to fix those. And then I can go back. I can Now
those. And then I can go back. I can Now we have this big merge we need to do.
And to be honest, I ran out of time to finish the merge. And Jules should help me with this merge. And it's called an octopus merge. So surely Jules as a
octopus merge. So surely Jules as a squid should help with the octopus merge. But let's just pull our check out
merge. But let's just pull our check out our add to calendar button. Go back to
button. Go back to this local host.
Refresh. And now I have a calendar button. Let's test it. Okay. Let's add
button. Let's test it. Okay. Let's add
this to my calendar to make sure I know to come to my own talk. And there it's on. It's on my calendar. I could then
on. It's on my calendar. I could then now again pull this back into the main branch and now everybody at the conference has the ability to add add sessions to their goo to their Google
calendar along with everything else that we saw there a full test suite all the accessibility audits a lighthouse scores improvement and that took me all about
an hour and managing the the parallel process in the back end.
Okay, so in theor in summary, the secret to working in parallel is a clear definition of success because nobody wants to review PRs all day. So think
before you get started, how am I going to easily verify that this works? Again,
Scott hit on this as well. Create this
agreement with the agent. Tell it, don't stop until you see this or don't stop until this works. and then a re robust merge and test framework at the end to
put everything back together and help is coming. This is how I prompt for Jules.
coming. This is how I prompt for Jules.
I give it a brief overview of the task.
I tell it when it will know what it got right, any helpful context, and then I'll at the end I'll append a simple broad approach and then I'll change that last line maybe two or three times
depending on the complexity of the task.
So for example, if I need to log this number from this web page every day, I'll say today the number is X. So log
the number to the console and don't stop until the number is X. That was a simple test that I wrote in. It'll keep going.
I give it a helpful context like this is the search query. And then I'll say use puppeteer and then I'll clone that task because I can. It's in the cloud and
I'll say use playright. So again, have an abundance
playright. So again, have an abundance mindset. But we're used to working on a
mindset. But we're used to working on a single thing at a time. Easy
verification makes it so now we can work on multiple things at the same time. Try
lots of things. As we saw this morning, look at different variations. We can
with a parallel process. We can we have the ability now to try things that we would never have tried before. Let AI
help with those bookends, the task creation and then the merge and and test part and context. Keep using MD files or links to documentation to getting
started. documents. The more context the
started. documents. The more context the better. And then we tell people just
better. And then we tell people just throw everything in there. Jules and
other agents are pretty good at actually sorting out which context is important.
So more context is better at this point, but maybe that's just for uh the Gemini models, which I should have mentioned.
Jules is powered by Gemini 2.5 Pro. Quick shout out. Thank you team
Pro. Quick shout out. Thank you team Jules. Couldn't have done any of this
Jules. Couldn't have done any of this without you. If you have any questions,
without you. If you have any questions, you can DM me. I'm Rustin Banks. Rustin
B on X. Thanks everybody.
Awesome. Always good to hear from one of the latest coding agents and it's always great to get a refresher. You know, even I don't know how to prompt these things, but I'm liking this flow. We started off
with Cognition. We had Devon, one of the
with Cognition. We had Devon, one of the first proper SU agents. Then we had one of the latest. We just heard from Google about Jules. Let's take it back again.
about Jules. Let's take it back again.
Let's hear from GitHub, one of the very very first coding co-pilots, right? So,
let's hear about the future and you know, how do we still want to think about GitHub Copilot? So, without
further ado, I want to welcome Christopher Harrison to the states to tell us about GitHub Copilot. All right, let's uh let's get
Copilot. All right, let's uh let's get right on into it. So, my name is Christopher Harrison. I'm a senior
Christopher Harrison. I'm a senior developer advocate at GitHub, primarily focused in on this little thing called developer experience, or as all the cool
kids like to call it, DevX, and GitHub Copilot. So, let's talk about the past,
Copilot. So, let's talk about the past, the present, and the future of GitHub Copilot. Oops.
Copilot. Oops.
Um um actually it's not picking up at all. Um oh there we go. Um let me
all. Um oh there we go. Um let me um entire start mirroring. There we
go. Cool. Look at that.
Okay, so let's get on into it. So where
we started was with code completion. And
so with code completion, I'm a developer. I'm in the zone. Type type
developer. I'm in the zone. Type type
type. And then C-pilot's going to then suggest the next line, the next block, the next function, potentially even the next class. And this is wonderful for
next class. And this is wonderful for giving that in time inline support to our developers.
But as we all know, the tasks that we're going to be completing go beyond just writing a few lines of code that I need to be able to explore. I need to be able to ask questions and I need to be able
to modify multiple files. And so this is where chat comes into play. And we
started off with chat by supporting ask mode where I could go in and ask questions or ask co-pilot to generate an individual file for me. And then we
expanded this out to edit mode. And with
edit mode, I can then drive copilot as it modifies multiple files. Because when
we think about even the most basic of updates, I'm going to update a web page.
That's going to require updating my HTML, my CSS, and my JavaScript, three files. With edit mode, I can do that very quickly. And again,
right inside of chat. Then we get into agent mode. And agent mode really shifts
agent mode. And agent mode really shifts things because unlike chat where I'm going in and I'm asking questions and I'm going in and I'm pointing it at the
files that I want to see modified with agent mode it's able to perform those operations on my behalf. And on top of that, it's going to behave an awful lot
like a developer that it will go in, do a search, find what it needs to do, perform those tasks, and then even be able to perform external tasks as well.
So, it could run tests, detect that maybe those have failed, and then even selfheal. So, I have an application
selfheal. So, I have an application here, and I want to create a couple new endpoints. So, the first thing that I'm
endpoints. So, the first thing that I'm going to do is I'm going to add in a little bit of context here. So
instruction files allow me to give Copilot a little bit of additional information about what it is that I'm doing and how it is that I want it to be done. So I have an instruction file
done. So I have an instruction file specific around my endpoints. Now, this
is definitely one of those scenarios where agent mode could figure this out on its own. But, as I like to say, don't be passive aggressive with co-pilot.
That if there's a piece of information that's important that you want it to consider, go ahead and tell it it might be able to figure it out on its own, but this is certainly going to make life
easier. So now that I've added this in,
easier. So now that I've added this in, I'm going to now say create um um uh
endpoints to list the publishers and get publisher by ID. Create the tests,
ensure all tests pass, and then hit send. Now I'm doing a live demo with
send. Now I'm doing a live demo with AI, so we're going to see what happens here. There's a chance it will fail.
here. There's a chance it will fail.
there's a chance it will fail spectacularly, but there's also a really good chance that every everything's going to succeed. And that's the part that I'm going to hope for. Now, if I take a look at what Copilot's doing
here, what I'm going to see is as highlighted, it's behaving an awful lot like a developer that it tells me what it's going to do. It's going to create the endpoints to list all the publishers, get the publisher by ID. So
the first thing it's going to do is explore the project, figure out what's going on. Then it's going to create the
going on. Then it's going to create the endpoint. Then it's going to create the
endpoint. Then it's going to create the tests. And then it will sure everything
tests. And then it will sure everything works correctly. And now if I keep on
works correctly. And now if I keep on scrolling down, I'm going to notice that it's searching through my codebase because if I was tasked as a developer to perform this, that's the first thing
I'm going to do. And that's exactly what Copilot here is doing. It created my publishers PI file. It looked for routes that happen to be matching publishers.
And now it's going to create the endpoints here. And so if I stall for
endpoints here. And so if I stall for just a moment longer and move my mouse to make it go faster. See, it worked.
We're gonna notice that it will now generate that publishers pi file. And
one big thing that you're going to notice is I've got these great keep and undo buttons here because I always like to highlight the fact that AI does not
change the fundamentals of DevOps. that
if I think about how I wrote code before AI, some of that would be created off the top of my head. Some of that would be based on existing code. Some of that would be copied and pasted from Stack Overflow and then made a couple of
changes and cross my fingers and hope that it worked. Maybe that was just me.
worked. Maybe that was just me.
Um, and to help ensure that all of the code that I was going to be committing to our codebase is secure and is written the way that we want it to be written, we had code reviews, we have llinters,
we have security checks. And we're going to do all that exact same thing even when we introduce AI. So this keep and undo allows me to very quickly ensure that yes, everything looks good and if
it doesn't to be able to undo it. We'll
also notice history buttons up here that allow me to act uh iteratively because again when I'm working with AI, I'm not necessarily going to get perfect code
the first time. So I can go in and I can uh work back and forth. So I can say, hey, this looks good, but I want to do this. Maybe I want the buttons to look
this. Maybe I want the buttons to look blue or whatever it is. And then
highlight that. So, what I'm going to now see is the fact that it created all my files, updated a couple of items, and now it can run my tests. And this is going to be one of those rare moments
where I'm kind of hoping that it fails because I want to see it be able to recover for me. So, you'll notice that it ran my tests. You'll notice that it ran my four tests. Everything succeeded.
Shucks. And then now it's going to go ahead and continue to iterate from there. And so what we see with agent
there. And so what we see with agent mode is co-pilot driving the way on going in and writing my code. But I
always want to highlight the fact that I as the developer am still in charge. Now the one catch though with
charge. Now the one catch though with agent mode is the fact that that's going to be inside my IDE and the fact that that's still going to be well singlethreaded. It's going to be
singlethreaded. It's going to be synchronous.
This is where we come into coding agent.
And with coding agent, this is going to be completely asynchronous and this is going to run on the server. So, let me kick over to an example that I had
actually run uh earlier this morning where I have an issue that's been created where I say add, edit, and delete endpoints. Now, I'm going to real
delete endpoints. Now, I'm going to real quick just unassign copilot just so I can kick off the workflow and we can see this in action here. I'm going to let those cute little eyeballs go away here.
There we go. And let's go back in and hit a reassign. So, by assigning co-pilot
reassign. So, by assigning co-pilot here, I've now kicked off the coding agent. I can now see the little eyeballs
agent. I can now see the little eyeballs and that indicates to me that copilot is hard at work. And if I scroll on down, I'm now going to see a brand new pull
request that's been made here. And this
is now what Copilot is going to utilize to help keep me updated on the work that it is performing. And if I scroll on down just
performing. And if I scroll on down just a little more, what I'm also going to see is a little view session button. And
if I hit this view session button, I can notice right here that it's telling me that it's spinning up a development environment. And this is raises a very
environment. And this is raises a very big question which is where is this running? How can I ensure that this is
running? How can I ensure that this is going to be done securely? So this is running inside of GitHub actions. And if
you're not already familiar with GitHub actions, this is our platform for automation. And in fact, I can go ahead
automation. And in fact, I can go ahead and configure the environment in which I want my coding agent to work.
by creating a specialized workflow exactly for that. And that's what I see right here with this co-pilot setups.
And if I scroll down, what I'm going to notice is that I've got steps to install Node, I've got steps to install Python and all the frameworks and all the libraries that we're going to be using.
Now, not only does this ensure that co-pilot is working in the environment that I want it to work in, but it also allows me to highlight the fact that by
default, coding agent does not have access to any external resources. So, it's not able to call the
resources. So, it's not able to call the internet. It's not able to call any
internet. It's not able to call any external services. Now, if I so desire
external services. Now, if I so desire that I wanted to be able to do that, then I can go in and configure MCP
servers and I can also add in uh uh updates to my firewall. So that way I can punch a hole in my firewall and allow Copilot to then access those
external resources. But by default, it's
external resources. But by default, it's only going to have access to what I've configured and inside of that container.
In addition, because of the fact that that is running inside of uh GitHub workflows, inside of GitHub actions, it's going to be an in an ephemeral
environment. So, it's going to spin up a
environment. So, it's going to spin up a brand new environment and once its work is done, it's going to then delete it. Continuing down the security path,
it. Continuing down the security path, let me kick back one uh one page here.
And if I scroll down, you're also going to notice that it's not even able to automatically kick off any workflows. So
I have a couple of workflows associated with this repository for running my unit tests and for running end to end tests.
And by default, it's not going to be able to do this unless I go in and I say yes. You'll also notice that the pull
yes. You'll also notice that the pull request that it creates is going to be in draft mode and I have to go in and review it because again developers in
charge again because just we're just because we're introducing AI does not change the normal dev ops flow. Now if I take a look at the one
flow. Now if I take a look at the one that was created earlier, let me go ahead and open that up. What I'm going to notice is a pull request with a
fantastic description of everything that it has done. So I can see the PR implements the uh missing credit operations. I can see it lists off all
operations. I can see it lists off all the different uh endpoints that it created, the error handling, the testing and the technical details. And I can
also again open up my session here and see all the tasks that it performed. And
you'll notice again that it's going to behave an awful lot like a developer that it's going to go out, it's going to do uh searches through my codebase, determine what needs to be done, and then eventually perform the tasks. And
if I scroll all the way to the end here, where did There it is. Perfect. What I can now see is I've
is. Perfect. What I can now see is I've got a nice little summary down at the very bottom. If I scroll up, I should be
very bottom. If I scroll up, I should be able to see that it ran all my tests.
Yep, I can see all my tests right there.
And I can see in this case that all 16 of those tests passed. So it created that PR and then I
passed. So it created that PR and then I also decided, okay, all of that looked good to me. So I allowed it to run the actions and I can even now see that it ran those unit tests, ran the endto-end tests, and everything works looks good.
then I could say ready for review and then finalize the uh the creation of it.
The last thing that I want to highlight and this both leads into the security aspect but also brings me back into the developer aspect is we'll notice that created a brand new branch. In this
particular case it's called copilotfix-3. Where that came from is
copilotfix-3. Where that came from is that the issue number that this was associated with was issue number three.
And so Copilot will only have right access to that branch. And this branch is going to behave just like any other branch that I might have. So if I clone the repository locally, I can go ahead
and check out that branch. I've opened
up the branch inside of GitHub here. Let
me scroll on down to my server. And if I scroll on down inside
server. And if I scroll on down inside of here, sound effects help. By the way, what we're going to notice is that there is my update game. I think my create game was up here. Yep, there it is. And
there it all is. But again, it's only inside of just that branch. So that's
the only place that coding agent is going to have right permissions to. Now, this leads us into a very big
to. Now, this leads us into a very big question, which is okay, that's that's wonderful, Christopher. You've created a
wonderful, Christopher. You've created a um a little uh kind of simple demo. You
had to create a few flask end points, and that's that's wonderful and all, but how about doing it in the real world?
Well, one of the big tenants that we have at GitHub is we build GitHub on GitHub. And in fact, coding agent was
GitHub. And in fact, coding agent was built with the help of coding agent. And
you're going to notice when we take a look at the amount of commits that found its way into coding agent that coding agent itself was one of the most
prolific committers.
that coding agent not only created new features but it also helped address tech debt and this is one of the biggest places where I personally see coding
agent really shining because I don't know of a single organization that doesn't have tech debt that feels comfortable with the state of their backlog that doesn't have a limitless
number of items where they keep going yeah that's that's great and all but we just don't have the time you can assum uh to kick through real quick as I
highlighted that secure environment separate platform ephemeral all running inside of GitHub actions and you have the ability to customize that coding
agent does understand your repository and understands your GitHub context so it has access to read your repository
and it is even able to read copilot instructions and it does have access to model context protocol so it can make those external calls. It does include
those safeguards readonly access to your repository the default firewall preventing any external access review before merge and review before those
actions run. So we continue to iterate
actions run. So we continue to iterate on C-pilot. We continue to look for new
on C-pilot. We continue to look for new areas where Copilot can shine to help streamline development and to help increase the productivity of developers.
Thank you.
[Applause] Awesome. Thank you, Christopher. We love
Awesome. Thank you, Christopher. We love
hearing about, you know, some of the main players in the sweet agent space.
So it's always nice to hear from some of the big players. We want to continue this track with you know how do we actually take things to production. So
um our next speaker Tomas is going to talk to us about the outer loop. So how
do we deal with you know actually deploying and using these software engineering agents? How do we manage all
engineering agents? How do we manage all of the you know CI/CD the pipeline? How
should we actually deal about using these things? So I want to take a little
these things? So I want to take a little bit of a break here in the talks and sort of speak about what's actually going on right innovation in sui agents is going at quite a rapid pace like
we've had jewels we've got codeex we've got cloud code as we get more and more of these software engineering agents that kind of really change the workflow of how we code how do we handle actually
deploying them right so we've got a next lineup of speakers that are going to help talk a little bit more about this and you know we just kind of want to set the stage here. So, let's just get it a little bit more interactive. Um, how how
are we feeling about the track today?
Who's kind of been a fan of Jules? I
want to see what are the what are the major ones? So, who here in the room has
major ones? So, who here in the room has used Devon? Let's see. Show hand fans.
used Devon? Let's see. Show hand fans.
Okay. Okay. We have a few Devon users.
And what about Jules? How are we feeling about Google's Jules? Okay. Same set of hands. How about uh Cloud Code? We've
hands. How about uh Cloud Code? We've
got a speaker from Cloud Code coming later. Okay, more hands but different
later. Okay, more hands but different hands. Seems like we've got a bit of a
hands. Seems like we've got a bit of a differentiation there. What about
differentiation there. What about OpenAI's codecs? Oh, another set of hands. So,
codecs? Oh, another set of hands. So,
interesting. You know, we've got different core co-pilots and it seems like people use them differently, but we also like to see the other end of the spectrum, right? Um, so we've got Devon.
spectrum, right? Um, so we've got Devon.
Who here is a fan of Devon and uses Devon's cognition? Cognition from um Devon from
cognition? Cognition from um Devon from cognition. Okay, so we've got another
cognition. Okay, so we've got another set of hands. And you know, one thing to note is that we kind of have these different categories of agents, right?
Um, what about the human in the loop short co-pilots? Who who here uses
short co-pilots? Who who here uses cursor and wind surf in their coding day-to-day tasks? Ah, a lot more hands.
day-to-day tasks? Ah, a lot more hands.
So, it's an interesting sort of split, right? We've got these human in the loop
right? We've got these human in the loop sort of short-term coding co-pilots where we've got stuff like Cascade from Windsurf. We've got um cursors co-pilot
Windsurf. We've got um cursors co-pilot and kind of everyone's hand goes up, right? A lot of people are starting to
right? A lot of people are starting to use these co-pilots in their IDE. Then
we take it to the next level. We've got,
you know, we've got the big the big players where we have uh cloud code, we have jewels, we have codecs. And an
interesting note, you know, everyone kind of has their own buckets. It's not
the same hands that go up. So that's one of the reasons why at the conference we like to invite speakers from everywhere.
Now the the third camp, you know, Devon, it's another way to think about it. As
you have longer horizon agents, how do we deal with those? And that's kind of where you know we're starting to take the second half of the day in SWU agents. We want to talk about how do we
agents. We want to talk about how do we take these things to production? How do
we actually deploy these? And you know to bring that up I want to invite our next speaker Tomas Ramirez. He's going
to give us a little bit about this. He's
from Graphite. So without further ado, let's welcome Tomas.
Thank you so much.
Perfect. Hello everyone.
Um, see, nope, no need for either of those.
Thank you so much. And then slides. Looking
much. And then slides. Looking
good.
Cool. Perfect. Awesome. Uh hi everyone, my name is Tomas. I'm one of the co-founders of Graphite. Graphite is an AI code review uh company. So to give
some context on sort of where we see the industry right now and where we see it going, software development currently and has always had two loops. The inner
loop which is focused on development and the outer loop that's focused on review.
Developers spend time in the inner loop.
They get their code working. They get
the feature the way they want it and then they go ahead and they move it to the outer loop where it's tested, reviewed merged deployed. We're seeing the inner loop
deployed. We're seeing the inner loop change right now more than we've ever seen it. More developers are using AI
seen it. More developers are using AI than ever. I think right here we have
than ever. I think right here we have some statistics from the GitHub developer survey. Nearly every developer
developer survey. Nearly every developer surveyed used AI tools both inside and outside of work. And 46% of GitHub is being written by CP code on GitHub is being written by Copilot.
We're seeing more and more code being written by AI. Here we have some statistics around how code has changed over time and how some people predict it will change. And even if we take a more
will change. And even if we take a more pessimistic view of that, we still see the way the world's going as just more and more and more code being written by AI. The inner loop is changing. You
AI. The inner loop is changing. You
know, AI is making more uh developers more productive. Developers now
more productive. Developers now producing higher volumes of code. But
that code still needs to be reviewed.
When we first started looking at this, when we first started building uh Diamond AI code reviewer about a year ago now, what we found was we read a lot of articles that scared us a lot. We
were seeing within our own organization a lot of developers adopting AI tools.
But we were also seeing a problem. AI
can hallucinate. It can make mistakes.
And almost more scarily, it can make security vulnerabilities.
For us, what we saw was that while the inner loop was getting sped up by AI, the outer loop was rapidly becoming the bottleneck. Um, we were seeing tools
bottleneck. Um, we were seeing tools like cursor, wind surf, co-pilot, vzero, ball, all of those producing larger volumes of code than we were used to, than we had ever seen before. But we
were also seeing our developers suddenly have to review higher volumes of code, test higher volumes of code, merge higher volumes of code, and deploy higher volumes of code. That's what brought us to say
code. That's what brought us to say there has to be a new outer loop here.
this the way that things are going, this isn't going to work. We're going to break down. We're watching the problems
break down. We're watching the problems that used to only aail large companies start to aail all companies where we were seeing companies deal with higher and higher and higher volumes of code.
The requirements for the new outer loop then look a lot like the problems that larger companies have always had to deal with. You need tools to better
with. You need tools to better prioritize, track, and get notified about pull requests. You need driver assist features to help reviewers focus and streamline the code review process.
You need optimized CI pipelines and merge cues to be able to handle the sheer volume of code changes that are now happening and you need better deployment
tools. Um when we first started looking
tools. Um when we first started looking at this through sort of an AI first lens, we started to see that well the problems are being created by AI, they can also probably be solved by AI. We
can probably start to streamline a lot of these processes which have previously had been manual, previously were parts of the process that developers did not enjoy, did not want to do. Um, we wanted to see self-driving code review
solutions where we no longer had to do those very manual and painful parts of review, but we could actually start to really focus on what matters most to the developers, making sure that your product can get out to users and that
the features work as expected. Um, we
were seeing that AI generated feedback wasn't perfect. And because of that, we
wasn't perfect. And because of that, we were starting to think that bots weren't enough. I think an early an early vision
enough. I think an early an early vision of ours was, well, can we solve this by just adding AI teammates, right? Maybe
it's background agents, maybe it's reviewers, maybe it's a whole lot of teammates to the workflow. And while we think that's part of the story, we don't think that's enough. We think that as we've built with Diamond that your
entire tool chain has to be AI native, not just your IDE. If you really are going to embrace AI in the age of development if you're going to accept the fact that developers are going to be orders of magnitude more productive than they ever have before, you need tooling
that reflects that. We started by building Diamond. So
that. We started by building Diamond. So
the winning AI code review platform with high signal, low noise, has a deep understanding of the codebase and change history. We summarize, prioritize, and
history. We summarize, prioritize, and review each change. And we integrate with your CI and your testing infrastructure to correct uh to summarize errors and correct failures.
Um, our hope with it and what we've started to see as we've rolled it out to larger and larger customers and enterprises too is we re we reduce code review cycles, we enforce quality and
consistency, and we keep your code private and secure. Um, it's high signal, it's zero setup, it's actionable with oneclick suggestions, and it's customizable. It's already being used by
customizable. It's already being used by some of the fastest moving companies in the world. It's expanding a lot more
the world. It's expanding a lot more than we can even say publicly. Um, and I hope that you all will embrace the idea that AI can change your entire developer workflow, not just your IDE.
um by the numbers we see comments that our AI bot leaves to be downloaded at less than a 4% rate and to be accepted meaning integrated into the poll request um that they were left on at a higher rate than human comments are. Human
comments are integrated about somewhere between 45 and 50%. We're watching our diamond comments be accepted about 52%.
We've spent a lot of time tuning that.
That's that number is actually new as of March for us. Um that's that's what I have to tell you around graphite. Um
what I have to tell you around diamond.
I hope you give it a shot and and thanks for having me.
[Applause] Awesome. Thanks again, Thomas, for such
Awesome. Thanks again, Thomas, for such a great talk. Uh we want to thank everyone for coming out to the SU agents track. We're going to take a short
track. We're going to take a short break. Lunch is going to be served here
break. Lunch is going to be served here in the halls. The expo session will be open. But you know without further ado
open. But you know without further ado we're we're very happy to announce that in the evening you know we have four more fully packed sessions. Um I think we are the only track that is fully
booked. So we've got all eight speakers.
booked. So we've got all eight speakers.
Um we're going to have a great round of speakers coming up soon. So feel free to come back here later. We're going to kick off with um a talk from claude code. So how do they think about
code. So how do they think about building cloud code? How to use it? How
to delegate? We're going to have that later back here in the keynote session.
But for now, please feel free to enjoy lunch. Check out the expo hall as we
lunch. Check out the expo hall as we take a little break. Thank you.
[Music] Heat. Heat.
Heat. Heat.
[Music] [Music] [Music] [Music] baby. Don't
baby. Don't [Music] Heat. Heat.
Heat. Heat.
[Music] Heat. Heat.
Heat. Heat.
Heat. Heat.
[Music]
[Music] [Music]
[Music] [Music] [Music] [Music]
[Music] You measure.
Everybody.
And we can I don't want to Hey hey hey.
Heat. Heat.
Heat. Heat.
[Music] [Music]
[Music] [Music] [Music] [Music]
Hey got [Music] [Music] Heat. Hey, Heat.
Heat. Hey, Heat.
I don't want to do it.
Heat. Heat.
Heat. Heat.
Hey hey [Music] [Music] [Music] [Music]
Heat. Heat.
Heat. Heat.
[Music] How?
[Music] What's up? Welcome everyone. Let's give
What's up? Welcome everyone. Let's give
it up for the sweet agents track. This
is the most packed track. We have four more amazing speakers for you. Let's
hear it for our sweet agent speakers. Awesome. We're gonna kick off
speakers. Awesome. We're gonna kick off talking about cloud code and then follow that up with open devon. I want to cut my MCing short. I want to give the speakers their time, but we have a
special little announcement. We never do Q&A, but for our first talk for Claude Code, we're going to do a bit of a presentation and do a bit of a Q&A session. Keep your questions short, 5,
session. Keep your questions short, 5, 10 words. as something interesting.
10 words. as something interesting.
Think of your question. But without
further ado, I want to invite Boris Churnney from Anthropic up to the stage.
Think of a question. I'll be back to [Applause] Q&A.
Hello. This awesome. This is a big crowd. Who here has used quad code
crowd. Who here has used quad code before?
Jesus. Awesome. That's what we like to see. Cool. So, my name is Boris. I'm a
see. Cool. So, my name is Boris. I'm a
member of technical staff at Enthropic and creator of Quad Code. And
Code. And um I was struggling with what to talk about for audience that already knows quad code, already knows AI and all the coding tools and agentic coding and stuff like that. So, I'm going to zoom
out a little bit and then we'll zoom back in. So, here's my TLDDR. The model is
in. So, here's my TLDDR. The model is moving really fast. It's on exponential.
It's getting better at coding very, very quickly, as everyone that uses the model knows. And the product is kind of
knows. And the product is kind of struggling to keep up. We're trying to figure out what product to build that's good enough for a model like this. And
we feel like there's so many more products that could be built for models that are this good at coding. And we're
kind of building the bare minimum. And
I'll kind of talk about why. And with cloud code, we're trying
why. And with cloud code, we're trying to stay unopinionated about what the product should look like because we don't know. So for everyone that didn't raise
know. So for everyone that didn't raise your hand, I think that's like 10 of you. Uh this is how you get cloud code.
you. Uh this is how you get cloud code.
Um you can head to quadi/code to install it. Uh you can run this incantation to
it. Uh you can run this incantation to install from npm. Um as of yesterday, we support quad pro plan. So you can try it on that. Uh we support cloud max. So
on that. Uh we support cloud max. So
yeah, just try it out. Tell us what you think. So programming is changing and
think. So programming is changing and it's changing faster and faster. And if
you look at where programming started back in, you know, the 1930s4s there there was like switchboards and it was this physical thing. There was no such thing as software. And then sometime in
the 1950s, punch cards became a thing.
And my uh my my grandpa actually in the Soviet Union, he was one of the first programmers in the in the Soviet Union.
And my mom would tell me stories about like, you know, when she grew up in the 1970s or whatever, he would bring these big stacks of punch cards home. And she
would like from work and and she would like draw all over them with crayons and that was growing out for her. And that that's what programming
her. And that that's what programming was back back in the 1950s, '60s, '7s even. But sometime in the late 50s, we
even. But sometime in the late 50s, we started to see these higher level languages emerge. So first there was
languages emerge. So first there was assembly. So programming moves from
assembly. So programming moves from hardware to punch cards which is still physical to to software. And then the level of abstraction just went up. So we
got to cobalt. Then we got to typed languages. We got to C++. In the early
languages. We got to C++. In the early 90s there was this explosion of these new language families. There was you know the Haskell family and um you know JavaScript and Java, the evolution of
the C family and then Python. And I
think nowadays if you kind of squint all the languages sort of look the same.
Like when I write TypeScript it kind of feels like writing Rust and that kind of feels like writing Swift and that kind of feels like writing Go. The
abstractions have started to converge a bit. If we think about the UX of
bit. If we think about the UX of programming languages, this has also evolved. Back in the 1950s, you used
evolved. Back in the 1950s, you used something like a typewriter to punch holes in punch cards and that was programming back in the day. And at some
point text editors appeared um and then uh Pascal and all these different ids uh appeared that let you interact with your programs and your software in new ways
and each one kind of brought something and I I feel like programming languages have sort of leveled out but the model is on an exponential and the UX of programming is also on an exponential and I'll talk a little bit more about
that.
Does anyone know what was the first text editor? Okay, I heard I heard Ed from
editor? Okay, I heard I heard Ed from someone. I think you read the
someone. I think you read the screen. Before well before text editors,
screen. Before well before text editors, this is what programming real quick. So
this was the IPMO29. This was kind of a top-of-the-line. This was like the
top-of-the-line. This was like the MacBook of the time for programming punch cards. Everyone have
punch cards. Everyone have this? You can still find it in museums
this? You can still find it in museums somewhere. And yeah, this is Ed. This is
somewhere. And yeah, this is Ed. This is
the the first text editor. This was uh Chem Thompson at at Bell Labs invented this. And you know, it kind of looks
this. And you know, it kind of looks familiar. If you open your MacBook, you
familiar. If you open your MacBook, you can actually still type Ed. This is
still is still distributed on Unix uh as as part of Unix systems. And this is crazy because this thing was invented like 50 years ago. And this is nuts.
Like there there's no cursor, there's no scroll back, uh there's no fancy commands, there's no type ahead, there's pretty much nothing. This is the simple text editor of the time. And it was built for teletype machines which were
literally physical machines that printed paper on paper. That's how your program was printed. And this is the first
was printed. And this is the first software manifestation of a UX for programming software. So it was really
programming software. So it was really built for these machines that didn't support scrollback and cursors or anything like that. Um for all the Vim fans, I'm going
that. Um for all the Vim fans, I'm going to jump ahead of Vim. Vim was a big innovation. Emacs was a big innovation
innovation. Emacs was a big innovation around the same time. I think in 1980, Small Talk 80 was a big uh it was a big jump forward. This is one of the first I
jump forward. This is one of the first I think the first graphical interface for programming software. And um for anyone
programming software. And um for anyone that's tried to set up like live reload with React or Redux or any of this stuff, this thing had live reload in
1980 and it worked and we're still kind of struggling to get that to work with like ReactJS nowadays.
So this this was a big jump forward and obviously like the language it had object-oriented programming and a bunch of new concepts but on the UI side there's a lot of new things too. In '91 I think Visual Basic was the
too. In '91 I think Visual Basic was the first code editor that introduced a graphical paradigm to the mainstream. So
before people were using textbased editors Vim and things like that were still very popular despite things like small talk. Um but this kind of brought
small talk. Um but this kind of brought it mainstream. This is what I grew up
it mainstream. This is what I grew up with.
Eclipse brought type ahead to the mainstream. This isn't using AI type
mainstream. This isn't using AI type ahead. This is not cursor when surf.
ahead. This is not cursor when surf.
This is just using static analysis. So
it's indexing your symbols and then it can rank the symbols and rerank them and it knows what symbols to show. I think
this was also the first big third party ecosystem for IDs. Copilot was a big jump forward with
IDs. Copilot was a big jump forward with single line type ahead and then multi-line type ahead.
And I think Devon was probably the first IDE that introduced this next concept and this next abstraction to the world which is to program you don't have to write code you can write natural
language and that becomes code and this is something people have been trying to figure out for decades. I think Devon is the first product that broke through and and took this mainstream and the UX has evolved
quickly but I think it's about to get even faster. We talked about uh UX and we
faster. We talked about uh UX and we talked about programming languages and verification is a part of this too. Um
so verification has started with manual debugging and like physically inspecting outputs. Um, and now there's a lot of
outputs. Um, and now there's a lot of probabilistic verification uh like fuzzing and vulnerability testing and uh like Netflix's chaos uh testing and things like
that. And so with all this in mind,
that. And so with all this in mind, Claude Code's approach is a little different. It's to start with a terminal
different. It's to start with a terminal and to give you as lowlevel access to the model as possible in a way that you can still be productive. So we want the model to be useful for you. We also want
to get we want to be unopinionated and we want to get out of the way. So we
don't give you a bunch of flashy UI. We
don't try to put a bunch of scaffolding in the way. Some of this is we're a model company at Enthropic and you know we make models and we want people to experience those models. But I think another part is we actually just don't
know like we don't know what the right UX is. So we're starting
UX is. So we're starting simple. And so cloud code it's
simple. And so cloud code it's intentionally simple. It's intentionally
intentionally simple. It's intentionally general. Um, it shows off the model in
general. Um, it shows off the model in the ways that matter to us, which is they can use all your tools and they can fit into all your workloads. So you can figure out how to use the model in this
world where the UX of using code and using models is changing so fast. And so this is my second point.
fast. And so this is my second point.
The model just keeps getting better. And
this is the better lesson. I have it uh I have I have this like framed and taped to the side of my wall because the more general model always wins and the model increases in
capability exponentially and there are many coral areas to this. Everything
around the model is also increasing exponentially and the more general thing even around the model usually wins. So with cloud code there's one
wins. So with cloud code there's one product and there's a lot of ways to use it. Um, so there's a terminal product
it. Um, so there's a terminal product and you know this is the thing everyone knows. So you can install quad code and
knows. So you can install quad code and then you just run quad in any terminal.
We're unopinionated. So it works in iTerm 2. It works in WSL. Um, it works
iTerm 2. It works in WSL. Um, it works over SSH and T-mok sessions. Uh, it
works in your VS code terminal in your cursor terminal. This works anywhere in
cursor terminal. This works anywhere in any terminal. When you run when you run quad
terminal. When you run when you run quad code in the IDE, we do a little bit more. So we kind of take over the ID a
more. So we kind of take over the ID a little bit and you know diffs instead of being inline in the terminal they're going to be big and beautiful and show up in the ID itself. Um and we also
ingest diagnostics. Um so we kind of try
ingest diagnostics. Um so we kind of try to take advantage of that. And you'll
notice this isn't as polished as something like uh again like cursor windsurf. These are awesome products and
windsurf. These are awesome products and I use these every day. Um, this is to let you experience the model in a low-level raw way. And this is sort of the minimal that we had to do to let you
experience that. We announced a couple weeks ago
that. We announced a couple weeks ago that you can now use Claude on GitHub. Can I get a show of hands who's
GitHub. Can I get a show of hands who's who's tried this already?
So for everyone that hasn't tried this, all you have to do is you open up Claude, you run this one slash command, install GitHub app, you pick the repo, and then you can run Claude in any repo.
Um, this is running on your compute. Um,
your data stays on your compute. It does
not go to us. Um, so it's it's kind of a nice experience and it lets you use your existing stack. You don't have to change
existing stack. You don't have to change stuff around. Takes a few minutes to set
stuff around. Takes a few minutes to set up. And again, here we intentionally
up. And again, here we intentionally built something really simple because we don't know what the UX is yet. And this
is the minimal possible thing that helps us learn but also is useful for engineers to do day-to-day work like I use this every day. The extreme version of this is our
day. The extreme version of this is our SDK and this is something that you can use to build on cloud code uh without um if you don't want to use like you know the terminal app or the ID integration
or GitHub you can just roll your own integration. You can build it however
integration. You can build it however you want. people have built all sorts of
you want. people have built all sorts of UIs, all sorts of awesome integrations and all this is is you run cloud-p and uh you can use it programmatically and so like something I
use it for for example is for incident triage I'll take my GitHub logs uh or my sorry my GCP logs I'll pipe it into cloudp because it's like it's a Unix utility so you can pipe in you can pipe
out um and then I'll like jq the result so it's kind of cool like this is a new way to use models this is maybe 10% exported no one has really figured about how to use models as a Unix utility.
This is another aspect of code as UX that we just don't know yet. And so
again, we just built the simplest possible thing so we can learn and so people can try it out and see what works for you. Okay, I wanted to give a few tips
you. Okay, I wanted to give a few tips for how to use quad code. This is a talk about quad code. So this is kind of zooming back in. Um, and uh, this is actually true for I think a lot of coding agents, but this is kind of
accustomed to the way that I personally use quad code. So, the simplest way to use this, um, it seems like most of this room is very familiar with quad code and similar coding agents. Um, but the simplest way to introduce new people
that have not used this kind of tool before is do codebased Q&A. And so, at Enthropic, we teach cloud code to every engineer on day one. And it's shortened onboarding times from like two or three
weeks to like two days maybe. And also I don't get bugged about questions anymore. People can just ask Quad. And
anymore. People can just ask Quad. And
honestly like I'll just ask Quad too. And then this is something that I
too. And then this is something that I do uh pretty much every day on Monday.
We have a stand up every week. I'll just
ask Quad what did I ship that week?
It'll look through my git commits and it'll it'll tell me so I don't have to keep track. The second thing is teach Quad
track. The second thing is teach Quad how to use your tools. And this is something that has not really existed before when you think about the UX of programming. Um, with every ID there's
programming. Um, with every ID there's sort of like a plug-in ecosystem. You
know, for Emacs, there's this kind of lispy dialect that you use to make plugins. If you use Eclipse or VS Code,
plugins. If you use Eclipse or VS Code, you have to make plugins. For this new kind of coding tool, it can just use all your tools. So, you give it batch tools,
your tools. So, you give it batch tools, you give it MCP tools. Something I'll
often say is here's the CLI tool cla run-help. Take what you learn and then
run-help. Take what you learn and then put it in the cloud MD. And now Claude knows how to use the tool. That's all it takes. You don't have to build a bridge.
takes. You don't have to build a bridge.
You don't have to build an extension.
There's nothing fancy like that. Um, of
course, if you have like groups of tools or if you have fancier functionality like streaming and things like this, you can just use MCP as well. Traditional coding tools focused a
well. Traditional coding tools focused a lot on actually writing the code and I think the new kinds of coding tools, they do a lot more than that. And I
think this is a lot of where people that are new to these tools struggle to figure out how to use them. So there's a few workflows that I've discovered for using quad code most effectively for
myself. The first one is have quad code
myself. The first one is have quad code explore and make a plan and run it by me before it writes code. Um you can also ask it to use thinking. So typically we see extended thinking work really well
if quad already has something in context. So have it use tools, have it
context. So have it use tools, have it pull things into context and then think.
If it's thinking up front, you're probably just kind of wasting tokens and it's not going to be that useful. But if
there's a lot of context, it does help a bunch. The second one is TDD. Um, I know
bunch. The second one is TDD. Um, I know I try to use TDD. It's like it's pretty hard to use in practice, but I think now with coding tools that actually works really well. Um, and maybe the reason is
really well. Um, and maybe the reason is it's not me doing it, it's the model doing it. And so the workflow here is tell
it. And so the workflow here is tell Claude to write some tests and kind of describe it and just make it really clear like the tests aren't going to pass yet. Don't try to run the test
pass yet. Don't try to run the test because it's going to try to run the test. Tell it like, you know, it's not
test. Tell it like, you know, it's not going to pass. write the test first, commit, and then write the code and then commit. And this is kind of a general
commit. And this is kind of a general case of if Claude has a target to iterate against, it can do much better.
So if there's some way to verify the output, like a unit test, integration test, uh a way to screenshot in your iOS simulator, uh a way to screenshot in Puppeteer, just some way to see its output. Um we actually did this for
output. Um we actually did this for robots, like we taught FOD how to use a 3D printer and then it has a little camera to see the output. If it can see the output and you let it iterate, the result will be much better than if it
couldn't iterate. The first shot will be
couldn't iterate. The first shot will be all right, but the second or third shot will be pretty good. So g give it some kind of target to iterate against. Today we launched plan mode in
against. Today we launched plan mode in cloud code and this is a way to do the first kind of workflow more easily. So anytime
hit shift tab and cloud will switch to plan mode.
So you can ask it to do something, but it won't actually do that yet. It'll
just make a plan and it'll wait for approval. So restart quad to get the
approval. So restart quad to get the update. Run shift
update. Run shift tab. Okay. And then the final tip is uh
tab. Okay. And then the final tip is uh give quad more context. There's a bunch of ways to do this. QuadMD is the easiest way. So take a this file called
easiest way. So take a this file called quadm, put it in the root of your repo.
You can also put it in subfolders. Those
will get pulled in on demand. You can
put in your home folder. This will get pulled in as well. Um and then you can also use flash commands. Um, so if you put files like just regular markdown files in these special
foldersclaw/comands, it'll be available under the slash menu. So pretty cool.
This is useful for res uh reusable workflows. And then to add stuff to
workflows. And then to add stuff to quadm um you can always type the pound sign to ask quad to memorize something and it'll prompt you which memory they should be added to. And you can see this is us trying to figure out how to use
memory, how to use this new concept that is new to coding models, did not exist in previous IDEs, how to make the UX of this work. And you can tell this is
this work. And you can tell this is still pretty rough. This is our first version, but it's the first version that works. And so we're going to be
works. And so we're going to be iterating on this. And we really want to hear feedback about what works about this UX and what doesn't.
Thanks.
[Applause] Thank you, Boris. Fortunately, we only have one minute left. So, someone sent a question on Slack. The question is, as I
delegate more and more to cloud code, as it runs for 10 minutes and I have 10 of these active, how do I use the tool? You
got 50 seconds.
[Laughter] Yeah, this is it's pretty cool. I think
this is something that we actually see in a lot of our power users that they tend to like multi-cloud. You don't just have a single cloud open, but you have a couple terminal tabs either with a few checkouts of claude or uh or of your
codebase or it's the same codebase but with different work trees and you have quad doing stuff in parallel. This is
also a lot easier with GitHub actions because you can just spawn a bunch of actions and get quad to do a bunch of stuff. Typically, we don't like need to
stuff. Typically, we don't like need to coordinate between these quads I think for most use cases. If you do want to coordinate, the best way is just ask them to write to a markdown file. Um,
and that's it. Awesome. Yeah, simple
thing works. Thank you so much. And once
again, give it up for Boris from Enthropic.
Very exciting to see such a full packed room here. We're going to set up our
room here. We're going to set up our next speaker who is Robert Brennan from All Hands. He is the creator and the
All Hands. He is the creator and the company behind Open Devon. So a lot of what we see you know we've had talks from all the top suite agents we've had Jules here we've got cloud code we have
openai's codeex we have devon as people use more and more of these suite agents are we just you know adding tech debt or are we actually 10x engineers so this is what Robert is going to discuss with us
I once again don't want to fill the stage so let's hear it for [Applause] Robert. Hey folks. Uh so today I'm today
Robert. Hey folks. Uh so today I'm today I'm going to talk a little bit about uh coding agents and how to use them effectively really. Um if you're
effectively really. Um if you're anything like me, you found that uh you found a lot of things that work really well and a lot of things that uh don't work very well.
Um so a little bit about me. Uh my name is Robert Brennan. I've been building uh open- source development tools for for over a decade now. Uh and my team and I uh have created uh an open-source uh
software development agent called Open Hands, formerly known as OpenD Devon. So to to state the obvious, in
Devon. So to to state the obvious, in 2025, software development is changing.
Uh our jobs are are very different now than they were 2 years ago. Uh and
they're going to be very different two years from now. Uh and the thing I want to convince you of is that coding is going away. uh we're going to be
going away. uh we're going to be spending a lot less time actually writing code but that doesn't mean that software engineering is going away. Uh
we're paid not to to type on our keyboard but to actually think critically about the problems that are in front of us. Uh and so if we do AIdriven development correctly um it'll mean we spend less time actually like
leaning forward and squinting into our IDE and more time kind of sitting back in our chair and thinking you know what does the user actually want here? Uh
what are we actually trying to build?
what what problems are we trying to solve as an organization? How can we architect this in a way that sets us up for the future? Uh the AI is very good at that at that interloop of development, the write code, run the
code, write code, run the code. It's not
very good at those kind of big picture tasks that have to take into account um that have to like empathize with the end user uh take into account business level objectives. Uh and that's where we come
objectives. Uh and that's where we come in as as software engineers. Uh so let's talk a little bit
engineers. Uh so let's talk a little bit about what actually a coding agent is.
Uh I think this word agent gets thrown around a lot these days. Uh the meaning has started to to drift over time. Uh
but at the core of it is this this concept of agency. Um it's this idea of of taking action out in the real world.
Um and these are these are the main tools of a software engineer's job, right? We have a a code editor to
right? We have a a code editor to actually modify our codebase, navigate our codebase. Uh you have a terminal uh
our codebase. Uh you have a terminal uh to help you actually run the code that you're that you're writing. uh and you need a web browser in order to look up documentation and maybe copy and paste some code from stack overflow. So these
are kind of the core tools of the job and these are the tools that we give to our agents to let them do their whole uh development loop. I also want to contrast uh you
loop. I also want to contrast uh you know coding agents from some more tactical codegen tools that are out there. Um, you know, we kind of started
there. Um, you know, we kind of started a couple years ago with things like, uh, GitHub Copilot's autocomplete feature where, you know, it's literally wherever your cursor is pointed in the codebase.
Right now, it's just filling out two or three more lines of code. Um, and then over time, things have gotten more and more agentic, more and more asynchronous, right? Uh, so we got like
asynchronous, right? Uh, so we got like AI powered idees that can maybe take a few steps at a time without a developer interfering. And then uh now you've got
interfering. And then uh now you've got these tools like Devon and Open Hands where you're really giving an agent, you know, one or two sentences describing what you want it to do. It goes off and
works for 5 10 15 minutes on its own and then comes back to you with a solution.
This is a much more powerful way of working. You can get a lot done. Uh you
working. You can get a lot done. Uh you
can send off multiple agents at once. Um
you know, you can focus on communicating with your co-workers or goofing off on Reddit while these agents are are working for you. Um, and it's uh it's just it's a it's a very different way of working, but it's a much more powerful
way of working. Uh, so I want to talk a little
working. Uh, so I want to talk a little bit about how these agents work under the hood. I feel like uh once you
the hood. I feel like uh once you understand what's happening under the surface, uh, it really helps you build an intuition for how to use agents effectively. Uh, and at its core, um, an
effectively. Uh, and at its core, um, an agent is this loop between a large language model and the and the external world. So, uh, the large language model
world. So, uh, the large language model kind of serves as the brain. Uh, and
then we have to repeatedly take actions in the external world, get some kind of feedback from the world, and pass that back into the LLM. Um, uh, so basically at every every step of this loop, we're
asking the LM, what's the next thing you want to do in order to get one step closer to your goal. Uh, it might say, okay, I want to read this file. I want
to make this edit. I want to run this command. I want to look at this web
command. I want to look at this web page. uh we go out and take that action
page. uh we go out and take that action in the real world, get some kind of output, whether it's the contents of a web page, uh or the output of a command, and then stick that back into the LLM for the next turn of the
loop. Uh just to talk a little bit about
loop. Uh just to talk a little bit about kind of the core tools that are at the agent's disposal. Uh the first one again
agent's disposal. Uh the first one again is a is a code editor. Um you might think this is this is really simple. It
actually turns out to be a fairly uh interesting problem. Uh the naive
interesting problem. Uh the naive solution would be to just like give the old file to the LLM uh and then have it output the entire new file. It's not a very efficient way to work though if you've got a thousand line uh thousand
line of thousands of lines of code and you want to just change one line. Uh
you're going to waste a lot of tokens printing out all the lines that are staying the same. So most uh contemporary um agents use uh like a a find and replace type editor or a diff
based editor to allow the LLM to just make tactical edits inside the file.
Uh, a lot of times they'll also provide like an abst ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab abstract syntax tree or some kind of way to allow the agent to navigate the codebase more effectively. Uh next up is the terminal
effectively. Uh next up is the terminal and again you would think text in text out should be pretty simple but there are a lot of questions that pop up here.
You know what do you do when there's a longunning command that has no standard out for a long time. Do you kill it? Do
you let the LLM wait? Uh what happens if you want to run multiple commands in parallel? Run commands in the
parallel? Run commands in the background. Maybe you want to start a
background. Maybe you want to start a server and then run curl against that server. Uh lots of really interesting uh
server. Uh lots of really interesting uh problems that crop up uh when you have an agent interacting with the terminal. Uh and then probably the most
terminal. Uh and then probably the most complicated tool is the web browser.
Again, there's a naive solution here where you just uh the agent just gives you a URL and you give it a bunch of HTML. Um that's uh very expensive
HTML. Um that's uh very expensive because there's a bunch of croft inside that HTML that the the LLM doesn't really need to see. uh we've had a lot of luck passing it uh accessibility trees or converting to markdown and
passing that to the LLM um or allowing the LLM to maybe scroll through the web page if there's a ton of content there. Um and then also if you
content there. Um and then also if you start to add interaction things get even more complicated. Uh you can let the LLM
more complicated. Uh you can let the LLM uh write JavaScript against the page or we've actually had a lot of luck basically giving it a screenshot of the page with labeled nodes and it can say what it wants to click on. Uh this is an
area of active research. Uh we just had a contribution about a month ago that doubled our accuracy on web browsing. Uh
I would say this is uh this is definitely a space to watch. Uh and then I also want to talk
watch. Uh and then I also want to talk about about sandboxing. Uh this is a really important thing for agents because if they're going to run autonomously for several minutes on their own without you watching everything they're doing, you want to
make sure that they're not doing anything dangerous. Uh and so all of our
anything dangerous. Uh and so all of our agents run inside of a Docker container by default. um they're they're totally
by default. um they're they're totally separated out from your workstation, so there's no chance of it running RMRF on your home directory. Um increasingly
though, we're giving agents access to third party APIs, right? So you might give it access to a GitHub token or access to your AWS account. Super super
important to make sure that those credentials are tightly scoped and that you're following uh the principle of lease privilege as you're granting agents access to do these things. All right, I want to move into
things. All right, I want to move into some best practices.
Uh my my biggest advice for folks who are just getting started is to start small. Um the best tasks are things that
small. Um the best tasks are things that can be completed pretty quickly. You
know, a single commit uh where there's a clear definition of done. You know, you want the agent to be able to verify, okay, the tests are passing, I must have done it correctly. Um or, you know, the merge conflicts have been solved, etc.
Um and tasks that are easy for you as an engineer to verify uh were done completely and correctly. Um I like to tell people to start with small chores.
Uh very frequently you might have a poll request where there's, you know, one test that's failing or there's some lint errors or there's merge conflicts. Uh
bits of toil that you don't really like doing as a developer. Those are great tasks to just shove off to the AI.
They're tend to be tend to be very rote.
Uh the AI does does them very well. Um
but as your intuition grows here, as you get used to working with an agent, you'll find that you can give it bigger and bigger tasks. Uh you'll you'll understand how to communicate with the agent effectively. Um, and I would say
agent effectively. Um, and I would say for for me, for my co-founders, and for our for our biggest power users, uh, for me, like 90% of my code now goes through the agent, and it's only maybe 10% of
the time that I have to drop back into my IDE and kind of get my hands dirty in the codebase again. Uh, being very clear with the
again. Uh, being very clear with the agent about what you want is super important. Uh, I specifically like to
important. Uh, I specifically like to say, you know, you need to tell it not just what you want, but you need to tell it how you want it to do it. You know,
mention specific frameworks that you want it to use. Uh if you wanted to do like a test-driven development strategy, tell it that. Um mention any specific files or function names that it can that
it can go for. Um this not only uh helps it be more accurate and uh you know more clear as to what exactly you want the output to be um it also makes it go faster, right? It doesn't have to spend
faster, right? It doesn't have to spend as long exploring the codebase if you tell it I want you to edit this exact file. Um this can save you a bunch of
file. Um this can save you a bunch of time and energy and it can save uh a lot of a lot of tokens, a lot of actual like inference costs.
Uh, I also like to remind folks that in an AIdriven development world, code is cheap. Um, you can throw code away. You
cheap. Um, you can throw code away. You
can you can experiment and prototype.
Uh, I love if I if I have an idea, like on my walk to work, I'll just like uh, you know, tell open hands with my voice, like do X, Y, and Z, and then when I get to work, I'll I'll have a PR waiting for
me. 50% of the time, I'll just throw it
me. 50% of the time, I'll just throw it away. It didn't really work. 50% of the
away. It didn't really work. 50% of the time it looks great, and I just merge it, and it's and it's awesome. Um, it's
uh it's really fun to be able to just rapidly prototype using AIdriven development. Um, and I would also say,
development. Um, and I would also say, you know, if you if you try to try to work with the agent on a particular task and it gets it wrong, maybe it's close and you can just keep iterating within the same conversation and has already
built up some context. If it's way off though, just throw away that work. Start
fresh with a new prompt based on uh what you learned from the last one. Um it's
really really uh I think uh it's a new new sort of muscle memory you have to develop to just throw things away.
Sometimes it's uh hard to throw away tens of lines tens of thousands of lines of code that uh have been generated because you're used to that being a very expensive uh bunch of code. Uh these
days it's it's very easy to kind of just start from scratch.
Again, this is probably the most important bit of advice I can give folks. Uh you need to review the code
folks. Uh you need to review the code that the AI writes. Uh I've seen more than one organization run into trouble uh thinking that they could just vibe code their way to a production application uh and just you know
automatically merging everything that came out of the AI. Um but uh if you just you know don't review anything you'll find that your codebase just grows and grows with this tech debt.
You'll find duplicate code everywhere.
Uh things get out of hand very quickly.
Uh so make sure you're reviewing the code that it outputs and make sure you're pulling the code and running it on your workstation or running it inside of an ephemeral environment. uh just to make sure that you know the agent has actually solved the problem that you
asked it to solve. Uh and I like to say you know
solve. Uh and I like to say you know trust but verify. You know as you work with agents over time you'll build an intuition for for what they do well and what they don't do well and you can generally trust them to to um you know
operate the same way today that they did yesterday. Um but you really you really
yesterday. Um but you really you really do need a human in the loop. Um, you
know, one of our big learnings, uh, with Open Hands, in the early days, if you opened up a poll poll request with Open Hands, uh, that that poll request would show up as owned by Open Hands, it would
be the little hands logo uh, next to the poll request. Uh, and that caused two
poll request. Uh, and that caused two problems. One, it meant that the human who had triggered that poll request could then approve it and basically bypass our whole code review system. You
didn't need a second human in the loop to uh, before merging. Uh, and two, often times those poll requests would just languish. uh nobody would really
just languish. uh nobody would really take ownership for them. Uh if there was like a failing unit test, nobody was like jumping in to make sure the test passed. Um and those they would just
passed. Um and those they would just kind of like sit there and not get merged or if they did get merged and something went wrong, the code didn't actually work. We didn't really know who
actually work. We didn't really know who to go to and be like, you know, who caused this? There was nobody we could
caused this? There was nobody we could hold accountable for that breakage. Um
and so now if you open up a poll request with open hands, your face is on that poll request. You're responsible for
poll request. You're responsible for getting it merged. You're responsible
for any breakage it might cause down the line.
Cool. And then uh I do want to just close just by going through a handful of use cases. Uh this is always kind of a
use cases. Uh this is always kind of a tricky topic because agents are great generalists. They can they can
generalists. They can they can hypothetically do anything as as long as you kind of like break things down into bite-sized steps that they can take on.
Um but in that in that um in the spirit of starting small, I think there are a bunch of use cases that are like really great day one use cases for agents. My
favorite is resolving merge conflicts.
This is like the biggest chore as a part of my job. Uh, open hands itself is a very fastmoving codebase. Uh, I say there's probably no PR that I make that uh, I get away with zero merge conflicts. Um, and I love just being
conflicts. Um, and I love just being able to jump in and say at Open Hands, fix the merge conflicts on this PR. Uh,
it comes in and, you know, it's such a rope task. It's usually very obvious,
rope task. It's usually very obvious, you know, what changed before, what changed in this PR, what's the intention behind those changes? And Open Hands knocks this out, you know, 99% of the time.
Uh addressing PR feedback is also a favorite. Uh this one's great because
favorite. Uh this one's great because somebody else has already taken the time to clearly articulate what they want changed and all you have to do is say at openhands do what that guy said. Uh and
again like you can see in this example uh open hands did exactly what this person wanted. I don't know react super
person wanted. I don't know react super well and uh our front end engineer was like do x y and z and he mentioned a whole bunch of buzzwords that I don't I don't know. Open hands knew all of it
don't know. Open hands knew all of it and uh was able to address his feedback exactly how he wanted.
uh fixing quick little bugs. Um you
know, you can see in this example, we had an input uh that, you know, was a text input, should have been a number input. Uh if I wasn't lazy, I could have
input. Uh if I wasn't lazy, I could have like dug through my codebase, found the right file. Um but it was really easy
right file. Um but it was really easy for me to just like quickly I think I did this one from directly inside of Slack, uh just add open hands, fix this thing we were just talking about. Uh and
uh it's just, you know, really I don't even have to like fire up my IDE. Um
it's just it's a really really fun way to work.
uh infrastructure changes I really like.
Uh usually these involve looking up some like really esoteric syntax inside of like the Terraform docs or something like that. Um open hands and you know
like that. Um open hands and you know the underlying LLMs tend to just like know uh the right terraform syntax and if not they can they can look up the documentation using the browser. Um so
this stuff is uh is really great.
Sometimes we'll just get like an out of memory exception in Slack and immediately say okay open hands increase the memory. Uh database migrations are
memory. Uh database migrations are another great one. Uh this is one where I find uh I often leave best practices behind. I won't put indexes on the right
behind. I won't put indexes on the right things. I won't set up foreign keys the
things. I won't set up foreign keys the right way. Uh the LLM tends to be really
right way. Uh the LLM tends to be really great about following all best practices around database migrations. So again,
it's kind of like a rote task for developers. It's not very fun. Um uh the
developers. It's not very fun. Um uh the LLM's great at it. uh fixing failing tests uh like on a PR uh if you've already got the code 90% of the way there and there's just a unit test
failing because there was a breaking API change very easy to call in an agent to just clean up the the failing tests. Uh expanding test coverage is
tests. Uh expanding test coverage is another one I love because uh it's a very um safe task, right? As long as the tests are passing, it's uh generally safe to just merge that. So if you notice a spot in your codebase where
you're like, "Hey, we have really low coverage here." just ask uh ask your
coverage here." just ask uh ask your agent to uh expand your test coverage in that area of the codebase. Uh it's a great quick win uh to make your codebase a little bit safer. Then everybody's favorite
safer. Then everybody's favorite building apps from scratch. Um you know I would say if you're shipping production code again don't just like vibe code your way to a production application. Uh but we're finding
application. Uh but we're finding increasingly internally at our company a lot of times there's like a little internal app we want to build. Uh like
for instance, we built a way to uh debug openhand trajectories, debug openhand sessions. Um uh we built like a whole
sessions. Um uh we built like a whole web application that since it's just an internal application, we can vibe code it a little bit. We don't really need to review every line of code. It's not
really facing end users. Uh this has been a really really fun thing for our business to just be able to churn out these really quick applications uh just to serve our own internal needs. Um so
yeah, uh Greenfield is a great great use case for agents. U that's all I've got.
Uh we'd love to have you all join the the Open Hands community. You can find us on GitHub all handsaiopenhands. Um join us on Slack,
handsaiopenhands. Um join us on Slack, Discord. Uh we'd love to build with
Discord. Uh we'd love to build with you. Awesome. Awesome. Okay. Thank you
you. Awesome. Awesome. Okay. Thank you
again, Robert. Very, very exciting to hear about what works and what doesn't work in coding agents. Now, I want to take a bit of time to pause. We're kind
of going to change focus for the next few talks. Our next speaker is Josh
few talks. Our next speaker is Josh Albert from Imbue who's going to speak about you know a little bit of a meta a meta talk. He's going to give a
meta talk. He's going to give a walkthrough of a case study about sculptor. Sculptor is kind of their way
sculptor. Sculptor is kind of their way of how do you verify that your AI coding agents are actually outputting proper code. So we hear these like you know we
code. So we hear these like you know we always hear how do we go from prototype to production. I'm guilty of this. I've
to production. I'm guilty of this. I've
given this talk. I gave it last year, but you know, we always hear about how do you go from prototype to production?
You need a human in the loop. How do you go from vibe coding to actual like production grade code? And outside of tech debt, Josh is one of the people that has kind of gone very very deep in
this and built sculptor to exactly solve this. So for our next talk, you know,
this. So for our next talk, you know, he's going to go through a case study of as you build coding agents, how do you kind of launch something alongside this?
How do you better verify what's going on? And a little bit more about Josh.
on? And a little bit more about Josh.
Josh is kind of a friend that I've known for over a year. We've talked in great depth about coding agents. He's very
deep in the space. He's been on the Leaden Space podcast before. So, if you want to, you know, hear more, feel free to check out the podcast. And same with a lot of the other speakers. Boris from
Cloud Code, he's been there as well. But
without further ado, I want to invite Josh up pretty soon. I'm gonna I'm gonna kill some more time. We're we're running a little early. So, um yeah, let's let's actually get a show of hands. Who in
here has started actually shipping suite agents in production? So outside of using them in your own coding workflows, outside of using co-pilots, who has actually shipped a version of a co coding co-pilot? Who's working directly
coding co-pilot? Who's working directly on the tools? Okay, we we have a a few hands.
tools? Okay, we we have a a few hands.
So let's get a better idea of what people are working on. Are people in the session here? Are we trying to learn how
session here? Are we trying to learn how should we better use co-pilots? How
should we take them to production? How
should we build them? What should we know about them? because Josh's talk is a bit of a case study around this. So,
who here is in the phase of aggressively using co-pilots kind of vibe coding and trying to trying to take it to that next level? Okay. Okay. A lot more hands
level? Okay. Okay. A lot more hands there. So, Josh, a little bit more
there. So, Josh, a little bit more background for you there. So, let's
let's kind of give it off from there. Um
Josh, I think we're ready for you.
Awesome. Thanks.
Um one second. All
second. All right, cool. Well, yeah, it's great to
right, cool. Well, yeah, it's great to be here. So, I'm Josh Albertch. I'm the
be here. So, I'm Josh Albertch. I'm the
CTO of Imbue. Uh, and our focus is on making more robust, useful AI agents. In
particular, we're focusing on software agents right now. And the main product that we're working on today is called Sculptor. So the purpose of Sculptor is
Sculptor. So the purpose of Sculptor is to kind of help us with something that we've all experienced. You know, we've all tried these vibe coding tools and you, you know, tell it to go off and do
something. It goes off and creates a
something. It goes off and creates a bunch of code for you. Uh, and then, you know, voila, you're done, right? Well,
not quite. like at least today there's a big gap between kind of the stuff that comes back uh and what you want to ship to production especially as you get away from the prototyping into a larger more established code bases. So today I'm
going to go over some of the technical decisions that went into the design of sculpture our experimental coding agent environment uh and kind of go through
some of the context and motivations for the various ideas that we've explored and the features that we've implemented.
It's still a research preview, so these features may change before we actually release it. Uh, but I hope that you know
release it. Uh, but I hope that you know whether you're an individual using these tools or you're someone who's developing the tools yourself, you'll find these uh kind of learnings from our experiments
to be useful for yourselves. So today,
if you're thinking about how you can make coding agents better, then there's a million different things that you could build. You could build something
could build. You could build something that helps improve the performance on really large context windows. You could
make something to make it cheaper or faster. You could make something that
faster. You could make something that does a better job of parsing the outputs. But I don't think that we
outputs. But I don't think that we really should be building any of these things. I think that what we really want
things. I think that what we really want to be building is things that are much more specific to the use case or to like the problem domain or the thing that you
are like really specialized in. most of
the things that I just mentioned are going to get solved over the next call it 3 to 12 to 24 months as models get better, coding agents get better etc. And so I think you know just like you
wouldn't want to make your own database I don't think we want to be spending a lot of time working on the problems that are going to get solved uh instead we want to focus on the particular part of
the problem that really matters for for us for our business. And so at Impu the problem that we're focusing on is basically this like what is wrong with this diff? You get a coding agent output
this diff? You get a coding agent output and it tells you like okay I've added 59 new lines. Are those good? Like right
new lines. Are those good? Like right
now you have an awkward choice between either looking at each of the lines yourself or just hitting merge and kind of hoping for the best. Uh and neither of those are a really great place to be.
So we try to give you a third option. Uh
the goal is to help build user trust by allowing another AI system to come and take a look at this and understand like hey are there any race conditions? Did
you leave your API key in there etc. So we want to think about how do we help leverage AI tools not just to generate the code but to help us build trust in that
code and kind of the way that we think about it is about like identifying problems with the code because if there's no problems then it's probably high quality code and that's kind of the
definition of high quality code. If you
think about it from like an academic perspective, the way that people normally measure software quality is by looking at the number of defects and they look at like how long does it take
to fix a particular defect or how many defects are caught by this particular technique. So this is sort of the
technique. So this is sort of the definition that at least we're working on from when we're thinking about making high quality software. And then if we think about you know the software
development process what you want to be doing is getting to a place where you have identified these problems as early as possible. So sculptor does not work
as possible. So sculptor does not work as like a pull request review tool because that's much much later in the process. Rather we want something that's
process. Rather we want something that's synchronous and immediate and giving you immediate feedback. As soon as you
immediate feedback. As soon as you generated that code, as soon as you've changed that line, you want to know like is there something wrong with it? That's
easier both for you to fix and also for the agent to fix.
So what are some ways that you can prevent problems in AI generated code?
We're going to go through five different ways. Uh the first is learning planning
ways. Uh the first is learning planning or sorry only four different ways.
Learning planning writing specs and having a really strict style guide. And
we'll see how those manifest in Sculptor. So the first thing you want to
Sculptor. So the first thing you want to do when you're using coding agents if you're trying to prevent problems is learn what's out there. We try to make this as easy as possible in sculpture by letting you ask
questions, have it do research, get answers about what are the technologies, etc. that exist. What are the ways that other people have solved similar problems so that you don't end up reproducing a bunch of work for what's
already out there. Next, we want to think about how
there. Next, we want to think about how we can encourage people to start by planning. Here's a little example
planning. Here's a little example workflow where you can, you know, kick off the agent to go do something simple like, you know, implement this Scrabble solver and change the system prompt here
to force the AI agent to first make a plan without writing any code at all.
Then you can wait a little while. It'll
generate the plan. Uh, and then you can go and change the system prompt again to say like, okay, now we can actually create some code. So we make it really easy to kind of change these types of meta parameters of the coding agent
itself. Of course you can just tell the
itself. Of course you can just tell the agent to do that. But by changing its system prompt you sort of force it in a much stronger way to uh change its behavior. And you can build up larger
behavior. And you can build up larger workflows by making sort of customized agents for always plan first then always do the code then always run the checks etc.
Third, you want to think about writing specs and docs as a kind of first class part of the workflow. One of the main reasons why, at least I don't normally write lots of specs and docs in the past
has been that it's kind of annoying to keep them all up to date to spend all this time kind of typing everything out if I already know what the code is supposed to be. But this is really important to do if you want the coding
agents to actually have context on the project that you're trying to do because they don't have access to your email, your Slack, etc. necessarily. And even
if they did, they might not know exactly how to turn that into code. So in
Sculptor, uh, one of the ways that we try to make this easier is by helping detect if the code and the docs have become outdated. So it reduces the
become outdated. So it reduces the barrier to writing and maintaining documentation and dock strings because now you have a way of more automatically fixing the inconsistencies. It can also
highlight inconsistencies or parts of the specifications that conflict with each other, making it easier to make sure that your system makes sense from the very beginning. And finally, you want to have
beginning. And finally, you want to have a really strict style guide and try to enforce it. This is important even if
enforce it. This is important even if you're just doing regular coding without AI agents, just with other human software engineers. But one of the
software engineers. But one of the things that is special in Sculptor is that we make suggestions which you can see towards the bottom here uh that help
keep the AI system on a reasonable path.
So here it's highlighting that you could you know make this particular class immutable to prevent race conditions.
Was this something that comes from our style guide where we try to encourage both the coding agents and our teammates to write things in a more functional immutable style to prevent certain
classes of errors. We're also working on developing a style guide that's sort of customtailored to AI agents to make it even easier for them to avoid some of the most egregious mistakes that they
normally make. But no matter how many uh things
make. But no matter how many uh things you do to prevent the AI system from making mistakes in the first place, it's going to make some mistakes. And there
are many things that we can do to prevent or to detect those problems and prevent them from getting into production. So we'll go through three
production. So we'll go through three here. Uh first running llinters, second
here. Uh first running llinters, second writing and running tests, third asking an LLM. Uh and we'll dig into each and
an LLM. Uh and we'll dig into each and see how that manifests in sculpture. So
for the first one for running llinters, there are many automated tools that are out there like rough or my pylind py etc that you can use to automatically
detect certain classes of errors.
In normal development, this is sort of obnoxious because you have to go fix all these like really small errors that don't necessarily cause problems. It's a lot of like churn and extra work. But
one of the great things about AI systems is that they're really good at fixing these. So, one of the things that we've
these. So, one of the things that we've built into Sculptor is the ability for the system to very easily detect these types of issues and automatically fix them for you without you having to get
involved.
Another thing that we've done is make it easy to use these tools in practice. A
lot of tools end up like these. You
know, how many people here, maybe a show of hands, how many people have a llinter set up at all? Okay. How many people have zero
all? Okay. How many people have zero linting errors in their codebase? Two.
Great. We'll hire you. Okay, cool. Uh
but you know it's it's not it's not easy. But one of the things that we've
easy. But one of the things that we've done in sculpture is make it so that the AI system understands what issues were there before it started and then what issues were there after it ran. So at
least you can prevent the AI system from creating more errors without you even if it doesn't work in a perfectly clean codebase. Okay. Third testing. So why
codebase. Okay. Third testing. So why
should you write tests at all? I think I was pretty lazy as a developer for a long time and did not want to write tests because it took a you know a lot of effort. You have to maintain them. I
of effort. You have to maintain them. I
already wrote the code. It works. Okay.
But one of the major objections to writing tests has kind of disappeared now that we have AI systems. The ability to generate tests is now so easy that you might as well write tests.
Especially if you have correct code. You
can tell the agent, hey, just write a bunch of tests, throw out the ones that don't pass, and just keep the rest. So
there's no real reason to not write tests at all. Uh and B at as they say at Google, if you liked it, you should have put a test on it. This becomes much more
important with coding agents. And the
reason is that you don't want your coding agent to go change the behavior of your system in a way that you don't understand and don't expect and don't want to see happen. So at Google, this matters a lot for their infrastructure
because they don't want their site to crash when someone changes something.
But if you really care about the behavior of your system, you want to make sure that it's fully tested. So how do we actually write good
tested. So how do we actually write good tests? I'll go through a bunch of
tests? I'll go through a bunch of different uh components to this. So
first, one of the things that you can do is write code in a functional style. By
this I mean code that has no side effects. This makes it much much easier
effects. This makes it much much easier to run LLM and understand if the code is actually successful. You really don't
actually successful. You really don't want to be running a test that has access to say your live Gmail environment where if you make a single mistake you can delete all of your email. You really want to isolate those
email. You really want to isolate those types of side effects and be able to focus most of the code uh on the kind of functional transformations that matter for your
program. Second, you can try and write
program. Second, you can try and write two different types of unit tests. Happy
path unit tests are those that are ones that show you that your code is working.
It's happy. Hooray, it worked. uh you
don't need that many of those. You just
need a small number to show that things are working as you hope. The unhappy
unit tests are the ones that help us find bugs. And here LLMs can be really,
find bugs. And here LLMs can be really, really helpful. So, especially if you've
really helpful. So, especially if you've written your code in a functional style, you can have the LLM generate hundreds or even thousands of potential inputs, see what happens to those inputs, and
then ask the LLM, does that look weird?
And often when it says yes, that will be a bug. And so now you have a perfect
a bug. And so now you have a perfect test case replicating a bug.
Third, after you've written your unit tests, it's maybe a good idea to throw them away in some cases. This is a little bit counterintuitive. In the past, it spent
counterintuitive. In the past, it spent we took all this effort and spent all this time trying to write good unit tests and so we feel some aversion to throwing them away. But now that it's so
easy to run LLM and generate the test suite again from scratch, there's a reason a good reason to not keep around too many unit tests of behavior you don't care about too much. You might
also want to just refactor the ones that you generated into something that's slightly more maintainable. But when you do keep them around, it does kind of confuse the LLM when you come back and change this behavior. So it's something that's at least worth thinking about
whether you want to keep the tests that were originally generated, clean them up, how many of them should you keep, etc. Fourth, you should probably focus on integration tests uh as opposed to
testing only the kind of code level functional uh behavior of your program.
Integration tests are those that show you that your program actually works.
Like from the user's perspective, like when the user clicks on this thing, does this other thing happen? AI systems can be extremely good at writing these, especially if you create nice test plans
where you can write, okay, when the user clicks on the button to add the item to the shopping cart, then the item is in the shopping cart. If you write that out and then you write the test, then you can write another test plan like if the
user clicks to remove the button, the thing from the shopping cart, then it is gone. that systems can almost always get
gone. that systems can almost always get this right and so it allows you to work at the level of meaning for your testing which can be much more efficient. Uh
fifth, you want to think about test coverage as a core part of your testing suite. So if you're having cloud code
suite. So if you're having cloud code write things for you, then you don't care just about the tests working on their own, but you also care are there enough tests in the first place. If you
think back to the original screenshot where we get back our PR of, you know, how many lines have changed? If I tell you how many lines have changed, it's not that helpful. If I tell you so many
lines have changed and also there's 100% test coverage and also all the tests pass and also a thing looked at the tests and thought they were reasonable.
Now you can probably click on that merge button without quite as much fear. Uh
and sixth uh we try to make it easy to run tests in sandboxes and without secrets as much as possible.
This uh makes it a lot easier to actually fix things and makes it a lot easier to make sure that you're not accidentally causing problems or making flaky
tests. The third thing that we can do to
tests. The third thing that we can do to detect errors is ask an LLM. There are
many different things that we can check for, including if there are issues before you commit with your current change, if the thing that you're trying to do even makes sense, if there are issues in the current branch you're working on, if there are violations of
rules in your style guide or in your architecture documents, if there are details that are missing from the specs, if the specs aren't implemented, if they're not well tested, or whatever other custom things that you want to
check for. One of the things that we're
check for. One of the things that we're trying to enable in Sculptor is for people to extend the checks that we have so that they can add their own types of best practices into the codebase and
make sure that they are continually checked. After you've found issues, then
checked. After you've found issues, then you have to fix them. Very little of this talk is about fixing the issues because it ends up being a lot easier for the systems to fix issues than you
would expect. I think this quote
would expect. I think this quote captures it relatively well and that a problem wellstated is halfsolved. What
this means is that if you really understand what went wrong, then it's much easier to solve the problem. This
is especially true for coding agents because the really simple strategies work really well. So even just try multiple times, try a hundred times with a different agent, it actually ends up
like working out quite well. And one of the things that enables this is having really good sandboxing. If you have agents that can run safely, then you can run an almost unlimited number subject
to cost constraints uh in parallel. And
then if any one of them succeeds, then you can use that solution. And this is really just the
solution. And this is really just the beginning. There are going to be so many
beginning. There are going to be so many more tools that are released over the next year or two. And many of the people in this room are working on those tools.
There will be things that are not just for writing code like we've been talking about, but for after deployment, for debugging logging tracing profiling etc. There are tools for doing automated
quality assurance where you can have an AI system click around on your website and check if it can actually do the thing that you want the user to do.
There are tools for generating code from visual designs. There are tons of de dev
visual designs. There are tons of de dev tools coming out every week. you will
have much better contextual search systems that are useful for both you and for the agent. Uh and of course we'll get better AI based models as well. If
anyone is working on these other sorts of tools that that are kind of adjacent to developer experience and helping you fix this like much smaller piece of the process, we would love to work together
and find out a way to integrate that into Sculptor so that people can take advantage of that. I think what we'll see over the next year or two is that most of these things will be accessible.
Uh, and it'll make the development experience just a lot easier once all these things are working together. So, that's pretty much all
together. So, that's pretty much all that I have for today. If you're
interested, feel free to take a look at the QR code, go to our website at imbue.com and sign up to try out Sculptor. And of course, if you're
Sculptor. And of course, if you're interested in working on things like this, we're always hiring. We're always
happy to chat, so feel free to reach out. Thank you.
out. Thank you.
Thank you, Josh. I highly recommend picking Josh's brain. I'm sure he'll be around. Find him in the hallways. It's
around. Find him in the hallways. It's
been great. Had countless conversations with Josh. And, you know, just to say
with Josh. And, you know, just to say once again, what a day. It's been a fully jam-packed day. We have had eight backto backtoback speakers talking about Su agents. We started with all the, you
Su agents. We started with all the, you know, the originals, the GitHub co-pilot, the original coding co-pilot.
Then we went to the latest and the greatest, right? We've had OpenAI's
greatest, right? We've had OpenAI's codec speak. We've had Claude Code
codec speak. We've had Claude Code speak. We've had Jules from Google
speak. We've had Jules from Google speak. Then we went a little bit into,
speak. Then we went a little bit into, okay, how do I actually start using these things in production? How do I go past Vibe coding? How do I kind of, you know, let's walk through a case study of
how we really build these things. And
now for our last talk in the sui agents track we have someone who is not building an agent. We have Eno Reyes here from factory and he is actually building droids. What does this mean?
building droids. What does this mean?
It's not just hype. Eno is actually working on droids. He is one of the companies from factory AI that is actually ship this stuff in production.
They are actually in the enterprise.
They are growing like crazy. He's been
recently on the laten space podcast and they're actually doing this stuff. So
you know is a great speaker. He's spoken
for bigger audiences than this. And you
know, without any further ado, I want to pass it on to Eno.
[Applause] Hi everybody. My name is Eno. I really
Hi everybody. My name is Eno. I really
appreciate that introduction. Um, and
maybe I can start with a bit of background. Uh, I started working on LLM
background. Uh, I started working on LLM about two and a half years ago. uh when
uh GBT3.5 was coming out and it became increasingly clear that agentic systems were going to be possible with the help
of LLMs. At factory we believe that the way that we use agents in particular to build software is going to radically change the field of software
development. We're transitioning from
development. We're transitioning from the era of human-driven software development to agent driven development.
You can see glimpses of that today. You
guys have already heard a bunch of great talks about different ways that agents can help with coding in particular.
However, it seems like right now we're still trying to find what that interaction pattern, what that future looks like. And a lot of what's publicly
looks like. And a lot of what's publicly available is more or less an incremental improvement. The current zeitgeist is to
improvement. The current zeitgeist is to take tools that were developed 20 years ago for humans to write every individual line of code. um and ultimately tools that were designed first and foremost
for human beings. Uh and you sprinkle AI on top and then you keep adding layers of AI and then at some point maybe there's some step function change that happens. But there's not a lot of
happens. But there's not a lot of clarity there in exactly what that means. You know, there's a quote that is
means. You know, there's a quote that is attributed to Henry Ford. Uh if I had asked people what they wanted, they would have said faster horses. Now, we
believe that there are some fundamentally hard problems blocking organizations from accessing the true power of AI. This power can only be found when your team is delegating the
majority of their tasks across the software life cycle to agents.
To do that, you need a platform that has an intuitive interface for managing and delegating tasks, centralized context from across all your engineering tools
and data sources, agents that consistently produce reliable, highquality outputs, and infrastructure that supports thousands of agents
working in parallel. These are all hard problems to solve. But our team has spent the last two years partnering with large organizations to build towards
this future. This talk is going to serve
this future. This talk is going to serve as sort of a deep dive into agent native development and some of the and a bit of a share of some of the lessons that we've learned helping enterprise
organizations make the transition to agent native development. When Andre Karpathy said
development. When Andre Karpathy said English is the new programming language, he captured this very exciting moment.
Right? And if you're to judge AI progress based on Twitter, you'd think that, you know, you can basically vibe code your way to anything. But vibe
coding isn't the approach to solve hard problems. You can't vibe code a legacy Java 7 app that runs 5% of the world's global bank transactions, right? You
need a little bit more software engineering. So agents really should not
engineering. So agents really should not be thought of as a replacement for human ingenuity, right? agents are climbing
ingenuity, right? agents are climbing gear and building production software is like scaling Mount Everest. And so while better tools have made this climb more
accessible, we still need to think about how to leverage them and use our existing expertise in order to drive this transformation. I want to start
this transformation. I want to start with a quick video of what's possible today, right? And so in this you'll see
today, right? And so in this you'll see a quick glimpse of what it's like to delegate a task to an agentic system.
You can watch the droid as we call them ingest the task and start grounding itself in the environment. It uses tools to search through the codebase, determine the git branch, check out what
the machine has available to it. It
looks through recent changes to the codebase. It looks at memories of its
codebase. It looks at memories of its recent interactions with users as well as memories from its interactions across the entire organization. And then the droid comes back with a plan and says,
"Here's exactly what I'm going to do, but I'd like you to clarify a couple of things. Right? We need to expect our
things. Right? We need to expect our agents to not just take what we say at face value, but instead question it and make us better software developers." And
so after the user comes back with that info, the droid comes, it executes on that task. It leverages its tools to
that task. It leverages its tools to write code, runs pre-commit hooks, lints, and ultimately generates a pull request that passes CI. But how can you achieve outcomes
CI. But how can you achieve outcomes like this on a regular basis? Right?
It's nice when it works, but what about when it fails? At the heart of effective AI assisted development lies a very fundamental truth. AI tools are only as
fundamental truth. AI tools are only as good as the context that they receive.
So much of what people are calling prompt engineering is really mentally modeling this alien intelligence that has a slice of context of the real world. And if you start thinking about
world. And if you start thinking about your AI tools this way, you're going to start to get a lot better at interacting with them. We've investigated thousands
with them. We've investigated thousands of droid assisted development sessions and you see this sort of heristic emerge where AI is most likely failing to solve
the problem. Not because the LLMs aren't
the problem. Not because the LLMs aren't good enough, but because it's missing crucial context that's required to truly solve it. And better models are going to
solve it. And better models are going to make this happen less often. But the
real solution is not just making the AI smarter. It's going to be getting better
smarter. It's going to be getting better at providing these systems with that missing context. LM don't know about your
context. LM don't know about your morning standup. They don't know about
morning standup. They don't know about the meeting that you had ad hoc and the whiteboard that you did, right? But you
can give those things to the LLM if you transcribe your notes, if you take a photo and you upload it. Right? You have
to start thinking about these things not as tools but as something in between a co-worker and uh and a and a platform, right? And if you can get that context
right? And if you can get that context that lies in the cracks between systems, you use platforms that integrate natively with all of your data sources and you have agents that can actually
make use of those things, you can start actually driving this transition to agent native development. I want to talk a bit as
development. I want to talk a bit as well about planning and design. When
your agent I mean sorry when your organization is doing agent native development then you are using agents at every stage. Droids don't just write
every stage. Droids don't just write code. They can help with that part, but
code. They can help with that part, but the hardest thing about software development is not the code. It's about
figuring out exactly what to build. Here
you can watch a droid as it's tasked with trying to find the most up-to-date information about a new model release and integrate that into an existing chat application. It's going to leverage
application. It's going to leverage internet search, its knowledge of your codebase, its understanding of your product goals from its organ uh memory, and its understanding of your technical
architecture from the design doc you wrote last week. Planning with AI is fundamentally different from planning alone. It's not necessarily just asking
alone. It's not necessarily just asking please build this thing for me or give me the design doc but instead it's about delegating the groundwork and the
research to AI agents then using a collaborative platform to interact and explore possibilities together. That is
how you get better at planning with agents. Now you can see here we have a
agents. Now you can see here we have a nice document a nice plan. You could
export that to notion, Confluence, Jira, any of your integrations with no setup because MCP is great, but having every developer have to install a bunch of servers, click a bunch of things, pass
around the API key is not necessarily ideal. And so platforms are going to
ideal. And so platforms are going to evolve and solve a lot of these problems. But in the meantime, you do have droids. And now a little bit more
have droids. And now a little bit more on this. The real unlock for AI
on this. The real unlock for AI transforming your organization in with respect to planning is going to be when you start standardizing the way that
your organization thinks, right? And so
there's a bit of a of an example that we just had a couple of weeks ago while we were planning out uh a feature related to our cloud development environments.
We got a lot of feedback from users and so we had about three months of user transcripts, people from enterprises, uh, individuals that we knew. Uh, we
transcribe every single interaction and meeting at factory. We take those notes and we combine them with a droid that has access to our architecture. We take
a ad hoc meeting that one of our engineers took a granola of. If you guys use granola, I love that tool. Um, and
we throw that all to the knowledge droid and we say, we don't say, "Let's plan the feature out." We say, "Could you find any patterns in the customer feedback that map up to our assumptions?
Can you highlight any technical constraints with what we have today that might help us make this better?" And
then we take all of that output, those documents, there's maybe four or five intermediate results here, and that's what we use to start iterating on a
final PRD that helps us outline the full feature. You can take that PRD, and if
feature. You can take that PRD, and if you have a droid that has access to linear and Jira with tools to create tickets, create epics, modify those
things, then that PRD can be turned into a road map. eight tickets. This ticket's
dependent on that ticket, but ultimately work that can be parallelized amongst a group of eight code droids, right? And
so this is how software is going to evolve. We're going to move from
evolve. We're going to move from executing to orchestrating systems that work on our behalf. I tal I talked about a couple of
behalf. I tal I talked about a couple of these. I think PRDS, edge design docs,
these. I think PRDS, edge design docs, RCA templates, quarterly engine and product road maps, right? transcriptions
of your meetings. Normally, you might see this stuff as a burden, but when your company is doing agentnative software development, your process and your documentation is a knowledge base
and a map for your droids to learn and imitate the way that your team thinks.
This documentation and process is a conversation with both future developers as well as future AI systems. And so if you can communicate that why behind the
decision, that context for those future developers and agents, then you'll start to see that there's a huge lift in their ability to natively work the way that your team actually
works. I want to talk about uh
works. I want to talk about uh agent-driven development with respect to site reliability engineering. There is a lot that goes in
engineering. There is a lot that goes in to a real incident response. It would be crazy for me to go up here and say you could actually just automate all of S
and RCA work today. But there is a difference in the AI agent-driven approach. Right here we're watching a
approach. Right here we're watching a droid take a sentry incident and convert it into a full RCA and mitigation plan.
Traditional incident response is effectively solving a puzzle. The pieces
are scattered across dozens of systems. Logs in one place, metrics in another, historical context somewhere else.
There's knowledge in your team's head.
Droids in your organization fundamentally change this, right? When
an alert triggers, you can pull in context from relevant system logs, past incident, runbooks in notion or confluence, team discussions from Slack.
And you can see that a droid that has the tools and the ability to access this can condense that search effort from hours to minutes. And so really the
acceptable time to act for a standard enterprise organization should really it's really going to be zero. Right? The
moment that an incident happens, you should have a droid that's telling you exactly what happened, exactly how to fix it. And the thing that gets
fix it. And the thing that gets interesting is when you have user and organization level memory, you really start to build a model of what your team's response patterns and common issues are. And so it's not just
issues are. And so it's not just generating runbooks or generating a mitigation for one incident, right? but
creating new processes that help solve some of these issues. And once you've written that
issues. And once you've written that RCA, right, you you can move on to generate runbooks for those new learned patterns, update existing response
workflows, capture team knowledge that gets shared automatically without without the need for manual curation. And this is why all these
curation. And this is why all these things are connected. Agentnative
incident response is a part of a larger learning cycle that happens when you start to integrate agents into the workflow. We're seeing teams that are
workflow. We're seeing teams that are able to cut incident response time in half because context is immediate.
They're able to reduce repeat incidents because the third time something happens, the droid starts to say, "Maybe we should fix this." And they're able to improve team collaboration because when
a new engineer joins the team and says, "How do we do this?" It's already in memory. They can just ask the droid how
memory. They can just ask the droid how we do this. And so, most importantly, what we're seeing in general is a shift from reactive to predictive operations because you can now start to really see
the patterns across the entire operational history. And agentic systems
operational history. And agentic systems turn each of these incidents into an opportunity to make the entire system far more reliable. AI agents are not replacing
reliable. AI agents are not replacing software engineers. They're
software engineers. They're significantly amplifying their individual capabilities. The best
individual capabilities. The best developers I know are spending far less time in the IDE writing lines of code.
It's just not high leverage. They're
managing agents that can do multiple things at once that are capable of organizing the systems and they're building out patterns that supersede the inner loop of software development and they're moving to the outer loop of
software development. They aren't worried about
development. They aren't worried about agents taking their jobs. They're too
busy using the agents to become even better at what they do. The future
belongs to developers who understand how to work with agents, not those who hope that AI will just do the work for them.
And in that future, the skill that matters most is not technical knowledge or your ability to optimize a specific system, but your ability to think
clearly and communicate effectively with both humans and AI. Now, if you find any of this
AI. Now, if you find any of this interesting and you want to try the droids, I'm happy to share that everyone here uh at this talk can use this QR
code uh to sign up for an account. Our
mobile experience is not optimized yet, but the droids are on that. And so I'd recommend trying this on a laptop, but you will get 20 million free tokens uh credited your account. Um, and I also
want to add that uh you know, first and foremost, Factory is an enterprise platform, right? And so if you're if
platform, right? And so if you're if you're thinking about security, if you're thinking about where are the audit logs, whose responsibility is it when an agent goes and runs remove RF
recursive on your codebase, right?
Droids don't do that. But if it were to, right, whose responsibility is that?
Then these are the types of questions that we're interested in and that we're helping large organizations solve today.
And so if you're a security professional, if you're thinking about ownership auditability indemnification, if you're a lawyer, right, these are the types of questions that you should start asking today
because yolo mode is probably not the best thing to be running inside your enterprise, right? And so give it a
enterprise, right? And so give it a scan, give it a try, check out some of the controls we have. Um, and if you have any questions, feel free to reach
out via email. Thanks.
[Applause] Awesome. Thank you, Eno. What a day of
Awesome. Thank you, Eno. What a day of talks, everyone. That's our back-to back
talks, everyone. That's our back-to back eight sessions of sweet agent talks.
Okay, logistics. So, this is the main keynote room. We're going to be back
keynote room. We're going to be back here in around 3:40 for our ending keynotes. feel free to, you know, stay,
keynotes. feel free to, you know, stay, hang out. It's not that long from now.
hang out. It's not that long from now.
You have about 20 minutes. If you're
interested, there's some expo talks going on. Feel free to check out the
going on. Feel free to check out the expo booths, but please do stay. Um,
after the keynotes, we have a few more great great keynote talks lined up.
Everyone will come back to the keynote room. And then we have a few surprises.
room. And then we have a few surprises.
So, one thing very special, last week we held a hackathon. We held an AI uh AI engineer hackathon. And the finalists of
engineer hackathon. And the finalists of the hackathon have not got their awards yet. They have been spending a week to
yet. They have been spending a week to work a little bit further on their project. They're going to come here and
project. They're going to come here and demo on stage and we're going to pick the winners. There's $10,000 of prizes
the winners. There's $10,000 of prizes on the line. So, we're going to see some hackathon demos. And of course, at the
hackathon demos. And of course, at the end, we want to thank our speakers. We
have a special trophy ceremony and we need your help to determine who your favorite speakers were. For the sweet agent track, we're going to reach out.
We're going to have a poll for whoever your favorite speaker is. Please, please
vote alongside the keynotes, the other tracks for anything that you've attended. Please let us know your
attended. Please let us know your favorite speakers. So, thank you all for
favorite speakers. So, thank you all for coming. It's been a great talk, a great
coming. It's been a great talk, a great list of talks, and we hope to see you back soon. So, once again, 3:40 we're
back soon. So, once again, 3:40 we're going to kick off here with keynotes, speaker prizes, and hackathon judging.
Thank you everyone.
Hey hey hey.
[Music] Data.
[Music] Come on.
[Music] [Music] [Music] Yeah heat.
[Music] [Music] [Music] Hey hey hey.
Heat. Hey, Heat.
[Music] [Music] [Music]
[Music] All eyes.
[Music] [Music] Oh. Oh.
Oh. Oh.
Oh. Heat. Heat.
[Music] [Music] are all [Music] [Music] Hey hey hey.
[Music] I love [Music] [Music]
me. Heat. Heat.
me. Heat. Heat.
[Music] [Music] down.
Happy down.
Happy.
Everything.
Everybody feel I d I I'll be happy I'll be Heat. Heat.
I feel I need [Music] [Music] [Music] [Music]
Are you [Music] Hey, hey hey.
[Music] [Music] [Music] I'm everything.
[Music] Hey hey hey.
[Music] [Music] as you I feel [Music] I don't want to
[Music] go after [Music] [Music]
[Music] Do [Music]
I don't want to work.
[Music]
I take it.
[Music] [Music] Heat. Heat. N.
Heat. Heat. N.
[Music] [Music] [Music]
[Music]
Hey. Hey. Hey.
Hey. Hey. Hey.
[Music]
[Music] Heat. Heat.
Heat. Heat.
Hello. Hello.
[Music] Heat. Heat.
Heat. Heat.
[Music] [Music] [Music]
[Music] [Music] [Music]
[Music] [Music] Heat. Hey, Heat.
Heat. Hey, Heat.
[Music] [Music] [Music] [Music]
[Music] [Music] Ladies and gentlemen, please welcome
back to the stage, the VP of developer relations at Llama Index, Lorie Voss.
[Music] Hello everybody. Welcome back. How's
Hello everybody. Welcome back. How's
everybody had a good conference day today?
All right, so for this next bit, I'm going to try an experiment. There's four
sort of blocks of you uh separated by aisles, and so I'm going to divide you into teams. You are team A. You are team
B. You are team C. You are team D. Let
B. You are team C. You are team D. Let
me hear it from team A. Team
A. Team C. Team
C. Team B. Team
B. Team D. Team A again. All right. I'm not
D. Team A again. All right. I'm not
going to do anything with that. That's
just to wake you up. Uh we have some great keynotes lined
up. Uh we have some great keynotes lined up this afternoon. Uh we're going to hear the results of the state of AI engineering survey. Uh and if you know
engineering survey. Uh and if you know anything about me, you know that I love data. I love a good survey. It's my
data. I love a good survey. It's my
favorite thing to hear about. Uh we're
going to hear stories about building open router. Uh and we're going to hear
open router. Uh and we're going to hear Shawn Grove tell us why prompt engineering is dead, which is sure to be spicy. Uh but our first keynote this
spicy. Uh but our first keynote this afternoon is trends across the AI frontier. Uh so please welcome to the
frontier. Uh so please welcome to the stage uh co-founder of artificial analysis George [Applause]
Cameron. Hi everyone. I'm George,
Cameron. Hi everyone. I'm George,
co-founder of Artificial Analysis. A
quick background to who we are before we dive into things. Do you see that?
things. Do you see that?
Sorry, I think my clicker is not working. Oh, there we go. Great. So, a
working. Oh, there we go. Great. So, a
quick background to who we are. We're a
leading independent AI benchmarking company. We benchmark a broad spectrum
company. We benchmark a broad spectrum across AI. So, we benchmark models for
across AI. So, we benchmark models for their intelligence. We benchmark API
their intelligence. We benchmark API endpoints for their speed, their cost.
We also benchmark uh hardware and all the AI accelerators out there. Uh and we also benchmark a range of modalities, not just language, but also vision,
speech, image generation, video generation. And we publish essentially
generation. And we publish essentially nearly all of it uh for free on our website artificialanalysis.ai AI whereby we benchmark over 150
different models uh across a range of metrics. We also publish reports many of
metrics. We also publish reports many of which are publicly accessible and we also have uh a subscription for
enterprises looking to uh enter uh or bring AI to production in their environments um in an efficient uh and effective way.
Let's start off with AI progress. Let's
set the scene. So, it's been a crazy two years. I think that we've all felt it in
years. I think that we've all felt it in this room whereby OpenAI uh kicked off the race uh with the Chat GBT and GBD
3.5 launch. And since then, it's only
3.5 launch. And since then, it's only gotten more hectic. There's been more and more uh model releases by more and more labs pushing the AI
frontier. So the current state now of
frontier. So the current state now of frontier AI intelligence. I think this will be this order of models will be familiar to a lot in this room. 03 is
the leader but followed closely by 04 mini with reasoning mode high. Deepseek
R1 the release in the last week or two.
Gro 3 mini reasoning high Gemini 2.5 Pro Claude for Opus thinking this benchmark is our artificial analysis intelligence index.
It's made up of a composite. It's a
composite index of seven evaluations which we then wait to develop our artificial analysis intelligence index which just provides a
generalist perspective on the intelligence of these models.
We all have an understanding of what frontier AI intelligence is. But what I want to explore with you today is that there's more than one frontier in AI.
There's tradeoffs to accessing this intelligence. You shouldn't always use
intelligence. You shouldn't always use the leading most intelligent model. And
so what we want to do is we want to explore the different frontiers out there. And as an AI benchmarking
there. And as an AI benchmarking company, we're going to bring some numbers to the four to help you reason about this. First, we'll be looking at
about this. First, we'll be looking at reasoning models. Next, we'll be looking
reasoning models. Next, we'll be looking at the open weights frontier. Third, the
cost frontier. And lastly, the speed frontier. There's other frontiers out
frontier. There's other frontiers out there that we benchmark, but we'll focus on these key ones today.
Starting with reasoning models, what we've done here is we've taken our intelligence index and looked at that relative to the output tokens used to
run the intelligence index. So we've
measured all of how many tokens each model took to run our seven evaluations and we've plotted it on this chart and you can see two distinct groups. It's helpful to think about
groups. It's helpful to think about these separately. So non-reasoning
these separately. So non-reasoning models which offer less intelligence but uh require fewer output tokens and reasoning models which use
more output tokens but offer greater intelligence and the more this is important to look at because more output tokens comes with trade-offs both for
request latency as well as cost. We're
going to bring some numbers to draw that out and look at the real differences here. Just how yappy these reasoning
here. Just how yappy these reasoning models are. We can see that there's an
models are. We can see that there's an order of magnitude difference between reasoning and non-reasoning models. It's
not just that feeling, oh, this is taking a long time. It's real. It's an
order of magnitude. So between GPT 4.1 it uh it required 7 million tokens to run our intelligence index evaluations
but then 04 mini high took 72 million tokens and the yappiest of them all Gemini 2.5 uh pro took 130 million tokens to run our intelligence
index and as mentioned this has implications for cost as well as N10 latency responsiveness So looking at latency, we benchmark the
API latency of how long it takes to receive a response when accessing these models via their APIs. Here we can see that GBD 4.1 on
APIs. Here we can see that GBD 4.1 on median across our requests took 4.7 seconds to return a full response.
04 mini high took over 40 seconds, roughly another 10x or order of magnitude increase. This has
magnitude increase. This has implications for applications and users which require responsiveness even
enterprise uh kind of chat bots. You
don't always reach for 03 in chat GPT and it and Facebook's done a lot of studies on this where they've looked at the for consumer apps where they've
looked at uh user drop off by lat uh application latency which clearly demonstrate this. Sorry, do you mind if
demonstrate this. Sorry, do you mind if we jump back a slide? And it also has uh implications
slide? And it also has uh implications for how we're building. So I think particularly with agents whereby 30 uh
queries in succession is not uncommon. It has it's a multiplier
uncommon. It has it's a multiplier effect on the latencies uh for your application and how you can build. If
you have faster responses, maybe you can make that 30 uh 100 queries for instance. And so putting numbers to that
instance. And so putting numbers to that in terms of agents 30 is normal. And so
even less than 04 mini you're at 10 seconds for a reason model. If you're
running 30 queries that's 300 seconds that a user might be waiting for a response or an application might be waiting for a response. That's 5
minutes. If with the order of magnitudes that we're dealing with here if that 10 seconds was 1 second then those 30 queries takes 30 seconds. 30 seconds
versus 5 minutes impacts what you can build. Think of a contact center uh
build. Think of a contact center uh application that might maybe 30 seconds is okay there, but 5 minutes uh definitely not. Who likes waiting on the
definitely not. Who likes waiting on the phone uh that long or imagine if you had to uh use Google and each time that you wanted to use a function
it impacts how we can build with these models. And so I think bringing numbers
models. And so I think bringing numbers to these trade-offs is really important.
I'd encourage everybody to measure them. Next, we're going to move to the
them. Next, we're going to move to the open weights frontier. Around the time of GPT4, there
frontier. Around the time of GPT4, there was a huge delta in terms of open weights intelligence proprietary
intelligence. Llama 65B or LMA 270B
intelligence. Llama 65B or LMA 270B wasn't close to the intelligence of GPT4.
What I'd like to show here is where we plot our intelligence index by release date is that that gap it closed until with with great models like
mixture time 7 and uh LM45B. But 01 broke away in late 2024.
LM45B. But 01 broke away in late 2024.
But then of course I think we remember DeepS released V3 I think December 26 ruined some of my Christmas holiday
plans. Had to tell my family I I need to
plans. Had to tell my family I I need to go read this paper. It's really
exciting. And then of course R1 in January. The gap between open weights
January. The gap between open weights intelligence and proprietary model intelligence is less than it's ever been. particularly with the recent R1
been. particularly with the recent R1 release in the last couple of weeks which is only a couple of points different in our intelligence index to the leading
models. You can't talk about open
models. You can't talk about open weights intelligence without talking about China. The leading open weights
about China. The leading open weights models across both reasoning models and non-reasoning models are from China
based AI labs. Deep Seek's leading in both. Alibaba with their Quen 3 series
both. Alibaba with their Quen 3 series is leading is coming in second in reasoning. But you also have other labs
reasoning. But you also have other labs such as Meta uh and Nvidia with their Neotron fine tunes of Llama coming in close as
well. Let's look at the cost frontier.
well. Let's look at the cost frontier.
This is really important and I think similar to re to uh end to end latency impacts what you can build. So bringing
some numbers here, we can really see these order of magnitudes play out. So 03 cost us almost $2,000 to run
out. So 03 cost us almost $2,000 to run our intelligence index. Techrunch
actually wrote an article about how much money we were we were spending on running. We did we didn't want to read
running. We did we didn't want to read it.
You can see 4.1 a great model is 30 times roughly cheaper in terms of the cost to run our intelligence index
compared to 01 and 4.1 nano over 500 times cheaper to run our intelligence index than 03. You should think about these when
03. You should think about these when building applications. The kind of cost
building applications. The kind of cost structure of your application might dictate what you can use here.
and how you use them. Those 30 uh sequential uh API calls for your agentic application could be uh 500 and still be
cheaper than an 03 query.
A key point to note here with this cost to run intelligence index and why we don't just look at the per token price is that and the labs maybe don't want
you to think this way but you're paying for the cost per token but then you're also paying for how verbose the models are all the reasoning tokens that are output when these models are in their
thinking mode. you pay for those as output tokens
mode. you pay for those as output tokens even if some of the labs hide them. And
so you need to think about this and measure it in your application not and benchmark not just by the cost per million tokens but also considering how
many reasoning tokens there are and how verbose these models are. You can see even amongst the non-reasoning models there's big differences between how
verbose these models are in responses. So for instance, ah we'll go
responses. So for instance, ah we'll go to the next slide. Do you mind if we go back one
slide. Do you mind if we go back one please? So what we've done here is we
please? So what we've done here is we have now we're now going to look at the trends in terms of cost. And so what you can see here is
cost. And so what you can see here is we've bucketed models by how intelligent they are. intelligence uh bands, if you
they are. intelligence uh bands, if you will. And what we can see here is that
will. And what we can see here is that accessing GPT4 level of intelligence has fallen over a 100 times since
mid23. This is the case across all
mid23. This is the case across all quality bands.
You can see that even when a new quality band, a new frontier is reached, 01 mini in late 24, quickly within only a few months,
the cost of accessing that level of intelligence halved. This is moving
intelligence halved. This is moving quickly. And so what I would say to you
quickly. And so what I would say to you is when building applications, think about what if cost wasn't a barrier when you're building.
It's a it's a very important kind of cost exercise because it might well be that if you build for a cost structure that doesn't work now then maybe in 6
months time that will be uh possible and it will be uh feasible. Next we're going to look at
feasible. Next we're going to look at the speed frontier. So this is how quickly you're receiving tokens. the
output speed, output tokens per second that you're receiving after sending an a API request. This has been increasing and
request. This has been increasing and has increased dramatically since early 23 as well.
So similarly, we've because there's a trade-off typically between intelligence and speed, we've grouped models into certain buckets. And we can see here
certain buckets. And we can see here that they've all increased in terms of how quickly you can access a level of
intelligence. So 40 I believe was around
intelligence. So 40 I believe was around 40 output tokens per second. Now you can access in that was in 2023.
Who remembers hitting it wasn't a reasoning model hitting enter in chatbt and just waiting for it to output especially code which you want to just copy straight into your editor and you
know hit run see if it works now you can access that level of intelligence at over 300 tokens per second that I'll go through it's not the
focus of the talk but important to to reference model sparity so we're seeing more mixture of experts models and
They activate only a proportion of uh parameters at inference time less compute per token which means it can go faster essentially and were around back
then but they're getting more and more sparse smaller proportion of active parameters. Next, smaller models.
parameters. Next, smaller models.
Smaller models are getting more intelligence uh intelligent particularly with distillations, you know, 8B distillations,
etc. Inference software optimizations like flash attention and speculative decoding. And lastly, hardware
decoding. And lastly, hardware improvements. So, H100 was faster than
improvements. So, H100 was faster than A100. Now, we've recently launched
A100. Now, we've recently launched benchmarks of the B200 on our artificial analysis website, and it's getting over a thousand output tokens a second. Think
about that relative to the 40 output tokens per second of GPT uh 4 in 23. There's also specialized uh
23. There's also specialized uh accelerators like Cerebra, Sanova, Grock. I want to share a house view here
Grock. I want to share a house view here to frame things.
Yes, things are getting more efficient.
Yes, the cost of accessing the same level of intelligence is decreasing and hardware is getting better. We're
getting more system output throughput on our on the chips. But our view is that demand for
chips. But our view is that demand for compute is going to continue to increase. We're going to see larger
increase. We're going to see larger models. I mean deepseek it's over 600
models. I mean deepseek it's over 600 billion active uh sorry not active total parameters and the demand for more intelligence is
insatiable reasoning models as we saw the yappy models they require more compute at inference time and lastly
agents whereby 20 30 100 plus uh sequential requests to models is not uncommon. These actors multipliers on
uncommon. These actors multipliers on the demand for compute and so the house view playing with these numbers is net net we're going to continue to see commute compute demand
increase. Thanks everyone. I'm George
increase. Thanks everyone. I'm George
from Artificial [Applause] [Music]
Analysis. Our next speaker is the
Analysis. Our next speaker is the founder and CEO of Brain Trust and the curator of this year's Evolve track.
Please join me in welcoming to the stage Ankor Goyal.
[Applause] Awesome. Excellent. Uh, so today we're
Awesome. Excellent. Uh, so today we're going to talk a little bit about evals to date and where we think eval are going to be going in the
future. Also, for those of you who saw
future. Also, for those of you who saw my brother earlier, um, I'm going to do my best to live up to his energy and uh, and charisma.
Um but um yeah, you know, it's been an amazing almost two-year journey for us at Brain Trust. We have had the opportunity to work with some of the most amazing companies building um I
think the best AI products in the world.
Uh I'm blown away by how many EVLs people actually run on the product. The
average org that signs up for Brain Trust runs almost 13 eval a day. Some of
our customers run more than 3,000 EVELs a day. Uh, and some of the most advanced
a day. Uh, and some of the most advanced companies that are running EVELS are spending more than two hours in the product every day working through their
evals. And I think one of the things
evals. And I think one of the things that stands out to me is while we have customers building some of the coolest most automated
um AI based products and agents in the world eval the best thing you can do is look at a dashboard and I think we have a pretty cool dashboard in Brain Trust but still it's just a dashboard that you
look at and you walk away and think okay what changes can I make to my code or to my prompts so that this eval does better. Um, and I actually think that is
better. Um, and I actually think that is all going to change. Uh, so today I'm excited to talk
change. Uh, so today I'm excited to talk about something called loop. Loop is an agent that we've been working on for some time now that's built into brain trust. Um, and it's actually only
trust. Um, and it's actually only possible because of evals. Every quarter
for the last two years, we've run evals on the frontier models to see how good they are at actually improving prompts, improving data sets, and improving scorers. And until very, very recently,
scorers. And until very, very recently, they actually weren't very good. In
fact, we think that Claude 4 in particular was a real breakthrough moment. Um, and it performs almost six
moment. Um, and it performs almost six times better than the the previous leading model before it.
So, Loop runs inside of Brain Trust and it can automatically optimize uh your prompts all the way to very complex uh agents. Um, but just as importantly, it
agents. Um, but just as importantly, it also helps you build better data sets and better scorers because it's really the combination of these three things that make for really great
evals. This is a little preview of of
evals. This is a little preview of of the UI. Um, you can actually start using
the UI. Um, you can actually start using it today if you are an existing Brain Trust user or you sign up for the product. There's a feature flag that you
product. There's a feature flag that you can just flip on called Loop and start using it right away. Um, by default it uses Cloud 4, but you can actually pick any model that you have access to and
start using it. Whether it's an OpenAI model, a Gemini model, or maybe some of you are building your own LLMs, you can use those as well. Um, and as you can see, it runs directly inside of Brain
Trust. One of the things that we uh
Trust. One of the things that we uh learned from working with a lot of users is how important it is to actually look at data and look at prompts while you're working with them. And we didn't want
that to go away uh when we introduced loop. So every time it suggests an edit
loop. So every time it suggests an edit to your data or it suggests a new idea for scoring or it suggests an edit to one of your prompts, you can actually see that side by side directly in the
UI. Um, of course for the more
UI. Um, of course for the more adventurous among you, there's also a toggle that you can turn on that says like just go for it and it will go and optimize away. Um, which actually works
optimize away. Um, which actually works really well. So just to recap, uh, to date,
well. So just to recap, uh, to date, EVELs have been a critical part of building some of the best AI products in the world, but the task of actually
doing evaluation has been incredibly manual. And I'm excited about how over
manual. And I'm excited about how over the next year uh eval themselves are going to be completely revolutionized by the latest and greatest that's coming out um from you know the frontier models
themselves and we're very excited to incorporate that into Brain Trust.
Please if you're not already using the product, try it out. Uh try out Loop, give us your feedback. Uh we have a lot of work to do. Um and we'd love to talk to you. We're also hiring. Uh so if
to you. We're also hiring. Uh so if you're interested in working on this kind of problem, whether it's the UI part of it, the AI part of it, or the infrastructure uh side of it, we'd love
to talk to you. Um you can scan this QR code. Uh it should be over there. Yeah,
code. Uh it should be over there. Yeah,
you can scan the QR code and and get in touch with us. Uh we'd love to chat.
Thank [Music] you. Our next presenter will provide us
you. Our next presenter will provide us some perspectives on the state of AI engineering. Please join me in welcoming
engineering. Please join me in welcoming to the stage
[Music] [Applause]
Baron. All right. Hi everyone.
Baron. All right. Hi everyone.
Uh, thank you for having me here and huge thanks to Ben, to Swix, to all the organizers who've put so much time and heart into bringing this community together.
[Applause] Yeah. All right. So, we're here because
Yeah. All right. So, we're here because we care about AI engineering and where this field is headed. So, to better understand the current landscape, we launched the 2025 state of AI
engineering survey. And I'm excited to
engineering survey. And I'm excited to share some early findings with you today.
All right, before we dive into the results, the least interesting slide.
Uh, I don't know everyone in this audience, but I'm bar. I'm an investment partner at Amplify, where I'm lucky to invest in technical founders, including companies built by and for AI
engineers. And uh, with that, let's get
engineers. And uh, with that, let's get into what you actually care about, which is enough bar and more bar charts. And there are a lot of bar
charts. And there are a lot of bar charts coming up.
Okay, so first our sample. We had 500 respondents fill out the survey, including many of you here in the audience today and on the live stream.
Thank you for doing that. And the largest group called
that. And the largest group called themselves engineers, whether software engineers or AI engineers. While this is the AI engineering conference, it's clear from the speakers, from the
hallway chats, there's a wide mix of titles and roles. You even let a VC sneak in. Um, so let's test this with a quick
in. Um, so let's test this with a quick show of hands. Raise your hand if your title is actually AI engineer at the AI engineering conference. Okay, that is
engineering conference. Okay, that is extremely sparse. Uh, raise your put your hands
sparse. Uh, raise your put your hands down. Raise your hand if your title is
down. Raise your hand if your title is something else entirely. So that should be almost everyone. Keep it up if you think you're doing the exact same work
as many of the AI engineers. All
right, so this sort of tracks titles are weird right now, but the community is broad. It's technical. It's growing. We
broad. It's technical. It's growing. We
expect that AI engineer label to gain even more ground. Uh couldn't help myself. Quick Google trend search term
myself. Quick Google trend search term AI engineering barely registered before late 2022. Uh we know what happened.
late 2022. Uh we know what happened.
Chat GPT launched and the moment for AI engineering interest has not slowed since. Okay, so people had a wide
since. Okay, so people had a wide variety of titles but also a wide variety of experience. Uh the
interesting part here is that many of our most seasoned developers are AI newcomers. So among software engineers
newcomers. So among software engineers with 10 plus years of software experience, nearly half have been working with AI for three years or less and one in 10 started just this past
year. So change right now is the only
year. So change right now is the only constant even for the veterans. All right, so what are folks
veterans. All right, so what are folks actually building? Let's get into the
actually building? Let's get into the juice. So more than half of the
juice. So more than half of the respondents are using LLMs for both internal and external use cases. Uh what
was striking to me was that three out of the top five models and half of the top 10 models that respondents are using for those external cases for the customerf
facing products are from OpenAI. The top use cases that we saw
OpenAI. The top use cases that we saw are code generation and code intelligence and writing assistant content generation. Maybe that's not
content generation. Maybe that's not particularly surprising. Uh, but the
particularly surprising. Uh, but the real story here is heterogeneity. So 94%
of people who use LLMs are using it for at least two use cases. 82% using it for at least three. Basically, folks who are using LLMs are using it internally,
externally, and across multiple use cases. All right. So you may ask, how
cases. All right. So you may ask, how are folks actually interfacing with the models and how are they customizing their systems to for these use cases? Uh
besides fshot learning, rag is the most popular way folks are customizing their systems. So 70% of respondents said they're using it. The real surprise for
me here I uh I'm I'm looking to gauge surprise in the audience was how much fine-tune is hap fine-tuning is happening across the board. It was much more than I had expected overall. Uh in
the sample we have researchers and we have research engineers who are the ones doing fine-tuning by far the most. We
also asked an open-ended question for those who were fine-tuning. What
specific techniques are you using? So,
here's what the fine-tuners had to say.
Uh 40% mentioned Laura or Qura reflecting a strong preference for parameter efficient methods. And we also saw a bunch of different fine-tuning
methods uh including DPO reinforcement fine-tuning. And the most popular core
fine-tuning. And the most popular core training approach was good old supervised fine-tuning.
Many hybrid approaches were listed as well. Um, moving on top uh to up on top
well. Um, moving on top uh to up on top of updating systems, sometimes it can feel like new models come out every single week. Just as you finished
single week. Just as you finished integrating one, another one drops with better benchmarks and a breaking change.
So, it turns out more than 50% are updating their models at least monthly, 17% weekly. And folks are updating their
weekly. And folks are updating their prompts much more frequently. So 70% of respondents are updating prompts at least monthly and 1 in 10 are doing it
daily. So it sounds like some of you
daily. So it sounds like some of you have not stopped typing since GPT4 dropped. Um but I also understand I have
dropped. Um but I also understand I have empathy. Uh seeing one blog post from
empathy. Uh seeing one blog post from Simon Willis and suddenly your trusty prompt just isn't good enough anymore.
Despite all of these prompt changes, a full 31% of respondents don't have any way of managing their prompts. Uh what I did not ask is how AI engineers feel
about not doing anything to manage their prompts. So we have the 2026 survey for
prompts. So we have the 2026 survey for that. We also ask folks across the
that. We also ask folks across the different modalities who is actually using these models at work and is it actually going well? And we see that
image, video, and audio usage all lag text usage by significant margins. I like to call this the
margins. I like to call this the multimodal production gap cuz I wanted an animation. Um, and
this gap still p persists when we add in folks who have these models in production but have not garnered as much traction. Okay. What's interesting here
traction. Okay. What's interesting here is when we add the folks who are not using models at all in this chart too.
So here we can see folks who are not using text, not using image, not using audio or not using video. And we have two categories. It's broken down by
two categories. It's broken down by folks who plan to eventually use these modalities and folks who do not currently plan to. You can roughly see this ratio of no
to. You can roughly see this ratio of no plan to adopt versus plan to adopt.
Audio has the highest intent to adopt.
So 37% of the folks not using audio today have a plan to eventually adopt audio. So get ready to see an audio
audio. So get ready to see an audio wave. Um, of course, as models get
wave. Um, of course, as models get better and more accessible, I imagine some of these adoption numbers will go up even further. All right, so we have to talk
further. All right, so we have to talk about agents. One question I almost put
about agents. One question I almost put in the survey was, "How do you define an AI agent?" But I thought I would still
AI agent?" But I thought I would still be reading through different responses.
Uh so for the sake of clarity, we defined an AI agent as a system where an LLM controls the core decision-making or workflow. 80% of respondents say LLMs
workflow. 80% of respondents say LLMs are working well at work, but less than 20% say the same about agents. Agents aren't everywhere yet,
agents. Agents aren't everywhere yet, but they're coming. Uh the majority of folks uh may not be using agents, but most at least plan to. So, fewer than one in 10 say that they will never use
agents. All to say that people want
agents. All to say that people want their agents. And I'm probably uh
their agents. And I'm probably uh preaching to the choir. Um the majority of agents already
choir. Um the majority of agents already in production do have right access uh typically with a human in the loop and some can even take actions
independently. So um excited as more
independently. So um excited as more agents are adopted to learn more about the tool permissioning that folks uh have access to. If we want AI in production, of
to. If we want AI in production, of course, we need strong monitoring and observability. So, we asked, do you
observability. So, we asked, do you manage and monitor your AI systems? This
was a multi- select question. So, most
folks are using multiple methods to monitor their systems. 60% are using standard observability. Over 50% rely on
standard observability. Over 50% rely on offline eval. And we asked the same
offline eval. And we asked the same thing for how you evaluate your model and system accuracy and quality. So
folks are using a combination of methods including data collection from users, benchmarks, etc. But the most popular at the at the end of the day is still human review. Um, and for monitoring their own
review. Um, and for monitoring their own model usage, most respondents rely on internal metrics. So storage is important too.
metrics. So storage is important too.
Where does the context live? How do we get it when we need it? 65% of
respondents are using a dedicated vector database. and to suggest that for many
database. and to suggest that for many use cases specialized vector databases are providing enough value over generalpurpose databases with vector
extensions. Uh among that group 35% said
extensions. Uh among that group 35% said that they primarily self-host. 30%
primarily use a third party provider. All right, I think we've been
provider. All right, I think we've been having fun this whole time, but we're entering a section I like to formally call other fun stuff. Uh I spent hours workshopping the name. So, we asked AI
engineers, should agents be required to disclose when they're AI and not human?
Most folks think yes, agents should disclose that they're AI. Uh, we asked folks if they'd pay more for inference time compute, and the answer was yes, but not by a wide margin. And we asked
folks if transformer-based models will be dominant in 2030, and it seems like people do believe that attention is all we'll need in 2030.
Uh the majority of respondents also think open source and closed source models are going to converge. So I will let you debate that after. Um no
commentary needed here. So uh the average or the mean guess for the percentage of US Gen Z population that will have AI girlfriends, boyfriends is
26%. Um I don't really know what to say
26%. Um I don't really know what to say or expect here, but we'll see. Uh we'll
see what happens uh in a world where folks don't know if they're being left on red or just facing latency issues. Um or uh of course the
latency issues. Um or uh of course the dreaded it's not you, it's my algorithm. And finally, we asked folks,
algorithm. And finally, we asked folks, what is the number one most painful thing about AI engineering today? And
evaluation topped that list. Uh so it's a good thing this conference and the talk before me has been so focused on EVELs because clearly they're causing some serious pain.
Okay. And now to bring us home, I'm going to show you what's popular. So, we
asked folks to pick all the podcasts and newsletters that they actively learn something from at least once a month.
And these were the top 10 of each. So,
if you're looking for new content to follow and to learn from, this is your guide. Uh, many of the creators are in
guide. Uh, many of the creators are in this room. So, keep up the great work.
this room. So, keep up the great work.
And I'll just shout out that Swix is listed both on popular newsletter and popular podcast for latent space. Uh, so
I will just leave this here.
Um, I think that's enough bar charts and bar time, but if you want to geek out about AI trends, you can come find me online in the hallways. Uh, we're going to be publishing a full report next week. Uh, I'll let Elon and Musk have
week. Uh, I'll let Elon and Musk have Twitter today, but um, it's going to include more juicy details including everyone's favorite models and tools across the stack. Thank you for the
time. Enjoy the
time. Enjoy the afternoon.
[Music] Our next presenter co-founded OpenC, the first NFT marketplace, and grew it to over $4 billion in monthly volume from
2017 to 2022.
He then founded Open Router in 2023, the first LLM aggregator and distributor, processing over two trillion tokens weekly across over 400 unique language
models. He's here to tell us fun stories
models. He's here to tell us fun stories from building open router and provide some predictions on where all this is going. Please join me in welcoming to
going. Please join me in welcoming to the stage Alex Atala.
[Music] [Applause] All right. Um, I can't go back. Well,
All right. Um, I can't go back. Well,
uh, when I started Open Router the beginning of 2023, I had one major question in mind.
I was looking at this new market that was coming online and and it was incredible. like at the very end of uh
incredible. like at the very end of uh 2022, we all saw chat GPT and I got bitten by the AI bug. Um, and I decided
to look into answering this question.
Will this market be winner take all inference might be the largest market ever in software and this seemed like a critical thing that everybody was assuming the answer to the answer to it
would be yes. Um, open AI was just far and away the leading model. There were a few others that were coming up on its
tail and I I built a couple prototypes to look into what they could be good used for and also wanted to investigate open source. So in this talk which Swix
open source. So in this talk which Swix named um I'm going to talk about the founding story of open router and uh and go through a little bit of the hoops
that we jumped through and sort of the investigation that we did as we we put together this product that started as an experiment and kind of evolved into a
marketplace over time.
In January, we saw the first signs of people wanting other types of models.
And the the first evidence was moderation. This this was like a very
moderation. This this was like a very clear interest from users in looking for models where they could understand why whether they'd be deplatformed or what
the the moderation policy of the company was. And and we saw some people like
was. And and we saw some people like generating novels where like it would be a detective story and in chapter 4 um the detective would find someone who
like commits a murder and shoots the victim and and opening at the time sometimes refused to generate that output or it was like questionably against the terms of service. And of
course we saw role play and and basically a big gray area emerge around what models were willing to
generate. So uh in the next month we saw
generate. So uh in the next month we saw the open- source race begin and that uh I'm going to do a little bit
of an OG test here. Uh raise your hand if you ever used Balloon 176b. There's like a there's like 10
176b. There's like a there's like 10 hands raised um or opt by Facebook. This
was like one of the earliest open source language models about five hands raised.
Uh there were a couple of these emerging and there were some very interesting projects to help people access them and uh and and early days they weren't
really useful for very much. So, uh, we kept digging and, uh, and eventually like the open source community, um,
round like ran into Meta's first launch, which was Llama 1 in in February. And
Llama 1 in their abstract advertised that it outperformed GPT3 on most benchmarks. You can see the highlighted
benchmarks. You can see the highlighted part here, which blew everyone away.
This was huge. an open weights model better than GPT3 and uh and especially a smaller model. This was the 13 billion parameter
model. This was the 13 billion parameter version, one that you could run on your laptop. Um outperforming a large server
laptop. Um outperforming a large server only like you know tons of money required to run inference companies model it and it was beating it on some
benchmarks. Everyone lost their minds
benchmarks. Everyone lost their minds and llama kicked off a huge storm. it
still was not very useful. I have to say it was like a text completion model for the most part and it was very difficult to run locally. The infrastructure just wasn't there. Um and people were
wasn't there. Um and people were struggling to figure out what to do with it which is when we found when we had the greatest moment of all I think for
the birth of the long tale of language models which was the first successful distillation in March of 2023.
Alpaca. Uh, a group at Stanford took Llama 1, generated a bunch of outputs on GPT3 and fine-tuned Llama 1 on those
outputs and created Alpaca for less than $600 in to like total. And this was an incredible moment. It was the first time
incredible moment. It was the first time I saw the transference of both style and knowledge from a large model onto a small one. And this me this was a huge
small one. And this me this was a huge unlock because it meant that not only do you not need a $10 million training budget to create your own models, but you could also for the first time make
unique data available as a service in the form of a language model.
And I immediately began to wonder like what are there there's going to be tens of thousands of these maybe hundreds of thousands. Um and they seem incredibly
thousands. Um and they seem incredibly important. This is knowledge finally
important. This is knowledge finally being distilled into software. Uh there
needs to be a place on the internet to discover these and understand what they do because even this open weights model was still closed in a way. It's a black box. You get 7 billion floating point
box. You get 7 billion floating point numbers. You don't know what it's good
numbers. You don't know what it's good at or what to do with it. Very few
people used alpaca. Raise your hands if you used alpaca. I see about maybe 12. So it's
alpaca. I see about maybe 12. So it's
like only double the number of people who used the like almost unusable open source models on the previous slide.
So open router initially started as a place to collect all these things. Um
but before we got there I wanted to check out people's willingness to bring their own model to generic websites.
Like what if the developer didn't even know which model a user wanted to use?
How would a user bring their choice of model to the software that they want? And uh in April, I launched window
want? And uh in April, I launched window AI, which was a an open- source Chrome extension that let a user choose their
model and let a web app just kind of suck it in. And so you can see from the Chrome extension here if you look really closely um this user is using
together's open source deployment of GBT next I can't I can't read it from here but like an open source model that um
swaps out open AAI directly inside the web page.
So the next month, Open Router launched and uh I uh co-ounded it with the founder of the framework that that
window AAI was built on, plasmo, um Lewis, and we started Open Routers first a place to collect all the models in one spot and and help people figure out what to do with them. And it
eventually grew into a place that gives you the like better prices, better uptime, no subscription, and uh and the most choice for figuring out which
intelligence your your uh software should run. So let's talk a little bit about
run. So let's talk a little bit about what it is because not everyone here might be familiar with it. Um, we we
have been growing 10 to 100% month over month for the last two years. It is an API that lets you access all language
models and uh and it's also become kind of the go-to place for data about who's using which model um and how that is changing over time, which you can see on
our public rankings page here. It's a single API that you pay for
here. It's a single API that you pay for once. you get near zero switching costs
once. you get near zero switching costs to go from model to model. Uh and we have about 400 over 400 models over 60 active providers and uh you can buy with
lots of different payment methods including crypto and and we basically do all the the tricky work of normalizing tool calls and caching for you so that you get the best prices and the most
features uh and you don't have to worry about what the provider supports. Another story. Initially, open
supports. Another story. Initially, open
router was not a marketplace really. It
was just kind of a collection of all the models and a way to explore data about who was using each one. So, how did we get here? Initially, when the first open
here? Initially, when the first open source models emerged, uh we only had like one or two providers for each one.
And so, we had like a primary provider and a fallback provider. And initially
that was it. And and we didn't even name the providers. Um, but it became clear
the providers. Um, but it became clear that there were going to be a bunch of companies that wanted to host these provi these models and at very different prices and
performances. The number of features
performances. The number of features ballooned. Um, there were companies that
ballooned. Um, there were companies that supported the minp sampler and most didn't. There were some that supported
didn't. There were some that supported caching, some that supported tool calling and structured outputs and others that didn't. And suddenly the ecosystem was just ballooning into this
kind of outofcontrol heterogeneous monster. And we wanted to
heterogeneous monster. And we wanted to tame the monster. So we aggregated all providers in one spot and at different price points it became a marketplace.
And you can see like this model llama 3.370B instruct um has one one of the models with the most providers on the
platform. um and it has like 23
platform. um and it has like 23 um closed source models also had something interesting happen to them which is that they just they couldn't
keep up with the demand and uh and and so we help developers basically get uptime boosting and you can see like the delta uh and how much we can boost
uptime just by aggregating lots of different providers for a model and this became really helpful for people using open source or closed source And we became a marketplace for both um
showing graphs about latency and throughput and helping people figure out using real world data what the latency and throughput is on each model. Um and that's how open router
model. Um and that's how open router became a marketplace and one optimized for language models which I thought would be proper for for inference
potentially the biggest market in software. Uh you can obviously a couple
software. Uh you can obviously a couple other things that we support comparing models we using your own prompts with the ease of just texting and iMessage um
fine grain privacy controls with API level overrides the ability to see like your usage of all models in one place and have great observability and back to the original
question here of whether will intelligence be winner take all uh I I we've come to the most likely bet that
that is not the case. Um, here's our data broken down by model author. Um,
how much how many tokens have been processed by each one. And you can see Google Gemini started pretty low, like
roughly 2 3% in June of last year and just has grown to 34 35%
uh pretty steadily over the last 12 months. um o uh enthropic uh is is like
months. um o uh enthropic uh is is like one of the most popular models on our platform. Open AAI is a little bit under
platform. Open AAI is a little bit under reppresented in this data um because a lot of developers use us to get open AAI like behavior for all other models but
OpenAI as has grown a lot here as well. So here's what we believe about
well. So here's what we believe about the market after all of the you know backstory that I just gave you. Um the future is going to be
you. Um the future is going to be multimodel. Ton all of our customers,
multimodel. Ton all of our customers, tons of customers use different models for different purposes and realize they can unlock huge gains by doing so.
Inference is also a commodity. Claude
from bedrock we want to make look exactly the same as cloud from Vert.ex.
And we do that because like the two hyperscalers have fundamentally uh you know the same commodity being delivered at different rates, different performances and for a developer you
just want to be able to like select that without worrying about who's serving it.
Um we think inference will be like a dominant operating expense and selecting and routing will be crucial. Um you can see the number of active models on open
router has just steadily grown. not the
case that people just hop from model to model like it tends to be sticky and uh and we tr we're trying to just make this
wild ecosystem a lot more homogeneous and easier to work with as a developer. Um to honor Swix's title for
developer. Um to honor Swix's title for this presentation, uh let's give a technical story. Um
something that we've worked on in the process of building the company and that was our own idea for how to do an MCP
within Open Router. So we don't have MCPs, we don't have an MCP marketplace.
Um, but we did run into the need to expand inference with new features and new abilities. For example, searching
new abilities. For example, searching the web for all models, PDF parsing for all models, um, you know, other interesting things
coming soon. And what we really wanted
coming soon. And what we really wanted to do was give these abilities to all models. But that involves not just the
models. But that involves not just the pre-flight work that MCPs do today where you can kind of get in, you know, like call another API, get a bunch of
behaviors and then have the inference process access those behaviors as it goes. We also needed the ability to
goes. We also needed the ability to transform the outputs on the way to the user. And so what we really really
user. And so what we really really needed was something more like middleware.
Middleware um is kind of a common concept in web development. You set up middleware when you're setting up authentication, for example, or or or
caching for a web app. And so we came up with a type of middleware that's sort of that's AI native and optimized for inference. Um and that looks not totally
inference. Um and that looks not totally dissimilar from the way middleware it looks in in Nex.js or or web development. Yeah. So, pardon the code
development. Yeah. So, pardon the code on the screen, but this is a little bit about how our our plug-in system looks and it, you know, it can call MCPS from
inside a plugin, but importantly, it can also augment the results on the way back to the user. So, here's an example of our web search plugin, which augments
every language model with the ability to search the web. um every language model can just kind of tap in to this plugin and get web annotations as results are
being fed back to users in real time and this all happens in a stream. So there's
no kind of like you know requirement that you get all of the tokens at once.
It can just happen in live in the stream. We we solved a bunch of other
stream. We we solved a bunch of other tricky problems uh while building open router. We we
really wanted to get extremely low latency. Um and we got it down to about
latency. Um and we got it down to about 30 milliseconds uh the best in the industry I believe. Um using a lot of custom cache work and we also need to
make streams cancelellable. All these
different providers have completely different stream cancellation policies.
Sometimes if you just drop a stream the the the the inference provider will bill you for the entire thing. Sometimes it
won't. Sometimes it'll bill you for the next 20 tokens that you never got. And
um we kind we we work a lot to try to figure out these edge cases and understand when developers are going to care about them too. And standardizing
all these providers and models uh became like a big tricky architecture problem that we spent a while working on. So
here's where all this is going. Uh we're
going to add more modalities to open router and I think this is like a big change in the industry as well. We're
going to start seeing LLMs generate images. We already have uh a few
images. We already have uh a few examples on the market, but like some people call it transfusion models, a transformer mixed with stable diffusion.
Um these are going to give images way more world knowledge and the ability to have a conversation with the image, which we think is just critical for growing that industry, making it really
work. Imagine I just ran into somebody
work. Imagine I just ran into somebody today who is using a transfusion model uh or who told me about their customer using a transfusion model to generate
menus. Imagine doing that like a whole
menus. Imagine doing that like a whole menu like in a delivery app generated by transfusion model. Um it's going to be
transfusion model. Um it's going to be really exciting and and a big deal in the coming year. We're also going to work on much
year. We're also going to work on much more powerful routing like routing is our bread and butter and so doing geographical routing. Right now we it's
geographical routing. Right now we it's pretty minimal but routing people to the right GPU in the right place and doing enterprise level optimization is coming
um better prompt observability better discovery of models like really fine grain categorization you know imagine being able to see like the best models
that take Japanese and and create Python code and of course even better prices coming soon. So, you know, we we believe
coming soon. So, you know, we we believe in in collaboration um and and building an ecosystem that's durable and with low vendor lock in. So, you know,
collaborate with us. Um here's our email and if you're interested, join us, too. Thank
too. Thank [Applause] you.
Our next speaker works on alignment reasoning at Open AI, helping translate highle intent into enforceable specs and evaluations. Please join me in welcoming
evaluations. Please join me in welcoming to the stage Sha [Music]
Grove. Hello everyone. Thank you very
Grove. Hello everyone. Thank you very much for having me. Uh it's a very exciting uh place to be, very exciting time to be.
Uh second, uh I mean this has been like a pretty intense couple of days. I don't
know if you feel the same way. Uh but
also very energizing. So I want to take a little bit of your time today uh to talk about what I see as the coming of the new code uh in particular specifications which sort of hold this promise uh that it has been the dream of
the industry where you can write your your code your intentions once and run them everywhere.
Uh quick intro. Uh my name is Sean. I
work at uh OpenAI uh specifically in alignment research. And today I want to
alignment research. And today I want to talk about sort of the value of code versus communication and why specifications might be a little bit of a better approach in
general. Uh I'm going to go over the
general. Uh I'm going to go over the anatomy of a specification and we'll use the uh model spec as the example. uh and
we'll talk about communicating intent to other humans and we'll go over the 406ency issue uh as a case study. Uh we'll talk about how to make
study. Uh we'll talk about how to make the specification executable, how to communicate intent to the models uh and how to think about specifications as
code even if they're a little bit different. Um and we'll end on a couple
different. Um and we'll end on a couple of open questions. So let's talk about code versus communication real quick. Raise your
hand if you write code and vibe code counts. Cool. Keep them up if your job
counts. Cool. Keep them up if your job is to write code. Okay. Now for those people, keep
code. Okay. Now for those people, keep their hand up if you feel that the most valuable professional artifact that you produce is
code. Okay. There's quite a few people
code. Okay. There's quite a few people and I think this is quite natural. We
all work very very hard to solve problems. We talk with people. We gather
requirements. We think through implementation details. We integrate
implementation details. We integrate with lots of different sources. And the
ultimate thing that we produce is code.
Code is the artifact that we can point to, we can measure, we can debate, and we can discuss. Uh it feels tangible and real, but it's sort of underelling the
job that each of you does. Code is sort of 10 to 20% of the value that you bring. The other 80 to 90% is in
bring. The other 80 to 90% is in structured communication. And this is
structured communication. And this is going to be different for everyone, but a process typically looks something like you talk to users in order to understand their challenges. You distill these
their challenges. You distill these stories down and then ideulate about how to solve these problems. What what is the goal that you want to achieve? You
plan ways to achieve those goals. You
share those plans with your colleagues.
uh you translate those plans into code.
So this is a very important step obviously and then you test and verify not the code itself, right? No one cares actually about the code itself. What you
care is when the code ran, did it achieve the goals? Did it alleviate the challenges of your user? You look at the the effects that your code had on the
world. So talking, understanding,
world. So talking, understanding, distilling, ideulating planning sharing translating testing verifying these
all sound like structured communication to me. And structured communication is
to me. And structured communication is the bottleneck.
knowing what to build, talking to people and gathering requirements, knowing how to build it, knowing why to build it, and at the end of the day, knowing if it has been built correctly and has
actually achieved the intentions that you set out with. And the more advanced AI models get, the more we are all going to starkly feel this
bottleneck because in the near future, the person who communicates most effectively is the most valuable programmer. And literally, if you can
programmer. And literally, if you can communicate effectively, you can program. So let's take uh vibe coding as
program. So let's take uh vibe coding as an illustrative example. Vibe coding
tends to feel quite good. And it's worth asking why is that? Well, vibe coding is fundamentally about communication first and the code is actually a secondary
downstream artifact of that communication. We get to describe our
communication. We get to describe our intentions and our the outcomes that we want to see and we let the model actually handle the grunt work for us.
And even so, there is something strange about the way that we do vibe coding. We
communicate via prompts to the model and we tell them our intentions and our values and we get a code artifact out at the end and then we sort
of throw our prompts away. They're
ephemeral. And if you've written TypeScript or Rust, once you put your your code through a compiler or it gets down into a binary, no one is happy with
that binary. That wasn't the purpose.
that binary. That wasn't the purpose.
It's useful. In fact, we always regenerate the binaries from scratch every time we compile or we run our code through V8 or whatever it might be from the source spec. It's the source
specification that's the valuable artifact. And yet when we prompt
artifact. And yet when we prompt elements, we sort of do the opposite. We
keep the generated code and we delete the prompt. And this feels like a little
the prompt. And this feels like a little bit like you shred the source and then you very carefully version control the binary. And that's why it's so important
binary. And that's why it's so important to actually capture the intent and the values in a specification. A written specification
specification. A written specification is what enables you to align humans on the shared set of goals and to know if you are aligned if you actually synchronize on what needs to be done.
This is the artifact that you discuss that you debate that you refer to and that you synchronize on. And this is really important. So I want to nail this
really important. So I want to nail this this home that a written specification effectively aligns humans and it is the artifact that you
use to communicate and to discuss and debate and refer to and synchronize on.
If you don't have a specification, you just have a vague idea. Now let's talk about why
idea. Now let's talk about why specifications are more powerful in general than code. Because code itself is actually a
code. Because code itself is actually a lossy projection from the specification. In the same way that if
specification. In the same way that if you were to take a compiled C binary and decompile it, you wouldn't get nice comments and uh well-n named variables.
You would have to work backwards. You'd
have to infer what was this person trying to do? Why is this code written this way? It isn't actually contained in
this way? It isn't actually contained in there. It was a lossy translation. And
there. It was a lossy translation. And
in the same way, code itself, even nice code, typically doesn't embody all of the intentions and the values in itself.
You have to infer what is the ultimate goal that this team is trying to achieve. Uh when you read through
achieve. Uh when you read through code, so communication, the work that we establish, we already do when embodied inside of a written specification is
better than code. it actually encodes all of the the necessary requirements in order to generate the code. And in the same way that having a source code that
you pass to a compiler allows you to target multiple different uh architectures, you can compile for ARM 64, x86 or web assembly. The source
document actually contains enough information to describe how to translate it to your target architecture.
In the same way, a a a sufficiently robust specification given to models will produce good TypeScript, good Rust,
servers clients documentation tutorials, blog posts, and even podcasts. Uh, show of hands, who works
podcasts. Uh, show of hands, who works at a company that has developers as customers?
Okay. So, a a quick like thought exercise is if you were to take your entire codebase, all of the the documentation, oh, so all of the code that runs your business, and you were to
put that into a podcast generator, could you generate something that would be sufficiently interesting and compelling that would tell the users how to succeed, how to achieve their goals, or
is all of that information somewhere else? It's not actually in your code.
else? It's not actually in your code.
And so moving forward, the new scarce skill is writing specifications that fully capture the intent and values. And whoever masters that again
values. And whoever masters that again becomes the most valuable programmer and there's a reasonable chance that this is going to be the coders of today. This is already very
similar to what we do. However, product
managers also write specifications.
Lawmakers write legal specifications.
This is actually a universal principle. So with that in mind, let's
principle. So with that in mind, let's look at what a specification actually looks like. And I'm going to use the
looks like. And I'm going to use the OpenAI model spec as an example here. So
last year, OpenAI released the model spec. And this is a living document that
spec. And this is a living document that tries to clearly and unambiguously express the intentions and values that OpenAI hopes to imbue its
models with that it ships to the world.
and it was updated in in uh February and open sourced. So you can actually go to
open sourced. So you can actually go to GitHub and you can see the implementation of uh the model spec. And
surprise surprise, it's actually just a collection of markdown files. Just looks
like this. Now markdown is remarkable.
It is human readable. It's versioned.
It's change logged. And because it is natural language, everyone in not just technical people can contribute, including product, legal, safety,
research, policy. They can all read,
research, policy. They can all read, discuss, debate, and contribute to the same source code. This is the universal
artifact that aligns all of the humans as to our intentions and values inside of the company.
Now, as much as we might try to use unambiguous language, there are times where it's very difficult to express the nuance. So, every clause in the model
nuance. So, every clause in the model spec has an ID here. So, you can see sy73 here. And using that ID, you can
sy73 here. And using that ID, you can find another file in the repository sy73.mmarkdown or md uh that contains
sy73.mmarkdown or md uh that contains one or more challenging prompts for this exact clause. So the
document itself actually encodes success criteria that the the model under test has to be able to answer this in a way
that actually adheres to that clause. So let's talk about uh syphy. Uh
clause. So let's talk about uh syphy. Uh
recently there was a update to 40. I
don't know if you've heard of this. Uh
there uh caused extreme syphy. uh and we can ask like what value is the model spec in this scenario and the model spec
serves to align humans around a set of values and intentions. Here's an example of syphnty
intentions. Here's an example of syphnty where the user calls out the behavior of being uh syophant uh or sickopantic at the expense of impartial truth and the
model very kindly uh praises the user for their insight.
There have been other esteemed researchers uh who have found similarly uh similarly uh concerning
examples and this hurts uh shipping syphency in this manner erodess trust. It
trust. It hurts. So and it also raises a lot of
hurts. So and it also raises a lot of questions like was this intentional? you
could see some way where you might interpret it that way. Was it accidental and why wasn't it caught? Luckily, the
model spec actually includes a section dedicated to this since its release that says don't be sick of fantic and it explains that while sophincy might feel
good in the short term, it's bad for everyone in the long term. So, we
actually expressed our intentions and our values and we're able to communicate it to others through this So people could reference it and if we
have it in the model spec specification if the model specification is our agreed upon set of intentions and values and the behavior doesn't align with that then this must be a
bug. So we rolled back we published some
bug. So we rolled back we published some studies and some blog post and we fixed it. But in the interim, the specs served
it. But in the interim, the specs served as a trust anchor, a way to communicate to people what is expected and what is not expected. So if just if the only thing
expected. So if just if the only thing the model specification did was to align humans along those shared sets of intentions and values, it would already
be incredibly useful.
But ideally we can also align our models and the artifacts that our models produce against that same specification. So there's a technique a
specification. So there's a technique a paper that we released uh called deliberative alignment that sort of talks about this how to automatically align a model and the technique is uh
such where you take your specification and a set of very challenging uh input prompts and you sample from the model under test or training.
You then uh take its response, the original prompt and the policy and you give that to a greater model and you ask it to score the response according to the specification. How aligned is it? So
the specification. How aligned is it? So
the document actually becomes both training material and eval material and based off of the score we reinforce those weights and it goes from you know you could include your
specification in the context and maybe a system message or developer message in every single time you sample and that is actually quite useful. a prompted uh model is going to be somewhat aligned,
but it does detract from the compute available to solve the uh problem that you're trying to solve with the model.
And keep in mind, these specifications can be anything. They could be code style or testing requirements or or safety requirements. All of that can be
safety requirements. All of that can be embedded into the model. So through this technique you're actually moving it from a inference time compute and actually you're pushing down into the weights of
the model so that the model actually feels your policy and is able to sort of muscle memory uh style apply it to the problem at hand. And even though we saw that the
hand. And even though we saw that the model spec is just markdown, it's quite useful to think of it as code. It's
quite analogous. Uh these specifications they
analogous. Uh these specifications they compose, they're executable as we've seen. uh they are testable. They have
seen. uh they are testable. They have
interfaces where they they touch the real world uh they can be shipped as modules and whenever you're working on a model spec there are a lot of similar
sort of uh problem domains. So just like in programming where you have a type checker the type checker is meant to ensure consistency where if interface A has a dependent uh module B they have to
be consistent in their understanding of one another. So if department A writes a
one another. So if department A writes a spec and department B writes a spec and there is a conflict in there you want to be able to pull that forward and maybe block the publication of the the
specification as we saw the policy can actually embody its own unit tests and you can imagine sort of various llinters where if you're using overly ambiguous language you're going to confuse humans and you're going to confuse the model
and the artifacts that you get from that are going to be less satisfactory. So specs actually give us
satisfactory. So specs actually give us a very similar tool chain but it's targeted at intentions rather than syntax. So let's talk about lawmakers as
syntax. So let's talk about lawmakers as programmers.
Uh the US constitution is literally a national model specification. It has
written text which is aspirationally at least clear and unambiguous policy that we can all refer to. And it doesn't mean that we agree with it but we can refer
to it as the current status quo as the reality. Uh there is a versioned way to
reality. Uh there is a versioned way to make amendments to bump and to uh publish updates to it. There is judicial
review where a a grader is effectively uh grading a situation and seeing how well it aligns with the policy. And even
though the again because or even though the source policy is meant to be unambiguous sometimes you don't the world is messy and maybe you miss part of the distribution and a case falls
through and in that case the there is a lot of compute spent in judicial review where you're trying to understand how the law actually applies here and once that's decided it sets a precedent and
that precedent is effectively an input output pair that serves as a unit test that disamiguates and reinfor enforces the original policy spec. Uh it has
things like a chain of command embedded in it and the enforcement of this over time is a training loop that helps align all of us towards a shared set of
intentions and values. So this is one artifact that communicates intent. It
adjudicates compliance and it has a way of uh evolving safely.
So it's quite possible that lawmakers will be programmers or inversely that programmers will be lawmakers in the future. And actually this apply this is
future. And actually this apply this is a very universal concept. Programmers
are in the business of aligning silicon via code specifications. Product
managers align teams via product specifications. Lawmakers literally
specifications. Lawmakers literally align humans via legal specifications.
And everyone in this room whenever you are doing a prompt it's a sort of protospecification. You are in the
protospecification. You are in the business of aligning AI models towards a common set set of intentions and values.
And whether you realize it or not you are spec authors in this world and specs let you ship faster and safer.
Everyone can contribute and whoever writes the spec be it a uh a PM uh a lawmaker an engineer a
marketer is now the programmer and software engineering has never been about code. Going back to our original question a lot of you put your hands down when you thought well
actually the thing I produced is not code. Engineering has never been about
code. Engineering has never been about this. Coding is an incredible skill and
this. Coding is an incredible skill and a wonderful asset, but it is not the end goal. Engineering is the precise
goal. Engineering is the precise exploration by humans of software solutions to human problems. It's always been this way. We're just moving away from sort of the disperate machine
encodings to a unified human encoding uh of how we actually uh solve these these problems. Put this in action. Whenever you're
working on your next AI feature, start with the specification. What do you actually
specification. What do you actually expect to happen? What's success
criteria look like? Debate whether or not it's actually clearly written down and communicated. Make the spec
and communicated. Make the spec executable. Feed the spec to the
executable. Feed the spec to the model and test against the model or test against the spec. And there's an interesting question sort of in this
world given that there's so many uh parallels between programming and spec authorship. I wonder what is the what
authorship. I wonder what is the what does the IDE look like in the future.
you know, an integrated development environment. And I'd like to think it's
environment. And I'd like to think it's something like an inte like integrated thought clarifier where whenever you're writing your specification, it sort of ex pulls out the ambiguity and asks you
to clarify it and it really clarifies your thought so that you and all human beings can communicate your intent to each other much more effectively and to the models.
And I have a closing request for help which is uh what is both amenable and in desperate need of specification. This is
aligning agent at scale. Uh I love this line of like you then you realize that you never told it what you wanted and maybe you never fully understood it anyway. This is a cry for specification.
anyway. This is a cry for specification.
Uh we have a new agent robustness team that we've started up. So please join us and help us deliver safe uh safe AGI for the benefit of all humanity.
And thank you. I'm happy to [Applause] [Music] chat. Ladies and gentlemen, please
chat. Ladies and gentlemen, please welcome to the stage the founders of the AI Engineer World's Fair, Benjamin Duny and [Music]
Swix. Um,
Swix. Um, [Applause] All right. Choose to mirror or extend
right. Choose to mirror or extend display. I'd love to have my notes from
display. I'd love to have my notes from the house slides, please. Thank you. All
right. How are we feeling? I hope you're not as exhausted
feeling? I hope you're not as exhausted as me, but sufficiently exhausted. I
hope we all had a wonderful conference.
But we have one more special treat for you. We're excited to present the
you. We're excited to present the finalists for the very first official AI engineer hackathon. We partnered with Cerebral
hackathon. We partnered with Cerebral Valley, the largest AI community in the world and legends right here in the Bay Area for running hackathons for the very
first official AI engineer hackathon.
From 500 applicants, 160 engineers came together to learn, connect, and build together. 46 projects presented on site
together. 46 projects presented on site and three were selected as finalists.
And today we have those three finalists with us and they will each present their 48 hour builds for us in under five minutes. And all of you in the audience
minutes. And all of you in the audience are going to be the judge. But thanks to being smitten by the Wi-Fi gods, we have
decided to go old Athenian style by the roar of the crowd. Are you not [Music] entertained? The three teams listed here
entertained? The three teams listed here in the order that they will present.
Have we confirmed that, Ro? Is this the actual order they're coming on? I
certainly hope so. Team one, survival of the future.
Team two, tab RL. Team three,
featherless action R1. Do what you have to do to remember the order. Take some
notes on what you like best because we're going to come back and roar as soon as they're done after these 15 minutes. So, I'll let Swix proceed with
minutes. So, I'll let Swix proceed with the intro. Uh, yeah, these are all very
the intro. Uh, yeah, these are all very uh competitive teams. I think they're coming up now. Um, they are, what can I say? I I was actually I think in the
say? I I was actually I think in the room when the these guys were presenting for the final round. Um and like everyone was uh very very impressed like they were like how does this not exist
already? So um I think I should just
already? So um I think I should just kind of let them take it away because I don't want to steal their thunder. But
um I did insist on printing these trophies. Uh so we're going to hand them
trophies. Uh so we're going to hand them out. Um it's mostly just appreciation
out. Um it's mostly just appreciation but uh I think we also want to try to make AI engineer a place where people can get recognition for their work uh by speaking by posting. Thank you. We work
really hard on these. They got here um two hours ago. Uh overnight delivery started on
ago. Uh overnight delivery started on Monday and then went to Tuesday. Uh and
and anyway, so um I think these are ready so I don't want to take away their time. Survival of the future, folks.
time. Survival of the future, folks.
[Applause] So we're at here at the World's Fair and we're all builders. So we want to ship as fast as possible so we can get
feedback from users as fast as possible.
Shorten the feedback loop to know whether we're moving in the right direction. But a lot of the time making
direction. But a lot of the time making progress toward optimizing UX can totally feel like shooting in the dark.
Why is it so hard to optimize UX? Well,
in order to find the right message for users, you have to subject yourself to the painstakingly iterative trial and error process of creating and testing
variations. A lot of the times these
variations. A lot of the times these changes can look like really small tweaks to copy, oneline code changes. The variations are endless.
changes. The variations are endless.
In addition, AB testing pipelines can be super clunky. So you can wait to gather
super clunky. So you can wait to gather the data and then once you get the data, the the signal is still not clear and you're not sure how to proceed. All the
while, sometimes the product is changing or sometimes we need the feedback from the users in the first place to figure out what the product is. How do we use AI agents to improve
is. How do we use AI agents to improve this process?
Our product uses agents to automate those small refinements, the oneline code changes and push those to production um and review the data in
real time. This frees up resources for
real time. This frees up resources for teams to focus on the big picture problems and improvements. Meanwhile,
our agents are reviewing the data and refining the AB testing to maximize the value of information that can be gained from user behavior from these changes.
not this one. So for our current workflow, we
one. So for our current workflow, we have a pretty easy integration with your GitHub that you can uh just in increment it to your GitHub. Choose whatever repo
that you want that has some sort of front end. Uh we have one agent that's
front end. Uh we have one agent that's going to look for your either your landing page or your the dashboard that the users have the most uh integration with. And then another agent is going to
with. And then another agent is going to try to look analyze it and try to look to make like very small integrations to those pages. Or if you're already in the
those pages. Or if you're already in the data pipeline, we can also use previous feedbacks from the user interactions to give that agent to make like better uh
integrations depend on on how previous interactions worked. And then after that
interactions worked. And then after that agent is done, it's going to make a branch into just just to your
repo. And the other agent can uh traffic
repo. And the other agent can uh traffic user data again based on previous recordings of how the users were interacting with those components. It's
going to traffic like a very small percentage of the user data to that new variant that we made. And it's going to keep doing that until you make uh better
and better variants for your product. So we're currently building out
product. So we're currently building out capabilities to solve for the metrics that matter the most. So our customers can customize what they want to solve
for to maximize the value of real-time user feedback. LLMs and user feedback
user feedback. LLMs and user feedback are a match made in heaven. This also
means that UX engineers don't have to babysit their features because this process is run by agents. So again,
teams can res can focus on the metrics that matter the most while working on the big picture improvements and decisions. All the while, our agents are
decisions. All the while, our agents are analyzing the user data, providing a refined approach to AB testing and introducing a soft launch of updates and
changes so that as more and more users respond positively to these changes, they're shown to more and more users and you can push changes to production safely and with confidence. This is a
massive improvement over the current process because who hasn't had the experience of pushing to production and it doesn't turn out how you were hoping it
would. So our agent does three things.
would. So our agent does three things.
It takes care of the busy work and those incremental changes. It frees up
incremental changes. It frees up resources for teams to focus on the big picture and it improves on the current AB process by incrementalizing it and refining it. So you can push code to
refining it. So you can push code to production. um more confidently, more
production. um more confidently, more safely, and reduce that risk.
Thank you. And if you scan this QR code, it'll
you. And if you scan this QR code, it'll take you to our website so you can check it out.
Awesome. Um thanks to Lori, Salem. Um
and what was the last? Armen. Uh thanks.
Thanks so much, guys. Um, fun fact, they just met 10 days ago and they've been just spamming the hackathons and been winning quite a lot of them. So, very
very strong team. Um, the next team is Tab RL. Uh, I think u DT I've met quite
Tab RL. Uh, I think u DT I've met quite a few times at a number of AI, right?
This is not your first one. Yeah. Um,
and uh I I think the other interesting thing about this is the just the sandboxing that you guys do. was like
really stands out like um that's what every single judge that I talked to um also was commenting on. So uh take it away.
Hi guys. So I'm Rich. I'm a physicist and this is my friend Adita and he's an AI engineer. So we met at the hackathon
AI engineer. So we met at the hackathon and this Saturday and I was very frustrated about certain things and I pitched something to him and I was like
we are having this entire automation of full sex platforms where we have bold lovable completely doing really complex backend and front end in the browser.
But we have nothing like that for robotics. We have nothing like that to
robotics. We have nothing like that to simulate the reality. And so the idea was born. Your browser is all you need
was born. Your browser is all you need to have RL. So we are here to like present to you what we did on the hackathon. Next slide please. All right.
hackathon. Next slide please. All right.
So we are using Muja was we are using help uh Mujiko which is a genius platform built by and acquired by Google deep mind. What it does it helps you to
deep mind. What it does it helps you to embed all the physical attributes in the robots. And so you can see this like
robots. And so you can see this like really nifty, really cute uh robots falling under the gravity. It's
basically it just shows you how these attributes that are only like present in the physical world are all embedded in these frames. And but the problem is
these frames. And but the problem is it's all siloed in Python. It's
extremely fragmented the way this framework works. And it's kind of like
framework works. And it's kind of like it's only like left up to like robot robotists to like figure out like how to like generate like thousand and thousands of data points and simulations
to invent the future. But we are changing that. What we're building is
changing that. What we're building is we're building a simulator that allows you to take a prompt, generates
different policies of RL and basically gives this really really controlled parametrically and sophisticated
simulations. So in a second we'll switch
simulations. So in a second we'll switch to All right. So So here we have you good? Yeah. Sorry about that. All right.
good? Yeah. Sorry about that. All right.
So this is what we built actually. We
built an entire RL environment that runs in your browser. In the beginning, we actually built it in the browser, but then in order to make it work, the whole idea is beginners like us can just pick a model uh in an 3D environment like
richer just showed you. You can, you know, we picked a robot dog and we told the dog like, "Hey, you're a great dog.
Show me how well you can, you know, stick out your paw. I love you. Like, do
you want a treat?" Right? Uh and the way RL works is uh the robot throws offs throws off observations and you need to take those observations and you need to craft a custom reward function and
usually these reward functions are only written by specialists. But what we've done here is we've used the latest foundation models to democratize that.
So you just put in your prompt and 03 opus and Gemini all create three different reward functions each. And as
you can see these are pretty complicated bits of code. uh you know they have like all these quarter neons different rewards for height like what we ask the robot to do is to sit and stick its paw
out like that's a pretty complex set of rewards like I I wouldn't even know where to get started with the math right but foundation models they just spit that stuff out and then once we actually go through and generate that we actually
have these sandboxes which are kindly hosted by modal uh where we actually go ahead and start training all these fine-tuning and what we end up with is we actually end up with reinforcement
learning and it's just like magic. So
normally you have to be like a researcher. You have to know all this
researcher. You have to know all this stuff but I just typed in a prompt. Uh
my model started training. I had nine different ones. I'm showing you one from
different ones. I'm showing you one from each provider. I think this is the one
each provider. I think this is the one from claude. As you can see I didn't
from claude. As you can see I didn't give it enough steps. Reinforcement
learning takes time. Uh so it didn't like get the you know start to converge or whatever but some of the others ones from Google and OpenAI did. And yeah
long story short um that's that's our project and uh you know now you can do it in your browser. We're really excited to bring this to the whole world, get everybody start training the robots on
their own machines. Thanks. Awesome. So
overall, yeah, so close again, the future is incredibly bright and if you want to reach the generalized intelligence in these machines, we have to optimize for
everything. Thank you guys.
everything. Thank you guys.
Thank you. Um yeah, that's the I think the next speaker and I think the last uh finalist that we have uh I have a personal relationship with because he was our first international guest on L
in space. Um we did it in Singapore I
in space. Um we did it in Singapore I think. Um and he's been training non
think. Um and he's been training non subquadratic non-intention models for a while. Are you are you plugged in? I am.
while. Are you are you plugged in? I am.
Um, and uh, he was like, I'm just going to, he was like in the middle of like some very important meetings, but he said, I'm just going to hack in this hackathon and uh, show you what I can do
with my model. So, uh, I thought it was like pretty impressive and uh, wanted to uh, at least it was exciting to at least see him like emerge with something that
you can use today.
Um, hopefully this works because he wants to demo instead of slides.
Okay, awesome. Take it away.
We can't hear you. Hang on. You're like
your mic on. That's all right. How are
you measuring reliability? Are your
agents following your specification?
That's the question I'm asking. A bit of background like Sean G is I'm Eugene and firstly I'm going to say I'm sorry because my team is working to obsolete all the AI models you see today. Um this
is what we are working on like you may have seen some of my latest work such as the qui 72b where we built the world's largest model without transformer attention. So this is a 72 billion
attention. So this is a 72 billion parameter model that's a thousandx cheaper in inference cost and performs the same uh based on the RWB architecture. We also apply this
architecture. We also apply this technology to accelerate transform models but that's my background not what I did in the heckodon to be clear. So so
that's not really that important for this case. Back to the topic, the boring
this case. Back to the topic, the boring topic which is reliability. And this may sound weird
reliability. And this may sound weird because my hot take is scaling is dead and we're not going to solve reliability with scaling cuz to me right this is a billion dollar money pit that we are
throwing to scale and despite that some of the richest companies on earth is saying for example the deep mind founder CEO is saying that it may take up to 10 year to solve the compound AI agent
error problem. Yen Lun say we need a new
error problem. Yen Lun say we need a new AI architecture to solve the paradigm in AI in robotics and AI. If you think that they are GPU poor, maybe don't take them seriously. But furthermore, this is also
seriously. But furthermore, this is also further reinforced by what we see in production where 90% of all AI projects fail to to reach the reach the bar
required to for enterprises. So why does this happen? Really the problem if you think
happen? Really the problem if you think about it is reliability. Because if you think about it right, these AI models are already capable of orbital physics math. How many of us can do orbital
math. How many of us can do orbital physics math from Earth to Mars? You
have a one in 30 chance of you answering correct.
But who would you use a delivery app that says it will arrive 45% of the time? Like think about it. like you can
time? Like think about it. like you can do your order and then maybe he orders 10 pizza instead of one or the pizza never arrive and then you're spending spending your time calling customer support cleaning up the mess. That is
what the best AI agents right now are doing or even the best AI model. And
that's the struggle that we we are we having with here's what nobody is talking enough about in my opinion. Most
companies don't need an AI that can do PhD math. What they really want is an AI
PhD math. What they really want is an AI that can do the boring things in life like booking a flight, sending an email or processing an invoice without failure every single time. Scaling is not going
to fix this and in our opinion a new architecture is needed and that's something that I can spend an hour talking but I'll put it aside because what what we did instead is just to show it. Most recently our latest action hour
it. Most recently our latest action hour model hit 65% on real eval. This model
will not solve a PhD math equation but it will do it will do real world web tasks such as shopping on Amazon.com dot dash and etc. And that jump is more than half compared to clock or gemini which is at
45%. So so for those who are asking how
45%. So so for those who are asking how it looks like of course we made an MCP demo for it. So that's so if you look at this I'm
it. So that's so if you look at this I'm just going to run the MCP and pray to the Wi-Fi gods. So for
those who are not familiar with client, client is awesome because it can run everything in your agent uh I mean in your IDE. And I'm just going to tell it
your IDE. And I'm just going to tell it to connect to my local MCB server which I have already set it up. And let me double check. Okay, it's there.
double check. Okay, it's there.
Okay. And then this will this will then do the task of searching up for a book for AI engineering on Amazon.com if my API if my Wi-Fi is working as planned.
Okay. So you see it goes there and it starts to starts to run it. Um I'm going to say up front this is not a fast model. It's going to take 5 minutes to
model. It's going to take 5 minutes to run. So I'm going to uh but uh but you
run. So I'm going to uh but uh but you can see like slowly filling up behind the scenes. So to speed it up I have
the scenes. So to speed it up I have prepared a recording in advance uh to to just show it in simultaneous. Okay. So
you so this is the same thing. You can
see it going to fast forward a bit. Yeah. So
this is boring. Uh but the point here is actually about reliability. uh and so so how do we measure reliability? It's
about running it as many times as you can. So what I'm going to do
can. So what I'm going to do simultaneously is I'm going to run this on model. So part of the real eval and
on model. So part of the real eval and shout out to div and and AGI inc who who did all of this is that they provide provide an endpoint for us to be able to to run it and against a leaderboard. So,
I'm just going to run it and then it'll just I'm just launching everything live and it's going to start filling up this this uh the scoreboard here. Uh once
again uh don't have time to run the whole thing. So, I'm just going to go to
whole thing. So, I'm just going to go to the final result 65%. And to me, this is not enough. We need to get 99. And
not enough. We need to get 99. And
that's what I'm building towards. And to
me, I'm find more frustrating that if anything, our existing best model can't even do better than a coin flip. And
that reliability is important because it's what's going to unlock all of the your value for all you AI engineers.
This is a billion dollar market. You
want to make an AI agent that's reliable in law, accounting, um ordering books.
That's what's going to make you money and that's what we need. Not a PhD lawyer.
Yeah. Okay. Uh that's about it. I think
I'm out of time, so I'm just going to jump straight into Yeah, we have a weight list. Thank you.
weight list. Thank you.
[Applause] All right, how about we hear it for all of our hackathon finalists. Very exciting. So, as
finalists. Very exciting. So, as
mentioned, now all of you are going to be the judges. So, um can I get next slide actually?
So, you're going to be the judges. So,
typically this is done by applause, but that is so preGPT. So, let's go by woos.
So, we're going to do a practice round with all of you in the audience. I want
you to go one, two, three. Nice. We only have to do one practice round. Great work. Great
work. Okay, so you ready? I want you to write down who your top team is. I only
want you to woo for that team. I'm going
to get chat GPT advanced voice mode ready to analyze the results. Hey Chat GPT, I am at the AI
results. Hey Chat GPT, I am at the AI engineer world's fair and we are doing the judging of the top three hackathon finalists and now we need your help. We
don't have Wi-Fi. So what we're going to do is we're going to I don't know why I'm talking to her like a kid. We're
going to actually do it by applause. So
we have three teams. I'm going to say team one, team two, team three. And each
of them are going to get applause from the audience. I want you to analyze this
the audience. I want you to analyze this whether actual data measurements that you have or just perceived and tell us who the winner is, who number one is,
who number two is, and who number three is. Are you ready?
is. Are you ready?
Absolutely. I'm ready. Let's do it. Go
ahead and announce each team, and I'll listen carefully to the applause.
yeah. All right.
Are we ready?
Team one, survival of the future.
Got it. Listening to the applause for team one.
We just did it.
All right. Let's move on to team when you're
right. Let's move on to team when you're ready. Awesome. We're ready. We're
ready. Awesome. We're ready. We're
going team two tab RL to team two's applause. It's actually
woos, but sure.
I'll listen out for those woos, too.
Whenever you're ready for team two, team three, featherless action R1.
Wow. Applause for team.
You failed. I'm sorry. I'm calling
Claude.
No, we have we have human evaluators in the back who are we knew this was a gimmick. Yeah, it was a fun gimmick. Uh
gimmick. Yeah, it was a fun gimmick. Uh
but no, thank you for helping us at least uh gives some it gives some sense of uh you know, sort of audience participation and uh favorite like it's meant to be a bit of a people's choice
uh type of thing. So yeah. Yeah, it's
work. Okay. Um, okay. So, uh, I So, we're going to get the results later. I
think Ben, you can talk to them if you if you need. Um, but we're going to give out some prizes. Uh, I don't know if the trace loop team is still around. I
think, uh, near uh I I saw I saw some of them going out there. Uh, but basically, we we want to just recognize people who've been like really, you know, pulling out their stocks for the event.
We got best swag, uh, which is one by Trace Loop for their keyboard. We got
best dressed uh worn by uh worn by Madison from B 10. Um I don't know if any of the B 10 folks here, but um anyone got the artificially intelligent
uh shirt? Yeah, that that really fun uh
uh shirt? Yeah, that that really fun uh swag. Um and best tweet from Dylan
swag. Um and best tweet from Dylan Patel. Um basically talking about uh a
Patel. Um basically talking about uh a relationship that was actually started here um like one year ago, which is which is pretty sweet. Like that that is actually heartwarming. We try to get
actually heartwarming. We try to get people hired but we never promise any uh partners. AI engineer world's fair where
partners. AI engineer world's fair where love happens. Yeah, I think that is AI
love happens. Yeah, I think that is AI engineer love fair which uh is a very high bar. Okay. So then the big
high bar. Okay. So then the big categories that we really wanted to hand out um obviously uh unfortunately like a lot of people like leave after your talk so we can't really uh hand it out but obviously come come and claim it uh
afterwards if you want. Every track has best speakers. So um you all voted uh we
best speakers. So um you all voted uh we actually you know like really care about giving recognition to the the speakers who work so hard in their talks and sharing their experience. Um thank you to all these tracks. Can I get a round
of applause for MCP uh David Kramer, Alex Duffy, Devon Tandon, Daniel Shaliff, Harrison Chase, Dylan Patel, Brook Hopkins, Brian Belelfer, Adamar Freeman Dennis
Nikov, Boris Jurnney Lambert, Rafal Vtor, Daniel uh retita, Renee, John, Sheree, Nick and Paul. Uh I think the retrieval one is wrong. We retrieval
ones actually will will brick uh who actually got his prize earlier uh yesterday earlier as well. Um so those are the individual track speakers. Um
actually um can I can I get those those uh picture frames on on there? Um we
actually spent some time um putting together the uh sort of track speaker prize which is just which is one. Thank
you. Um and uh it's like it's like really nice and printed and we gified everything. So it's uh so it's kind of
everything. So it's uh so it's kind of cool to see. Um, so yeah, come and get your track speaker award uh if you uh are still around and obviously we can send it to you if you're not. Okay,
overall best speaker. Uh, we have a runner up and uh also an overall uh winner. Um, I think uh it's like
winner. Um, I think uh it's like relatively uh you know obvious and and I think like the something that we wanted to recognize as well for our keynoters.
Um Oh, where is the Oh, okay. This is
not refreshed. Um if if someone can go back, can we go back two slides? Yeah,
runner up. Uh, George, are you here from um um artificial analysis?
Let's hear for the runner up. George,
artificial analysis.
Um, probably they're all in the hallway track. So, uh, artificial analysis like
track. So, uh, artificial analysis like worked really really hard in their talk.
They actually like did this whole like 50page report. I was like, George, you
50page report. I was like, George, you you have 20 minutes. Like, you can't really can't really do this. But, like
they worked super hard on that. And I I think um it's something that we want to recognize as well. Um the the winner though uh it was the by far the consensus on the people that I talked to
and the committee and all that. Uh the
winner is um you know our third time keynote speaker. Um he went line dancing
keynote speaker. Um he went line dancing so he's not here today. He's he's not here to receive the award but I'm actually going to get Lori. Lori um
you're actually going to receive the award on behalf of him. Uh it's Simon Willis everyone.
So uh no Lori Lori Voss. Lori, boss, we have two Lories. Sorry, Lori. She's also
called Lori.
Lori, you're you're you're waiting for the next one. So, um I I don't know. You
can you can present the the best speaker.
Um Simon nominated Lori because they worked together on Django on where did you work together? We worked together at uh Yahoo in 2005. Yeah. Yeah. Yeah. So,
the few the proud, the few the proud.
Yahoo pipes is still a pipe dream for a lot of people. Uh, but thank you for accepting award on Simon's behalf. Thank
you. Thank you, Lori.
Um, okay. Um, hackathon. Hackathon. You
you have the the second best and it's a runner up and the best. I I was relying on JGBT. I don't know. Yeah. Yeah. Yeah.
on JGBT. I don't know. Yeah. Yeah. Yeah.
Kind of failed me. Yeah. Okay. So, do we want to go by perception? Yeah. Should
we do woo again? No. No. The audience is thinning. Yes, they're running they're
thinning. Yes, they're running they're running the patient. Uh I think it's probably team three, right? Okay. Um
well, so we have we have the runner up.
Okay. Uh of of of the hackathon. I I I actually don't know where to uh price.
Yeah, there we go. Okay. Um so hackathon runner up. I think uh I think it's like
runner up. I think uh I think it's like fairly uh pretty evident in in my mind.
Um it would be uh the uh the feature team. So this other Lori uh you
feature team. So this other Lori uh you can come up with your team and uh air come on up. What's the rest of the team?
Come on up. Yes.
Um yeah. So you can you can come for rail
yeah. So you can you can come for rail this time. Yes, you can come. I'm sorry.
this time. Yes, you can come. I'm sorry.
Sorry about that. There you go.
Congrats. Thank you. Thank you. Thank
you. Congrats. Can I Thank you. Oh,
yeah. You should all you should all definitely Thank you. Congrats everyone
for our photo. Yeah. Yeah. Looking at me right over here. Thank you.
Thank you. And our website is survival of the feature.com. Survival of the feature. Survival of the feature.com.
feature. Survival of the feature.com.
Please try it. Very good. Um, and I think the winner just decided by uh votes and uh applause earlier is Eugene
from Featherless. Featherless R1. Let's
from Featherless. Featherless R1. Let's
hear for Eugene. Where is Eugene?
Uh Eugene gets one of the big ones. Oh
my god, Eugene, you're so excited. Uh
Fed um yeah, Federalist has been grinding away for a long time and uh I can't believe you did this in a hackathon. Uh and u I also like to add
hackathon. Uh and u I also like to add that uh I wasn't alone. Michelle
who couldn't be here. Yeah. Took part in the hackathon as well. Yeah. And work on it with me. Okay. Well, this is yours.
Thank you for taking part.
Yeah. Yeah. Yeah. Stand in the middle.
Okay. Awesome.
That's it. All right, we got to go to Thanks. All right, now I got to do one
Thanks. All right, now I got to do one more thing. It's just a quick uh thanks
more thing. It's just a quick uh thanks to everyone who has been part of this.
Obviously to everyone in the audience here, Microsoft our presenting sponsor, AWS our diamond innovation partner, Neo4j Brain Trust who curated our evals
and graph tracks, all of our platinum sponsors and all of our sponsors in the expo and beyond. And of course, Swix, the executive producer and program curator of this event. It takes a hell
of a lot of work to do that. Leah
McBride, our senior producer, she's been with us since our very first event in October of 2023, and she really helps to make this run. and also our new team members, Melissa Billy and Scott Dilap.
Um, so many others including I want to give a special shout out to uh Vincent Wendy who is just all of these incredible graphics. Everything you see
incredible graphics. Everything you see that was him. Okay, he didn't do the animation. I'll get to them in a minute,
animation. I'll get to them in a minute, but he gra he did all that. So,
incredible working with him. VCI events,
this is VCI and they've been running everything on this floor in Golden Gate Ballroom level. Uh, Freeman, all the
Ballroom level. Uh, Freeman, all the graphics you see were them. Art and
Display. That beautiful expo in there.
That's like a little Santa's village. I
mean, it's you feel like you're in a little mini city there. Incredible. That
was them. Encore helped to run AV up in uh second floor. Local 16 helps to operate everything. So, really big
operate everything. So, really big thanks to them. They've all been so incredible. Motion agency actually did
incredible. Motion agency actually did all of the motion graphics you see here.
They're based in Asia, but they they worked some of the hours in Pacific for some last minute stuff. Sunno, I love working with Sunno. It just can't it doesn't miss every time. and they
produce music just from text. Uh the
Marriott Marquee, thank you so much. Max
Video Productions for B-roll. Randall
for photography. Uh Brad Westfall for and Swix, our web developer. Come on,
how is he actually doing? And Haley
Holmes, our incredible show caller.
Thank you so much. And all the speakers, of course, they've been so incredible. Anyone in a yellow shirt you
incredible. Anyone in a yellow shirt you saw is a volunteer. They come here just to help out and be part of the event and the excitement. So, we thank all them.
the excitement. So, we thank all them.
We can't run it without you all. So,
thank you so much. And then lastly, I'd like to welcome on stage the absolutely hilarious, absolutely wonderful Lori Voss, our MC. Thank you so much. Can we
give him a big round of applause, everyone? I keep telling him this, but
everyone? I keep telling him this, but his I keep telling everyone this, but but I keep telling you this, but your intro was like his jokes did his jokes land or what? Like, they were actually really good. They weren't just dad
really good. They weren't just dad jokes. So, I really appreciate that and
jokes. So, I really appreciate that and thank you so much. So, with that, that should do it for the show. Thank you all for staying. The last few of you who
for staying. The last few of you who stayed for this really appreciate it.
Thank you so much for coming out.
[Music] [Applause]
[Music] Heat.
Heat. Heat. Heat.
[Music]
[Music] Heat. Heat. Heat.
Heat. Heat. Heat.
[Music]
Heat.
Heat. Heat.
[Music] Heat.
[Music]
Yep. Heat.
Yep. Heat.
[Music] [Applause]
[Music] Heat. Heat.
Heat. Heat.
[Music]
Heat.
Heat. Heat up
[Music] here.
Loading video analysis...