AIE Singapore Day 1 ft. Minister, NanoClaw, OpenAI, Google, Vercel, Cursor & more
By AI Engineer
Summary
Topics Covered
- Real AI Value Lies with Ordinary People Empowered by Tools
- The Barriers Have Fallen: My Raspberry Pi AI Agent
- Never Give Agents Secrets—Instructions and DLP Cannot Prevent Leaks
Full Transcript
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, Hey, hey,
hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey hey.
Heat. Heat.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey hey hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Hey, hey, Hey, hey, hey hey.
Hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, Hey, hey,
hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, Hey,
hey, hey.
Hey, hey, hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
joining at 8:40 a.m. on a Saturday um for day two of AI Engineer Singapore. Uh
just as a way of introduction, I'm Sherry and I'm one of the members of 65 Labs. Uh we're one of the largest
Labs. Uh we're one of the largest grassroots builder collective um here in Singapore. Uh an article actually
Singapore. Uh an article actually recently came out about us this week. Um
it's just a few of us who are actually doing all of this in our spare times. Uh
we all have full-time jobs, but you know, this is something that we are all very very passionate about to bring to Singapore. So this conference really is
Singapore. So this conference really is our love letter to the ecosystem.
Now, somewhere along the way of all these hackathons and bill nights we are running, um something magical really happened. Um some of these frontier AI
happened. Um some of these frontier AI teams that are here today building the models that you use every day, uh started to show up for our community. Um
they're giving credits for our hackathons. uh even showing up late on
hackathons. uh even showing up late on Zoom to do workshops for people and you know we've really had all sorts of people um being supported from as young as 13 years old to folks who are in
their 60s just learning all of this as well. Um it really is just a great time
well. Um it really is just a great time to kind of come together and be a builder. But what we got was more than
builder. But what we got was more than just a few credits. Uh we started actually to build a relationship uh with some of these teams and uh that is the magic that we see in this room today.
Now, you might think that this is the first time all this is happening in Singapore, but it's really been happening under the surface for a long time before we gap gathered at the Capitol Theater here today. Um, so
that's why it was no surprise when some of our speakers actually told us their entire plane from San Francisco to Singapore was actually just full of folks uh coming down for AIE.
So, how did this conference actually happen? Um so uh we actually met Swix uh
happen? Um so uh we actually met Swix uh who is the CEO and uh co-founder of uh AIE Globally. Um met him actually in New
AIE Globally. Um met him actually in New York City and uh I don't know if you guys know but he's actually from Singapore originally. So it just clicked
Singapore originally. So it just clicked and made sense. Um we've been working with a lot of these teams uh remotely and we just wanted to bring them in person in Singapore all together for the
first time. So to hear a little bit more
first time. So to hear a little bit more about the AIE story, uh, Swix will be, uh, speaking about cognition, but also be closing out day one to hear, uh, to share a little bit more about the AIE story.
All right. Um, now, show of hands. I'm
kind of curious who was all at the workshops yesterday.
Woo. All right. That is pretty much like 98% of the crowd. Um that's awesome to see because that was a laptop open day and that's what we wanted to do differently about this conference where
we're not just yapping about stuff but we're you know building and uh apping in a way. So we wanted to make sure that
a way. So we wanted to make sure that you know all of this is designed for for practical knowledge right and uh yesterday uh just so you know we actually had 20 workshops running five
rooms concurrently and an entire leadership track as well. So, um, we really wanted to put programming at the absolute heart of everything here. Um,
so we're all learning and building alongside each other.
And because we want to create this place for learning, um, we also wanted to give the opportunity for the next generation to get this experience. So, uh, we were actually able to come together as a
community to support 20 students who are actually scholars here today at the conference. So, um, would you stand up
conference. So, um, would you stand up and wave?
So, every one of these tickets were actually partially or fully sponsored uh by builders in our community um who believe that this is what Singapore's AI
future looks like. So, what can we expect over the next two days? uh we
didn't just you know string together a bunch of talks to you know keep you guys here for lunch and things like that but we wanted to bring you the kind of conversations that you can't just Google or find in Corsera but actually be in
the thick of these conversations that are happening with the people who are actually building with these tools whether it's on Twitter or research papers and all of that so we want to bring these conversations so you can be
in the middle of them as well and we know that there's a lot going on uh there's over 60 talks actually within the next two days across across three themes and you may need some help kind
of navigating and figuring things out.
So, uh we actually have a guide that we sent out an email that has a map that we actually vibe coded as well as a complete program list as well. Uh not
only that, we created an API that you can actually use that's publicly available. So, you can actually build
available. So, you can actually build your own tool on top of the program because we didn't just want to create like a one-sizefits-all app that you download and you use. We wanted to create something that you can actually
build for yourself because that's kind of the spirit of what we do.
So, AIE today and tomorrow is going to be spread across different spaces. Uh,
this theater right here in Capitol Theater is going to be where all the talks are going to be happening. And
once you kind of get that inspiration for like, you know what, I really want to go talk to this team. That was super interesting. We have two expo areas set
interesting. We have two expo areas set up. uh one is going to be in Pullman
up. uh one is going to be in Pullman across the street and the other is the Attelier in Kinsky. And these are again not just booths that people are just setting up um just because but we
actually made this a curated space where you can actually have face-to-face conversations with the very people who are building the tools that you are using.
And then finally, we also wanted to make sure that we give you space to uh you know decompress and touch grass as well because it's a long two days. So, uh,
we're happy to share that we're going to have a 15-minute break session where there's going to be an experiential space, uh, called the cave, which is a re immersive sound reactive
decompression room that was actually entirely, uh, vibecoded by the creator.
You'll also find a lot of folks running around with red shirts. Um, these are our amazing volunteers who will be helping you navigate the spaces and make sure that you get the most out of every
conversation happening here today.
And we'll not be in this room without our incredible sponsors. So, um, our diamond sponsors are OpenAI and ZAI. And
our platinum sponsors are Google Deep Mind, Arise, and Cursor. And thank you for the uh thank you to the Capitol Theater for providing us with this
beautiful space.
Now the Singapore story has always started with the builders. Uh which is why a few weeks ago uh it really blew our minds when um our very own minister
for foreign affairs Dr. Vivian Bala Krishnan uh went viral for a post on Twitter for building his own second brain. uh and it makes sense because he
brain. uh and it makes sense because he has a role that demands navigating large volumes of information and rapid context switching. So his reflections on
switching. So his reflections on building this kind of workflow and tool for him really underscores that meaningful conversation about AI should involve understanding the tools
themselves and not just thinking about you know the the abstracts of it.
Um, and with that, uh, I am absolutely honored to introduce our keynote speaker and a builder himself, Singapore's Minister for Foreign Affairs, Dr. Vivian
Balakrishnan.
Use this.
Hi, good morning everyone.
You know, we can be a bit more informal in Singapore. So, good morning. I know
in Singapore. So, good morning. I know
it's raining, but Singapore's usually sunny. Um, I feel like an impostor here.
sunny. Um, I feel like an impostor here.
Uh, for those of you who don't know me, I'm actually a retired eye surgeon. Took
a detour into politics for perhaps too long. Um, but I've always retained an
long. Um, but I've always retained an interest in getting things done, building things, fixing things. And
since I don't get to operate on eyes anymore, uh I assemble watches, I reprogram appliances,
and now there's some other stuff which is this which is what I'm going to talk about today. But actually, I wanted u to
about today. But actually, I wanted u to explain why I did it and the implications of this. And I think with this audience, you'll get it straight
away. But let me jump to the end.
away. But let me jump to the end.
um and to say these are the three key messages which you can forget everything I've said but just bear these things in mind. We're now at an age when you can
mind. We're now at an age when you can outsource a lot of stuff calculations computation memory replication
dissemination of knowledge. The one
thing which you cannot outsource is your personal understanding. And if you are
personal understanding. And if you are in a position of authority, you can delegate work. You can't
delegate accountability. So remember the personal element in that understanding and accountability.
The next point and I would refer you to a nice short letter published in the Financial Times by Professor Neil Lawrence, University
of Cambridge. She's the professor of
of Cambridge. She's the professor of deep of machine learning.
And you know there's a lot of hype about AI models, data centers,
top-down systems, rules, governments.
That's macro.
But his hypothesis is that real value for the economy and society is created
at the ground level workflow by workflow sector by sector department by department and in fact at the individual
level. What this means is that it's
level. What this means is that it's look, I know you guys are great and I know the guys working on frontier models
are incredible, but the real payoff is when ordinary people, teachers, lawyers technicians managers
doctors, lawyers or even ministers are actually using the tools which are already available, already invented.
people who know their jobs and are empowered by these tools. That's how you create real value for society and for
the economy. So I'm looking at
the economy. So I'm looking at decentralization individualization bespoke models. I'm talking about making
bespoke models. I'm talking about making yourself better at what your day job is and even better still re-engineering the
workflows of your life. That's where the real value boost is. And the third takeaway and that's why I'm making this
presentation is that I sincerely believe the barriers for achieving all this have collapsed. The tools have already been
collapsed. The tools have already been availed.
It's a matter of getting people to understand what tools are out there, assemble their own tools and put ourselves on a completely different
trajectory. Okay. So now let's do the
trajectory. Okay. So now let's do the fun part as to what my adventures began. Now my personal agent first came
began. Now my personal agent first came to life almost exactly three months ago.
Uh yes, I got caught up by the open claw uh hype but immediately given my job I knew that was not practical because security was an issue. And then some
someone else then pointed to nanoclaw and I think we are going to hear from Gabriel after this where
you know and as a geek and as a tinkerer myself I like stuff which I can grasp.
So the fact that nanoclaw has a very short code base which even an idiot like me can read and sort of understand the
fact that it's containerized and as a surgeon I know that there's no such thing as a routine operation and things will go wrong things will break
and when they do break hopefully you want them to break within barriers. So
the containerization part, the understandability part was vital for me.
Anyway, simple go to GitHub, download the stuff. And the other attractive part
the stuff. And the other attractive part about it is there no configs.
There's in fact because you rely on LLM to do all the
bespoke tailor customizations.
In fact, you realize everyone running an instance of nanoclaw is running an individualized system. Now, that's both
individualized system. Now, that's both good and also has its share of complications. But anyway, so let me
complications. But anyway, so let me tell you what I did with it, right? So,
Nano Claw provides the platform. It
allows me to communicate through WhatsApp with my agent. That part's not rocket science.
The thing which if I could go back one slide the thing which I was really after was how could I use it for my daily life. Let me give you an idea of my
life. Let me give you an idea of my daily life. This month I'm visiting 12
daily life. This month I'm visiting 12 countries.
I have I will therefore have to meet hundreds of people. I will have to understand the country's economy, geography,
culture history.
war and peace.
I need to know people as individuals and not just something I from a brief and there's a huge huge cognitive overload
on every single diplomat. And the
question is how can I turbocharge this process so that if I need a fact or a factoid
I can get it I can get it anywhere and I can go down the rabbit hole if need be.
So it's got to do with this whole overload.
The LLMs are useful for analysis, for abstraction,
for expression and certainly for drafting briefs, drafting speeches, formulating answers to questions in including, I must add, parliamentary
questions. and three months ago which
questions. and three months ago which includes the whole debates in in parliament. Uh it was extremely
parliament. Uh it was extremely impressive to see both the questions and the answers which generated and uh with due respect to all my colleagues in
parliament uh some of the AI generated debates uh far more incisive shall I say.
But anyway so it communicates with me through WhatsApp. So there's this
through WhatsApp. So there's this bit of software called Bailey's. I
suspect it's probably uh not entirely in keeping with what uh Meta or WhatsApp would like us to do because it's actually simulating, you know, the way we get WhatsApp to
work in our browsers or in on our laptops. So it's it's a pseudo terminal
laptops. So it's it's a pseudo terminal in a sense.
Then the bit which I believe is the real frontier for people like me is memory
and fortunately I came across this obscure piece of software called Neman.
I still haven't met the developers so I don't really know but a memory system with graphs. So it's got entities.
The edges are entities, causality, temporal relationships, and semantic.
And also because I didn't want to be confined to just keyword searches.
The fact that I could run Olama locally with an embedding model means I also have semantic search built in. So with
these elements, I mean, whisper is the part that's easy because with WhatsApp, I didn't want to only have to type. I wanted to be able to speak and he can speak back to me.
And of course, my dream uh is one day to just have my agent answer supplementary questions in Parliament. I'm not sure about the legality of that, but if it
happens, you you'll know that I shared the idea with you first.
But the point is I was now able to curate material, speeches, transcripts, particularly of
my own contributions, get it into the system, digested, extracted, put into that memory database. And then
around the same time, Andre Kapati came up with his LLM supervised wiki generation. So I added that in as well.
generation. So I added that in as well.
And then for the UX, the user interface, I used Obsidian partly also because Obsidian allows me to use uh the Apple iCloud and that therefore immediately
means I've got a personal cloud and all the wikis which are extracted from this personally curated database becomes available to me because remember I
started off by saying the key is personal understanding.
So I've got a memory system, I've got a communication system, I've got an analysis system, but all nice in theory.
But what I here to share with you is that in last three months, I found it incredibly useful meeting people, traveling,
first drafts, first cut of a speech.
Even today's presentation uh even the slides actually were generated by claude you know it's turbocharged
the pace at which things can be done and as a practitioner so not as an engineer but as a practitioner with a day job it's useful
and I can attest to it usefulness because I can honestly tell you I have not dared to switch it off and Nano Claw unfortunately well has moved from
version one to version two when version two came on because their transition is not at all smooth I've left version one working and I put version two on another
computer and and I should also add all this stuff one of that my my most daily used agent is running off a Raspberry Pi
which is at least two or three years All it has is 8 GB of RAM. You you see my point about accessibility,
personalization, relevance, use. Let's let's go on to the
relevance, use. Let's let's go on to the next slide. And this is my point. The
next slide. And this is my point. The
barriers have fallen because I did this.
I did this without writing Claude, Bailey's, Neman, Whisper,
or the credentiing system.
You know, there's this whole thing about vibe coding. I won't even dare to claim
vibe coding. I won't even dare to claim I was vibe coding. I was just assembling tools.
you it's just tool assembly and so I I I should actually change that line I didn't write any glue I can honestly say yes I have gone through the
code you know the nanoclaw insists that you approve every time you give bash access to the agent so I do scan through it I
and it does help here it does help if you don't understand coding so you understand what's going on even if you're not
actually typing and editing code in the raw. Next,
raw. Next, in a sense, my approach to all this has been to learn by doing. It's not enough to sit down and read, get the headlines,
get the summaries done. If you're
interested in anything, get your hands wet.
Learn and you learn best by doing. And
because the barriers for entry have been have come down so dramatically, everyone should em embark on their personal on their personal exper
experiments.
And you know Claude came up with this quote which I got a bit suspicious about. You know who has said it before.
about. You know who has said it before.
says it. It claims no one else has. But
actually, I kind of agreed with it and this is a shout out to my government colleagues. You cannot govern a
colleagues. You cannot govern a technology that you have only been briefed on. You better get your hands
briefed on. You better get your hands dirty and then you understand both the potential and the limits and the
problems. a few other digressions down here. Um
down here. Um there are constraints.
So for instance depending on LLMs and quite frankly I mean I'm
the prices for which the um AI majors are currently charging us.
I think we all know we're enjoying in effect a subsidy. Tokens are not cheap.
Compute power is limited. Electricity
prices have risen. Wars do not help. And
we should beware of just trying to throw every problem and every step in a solution at an LLM. It reminds
me of the old proverb, you know, for a man with a hammer, everything looks like a nail. And in fact there are
a nail. And in fact there are good both economic and design advantages
so that you use LLMs but do not forget there is still a role for deterministic systems. There is still a role for
expert rulebased systems and my my personal belief as a biologist in the end some kind of neuro symbolic system
rather than just uh the LLM model and I have some sympathy for Yan Leon who says you know I think that LLMs are great but
actually that's not the way we've solved it in nature. If you look at the human brain, actually I suspect we have less layers
of computation in the human brain than in many of the large language models which we have today. And I can tell you
as an eye surgeon, the cortical computation for vision, for language,
for cognition are often based on far more efficient structures than the energy gobbling systems which we have today. The point I'm making and
where I'm agreeing with Yan Lun, you know, is that in the end these are pattern recognition systems with attention
with memory.
And out of what looks like simple fundamental abilities is emergent behavior which gives you
conceptual understanding which gives you language which gives you the ability to do things. So my point is this is a
do things. So my point is this is a field which is still exploding and therefore approach this with humility.
Approach this by just doing your best, improving the productivity of your daily job, but understand that actually we are perhaps one of the most privileged
generations to be living through a revolution.
Tools matter more than models. And I
think um Gab will know I've told him by June I think it's June the 15th I need nanoclaw
to make all models first class citizens.
Uh there are reasons for that which we can discuss later. And then finally memory.
It is a very human and I think it is the great unsolved part of this frontier. Next slide which
I think on security I'm not going to belabor this. Uh just as an aside even
belabor this. Uh just as an aside even if you hack my system uh the most thing you'll get from it is my phone number.
uh you will get summaries of foreign policy but since it's foreign policy which I have espoused and in any case I have curated the stuff I've put in uh
even if you take my system I think it will generate the foreign policy of Singapore anyway now that's one way of addressing security by making sure you
only put what is already open source what is already published and you subject your systems to a level of transparency and scrutiny
that can be withstood. But do not forget security remains paramount and in fact the complication to the dissemination of
AI is going to be commercial competition, national security, cyber security and the superpower
contestation. These are the political
contestation. These are the political factors that are going to affect the availability, the speed and the dissemination of AI of
the future. This again is a separate
the future. This again is a separate political talk well worth a deep dive.
And next slide and I hope that is my last slide. So the goals
last slide. So the goals I'm a believer in deployment at the edge.
I'm a surgeon.
I believe in doing. I believe in fixing.
I believe that's where lives are safe.
Value is created.
Second, therefore the public policy goal is the democratization of these tools. And that's why you will see in the economic strategy review
committee DPM gun said we are Singapore is not likely to be at the frontier of model development. But we can be at the
model development. But we can be at the frontier of deployment at scale. So
democratization and therefore if that's what we believe then it must be a decentralized groundup approach and that's why I'm here today
because I found out this or conference was organized less than three months ago. 65 labs.
ago. 65 labs.
All the people you meet here, this is all not even their day job. It's a hack, right? But this is the way this is the
right? But this is the way this is the way I believe the future is going to be created. So, thank you all for being
created. So, thank you all for being here. Thank you for part of this
here. Thank you for part of this journey. Have a wonderful day, a
journey. Have a wonderful day, a wonderful future. Thank you very much.
wonderful future. Thank you very much.
You should have given this.
Oh, I I should have worn this before. You
should have given it to me before. I
would have won.
I wasn't We weren't brief. But thank you so much. So much.
so much. So much.
Thank you.
All right.
You got to make an announcement, right?
I let her know that.
All right, everybody. And um I'm super excited to be introducing our next speaker, none other than the creator of
Nano Claw himself of Nanoco, Gabrielle Cohen.
Hi everybody. Really excited to be here.
Just getting things set up.
Just need your sites to load then it should try to go to hospital.
Can you put your mics and just can you m Wait, it's loading now.
It's gone. It's
getting there. There we go.
This one just got right here.
Hi everybody. I'm uh Gabriel Cohen and I created NanoClaw.
I have uh in my telegram right now a AI assistant that's connected to my emails, my calendar, uh connected to my call notes. Uh it has
access to sensitive information. It can
take sensitive action like reading my emails, sending out an invite. Uh at the end of this presentation, 15 minutes, I will give everybody here access to talk to it freely.
Um, and I can do that and I'm not crazy and that's not dangerous. And throughout this talk, I
dangerous. And throughout this talk, I want to explain to you a few concepts about Nano Claw that make that safe.
Um, and uh, to demonstrate those concepts, I'm going to talk about our um, agent factory that we built and along the way I'll share some things that I think are interesting about
choices we made while building it.
So first of all, NanoClaw is an open-source framework for building uh secure autonomous assistance or claw assistance. Um in just three months, we
assistance. Um in just three months, we have over 30,000 stars on GitHub and uh many thousands of users all over the world including uh Dr. Vivian
Balakrishna, foreign minister of Singapore.
Um, more importantly though than stars on GitHub, over 12,000 people have forked the repository and that's the main way people are using it. They're
forking it, experimenting with it and making their own autonomous agent based on nanoflow.
Uh, together with that we have over two and a half thousand uh pull requests and issues. So maintaining an open source
issues. So maintaining an open source project today, there's never been a better time to build open source projects. At the same time, there are
projects. At the same time, there are new challenges with uh coding agents.
It's easier than ever to open a poll request. Um, and many people, many
request. Um, and many people, many thousands of people are making great contributions to the project. Uh, but
there are frankly also spam pull requests. People will point their coding
requests. People will point their coding agent at a repo and say, "Contribute something here."
something here." It is very difficult to tell the difference between a spam pull request and a good pull request today. They look
the same. They can have similar amounts of uh code and telling the difference comes down to a deep understanding of the project and the direction of the project, the vision.
So, we built to help us sort through these poll requests, we built an agent factory uh that helps us review every single contribution. Uh this is our
single contribution. Uh this is our agent factory. It's in our Slack. It's
agent factory. It's in our Slack. It's
hosted on an xie.dev uh virtual machine.
Um, every single PR that's open in GitHub fires a uh web hook um that creates a new thread in our Slack. A
review agent first triages and then does an in-depth review.
Uh it then gets passed on to testing first creates a testing plan uh for in-depth testing, real life testing, not just automated tests.
Um and then once we approve the plan, it get a new VM is spun up. It goes through a whole series of tests and then uh once it's done we can merge
it directly within the factory and it goes live.
So half of you are probably looking at this going amazing I want to build a factory like this myself. The other half of you are thinking about the security implications and going this is crazy.
This is reckless. It's unsafe. Pull
requests of course are unsanitized uh input, right? Anybody can open a pull
input, right? Anybody can open a pull request. Anybody can put things in
request. Anybody can put things in there. Uh you can't really sanitize a
there. Uh you can't really sanitize a pull request because I don't want to remove information from it. There's
going to be false positives and everything. You can imagine a pull
everything. You can imagine a pull request that's open to harden for security to defend against prompt injections. It would trigger any kind of
injections. It would trigger any kind of um detection. So this goes way beyond
um detection. So this goes way beyond lethal trifecta. and um our f our
lethal trifecta. and um our f our workers, our agents in the factory are taking very sensitive actions. They're
spinning up VMs. They're merging uh pull requests.
So, how can we prevent our agents from being prompt injected?
You obviously can't do this, right? If
you go into a codebase and you see at the top of the cloud. MD uh never run drop database production. So that tells you two things about that agent. It
tells you that that agent has deleted a production database before and it tells you that the agent can still do it if they put that instruction
there. So it still has that ability. Uh
there. So it still has that ability. Uh
instructions are not for security.
They're not for safety. The instructions
are for steering your agent towards producing valuable uh high quality output towards the direction you want it to to uh to to to
go for you.
So how do we deal with these kinds of risks with nanoclaw? So we think about our agents as if they're operating in enemy territory behind enemy lines
because they're being they're in contact with the enemy, right? the somebody
who's potentially a malicious actor who's trying to work against you and get your agent to work against you. So, if
you think about a map of a conflict, uh you have the red zone and the blue zone and the blue zone is our side, the red zone is the other side. Agents are
operating in the red zone and at any moment they could be turned into a double agent.
So, we don't trust our agents and nanoclaw agents are not considered trusted. Instead, they're isolated.
trusted. Instead, they're isolated.
So this is a simplified version of the nano claw architecture you have on the left side slack or whatever messaging app you send the message there it goes to slack servers and then it gets sent
to wherever your nano claw is running in this case say a VM there's a slack bridge which connects to the slack server with a socket or web hook every message gets sent to the slack bridge
and then from there through a router and pushed to the agent the agent respond responds. It's uh
produces some output that's sent back through the router back to Slack bridge, Slack server, and appears in your messaging app as a response from the agent from your, you know, Slack bot or whatnot. Uh but the agent is potentially
whatnot. Uh but the agent is potentially compromised. It's operating in the red
compromised. It's operating in the red zone.
So, anything that the agent can touch is potentially compromised. If the agent
potentially compromised. If the agent can access the router, if the agent can access the Slack bridge, it can manipulate those and change what messages it has access to, uh, and who
it's able to send messages to.
So rather than letting the agent access anything in the VM that it's running in or anything in the environment it's running in, we isolate the agent and put it within the VM within another
isolation layer. In our case, normally
isolation layer. In our case, normally we put it within a container. Now, the
container limits the blast radius. We
control what goes in, what goes out, and what happens with the things that are coming out. So, the agent isn't directly
coming out. So, the agent isn't directly connected to a messaging channel. That
already does a lot to limit the blast radius, but in order for our agent to access the outside world, uh it needs to have credentials. If it wants to connect
have credentials. If it wants to connect to services, whether it's GitHub or um or your calendar, and that could be using CLIs, APIs,
MTPs, it doesn't matter. It needs some form of credential.
So the second principle, first principle isolation. Second principle is keep
isolation. Second principle is keep credentials outside of the agent environment. The agent's environment is
environment. The agent's environment is enemy territory. You don't want to put
enemy territory. You don't want to put anything that's highly sensitive in there. Definitely not uh secrets and
there. Definitely not uh secrets and credentials.
The only way to ensure the agent will not leak a credential, it can't be done through instructions. It can't really be
through instructions. It can't really be done through uh DLP or analyzing outputs. The agent can circumvent that
outputs. The agent can circumvent that as well. The only way to prevent it from
as well. The only way to prevent it from leaking a secret is to not give it a secret. So the way that we let it talk
secret. So the way that we let it talk to external services that are credentialed without giving it credentials, we add between the agents request a proxy. We give the agent a
vault. We partnered with a really great
vault. We partnered with a really great open source project on this uh called one CLI. Every request leaving the agent
one CLI. Every request leaving the agent sandbox is proxied through the vault and then we check the request and decide if we should add credentials. The request
leaves the vault with no credentials with literally authorization bearer placeholder. Literally the word
placeholder. Literally the word placeholder at the vault. The
placeholder is replaced with a real credentials if the agent is supposed to have access to that resource.
But isolating the agent and giving it this proxied credentials isn't enough because if someone is talking directly with my agent, even if my agent doesn't hold the key, if it can take the sensitive action and you can manipulate
it and prompt inject it, you can get it to take sensitive actions for you. So
maybe you can't get my GitHub access token, but you can potentially get it to add you as a code owner.
So we need to have another layer of policies, not just rubber stamping letting every request through, but adding policies of what the agent can and cannot access.
The most flexible policy for the most sensitive actions is human in the loop approval. And what that looks like is at
approval. And what that looks like is at the level uh where we're enforcing policies, we can have a policy set. This
requires human approval.
A request is then sent not from the agent but from the vault or from the uh the router or the delivery part of uh NanoClaw.
That message gets sent through the router to the Slack bridge and shows up in your messaging app as a permission request from the agent. Now this is actually an illus illusion. This was in
the video before and it looks like the agent is requesting your approval and then you give the agent your approval and then it goes ahead and merges your PR for you. None of that is happening.
The agent can't request approval and the agent doesn't actually have credentials to merge. Instead, the agent is trying
to merge. Instead, the agent is trying to make a request using an MCP where it writes out the the command that it wants to run in GH with the GitHub CLI. And
then we display that to you as if it's a message, as if it's a request coming from the agent, but that's actually coming from the Nano host process. Once
you approve, the merge is actually done not at the level of the agent, but outside of the agent's environment.
And that same uh pattern can be used to do any kind of sensitive action.
Initiate a wire transfer for example.
The most sensitive actions you need to separate the tool call from the tool execution. The tool call happens within
execution. The tool call happens within the agent's environment. Within the red zone, it leaves the red zone and outside of the agents environment, you then
enforce policies and implement the action if it meets your policies, including human approval.
One interesting pattern that's emerged that we found in our agent factory is that we have multiple different people over reviewing and h uh providing
oversight over the reviews, the plans, uh the triage. Whoever presses the button to approve or to send it to testing, it uses their credentials. So
you won't see in our GitHub any PRs being merged by a nano claw agent. I'm
the one who presses the button. means
I'm proving that this is correct. I'm
taking responsibility for it and it's done with my credentials.
So this is uh what our factory looks like. Another interesting thing is so
like. Another interesting thing is so you can see here we have the uh slack app uh connects to the slack bridge. We
have multiple different bots and then each of those bots get routed to a different nano agent. Each nano agent runs in its own container. So nano claw by default by design is multi- aent and
can be multi-user multi-tenant.
Now when the test plan is approved that isn't running automated testing what happens is we have a test uh
orchestrator that creates a new VM checks out the branch for that GitHub pull request in the VM. Our test agent then SSHes into the VM, runs the Nano
instance and starts poking and proddding the agent, sending them a message in Telegram, getting a response, real life testing, and then also is able to check databases and logs to verify behind the
scenes that what you're expecting to happen does happen.
Uh, another last interesting pattern is that each of the agents in the Slack thread has a persistent environment and a persistent session. You can come to them at any time, tag any one of the different agents. We have a testing
different agents. We have a testing agent, reviewing agent, uh, and give them direction, ask a follow-up question, uh, change the depth of
testing like you see here. We also have this ability to tag a supervisor and give feedback.
Uh, you feel a little bit like a Karen, if anyone knows the meme, can I talk to your supervisor? You leave feedback and
your supervisor? You leave feedback and then um the supervisor can suggest changes to the instructions and to the skills based on that feedback
and then once we approve those changes they get implemented. So our factory is improving itself essentially.
So as promised that QR code if you scan it I have my agent in Telegram. It has
access to my emails to my calendar uh to my drive. Uh but I feel safe giving you all access because this agent doesn't have any credentials in its
environment. It's isolated. I control
environment. It's isolated. I control
what goes into its environment and what comes out. And there are human approvals
comes out. And there are human approvals on every action. So that's connected to my calendar. I'll be here all day. I'd
my calendar. I'll be here all day. I'd
love to grab coffee with some people building interesting things in the space. Uh, talk to it. I told it to be a
space. Uh, talk to it. I told it to be a little bit protective over my time. I
hope it's not mean. Um, but if you talk to it and tell it what you're building, uh, hopefully it will schedule a coffee chat for you with me. Thank you.
All right. Um I'm super excited to introduce uh our next speaker. Uh this
is Tibo who is the head of codeex at OpenAI. Now Tibo uh unfortunately could
OpenAI. Now Tibo uh unfortunately could not make it in person today. Uh but he wanted to do the talk because it means a lot for him. So he will explain uh when he when it's uh when he's up on the
screen which I think he's there. Uh but
the other thing that we'll be doing which is super cool is uh Tibo is excited to speak with some of the students for a Q&A. So, uh, let's give, uh, TBO a warm welcome.
Hi, everyone.
Um, glad to be here.
I would have loved to be in person. It's
just really incredibly exciting to see the room so packed.
Uh, Singapore has such a unique energy and I'm excited to chat to you all from San Francisco.
I feel really proud to say that San Franc Singapore is actually the top in the top five countries globally for codex adoption and engagement. Uh it rose
there fast. Uh it feels like Singapore
there fast. Uh it feels like Singapore is just adopting new technologies and an unprecedented rate. Uh our mission
unprecedented rate. Uh our mission overall is to deliver the benefits of AGI to all of humanity. And I believe that in the coming months we'll make
such incredible progress towards making AI deeply valuable to each and everyone in the world. We started with Chat GPT
and with Codex we have focused on builders and developers.
You might know Codex as this little app, but for us it is our frontier agent. And
I'm going to talk a little bit about what agents have done to software development and the whole life cycle.
I don't have to tell this room, but software development obviously is unrecognizable compared to two years ago, even six months ago.
New models are capable of full agentic delegation or examples like we saw with nanoclaw where you have a full autonomous system just doing stuff for
you uh going far beyond programming. You
just give it a job. It works on the task the codebase perhaps for hours independently sometimes a full day until the job is done.
From the beginning, that's been our goal to build an AI teammate you can delegate to.
A useful way to think about the SDLC and building things is to think about it as a throughput problem.
For decades, the software development life cycle was designed around one core assumption.
Code is hard to write. That assumption
shaped really everything around it. We
planned heavily because engineering time was scarce. We reviewed every line
was scarce. We reviewed every line carefully because code was expensive to get wrong. We built delivery systems
get wrong. We built delivery systems around the idea that the build step was the narrowest part of the pipe.
Aentic coding has really changed this assumption.
It has dramatically widened the belt section of the pipe.
But if the rest stays narrow, the total throughput does not actually increase.
The constraint moves into the systems around the build step planning reviewing validation CI security release operations debugging and even learning
and understanding of what's actually happening which is a big part of the new bottleneck.
That shift is something that everyone here needs to understand. The
opportunity was not just to generate more code faster, but it is to redesign the entirety of how we do engineering and how we can increase the overall
throughput of what we deliver together.
The first wave of AI coding just really expanded this build phase. We were all very excited to be able to write a lot of code faster. That matters. It means
engineers can generate, modify, and test code at a speed never seen before.
But as we said earlier, just widening the build section does not increase total throughput. The next step was
total throughput. The next step was really to look at expanding capacity across the full software delivery life cycle. This is how we think about
cycle. This is how we think about codeex, our agent. It's not just a coding assistant, but an agent that can work across the full layer of building
software. In the build step, critics can
software. In the build step, critics can help engineers delegate implementation work. In review, critics can help
work. In review, critics can help inspect changes, surface issues, support human review. In deploy and operations,
human review. In deploy and operations, cloud agents and automations can help teams respond to triggers, investigate issues, and keep work moving through the
system at unprecedented pace.
The goal is not to remove humans from the process. The goal is really to make
the process. The goal is really to make every stage more scalable. So higher
code output can actually become more shipped value.
This is a key distinction. Agentic
coding increases code velocity but agents like codeex help organizations expand the system around that velocity.
So there are like these different steps here and we can see that you can use an agent to increase the velocity of planning, velocity of building,
velocity of reviewing and even the velocity of deploying. If you think about it, planning, building and reviewing is a little bit easier because you don't really have a side effect upon
the world. And deploy is when you know
the world. And deploy is when you know security starts to really matter as you're having an actual impact and the code gets actually deployed out there and meets your users where they are. We
have automations for this. We allow to build around the agency as well. And
then we have a version of our cloud agents um which has secure can have secure access through our plug-in system and allow you to deploy and verify that the deploys are correct through human
approvals.
This is a journey we started a long time ago. The Codex team is special in a
ago. The Codex team is special in a sense that we designed both the agent and the models to power those agents and we work deeply within the research in order to advance the state-of-the-art
for our models. This started with a model GPT51codex max which is now famous for its name which we released at the end of 2025.
It was trained on end to end RL for compaction for longunning tasks. This
means that in its environment during RL, we were exercising tasks that would challenge the model to work well beyond its context window. And at the end of
its context window, it would need to delegate to itself in order to achieve a task across many context windows of inference. We also shipped a high
inference. We also shipped a high reasoning effort. We trained it to
reasoning effort. We trained it to operate natively on Windows and we showed that we could achieve better performance with 30% fewer thinking tokens and achieving a new
state-of-the-art token efficiency. This
is a theme that will continue and that we've seen across every other model ship. The token efficiency just gets
ship. The token efficiency just gets better and better and better, which both makes it faster and cheaper to run agents over time. With 52, we increased cyber security capabilities, which is
really a precursor to what we've seen now with models with unprecedented unprecedented capabilities around cyber.
We've improved performance across large code changes, but also we added vision capabilities. We're not just building a
capabilities. We're not just building a texttoext model. We're building an
texttoext model. We're building an everything agent. With 53, we made it
everything agent. With 53, we made it faster. With 54, we added 1 million
faster. With 54, we added 1 million context window. And 55 has been our
context window. And 55 has been our biggest step change so far. Even though
it seems at face value, it's only a little incremental 0.1 uh improvement from 54 to 55. It was actually a much bigger change. We added computer use and
bigger change. We added computer use and we made it even more token efficient. It
is really the smartest and fastest model available out there today.
What makes it work though? What makes it work though is not just the model. It is
a combination of the model and its hardness. This is why Codex is special.
hardness. This is why Codex is special.
We're capable of like co-designing these things and making the harness really optimized for the model and the model optimized for the harness. It allows us
to deliver a new class of intelligence very broadly and very efficiently.
Five was released only a few weeks ago and we saw revenue grow more than two times faster than any prior release.
People really love it.
And we've seen adoption really go wild.
You can see here that it set a new industry high on SweetBench Pro. We also
achieved a new soda on terminal bench.
It seems like we're just pushing the frontier model after model, model after model, and we're now shipping at a cadence of roughly one model a month.
We did all this while also delivering unprecedented reliability.
And this was no no short feat. Really
the level of engineering and infrastructure improvements that we needed to deliver here were started a year ago which allowed us to scale with
unprecedented demand. Usage exploded.
unprecedented demand. Usage exploded.
We're serving 55 at a level of traffic that um makes me fail at times. like we have like an amazing
at times. like we have like an amazing team of engineers and own callers and it's also one thing that is rarely talked about is how efficient our models are and this allows us to offer like
just really generous limits across the plans.
We've achieved like nine nine three nines of availability which I'm very proud of. um
proud of. um all while scaling and being used across hundreds of companies. We've now over four million and approaching soon 5
million weekly active users. And it's
never been a better time to get started.
A lot of engineers write more code.
We've already talked about that. But
what we haven't talked about is that inside of OpenAI just really everyone, everyone I see, everyone I talk to uses Codeex for literally everything, not just for engineering.
We're seeing the marketing department use it. We're seeing finance raising uh
use it. We're seeing finance raising uh rounds of incredible fundraisers using Codex to coordinate it all. It has
become this everything agent. And
because we were building Codex using Codex, we've never built faster. We've
released an extraordinary amount of features this year. Team configuration,
new models, Codex for Windows. The Codex
app itself is only 3 months old, which still shocks me when I think about it.
We've released fast mode. We released
auto review as well, which is one of my favorite features. When you think about
favorite features. When you think about agents and security and safety, one thing that is often overlooked is that approvals and human approvals are something that leads to fatigue and
mistakes over time. If you have to go and like verify everything that your agent is doing and like thinking hard about whether you want to approve it or not, then you're bound to make a mistake at some point and give it too much
access or allow it to do something or merge a PR or worse send some information off somewhere that you shouldn't have done. This is going to be true as we continue to scale and you
have many more agents working for you.
Auto review is a new system which introduces a second agent which verifies the actions of the first agent and it verifies them against the original
intent of your task. So if you say hey go and check my important emails for example and pull the last three that you know are specific to the goals that I
have set today. Then auto review will understand that this is your intent and verify each action from the main agent against that intent. Anything that is suspicious or high risk and other line
with that intent will get blocked and the main agent will get redirected to try and do something else. This is very important because it allows you to preserve the human attention and not
fatigue you with unnecessary approvals.
This is the default now within OpenAI and it has reduced approvals by a factor of 20.
We're seeing gains across the company across much more than coding.
There's a bunch of pillars that we're investing automating more deeper enterprise controls, leading on models and the overall developer experience.
I'm really proud of how polished the application has been that we have shipped and how delightful the experience is. I invite you to all try it. It's really a different way to
try it. It's really a different way to interact with agents and over time we're going to evolve it into the cockpit for every agent that you manage.
CODC unlocks so much for builders but also for almost everything.
We're seeing really incredible use cases for even nontechnical people. This is
Rowan's mom just experiencing the magic of image genen 2 for the first time in chachi and she's longtime recruiter.
She really needed to do a bunch of things across uh managing her resumes and she wanted to go back into recruiting. We showed
her codeex and she just immediately got it. There's new
ways of interacting with agents that is really going to come to everyone.
We don't think agents are just for technical people. There are different
technical people. There are different challenges when you think about bringing agents to the world where you just really need to preserve the magic while
also making it safe and secure.
But we think this is going to come to the world very soon and it's not going to be just enabling engineers and technical people to become more effective.
We're linking our agents to the entire world. We have plugins for almost
world. We have plugins for almost everything. We also are working on
everything. We also are working on memory systems. We're working on new models. You can set up automation so
models. You can set up automation so they can run on specific on a specific schedule, maybe every few hours to give you a report. And really what we're
starting to see is that the models are so reliable at doing complex tasks that it's just really a question of what is the context and what is the access that
you give those models. And this is really what is capping the potential right now. It's like how much access to
right now. It's like how much access to the world these models have.
We're seeing great success um in different areas. Let me make this concrete with one example of one of the world's most advanced engineering organizations.
C Limited, one of APAC's latest, largest digital platforms and a major open customer. C has gone allin with Codex.
customer. C has gone allin with Codex.
It's rolled it out across its entire developer organization and its chief product officer shared with us that Codex just goes beyond coding and feels pretty magical. We're excited to have
pretty magical. We're excited to have the first regional Codex hackathons here at C starting on the 6th of June. right
here in Singapore. I'd love for you all to join and do check it out online.
We rolled it out to also 45,000 employees across Nvidia.
Um, we did it in only two weeks. Codex
helped itself with the deployment within Nvidia and this is a trend that we're seeing. We're just using agents to
seeing. We're just using agents to accelerate everything including the deployment and development of Codex itself.
What's special about Codex is that it's it's entirely open source. You can read the code of the harness uh just on GitHub. It's under the the Codex repo.
GitHub. It's under the the Codex repo.
Uh you can also bring it anywhere. We
now just released remote control through the chat GBT app. So you can have it run on a Raspberry Pi, you can have it run on a Mac Mini, you can have it run on your laptop and then fully control it
over a secure connection uh straight from your app.
You can also, the thing that's pretty magical that I love doing is using the plugins for browser use or computer use and allow it to just use and navigate across your computer, but like use this
little command, this little remote control that you just have on your phone. And I think this is something
phone. And I think this is something that we will soon realize is that agents will have a certain permanence to it and we will just really start to consider them to be like these little entities in
the cloud that we can reach from all sorts of different clients. be it on the web, be it through a desktop app, be it through a client. Eventually, you'll
just pick up your phone and talk to your agent and it will still be able to do things for you and have access to everything in your life.
We also ship fast and we we fix fast. Uh
we we don't we're not shy of sometimes making mistakes and um resetting some uh rate limits when we get things wrong.
One thing that's cool as well is Peter is working with me. He's the original creator of OpenClaw. We also support this as an open source project. We
recently worked on rewriting the core of open claw to be based on the same foundation as codeex. So it actually runs the codex agent under the hood.
You can read about it uh on on the open source repo. Again like all of this code
source repo. Again like all of this code is open source and we really want to contribute to like this new generation of inventions by just showing how you
can do these things in a simple way. Um
we're taking safety first. We're also
thinking a lot about security. We
innovated on Windows sandboxing. We
publish a lot about this on our blog posts. You can read all about the
posts. You can read all about the Windows sandboxing there. And we're
trying to solve the hard problems at the product layer as well. And in the future, we hope to bring agents to the scale of all of Chadbt, which almost now has a billion users.
There's a lot of things that I'm excited about, but here are some that we're really working on hard. We're working on new memory systems. We shipped Chronicle which is an experimental in research
preview which allows your agent to just follow everything that you've done on your screen and form memories from it so that you know it knows what you did last week. It knows what you did it did you
week. It knows what you did it did you did during the day and it gets a lot more contextual. We're working on
more contextual. We're working on massive multi- aent systems with the next generation of models. We think this will be quite groundbreaking and a new
way in a new scaling paradigm. Um, and
then we're working on new ways of handling tools which I'm excited to share on more in the future.
I heard that several of the billers in this room wanted um to ask a few questions and unfortunately I'm not able to hear the questions live but we've
compiled a few of the questions and I would love to go through two of them.
Um, here's a question from Louis. The DevX
on the Codex app is the best I've seen.
Project organization, the div one tap PRs. It has changed how I build. As
PRs. It has changed how I build. As
agents get more capable and the user base broadens beyond developers, how do you think about the interface layer?
Chat feels like a default we inherited from LLMs. It is actually the right model for how humans should work with agents long term. What does that evolution looks like to you?
I think this is very interesting and initially we just really inherited this thing where we were powering LLMs through LLMs who were powering chat
conversational interfaces and chat GBT started that revolution and what we're seeing is now that LLMs can do things on your behalf and get access to everything. We have to evolve how we
everything. We have to evolve how we think about these things. And it's just really going to profoundly change, I think, the way that we interact with computers, with technology. And I hope
it sort of frees us from some of the limitations that we have, I think, collectively found as well where we're always glued on our phone, you know, kind of tucked over um you know, maybe
we're like typing furiously on our laptop and sort of like we're not connecting enough to others. I think the future is going to be a future where people are much more connected and everything is much more ambient and
seamless around you and you can interact with technology through natural language through natural voice in a very multimodal way and it sort of fluidly
adapts to what you want to do in that moment. And this is hard to imagine now,
moment. And this is hard to imagine now, but I think, you know, within a year or so, we're going to start seeing like the signs of that where agents get embodied and things get a lot more natural. Just
you continue to leverage all of this through natural voice. Um, we're going to break apart the boundaries of the applications that exist today on your computer.
Dehan asked, "You've said some scaffolding should disappear as the model gets better, but skill seems like a kind of user-owned scaffolding that should maybe stay. When somebody
something fails, how do you decide whether to fix it in the model to harness a skill or somewhere else without accidentally turning today's model limitations into tomorrow's
infra?" This is something that we think
infra?" This is something that we think a lot about and it's something that is unique to our setup where we have control over the model. we have control over the harness and the product and the
agent primitives.
oftentimes we actually ask ourselves hey what if we don't fix this in the harness today how quickly is it going to be possible to improve the models this is
something that you know for example for end to end compaction and end toend RL and compaction for very longunning tasks before that people were trying to sort of like fix this with like manual
compaction and very complex systems to keep state around we were thinking maybe we can fix this by working very hard on the next model train and just being able to keep that coherence around very long
horizons of tasks. Um, and so we fix it at the model. Sometimes we estimate that it takes more than a couple of months to fix it in the next generation of models and then we decide to take a bit of a
shortcut uh and fix it in the harness instead. And so there's always this
instead. And so there's always this healthy tension, but we're able to co-design things and just really approach things from first principles uh which always makes me very excited to think about these problems.
There's a few more questions, but uh I think I'm running a little bit short of time. And I just wanted to thank you all
time. And I just wanted to thank you all for being here. Um and
I invite you to just all think with this technology and like you know think about you know what the future is going to look like and you know to invite it into your lives. It's here to stay. It's
your lives. It's here to stay. It's
going to continue to evolve. It's a
beautiful time to just explore all of these things and I hope you have a wonderful time building.
I would like to invite to the stage Dr. Fun Yang, head of AI practice at Govek.
Um, good morning everyone. Uh, my name is Yang. I lead the AI team at GFE
is Yang. I lead the AI team at GFE Singapore. Very glad to be here today um
Singapore. Very glad to be here today um at the AI engineer Singapore event to share with you uh how we are driving the
adoption of AI in Singapore government.
Wrong clicker.
Yes. So a very quick introduction of GFT in case you are not familiar. Uh Gtech
is the lead agency driving Singapore's uh smart nation initiative and public sector digital um transformation. We
harness the power of technology to deliver digital government services. I'm
sure some of you actually many of you have actually used some gap products such as Syncpus live SG go business etc. Our mission is really to engineer a
digital government and make life better.
Um in fact GVtech was formed uh in in the year 2016 and this year we are celebrating 10 years of tech for public good.
Coming back to AI, it is clear that the government must adopt AI. Um the first and most immediate reason is obviously for effectiveness and efficiency. Our
government is responsible for delivering services that millions of people depend on every day. AI give us the opportunity to actually do this much faster,
more accurately and at greater scale.
So that is an opportunity we cannot afford to miss.
But beyond the operational gains, there is the there's the question of expectation from the citizens and businesses. As new technology reshape
businesses. As new technology reshape how citizens live and how businesses operate, people increasingly expect the government to keep pace with the technology.
This will increase the trust and confidence from the people the government serves.
Sorry.
There's also a deeper reason to govern well in the digital in the digital world. We need to understand the
digital world. We need to understand the technology shaping it. Hands-on
experience with AI builds the intuition necessary to craft the policy that are thoughtful, grounded, and fit for purpose. Pro protecting our citizens
purpose. Pro protecting our citizens while enabling innovation.
And finally, if we want our entire nation to embrace AI like how our prime minister has say so the government must
not must must not sit on the sideline.
We must lead by example. When citizens
see their government using AI responsibly and effectively, it builds the confidence and sets the tone for the whole of society.
In fact, we are not starting from scratch. For
years, our government has already been using AI in many areas to inform policy and improve operations and service
delivery both internally within the agencies as well as externally serving citizens and businesses.
Just to share a few examples uh among the tons of AI use cases that we have implemented in government in healthcare AI has been developed to
detect early signs of pre-dementia.
This technology achieve a very high level of accuracy and the results were published in the scientific journal nature communications. We are actually
nature communications. We are actually rolling out this technology at community sites this year.
In education, AI has been deployed to assist teachers mark assignments faster with higher accuracy, cutting down three
to four hours of marking per class and allowing teachers more time to engage the students.
for jobs and skill.
We uh our recommendation engine has been powering my career's future in delivering personalized job and course recommendations for Singaporeans and residents to find
more suitable jobs faster and also learn new skills more effectively.
For citizen services, we have developed and and deployed the latest AI model to our citizen call centers. The
transcription summarization and analytics capabilities allow us to serve our citizens better, reducing the afterpaw works by 72%
and improving customer satisfaction to 95%.
At the same time, we also make sure we apply AI responsibly by developing safety testing tools and guardrails to ensure our AI solutions are safe,
secure, and behaving in the intended way.
While we have made significant progress in bring AI into the government uh over the last few years, we aspire to actually go even further from being AI
enabled to becoming an AI native government. So what is the difference
government. So what is the difference you may ask? An AI enabled government uses AI as a tool, a helpful a helpful
addition to the existing processes. It
is usually built on top of legacy systems and is there is increment incremental improvement. The system
incremental improvement. The system scales but not compounds.
On the contrary, an AI native government is something far more ambitious. It
means AI is the foundation and the core of everything.
We reimagine how government works from the ground up with AI embedded in the way we think. design and deliver there
is always continuous innovation.
So what does AI native government means to us exactly and how we are working towards that? We think about it in four
towards that? We think about it in four pillars differentiated by user personas and one horizontal.
Let me let me just quickly walk you through.
Firstly, we want every single public officer to be augmented by AI. All
150,000s of them from ground staff all the way to the prime minister. No
exception.
I think just now Minister Vivian talked about how he uses and builds AI.
uh in two in two weeks time I'm going to conduct a technical hands-on training to a room full of permanent secretaries on building agents.
We really want to put the AI productivity tools into the hands of every single public officer to help them with their daily tasks and workflows
such as drafting, summarizing, transcribing, analyzing and etc. Second, we want citizen developers to be
building with AI. These are basically the non-technical officers who are closest to the problem statements that we are interested in. They can be policy
officers, you know, they can be uh citizen engagement officers, they can be product managers or designers. We want
to provide the tools to them for them to be able to vibe code, create prototypes and deploy them. Personally, I feel this is a gamecher because it will change the
entire innovation model within the government and now without relying on the engineers, people can actually bring
their ideas to life early in the stage.
Thirdly, for software engineers, AI allows them to build production grade application with greater speed and quality, compressing the entire software
development life cycle.
We have already roll out many um various AI coding assistants like clock code, codeex to our developers. This is not
just to help them with the activity of coding but also the entire SDLC such as code review, testing and documentation.
Last pillar is about AI for domain and uh domain transformation and modernization.
We want to focus on a few key domains such as education, transportation and healthcare and the cross cutting functions such as HR and finance and
completely redesign the business processes for better outcome.
You will see underpinning all these AI initiatives is our government AI stack which really provides the latest foundation models and those customuilt
AI capabilities in vision, speech, document analysis, evals and safety all with government context and localization.
This will ensure our AI solutions are supported by performant models, have shorter time to market and also are safe and secure by design.
As part of the platform, we are also building capabilities in agent harness.
Let me spend a few minutes to explain what it is and why we are doing it.
Looking ahead, we understand from the industry that you know AI agents is going to proliferate very soon. This
will mean AI becoming more capable having access to data, having access to tools, being able to perform actions in autonomous fashion.
According to a IDC study, there will be more than 1.3 billion AI agents by the year 2028. is a very big and scary
year 2028. is a very big and scary number but personally I find this could be actually very conservative by the rate of development that we we can observe
we can already see people starting to develop agents either for their personal use for the team's collaboration or even consumption at the at the enterprise level
there's a whole range of use cases for in the government for agent AI in citizen services policy research etc With the proliferation of AI agents in
the government, we must we must think about a way to effectively enable them, optimize them and manage them so that we can maximize the value and manage the
any associated risks that comes with them.
Sorry, we are building a sovereign agentic harness which include a few components.
MCP gateway which serves as a front door, agentic runtime which provides a sandbox environment and also the resources for the agents to perform
their actions. a agent identity uh
their actions. a agent identity uh management which ensure each agent has a verified identity knows what is allowed to do and cannot overstep its boundaries.
An agent memory which provides personalized experience to users with short-term memory within a single conversation and the long-term memory across multiple sessions.
Observability is important. It provides
the oversight to the entire agentic ecosystem monitoring what agents have been doing catching problem early and understanding what has gone wrong.
A skills platform which contain a rich library of readymade capabilities like searching the web, reading documents,
sending emails all versioned, evaluated, sharable and governed so that agents can draw upon to complete their tasks.
The idea is that every single assistant or agents in the government whether it's a coding agents or it's a co-work session or it's a workflow agent is a
client of this stack.
One door everything is visible.
You might think that it's relatively simple to think about it, you know, in an individual local setup, but it is a whole different ball game at the enterprise level, especially if you
think about it, you know, in an ecosystem across multiple organizations within the government.
As an analogy, I always like to think about it, you know, in the car example, a super powerful car engine by themselves are not good enough to actually bring people from one point to
another. You need robust car car bodies.
another. You need robust car car bodies.
You literally need the the roads. You
also need the clear traffic rules, you know, for safe and efficient commute.
Similarly, AI models are like car engines. They are not good enough to be
engines. They are not good enough to be effective agents. They need a harness to
effective agents. They need a harness to be truly useful and trustworthy.
Hence, a key strategy for us to work towards the agent AI is actually to invest heavily in building these capabilities in the agent harness.
That leads to the end of my sharing.
Thank you very much for your attention.
It is really exciting time.
It is really exciting time ahead of us.
Uh do collaborate with us. Um and also you know if you interested joining us in this meaningful journey AI for public good if you are interested please visit our booth you know we have teams
showcasing some of the work the initiative the projects that we are working on they will be more than happy to share more details with you uh I'll also be very happy to connect with you on LinkedIn and also share more about
collaboration opportunities. Thank you
collaboration opportunities. Thank you very much.
All right, I would like to invite to the stage our first speaker in our design track, Phil, CEO and co-founder of Air Foil.
Also, quick PSA. Um, it is past 10:00 a.m. So, our expose are actually all all
a.m. So, our expose are actually all all open as well in Pullman as well as uh Capitol Kinsky. If you need to refer to
Capitol Kinsky. If you need to refer to any maps, we have some tools for that.
Thank you.
Okay.
To just straight in.
Okay.
Screen.
Awesome.
Testing. testing.
Okay, awesome. Good morning everybody. It's
awesome. Good morning everybody. It's
great to great to see all of you here and honestly so surreal um to think that this entire conference is happening that so many of you have traveled from around the world and coming from Singapore to
be here. Um I'm my name is Phil
be here. Um I'm my name is Phil Hedatnea. I'm the co-founder of a
Hedatnea. I'm the co-founder of a company called Airfoil. Um we're
basically a combination of a product design, brand design, and design research firm that works with companies across the tech sector. Um, but we've actually been dual based in San Francisco and in Singapore for the last
5 years. So, it's awesome to see you all
5 years. So, it's awesome to see you all here. Um, whether or not you know who we
here. Um, whether or not you know who we are, uh, you may have interacted with some products you've worked on in the past. So, for example, if you're doing
past. So, for example, if you're doing document processing with agents, you might be using Reduct. If you're
embedding voice AIs into your application, maybe Vappy. If you're
doing a Gentic Search, maybe Exa. Is
there Oh, someone in the back is okay.
Uh, or if you're here from crypto, maybe Salana. Um, but I wanted to basically
Salana. Um, but I wanted to basically about a year ago we built a team at Airflow called Airflow Labs because there was a question on all of our minds and the question was very very simple.
Will we have a job in two years? Because
as a design firm, right, especially if you've been on Twitter and seen the talk about the design tax and the way in which increasingly improving models will enable us to just build things without the need for a designer. We were
honestly a little scared. Um we wanted to know where our place really was in the design process. So we started building. We made things internally like
building. We made things internally like check which is our own uh engine to effectively QA the implementation of our design. So we can take a Figma file on
design. So we can take a Figma file on one side, we can take a live staging site on the other, use image models to then compare the two together and make sure that we've implemented them properly. Um eventually this turned into
properly. Um eventually this turned into something kind of cool which was self-improving websites. So, because
self-improving websites. So, because we're able to stack rank and uh prioritize based on severity, we're able to then feed that directly back into a code model and then constantly make
sites better even after we've released the first dev version. We built
something called scoop, which is um effectively it just takes all the information that a client gives us and turns it into a really comprehensive brief. Takes two or three pages of
brief. Takes two or three pages of context we get and turns that into 50 or more. But it importantly gives designers
more. But it importantly gives designers more context on the industries they're designing for, the customers and users they're designing for so that they can do better work.
But after all of that, we started to put our heads together around what's effectively the holy grail, the thing that everyone's trying to solve. How can
we create design agents that actually have taste that are able to produce things that don't just look like slop?
And so today, I want to present a little bit of what we've learned. Here it is.
Okay. Not that. Not that at all. Um,
that's actually a screenshot from impeccable.style. Um, which is something
impeccable.style. Um, which is something that you can download. We didn't make it, but it helps uh your agents have better design fluency. The way that it works is it basically tells agents a bunch of things not to do, right? Make
sure that your color contrast is appropriate or use better typography.
And that does make a meaningful difference. You can see without
difference. You can see without impeccable.style and with
impeccable.style and with impeccable.style style that site looks a
impeccable.style style that site looks a lot better, but it still kind of looks like slop. It looks like something that
like slop. It looks like something that you were able to just generate directly.
So why is that? Why does that continue to happen? Well, our view is that
to happen? Well, our view is that training AI on what we consider good design doesn't teach AI how we got there. And it misses a very important
there. And it misses a very important point. Design isn't about taking product
point. Design isn't about taking product specs to Figamox. Design is about applied psychology. It's about
applied psychology. It's about understanding how the user thinks, how the user acts, and crafting the flows, visuals, and narratives that will resonate with them. I like to say that
designers are investigators of human psychology. This is a mood board that my
psychology. This is a mood board that my co-founder put together for a merch project that we're working on. And it
actually seems kind of random at first.
If you look up in the top left corner, you'll see a photo of California Street in San Francisco. And it's unclear what that has to do with merch. But what it really means is it's a way for us to
categorize the things that we derive meaning from. These images may seem
meaning from. These images may seem random at first, but they express meaning to someone. And when designers put together these mood boards, that they're trying to understand. They're
trying to investigate why people resonate with certain things, come up with rules for how to do that, and then apply that to their own work. And
there's another way to look at this.
It's just human creativity. Uh there's a book called The Runaway Species by Anthony Bran and David Eagleman. And
Tony Bran actually uh is a professor at Rice University where I went. I studied
under him. He was one of the biggest inspirations for me and one of the reasons I got into design. And what the runaway species articulates is a definition of human creativity which is the bending, breaking, and blending of
existing concepts to create things novel relative to the culture in which they're introduced. Put simply, it's not that
introduced. Put simply, it's not that people are born with creativity, that they're they have an inate characteristic to be creative. We all
are creative every day. It's a simple part of of how our brains work. But it's
not just a neuroscientific definition.
It's a sociological definition. We see
this in things like biomimicry. The
reason that the shinkenzen doesn't cause a sonic boom as it exits the exits a tunnel and through a mountain is because they modeled the shinken not just uh off
of other trains but off of the bill of the kingfisher. That was the insight
the kingfisher. That was the insight they derived from nature and applied to a totally different context and even in something like a website like say the reducto site that we worked on um we wanted to make it feel friendlier and
more accessible to people. So we
introduced page elements that brought back dot mat the the elements of dot matrix printers. You can see an actual
matrix printers. You can see an actual example over here. It's little decisions like that that make all the difference to making interfaces and brands look great and makes the difference between
stuff that looks like slop and stuff that looks innate and and really creative. But my key point is that none
creative. But my key point is that none of that can be extracted from outcomes.
You can train on outcomes and you'll eventually get overall better visuals that don't make clear mistakes, but you won't get visuals that are novel, interesting, and new. When we train
models on ideal design outcomes without the context and thinking behind them, that's when we get underwhelming results. So, we decided to take a stab
results. So, we decided to take a stab at how to solve this problem. And I'm
going to show this to you for the first time. We've not demoed this before. Um,
time. We've not demoed this before. Um,
this is something that's currently internal to us, but we hope to bring it to the public soon. Um, I want to give you a first look today at something we built called Melt. So, Melt starts where a lot of our designers do, which is
design Twitter. Um, but this is the case
design Twitter. Um, but this is the case for a lot of designers, right? They're
always going out into the world. They're
finding inspiration. They're looking at an interesting brand direction, and now they can just save it to melt. They can
click the save to Meld button, and then we save it away to what we call their backpack. Or let's say they were on a
backpack. Or let's say they were on a trip to Vietnam, and they went to a restaurant called Pizza Four Pas, and they're like, "This is a pizza restaurant, but it's really beautiful brand direction." And it is great. uh
brand direction." And it is great. uh
illustration like their menus are even gorgeous. They can just take that thing
gorgeous. They can just take that thing that they saw, they can save it directly to Mel and then we start to extract key metadata from that things like typography, color use, but also background information on the company
itself and where you were when you took it. Once we have all that metadata
it. Once we have all that metadata together, we're able and that's what the desktop one looks like. We're able to put that in your backpack and you can access all of that information later.
So, why would you want to have all of that information? Well, the first reason
that information? Well, the first reason as a designer is you want to use it the way you would use a notebook. By
annotating on the samples that you save, you're able to record your thinking at the time, which means that in six months time, if you need to come back to any of that, you can do so immediately. But it
also means we can start to make connections between the metadata we're collecting and how it impacts how people think about it, how it impacts their perception. So this means that I can ask
perception. So this means that I can ask more complex queries. For example, I can say find visuals from Vietnam 2026 with serif typography and blue, yellow, or black and white color palettes. The
second half of that you might be able to just do directly with an image model, but the combined query gets easier to do once we have all of that in there. And
so you can see it says find three saves in Vietnam 2026. And it's able to understand in a bit deeper of a way what it's actually referencing. And of
course, you can oneclick export to Figma. That's the thing every designer
Figma. That's the thing every designer wants. Where it gets more interesting is
wants. Where it gets more interesting is that once Mel understands your reasoning for why you've saved things, it can surface that to other people and it can give you the ability to share in a more
multiplayer way. So on Mel, you can make
multiplayer way. So on Mel, you can make general queries that then are able to use the comments and annotations other people have left to better understand the content you want to find. So it's a
much more effective content finding engine. So if I say something like
engine. So if I say something like assemble a mood board with light and airy UI, it's able to find references not just that I saved but that my teammates also saved with context about
why they saved it which is really really important. Coming back to that
important. Coming back to that definition of creativity again, our view is that by enhancing recall and putting things in front of people faster, but also also making sure to save that
creative process, to save the feedback, the back and forth, the comments that uh make uh work resonate with people, understanding it in a deeper way. That's
what enables us to then take the next step, which is actually try to infuse quote unquote taste or rather the intention of the human designer into the work that these models actually produce.
So this isn't part of the product yet.
It's something we built called Blend, but it's able to use both the visual references in Mel that you've saved as well as the metadata and comments to remix different things together. We're
now building tools that enable us to render entire mockups of pages just using the influences in Meld coupled with custom prompts and commands. And
though it's still a bit rough and we've got some more work to do on it, it is yielding way better results for us than just using claude or GPT directly. Der
Rams once said that you cannot understand good design if you don't understand people because design is made for people. I think the problem with
for people. I think the problem with design agents today is that we're spending a lot of time looking at what people make and not looking at why they make it. But by bringing all of that
make it. But by bringing all of that onto a platform and making it legible to LLMs, I believe that will unlock the next generation of design agents that are able to act more intelligently to follow our intent and even to make
decisions on their own. We want to start with augmenting the creative process.
Eventually, that enables us to teach machines to create. Eventually, that
enables us to teach machines to decide.
that unlocks the world of generative UI and all of the amazing futures we want to build. So, we're going to be sharing
to build. So, we're going to be sharing a lot more in public soon on MEL and everything else that we do at AirFoil.
You can scan the QR on screen to learn a little bit more about us or stay in touch. And Min and I will both be around
touch. And Min and I will both be around the conference today and tomorrow.
Thanks so much everybody.
And now I would like to invite up to the stage Annie Lua, senior UX researcher at Google.
Hi everyone, I'm Annie. I'm UX
researcher at Google working on AI shopping. We've heard a lot about um
shopping. We've heard a lot about um coding agents and ways to get AI to do more faster with less friction. And I
want to talk about the other side, a class of problems where efficiency isn't the goal and we actually need to keep some of the friction in for these everyday consumer AI products.
So let's take a moment to think about this question. One that you might ask in
this question. One that you might ask in front of a mirror. How do I look in this jacket?
Underneath though, um, you might actually be asking about, does this reflect the person I want to be? Um,
does this make me feel like because the fur jacket might be a little bit out of my daily range, um, am I brave enough to wear this or does it make me feel like I'm trying too hard? So, these aren't
prompts or search queries. These are the kind of questions that people quietly ask themselves when making a purchase decision.
So the first wave of AI worked by removing a lot of these frictions um for tasks like summarizing um a dock or booking the cheapest flights. These are
the utility task where the success metrics is pretty obvious. Um you get the task done quickly and as AI is now being asked to help with a different class of problems where um the question
is a lot more subjective like how do I look in this jacket? What kind of trip do I want? Um, these are subjective questions and the right answer depends on the person, the moment and even the
mood and efficiency alone can't really tell if the feature is actually helpful anymore. And so how do we design for
anymore. And so how do we design for this? When AI moves into helping people
this? When AI moves into helping people with these everyday decisions that are really personal and subjective, three things change. People don't actually
things change. People don't actually know what they want until they've seen a range of versions in contrast. And
that's how people build trust. And as AI becomes a thinking partner for a lot of these decisions that are a lot more personal, a different kind of trust has to be earned. If you think about um meeting a stylist for the first time,
trust is built through the small talks that you guys had up front or the stylist commenting on something that you wear that day um instead of upfront giving you a range of recommendations for things that you like. You wouldn't
trust that the stylist actually know what you want. And so um it's really important because uh you trust them because they have signal in small ways through those little interactions that
they understand your vibe and different from utility tasks where confidence for personal decisions um comes from the feeling like you have made the call and all of these aren't straightforward
deliverables. These are things that AI
deliverables. These are things that AI has to help you build in the process. Um
and so in the next few slides I love to use two domains to show how that looks like. um in fashion and travel.
like. um in fashion and travel.
First, this is um virtual tryon. It's a
Google shopping AI feature that I've been working on um for visualizing how clothes would look on you. Powered by a custom image generation model for fashion. We launched it last year in the
fashion. We launched it last year in the US and in APAC. It's currently available for users in Australia, Indonesia, and India.
And here's how it works. You're looking
at a denim jacket and you upload a full body photo of yourself. So, I pick one of me in Central Park in New York and um and then the AI can render the jacket on you in your context rather than you
having to imagine how that would look on you browsing a feed of products. And
notice that the question AI helps with is not just figuring out is this is is this is like a nice jacket. It's
actually helping you visualize do I look good in this for a vibe check. And you
can also see yourself in different jackets. Maybe I want to try a white
jackets. Maybe I want to try a white one. And that's how you gradually build
one. And that's how you gradually build taste by seeing a range and compare. You
don't really know you prefer the white one a lot more by um until you actually see this being right next to a blue one.
And as you explore further, you might start to recognize patterns about yourself or find something that really surprises you. Maybe the the brown one
surprises you. Maybe the the brown one actually looks really good. And AI that really built for supporting subjective decisions aren't really deciding for you, but they're giving you a surface to
to discover your own taste. And in this case, let's say I'm not really interested in any of these. I feel like I'm not a fan. Um but in in the utility frame, it feels like nothing happened
here because um the user didn't buy. But
subjectively, they got something super valuable because they sharpened their taste. I also learned something about
taste. I also learned something about myself. I don't really look in that um
myself. I don't really look in that um purple dress, which is equally valuable.
In our next example, um let's take a look at travel as well, like where should I travel next? Again, the real questions underneath are subjective. Um,
do I want to be challenged or just chill out and relax? Or in this trip, do I want to be a museum person or do I want to be a beach person? A booking agent won't be able to help you answer that.
And people plan trips partly to figure that out.
And in Google Travel, we treat maps as a place to wander, not just a destination picker. And this is a reference point
picker. And this is a reference point for the kind of interface that supports exploration, not just jumping to book me a ski trip for efficiency.
And maybe you wonder, should I be a ski person this winter and you wanted to explore um Aspen or um Whistler and both are great skiing destinations in the US.
Or maybe skiing doesn't feel right and now you're considering a totally different kind of trip. Um, and so maybe you wanted to explore like the Yellowstone National Park or um, Yusede
and now you're considering something totally different and a chatbot might have committed you to skiing five prompts ago, but a map interface lets you change your mind and explore with you. And that's a key difference. So,
you. And that's a key difference. So,
what both products have in common is this. They're not trying to um, give you
this. They're not trying to um, give you the answers fast. They're trying to give you a better place to think. And that's
why it's really important to design for the deciding not just a decision because um taste, trust and confidence, these are built through the process, not just
um being handed over for you at the end.
And so that also means we have to measure a different set of things for metrics like task completion, um time to results, conversion. These are great for
results, conversion. These are great for utility tasks. But for a different um
utility tasks. But for a different um class of problems that are a lot more subjectives, the ones that truly matters are harder to count. For things like the users feel more confident, did they
learn something about myelves? Or did
they return to explore more? These are
the things that really count.
And practically there are three kinds of optimizations where efficiency could cause explorations. And those are the
cause explorations. And those are the moments where it's really important for us to put the friction back. Um, and so for everyday consumer products where AI is supporting people with personal subjective decisions, it's really
important for us to support comparison, not just giving one suggestions right away. Otherwise, we miss that meaningful
away. Otherwise, we miss that meaningful moments to help people build trust. It's
also really important to understand the intent not just um give them the quick results because we have to earn a different kinds of trust for and build in moments where people can express
their intent or visual preferences and also show that AI gets your taste and gets the style that you want instead of assuming the intent right away. Um and
lastly, invite active selections. Um not
just automatically giving you the best choice because the act of choosing is the point a lot of these exploration journeys that really make the journey fun, delightful. Um that's about
fun, delightful. Um that's about self-discovery as well. And these are the frictions that are worth keeping.
Um thank you. I love jamming on consumer products and I also write about this kind of thing at Substack. Happy to chat more after.
All right, thank you so much everybody.
Uh that was the conclusion of our first part of our morning session. So we're
going to be taking a 15inut break uh in the theater right now. But during this time uh we also wanted to create some experience for all of us to take a moment and uh you know take a rest from
thinking and just relax a little bit. So
that's why I'm very excited to welcome uh Kazaya a trained mindfulness teacher to the stage. Uh she actually built a sensory meditation experience including
a vibecoded particle visualizer trained on hours of her own guided meditation transcripts.
Hey, hey, hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, Hey, hey, hey, hey.
Hey, hey, hey, Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey.
Hey, hey, hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
your morning break. Hopefully the talks have been great so far. Uh we're going to keep these moving with our next talk from Jimmy Lie who's the head of Nex.js at Verscell who'll be talking about
shipping what's next. Jimmy, the stage is yours.
Hello. How's it sound? Cool.
Right. Hi, I'm Jimmy. I lead the NexJS and React team at Versell. So, since
we're at at an AI conference, I'm kind of curious just, you know, how much people are familiar with like Nex.js and Versel in general.
Okay, it's not not terrible. Then um for those who don't know what it is, it's like a web framework that people use to build like websites in general. It's not
the topic of the talk, so we should be fine. Um but here's a fun number. When I
fine. Um but here's a fun number. When I
joined NexJS four years ago, we were doing about like 4 million donors per week, and today we're basically up 42 million. Uh obviously a huge part of
million. Uh obviously a huge part of that is thanks to you know the incredible work that my team is doing but in reality uh in reality I think a huge part of that is because of like
coding agents. Um and
coding agents. Um and as such I think that has changed a lot about like how we think about building um tooling for developers. um because we spend a lot of time about obsessing
about how people will build websites in the future and in the past six months it has caught up to us you know faster than we ever expected. Um, and this year we
spent a lot of time talking about like, you know, how can we account for like this new type of like user? How do we stay ahead of the curve as a team? And
like, do we still have a job in the future? And we're somewhat okay because
future? And we're somewhat okay because I think maybe what we had to go through uh and realize is that like the job was never just about executing the tasks.
It's about deciding which task should even exist and whether or not we want to own the result of it. So
yeah, in reality AI just has made like executing and building u much faster, but turns out you still have the same amount of time in your day. So in
reality, it just makes ownership more expensive because you still need to account for like what you put out uh out there and you still need to like deal with like any issues that come with it.
Um so today I want to share a few of those lessons with you because I think they apply to everyone in the room whether or not you're building with an agent for an agent or whether or not you're building your own agent. Um my
prediction is what we're learning about how agents use Nex.js JS will only grow more useful as agents become more widespread as um you start to using to use them for like anything else besides
coding. Uh maybe it's just about like
coding. Uh maybe it's just about like doing your online shopping for you as we already kind of see happen with like open flow. Uh so I want to talk about
open flow. Uh so I want to talk about three things like just what's changed for us as we started building for agents. how agents also changed how we
agents. how agents also changed how we worked as a team and where I think the industry is going to go um and especially why open source matters more than ever now.
So the weird thing about maintaining a framework in 2026 is that you're no longer designing for the person at the keyboard. Uh you're designing for
keyboard. Uh you're designing for whatever is sitting between them and the code. So it used to be an editor the
code. So it used to be an editor the docs pages but now it's like an agent that you know has access to your whole computer maybe has too much permissions
and that can be like a little bit dangerous and pricey too. So fun fact like 60% of like the next JS docs today
are served as markdown. um which means like not only coding agents but it's also like um like lab indexing um etc. But it just means
that like we're not having much uh like manual clicks to the docs anymore. I
mean if you think about it when was the last time you consulted docs yourself um it's always a bit faster. There's less
less friction now that just just in just asking like you know claude like how does this thing work in XJS? Um, and so we're moving to a world where software is kind of becoming the primary user of
software. And I feel like that changes a
software. And I feel like that changes a lot.
If you think about it, an agent is kind of a an annoying kind of user. It reads
exactly what you've wrote, copies example, it runs commands, it follows errors to the to the to the letter really. And so if the loop is broken, if
really. And so if the loop is broken, if your errors are not good, it's just not gonna, you know, like sort of stop, get some coffee, go to bed, and like wake up at 5 a.m. with the fix in their head.
They're just going to keep trying and burning money up until uh they get it fixed. And so that has been really
fixed. And so that has been really useful for us to understand because it indicates where the human is still like required in the loop and what we should
try to optimize. Um, so for example, documentation used to be like somewhat of a very passive thing. We we used to put it out there and we'd assume that like no one would like people would read
it once in a while and like sort like build that knowledge over time and and you know I always prided myself on like having great docs for like the next site but the real return on investment was
never really there. We do it once in a while like we'd look at them and like say oh we're missing this feature. Um
but nowadays like documentation has become like you know a bit of a a bit of a bible for agents. They'll take
whatever you wrote in there and just act on it immediately.
And it's not only applicable you know to sort like frameworks and like dev tools offer. It's like anything that's in your
offer. It's like anything that's in your codebase like your readmies your PR descriptions your your playbooks every stale document that's in your codebase. It's kind of like a time bomb right? It's a it's like
a hallucination waiting to happen. And
the most dangerous ones that I've seen are like not the missing ones, right?
Like because if the agent has access to code, they can still figure it out. It
it's like when the doc is like slightly misleading. Um what we've seen in
misleading. Um what we've seen in practice is a stell example used to confuse, you know, one person at a time and now basically confuses like hundreds of thousands of projects before anyone
really notices it. Um and worse for us it's when the bad information gets into the the data sets it means that the model is now potentially tainted. Um
that definitely has happened for some NexJS features. Um and the same story
NexJS features. Um and the same story for compiler errors. We we have sort like an error I think in next that says like this is a bug in NexJS please open
an issue. And this is like this kind of
an issue. And this is like this kind of like a crime uh in this age because I've never seen to this day like an agent like open a bug report on on Nex.js. U
like as as tool offers it's really important for us that we kind of ensure that the user always stays you know as fast as as unburdened as possible. In
general this again this applies to like any types of like tools that you build um like you know like your banking websites that requires like 10 steps before you're able to like um send a
payment or something. Um, so but it's just that like agents have made this much more important for us.
Yeah, it's also a great way of like agents are like a nice way of like testing out whether or not you're designing properly because when you're coming up with like something new, an agent will basically have like no, you know, they don't have like Stack
Overflow. They don't have like Twitter
Overflow. They don't have like Twitter lore to understand what what something is. And so if your if your API is like
is. And so if your if your API is like poorly designed, poorly named, then you're going to in for a bit of trouble, right? And one key part that I've
right? And one key part that I've learned that I think you should also apply in your work is that like any system should be as explicit as possible. When you're doing an action,
possible. When you're doing an action, you should really be thorough about like the way you can explain it. like like
those like you know 500 like status code that don't say anything you should be able to still debug it because we're kind of heading toward a world where um you know as like Sio was saying earlier
about codex is that like agents are just going to run passively for everyone and so you want to enable your own systems to run and be fixed passively. You want to you want them to
passively. You want to you want them to be able to understand like you know if your if your prerogative is that like all of your websites should be fast you should be able to like define those and today we have like some metrics here and
there but maybe you have a different definitions and so you should work through your codebase what that means what does it mean to be reliable? What
does it mean to be um fast? What does it mean to be secure? Um so that in turn you're well set up for when the agents are able to like kind of run autonomously and ship the fixes
themselves.
Um so yeah that's that's kind of the first shift like how basically building for agents doesn't like replace the fundamentals it just makes them like you know even more important. I wanted to
share about how we leverage agents ourselves internally, right? Like I'm
sure like you're all familiar like in the past six months um I think the industry has gotten like a little bit of like a a psychosis phase where like everyone was trying to build everything
in the entire world. Uh I've certainly done so a lot thanks to uh oppus during Christmas and once we got past there uh I think like the realization was that
the real work is you know the tastes and the judgment and and I think it's better to think of agents as like being able to help you for everything around it. Um as
an engineer what's most valuable is my focus time. Like I'm sure you've heard
focus time. Like I'm sure you've heard about this study that says, you know, if you're like disturbed for um a little bit, it takes like you it takes you like
30 minutes on average to get back to like a flow state. And in a world where it's, you know, extremely tempting to have like 10 agents uh running in the background, uh 10 chats open at the same
time, like you know, how do you make sense of that? How do you stay productive in that world? Um because
agents are still pretty powerful. They
allow you to like, you know, research very quickly. Can turn like a messy
very quickly. Can turn like a messy investigation into like, you know, a really nice documentation spec. Um, but
my key advice here, learned from my my own hard experience, uh, is stop actually force yourself to stop chatting with them. You kind of want to put in
with them. You kind of want to put in the work now so that you can avoid having to like over steer an agent. Like
it's kind of hard because I like I like having like 10 chats going on at the same time. It gives me like a little bit
same time. It gives me like a little bit of like dopamine all the time. But think
about the word where instead of like you know bottlenecking yourself with like 10 agents you can by putting the work now putting like the correct prompts in
place and the correct like evolves and like safeguards. You can this is kind of
like safeguards. You can this is kind of how you scale yourself to like having a hundred of like agents running in the background. Um yeah because that's just
background. Um yeah because that's just what's going to happen for us as an industry.
Um, so that's the version of like AI that I'm very excited about, right? Um,
but there's also another part to it where like the industry does not only reward judgment, right, but it also rewards like motion. Uh, and EA AI creates a lot of that. So I want to talk
about the honestly that's probably the the most important thing. It's like
knowing when not to use AI. Uh, someone
said to me in the past six months that the the last six months have felt like the most exhausting five years of their life. Um, and that sounds kind of about
life. Um, and that sounds kind of about right. Every week there's like a new
right. Every week there's like a new model, a new demo, a new feature that comes out. And
comes out. And you know, my natural reflex, and I'm sure you've felt this as well, is that like, well, you should basically do the same thing. You should like generate
same thing. You should like generate more code. You should ship like more
more code. You should ship like more features um so that you can bid out the competitors um so that you can stay ahead of the curve.
But in reality and and we know this from like you know having like built like developer toolings for like the the past 10 years is that what's going to happen is that you're just like speedrunning
tech depth. You're every demos that you
tech depth. You're every demos that you have shipped you know in the past six months maybe you've felt it now. um now
you actually have to deal with it and with that comes a lot of like other problems like uh observability um pricing uh making sure that like everything stays like really reliable.
Um and for example, like you know, you can fork NexJS over a weekend with like uh, you know, a bunch of like tokens. Uh, but
just because you can doesn't mean that you should. Uh, because when you fork a
you should. Uh, because when you fork a framework for example, you start owning everything that comes with it. And like
maybe the the biggest recent example of that is security issues. Um last year I ended up like leading the response to like react to shell which was um you
know a very critical issue and like a very critical vulnerability and this is kind of what I think this is kind of like the extreme of like ownership is like we said you know we
released um NexJS to the world like a few a few years ago we keep releasing it and all of a sudden I still we still had to fix it um for like you know the the
hundred like you know thousands of like users that we have and this is where I you know would caution you to if you think that like you can like to fully
replace abstract away some place of your stack um think about yeah like how how are people going to deal with it in a year's time when you've created your own like meta framework um to serve your
websites do you actually want to take care of this or maybe it's better to use like open source and make sure that like you're you know giving back to the community and potentially like helping
other people have like more secure websites. A note by the way on security,
websites. A note by the way on security, right, is we're kind of living in like unprecedented unprecedented precedented times um in terms of like uh
vulnerabilities being disclosed every every month or so, right? Because AI has made it insanely easier and it looks like you're not using like secure software, but it's actually the reverse.
It means the system is working. If
you're getting like security patches, it means like somebody reported it. Um the
other alternative here is that you're building your own version of like your framework or your tooling. Um but they don't get like the attention the attention of like security researchers.
And so now attackers will just like you know identify that you're you're running your own your own stack and they will like basically attack you without you knowing. And um this is where open
knowing. And um this is where open source really matters because we get to sort like build on like stable foundations together.
Um so at Verscell we kind of have a saying which is like you you know you can just ship things and this this was pretty great like we built like you know an insane amount of like really good
products out of this. Um but since the start of the year we've also started taking like the other approach too which which is the fact that like you can also use AI to just delete things. Um because
before like shipping was you know it just meant winning like you just you could know could have like features features features um in your road map and but now that's like it has become so
cheap I think what's going to put you ahead of the conversation um and the competition is focus like because you're going to have to deal with it your users are going to
have to deal with it it's I would actually rather slow down take you you know, take time to reflect on like what what is actually what makes your product different. Um, because if you can build
different. Um, because if you can build like a feature in like an hour with a few tokens, your competitors can do the same too.
So, and yeah, and what I'm saying is not about like not shipping at all, right?
It's more about like how you protect yourself from like and how you protect your ability to keep shipping.
Um, the question is not yeah, can we build this anymore? because the answer is like always yes. The harder question and the one that like actually decides whether or not something is worth doing is like should this exist and are we
actually willing to own it like long term. Um the sun yeah when I was working
term. Um the sun yeah when I was working at Meta we had a thing called like the not invented here syndrome uh where people would just actually rebuild like
every every library is possible on the planet. Um there there used to be like
planet. Um there there used to be like um people are pretty familiar with like React Native for example for serving web um mobile mobile apps with React. Um
funnily there's like three versions of this internally at Meta just because people did not want to control this and that was already a problem back then.
Um, it's becoming more and more of a problem now for like everyone. And
again, uh, when you think about like spinning up your own, uh, so like your own products that like replaces like something in your stack, think about yeah, the the mental burden that that's
going to come with it. Um, so yeah. Um,
as quick recap, I guess like my my my predictions, right, is like if you're building with an agent, like what's really important is that you think about what's not on the happy path. Like, you
know, can your users actually fully use your tools without like prompting themselves? Like make sure
prompting themselves? Like make sure that like your docs, your errors, your um CLIs are well well defined. If you're
building with agents, be really careful about like outsourcing judgments there.
Um you can use them to you know get context closer to the judgment like you can do the research you can you know investigate repros specs like
investigate like performance issues. Um
but you know yeah really focus on like what you bring to the to the table and make sure that you make time for this.
Um and also as the industry speeds up yeah be careful about what you decide to own. Um again like AI made creation like
own. Um again like AI made creation like really cheap uh but ownership is much more expensive than you think it is. Um
so yeah doesn't mean that you should ship less but it just means that we have to ship you know as an industry with like more focus build what should exist like try to make
it understandable make it reliable make it safe and stand on the foundations that you can trust. Um and yeah, thank you.
Thank you so much, Jimmy. Our next
speaker is Vran Yukich, uh who's a co-founder and CTO of Daytona. Um
He will be speaking about why sandboxes are non-negotiable for autonomous AI agents. Um,
agents. Um, so without further ado, we're going to hear from Van Um, hello everyone. It's great to see you all here. Uh, my name is Weather
Nich and I'm the CTO and co-founder at Daytona and Singapore is our number one city by users worldwide and top five countries. So thank you for for that.
countries. So thank you for for that.
And today I'm going to talk about uh why you should run your autonomous agents inside sandboxed environments.
So when you install cloud code, codeex, open code or any agent that uses tools, you give it a lot. It runs as you. It
can read your files. It can use your SSH keys. It can spend your AWS bill. It can
keys. It can spend your AWS bill. It can
delete things. and it decides what to do based on text that it reads from the internet. So we said yes to this because
internet. So we said yes to this because the productivity is real. But most of us never thought about risks. So why would
the agent do things that it is not supposed to do? Well, because it can get compromised easily.
Um, prompt injection is when someone hides instructions in text that agent reads. And there are two kinds. Direct
reads. And there are two kinds. Direct
when the attacker types the bad prompt and indirect when the bad prompt is hidden some in something that the agent
reads. It can be a web page, a rhythm
reads. It can be a web page, a rhythm file or an email.
and indirect are the dangerous ones because the autonomous agent reads the internet. It's its job. Fortunately, uh
internet. It's its job. Fortunately, uh
models are getting better at spotting this, but they don't actually catch them reliably.
And remember that attacker needs to succeed only once.
So, OASP says, OASP says it plainly. Uh,
the prompt injection cannot be fully prevented. It's how the models work.
prevented. It's how the models work.
Open AI said the same thing in December.
So, the people who are building the models are actually telling you they they cannot stop this. It's not a bug.
It's how the technology works.
And we also use skills, right? So a
skill is a folder with some instructions and maybe some code in it and you download it. You give it to your agent
download it. You give it to your agent and your agent will run it with all your permissions with your shell with your
tokens with your files. And remember
there is no app store review for skills.
So there's no sandbox between the skill and your machine.
The agent will read instructions from the skill and will does what they they say.
Um if you look at the numbers they're they're not looking good. Three
different teams have looked at this in early 2026 and KO security checked open cloud the skill marketplace for the
cloud agent.
They found 341 bad skill and by February that number grew to over 800 bad skills.
Sneaked checked another set and they found 13% of skills that had serious problems and
with 76 were clearly malicious. Also
research paper titled malicious skills in the wild checked 98,000 uh skills across different marketplaces
and they found 157 of them are bad.
Now let's let's see uh an example of a real malicious skill from from that re research. And it looks like a normal
research. And it looks like a normal documentation helper, but there's a hidden comment in the markdown, right? And if you if
you preview the file, you don't see it, but the agent does. And the comment tells the agent to send your project
files to an attacker server, right? So,
and the funny thing, the last line in the comment says like, do not mention this to the user.
Um, some skills ship real code and this one looks like a normal telemetry function, but if you look, it it collects some data and and it sends it to an analytics endpoint. But if you
look closer, uh, what it's really interested is are your API keys, your secrets, your tokens, and it will walk through your environment variables. It
will pick out the credentials and will ship them out.
So to get compromised, you don't actually need to install anything. Um
the agent can read any readme file, an issue or an email, even a PDF and any of them can contain malicious instructions.
So instead of trying to to prevent this, we should accept that this is a reality.
The model cannot be fixed and OASP and Open AI said so. a new skill ecosystem is is already full of of bad stuff and
new malicious skills are appearing at a rate that no one can really review them.
So any readme file, any ticket, any email the agent reads can can hold malicious instructions.
So what can we do? We can we can change what the agent has access to. So no host shell, no host files, no credentials and
we can restrict the internet and also we can throw it all away when the task ends.
Sandbox is not just a virtual machine or a container. The agent inside still has
a container. The agent inside still has your access token is still has open internet. A real sandbox
internet. A real sandbox does four things. One, it keeps your secrets outside of the agent so that agent never sees them. Two, it controls
what the agent can access on the internet or inside your local infrastructure. Three, it logs
infrastructure. Three, it logs everything, every command and every request. And four, it sits between the
request. And four, it sits between the agent and the AI model. So you can see what the agent has asked and what the model responded.
A real sandbox has the network restricted.
Every outbound request goes through a proxy that checks every request against the allow list. And a request to random endpoint will get rejected. Also,
everything is logged. So if something goes wrong, you can clearly see it in the logs.
The agent should never see your secrets.
The secrets should live outside the sandbox. So when the agent makes a
sandbox. So when the agent makes a request, for example, to GitHub, it sends the token placeholder value instead of the real token and the proxy will catch that on the way out and will
get the real value from the secrets broker and it will send it to GitHub.
The response will come back through the proxy to the sandbox. So the agent will get what it asked for but without ever knowing the token. And if the agent gets
compromised, there's there's no token to to leak because the token was never exposed to the agent.
And finally, the model is the brain to the agent. If you can't see what's what
the agent. If you can't see what's what what goes in and what goes out, you have no audit trail. So every sandbox routes its model calls through the same gateway
and every prompt and every response gets logged in in in the gateway. So when when for example sandbox A starts behaving
strangely, you don't have to to to to guess what's going on. You can just open trace for sandbox A and you can clearly see what the agent asked and how the
model responded.
So the reality is that the agent will get compromised.
The only question is what it can reach when it does. So you can try to build your perfect agent or you can put it in
a sandbox and sleep well. So choose nice uh choose wisely. Thank you.
Thank you van.
Next up we have Vashant Kameeshwaran who's a co-founder of Gravile along with Rohan who's also from Grapile. They're
going to be talking about what they have learned from analyzing 5 million vibecoded PRs. Um so once they are set
vibecoded PRs. Um so once they are set up and ready to go, we'll have both of them talk about that.
All right.
Hello everyone. Uh, I'm Vishant, co-founder and CTO of Greile.
And hi, I'm Rohan. I'm a researcher at Greile.
And today we're going to be talking about what we learned from analyzing 5 million vibecoded PRs.
So, at Reptile, we're building AI agents that review and test pull requests.
We're reviewing four billion lines of code every month for companies like Nvidia, Coinbase, and Meta. And there
are 100,000 bugs that are identified by Reptile and fixed every single day.
AI agents have evolved a lot over the last few years. In 2023, we were still working with, you know, quite simple agents that are able to generate short code snippets for us. In 2024, we
started to see the rise of agents that are able to make small multifile changes. And since 2025, we've entered a
changes. And since 2025, we've entered a new age of fully agentic coding. AI
agents are now able to create uh to go directly from spec to PR.
But this leads us to wonder, are these fully vibecoded PRs actually any good?
How are they being adapted adopted by industry? And in what ways are they
industry? And in what ways are they succeeding or failing?
So we have over 5 million PRs in our database. So we're well equipped to
database. So we're well equipped to answer this problem. Uh and the first thing we need to figure out is how you know if a PR is vioded. Uh and so we rely on three key signals to figure this
out. Uh the first is the GitHub author
out. Uh the first is the GitHub author field. So uh often bots will just add
field. So uh often bots will just add themselves as co-authors to your commit.
And this is a very surefire way to tell whether the bot uh vibe coded the PR. Uh
that being said, it's a pretty sparse signal. Only about 1% of PRs in our
signal. Only about 1% of PRs in our database were able to be identified this way. And so obviously many more than 1%
way. And so obviously many more than 1% of PRs are vioded. We needed a stronger signal. And for that we move to looking
signal. And for that we move to looking at the PR descriptions themselves. Uh
often bots will add uh notes in the PR description saying that they contributed to the PR and that's another helpful tell that the bots coded the PR. Uh this
was a much more frequent signal. About
20% of PRs in a database were able to be identified this way.
And lastly, if you've used codeex or cursor recently, you'll know that any branches that they create uh will have their names in the prefix of the branch.
And so this is a very easy tell as well because humans are unlikely to make branches with these names. Uh and so putting these three signals together, we found that about 27.6%
of PRs that were written in April uh had strong evidence of being fully vioded.
And that's a very interesting number.
But it's even more interesting if you look at the history of this number uh since the beginning of multifile agent systems. As you can see, it's been going up rapidly and we expect it to continue
to go up rapidly. Fully agentic software engineering is the future after all. And
so, if this really is our future, it begs the question, are these PRs any good? Uh, you know, do we expect to see
good? Uh, you know, do we expect to see a significant degradation in the quality of code uh because of this reliance on agentic systems or are they actually better at writing code than human beings
and we're too scared to admit it?
So in order to answer this question, we first have to ask ourselves, what does it even mean for a PR to be good? And we
tried to quantify this in a few different ways.
The first met metric that we looked at was the revert rate of these PRs. Uh
typically when a PR is reverted, it means that it caused breaking changes in production or caused issues downstream.
So we looked at the breakdown of revert rates by author and we found that some agents were actually able to uh have their PR reverted at lower rates than the human baseline uh namely cla and
codeex.
We then also broke this down by the size of the PR as measured by the number of files that were changed. Interestingly
we found that AI agents on average uh had their PRs reverted at lower rates than humans the larger the PRs got.
Another very interesting signal of the quality of a PR is in the gre comments that it receives. So as Vishan mentioned at Grapile, we review pull requests and in the process of reviewing those pull
requests, Grapile leaves comments on your code like a human would. Now
Grappile also rates those comments on a scale of P 0 to P2 where P 0 is a critical codereing change and P2 is a nit. Now you can imagine that if a PR
nit. Now you can imagine that if a PR receives many P zeros or many critical bugs, then that is a lower quality PR than one that receives only a few knits or no comments at all from reptile. And
so uh to look at this as a metric, we broke down the the severity of bugs produced by each bot um and looked at that compared to human baseline. As you
can see, each bot uh or rather the majority of bots produced fewer critical bugs than humans on average. And this is interesting. It means that you know on
interesting. It means that you know on average if you're hoping to avoid codereing changes things that will take down production uh bots are actually more reliable. That being said if you
more reliable. That being said if you look at the entire distribution of severities only some bots were able to avoid bugs of all severities compared to the human baseline. So again it's
unclear whether on aggregate bots are better or worse than humans at writing code.
The third metric we looked at is how many rounds of review it took for these PRs to get merged. Here we defined a review round as essentially uh the bot
opening a PR, a human leaving feedback in the form of comments on that PR and then the bot going back and making changes to address those concerns. This
helps us understand two different things. One is how good the bots are at
things. One is how good the bots are at writing good code uh the first time through and the second is how well they're able to incorporate feedback and make changes without introducing new
bugs.
We once again broke this down by bot author and we found that a few bots were actually able to get their PRs merged on average more quickly than humans.
Namely, Devon and Claude we found uh were the best on this metric.
So, so far we've looked at a couple different metrics for whether bots are better or worse at writing code than humans, and we've seen that there's no real conclusion. Some bots are better at
real conclusion. Some bots are better at writing code than humans based on some metrics, but they lose on other metrics.
The winner is kind of unstable. It
depends on what you're measuring. Uh,
and so perhaps the right question to ask is not are agents in aggregate better at writing code than humans, but perhaps the question to ask is how do bots produce bugs? Do they look different
produce bugs? Do they look different than human beings and in what ways? And
so to investigate this question further, we looked at the breakdown of different bugs that each bot makes compared to a human baseline. So namely uh if you
human baseline. So namely uh if you compare the rate of bugs to the humans, red over here means that the the bots make more of that type of bug compared to humans and blue means that they make
less and the intensity of the color uh corresponds to the magnitude of that change. Now as you can see uh the kinds
change. Now as you can see uh the kinds of bugs that each bot makes vary widely depending on the bot. So for example, cursor background agents are much more likely to make N plus1 query errors,
whereas clawed agents are much more likely to make missing tenant check errors. There is no one clear bot that
errors. There is no one clear bot that necessarily wins across every single metric. And you know the shape of each
metric. And you know the shape of each bot looks different. Now what we've learned here is that bots make different kinds of bugs than humans. Not
necessarily better or worse by all the metrics that we looked at before, but different. And so one thing that we
different. And so one thing that we haven't talked about yet is that bots just allow you to ship code much faster.
So if the quality is roughly the same, albeit differently shaped, and the magnitude is greater, then I guess we can say that AI code agents are actually good. They allow you to write much more
good. They allow you to write much more code, except that you have to be mindful of the kinds of bugs that they make.
AI agents are writing more code than ever. And as Rohan mentioned, the shapes
ever. And as Rohan mentioned, the shapes of bugs that they create is different from humans. It's clear that as AI
from humans. It's clear that as AI coding uh scales into the future, your your code validation systems need to adapt and scale as well for the AI
agentic future.
At Reptile, we're helping thousands of companies manage their everinccreasing scale of AI code using AI code review.
We spend a lot of time understanding the strengths and weaknesses of the individual models. so that we can use
individual models. so that we can use them in tandem to help catch more bugs and create better quality code for everyone.
If you're interested in learning more about what Gretell does, check out our website at guptell.com. Uh, and if you're interested in chatting with us some more about what the future of AI coding and AI code review might look
like, uh, please come find us uh, at our booth and we'd be happy to chat some more. Thank you so much.
more. Thank you so much.
Thank you so much.
Next up, we have Yunong Zang, who's a research consultant with Sonar. Uh,
Yunong will be talking about AI agents in your code quality pipeline, uh, shipping, securing, and measuring them.
Um, Yunong, stage is yours.
right um good morning everyone um I'm in uh so I'm a research consultant from Sonar and I'm also a final year Ph student at a US
um so today I'll be talking about uh a AI agents in your code quality pipeline um specifically I'll talk about how do we secure And also how do you review the
changes made by these coding agents. Um
so um opinions here are my own does not reflect any of sonar and i um for the standard disclaimer disclaimer.
So um here uh is a very high level diagram. So if we think about how the
diagram. So if we think about how the code are making are being made and being merged into repositories. Uh these are roughly the very high level three steps.
So the agents will write code and almost always now agents will review the code because there are just too much of them and then human may decide whether to merge in merge them in or not. Um so
today I'm going to talk about two aspects within this pipeline. First is
when agents are writing code how do we build an agent in sonar called sonar remediation agent uh which fixes sonar cube issues and then I will talk about how do we evaluate code reviews
generated by agents in a more reliable fashion.
Um so um so here is the first part it's the uh sonar cube remediation agent. So
um basically the workflow is that sonar cube which many of you know is a very widely used study analyzer to scan your code. So sonar cube will find all your
code. So sonar cube will find all your issues in your PR and then you can invoke sonar cube remission agent to automatically generate patch for you. So
this agent will open new PR on top of your existing one and then suggest changes to improve it. So the screenshot on the right shows that uh shows what this agent will look like. So you open
this PR tells you about which issues in solo cube it has fixed and then um give you these patches hunk by hunk and tells you an explanation of why this patch is
fixing this issue. Um so we have released this in open beta and we have re received a lot of feedback from uh customers.
So um the one thing I want to talk about more today is how do we secure these agents uh when we go when we put them into production. Um so because these
into production. Um so because these agents work on a lot of enterprise code we want to make sure that there's really no security issues when we deploy and run these agents. So we have heard about sandboxing agents that is very important
uh we use it when we deploy this. Uh but
I want to what I want to say is that we also want to build security in depth.
That means we build layer security after we deploy sandbox. We also built in security within the agent and after the agent ship the code. So uh here are a few things that we we have done within the agent. So one thing is that we are
the agent. So one thing is that we are building a very constrained workflow for this agent because we know that it's going to work on a very concrete scenario which is fixing the son issues.
So there's no free terminal meaning that the agent cannot just randomly assess internet and execute arbitrary commands.
And also we consider codebase as an attack surface not just MCPS and skills but the codebase. Uh so imagine if someone uh opensource comm uh
contributor is open a PR in your repo and then this this person turns out to be malicious. So they can actually
be malicious. So they can actually inject malicious commands in the PR that they send to your repository. So um
that's one thing we considered. So when
we actually run this agent, we'll replace all of these commands uh into some other uh identifier and then swap back these commands after the agent is
done. Um and also um we we uh we want to
done. Um and also um we we uh we want to deal with this supply chain attacks. So
this is for the scenario where uh if I'm a malicious actor and I'm um stop squatting and running repo in pip and I want to avoid agent from importing those kind of repositories. So we build a lot
of import guards to make sure that the agent does not import these malicious libraries.
Um so that's what happens inside agent and here is how we verify the agents patch after it's done. So um when agent generates a patch we run solar cube
analyzer again on this agent generate patch and then if we find regression or we find any security issues the agents is tked to retry with feedback from the previous iteration and then it's only
sent to to developers when the quality gate passes.
So that that was the first part uh I want to talk about. Um so now we we want to switch gears to how do we evaluate this code reviews. So this is becoming a real bottleneck now because agents are
making a lot of PRs to your repositories and then it just impossible for human to to review all of these PRs. So a natural way is to use uh AI review tools to help you review the the PR but there are so
many of them and how do we know which one are better at your specific use case and how do we reliably evaluate them. So
this is a question we want to research on. Um so here are what uh existing
on. Um so here are what uh existing methods do. So uh if you consider a
methods do. So uh if you consider a scenario where we have some historical PRs, human have made some comments on them and then we run this AI review tools and we want to see whether AI tools are catching the same error as
human did. So of course if they catch
human did. So of course if they catch more similar errors they're better. So
here are a few metrics of what people previously did. One is that we can check
previously did. One is that we can check text similarity. We can see whether the
text similarity. We can see whether the area review tools in natural language are generating similar tokens in semantics compared to humans. But as you know if we even if we point to the same
issue this issue can be word can be worded in very different ways. So this
token similarity sometimes doesn't work and also u we can also consider localization. This means that taking a
localization. This means that taking a PR we compare each line at the location where these bots and also the humans are making comments and we say that uh the bots are good if they're making the same
commands at the same location as humans.
But again this does not tell you the semantics that's only the location. Um
also easy way you can take LM as a judge. Uh you can ask the language model
judge. Uh you can ask the language model whether these two commands the same. Uh
so it works sometimes but it's hard to tell whether they are really reliable or not. Uh so uh the the gap here is that
not. Uh so uh the the gap here is that we want somewhat determines the way of checking whether a real command is good.
Um so um this is what we did. So we we built a new benchmark uh called Crap. So
it also takes works in a similar scenario meaning that we want to check whether AI generated reviews are catching similar issues as humans but the core idea is that instead of using
language model component we are turning every human review into one executable test. So here is a concrete example
test. So here is a concrete example taking this PR line on the left uh if this is what someone changed the the codebase and the human review will say that okay this may raise some more
inputs like this. So instead of safely returning false so this is a concrete improvemental code uh that human review has human review has suggested. So uh
correspond to this we will generate a test on the right. So this test basically corresponds to this review command meaning that if this review command is addressed this test will pass
otherwise this test will fail. So now
now for all these PRs we don't have human reviews anymore. We have all these executable tests. Now the second part is
executable tests. Now the second part is how do we actually evaluate AI reviewers based on this input. So we show the the AI review tools the PR ask you to make
comments and then we take another coding agents to improve the code based on these comments. Now we have a different
these comments. Now we have a different version of code which is improved according to the AI based commands. Then
we run this executable test to check whether this updated version code is good and how many of these tests are passed. So in this way we can tell how
passed. So in this way we can tell how many of the human commands the air reviews tools has have catched.
Um so um here is the results we have gotten. Um so the concrete numbers for
gotten. Um so the concrete numbers for each tool is not that important because these review tools are getting better and these language models are getting better every day and this these numbers
are are obtained in the early of 2006.
Uh so the part I want to highlight is this number on the right. So if we consider all of these review tools together, they addressed 41.5% of what
human review human reviewers have pointed out. So this actually means the
pointed out. So this actually means the these um the current review tools does not capture even half of what the human reviewers has pointed out in in the past.
Uh but this is not the full story. So
other than this number, we actually look into all these AI generated commands and see the quality of them because they can also point out to other errors that humans did not identify but they are
still valuable. So uh we further look
still valuable. So uh we further look into all these review comments generated by humans and AIS and put a categorization around them. So this
categorization is beyond bug fixes. So
we basically put them around security, efficiency compatibility robustness and so on all the way to about documentation and the maintainability of the code. So uh and this diagram shows
the code. So uh and this diagram shows uh what how each review tools is doing compared to human reviews. So we can see that AI is actually doing very well on robustness of and testing. So they will
suggest you to test more code. They'll
point out edge cases in code and ask you to add them. So this aligns with my personal experience as well. So AI is very good at pointing out things that I didn't notice before. But then on the other hand, human reviews are very good
at maintainability and design compared to AI. So so they they'll talk about how
to AI. So so they they'll talk about how this code is not maintainable anymore if you add so much changes. You should and you should organize code in a different way because human reviewers has more contextual knowledge about the codebase
compared to AI reviews.
So um the takeaway here is that for now uh we still we still should use AI and human reviews together. Probably AI
reviews should use as the first layer and then human reviews can look at these specific categories that AI reviewers were not so good at.
Yeah. So uh yeah that's all I want to talk about. Um so I I talked about how
talk about. Um so I I talked about how do we build agents where we focus more on control and safety and also how AI and human should work together for code review task for now pro probably in the
future we can have AI tools which are trained focus more on these aspects that they are missing right now but for now I think this should be a solution that we build layered reviewed uh on our codebase. So these two are the QR codes
codebase. So these two are the QR codes to our paper. So we have a research paper on each of these topic. Um feel
free to read if you're interested. Uh
and happy to chat after.
Thank you so much. Next up, we have Singapore's very own Eugene Chia from Featherless, who will be talking about how open-source models are here now, and
it's time for Singapore to build Apologies.
Apologies for the technical difficulty.
Um to me AG when AGI is actually truly solved, right? These things will be
solved, right? These things will be resolved and so is it printers like it should never happen.
All right.
All right.
All right. Hey, I'm Eugene. I'm going to talk about open source models and why they are here and why Singapore should just build. Um, due to the limited time
just build. Um, due to the limited time span, uh, I may slightly lean into English. I might say you go fast and I'm
English. I might say you go fast and I'm just going to do kick it off with a live demo. And for this live demo, I'm just
demo. And for this live demo, I'm just going to like very quickly do a simple web game. Um, but what is more
web game. Um, but what is more interesting here is I'm not going to use the best Frontier models. I'm not even going to use the best open source model.
I'm going to use the Quen 27B and the Gamma 431B that can run on your laptop.
So, I have the prompt here.
I'm just going to quickly get that running and get that running and I hope my internet didn't disconnect on me. Okay,
so as you can see um I'm using client which is one of the open source coding agents uh uh that is integrated in VS Code. Uh you can use anything uh that's
Code. Uh you can use anything uh that's not the point of this uh demo. The point
is really just to show that these are models that you can use today to actually just build stuff. So trying to wait for this one to Okay, fine. Plan
finish. It plan finish it. I didn't I didn't even check it. Okay, so these are models, right? More importantly that you
models, right? More importantly that you can run on your laptop. So this is an example of MM Studio with the gamma 31B that uh that is running on my on this laptop itself. You can run it on a Men
laptop itself. You can run it on a Men laptop. You don't even need the highest
laptop. You don't even need the highest end. Uh and this is the the the same
end. Uh and this is the the the same coin 27B except that it's probably faster if I run it in the cloud. So uh
I'll leave that one running in the cloud. Yeah.
cloud. Yeah.
Yeah. So a little bit about my background. I'm Eugene. Uh I'm I'm an AI
background. I'm Eugene. Uh I'm I'm an AI model creator. Uh one of less than a few
model creator. Uh one of less than a few hundred teams worldwide that have created AI models from scratch. Uh and
significantly in Southeast Asia, there's only really a handful of us. Uh founder
and CEO of Federalist AI uh recently did our series A at 120 mill valuation led by Airbus Ventures and MD Ventures. I
also co-lead the RWKB open source project, the first AI model under Linux Foundation and I'm born and raised in Singapore and a repeat startup founders.
Um, and I work in startup enterprise software uh, banks, open source space for over a decade. I fly at pretty much every month between the east and the west regularly.
What is Federalist AI? We are platform that provides instant access to the entire collection of open source model.
Today is 30,000 models. In the future we want to support all two million models or even three million. At that point our principle is that we shouldn't be choosing be the judge to decide for you
what models you want to use. You should
be able to decide for yourself. And so
this is something that we are scaling to provide access uh instant access to everyone and you and you can access us through hugging face and open router.
What makes this also interesting is that when you let the users have the choice of the model, the entire collection model, well, it's still in the early stage of 30,000 and we're scaling that
up. You get to observe what models that
up. You get to observe what models that people actually use when they given that choice. So that's pretty much the
choice. So that's pretty much the background for the talk like what the do people use open models for? because at
the end of the day right it's really about like getting those insights that I find more interesting respectively.
So to answer that question I'll split it up into two major segments. The first
one is which open model classes are being used. This is typically what
being used. This is typically what people find exciting when they are first entering the open source AI landscape because they like should I use the quen or the deep sync and things like that.
But this is probably one of the hardest metric for me to present because every time I do the slides it gets outdated like the next week. This was December when most of our traffic was dominated
by deepseek for consumers and for enterprise customers it was dominated by administr Nemoi. I think this is a very
administr Nemoi. I think this is a very interesting pattern because consumers like to instantly test the latest and the greatest and experiment where enterprises like to run things at scale
and so they focus on efficiency. But
soon after it got replaced and then like just a few days ago like gamma started exploding off the charts and and and this is literally the updated version
chart for that I had to update for the talk itself. Oops. Oh, okay. It it ran
talk itself. Oops. Oh, okay. It it ran finish. Okay. Uh and so this is a shout
finish. Okay. Uh and so this is a shout out to uh Ivan and the Google Tig team.
They did an awesome work with the gamma 31B.
And so what are these models then used for? Um oops live demo issues but never
for? Um oops live demo issues but never mind. Like increasingly we all heard
mind. Like increasingly we all heard about open claw agentic use cases that that that represents a huge bar of our traffic. The other major one one is AI
traffic. The other major one one is AI companion therapy and role play that actually virals the agentic claw usage but the agentic claw usage will be a lot
less users running a lot of agents where the AI companion space you'll be some usually commercial clients where they where a company will have thousands of users coding use cases these are based
on the metadata that we have like client and clot code and things like that we can see these kind of use cases and subsequently like 5% chbtt Oops. Once again because we do not
Oops. Once again because we do not perceive any prompt on the uh completion data we infer this number approximately.
So what is interesting beyond that right is down here I'm uh down here I'm representing by model classes but when you represent it by fine-tune models be
you may have heard of fine tunings to specialize the models for your individual use case or company use case you can see the difference in the chart respectively. What I find most
respectively. What I find most interesting is not the top one/ird or half which is usually all the popular models but the bottom half because if this inference market is going to be a
trillion dollar market this is where the things gets interesting and this is where we see AI models being support fine tuned to support specific region like we are proudly one of the providers for the sambar AI one of
Uganda's first language model or the Denu AI model which is an agriculture language model we also see use cases for medical for Open hands which is also trained in Singapore and also like for
security like Cisco foundation model respectively. So what what I find
respectively. So what what I find exciting about these trends is that more importantly open models are crossing the current sonet and mini line and operating of opus the level of
intelligence in laptops accelerating and long context cost is dropping. I'm a bit pressed of time so I'm going to move faster. This is basically open models
faster. This is basically open models matching sonet and approaching opus for for the AI model. Yes, still slightly behind, but it's almost there. But this
is the more interesting one. The two
models that I was running already surpass GPT4 encoding use cases. Sure,
they may not be GPT5, but mind you, they run on the laptop. Basically, the
models, the best models you see today will possibly be running on your laptop next year. That's the pattern that has
next year. That's the pattern that has been repeating in the open source space.
And that is why I'm going to skip this part. Uh that's why right like I this is
part. Uh that's why right like I this is an important thing that I want to stress to all the AI engineers here because let's just see the live demo. Okay.
Okay. So this is so this is one of the asteroids. Um let me see. This would be
asteroids. Um let me see. This would be the gamma 31B but let's just try uh opening for the quen 27B for example.
And you can see this is also another one. The fact that this works
one. The fact that this works on potentially running your laptop, right, is the significant thing because right now today all these models that
run your laptop can do UIs, APIs or anything else. And sure it may require a
anything else. And sure it may require a few retries. But if we want to make
few retries. But if we want to make Singapore the AI hub of the world or Southeast Asia, the problem is not the models, it's us. We just need to start
building. And that's what I want
building. And that's what I want everyone in Singapore to start doing.
Just build because there is no barrier.
Yeah, that's all. Thank you.
Thank you so much. Thank you so much, Eugene. Um, next up we have Max Buckley
Eugene. Um, next up we have Max Buckley who's the head of knowledge research from XAI.
Max will be talking about well his top talk title is November 24th, 2025, what comes next.
Max, over to you.
Hello everyone. Uh, I'm Max from EXA.
Um, I'm the head of knowledge research and I'm also in charge of the Zurich office which we're currently setting up.
This is more of an existential talk, so I won't really be talking about EXA here. Um, and this is not a typo,
here. Um, and this is not a typo, although it was asked many times, was a typo. November 24th, 2025, what comes
typo. November 24th, 2025, what comes next? Um, so what is November 24th,
next? Um, so what is November 24th, 2025? And that is the day Claude 4.5
2025? And that is the day Claude 4.5 Opus was released. And my position here is that that will go down in history as a day when things changed.
So my proposal to you here is that the game theory underlying sort of society is changing and Genai is driving this.
Um and I'll give a historic example with ChatgPT a few years ago and I give a more recent example with Opus. But
basically the institutions that we have were built on the assumptions that certain things are costly and these costs make certain things work right but when we remove the costs the systems
built around them they can fail to work they can crumble.
So one historic example of this was proof of work, right? We had a lot of systems that required people to make an effort in order to kind of prove they had made an effort. And by doing so, you
know, you would get people to learn in schools. You would find which com which
schools. You would find which com which people really wanted to apply to your company for jobs. You know, you could also, you know, know if someone was credible. Nowadays, if I get a message,
credible. Nowadays, if I get a message, like an email or a LinkedIn message and it's really well written, I don't think the person is really eloquent and really made an effort to talk to me. I think
the person just used an LLM. Whereas
previously the opposite was true.
Nowadays, if you get something with a typo, someone either got a model to generate typo text or edited it intentionally to make it more typoed.
And the reason I talk about this in game theory lens is that you can't opt out of this. Even if your university has come
this. Even if your university has come up with some claim like we don't allow Gen AI, you know, projects, that just means your students have to edit the delves out of it and remove the emphasis
dashes. So you can't opt out of these
dashes. So you can't opt out of these changes. They're coming for you. Um and
changes. They're coming for you. Um and
a similar shift is happening now in coding, right? So over the last sort of
coding, right? So over the last sort of eight years, we've gone from, you know, tab completion where you complete a line to completing a function to being able to ask it to generate a file to now where you have this coding agent where
you can just give it this highle prompt and it will run away for you know minutes to hours and build the whole thing, test it and verify it and come back to you when it's done. And this is
quite a shift and something that hasn't fully been kind of played out yet.
What's interesting is the models themselves aren't even aware of this shift. So if you use Claude, it will
shift. So if you use Claude, it will still use the time estimates that used to be true. So if you give Claude a big spec and say, "Here's a crazy idea.
Let's implement this research paper."
Claude will tell you this project will take 12 weeks. You then copy the markdown into Claude code and it wors away for 30 minutes and then it's done.
you know, clearly it hasn't understood how much the world has changed. And I
don't think this original estimate was wrong. Like I've worked at Google with
wrong. Like I've worked at Google with several like, you know, very good engineers where you would assign this to like a an junior engineer and it would indeed take them 12 weeks and that's 12 weeks of check-ins and iteration and
making progress.
Remember this concept of IT literacy? I
mean, probably I'm preaching to the wrong audience here, but it used to be the case that many people were scared of computers or found them difficult or hard. And the reason for this was that
hard. And the reason for this was that computers were hyper literal, right? If
you missed a semicolon or had something like a typo, the computer would just say, "That's not found. That doesn't
work. You're out of luck." Whereas it literacy was about helping people, normal people, get used to using a computer, like make them realize that yes, you missed the semicolon, but don't worry, you can just put it in and it'll
still work. No, an illegal operation is
still work. No, an illegal operation is not actually a crime. Don't worry. Um
but again one of the things that coding agents are driving and I think coding agents that term even undersells the potential is a shift here right because coding agents or just having an agent
running on your computer makes computers have a kind of natural language interface like normal people now have this one hurdle which is how do I open the terminal? How do I launch clawed
the terminal? How do I launch clawed code? And now they can use a computer in
code? And now they can use a computer in a way they never could before. They can
talk to it in natural language. It can
talk them through how they do whatever they want to do. How do they set up their printer on the network? How do
they, you know, take a screenshot? How
do they debug if their camera is visible or not, right? And this something they could not do before. And open source is next. And I say next, I mean it's it's
next. And I say next, I mean it's it's already happening. I mean, there are
already happening. I mean, there are people in this room who have talked about some of these facets, right? But,
you know, open source used to mean open to engineers. Now, it means open to
to engineers. Now, it means open to anybody who has a computer and is literate, which is quite a bit more open. Of course this comes with new
open. Of course this comes with new problems, new challenges.
So yeah, what used to be true like these are the assumptions of the world pre end of last year. So it used to be the case that software development was expensive.
There were few people who could code.
Those people were very skilled. Their
time was very valuable. Um so you know we basically every feature had an opportunity cost. There were whole
opportunity cost. There were whole pieces of organizations designed to ensure we work on the right things by for some definition of right. You know
there was endless debates by managers, program managers, product managers, technical program managers, whatever you want to call them about which project should we do, which ones should we depp prioritize, how much should we invest in
fixing the bugs versus how much should we invest in adding new features.
Similarly, software development was slow. So even a small feature would take
slow. So even a small feature would take you know a few hours maybe days you know a big feature could take weeks could take months a really big rearchitecture of a system could take years for
multiple people right and of course one nice thing about this was that like road maps kind of could align with this quite well right because the road map could be quarterly because effectively the work was quarterly I remember working in
Google and you know we'd assign maybe someone like four five six bullet points for their quarter that was the four five six things they were going to work on and do that quarter And usually they
would do like 70 to 80% of them. Um, and
so because of these two things, you wanted to prioritize ruthlessly. And
there were again systems designed to do this, right? You know, we'd have sales
this, right? You know, we'd have sales teams who were filing hundreds of issues, requests, features, ideas. And
then you'd have program managers sifting those hundreds or thousands down to 30.
Those 30 would go to engineering managers who would debate them and would draw a line and say, "We'll do the top 16." And those 16 are fanned out to like
16." And those 16 are fanned out to like the engineers on the team.
And so as I say the interesting thing here is that all of our like processes and habits and org charts assume this to be true. So all of this is going to have
be true. So all of this is going to have to change as these things change. And
yeah so basically this whole thing is based on this economics of scarcity right that you know every line of code was very valuable so we should you know prioritize things a certain way. Um you
know things like software as a service is a funny one. We've all heard about that it's in kind of danger now. And
it's interesting because you know with a good set of engineers you could build in theory like a uh like a workday competitor or whatever other software service thing you wanted to but the question is did you want to like were
you willing to commit several people for several years and several million dollars to try to build a basic version and then get to the challenge of selling it and convincing people to switch. Now
that's a lot easier to do and this makes people realize that the moat is not the code anymore but now it's going to be your brand your like go to market channels. I do think ML and data hold a
channels. I do think ML and data hold a lot modes will hold longer because it's much less clear how exactly where the boundaries lie right so it's harder to
kind of reverse engineer than something kind of deterministic and this scarcity thinking also is going to have to go right so this idea of whittling 30 ideas down to three by professional judgment
and then implementing three you know we don't need to do that anymore we can now build all 30 do good evals do benchmarking see which ones were actually worthwhile and you can revert the rest and we won't be so attached to
those that we revert because we didn't spend 3 months building it and our promotion case doesn't rely on it.
So yeah, the supply of software is going to explode. I mean this is not an
to explode. I mean this is not an original thought. Um there's tweets from
original thought. Um there's tweets from the COO of GitHub recently saying that as of the current like run rate, GitHub commits are up 14x year. It's like over
2025 which was already up 4x on 2024. So
it's 14x at the current rate and it's growing. So it's going to be even more.
growing. So it's going to be even more.
What's especially interesting here is that the the marginal cost of like a new tool is almost zero. So nowadays if you're given a task like maybe you need to label some data or debug an issue, you can quickly throw together a new
custom UI that you use for that task and never again. And this is crazy because
never again. And this is crazy because this UI might take 20 minutes for a claw to cook it, but it may make you like 10 times as effective at like labeling data or sifting through images or whatever else, right? Like because as a human
else, right? Like because as a human you're good with visual data and you're not necessarily so good with text or whatever else. And now we can just build
whatever else. And now we can just build all these niche apps that no quarter ever justified. Um,
ever justified. Um, so the bottleneck is going to shift to go to market and code review because now that you can build everything, so can everyone else. So people are going to be
everyone else. So people are going to be competing even more to get people to use their ideas, to see their ideas, to hear their voice. And code review has already
their voice. And code review has already been talked about, so I won't dwell on that right now. But basically, you know, code review is just again struggling because of the amount of code we're producing. And of course, AI can also
producing. And of course, AI can also help with this. So what I think is valuable now, what I would invest in is statistics. So statistics was always
statistics. So statistics was always very valuable at big companies like Google. There was always, you know, some
Google. There was always, you know, some team, some people building statistical tooling for evaluating experiments and then many engineers that would just rely on that tooling. They would just opt in.
Now it's probably more useful as a more broadly distributed skill because everyone can be evaluating all sorts of things in many different ways. And
evaluating here could be different things. It could be profiling for
things. It could be profiling for performance. It could be benchmarks, AB
performance. It could be benchmarks, AB tests, user behavior metrics, these things. Uh ideation and taste is another
things. Uh ideation and taste is another important thing. So basically the idea
important thing. So basically the idea of what to build, having ideas is going to become even more important. And then
of course iterating on these ideas and jagged. So my final point here is just
jagged. So my final point here is just that the val the specific value of knowledge I think will change. We're
going to move from deep technical expertise when you really detailed know the exact syntax of something to knowing what exists, how and when to use it.
Because with these models, if you prompt them kind of generically, they will kind of often give you a sort of generic response. Whereas when you prompt them
response. Whereas when you prompt them with a sort of the right words, it kind of unlocks this strange potential. Like
my final example is statistics. If you
say please, you know, benchmark my change, it will often do n equals one, run it once, run it twice, see which is faster. If you say use statistics,
faster. If you say use statistics, suddenly it starts spouting things like p values and t statistics and all of these other things and large sample sizes and it goes crazy. Yeah. So that's
that. Yes. So basically the question is no longer can you build it. The question
is what should exist.
Thank you.
Thank you so much to Max from XAI.
Next up we have Mark Doyle who is a software engineer with Stripe. uh as you may make your way over to the stage and Mark will be sharing uh a little talk
about Minions uh not quite Minions the movie but Minions which is Stripe's oneshot endtoend coding agent platform uh you'll talk about how they're building it why they're building it what
the reasons behind it and some thinking behind how they think about coding agents Hi everyone. Uh thanks so much for
Hi everyone. Uh thanks so much for sticking around. I know it's nearly
sticking around. I know it's nearly launched so uh hopefully you can keep this uh really interesting. Uh I work at Stripe on our coding agent platform. Uh
my name is Mark. Uh so roughly anything to do with uh writing code with agents and uh the whole software engineering life cycle uh with coding agents I'm roughly involved in um just to before we
start talking about what we'll talk about today which is oneshot coding agents. So uh going from a prompt
agents. So uh going from a prompt straight to a PR and one shot just to frame the problem a little bit at Stripe that uh we process close to 2% of the world's GDP on Stripe. So even though we're trying to be on like the really
bleeding edge and the forefront of AI with using the models, uh we have like really big obligations to our users and our customers and even the broader global economy to you know maintain a quality bar and a security bar. Uh so
that's definitely our like number one thing we keep in mind while we're building all this. That said though we have 91% of Stripe engineers are writing code with AI on a daily basis. So 100%
of Stripes are using AI in some form during the software authoring life cycle. Um, but every day we have 91% of
cycle. Um, but every day we have 91% of our engineers merging code with AI. In
the last year, we've seen a 500% increase in the number of fully AI generated pull requests. Um, so today, yeah, we're just going to talk a little bit about like how we're making that happen. Um, and how oneshot agents are,
happen. Um, and how oneshot agents are, you know, enabling that for us.
Um, one shot coding agents are sort well-known term in the industry, I guess, but something we use uh internally a lot is um creating a PR when you go from just straight from a
prompt or a slack thread all the way to the poll request uh just without any interaction. So we all we in Stripe also
interaction. So we all we in Stripe also have the harnesses much like I'm sure all of you have like clawed code, codeex, cursor um we use those as well but we see those as kind of like a co-pilot harness. So that's when the
co-pilot harness. So that's when the engineer is sitting in, you know, in tandem with the harness working in like an iterative manner. Oneshot coding
agents are specifically for when we think uh the engineer knows roughly what the pull request or what they're trying to achieve looks like. We we don't need them to sit in tandem with the harness
like for extended periods of time. So we
think it's a little bit wasteful for engineers to be juggling like tons of different teamwork sessions connecting to different agents on different boxes when maybe they could have the planning session with the agent up front and then
just kick off this oneshot um experience and not do not have any involvement until they get to the code review phase.
Um so yeah, our goal is to just like save our engineers time. You know, we don't want them spending time like spinning up new development environments, creating branches, pull requests when they already know roughly what code they're going to write. We
want to offload all that work to the agent, not just the actual writing of the code. Um, so I'm just going to give
the code. Um, so I'm just going to give you an example of me using one of our oneshot agents. So here I'm
oneshot agents. So here I'm investigating a problem with one of our MCP tools with Stripe. It's just uh this is like a very simple example just to show how we do it. Um, we have these
like agents in Slack where uh we can say hey what's you I'm seeing this issue.
What might be the problem here? Um, so
straight away the agent will come back.
It'll read our code, read our documentation, and say, "Oh, look, this is seems to be the issue you were looking for." It's just literally, in
looking for." It's just literally, in this case, a threeline uh or three character diff. It's a very like
character diff. It's a very like straightforward change. And right now,
straightforward change. And right now, the developer in me in the scenario knows that um this change is very simple. Like it's it can be implemented
simple. Like it's it can be implemented by roughly anyone. You probably wouldn't even need to be an engineer to make this change. So we don't want our engineers
change. So we don't want our engineers now to, you know, spend the next 10 minutes, creating branches, spinning up an agent, explaining the problem again to an agent, copy and pasting this context. We just want them to literally
context. We just want them to literally be able to say, "Hey, go fix this issue.
Once you've come back with the pull request, I can approve it and uh or get my colleague to approve it and merge it." Um, so Devbox in this situation is
it." Um, so Devbox in this situation is just analogist to minion, which is what we call our one shot agents.
Um, and the developer then can expect sometime later to see a response like this where the minion comes back and says, "Hey, our process is completed.
Um, go check out like what code I've written." So, the developer didn't need
written." So, the developer didn't need to be in the loop for any of this.
And that's like a little bit about the the like philosophical side I guess of why we want to do this like why we think we're saving engineers time. And now I can explain like how do we actually
achieve this outcome. So u we saw in the this previous message when I instantiated the agent we see this like message the agent says hey one second I'm cooking I'm working on your task how
do we go from you know that message to actually getting a pull request that the engineer can review so is really lucky we've been investing in dev boxes which are are remote developer environments so stripe
engineers don't write code on their laptops they write it on remote developer environments and we could probably give a whole talk about why we need these stripe has like a super large monor repo one of the biggest git
repositories in the world. It's close to 300 million lines of code. So like if you clone down our repository, it's like 90 gigabytes. Um it takes a long time to
90 gigabytes. Um it takes a long time to generate our code. So we kind of need to have these remote developer environments. So every time you want a
environments. So every time you want a fresh branch or something, you can just get it straight away. We have a pool of them. They're ready to go. And we're
them. They're ready to go. And we're
really lucky. We invested in these for years because it turns out now they're, you know, really good homes for agents.
And agents can be really comfy there.
They have all their tools. Um these are not lightweight sandboxes like what we see a lot of in the industry today.
These are quite large like developer machines are uh lots of cores 64 to 128 GB of RAM pretty big machines pretty capable of um like for large scale engineering tasks um and every minion
gets their own dev box so they have their own home there um where they can you know one from a security standpoint are safely isolated um sandbox etc. And then two it's just like a good
environment for them to write code in.
So once we've given the minion some compute the dev box so like a computer to run on it needs like a file system it needs a a shell we've given it that with the dev box the first thing we do to try help it operate in this giant codebase
is we hand the prompt or the slack thread all the context we can gather so say in uh the example I showed it was a slack thread where another agent had you know uh searched the codebase given some
context maybe there could also have been a a ticket mentioned a pull request mentioned some other context from a colleague we gather all that information and we hand it to this analyzer agent you see here that analyzer agent you
know gathers all that context and says okay this is where I think we need to point the agent this is the right part of the codebase and that's when we start uh the actual implementation phase so once we've figured out where we're going
to write the code or like approximately what the task look like just summing the whole um contents of the slack thread or wherever we started the minion from into some into a prompt uh we can start this
minion loop so the minion loop is the process of making sure we always produce a pull request and the agent doesn't stop in the middle. And this is what the minion loop looks like. So we start at
this white arrow at the top where we take that context I just uh explained where you know everything from the slack thread and we give it to this coding agent you see in the white box. It's
just a regular coding agent. It takes
the you know as you as maybe claw code or codeex you're very used to using does takes the prompt conversational context and tries to you know advance it makes a turn tries to advance towards a goal. um
after it's you know advanced towards that goal we make it run lints we make it run tests and type check and then we stop and we don't go back to the human this is sort of the difference between
oneshot agents and uh you know co-pilot agents is here we pass the result to an LLM judge which is this orange box you see at the bottom of the screen and the
lm judge literally takes the prompt that the original author gave to the minion and the current git diff or the output that's been produced And we just ask it is this task complete? So it doesn't get
its context doesn't get poisoned with all the like information conversation all the you know excuses that the coding agent might come up with for why it stopped working or why this task is
impossible etc. Um it is lally just a you know unbiased judge that says is this task complete or uh has it failed.
Um if the task is complete great we can you know create the pull request and go back to the engineer and say hey it's uh it's ready for your review. Um, I'm
finished here. If it's not complete, um, we have a diagnostic agent that looks at the, you know, looks at the output of the LLM judge, looks at what happened in the coding agent session and the original prompt and says, "Oh, uh, this
isn't finished because some test failed or this isn't finished. This hasn't
finished yet because it's you've actually implemented the wrong thing.
You know, you uh created an API endpoint, but you didn't wire up the front end. Obvious things coding agents
front end. Obvious things coding agents will miss." Um, and then we take that
will miss." Um, and then we take that context from the diagnostic agent and put it back into the loop. So, we run this loop as many times as we need. And
we try to keep the input from the diagnostic agent very short. Uh, so it doesn't blow the context window. But we
keep running this loop with the diagnostic agent, the LM judge, and the coding agent. Just keep running it until
coding agent. Just keep running it until we have something that resembles a pull request. It's not always going to be the
request. It's not always going to be the case that the pull request is correct, but at the moment at Stripe, we're merging roughly 65% of minion pull requests on one shot. So 65% of the time
a stripe engineer starts one of these it's being merged without any human intervention. So it's getting pretty
intervention. So it's getting pretty good. As the models get better um we see
good. As the models get better um we see this working more and more. Uh usually
it's like the engineer will then want to if it's not you know successfully oneshot it the engineer will want to jump in um and make some changes. So to
that point we have a web interface for you know you can continue steering the conversation. You can also uh you see in
conversation. You can also uh you see in the top of the screen here like open the box the minion was spawn spawned in on VS code or in a terminal. And so that lets the engineer take over in the case
that the minion like failed to one shot.
And so that's kind of the story of how we take this like little coding agent, give it a give it a place to live and then produce these like oneshot pull requests. Um we're merging like 3,000
requests. Um we're merging like 3,000 pull requests a week at Stripe with these. uh it's really like valuable for
these. uh it's really like valuable for um you know saving our engineers time of solving the really small problems and even bigger bigger tasks that the engineer already believes that the agent can oneshot or uh can it can the
engineer can provide significant context up front that lets the oneshot PR happen.
Um, so if you're building systems like this, there's probably a few lessons we can give you to take away.
Uh, we learned that prompts are really good. So in all of our agents here, like
good. So in all of our agents here, like the LM judge, the um actual coding agent itself, etc. We have like very detailed prompts as you can imagine. We've
thousands of clawed and agents.md files
around our codebase. They're very
valuable. However, if you're writing uh one of these these loops like a minion loop and you're constantly making prompts that look something like please please run the test before you make a
commit, don't push and run like an expensive CI run with you know before you've run the test yourself or um please format your uh commit messages in a certain way. You're you know writing
in screaming case all capitals you're trying to really trying to convince a coding agent to do something. In that
case, we uh really think deterministic instructions are just far better for this. So, anything you can make
this. So, anything you can make deterministic, please do it. Um it's
really it really helps the agent be successful. Uh trying to argue with
successful. Uh trying to argue with agents for things is usually not a great it's kind of like a code smell. Um
especially if it involves security things. Uh so yeah, deterministic
things. Uh so yeah, deterministic instructions for writing these kind of loops is like absolutely critical and it just lets the the process be so much more reliable. If you're, you know,
more reliable. If you're, you know, building your own workflow, it may be fine to rely on these like screaming case um context files, but for doing this at scale when you have like thousands of developers running
thousands of minion runs, uh this has been really useful for us. Uh our second takeaway is that uh developer tools are always super important. So at Stripe, we've always been really lucky that
invested pretty heavily in developer tools for a company of our size. Um for
example like Stripe open source sorbet which is a a static analysis type checker for Ruby. It's like analogist to typescript for JavaScript. Um lots of tools like this Stripe was built to
boost our development uh velocity over the years. But more so than ever this is
the years. But more so than ever this is like so much higher leverage. So now we see that like these tools are like you must have them. So if you don't have like good uh compute primitives for your
agents to run on like for us dev boxes, you don't have static type checking, linting, all these things we expect to have as professional developers. The
better your tools are, the more you can do agentic development. So if you don't have these things, it's no longer like, oh, my engineers are losing an hour a week to it. It's you're losing like thousands of agent cycles that are
failing or, you know, are taking much longer than they would have before. So
now we're like doubling down even more than we had on building like even better llinters formatterers analysis all these kind of non LLM related things that are mostly static analysis. Um so
that's been really valuable for us. Um
the last takeaway we have is that um building on Slack has been really valuable for us. So you saw earlier in my uh presentation we have this at devbox or at minion slack message where we can kick off a minion. that's been
super valuable for educating like all our engineers about using AI and non-engineers as well can kick them off.
So that process of um sort of building in public and sharing with our engineers that hey it's you know maybe you were about to go down the you haven't been reading on Twitter the latest and
greatest in AI um you maybe would have gone down the path of opening up your editor and manually making this change or using um tab completion or something.
Now all our engineers see other engineers working in public and just tagging these minions being like, "Hey, go do this thing." That's been really helpful for like helping get our, you know, really large um organization on
board with using AI for lots of tasks.
Um that's been yeah, working in public within your companies has been super useful as well. Uh finally, we have a booth over in the rest of the conference. So if you'd like to come
conference. So if you'd like to come chat to me about minions or ask any questions, uh please do. Also, if you think uh working with minions or on this platform is interesting, Stripe is
hiring uh jobs. We're actually hiring an EM or engineering manager for my team specifically. So, if this sounds really
specifically. So, if this sounds really interesting, uh you should come work with us. Um I'd love to work with you.
with us. Um I'd love to work with you.
Uh we also have a giveaway here if anyone's interested. Uh you can come by
anyone's interested. Uh you can come by our booth after and check it out. But,
uh thank you very much.
Thank you so much, Mark. Um, our last talk before lunch. I know everyone's very hungry, but lunch won't start till this talk ends. So, I hope you guys give Liha the time uh for him to present.
Liho as you said up uh Lihao is a software engineer with a company called similar and he will be talking about from playing solitaire to operating ERP
software. Why does your computer need to
software. Why does your computer need to learn to click and type? So similar is building tools that uh are really good for computer use and Leha will show you
how that works. Leha over to you.
Thank you. Thank you so much. So, how
many hours a day do you think you've moved your cursor around the screen?
Anyone?
Five. So, we if a few months ago, we did an experiment with a group of friends.
Some of them are like you, right? AI
engineers, builders, and we also have doctors, admins, accountants. And track
them, see how how much time they spend on moving the cursor, right? And this is what we found out.
five hours a day. We have someone who move them their fingers on a trackpad for more than like 5 hours a day. That's
more than onethird of your time awake, right? Not not creating, not thinking,
right? Not not creating, not thinking, but just moving, clicking around. Sorry.
Clicking around, navigating, right?
Scrolling through tabs, uh, through menus.
So we've we've put a lot of our work into this digital space but the way we interact with it is still incredibly manual. the
PC. We have PC in 1981, right? Suddenly
we're able to do things that we used to take hours within minutes, right? It's a
big leap and we freed ourselves from filing cabinets and paperworks. But now
look at us 40 years later. We're still
clicking, scrolling, navigating around, right? Five hours a day. We traded off
right? Five hours a day. We traded off like one kind of manual labor into another kind. So we need the next leap.
another kind. So we need the next leap.
So what would truly an efficient way of interacting with computers, right? What
if you don't have to interact with the computers at all? What if the computers can operate on its own? It can see the screen, understand the task and just do
it. And that's what we are building here
it. And that's what we are building here in similar. We call it an autonomous
in similar. We call it an autonomous computer. Right? So this is what keeps
computer. Right? So this is what keeps me exciting and this is what we're building. So my name is Liha. I'm a
building. So my name is Liha. I'm a
techn a member of technical staff at similar uh we're building the infrastructure for autonomous computers.
Right. So last December our research agent agent S3 has achieved a surpass human level performance in um in OS
world which is the standard test for computer use.
So what does an autonomous computers look like right? So let me show you this is s this is our um product. So on the left you can see that the screen we have this chat interface where the LM is
trying to understand the task. It's
trying to play a solitire. So it's
trying to look at the screen understand what's going on and try to see like what's the next best move and trying to figure out how to move the mouse and drag the cards. Right. It's on the right
is the machine that's running on and so hopefully in a minute or so if he gives them encouragement hopefully Sai will play the first move.
Yeah. So you can see that it actually able to control the mouse cursor and drag the card right from the left to right. But this is just one app, one
right. But this is just one app, one task, very clear rules, right? But
imagine your actual workday. at work,
you what do you do, right? You have
emails, you have Slack, uh you have sorry, hold on.
So, at work, you have email, you have Slack, you have spreadsheets, you have your PowerPoints, you have your QuickBooks, SAPs, and some of the legacy systems that your company refuses to
retire, right? So some of these tools
retire, right? So some of these tools have APIs, right? So this is where we had a lot of activities is going on last year where we have API or CLI agents. We
have um tool calling, function calling.
Great. This part has been solved and some of these apps is in the browser, right? So you you heard talks about
right? So you you heard talks about browser use agents which can handle um navigating and looking at your browsers for you. Great. But again there's
for you. Great. But again there's everything else that your desktop apps your legacy systems your proprietary tools there's no API there's no browsers
so there's no the only way in is actually through the screen right so that's computer use so teaching an API to see the screen understand
what's on it and operate it just like you would with that the autonomous computers is complete we have API and CLI agents we have browser agents and computer user
agents working together simless to the users and handle any task on the browser. So this is what we are building
browser. So this is what we are building at similar and in building it I would like to share three main challenges that we have faced reliability trust and scalability.
So let's start with reliability to a user. Reliability means one thing,
user. Reliability means one thing, right? It works every time. Two things
right? It works every time. Two things
have to be true for that.
The agents needs to see the screen and act on it precisely. That's grounding.
And it has to be able to do it across multiple turns. So across 100
multiple turns. So across 100 repetitions, that's consistency. So you
have grounding and consistency.
Let's start with grounding, right? How
does a blind person sees a screen? You
use screen reader, right? So, it reads accessibility tree structured map of every element, the name, the type, the state, and that's our starting point, but it's not enough. A lot of times
there's apps that don't have the complete tree. Sometimes buttons hidden
complete tree. Sometimes buttons hidden inside um menus, uh drop downs, you have elements that show up dynamically, and some apps barely even have a tree at
all. So we have to back it up with
all. So we have to back it up with vision grounding. So we specialize
vision grounding. So we specialize models that can actually look at the screen visually and figure out where are the elements, right? One can read the text, one can find buttons and elements.
When the tree has gaps, the vision try to fill it in.
So we have accessibility tree combined with visual grounding.
Now grounding gets you precision on single action. But what about thousand
single action. But what about thousand times in a row? In research, there's a matrix called pass at K, which means given the agent K attempts, how many
times do you uh how how many times do you get to get it right, right? As long
as you get it right at least once. So
it's if K is five and if it succeeds on the third try, that's a pass. But that's
not what user wants. User wants what I call pass to the power of K, right? You
have to get it right every time. K times
in a row. no mistakes.
Say a user has a hundred uh leads and wants to send each one a personalized LinkedIn message. If an LLM is driving
LinkedIn message. If an LLM is driving all the action every step, each attempt might go slightly differently uh on the same task 100 times 100 different
behaviors. So we need a different
behaviors. So we need a different paradigm. So what we use is neuro
paradigm. So what we use is neuro symbolic approach.
So neuro the LLM observes the screen reasons about what to do and then so this is the thinking and symbolic instead of just clicking it writes a
program code that's the executing right it's like a brain writes a recipe the machines follows it so here is where it gets powerful the first time it calls
LLM inference but doing it again for the second time the 100 times it just replay the code you don't need LLM inference no cost and essentially free. So the
language behind all this is Simulang, our domain specific language for computer use, readable, modifiable and releasing it to the developers this week.
So that's reliability. Let's talk about trust.
An AI agent that can do anything on your computer, right? You can uh send emails,
computer, right? You can uh send emails, delete files, make purchase. That's
really powerful, but it's also very dangerous. If it un misunderstood one
dangerous. If it un misunderstood one instruction or it hallucinates, it can become disastrous.
So this already happened not to a random user to Meta's director of AI alignment.
At similar trust is built into our architecture. The guardrail is a
architecture. The guardrail is a separate system from the planning agent.
The one deciding what to do is not the same deciding whether it's safe. So you
cannot be the same. You cannot allow the model to be the judge and player. So
even if the planning model gets confused or hallucinates, the guardrail system catches it before dangerous happens.
And the third challenge is scalability.
Our mission at similar is to scale users productivity by 100x. And how do you get 100x? By having 100 fingers and type
100x? By having 100 fingers and type 100x faster, you need 100 autonomous computers. But
not everyone wants to set up 100 Mac minis, buy it up and set up for them, right? But everyone wants the
right? But everyone wants the productivity gain from having multiple autonomous computers. So at similar,
autonomous computers. So at similar, we're building infrastructure to solve this. When you sign up site, this is our
this. When you sign up site, this is our product. um you get 100 uh we we spin
product. um you get 100 uh we we spin out a machine for you right and this is your machine you can do anything you want you can install your app you can set up the way you like it and then you
let sigh to take the wheel so um you if you can spin up one you can spin up five you can even spin up 100 right you can have one doing a
regression test you can have another one uh do another like something else and have the third one uh do a CRM update after the call and you can have last one
uh running reports.
You can have all of them running in parallel.
So why am I in a loop? Yeah. So 100x
productivity not by working harder but by having 100 computers work for you.
reliability, trust and scalability. the
three challenges and that's what we are doing right there's an incredible engineering behind all this um there's like a distributed systems agent
reliability at scale so we are hiring if you're interested do join us so please the PC freed us from paper and autonomous computers free us from human
work and this and we are similar I'll be around in the booth at level four across the street and we'll see you around.
Thank you.
Thank you, Lihao. With that, we have come to the thing you guys have been most looking forward to, which is lunch break. Uh there is buffet lunch in
break. Uh there is buffet lunch in Hopscotch, Cayenne, and the Beastro. So,
you have multiple spots to get access to food. Uh we are running ahead of time,
food. Uh we are running ahead of time, so we will start the next talks at 1:30 sharp. uh starting with Rio Louu from
sharp. uh starting with Rio Louu from cursor uh who's the head of design. So
you do not want to miss that talk and be back uh in time for that. Uh thank you so much for sticking around all day. See
you soon.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, Hey, hey,
hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey hey.
Heat. Heat.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey, Hey,
Hey, hey, hey.
Hey, Hey, hey, Hey, hey, hey,
hey, hey.
Hey, hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey hey.
Hey, hey, hey, hey, Hey, hey,
hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey hey.
Heat. Heat.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Heat.
Heat.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey, Hey, hey, hey,
hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Hey everybody.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Heat. Heat.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey.
Hey, hey, hey, hey, Hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, Hey, hey, hey, hey.
Hey Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
tip. No, I'll announce. Okay.
Hi everybody. Welcome back after the lunch break. Uh hope you guys got some
lunch break. Uh hope you guys got some food. Uh got to chat with people and uh
food. Uh got to chat with people and uh come back with energy for the next uh bunch of talks we're going to be having here at the Capitol Theater. Now, I am
super super excited to bring the next speaker here. Um this is Rio. He's the
speaker here. Um this is Rio. He's the
head of design at Cursor. But I'm going to share a little bit of a story because all of this started for me two years ago, middle of 2024, because I learned how to code as a complete non-engineer
actually using cursor. And I don't know if you guys uh ever use the tool when it was just tab and inline before composer model multifile orchestration came out.
Uh that was what I was learning with.
But there's so so much thought that the cursor team put into designing an experience. Um it's for people who are
experience. Um it's for people who are power users as well as people who are new to it like me and a lot of the design patterns have now become the ones that are kind of standardized and used
across all different kind of coding agents. So I am super excited to bring
agents. So I am super excited to bring Rio uh to the stage. Uh he's going to be giving a talk on designing the next cursor.
Yo yo yo. Hi. Hi everyone. I'm Rio. Um,
yo. Hi. Hi everyone. I'm Rio. Um,
let me wake up my computer first.
Cool.
Nice. Cool. Good afternoon everyone. I'm
Rio. I lead design at Cursor.
Um, today I'll share how we're designing Cursor to bring designers, engineers back to our roots when making software felt more like play rather than being
stuck in rigid roles, tools, or processes. also share how our design
processes. also share how our design process became more fluid as we designed cursor with cursor. I'll end with our vision for the future of making
software.
In the beginning, software design and engineering were the same thing. There
were no splits. The people who imagined software also built it. Design and code were the same craft. The material was
the code itself. Thinking and making happened in the same loop. This is Bill Atinson. He was on the early Macintosh
Atinson. He was on the early Macintosh team. He built QuickDraw, a 2D graphics
team. He built QuickDraw, a 2D graphics engine. He also designed and coded Mac
engine. He also designed and coded Mac Paint and Hyperart. He invented the marching end selection pattern and a lot of things that we still see in most
graphic design apps. He built pixel perfect UI in 68K assembly. Every detail
from concepts to design to implementation was his. Was he a designer or a developer?
This is Alan K. He invented small talk and the dynab vision at Xerox Park.
Basically designed the whole future of personal computing. He wrote the code
personal computing. He wrote the code that made it real. There's a famous quote from him. The best way to predict the future is to invent it. He built
working systems to prove his ideas. From
UI to interaction models to the runtime, they were all one craft. Were they
designers developers they were all builders. The question did not make sense back then. Design was
code, code was design, and the craft was whole.
Then something really weird happened, especially in the last decade.
We've forked ourselves. We split into specialized roles. The designers owns
specialized roles. The designers owns the vision makes the mocks. The
engineers implements the mocks. The PMs
write the specs, run meetings, keep everything moving. The promise was
everything moving. The promise was specialization will make us faster. But
the reality is we got slower and more distance from the code and our tools for too. The engineers
mostly stayed in the terminal and the idees uh Vim, VS Code, Sublime, but this the code is still the source of truth.
The designers kind of moved to the cloud. We started in Photoshop making
cloud. We started in Photoshop making bit maps. Then we moved to Sketch, which
bit maps. Then we moved to Sketch, which is a Mac only app that does vector mapping for like UI. And then we moved that to the browser and made it collaborative in Figma.
The designers made beautiful mocks but they weren't real.
And then the PMs and collaboration also kind of scattered. You have Jira tickets that nobody wants to update. You have
Google Docs for specs. And then we made notion for weeks and planning, Slack for everything else. And then there is this
everything else. And then there is this sassification of everything and per purposefully built tools that actually created more divides
and the gap widened. Linear handoffs
became the norm. Designer makes some designs in Figma. The PM writes up spec.
The engineer gets a ticket. Matching
Figma became the goal. But this back and forth in comments and meetings is really annoying.
And we lost this tight iteration loop.
It takes weeks between ideas to mocks to specs to tickets to code to review to staging to prod takes weeks from thoughts to reality. Designers can't
touch the real thing which is the code.
Engineers can't explore without a ticket. And the material, the code
ticket. And the material, the code became someone else's job. We told
ourselves this was progress, specialization, best practices, design systems. But we traded craft for process. We traded building for
process. We traded building for coordinating. We split what was supposed
coordinating. We split what was supposed to be whole.
Code is a universal language between humans and machines and it is our material of trade. The code is the material again. The code is the source
material again. The code is the source of truth. It is the real thing. It is
of truth. It is the real thing. It is
not a mock. But now with agents writing the code, you can design by asking, directing, refining. And the craft
directing, refining. And the craft becomes knowing what to build and how it should feel using the real material and making it real with other humans and
agents.
Cursor could bring tools and builders together into this one thing again so that we can all make great software
together. How do we get there?
together. How do we get there?
Enter cursor 3. Cursor started by inheriting a lot of complexity from VS Code. As the agents became the primary
Code. As the agents became the primary way people write code using cursor, all changed within the last year. This
legacy kind of became a liability for the agent pill coders. Much of this file ccentric view of things don't make sense anymore. And for the new coders, they
anymore. And for the new coders, they still feel a lot of friction getting started, bombarded with all these scary
UI and concepts that they don't know.
We also see a shift from operating on this local file state to interact with the agents to moving towards multiple agents running on different projects
increasingly in the cloud. And this
flips a filecentric view of the IDE to a new hierarchy centered around agents and their environments.
In order for us to retrofit VS Code, changing layouts creates a lot of UI forks, edge cases, and broken states.
And it just couldn't keep up with how fast the world is changing. So, how do we move from this filecentric view of software to an agent native interface
that adapts to each human and what they do?
There are, I think, two main philosophies to building AI tools and the difference really matters. On one
hand, you get a black box. You type in what you want. The AI does something where you can't really see. When it
works, you didn't really learn anything.
You just skip the thinking. When it
fails, you don't really know why.
Especially as a new coder, you keep burning more tokens without understanding what's h actually happening. You can't see, can't
happening. You can't see, can't intervene, can't edit. You either
approve every single change or give up.
You are just a product of the model.
On the other hand, you get glass. It
starts simple, but you can see more if you want to. Agents streaming in, code running in the background, AI thinks with you your way. You can redirect myth
lights, stop anytime, stare your way, edit that two pixel padding if you want.
You don't have to read every change, but you always can. The experienced coders can let agent flow review at the right time and make edits when needed. The new
coders can learn new software concepts with cursor. They can learn by just
with cursor. They can learn by just asking, building, tinkering, and then seeing a deeper slice of the system. You
stay in control, build intuition and shape cursor into how you think.
As AI gets more powerful, Glass matters more, not less. Autonomous agents
running for hours need legibility for humans to monitor and interject.
Multi-agent systems need inspectable, durable plans with clear boundaries defined by humans. We also need a share space and malible interfaces for humans
and agents to think together.
And we chose the glass way bringing focused legible customizable interfaces to humans and agents. Every
agent, their actions, artifacts are visible and editable. Plans you can shape, agent states you can inspect.
There is zero hidden magic and infinite control. But it starts simple.
control. But it starts simple.
You can use cursor now with editors closed, no auto opening files, no distractions. It kind of works as a
distractions. It kind of works as a sidekick next to the other tools and workflows that you do. But it reveals complexity as you use it and you can see
more when you want to. As you use cursor for more projects in different stages of making software from planning to designing to execution to review the
interface morphs to fit you and let you focus on what you're good at. The
experienced developers can roll fast with multiple agents, review changes, and make precise edit edits when needed.
The designers can sketch things out, see the code running in the browser, annotate and tweak every detail with immediate feedback.
And the product people can think, plan, explore options and trade-offs with the agents that knows about their whole team context in a fully interactive collaborative
document.
Everything feels instantly familiar yet powerful. Designed for the humans, not a
powerful. Designed for the humans, not a model. We let you tinker and fit cursor
model. We let you tinker and fit cursor for you. The core stays simple, but you
for you. The core stays simple, but you can customize through extensible concepts like plugins and skills. And
there are virtual interfaces that adapt to what you do. We also respect user habits and control. We never force drastic changes. We are not taking
drastic changes. We are not taking anything away, but we can show people there's a simpler new way to do things for those who prefer.
Now, let me share how we got here.
What's crazy about all this is designing our new interface all happened within about a month. And it all started from a random prototype that we started
exploring at the beginning of this year.
So Lee, Rob, and I kicked off Baby Cursor 3 earlier this year. Baby Cursor
is our name for our prototyping environment where people can fork, explore ideas, and share with others.
When designing AI tools, you always end up with a lot of non-determinist cases where static mock-ups can't capture the nuance. We really had to
feel it.
So the goal was to design cursor so that it can scale from the most simple form into something sophisticated that the pro engineers and software makers will love. In the new version of this
love. In the new version of this prototype, we made it so that it is a fully functional electron app built on top of the cursor CLI. I designed a
simple layout architecture that could support one to multiple agents, one to multiple projects, zero to end tabs of content and splits. And it works with
any space constraints.
Things always start simple, but it grows with you as you use the tool more.
Making these dynamic states in Figma in mockups will probably take months and it won't give you the same feeling as
playing it in real in code. So much of this highlevel like information architecture and flows were pretty much
done in a week.
In traditional design tools, it is really easy to duplicate artboards and states and export options. you always
end up with a lot of snapshots of states rather than like one cohesive thing to see. In cursor uh in baby cursor 3, we
see. In cursor uh in baby cursor 3, we added a built-in feature flagging system and we have settings stored as files. So
this kind of allows us to explore both really large architectural forks and explore every single little detail and permutations. Then you can see how
permutations. Then you can see how things fit together.
By playing with the prototype daily and exploring options, we were able to re reveal new constraints that affect the deeper architectural decisions. One
example is how does the layout change as you navigate across agents? Do the tabs on the right change as you navigate between them? Are they pegged to
between them? Are they pegged to different agents? Are they per workspace
different agents? Are they per workspace or environment? Or are they all
or environment? Or are they all independent like VS Code? It is a really hard concept to explain in words, but it
is very easy when you can feel it live.
We then shipped baby cursor to everyone at the company to play and get feedback.
The engineers started forking and adding their ideas and takes into the prototype. I then synthesized them back
prototype. I then synthesized them back and did more iterations based on the feedback and new ideas from the team.
We learned a lot by building the prototype. Which layouts made sense
prototype. Which layouts made sense under real use in different conditions?
What are the different defaults and customization options to expose? How do
we make complexity feel simple? How much
control should remain visible versus hidden? How much progressive disclosure
hidden? How much progressive disclosure should work etc. And from the prototype, I reverse engineered
the code into a highle spec where we document every single option and details. The videos and the screenshots
details. The videos and the screenshots became the mocks for the new cursor.
Then after 33 long discussion threads on the RFC, it's it's time to make it real.
and engineering also took a more drastic approach inspired by the speed of building this prototype. We basically
decided to rewrite cursor the whole UI from scratch with a brand new design system component library and a clean foundation.
As the engineers are working on this, I prototype more sidebar grouping customization, input customization, peaking and details. Then I went back to
Figma for the first time so that I can play with liquid glass that we didn't end up shipping and all the visual details.
Our engineering team cooked really hard on this for two months, rewriting the entire cursor UI from scratch in React and building a new design system.
Once the things are a little bit more cooked, we started using the new cursor to build itself and we dog food everything
that still feels a little bit weird. The
designers also went back to the code.
So, we're building out little details, polishing new components, icons, colors, theming, vibrancy, animations, all the little details that the models don't
see.
And the design process became really fluid. It is no longer linear. We just
fluid. It is no longer linear. We just
use the best tool to refine the craft.
Whether it is about spending more time to think making these prototypes or mocks or just go straight into the code.
And in late March this year, we shipped the alpha and we created this rapid feedback loop with both internal and real world users. And we focus on
performance quality for our first ship.
After we shipped cursor 3, we built Baby Glass, which is our next generation prototyping environment where we can visualize cursor from the very present
to the future in one single prototype.
It is rebuilt on top of our new design system and it uses the real components from Cursor 3.
We brought it back to the web. So it is no longer electron app because it is so easy to share states and links for
others. So they can click on the link
others. So they can click on the link and then give feedback. We also improve the future flagging and versioning system so we can visualize cursor from
the current production state to each step that we need towards like a more future milestone.
We also built better handoff flows so that these baby glass prototypes can turn into the first PR that an engineer can build on top of in the real
codebase.
And it looks crazy good. It has a desktop. It has some wallpapers, themes,
desktop. It has some wallpapers, themes, and we even built like a tool inside Baby Glass where you can generate mockups and videos. And we plan to use this for like actual demos in our
website.
So making glass give us a lot of clarity into what we think the future of making software could become.
It should be more collaborative so that the humans can work together on the same context and tools with teams of agents.
As we use agents to accomplish larger goals, it becomes increasingly more important for the agents and the humans
to share the same space so they can arrive at the right thing to build. And
as everyone is becoming a a builder, people from different disciplines, not just engineers, can finally come together and work on the same goals with
the same agent setups, tools, knowledge, and artifacts.
We think the future should be more customizable.
Our interfaces and tools should adapt to who we are and what we do, not the other way around.
Everyone and every team is different while the underlying concepts and tools are the same so that you can build connect your workflows and tools and
customize your agents to the most granular level for yourself and your team.
We think the future is more autonomous.
More agents can tackle repeated workflows, streamlining and eliminating manual processes while the humans define the system and boundaries.
We can automate things like buck triaging, release notes, security and code reviews. And you can design your
code reviews. And you can design your system with verification loops and really define what is right so that the agent can handle more things for you.
And lastly, we should be building more ambitious things and think about what what else we could do that is not making
more things and adding more slop. We can
build better, simpler software together.
Instead of adding more things, you can actually use the time that you saved to think deeply and figure out what is the simplest abstractions, what is the right
thing to build for your users.
Do something crazy that wasn't possible together with other humans.
The future belongs to people who build to think.
Stop waiting. Start building. No black
boxes. Stay glass.
Thank you.
All right.
Hello. Okay. Uh thank you so much, Rio.
And uh just so so everybody knows if they haven't checked if you guys haven't checked it out yet, uh Cursor has a booth in the Italier in Kinsky. So you
can go ahead and meet some of the team there afterwards. All right. Um I Thank
there afterwards. All right. Um I Thank you.
Thank you.
Yeah.
All right. Uh I would like to welcome to the stage the next speaker. Uh this is Ain. He is a staff product designer at
Ain. He is a staff product designer at Figma.
Welcome to the stage.
So Ain uh currently works on Figma weave and has been behind a lot of the very pro popular products on Figma including Figma Buzz and Fig Jam. Fig Jam of which is something that I personally love to
use. um and he's going to be giving a
use. um and he's going to be giving a talk on designing multimodal multiplayer AI.
While he's getting set up, uh just a quick few announcements. Uh number one is tonight we are actually going to turn this entire theater around from talks
into a nightclub for the afterparty.
Yeah. Um so if you uh have just remember to bring your badge. That is actually how we're going to uh check you in. So
if you are an attendee uh you could just bring that. Um no worries about having
bring that. Um no worries about having the QR code from Luma. Um and then uh next thing is we have a demo stage in Pullman which is kicking off now
actually. Um but if you want to stop by
actually. Um but if you want to stop by at some point to see some demos from some of the local startups on how they're embedding AI into their workflows or their products, uh check it
out there.
Right, without further ado, Aen.
All right. Hello guys. Wow, this is like a lot more people than I thought. Um,
okay. Um, I'm Ashang. I'm a product designer at Figma. And today I want to talk about why our AI tools should be multiplayer and multimodal. So the AI
tools that we have today focus on really um, making individuals go 10x faster.
But I feel like the harder but also more interesting question here is can we make a group of people go 10x faster together?
because when execution gets cheap um collaboration and alignment becomes the bottleneck. This is actually from
bottleneck. This is actually from another AI engineer talk um by Maggie Appleton the research engineer at GitHub and I really I wholeheartedly agree with her framing here because deciding on
what to build and what not to build is more important than ever right now and a team's progress will be stalled if the way that we explore plan align doesn't change. So very relevant to also what
change. So very relevant to also what Roy just shared earlier. And I just feel like the tools that we have today don't really make any of that easier necessarily.
Most of the agentic tools today have a chat on the left, artifact on the right.
But the chat is single access, one thread, one direction, one source of truth. And it primes you to one shot,
truth. And it primes you to one shot, right? Because there's so little
right? Because there's so little affordance in the interface that tells you how to branch out, compare ideas side by side, etc. And this is ultimately an interface for
convergence, not divergence. I think
when we design AI tools, we should build interfaces for divergence too because the creative process is both solitary and social and the best ideas gets sharpened through friction between
minds. So I think we should build tools
minds. So I think we should build tools that facilitate instead of removing that process and isolating us from each other. So yeah, here are some ideas on
other. So yeah, here are some ideas on divergent interfaces for AI tools. A few
years ago, I helped build this widget called Jambot, which um it lives in Jam and it allows you to explore ideas with Hatcht in a way that's um visual,
nonlinear, and multiplayer. And this was back when LM was like all about text before you could like ship code from 0ero to one. And looking back, I see
canvas as this really fascinating malleable medium where the additional dimension could make multiplayer presence and branching iterations feel much more natural. So the next part of
this this talk is going to be a little demo and I'd love for you to join me. If
you have a laptop, just type in this link and enter your name and hopefully you're in. Um, I know the Wi-Fi is a
you're in. Um, I know the Wi-Fi is a little bit spotty and um, and this demo is like purely coded via claw and I have I have no idea. Um, never tried asking this many people to join and there's a
lot more of you than I thought again.
So, wish us luck, but um, please try it out if you can. Hello, hello, hello. I'm
going to zoom out here, but the it's the same URL as the one in the U address bar. And once you join, you should be
bar. And once you join, you should be able to see the canvas with the slides plus a bunch of like mini games preloaded. So you can click to play like
preloaded. So you can click to play like any of the games here. So I'm just gonna select this like Flappy Bird thing from the top. Oh guys. Okay. I really hope it
the top. Oh guys. Okay. I really hope it doesn't break. But okay. Ah I lost
doesn't break. But okay. Ah I lost already. Okay. This is really
already. Okay. This is really embarrassing. Um so if you hover on the
embarrassing. Um so if you hover on the note here um you should be able to see this prompt box where you can like add an element um change the aesthetic the mechanics et
etc. And for example, my friend Annie yesterday suggested to add a monster buddy to my bird.
And let's see how that goes. And while
that is streaming, I would also wonder what if I change the background to galaxy.
Okay. And now I can see that the Asian is basically taking my prompt. It's
rewriting the plan and also rewriting the code. And I have this one at the
the code. And I have this one at the top.
Okay. It's falling down like way too fast, you know, but Okay. All right. Um, I'll see if you
Okay. All right. Um, I'll see if you guys have made anything else. So, um,
I'm not sure what happened, but um, really hope I'm not stuck in a vacuum here, but from here on I should be able to continue iterating. There's a chance that this has like softly crashed for
all of you. So, I'm sorry. But I can like add a hat to the bird to the bird and move on etc. So now you can see that like we are kind of
collaborating on this like most primitive version of executable code here. I'm also going to refresh and just
here. I'm also going to refresh and just see if it's like my problem. Okay, it's
like really messing itself up. So sorry
about that. But I feel like the thing that fascinates me here is having this like simple but also kind of visual representation of version history that feels very inviting for iteration, right? And being able to see that
right? And being able to see that collaboration happen in real time. And
imagine if the real software prototyping could actually feel this collaborative too. And this notion itself feels very
too. And this notion itself feels very exciting to me. And now by making this space multiplayer, um I think it also introduces like edge
cases. um we need to consider to enable
cases. um we need to consider to enable better co-creation between multiple humans and agents. So today most AI tools we let agents act on us act um for
us on a tasks that are meant to be automated and that's okay right but in collaborative exploration where we rely on both humans and agents as riffing partners the space should feel permissive right where we could touch
each other's work and iterate on the same thing in real time so for example okay this has really hard crash so I have a local version prepared just in
case um So, let me let me try this again. Make the
again. Make the add a monster buddy to my bird.
Okay, so as I was typing, you might be able to see that there was a ver there was a um an option to also make edits which would allow you to kind of override something, right? And this
introduced possibilities of conflicting edits, for example. So if I do say right here that like I want to like make um make the theme medieval
and I click make edits and now imagine that if someone else is working on the same thing at the same time right so for example if you're editing the visual style here with two parallax layers
for example and there's a chance that like some again somebody else might be touching the same artifact and like rewriting it and this should be allowed right just like in Google Docs or Figma because the space
is per as permiss as permissive as possible to encourage co-creation. And
here I'm seeing that I really hope that it streams but if it doesn't uh I'm doomed. Um but the original plan here is
doomed. Um but the original plan here is that you will be able to see the agents cursor also making the changes alongside me. Um and because it's output replace
me. Um and because it's output replace my edits, right? And I think over here it should ask me whether or not it should actually rewrite it. And it's al also also show the streamed outputs. So
I could compare and decide whether I want to allow it to rewrite it or not.
And that would be just a bare minimum example, right, of I'm just going to use this. That would just be like a bare
this. That would just be like a bare minimum example, right? But I think it gets to show that to make AI multiplayer, you really have to design this embodied presence. So the
representation here should set expectations on what it could do. And in
the case it would have been able to show me the the document that it was editing and it would show be able to appear with its own text pointer and it should also show its scope of changes visually and it show it should also show how to
handle conflicting edits with others whether it's with humans or agents. I
actually think that in many ways similar to designing the embodied presence for humans today um I think it's very I think it's very similar um whether you're designing a cursor in a documents or canvases. Um, so that will just that
or canvases. Um, so that will just that was just a demo on making AI multiplayer. But to me, this is just the
multiplayer. But to me, this is just the beginning. I think the next frontier is
beginning. I think the next frontier is actually also making a multimodal, which is really about widening the channels of communication so that both humans and agents can express themselves in richer
ways. And there are three directions
ways. And there are three directions that I'm pretty excited about here.
First, I think we should build genuinely multimodal models. A few days ago, um,
multimodal models. A few days ago, um, this is from thinking machines lab. I
think they shared a research piece on what they call interaction models which is natively multimodal and micro term b so that it's always interactive in real time and I love their framing that like
the turnbased AI is kind of like talking to your agent over email instead of in person. So this is pushing the boundary
person. So this is pushing the boundary at the model level and honestly this is like way cooler than this talk. So you
should totally check it out and if you're a model builder please make it happen. Second, I think we should build
happen. Second, I think we should build better embodied presence for agents as they move through the richer digital mediums. As I said earlier, take something as simple as a cursor. There's
a lot that you can express through position, movement, and interaction like clicks. So, this is a communication
clicks. So, this is a communication channel and just like how we read each other's intent through body language, we should design body language for agents, too. But even as of right now, right,
too. But even as of right now, right, there's a lot we could do with interaction modalities. So guey already
interaction modalities. So guey already helps people communicate intent with in richer and more intuitive ways and we have decades of experience building them. So think about how we could engage
them. So think about how we could engage inputs like multi- multi-touch, pencil and speech at once and like this interface experiment by um Diana Lou or
we should think about how much determinism you determinism you can build with your interface, right? such
as this example which is um Figma Weave um a note-based workflow tool that generates rich media and full disclosure I work on this tool but this genre of
note-based AI native tooling is exciting because it fuses models with the gooey patterns that we already know and it provides the provides the precision and control that they need for their
creative exploration and you could go very deep from here. So, I hope that what I shared just now gives you inspirations on how your agents could interact with multiple collaborators,
both peoples and agents, and what modalities that they can move through.
Um, so I'll definitely check out the space and see if the changes that you may eventually come in. Um, reach out to me on Twitter if you want to chat. And
if any of this resonates with you, definitely do not miss this year's config, which is Figmass Design Conference. There's a bunch of updates
Conference. There's a bunch of updates that might interest you, too. So, yeah,
that's it. Thank you.
Thank you so much, Ain. Uh, I am excited to introduce our next speaker. This is
Saleem, a robotics engineer at Menllo Research. Come on over. Uh, for those
Research. Come on over. Uh, for those who don't know, Menllo Research is behind Azimoff. They are an open-source
behind Azimoff. They are an open-source humanoid robot that you can train and customize. And they're going to be the
customize. And they're going to be the first uh folks to be talking in our physical AI track. uh which is really exciting because we want to introduce uh you know new discussions outside of just
thinking about AI as large language models but how do you actually allow it to understand and interact with the real physical world. So uh Salem will be
physical world. So uh Salem will be talking about how uh his topic will be noise is all you need engineering sim to real for open source humanoids.
Can you see the slides go from here? Thank you.
Uh hello everyone. Um I'm Sim. I work at Melo Research which is a company incorporate in Singapore. And I want to explain a little bit about Agentic Robots and pretty much how to vip code
the real world. Um
Melo is actually a full stack team. Uh
we have 27 people that are across the world. We have um an office in Singapore
world. We have um an office in Singapore which is at Syndam Square which is actually a very nice place. Um we have an office in Vietnam in Ho Chi Min City
and we are going to open an office in uh San Francisco next month. Uh our team basically 27 people 25 of those are
engineers. Uh I'm Salem. Uh I used to I
engineers. Uh I'm Salem. Uh I used to I joined Menllo in 2025. I used to work at Tesla for six years as a software engineer in PaloAlto, California. And I
came to Menllo basically to lead the robotics side. Um, as you guys can see,
robotics side. Um, as you guys can see, we do full stack robotics from the hardware up uh from the from the hardware up to the uh highest layer um application.
So, I I guess you guys already heard a little bit about ESO. As basically an open source human robot. It's the only open source human robot in the world
that actually um is uh you know went kind of viral like we we didn't expect that that much. We um uh it went viral in like multiple countries in Germany.
There is an article in Germany, in Japan. So um so we decided to basically
Japan. So um so we decided to basically create the DIY kit basically like just a box where all the different parts of the humanoids are in and then we have a manual online where you can build the
robot at home. Uh you know, we put it out for pre-orders and I think we got uh $1 million pre-orders within two days.
Um, people are very excited about the humanoid space and especially as like as learning how to use humanoids, uh, how to learn how they work and how they act.
Um, so a lot of people ask us, you know, like why you guys building hardware?
Hardware is hard. Uh, I guess not being in hardware is harder going forward. Um,
so as an open-source reference humanoid design. So that means that anyone that
design. So that means that anyone that can fork out the design can build the humanoid at home or in a manufacturing uh setting. So our goal is basically to
uh setting. So our goal is basically to create a distributed network of manufacturing partners worldwide that are that are creating esop for people who want to buy them while we own the
reference design. It's almost like uh
reference design. It's almost like uh like Android I guess where Samsung and Huawei like builds the hardware and you basically own the open source reference
design. Uh, and this is very interesting
design. Uh, and this is very interesting because when we put the DIY kit out for sale, actually we got um 200 plus factories reaching out to us want to
build the robot and they're all around the world. Some in Turkey, some in
the world. Some in Turkey, some in Germany, in the US of course. Um, and
even some in Nigeria. Um, so there is like all around the world where they can actually uh build as where we own the open source reference design. Um
what Esimov also includes what we also built internally is like a robot processing unit. It's it's very
processing unit. It's it's very important for us because it's it's basically one controller that can control the entire robot. The robot
consists of like 35 motors, eight cameras, uh two microphones, one speaker, all connected to a single board sitting in the torso. And what this
board is also uh useful is like it can run in local model on inside. And this
is very important because as you guys know stands for ESMO laws. There are
three laws of ESO which is pretty much don't hurt anyone obey your obey the commands and protect yourself. And it's
very hard to define safety that is universal right like safety for people that live in the Middle East is different. Safety for people that live
different. Safety for people that live in Singapore is different. like I'm
Turkish and I'm German. I I'm dual citizen. So like safety requirements for
citizen. So like safety requirements for both my countries are different. So we
decided to actually make safety as what same as the manufacturer. We want to make it a distributed consensus of people that build as that develop as to
decide what safety means. So it's almost like a consensus. Think about Bitcoin.
Think about all these other like blockchain uh technologies where people decide what is safe and basically create a functional safety model, computer
vision model that can overtake the robot at any time when it tries to do something wrong that uh that disregards the uh as laws
and that's why it's very important to basically burn that into a device. So
it's not running in the cloud, it's just locally in the device. the githash is burnt in the CRC you can read everything out and like um that that is that one single board where the manufacturers
have to use otherwise they're not allowed to build as so a little bit about like how to vip code the reality right um no vip coding
is kind of interesting because you know um in the era of like open claw where people can like basically automate their workflows everyone feels like an AI
engineer, right? Like, you know, people
engineer, right? Like, you know, people can basically summarize an email every morning and then probably think they can apply at OpenAI just because like they have all this power, right? Just to wipe
code things. And I think we what we
code things. And I think we what we really want to do at Meno besides the hardware sites, like on the software side, we want to we want to basically turn every software developer into a robotics engineer. Same as like Open
robotics engineer. Same as like Open Claw and the rest turned everyone that is a software developer into an AI engineer pretty much, right? And how we
do it is basically we have a system design like in a software site which is like an agent. The agent is basically something that you guys can bring in.
It's not something that we provide and this can run CL this can run codecs.
This can connect to your like all the different external tools you have. Um
and this is what we call the like the big brain the slow thinking brain. And
then we have a uh skills and robot control which runs inside in the robot.
So skills basically mean how to perform a certain task. How to pick up a cup, how to do a handshake, how to walk. Uh
and the robot control uh basically is a real-time operating system. Make sure
that these commands are getting through the robot doesn't fall down. Uh it has some safety mechanisms inside.
And just to give you you know an example how like the cockpit that when the robot is autonomously running looks like. So
basically what you know when you want to load the box from A to B right first the robot internally runs this perception and planning agent that uh can detect
the different uh obstacles can detect the different uh different scenarios uh depending on what
it sees. Um the second thing is actually
it sees. Um the second thing is actually uh you can train through simulation to have skills almost like open claw skills.mmd file right where you can
skills.mmd file right where you can train skills pick up a cup uh you know move forward locomate forward run uh jump uh these are trained through
different type of models but almost like abstracted into skills. This is either a VA that picks up things. This is a Walt action model or this can just be like
inverse kinematics, right? And now you can plan and you you have skills to execute. And the last step is basically
execute. And the last step is basically you can just plug it in to your agent.
So now your agent basically you you're not sending a video and audio stream to your agent.
You're just sending a text and skills and a robot is almost just an MCP server performing those tasks. And what you can do as the next step pretty much you can
create a Camban board where you can assign tickets to a fleet of robots that are running in your factory or running in your home. And the interesting part
is those uh robot processing units I mentioned previously. You can connect
mentioned previously. You can connect those robot processing units to any type of robot.
Some skills are actually transferable.
Some skills are not. So you can connect these robot processing units that run all the safety laws in any type of robot and basically through our stack you can connect them to like a almost like a
fleet orchestrator like a swarm intelligence and then you can basically you know control your entire environment. The
robot is open source the the skills are trained by the community. The safety laws are trained
community. The safety laws are trained by the community and the robot is built by manufacturing partners. So I think that that that is kind of like the goal that uh Melo is trying to achieve here
at uh around Singapore and whoever is you know whoever is interested to join us. We actually open an office uh in
us. We actually open an office uh in Melo Park as our name said right? Um
anyone that is interested can join us.
Um and anyone interested in that is in Singapore can also hit up uh we can talk. Um what kind of skill set is
talk. Um what kind of skill set is required? Robotics has no specific skill
required? Robotics has no specific skill set.
is like a multi-dimensional problem. You need people with
problem. You need people with perception, electrical engineers, mechanical engineers, inference optimization, GPU optimization. You need
all of these people. Uh and I hope with the entire community that we also have in the background that built ESO for us, we can achieve something great here out of Singapore as the first humanoid
robotics company out here. Thank you
everyone.
Thank you, Seem. And now I'm excited to bring our second speaker within the physical AI track. Um, Alberto, who is the founder of Reactor. Uh, Reactor
recently came out of Stealth. Uh, it's a startup that's focused on something called World Models, which uh maybe some of you guys are familiar with or some of you has heard some of it maybe post uh
Nvidia's GTC. Welcome. Um but uh we're
Nvidia's GTC. Welcome. Um but uh we're very excited to uh bring him here to talk about how you can actually create interactive simulation environments to
help with the next wave of physical AI.
So he'll be talking about world models, a look at the future.
Uh no, it's just my presentation.
I think I can do this, but I don't know if it's working.
You think it's working?
All right. Uh, thank you everybody. Um,
very excited to be here. So, today I'm going to give you a glimpse into the world of world models. No pun intended.
Um, and so first of all, I wanted to start the presentation by giving you actually a quick view of the state of world models today and what they're capable of because I think sometimes
people um are not aware of what's possible already with world models which is quite mind-blowing. And so without further ado, um this is a video uh which
is actually not a video. It is recorded in real time generating on reactor and you can see that I am palosing this uh this polar bear. Now, when I look at this video, I cannot quite distinguish
if this is actually actually like a real video or like a video game. But what
you're seeing here is actually something that was being generated in real time on the reactor platform. And so, um, this is to show you that today already the quality of what you can generate with
world models is very impressive. And all
of this when I recorded it was running live at 30 frames per second. And I
could control the experience just from the keyboard. and it would change all
the keyboard. and it would change all all in real time just starting from an image. So I just wanted to set the stage
image. So I just wanted to set the stage because it's important to know how already advanced these models are and what's possible today. And this is just an early glimpse. I'll show you more later during the presentation. I think
it's really incredible that this is already possible. Uh just quickly about
already possible. Uh just quickly about me, I am the CEO and co-founder of Reactor. Uh we started Reactor with the
Reactor. Uh we started Reactor with the goal of democratizing access to world models and for people to build with them. Um, in the past I co-founded uh
them. Um, in the past I co-founded uh Luma AI where I was CTO and co-founder uh and I also worked on the vision pro at Apple. So I've always loved the the
at Apple. So I've always loved the the field of uh spatial, visual, 3D and real time. Uh and u and that's what
time. Uh and u and that's what eventually led me to think about like okay what's the really the next frontier in AI and and and in general in Gen AI and it became obvious to me that that is
world models and real time uh video generation. Um and so it's important to
generation. Um and so it's important to to to to think about what's been happening in the space of AI in the last say uh five years especially visual AI.
Uh at the beginning you know we had we we we have today things that can generate text audio image and video but all of these modalities are are passive.
uh when you prompt a for example an image model um you eventually receive a file out but for the duration of the generation there is no interaction from the user there's nothing that makes you
interact with the model and the model cannot handle uh external stimuli so uh for example if something happens in the in the world and you would you would have wanted the model to react that's not that's not possible because these
models are really passive and not interactive and so in the future more and more AI workloads are going to be actually real time interactive and and
fully uh aware of the world around them.
And this is this is because um you really need to um to have these model think these models think about the world around them in order to deploy them in the real world. Otherwise, uh they're
really unaware and they don't respond in real time to what happens around them.
Um, and so in order to actually discuss uh, you know, the rest of the the rest of what Reactor does, I thought it was important to talk about what a world
model is. Um, so the way that we define
model is. Um, so the way that we define world models, I think, is a little different from what uh, a lot of people define them as. U, we think of them as models that first of all have long-term
memory. We like just call it persistence
memory. We like just call it persistence for for brevisacy, but they they know they know they're aware of what they generated before effectively. Uh they're
also real time. Uh it means that these models you can book them, you can interact with them and they react to you. Also they think casually meaning
you. Also they think casually meaning that they are aware of what happened before. Not only they remember it but
before. Not only they remember it but they take it into consideration in in for when they generate the next stage of of the output that you wanted them to generate. And like I said you can
generate. And like I said you can actually poke them and and interact with them. not only you like a human but
them. not only you like a human but external you know physical events or internet events whatever it is that your world model is supposed to do. Um and so you can think of them really as a state
machine uh that understand external inputs take into consideration what happened before and generate new outputs uh based on that which is very very
different from uh image v im image and video models because those models they don't have a uh sense of um uh of what happened before. Um, so this is really
happened before. Um, so this is really what why why we're excited because this changes what software is as a whole.
Like in the in the in the current generation of Genai, you generate artifacts, but in the in the next generation of Genai, you will produce applications because they are interactive, they're real time, and you
can uh and they they they are aware of what's happening around in the world.
And this is going to change entirely not only media and robotics, but software as a whole. Um, and it's a very exciting
a whole. Um, and it's a very exciting thing. Um and so today um effectively we
thing. Um and so today um effectively we already have a lot of use cases. Uh I
think again um it's it's easy to not think about a world models as something that uh is useful today but actually for example in robotics um they're becoming more and more used by robotics companies
um instead of VAS and VLMs uh because uh they're they're they're better at being aware of um the what's happening in in the surrounding of the robot and they
can even imagine visually what the robot should do for example in in avasars and digital humans you know realtime video AI and world models are are extreme extremely uh powerful way more powerful
than explicit based like 3D based uh representations because you can adapt them to various situations. Um for for advertising for example you can uh use
world models and real-time video to personalize content live per user which is really the holy grail of um media and advertisement but also in some cases of
of new types of artistic endeavors. Um
for simulation again being able to run uh gener generative simulations in real time in a way that uh is more precise is more representative of the real world and changes the game for what's possible
in simulation. And one of the things
in simulation. And one of the things that we are the most excited about at reactor actually is the idea of generative software. And what that means
generative software. And what that means is that why do we stop at generating media uh games and and and help robots actually act in the world? What if we
could generate every single pixel that it's on the that's on that's that's on the screen in real time uh live. And if
you think about how much frustration there is uh when humans interact with machines and so and interfaces that have been defined by somebody else and they're not really usable by by some by
another person. Generative software has
another person. Generative software has the possibility of really changing the landscape uh for how we interact with with with with software in the future.
Um and so also we believe that world models really are on the critical path to AGI because um the information that you can get from from visual input is so
much richer than what you can get from text u and when you have systems that can interact with the real world and understand it that's how you really deploy AI um uh worldwide in a in a very
useful manner and so we feel also that uh by building reactor we are on the path to to that and so having explained all of this. What we're building at reactor is a developer platform for
world models. And what we mean by that
world models. And what we mean by that is that our mission is to democratize access to world models such that you and everybody can use them and and and make useful things with them. Uh they have
been locked in uh how difficult it is to use them for a long time. And if you want to run them at scale, you have to take into consideration things like latency, you have to think about streaming, you have to think about super
sampling. And Reactor handles all of
sampling. And Reactor handles all of that for you. so that you the developer can just concentrate on application code and build whatever uh you're dreaming of using world models and real-time video
AI and we think this is the way that we get to really a broader uh uh adoption of world models and of of this type of this technology um and we make it very
very easy also for frontier labs and and research labs to deploy their models on reactor so that they can test them distribute them to to other people and even and even uh earn revenue uh from
people using their models. Um, and I wanted to show you something funny here.
So, this is Jensen actually that I'm generating live walking through NVIDIA.
I'm going to start the video again. So,
um, I wanted to show a few examples of of of funny things you can do with, uh, with world models that are unthinkable with other technologies. So, I just generated an image of Jensen at NVIDIA and then I made him walk through it. So
all of this I was controlling live like this was all being hap this was this was all happening live and I could make him walk around and you know go around Nvidia and you know this is Jensen in his leather jacket walking around
Nvidia. Um and also this was another
Nvidia. Um and also this was another funny one that you know these kind of things are impossible to make uh in real time uh without using something like a world model. And this is to I wanted to
world model. And this is to I wanted to showcase to you how um how incredible it is that this is possible uh and you can just make it in like basically instantaneous no not no time um and just
have fun but there are so many more very serious applications that you can have with this that I would love for everybody to try and build and that's why we and yeah of of course it gets it
gets freaky um but yeah so we are we we are ready to to allow developers to use this power we have we have partnered with all the major world models in the world already and you can go to reactor.in today, download our SDK and
reactor.in today, download our SDK and start building uh with world models.
Thank you very much.
Thank you, Alberto. U next up, I'm excited to introduce uh uh Yang Liart, who is the founder of Open Mind. Welcome
to the stage.
on your way. Um, he's currently a actually a very different background from some of the folks. He's a professor at Stamford. Previously, he was actually
at Stamford. Previously, he was actually a professor at my alma mater, which is Berkeley. I won't be too offended by the
Berkeley. I won't be too offended by the Stanford thing. Woo! Go Bears. Um,
Stanford thing. Woo! Go Bears. Um,
pretty excited that he's going to be introducing what Openmind does. Um, for
those who have some familiarity with the robotic space, a lot of things are kind of fragmented and so he wants to build what is the Android moment for robotics,
an open operating system for embodied AI.
Uh, no, I can use this one here, but we will figure this out.
Oh, wonderful. That totally works. This
is great. Cool. Welcome. Uh, so I started life as a physics professor at UC Berkeley. Um, collaborated with
UC Berkeley. Um, collaborated with Facebook a little bit. That sensitized
me to questions relating to, uh, collecting data at scale and using that information to make good decisions primarily for a healthcare context. Uh,
then moved my lab to Stanford so I could be closer to a medical school. And, uh,
so I'm a parent. Uh I teach, I do research, uh I care about healthcare outcomes and so I care about people getting better and so I'm primarily
motivated by things like health care, by teaching, by machines and humans around us. And I'm kind of curious for how all
us. And I'm kind of curious for how all of that will play out. So I'm not going to tell you about hands today. I'm not
going to tell you about assembly or manufacturing today. Um I'll think a
manufacturing today. Um I'll think a little bit about uh what it means to be surrounded by smart machines and uh what we should uh try to build as uh
engineers uh for for uh those new capabilities.
So of course every single one of you has read uh Norbert uh Vener's Cybernetics.
Uh if you haven't um uh that's just horrible. Uh so you should definitely do
horrible. Uh so you should definitely do that. Um he has a really nice uh sort of
that. Um he has a really nice uh sort of uh broader perspective on automation and of course step number one things like clocks and time pieces. Uh the first
revolution as he calls it is the devaluation of the human arm. So these
are technologies like looms for weaving.
These are technologies like steam shovels and automanufacturing and Amazon and warehouse logistics. So you can think of them all as some variation on
uh devaluing the human arm. And by the way um I'm just quoting him. I don't
necessarily totally agree with how he's phrasing this. Um but uh that's uh what
phrasing this. Um but uh that's uh what the argument is in cybernetics. And then
of course according to Norbert we currently in the second revolution which is the devaluation of the human brain.
And here's some examples in that uh historical trend. So chess and go. Then
historical trend. So chess and go. Then
there's Whimo, you can get to the airport. Uh there's of course uh the way
airport. Uh there's of course uh the way Ukraine fights wars which is more and more automated. Uh we're getting to the
more automated. Uh we're getting to the point where a lot of us think that general manufacturing and sort of manual tasks um are well within uh technical
reach. And then of course the sort of
reach. And then of course the sort of final step in all of this is things like caregiving teaching companionship repairing things and so forth. And I'm
primarily interested in this last category of uh of of tasks and opportunities. And
generally what you're dealing with in this last category is you're um you have a machine interacting with a person or multiple people. And that makes things
multiple people. And that makes things uh like really interesting and challenging. When some of us think about
challenging. When some of us think about robots, uh we might think about uh Tesla factory and other people when they think about robots, they think about movies
like iroot. So what you have here is a
like iroot. So what you have here is a situation where you have a human interacting uh with a robot and that's a key part of the plot of this movie. And
likewise for a lot of us when we think about robots, we immediately of course are drawn to Princess Leia and R2-D2. So
that's an example of where the robot that's performing a vital task in Star Wars uh doesn't have hands uh but nonetheless uh manages to uh save the
resistance.
And I'm very much in this sort of second camp when I think about robots. Um, I
think about all the opportunities created by um endowing uh machines around us with good decision-m and able to navigate uh complex dynamic
environments with pets and people and patients and students and so forth. So,
I'm really interested in when we look at, you know, doctors, teachers, nurses, investors, bankers, police officers, electricians, uh whatever uh their job title is
currently. I'm really interested in
currently. I'm really interested in their ability to um solve higher level tasks involving interacting with people, understanding people, remembering them,
uh being able to deliver personalized content to that human in front of them.
Sometimes when I teach physics for premeds, it breaks my heart because I'm looking at 500 students and I have no idea who they are. I have no idea what they know or don't know. And I know as a
teacher that the way I'm giving my physics for premed lecture um is super boring for like three kids in the audience and uh then maybe not so easy
to follow for the other 497 kids. And so
I really wish I just had much better ability to understand each human in front of me and be able to deliver content more appropriately. And I think that's a general problem statement for all of robotics is how to do that
optimally for families, patients, uh, and so forth. Um, if you look at all 830 human job categories in the US right now, um, I'm just plotting, um, how
important social intelligence is to doing well uh, for those tasks. Imagine
a teacher or a nurse. Uh, this is not just about going through some static workflow. This is really about
workflow. This is really about interacting with uh specific uh specific people and then uh delivering optimal care for example. So as we envision
machines being able to do more and more skills around us uh it's very important for me that these machines are uh incredibly capable about interacting
with people. Uh so our eval criteria as
with people. Uh so our eval criteria as a company is uh smiles and tears and trust and memories. Um, so this right here is Diane. Uh, Diane's the human,
Iris is the humanoid. And Diane lives close to the park. And when Iris the humanoid doesn't go to the park, uh, Diane will ask, "Uh, where's Iris?
Where's Iris?" And, uh, that's because Iris is the only thing uh that will listen to her sometimes for hours. And
uh, this makes Diane very happy. Uh, her
eyes light up. Um, she comes and goes into the park because she's looking for Iris the humanoid. And, uh, you're welcome to call me dystopian. Um, isn't
this a horrible future you're building on? Our parents should be surrounded by
on? Our parents should be surrounded by three generations of grandkids. Um, our
parents should be um surrounded by all their loved ones. If you look at long-term care in the US today, uh the average number, the average amount of
time an American in long-term care spends uh in any kind of social interaction is two minutes a day. Two
minutes a day. And I like to think that in that kind of world um uh there is a big role for machines in connecting with
us. And certainly when I start dribbling
us. And certainly when I start dribbling and drooling and uh my mind is gone uh I almost certainly will be um uh interacting uh with the machine and
hopefully I'll be smiling that situation. That's one thing I would be
situation. That's one thing I would be very happy about.
Uh so right uh we have a little bit of a different take on things. Um, there are a hundred companies around us and I love them all and they're all awesome and
they're working on hands and they're working on mechanical tasks and iPhone assembly and uh chopping onions and making noodles and folding t-shirts and
all of that is awesome. But by virtue of all the brilliant people who are focusing on that problem statement, I consider that uh that will be solved
very quickly, very soon. And so we're starting to anticipate the next step where all these machines will be baked into our immediate environment and we'll have strong opinions about their
behavior and how they connect with us.
And any uh questions or complaints you have, I put my email up. So, um, if you liked it, that's awesome. And for any complaints as well, it's yan@openmind.com.
yan@openmind.com.
Thank you.
All right. Thank you so much, Yan. And
now I would like to bring to the stage uh, Andrew Tan. Make your way over here.
Uh, he is the platform engineering lead at Grot Cloud. So a lot of the questions we have are not just okay can the model do this but can it do it fast cheaply and for millions and millions of people
at scale. So that is what he's going to
at scale. So that is what he's going to be talking about how to scale low latency LLM inference at grot cloud.
Wait, sorry. Sorry about that. Can
everyone hear me? Okay. Uh, so my name's Andrew. I'm one of uh the platform
Andrew. I'm one of uh the platform engineering leads at Grock Cloud. And
you know, over the last couple days when I tell people I work at Grock, people like to say, "Oh, Grock has such a great personality."
personality." And sometimes I need to correct them to say, "Oh, I work at Gro with a Q." But
we also do have and that's not Gro with a K. But we do also have a unique and
a K. But we do also have a unique and distinct personality which is fast low latency inference. And I'm going to
latency inference. And I'm going to share a little bit today about how we achieve that uh with Grock Cloud. So if
you don't already know Grock and Grock Cloud, we're an AI infrastructure company focused on low latency deterministic performant inference.
Now how do we achieve that? Right, we
this is centered on the LPU or the Gro chip which is custom silicon uh designed for low latency inference and we have an entire stack constructed around that. So
that's a compiler, a runtime, we've got cloud infrastructure, we've got global routing, a developer platform, and enterprise features as part of gro cloud. So I'm going to show you a quick
cloud. So I'm going to show you a quick demo of what this looks like. Um, we'll
just do a recording.
I don't know if you could hear that, but it's sort of instant transcription. Tell
me about AI engineer Singapore happening in May 2026. And you see near instant two calls, you see text being generated very very quickly at about 500 tokens
per second. I'll just, you know, play
per second. I'll just, you know, play that again. And this isn't even the
that again. And this isn't even the fastest model that we're using on Grocloud.
So that's a quick demo just to give you a sense of how fast inference can run.
It's probably a few times faster than what you're used to on different platforms. Now why is this important and where is inference demand today? You know with
agents with multimodal models with heavy reasoning models inference demand is exploding. It's accelerating really
exploding. It's accelerating really fast. Uh and in the last year token
fast. Uh and in the last year token demand on GRC cloud the number of tokens we served has grown about 600% or 7x.
And this we're doing this with a hardware footprint that's not so much larger than where we were last year. If
we wanted to serve all the demand there was for inference, this multiple will be much much higher. Uh today we serve about 800,000 active developers in the
last month. Um and we continue to see
last month. Um and we continue to see demand from large enterprises, from startups, from AI companies, AI natives and all sort and different kinds of
developers around the world. And we do think that going forward inference will really define um infrastructure the next generation of infrastructure and
architectural choices around AI inference uh AI infrastructure. Sorry.
Now one thing we spend a lot of time thinking about is I'm not sure why this is not full screen. Sorry.
Okay. Yeah. One thing we spend a lot of time thinking about is how to route requests around the world to serve tokens at the lowest latency. We've got
about 10 data centers around the world, mostly in North America, but also in Europe, the Middle East, and in Australia, serving the APAC region, and
with 65 approximately 65% of token demand coming from North America, 20% from EMA, and 15% from APAC, including
1% from Singapore. And each request we route it to the nearest POP via our Cloudflare edge network and that gets routed to our data centers and we make
lots of routing decisions along the way to ensure the lowest possible latency for our customers.
How that breaks down, you know, this is uh the life cycle of a of a LLM request.
Uh we see that consisting of network latency. A request lands on our itch
latency. A request lands on our itch network. It then gets routed into one of
network. It then gets routed into one of our approximately 15 inference regions which could comprise you know either a cloud network or on-prem um within the
data center. We deploy our inference
data center. We deploy our inference stack in there and within the inference latency uh it breaks down into Q times where requests are queued up for
different models. It breaks it also
different models. It breaks it also consists of prompt time or input processing and completion time which is decode or output processing latency and these add
up to the end to end latency you'd experience on making any LLM request to to any provider and Q times and prompt times are things
we care a lot about because that's the the slow step in many cases uh for getting to that fast streaming time to first token.
In a bit more detail, every request that comes in goes through authentication and hits one of our global load balances.
And the global load balances share information across 15 data centers about what the estimated weight time and Q times are for every single model
instance. And there could be 50 model
instance. And there could be 50 model instances deployed in every single data center. And this information is shared
center. And this information is shared across all the load balances in real time every 100 milliseconds or so to enable routing decisions to be made.
One it it is not the easiest to make these routing decisions because we do need to estimate what output generation lengths they are. Unlike typical API requests, you don't know how long an end to end request is it's going to execute
for because you don't know how many output tokens are going to be generated, right? and we take some sampling and we
right? and we take some sampling and we sample from the available backends bucket the TTFT and route the request to the optimal model instance deployed in a
specific data center.
There are also a lot of checks along the way including for rate limiting um obviously tracking and auditing different usage events as well in a bit more detail. You know we we
bucket things by TTFT to route to the best model instances in the best regions. We apply some priority for
regions. We apply some priority for different types of customers to make sure say our enterprise customers get faster traffic. This is done across
faster traffic. This is done across multiple ingress paths into our different into our different clusters and we need to enforce sort of global rate limiting to ensure there's no geo
arbitrage to get around rate limits and why rate limiting is important I'll come back to a little bit later on.
Another key aspect of serving traffic around the world is identifying the right model mix at different times of day in different regions or even week to week. We see
diff demand for different models varying and it's important to be able to deploy any model to a specific region pretty quickly and we do this through a declarative very simple manifest that
reconciles quickly. So within a minute or two after
quickly. So within a minute or two after committing and merging some code config, we can deploy a new model into any region around the world. So minutes from
merging to serving traffic with the appropriate canary testing and warm-up for each of the model instances.
Now another question we get often is how do we get models to run on our custom silicon? Typically we take open weights
silicon? Typically we take open weights from hungface uh and the pietorch reference implementations and we compile it into our gro tensor operators our
dialect into mlir schedule it partition it across different chips. Uh we run different presets to enable this and that gets compiled into input output programs or byte code that executes on
our custom hardware fully compiler scheduled execution and software scheduled network. uh so we get very
scheduled network. uh so we get very extremely predictable latency performance for every single request.
Now why with a popular developer platform we attract a lot of abuse and fraudulent behavior as well and you can see that uh the the attack vectors are
getting more and more sophisticated and the number of abuse fingerprints we're picking up abuse signals continues to increase on the platform. So it's
something we uh do need to monitor very carefully with rate limiting and other mechanisms. Now just two more slides for me. Uh in
thinking about what the largest enterprises that we work with are looking for in an inference stack in 26 and 27. Uh large enterprises are
and 27. Uh large enterprises are increasingly looking for dedicated compute capacity. Data residency
compute capacity. Data residency continues to be an important topic. uh
for as models increase in size these large models decode latency continues to be something that people pay a lot of attention to um and the unit economics associated of large model deployments
there's also a range of sophistication in uh large even for AI natives AI companies some want oneclick deploy some want managed service some want bring your own models bring your own weights
some want their own inference stack so there's a quite a heterogeneous uh demand for different types of inference services moving forward. Now the last
slide for me is a little bit about what LPU based decode looks like. I don't
know if anyone watched the Nvidia GTC uh speeches earlier this year where Nvidia CEO announced the Vera Rubin plus Gro 3 LPX system. Um the key idea behind that
LPX system. Um the key idea behind that is this aggregated inference where you run prefill and a number of the layers on GPUs and you run decode maybe the
FFNES on um LPU like chips and we going forward we do see heterogeneous compute
being much more common and the way to achieve better unit economics better speed and better performance uh of course be and that will need to be aligned to the ecosystem in the models
that are compiled onto this hardware and run. So, that's a little bit about what
run. So, that's a little bit about what I wanted to share. I hope you enjoyed learning a little bit more about Grock Cloud. Um, and we've got some links here
Cloud. Um, and we've got some links here on how to get started as well on our developer platform. Thank you.
developer platform. Thank you.
Thank you, Andrew. And up next, I would like to welcome to the stage Daria, who is the head research scientist at Cerris. uh she is a person behind
Cerris. uh she is a person behind designing many of thee recipes at Cababus and she's going to be talking about at scale from GPUs to wafer scale
AGI Hi everyone. I'm super excited to be
Hi everyone. I'm super excited to be here today. I will talk about how we
here today. I will talk about how we train mixture of expert models at scale on Cerebra's hardware.
First I want to start um a bit about me.
Um currently I'm a head research scientist at Cerebras and for the last couple of years I've been researching MOE networks and as a result I have this MOU 101 guide that we published. It
basically teaches you how to train and run inference for MOE models efficiently.
Um currently I'm leading um frontier scale training on Cerebra's hardware and before um I worked at the company called Yandex. It's very um known like a
Yandex. It's very um known like a Russian Google. Uh I worked there on
Russian Google. Uh I worked there on transformers and on the first transformer that we deployed in the production stack and before that I was at Google working on the speech to text models.
For the agenda today, I would like to start with giving you um an overview of what happened in the LM community for the last few years and how we ended up with MO networks. Then we'll talk about
what is an MO network and how we train them at scale.
Um first of all in the LM community we did a lot in the last few years. We
started with the GPT3. OpenAI released
the model that was 175 billion in size.
And in addition to the model, they also released the scaling law showing that as you increase the model size, you're getting better and better quality.
Shortly after that, there was a release of Llama 3 series by Meta. They scaled
the model further. So, it's 400 billion in size now. But in addition to that, they spent a lot of time figuring out how do you extract the signal from the data efficiently. So some of you might
data efficiently. So some of you might heard about the chinchilla scaling loss.
They suggested that in addition to scaling the model, you also want to scale the token budget. Something like
20 tokens per parameter is considered computer efficient. And so at the end of
computer efficient. And so at the end of this, we were able to scale both model and tokens very efficiently. However, if
you continue scaling the model size and the talking budget linearly, it becomes very very expensive very quickly. We
want to train trillion parameter model sizes on trillion parameter data sets.
So the other breakthrough that happened um a couple of years ago was the release of DeepSQ3 model by the DeepS uh company. That model was larger in size.
company. That model was larger in size.
So 671 billion total primary count but it was very very efficient because it would run at the speed of the 37 billion um active primary dense network.
How did they do that? The architecture
behind the scene is the mixture of experts. If you look at the decoder
experts. If you look at the decoder block of the transformer network, you'll see that we have different types of layers. We have embedding attention and
layers. We have embedding attention and FFN block. Um if you want to create an
FFN block. Um if you want to create an amoe network you'll see on the right you just take the FFN block and copy paste it and each FFN is now going to be
called an expert you also place an additional network on top which is called the router and the job of the router is to decide which expert should process a particular token.
This way you can continue increasing the capacity of the network. So you can go to 671 billion parameters by adding more experts. But because you only activate a
experts. But because you only activate a small fraction of them, you can be very efficient and run at the speed of the 37 billion dense network. Now you might wonder, okay, this sounds great, but how
does the scaling low look like for these networks compared to the dense?
Here I have a plot for you where I'm scaling the number of experts and and comparing the quality of the MO network against the dense network running for
the same flops. You can see that you can get loss improvements up to 5% here with 32 experts with literally no increase in compute. So you get it for free just
compute. So you get it for free just because the architecture is smarter. On
the other hand, you can think it about this way. You can train to the same loss
this way. You can train to the same loss as the dense network with just a third of compute. And here I only have 32
of compute. And here I only have 32 experts which is very very tiny compared to what the state-of-the-art models use.
We use hundreds of experts. So you can see how efficient this architecture is.
And in terms of the LLM community, we are super excited to have that running at scale because for the last few years we couldn't shift the scaling law as efficient as it it's done with the
now. um we know that it should run
now. um we know that it should run faster than a dense network, right?
Based on the theory. However, when we actually run it on an actual device like the GPU device here, we get it slower than a dense network. The m of is lower.
So why is that the case? Let's take a look at how we actually implement MO networks on the GPU devices. Each GPU
usually has a limited amount of memory.
And so if you run a very large network, you have to split it. You have to split the model parameters. For movies, we use expert parallel. Basically, you position
expert parallel. Basically, you position different groups of experts on different devices. Um, you can see here expert
devices. Um, you can see here expert one, two, three is on GPU 1 and expert four, five, six is on GPU 2. And you add two additional all to all operations.
This is usually done because you also do the data parallel and so you don't know in advance where to move tokens to which devices. So then they can be processed
devices. So then they can be processed by particular experts. And so these two alltoall operations are very expensive.
Majority of the time if you try to profile this uh will be spent on the communication and unfortunately there is nothing fundamental that we can do on the GPU
side to improve that. It comes down to the physical wires.
Now I want to show you a comparison between the GPU device and the CS machine. Here um I have the B200 GPU.
machine. Here um I have the B200 GPU.
You can see that it uses 126 megaby of SROM. It's or L2 cache. It's the
SROM. It's or L2 cache. It's the
available memory on the chip and it's also running at 8 terabytes per second on memory bandwidth.
It's also a very tiny silicone compared to the Cerebrus which is the size of the dinner plate. Um, and it has way more
dinner plate. Um, and it has way more SRAM. So you can see that we have 44 GB
SRAM. So you can see that we have 44 GB of SRAMM and we're running on orders of magnify faster memory bandwidth. what it
allows us to do, it allows us to actually train a very large network on the chip itself without any type of model paralization.
However, if we go beyond the 44 GB of SRAM, we developed a technique that helps us train networks literally like one trillion size on just one device.
How we do that? We add additional memory X nodes to our chip that will be um our weight banks. Basically, it's like
weight banks. Basically, it's like external memory where you hold most of the model parameters. To do a gradient update, you're going to stream layer by layer weights from the memory X nodes to
the chip, calculate your gradients, and then move the gradients to the memory X node to update the weights. This way,
you can connect very large memory banks like memory X nodes to one chip and train one trillion models and beyond that without any type of model operization and without any additional chips.
And this is very useful for MO networks in particular because we want to train very large networks. We want to train a lot of experts and so experts sit on the
same memory X nodes or on the same chip and there is no communication overhead.
However, when we run MO networks on Cerebras, we actually see the same problem. They're running slower than the
problem. They're running slower than the dense networks.
The problem here is slightly different.
Today networks are very different. We
want to train a lot of experts that are very tiny in size and because of that we have a problem with arithmetic intensity. So ammo layer compared to the
intensity. So ammo layer compared to the rest of the networks uh moves a lot of weights and it does very little compute per per weight used. Because of that the
throughput the speed of the network is worse compared to the dense.
We fix this problem with a technique called budge styling on attention.
Essentially, if you want to deal with the compute scarcity, you if you want to improve the uh comput the arithmetic intensity, the easiest way to do it is to increase the batch size. However, if
you look at different layers um in the network, if you just uniformly increase the batch size for all of them, some of the layers will actually hurt performance like the attention.
Attention is activation memory bound.
So, increasing the batch size there will start evicting more stuff into the memory x nodes which is not efficient.
We don't want to do that. Instead, we
want to decouple the batch size requirement for attention and the form layer. You can see here for attention,
layer. You can see here for attention, we can keep a very small batch size, the original batch size and just iterate in loops and concatenate the results together into the bigger batch size. You
can see that we concatenate G different loops. And now we can throw this bigger
loops. And now we can throw this bigger batch size into the MO layer. And it's
going to be restoring the arithmetic intensity of that layer to run it at the speed of the dense network. And you can configure this G depending on the sparity level. So here I have the
sparity level. So here I have the results for you the empirical results where we tested different layers of levels of sparity for the quen 3 network. You can see that baseline
network. You can see that baseline without BTA on cerebris run can run up to 7x lower than a dense network which is very inefficient. With the BTA we fixed this problem and you can see that
we can we can restore the original theoretical premise of the MO network and run it as fast as the uh dense network. So 671 billion um MO network
network. So 671 billion um MO network from deepseek can run at the speed of the 37 billion dense takeaways. Um I want to share with you a
takeaways. Um I want to share with you a few takeaways here from my talk. One is
in my opinion is the fastest way towards the edgi. So compute efficiency that
the edgi. So compute efficiency that comes from that network is really incredible. Unfortunately are not very
incredible. Unfortunately are not very efficient on the GPUs and they hit some communication bottlenecks. However, on
communication bottlenecks. However, on Cerebras WC, we fully realize the MOE theoretical promise.
Thank you.
And if you want to learn more, here is the QR code to the MO guide where we talk in details how to train this networks. Thank you.
networks. Thank you.
Thank you, Daria.
Yeah.
All right, this concludes the first section of our afternoon talks. Um, so
we have a 15 minute break before we come back. Uh, a few quick announcements.
back. Uh, a few quick announcements.
Number one is that the expos where you can meet the different companies at their booths is going to be closing at 5:00 pm. So if there's anyone you want
5:00 pm. So if there's anyone you want to meet, uh, head over to Pullman or the Atellier has uh, booths like uh, Cursor, Google DeepMind, etc. And then Pullman has the
robot playground and uh, open AI's booth many amongst many others. Uh, and uh, I'd like to welcome back to the stage, uh, Kazaya, who you've seen at around 10
am today, who is a trained mindfulness teacher who's just going to give a little bit of an experiential um, immersive experience where you can uh,
where she basically uh, created a um, vibecoded particle visualizer trained on hours of meditation.
Hey, hey hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Heat. Heat. N.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Heat. Heat.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Heat.
Heat.
Hey, hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, you know what?
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey hey.
Heat. Heat. N.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
Hey, hey hey.
Heat. Heat.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, hey.
Heat. Heat.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey.
keep the programming going. Next up, we have tasks. Uh if you aren't familiar with
tasks. Uh if you aren't familiar with ZAI and the GLM family of models, um some of the best open source models on the market. Um not as expensive as the
the market. Um not as expensive as the premier models you might be using. Very
good for things like open clause, personal automation kind of stuff. So
without further ado, I'd like to I thought it and let me check the other side. Okay.
So, you can you can change to the current size, right?
Maybe it's Hey, hey, hey.
of GM model. So today I will present GN 5.1 and also the idea behind Lar's test.
Hey, hey, hey, hey, But it's not G.A.I. and G.I. belongs to
Google, not not your company. So why you are called Z. It seems irrelevant. And
the point is we were first called in Chinese. So actually stands for
Chinese. So actually stands for intelligence.
And when we found that it's hard for foreigners to pronounce Zhi, we try to make it shorter. to make it Z. Actually
Z stands for intelligence. You can
regard us as intelligence.ai. So it's
the best way to link this Z.I platform
to our model and to our services and also I want to introduce GLM to you because a lot of people have used the
GLM 4.7, GM 5, GM 4.1 but actually we were the first one of the first companies exploring large models as you
can see from this paper. So we submitted on like some day
paper. So we submitted on like some day March 18, 2021. So we began the exploration of all the large integration models back in like 2020. So together
with open air and deep mind maybe the earliest labs doing so but we became famous like only in 2024 or 2025 for
most of you and now GM has become a brand not just stands for this generic uh general language model like like large language model we also have our
own branding and now we currently use architecture outside GLM the original architecture to make it stronger.
stronger, faster and more efficient.
Okay. And more about the model. So
currently we're pushing the boundaries of open source. So we are leading the open source models in text arena and colarena as you can see. So I did a
screenshot before after deepseek. So
when deepseeek they launched v4 they show that they improve a lot but still cannot beat j 4.1. in these benchmarks
and also we are quite strong in coding and genetic tasks. So as you can see this is artificial analysis latest benchmark combining three individual
benchmarks and you and we are just lag behind GBT 5.5 and Opus 4.7. So its
current state very close to Opus 4.6 six but many people use GLM inside clock code cursor kilo code open code so we are not very famous for our harness but
like we use other harnesses like they're great and their coding agents can help go make better jobs
okay so that's all for GM itself and ZAI now we're talking about long horizon task because today I'm not spending too much time on go but I want you to
remember the idea uh and fully understand what long horizon tax actually means. So have you heard of
actually means. So have you heard of long horizon task and long running task?
If you haven't, so these three labs, they all mention long horizon and long running in their
latest post about their model. So G 5.1 we emphasize a lot about our long horiz capabilities and Opus 4.7 also also
mentions long runninging skills and for Kim K 2.6 they have beautiful front end capabilities but they also emphasize a
lot their long horizon capabilities especially coding capabilities. So long
horizon has become very popular.
Why? So why matters to you? Why you
listen to this idea behind like behind the model? So we can share a lot of
the model? So we can share a lot of things how to make websites, how to make slides, how to use GIM for Excel but why long horizon matters. So the first thing
is it's useful because before the era of long horizon you can only achieve like one to 10 task at a time. When you go to sleep, you
don't have task to do because your agents can only finish it in 30 minutes and you have eight hours. Your agents
cannot do anything. But with long task, everything becomes true.
And another thing is with the evolution of open cloud heras, there are a lot of agents that has heartbeat that can interrupt your task. Sometimes you have
memory, you have a lot of things going on. have MCP they can interrupt your
on. have MCP they can interrupt your workflow. So unless your model has long
workflow. So unless your model has long horizon capabilities they can stick to the original goal. So if they cannot stick to the original goal they'll follow the latest instruction and
totally forget what you are doing right now. So long horizon capabilities make
now. So long horizon capabilities make this happen.
And a fun fact very interesting story.
So in our latest hackathon we had a 48 hour hackathon. It's the first time we
hour hackathon. It's the first time we have a 48 hour. So there's a night between the two days. Most of the
participant choose to run G 4.1 during their sleep and actually they made it.
So seven out of nine winners chose to run the task during their sleep and it's a it's a great so I use a graph to show this. When you sleep maybe this year
this. When you sleep maybe this year your agents will continuously work every time your agents gathering and discuss and finish the work for you.
And also the second reason why I have to introduce the idea of long horizon to you is it's hard. So not just useful because if it's useful like there's no
need I I I speak here you can use it freely as ZAI you can try your best you can run it whatever you want for eight
hours but it's very hard because the first thing is many people think long horizon as long context window but
actually G 4.1 only has 200k context window so where's the gap First thing is GM 5.1 is really strong
not because it contests is very long because it can understand the context it can understand your plan and your memory to better reflect the outcome. When you
use C code sometimes you cannot use like one 200k but the compress uh the context window may be compressed quite often. So
you need to stick to the original goal.
And the second reason is so even some models claim they have one million context window but when you use like near 500k it forgets everything they
only stick to the latest latest guidance and forget the original plan or they they don't follow what's going on in the
cloud MD and the second reason or the second misconception is uh some people think if I give enough instruction beforehand. So at least all
instruction beforehand. So at least all the instructions it it may follow pretty well because there's no need. The model
has the long horizon capabilities. I
have long hard capabilities. I can
instruct it to do it in the 100 runs but actually one model was not trained in this aspect. It doesn't has it doesn't
this aspect. It doesn't has it doesn't have enough capabilities to stick to your plan. So it will try to do whatever
your plan. So it will try to do whatever they want after certain pattern and later on we'll show you the the story.
And the third misconception here is that many people think the longer the better, right? So what people want to model lab
right? So what people want to model lab post like I can run eight hours maybe another lab shows I can run 12 hours 24
hours like a day seven days but from my point of view that doesn't make sense because we have super fast inference right now. Yeah. So as you can see there
right now. Yeah. So as you can see there are a lot of inference providers that can provide TPS more than 200 and the
latest technologies the hardcore the model inside the chip they can inference near 17,000 tokens per second. So the
time doesn't matter, right? The if you think about time, you you use the latest techniques, you only need to run like one minutes. It doesn't make sense to
one minutes. It doesn't make sense to run eight hours, right? So what actually long horizon the long means is not about
time, it's about a kind of depth. So as
we hear not longer but deeper. So long
horizon actually means a capability to keep finding meaningful improvements.
Yeah. So you have to make improvements but these improvements are meaningful.
For example, so if I have 10 followers on X, I want to collect all their information. So I give a prompt. So
information. So I give a prompt. So
scrape all the data of these 10 followers. So those are one scenario.
followers. So those are one scenario.
But if I want to scale, I scrape 100, 1,000 10,000.
The mission doesn't change very much, right? So you have to make very
right? So you have to make very meaningful changes and improvements.
So what actually are long horses task?
So you talk about the idea of long horizon, you you talk about what are wrong, but so what are the right thing?
one need to be care categories. The first category we call
categories. The first category we call it subjective goal. So in this first category you want to create a website, you want to create create a system.
There's no clear metric of what the best website is, right? So you can let the model run infinitely. But where it stops
that depends on your capability, your judgment, not the model's judgment. And
the second category is that this the scenario requires a objective goal. For
example, you want speed, you want price, you want everything related to a certain figure. So we have two categories and
figure. So we have two categories and for each categories there are completely different mechanisms for we to optimize both as a model and also as a human.
Oops. There's a video, but there's something going wrong. I'll try to make it happen right now.
So if it's not fixed in 30 seconds, I suggest you to look at the X of Z.AI. So
actually we have this is a video of how we built a Linux system from scratch from zero to one in eight hours and
within eight eight hours it doesn't just adding apps. It first create a layer to
adding apps. It first create a layer to let all the apps can be integrated into the system and and then it polish all the interface and then test all those
apps and finally add 50 apps to it. So
that's what this is supposed to be but like unfortunately we cannot present here. So maybe you can search uh G 5.1
here. So maybe you can search uh G 5.1 blog and there will be a a comprehensive illustration of this task. So why humans
are needed? If this model is super
are needed? If this model is super strong and it can finish almost anything. So why we are needed? Because
anything. So why we are needed? Because
I can go to sleep. I don't need to I don't need to instruct the model, right?
Because when I go to sleep, I I let it finish a Linux app and after I got up so it already there. Why why do I need to
join this event and learn how how to use long horizon tax? Because model can make mistakes and it make mistakes quite often. There are three major mistakes a
often. There are three major mistakes a model can make. The first one is the model may not stick to our original
goal. If you do a prompt that let the
goal. If you do a prompt that let the model optimize for five times, it may behave perfectly. But if it let model to
behave perfectly. But if it let model to optimize for 600 times, it may totally forgot the original goal. Right? because
the model always the attention they they they care about every single tokens. So
sometimes you when you talk about Linux and then you talk about iOS the the model totally forgets oh you are doing
Linux app or iOS app that's quite often to handle this I suggest everyone or recommend you to have a checklist so
whenever you do la tax try to prepare checklist it's the best way for for your model to stick to the original goal and you have to ask it to reread the
goal every few steps because you have many steps, right? Because when you optimize for only 10 minutes, you don't have many steps. You you don't think it's very important, but you have to
manually instruct it to reread all the instruction pretty carefully.
And the second thing is error accumulation. So if you find the model
accumulation. So if you find the model makes makes a mistake in the 400 runs let's say so it doesn't impact a lot but
when it goes to 800 runs it can actually break all the things. So that's called uh error error accumulation. So to let
it not happen quite often, you have to verify not by yourself but you have to instruct the model to verify itself. Uh
from from zero to one to 100 you need to have several checkpoints. So when we train model we have checkpoints but when you run horizontax it's similar you have
to set several checkpoints for yourself and for the model to check itself. And
the third thing is models were trained to push very hard, right? Because if you want the model to do this, it will continuously and sometimes in the loop
keep doing that, keep doing the single thing at a time. But it's hard for them to pivot. So the model never gave up
to pivot. So the model never gave up sometime. So you have to let the model
sometime. So you have to let the model give up or pivot if they found something very wrong. So also checklist is great
very wrong. So also checklist is great is very helpful here and they have to evaluate whether by yourself or by the agent
whether to continue to stop to revise to do anything that's related to your task.
So those are the suggestions for the subjective goal type of um long horizon task. So that's what people can do and I
task. So that's what people can do and I think a lot of people are building their apps or you are doing similar stuff. So
that might be helpful for your deployment and another thing. So it looks harder
because that's what objective goal is about. So we have a very strong case.
about. So we have a very strong case.
It's called like optimizing a vector database. I believe not many people of
database. I believe not many people of you have optimized a vector database.
Even our researchers or or the people who is responsible for training haven't have access to this do domain knowledge but our model did. So we start from zero
and we let the model to optimize for itself and for 100 runs and finally they got here. So have a very meaningful
got here. So have a very meaningful improvement at 100 runs and we do similar stuff. So from zero to 100 round
similar stuff. So from zero to 100 round to 200 run finally you go to 600 rounds we have basically six to eight
scientific findings. So the model pivot
scientific findings. So the model pivot a lot the first at first they use technique one and they start to use technique two and they use technique
four. So
four. So I want you to mention these failures. So
actually these crosses like means failure. So when you look at these 600
failure. So when you look at these 600 runs.
So basically most of them failed right.
So when you talk about long horizon task actually it doesn't mean you succeed every times just like life. So you
sometimes succeed sometimes fail and in the circle area all the all the optimization failed. So for long horses
optimization failed. So for long horses tax or long horses models the critical part is the model can reflect can plan
can change the ideas or can optimize improve can continuous improve itself to a better way. So that's what the future
optimizations look like and for this type of task maybe it's very hard for you and maybe it's very hard for me. I highly recommend you to look at evaluations.
And here's my favorite evaluation currently. It's called Frontier Suite
currently. It's called Frontier Suite because we all know Sweet Bench. We all
know SweetBench Pro, but Frontier Suite is a bench that's trying to assess the capabilities of long horizon task including both subjective goal and
objective goal.
and it's their category. So they
categorize long task in three ways not not just by subjective goal and objective goal. So first is
objective goal. So first is implementation. When we talk about
implementation. When we talk about implementation you start from zero to one.
Here's our three examples and I highly recommend you to look at their website.
It's all it's beyond three tasks and when you want to build an app when you want to do some web coding stuff agentic stuff is pretty much same like uh implementation and the second one is
research.
So actually trading is type of long hardening task you have to learn from your previous failure you have to learn from a lot of things you have to do the
research for the market. So a lot of things outside coding belongs to long hor task. So long horizon doesn't belong
hor task. So long horizon doesn't belong to what engineers do. Traders,
scientists can also use long horses and task to do things. So that's what research stands for. So you can use long horizon to explore a lot of things.
And the third way is optimizations. I
have shown you the capabilities of it.
So currently our mo our model teams is using AI is using GLM to to optimize CUDA kernel to optimize vector database.
So when we talk about self- evolving when we talk about continuous learning Z.AI model teams already a team an AI native team that can use model to improve
itself and improve the inference of the model.
Okay, I think that's all for today. And
here's my LinkedIn and axe one. I
doesn't post on LinkedIn, but there there's a profile of me and on X. Uh I
post a lot. I'm quite active on X, but there's no profile, so you better like scan both. And I think that's all for
scan both. And I think that's all for today. Welcome to all the questions.
today. Welcome to all the questions.
Yes. Yeah. Feel free to reach out me through these two platforms. Thank you very much.
Thank you so much. Um, next up we're going to switch things up a little bit.
Um, we're going to talk about voice agents. Now obviously we've talked about
agents. Now obviously we've talked about design and different interfaces as part of it uh as part of the conference so far and we wanted to look at how voice
might be one of those paradigms and to that end no one better to hear from than from Boris Starkov who's a growth engineer with 11 Labs. So 11 Labs obviously is one of the leading
companies in the space. Um and Boris will be talking about the speech engine and what makes an agent conversational.
So without further ado, Boris.
Um, hi everyone. I'm Boris. I work as a growth engineer at 11 Labs. 11 Labs is a frontier voice AI lab. Um we research
and build applications all across voice AI.
Um we're also particularly excited and we strongly believe in voice as the main medium for human to agent interactions and matter of fact we're very happy that
the industry starting to catch up with that uh vision. Take coding agents for example. uh most if not all of them
example. uh most if not all of them actually have some sort of use voice mode uh button.
However, if you actually use it uh it works the following way. You you start uh talking, you talk to it, then you wait for it to get transcribed, you wait
again for the for for the agent inside and then you wait for the third time uh for actual speech synthesis part. So
sure this is a voice input and this is a voice output but this is not conversational.
And today I want to talk about how to improve this uh architecture to make it feel much more like a natural humanto human uh conversation.
We're going to keep the core architecture the same but we're going to add a lot of small improvements that combined together make a huge difference.
Um I'm going to start with the improvements to speech recognition part and then in part two I'm going to proceed to some improvements to to cover some improvements in the speech
synthesis.
So probably the most fundamental uh the most fundamental piece of puzzle here uh is um is called a voice activity detector.
We take the uh audio stream from the user and then split it into chunks of approximately 20 milliseconds. And then
we have a very tiny, very efficient and very cheap model that can take that can tell you for every chunk whether someone is speaking or not.
Not only this is very helpful later on downstream to actually uh understand what's going on, whether someone is speaking, who is speaking, whose turn that is, it also helps us to save a lot
on compute because if you know that in some chunks nobody is speaking, we don't need to run a more expensive ASR model on those.
It's very important to understand that detecting silence and detecting the end of the of the turn is not the same
problem. For example, uh agent can ask
problem. For example, uh agent can ask me something and I respond with I think um there is a lot of silence but it's
not the end of my sentence. I don't
expect the agent to interrupt me at this point. That's why uh detecting silence
point. That's why uh detecting silence is not enough to accurately predict uh when the agent should start speaking. So
here we trained another model again a very smart uh turn detector model that takes into account not only voice activity but also the actual context of
what's been said before to predict whether this is the end of the sentence or uh the speaker the user is going to say something else. Like in many other steps, by the way, here we use a bunch
of heristics. For example, if the user
of heristics. For example, if the user is um spelling out their car details or their credit card details or their email or they say one of the trigger words
that um we have, we use this as a very strong signal that uh likely there will be some sort of silence and likely that silence doesn't mean that the user is
done speaking.
this this this model is crucial in u the following uh slide. So one of the biggest unlocks one of the biggest wins we can achieve in terms of improving
latency and turnbased models um is is the following one. So to understand this uh let's let's think about how the humanto human conversation goes. You
talk to a friend. Let's say your friend is talking to you. They they're talking talking then they stop speaking then you wait for approximately a second to make
sure that they don't have anything to add and only then you proceed to reply.
Unfortunately agent can't afford to wait for a second because it also needs u some time to generate the response.
That's why what we do we do a speculative uh turn which is immediately after our model thinks that the the user
stop likely stopped speaking at the very same moment we speculatedly start generating the response. Our model is quite smart and so most of the time it's
a right call and the response comes much earlier and it feels the latency is much lower. it feels much more natural and
lower. it feels much more natural and maybe sometimes it would be there would be a false positive. That's not a big deal because then we just send the
cancellation to the generative model and continue listening.
Sounds like a lot but that was only the first part. Uh now uh a little bit of
first part. Uh now uh a little bit of how to improve um syn synthesis part speech synthesis part. So the agent uh
sends us the tokens and the user expects uh sentences in terms of speech like speech sent yeah speech um we can't
really afford to wait for the whole sentence before we send it to the uh to the speech generator model because then
the user is going to wait in silence.
We also can't really generate tokens one by one because then some tokens will be generated very quickly and other tokens would take some time. The whole
generation would feel very jumpy, very laggy and not stable. So we take something in the middle. We make a buffer for small phrases of five, six,
seven words.
We collect tokens together and then flush them to the generator before the whole sentence is constructed. This lets
us have the best from both worlds. We
have stability and low latency. This is
also quite efficient because while the current phrase is being played out to the user, the next phrase is already being synthesized and the phrase after
that one is already being constructed in the buffer all at the same time.
We also use um cascading in many parts of our um of our models and tools. For
example, uh here I'm going to talk about TTS cascading which is we have a texttospech model um that generates the response and at every single time it's
running we also have a second model uh fallback model that's ready to pick up whenever the first one fails. So even if uh the current mo model fails or have
some sort of crash for some reason, the user is never experienced to it. Um
ensuring almost 100% um uptime. So users are never um
uptime. So users are never um experienced to experience crashes, errors bugs etc. Um, this one could actually be a whole
talk of itself, but a very important part of uh making your turnpaced model feel truly conversational is handling interruptions,
letting the user interrupt the model.
That comes with a with a lot a lot a lot of uh different um corner cases, horistics, etc. Here I'm going to cover just a couple of them. So imagine you're
a model and you're trying to detect that the user is interrupting you. So first
uh if the if the interruption is very very uh small very short couple of frames 40 milliseconds it usually means that it's a cough or a
noise or perhaps a false positive coming from the voice activity detector. That's
not an interruption. Another example is if the interruption is coming in the first 200ish milliseconds, that also
likely means that it's just an echo.
Another one, uh, for example, if the user is saying, "Yeah, mhm. Uh-huh.
Okay." That's an active listening.
That's not an interruption either. And
there is a lot of small corner cases like that here.
Well, um, let's actually zoom zoom out here a little bit.
um you've built an agent and you've came here to listen to this uh talk thinking that you're going to make it um conversational and now with all this uh little steps you might feel a little bit
intimidated by how how complex it is.
Well, good news, we've got you introducing speech engine. Um speech
engine is actually uh the new the new product that we we have. We didn't
publicly announce it yet. It's uh we're going to start testing it starting next week. The way it works is uh we
week. The way it works is uh we encapsulate all of the complexity related to making uh things sound fully conversational
into this product while you can bring your own agent and very easily plug it in. So it could be your uh chat bot or
in. So it could be your uh chat bot or your open clone, nano claw, uh hermas agent, whatever um any agent you can of any complexity you can just simply plug
it in. And remember this is not a speech
it in. And remember this is not a speech to text and text to speech. This is a proper conversational engine. We're very
excited to see millions of silent agents uh becoming conversational. Uh, keep an eye for for update on this on our socials again and we're going to start
we're going to likely start um testing it publicly next week. And thank you very much.
Thank you so much.
Next up, we have Jackman from Prime Intellect. He's a founding research
Intellect. He's a founding research engineer. Jackman, you can set up. Um,
engineer. Jackman, you can set up. Um,
and he'll be talking about continual learning for longunning agents, agents that keep getting better. So, this has been something that's been a recurring theme last couple of days. We've talked
about software factories. Uh, ZAI talked about longunning agents. This has been a theme that has come up time and time again. And I think the question that
again. And I think the question that keeps coming up is if an agent runs for too long, how are we making sure that the agents are getting better or learning as they go along? Uh because
there's no it doesn't make any sense if the agent just runs for 20 hours to put out stuff that doesn't work right. So
Jackman works for Prime Intellect. Prime
Intellect is one of the pioneering companies in the space. If you want to train your own models, um if you want to work on these um environments where you can test and improve things, uh they
have really cool tech to work with. And
Jackman, whenever laptop's ready, the floor is yours.
Yes. Uh thanks, Agram. I actually
changed the the topic of my talk, but it's still like related to continual learning and longunning agents. Just I
picked a catchier title, so you'll see it when it appears on screen.
So yes, uh hello everyone. My name is Jackman Ang. I'm a founding research
Jackman Ang. I'm a founding research engineer at uh Prime Intellect and today I'm going to be talking about reinforcement learning and recursive language models. So uh we've heard a lot
language models. So uh we've heard a lot about agents today and all the exciting things they do. Uh, and I find it quite crazy that just like two years ago, back
in 2024 when cursor agent had just released, if an agent ran for more than 5 minutes, you would not expect it to be doing anything useful beyond that point.
And yet here we are like in 2026, two years later, and where we just like let the agents roam free while we sleep, uh, going for hours and hours consuming millions and millions of tokens to do
some pretty remarkable things. And so I think it's not a question especially like in this audience that the models are really useful. And so the questions become more economical ones. Uh
questions like can the models do my task reliably? Can the models do my task
reliably? Can the models do my task efficiently? And can the models do my
efficiently? And can the models do my task quickly enough that I can deliver the user experience that I want for my product. And so today I'm going to be
product. And so today I'm going to be making the case that the solution to all of the above is that you should be training your own uh language models and in particular you should be doing
reinforcement learning to do so and also using RLMs. So first uh what is the issue with longunning agents? So I believe that
longunning agents? So I believe that anyone who's used the agents be it like clawed code or codeex or any of the claws like you know that the models aren't actually that good at long
context. Just because your model accepts
context. Just because your model accepts 1 million tokens doesn't mean that it can reason across the 1 million tokens.
And this is pretty apparent in the benchmarks. So if you look at any of the
benchmarks. So if you look at any of the big model providers today usually in their model card they'll have a section called long context and there will be two benchmarks in there. The first one is MRCR. This is needle in the
is MRCR. This is needle in the haststack. And basically this is uh
haststack. And basically this is uh testing the model's ability to retrieve a particular piece of information in a long series of text. And you can see that as the context length gets longer,
the models get significantly worse at this task. And people who have been
this task. And people who have been working on agents kind of know that like uh this information retrieval thing is kind of nice to measure, but that's not really what we want to know about the models, right? We want the models to be
models, right? We want the models to be able to reason across the 1 million context. And so a very popular benchmark
context. And so a very popular benchmark that's appeared recently is graph walks.
And graph walks is basically we pass a list of nodes and edges into the prompt and basically ask the model graph questions. So things like uh list all
questions. So things like uh list all the parents of X or uh do a BFS on Y and list all the children. And you can see it's the same story. As the context
length gets longer, the models get significantly worse.
But what if instead of passing the entire context into the context window, we just pass it a reference to the context.
And I think this is pretty intuitive if you're like a data scientist or if you've done any amount of data science and you've done exploratory data analysis in Jupyter notebooks because
like you don't pass your entire CSV into the Python code, right? uh you usually do like okay I do my classic uh data science imports and then I define a data frame and then I'm doing these like code
snippets to slowly manipulate my data frame try and figure out the structure of my data what the distribution is and then I figure out okay what things can I
do with this data and if you think about designing agents in this way uh a lot of things become very easy like context chunking becomes very easy
tool calls becomes very easy sub agent delegation becomes much much easier and the reason is that your orchestration agent now doesn't need to reproduce the
context autorecursively uh correctly right it just can pass it as a variable and so uh why stop at just uh variables
right um why not have the entire grabag of um programming structures uh so say for example you need to process uh you have task that needs to process 10,000 documents.
If you were to do this in like the legacy language models, basically you would need your orchestration agent to do 10,000 sequential tool calls correctly and like not just do the tool
call correctly and pass the context correctly. You also need to pray to like
correctly. You also need to pray to like the summarization gods. Please please
that when the model does its compaction that it somehow remembers all the various things it did and can somehow still remember uh like where it is even in like um calling all these sequential
tool calls. But if you had just done it
tool calls. But if you had just done it as a recursive language model, you could well the model could simply write a for loop and just basically do these LLM queries uh these sequential queries in a
very simple manner.
And so we see that like the people who are really good at using the agents kind of already do RLMs. Like if you meet anyone who's really good at using cloud code, they're always writing these
prompts of like, "Oh, please please don't uh put the sub agent to uh uh uh don't put the sub aent output into your context window. Don't put the tool code
context window. Don't put the tool code output into your window. you'll probably
mess up and they'll garble your context uh and like write everything into a file because like people who are very good at using the agents kind of know that compaction doesn't really work and when
you see this like you know it's over the the model is not going to recover from compaction and so um any chat uh agent that you can
use now like chat GBT uh and claude or like AI studio basically if you try to put a really long series of text into the chat window. Uh they basically
always turn it into a file. So like the point being made here is that like people kind of already are doing recursive language models, but they're just not doing the full power of it.
They're only using the variable aspect of it. The fact that you can reference
of it. The fact that you can reference context, but they're not getting the full Python expressivity that you can get if you had a full Python ripple.
And so I think it's no surprise that uh people have started to use RLMs for everything. So anything that needs like
everything. So anything that needs like long context understanding. So there's
RLM for videos, there's RLMs for gaming, there's RLM for coding, there's RLM for math. Uh I believe at some point on
math. Uh I believe at some point on Twitter there was even an RLM for Epstein files. Uh I couldn't find the
Epstein files. Uh I couldn't find the tweet. Uh maybe the CIA removed it
tweet. Uh maybe the CIA removed it somehow.
Okay. And like the uh Alex Zang who is like the first author of RLMs uh he wrote this really nice uh post that I think everyone should read called the mismanaged geniuses hypothesis and the
basic idea in there is that the models are already capable enough to do a lot of the tasks you want and the only thing holding them back is the scaffolding. We
don't quite know how to orchestrate these agents. We don't quite know like
these agents. We don't quite know like oh where we should put the memory what exactly it should be doing what are these like sub aent delegations and like the bitter lesson way of viewing this is
like why are we making humans do this right we should just let the agents define their own scaffolds like all the scaffolds you guys use today cloud code open claw super vibe coded so it's very
obvious that the models can already write really good scaffolds so they should just dynamically write the scaffolds as they're doing the inference Well, it's not so nice right now. So, um
you guys might have like uh seen the slides before and been like, "Oh my god, this is like the the best idea ever."
And then like you go home and then you uh try out the RLM repo. Um but you might feel a bit lackluster. And the
issue is that what you will notice if you look at the way that the agents do RLMs right now is the agents aren't trained on this scaffold. So, they're
not very good RLMs. They don't quite figure out that, oh, they should be doing sub aent delegation. they don't
quite know how to do this like context slicing thing but like yeah you should read the blog post but in the blog post basically it's showing this task where if you had just taken the base model with the base uh RLM prompt it doesn't
perform very well but with a bit of prompt engineering you can get significant performance gains and you basically always beat the base model and if prompt engineering is enough for you
to be able to beat the base model with RLMs what's stopping you from just training these good RLM strategies right into the model itself And so that's what we're trying to do at
Prime Intellect. So Prime Intellect,
Prime Intellect. So Prime Intellect, we're a platform that uh is trying to serve anyone who is trying to train and serve their own uh language models. Uh
we support many of the open source language models from GBD OSS to Llama to Neotron and all the quens. Uh we
basically have experiment management. So
you can see your metrics and also all your experiment configs.
And most importantly, you can look at rollouts which is like the most important thing. you can see your
important thing. you can see your failure cases and uh look at your data.
Uh we have some pretty interesting users. Uh so I think this was like two
users. Uh so I think this was like two weeks ago. Uh Ramp Labs announced that
weeks ago. Uh Ramp Labs announced that they were working with us and they basically did a project where they trained a a small Quen model to beat
Opus 4.6 on a retrieval task for Excel agents. And not only did it beat Opus
agents. And not only did it beat Opus 4.6 ICS in terms of accuracy on this task they were interested in. They also
could do it more cheaply and they could also do it at lower latency.
Another interesting user segment for model training is data vendors. So
there's this guy called Shan Chai. I
think if you're in the data space in Silicon Valley, you've probably met him before. I think he's basically talked to
before. I think he's basically talked to like every data supplier, every data consumer in the valley. And he made this observation that the distinguishing factors of like which data labs will
make it in the future is whether or not they're able to develop in-house training capabilities. Because these
training capabilities. Because these models uh these um labs buying the data, they're not stupid, right? They know
that like not all data is created equally. And before they sign like a
equally. And before they sign like a million-doll deal to buy a bunch of data, they want to know like is this data going to improve my model capabilities or not? And a very easy way
for you to do this and very definitive way for you to do this is to simply show reward curves. Simply show that if you
reward curves. Simply show that if you have trained on my data, uh then your reward goes up or if you have trained on my data, your agent performs the task significantly more efficiently.
So if any of this sounds very exciting to you, uh please check us out. We're at
primeintellect.ai. Uh we look forward to seeing what you guys build. And uh
that's it for me. You guys have been a great audience. Thank you very much.
great audience. Thank you very much.
Awesome. Thank you so much, Jackman.
That was a really, really good talk. Um,
next up we have Michelle Julia, who's a co-founder of Blue Labs, who'll be talking about for AI to be emotionally intelligent. Obviously, we've been
intelligent. Obviously, we've been talking about personalized AI for a while, so this is a topic that's pretty pertinent. But Michelle is also kind of
pertinent. But Michelle is also kind of a badass. She's one of the youngest
a badass. She's one of the youngest patent holders from Apple. So if you've ever used Find My Find My iPhone or Bump to exchange contacts, the wireless
system that runs underneath all of it, she is the patent holder for it. Uh but
today we're not talking about that.
We're talking about emotionally intelligent AI. Without further ado,
intelligent AI. Without further ado, Michelle Hello.
Hi. Hi to y'all. I'm Michelle. I'm
co-founder of Blue Labs. We're a
research lab focused on emotional intelligence, specifically embedded emotional intelligence.
Embedded emotional intelligence is the capacity to navigate ongoing relationships where each interaction shapes the trajectory of future wants.
So it's not a static state. It treats
stewarding the relationship and capturing immediate utility as co-equal objectives, not as a trade-off to optimize. So our research is around what
optimize. So our research is around what architectures let AI systems do this in a way that humans do.
If you take a quick step back, really we're focused on making AI sound and feel human, especially in commercial decision-making processes. So that's
decision-making processes. So that's where we're focused on today.
Let me ground this in a quick story.
So as he mentioned previously, as a grim mentioned, uh I was at Apple before Blue Labs and you know, I was one of the youngest patent holders there. If you've
used Find My, it runs on wireless algorithms that I hold the patents for.
And you can imagine I am a small Asian woman.
Oftentimes in negotiations, the room looked like this. So I was a little bit anxious going into every negotiation. Uh
the first one that I went to, we were flown out to Portugal. And the night before I was sitting in the lobby of this hotel and I was very anxious and I was going through all the technical details of, you know,
What exactly are we negotiating with these external vendors? What is Apple's position? How do we talk to them about
position? How do we talk to them about the tech?
My manager then sat me down and said, "Listen, we have an hour to talk about this. Forget the technical details for a
this. Forget the technical details for a second. These these are the past 10
second. These these are the past 10 years of history that we've had with this vendor. And this is all the tea.
this vendor. And this is all the tea.
Let me tell you about the relationship that this guy has with this guy and how we've negotiated with this guy in the past and what he kind of looks out for and how he has interacted with our big
boss in the past. And this is all the dynamics of this room that you're walking into. And that will serve you
walking into. And that will serve you way better than just memorizing technical spec.
It was at that moment that I came to a realization that what's important isn't necessarily just the technical utility of a conversation.
In most settings, humans require an understanding of longitudinal relationships. And so for
longitudinal relationships. And so for me to be an intelligent agent of apples, I need full diadeic context on each vendor and to be able to steward that relationship forward in a way that's
beneficial in the long run.
So that's a lot to hold for one person to hold, much less for an agent to hold.
Most humans actually do this intuitively. You don't need to really
intuitively. You don't need to really think about the mechanisms that much.
Most of y'all are, you know, well functioning and well situated.
But it's hard to model and balance relationship states over time in these utility based conversations and relationships. Mathematically, it is
relationships. Mathematically, it is hard to prove.
So I believe that the unlock of emotional intelligence in this realm is what will sincerely move us the needle for us in adopting AI as strategic and
useful emulations of humans. We have
built language models that are fluent in the work that humans do but not strategically competent. So taking
strategically competent. So taking advantage of these long-term relationship blocks, I believe true corporate function depends on highly nuanced abilities to
balance trust and relationship with transaction and negotiation.
And so um I'm very excited about the space and my goal today is really just to give you like a little taste of what the field is, what is the state-of-the-art today, what are people talking about and what are some open
questions. And if this is exciting to
questions. And if this is exciting to you too, we can talk more about what Blue Labs is doing later on. So we'll be going over social train of thought and game theory by modality and human
behavior and state beats trait.
I'll try and kind of touch very briefly on these. So the first one it was
on these. So the first one it was published in human nature nature human behavior last year. So basically they played this game right with AI agents where you have the prisonless dilemma
which is a self-interested game and battle of the sexes which is a coordination game and their goal was to really see how models behave um in these
specific states and what they found is an asymmetric result. The models did pretty well at self-interested games. So
they cooperate when you're supposed to cooperate and you know defect when defection pays but badly when it comes to coordination.
And this is sticky because most human interaction is a coordination game, right? When you
sit in that hotel lobby in Portugal, our vendor is not trying to defect on us.
We're not trying to defect on them. We
both want a deal. It's just what kind of deal. So that nuance is hard to capture.
deal. So that nuance is hard to capture.
Social chain of thought also actually increase cooperation rates. Um and so we see kind of an exponential growth when you're able to model both you and the opponent.
The second piece is from Google DeepMind. Um, it came out this year
DeepMind. Um, it came out this year where they played a bargaining game with humans, Frontier models, and a uh specific
Beijing agent, a custom agent that they trained. So, this was, I believe, Gemini
trained. So, this was, I believe, Gemini 1.5 Pro and GPT40.
And what they found is in these three camps of people playing the game where, you know, it's a bargaining game where you're trading chips, the Beijian agents are very aggressive. So they, you know,
kind of play a hard ball. They get
rejected a lot, but then they get 80% of the max surplus. So this is actually very good in a defined space. Humans are
more fair. They give a little bit, they get a little bit. They kind of want this balance. LLMs are very concessionary. So
balance. LLMs are very concessionary. So
it's like, oh, I'll deal I'll do any trade with you and I'll actually give you more than what you're giving me just so that I can make this trade. So every
deal is accepted. And we see an inability for these models to actually self-balance throughout the game. So the
appropriate response here is really that as a human maybe when I first meet you I give a little bit so that we build the relationship and then when when it's coming to a very big transaction then I
want to play more of a Beijing game. And
so um this highlights that there is a static nature to uh in which agents kind of negotiate.
And the third piece, so this is interesting because it comes from computational psychology and not necessarily CS. Um but the highlights uh
necessarily CS. Um but the highlights uh the findings are highlighted, you know, along a similar vein. So it's a paper accepted to ACL um on fixed psychological personas on states rather
than traits. Basically
than traits. Basically researchers were asking how well do language models actually capture what who a user is. And what they found is
who a user is at a specific time is more interesting and important than the user's like general state. So in this point of time given this relationship
I'm a little bit anxious because I'm in a room of these kinds of people or I'm meeting these people for the first time.
These changes in states are actually more important to the users's policy than the underlying users like I am a naturally calm person or I am a
naturally this kind of that person. uh
sorry personality trait. So what we found here is that the the static way in which we model personality actually leaves a whole lot of room for
improvement.
So what this means it shows that models can't coordinate over across changing conditions. They
treat their own behavior as static and are naturally concessionary. Right? I'm
pointing out all these problems to show you that there's so much more that we can do to imbue models with this sense of understanding and sense of emotional relation.
So, we have a couple of research directions and I have 30 seconds, so I'm going to run through these very quickly.
One, can we train language models to modulate between strategic registers?
When to push and when to pull?
Two, what's the most appropriate architectural representation of a relationship? Diadeic embeddings,
relationship? Diadeic embeddings, reflective memory hierarchies. It's an
open research topic. And any one of you, if you have an idea, you can you can implement these experiments pretty quickly and whip something up.
And so this is a rough estimate of, you know, what we're exploring, we're beginning to explore at Blue Labs. Our
first architectural attempt at this is Blue JST, a joint state engine whose core idea is a dual reward mechanism that holds relationship building and utility prioritization as co-equal
objectives rather than reducing one to the other. And like I said, it's open
the other. And like I said, it's open research. It's exciting. It's, you know,
research. It's exciting. It's, you know, we don't have all the answers yet, but if any of this is interesting to you, we're hiring and we'd love to chat.
We're actively collaborating across industry and academia and the research is out there for us to get. Thank you.
get. Thank you.
Thank you, Michelle.
Next up, we have Jackie Mock, who's the head of applied AI at RA. Now he will be talking about world models um and how do we move from language to physical
intelligence um again we're moving into the terrains of physical AI embodied AI um not quite the robotic side yet but more world models world building sides
of it um so once Jackie is set up we're going to be ready to Hi.
All right. Hi. I'm talking about uh how we go from language to physical intelligence. Uh my talk is about our
intelligence. Uh my talk is about our path towards world models. So I'm
Jackie. I work at REA where I am the head of applied AI. Um, REA is a multimodal AI for video and image and text. Uh, you may know us from some of
text. Uh, you may know us from some of the models we built uh a few years ago where we were like climbing the leaderboards. Um, we're very focused
leaderboards. Um, we're very focused more lately on vision models and different modalities and uh at the lab we are working to understand how we can
apply these to real world situations.
So in terms of vision today um we are already um having a lot of these CV technologies that can do a lot of things right this
is a solved problem being able to detect cars to detect things and to track items that's something that comes from computer vision um and we can use these to kind of help our deployments understand with more deterministic ways
of what's going on within the video but you can see later on the video that the the machine doesn't actually understand like what it's actually seeing. It might
be able to see the heat map. It might be able to see the bounding boxes and this is where computer vision was before VLMs come in. So now we have VLMs and with
come in. So now we have VLMs and with VLMs we're able to look at a scene, think about the scene and then take action on the scene, right? We are able
to do CV on top of it to kind of help it also figure things out over time. Um but
this is kind of how we can apply LMS. Uh but we don't replace CV. CV is
kind of on the side too. Um and another example of how we deploy AI in production is um here where you can add like detecting tracking and
identification. Um here we still use CV
identification. Um here we still use CV as a very cheap step to kind of understand what's happening in the scene. Uh then we use VLMs to do
scene. Uh then we use VLMs to do reasoning and then we use it to alert uh for specific use cases. Right? Neither
alone is sufficient and neither alone is uh physical eye just yet, but these are the building blocks that we have that came from our language models. Um so BLM are able to predict the next token
because we're able to take this visual space, encode it uh into some embedding and we generate the next token. So we
can explain what's in a in an image and what's in a video over time. However,
the output is still largely text based.
Um there is this other paradigm that we also build models around. um where we're able to predict the next frame, right?
So you've seen the fusion models where they generate images or or videos. Um
this is also a path that now robots and physical AI is trying to use to uh generate uh trajectories for for robots. And
these two models, the language model versus these video models are not exactly world models just yet. Um and
for us uh we are taking we can go from both approaches right and both approaches actually help us craft this next idea of like what a world model is.
So we want to predict the next action and that's the biggest thing that makes the difference between anything in anything out. Um and we're going to talk
anything out. Um and we're going to talk about how we're trying to get there.
Here's an example of how we're able to train a model from scratch. So this is not an off offtheshelf model. This is
like completely from scratch diffusion model that was trained on video generation. So it can make cinematic
generation. So it can make cinematic films and cinematic scenes of 5 seconds.
Um but when applied to robotics the main advantage now is that it's zeroot. So
even though it in previous technologies you'd have to train a robot with previous robot arm techn robot arm movements um you have a diffusion model
that is now tracing the trajectory of where the arm can go to achieve the objective. Right? The biggest
objective. Right? The biggest
improvement is that this happens without the robot knowing what it was before and we're able to get pretty surprising results and there's many other labs that are kind of doing something similar to
kind of control robots. Uh but where is the gap still? There's still a lot of things that we want to get better and and for a lab and when we build models
the best way we do that is we understand what's broken and we create eval right.
So actually VLMs are quite bad at physics. So one example is it will
physics. So one example is it will hallucinate. Uh an object might
hallucinate. Uh an object might disappear. A object might get smaller
disappear. A object might get smaller some for some reason in the next generation. Um it might not follow
generation. Um it might not follow physics. Right? So one of the things
physics. Right? So one of the things that we're adding is we're adding uh an eval set to like kind of understand our blind spots uh for the other blind spots that we might have is that even though
we evaluate a lot today, there's actually a lot of blind spots when we do evaluations where um even though the model is able to get the right output, it actually was sampled and we actually lost a little
bit of data. Right? A lot of these models are also being judged by other BLMs. Uh so yeah, BLM's kind of judging each other to understand whether or not they're improving and this creates a gap
as well. Um so that's why um for us we
as well. Um so that's why um for us we are creating new data sets uh to kind of understand what the ground truth is. So
all these things you see behind me uh are areas where the the models don't really understand right that's a ball game smaller. Um then you have like if
game smaller. Um then you have like if something is falling is this falling proper properly when when two things hit each other what they do um and is the motion correct right does something
spontaneously move and to be honest a lot of models are not able to predict this right now and this is one of these main like physics related graphs and um we create synthetic data to kind of
understand what the realism chance is in our eval like even the best models today do not perform very well right and there's a reasons for that uh but I'll
go over that now is that one of them is that BLMs do not look at every frame these large language model based approaches you know there's a lot of tokens that go into the to those these
models and most of the times it needs to be sampled right so in our experiments we can kind of prove that like if you send it every frame it might understand but if you send it um a random amount of
frames it's going to interpolate it's not going to understand what's actually going on. So, that's one way it fails.
going on. So, that's one way it fails.
Another way it fails is that when an object's just near the edge, um it's not able to actually see if the person disappeared or if they walked off the
scene. And this creates a lot of um
scene. And this creates a lot of um confusion because the model kind of assumes and predicted that the person disappeared even though
they didn't see it per frame. Another
area is that VLMs are really just going back to text. So it will reason about things in the text world. Um we have to give it more CV and like more like
supplemental data for it to really understand uh what's happening within the scene. Uh it understands laws but
the scene. Uh it understands laws but understands it in a text space. So it's
able to more reason about it. Uh it goes back to like why our deployments today are actually more CV augumented where you have the vision model looking at the
video but then also the the CV text explaining oh this scene has X identity and it's being tracked over many scenes and that's how we kind of help improve
the VLM performance. So for us, we're using VLMs to kind of help improve how we judge physics. They're but ultimately
they're still skipping frames today. Um
we're using them to they're using them to match uh position, not motion. And uh
they know physics just from what they learn from from the text based models, right? And we're about to release some
right? And we're about to release some uh eval sets to kind of help other people improve their their models as well. um so that they can also train the
well. um so that they can also train the the the next embodied model. And for us um to recap for how we're going towards physical AI as a company um as we're
building next model is we're still using our LM and our VLMs where we have the next token and that will be wrapped around a harness and that harness will help us control uh surveillance or it'll
help us control robots. Um but we're also creating the path where we have the diffusion path where we have these video models that are now creating these control paths for robots. Um and
together they can be combined to create this kind of world model where we generate the next action. Um for the next step is this evaluation set because this evaluation set will help us understand if we're actually
understanding what's going on or if we're actually flying blind. Um and yeah that that is our path that we take our language models to kind of evolve over time and now we're trying to um shift it
to kind of help us support the next generation which is to uh build physical AI and world models. And that's that's my talk. Thank you.
my talk. Thank you.
Thank you so much, Jackie.
Next up, we have Gokul Shinasan. He is
the co-founder and president of Antim Labs. Now, he will be talking about
Labs. Now, he will be talking about simulation games and the future of robotics. And I think like he's got some
robotics. And I think like he's got some really cool demos and videos as part of this. So, this is one to look out for.
this. So, this is one to look out for.
Good evening everyone. Uh my name is Gopal and I'm co-ounder labs and today I'll be speaking about um simulations games and how these are going to be
really important themes um going forward in robotics. Okay.
in robotics. Okay.
So since like the 1950s, 1960s, robotics has basically been in the cage. And what
I mean by that is everything has been um pre-programmed. The environments have
pre-programmed. The environments have been fixed. Um the the scripting for
been fixed. Um the the scripting for like what what the robot supposed to do, everything has been fixed. So the
environment has been purpose-built for the robot. And um of course to really
the robot. And um of course to really unlock economic value, we can't have that where the environment is built for the robot. The robot should work in
the robot. The robot should work in existing environments. So um over the
existing environments. So um over the last 10 years 10-15 years a lot of work's gone in to make robots more and more general and um this has led to a
lot of cool research.
So one thing we see today is that even though there's been a lot of research the robotics community has no sort of answer as to okay what is the model
architecture that's going to lead to a significant uh generality. So for
example, if you if you're just looking at all of the latest research, we see world action models, u VLM, VAS, um video action models, and of course some people still employing classical
algorithms. Now, because there's different types of models, of course, we need different types of data collection methods. Um some of these are teley op
methods. Um some of these are teley op just using internet scale video to train video action models uh synthetic data from simulations and also um UMI style
uh capture. So these are all different
uh capture. So these are all different types of data capture methods for um robot for training robots. So, one could ask now, okay, there's so many different types of models, so many types of data,
like what's really going on? Is robotics
just going to go in multiple different parts? And there's no um there's no real
parts? And there's no um there's no real linking thread between all of them. And
I would like to argue that the one thing that's common among all of these methods is simulation. And what I mean by that
is simulation. And what I mean by that is simulation is going to become an uh a part of the workflow both the R&D workflow and the deployment workflow
that you in all likelihood cannot escape. So um some of the places where
escape. So um some of the places where simulation is going to be used are for generating synthetic data. Uh second is um you can create digital twins of
environments and you want to make sure that they work in those digital twins before they you you know go out and deploy an actual physical robot. Uh the
third one is for edge case coverage.
This is like really um well established and it's used quite heavily in things like autonomous driving and of course just to prototype policies before you deploy them. So, um, for all of these
deploy them. So, um, for all of these different, uh, you know, places where you can use simulations, um, even though it's it's going to be so
ubiquitous, what what is the state of simulations is that it's really really hard to make them. Um, if I don't know how many of you have tried to build simulations or have used any of the
simulation software like Isaac Sim or Mojo or something like that, but there's a real massive learning curve. And even
once you've become an expert at it, it's still really hard. And so what's on the slide right now is just the um workflow
to create one asset and then place it.
So you've so depending on how complex your scene is, you have to do this for multiple assets and um you know it just it's just really hard and it takes days
and sometimes even weeks. So
there's no reason for this to be the case. Um so with current agentic AI and
case. Um so with current agentic AI and a lot of the uh vision based models and language models, we can actually automate several steps of the pipeline or at least bring it as close to
automated as possible. And so we built something called Gizmo. This is a prompt to simulation tool where basically you can give our system a prompt in either
natural language or a just a picture and it will go out spin up a bunch of sub aents and it'll do whatever it needs to do and then at the end you just have a sim you have a fully built 3D simulation
and this takes around like 20 minutes right now. Um so you basically have your
right now. Um so you basically have your first pass of your environment done in about 20 minutes and let's say there's some human in the loop work required. It
is still you know you can complete it in a couple of hours. Now this is contrasted to days or weeks. Uh that's
what's being done right now.
So I'm just going to play a demo of our tool.
Heat. Heat. N.
Heat. Heat.
Great. So that's the demo of the tool.
So basically just prompt in something and then out you get a simulation. So um
this un this unlocks some serious capability. So we're also going to have
capability. So we're also going to have APIs. So what this means is that your
APIs. So what this means is that your codeex or open claw whatever you're using um in any part of the workflow it can just decide to spin up a simulation and uh you get a simulation out. So this
also enables massive scale. Right now
it's just not possible to do simulations at really high scale because they're so hard to make. Um this also enables some really fun stuff like you you you can basically have an end toend closed loop
closed loop for robot learning. For
example, you can just say train a quadriped to walk to um a point in the scene that I specify or something and that is literally all the information that an agent needs to go out and
actually do the entire thing and give you a policy for a trained quadriped.
Okay, so uh is this it is is robotics solved? Well, of course not. Um the
solved? Well, of course not. Um the
simtorial gap still exists. What this
means is that simulations um while they are useful, they're not 100% accurate yet. And this is fundamentally just a physics problem. Um
there's problem with the contact physics and there's problems with you know we approximate the properties of materials and deformations are really hard to model and so um this is a hill that the
robotics community and us are still climbing and we expect the gap to become lesser and lesser as the years go by.
Okay, so we spoke about simulation.
Let's go to games. Why are games important? So um in simulation you
important? So um in simulation you cannot only train manipulation or navigation or locomotion. It turns out that if you're
locomotion. It turns out that if you're able to have a synthetic world you can train even highle cognition. And what do I mean by high level cognition is things
like um exploration when when the goal is not clear. Um when you had a certain plan and then something happened in the world and your state degraded it. How do
you recover? How do you replan? um when
you have when you don't have full information about the world, how how good is your decision making? So um all of these things are really important.
They're not only important for robotics, they're also important for LLMs, but for robotics, they're they're specifically important because they it also needs to be grounded in spatial temporal memory.
So um I mean all of these things like exploration, um replplanning and uh you know long horizon planning, all of that needs to be grounded in spatial temporal
memory. So we trained an agent and I'll
memory. So we trained an agent and I'll just give you a very quick um overview about how over how we did it. So we
trained a two billion um quen model VLM.
So it's basically functioning as a computer use agent where it controls the keyboard and the mouse. So we
pre-trained it on like 400 hours of uh um frame action video gameplay data. So
that basically gives the model some instinct as to how to play video games uh with a pre-training and uh we did some instruction fine-tuning with around 60 hours of if data to basically steer
the model through the game. And finally
uh this is something that we haven't done yet but uh it's in the works where you train the model to output reasoning traces and the reasoning traces then function as the instruction for the next step. Right? And finally, one thing we
step. Right? And finally, one thing we need to really keep in mind is since we we want to play a video game, we need real-time operation. So, um yeah, we
real-time operation. So, um yeah, we need the model to take in the input, proc it, and decode the output within uh 200 milliseconds. So, now I'm just going
200 milliseconds. So, now I'm just going to play a quick demo of um our agent. Um
as will be clear to you, it is still early work and it's far from perfect, but um hope you'll enjoy.
Heat. Heat.
and that's my time. Thank you.
Thank you so much, Gokul.
We're at 5:00 pm. We're in the home stretch. Very, very happy that
stretch. Very, very happy that everyone's still around listening to talks. Uh we're going to take a little
talks. Uh we're going to take a little detour and go into some aspects of design. And we're going to explore a
design. And we're going to explore a different playbook now. So our next speaker is Weii Su from Lentil. And her
angle for the talk is to explore the wisdom behind eastern philosophies and eastern product building. So this will be a very interesting talk at looking at
design and AI but from a lens that is often not the center of discussion. So
way whenever you're ready.
You want to go back?
Yeah. Yeah.
Cool.
Hi everyone. Can you hear me? Okay,
cool. Thank you for being here. Um, my
name is Wayi and I run a startup called GenZen. We create AI videos to scale
GenZen. We create AI videos to scale marketing.
I'm going to be a little bit experimental today and I would like to spend some time today to talk about eastern philosophies as well as how this can shape the way we
build in the future.
This feels like something worth discussing because we're living in a time where Westerners are becoming Chinese on Tik Tok and are China maxing.
So if you spend time on Tik Tok, you've definitely noticed this trend in the last few months. Not only so, the west is also paying increasingly more attention to both companies and AI
models coming out of Asia.
A moment I want to highlight in this movie is um it is is this movie called Wandering Earth. How many of you have
Wandering Earth. How many of you have heard of this or watched this?
Cool. And how many of you have heard of Three Body Problem?
A lot more. Great. Um so wandering earth is also written by the same author Leo Sushing and it was a very important moment in sci-fi films because it was
one of the first successful attempt for China to build a large-scale Hollywood sci-fi blockbuster rooted in Chinese story storytelling traditions. This is a
story that's set in 2075 and the sun is expanding. Earth will
soon become unlivable.
Instead of abandoning Earth, humanity decided to come together and build uh build 10,000 giant planetary engines on Earth's surface to push the planet out
of the solar system. This plan would take 2500 years. And so in those in the next 25 centuries, they all agreed to live underground. Um watching this movie
live underground. Um watching this movie and seeing this collectivist mindset was very empowering for me. It helped me realize that we had been given one
version of the story and one version of the future our whole life. Mostly
created by Hollywood without really knowing about it.
So for the longest time the western narrative has been in the center of how we build, how we live and also what we desire. What happens if there's an
desire. What happens if there's an eastern narrative in the center of 21st century? In the west, minimalism is
century? In the west, minimalism is typically favored. Apps tend to have one
typically favored. Apps tend to have one call to action on every page. For
example, in America, you use Cash App or Venmo to send money to friends and to pay friends. This is what Cash App looks
pay friends. This is what Cash App looks like. And on the other hand, this is
like. And on the other hand, this is Alip Pay from China. Not only can you send and receive money, you can also pay your bills, order delivery, or even take
out a loan.
So, in the east, vibrancy is often celebrated more. People want all the
celebrated more. People want all the options. A lot of the times more is good
options. A lot of the times more is good rather than less is good.
This belief in the west also trends toward focusing on singularity.
An example of this is for example a western company like Meta has been focusing on growing one revenue stream for more than a decade. So as you can
see they rely on advertising.
On the other hand, Tencent, the parent company of WeChatai, has been diversifying their revenue streams and they don't put all eggs in the same basket. When you put these two social
basket. When you put these two social media companies next to each other, the contrast is quite stark. And you can also see how that changes their
behavior, how they deal with risk, and how they experiment in general.
While the east trends toward plurality, we also celebrate optionalities.
So I couldn't help but wonder what led to this difference. Right? One
observation is that the philosophy each culture embodies is very very different.
While the west has Bible which is called shenanzing in Chinese which the holy scripture the east has something called eing the changing scripture.
Its central argument is that nothing is fixed. Everything is in motion and a
fixed. Everything is in motion and a wise person doesn't really resist change. They seek guidance in navigating
change. They seek guidance in navigating and also embracing it.
With Eegene, practitioners tend to cast coins to generate six lines. These are
all the 64 options. Um the 64 hexogs.
They offer guidance on life's changing situations. Over time, it becomes a
situations. Over time, it becomes a cornerstone of Chinese philosophy, reflecting on ideas about balance,
transformation, and also in and one of the changes that I think we're all living through in this era is content is synthetically generated.
We're going to see more content that's synthetically generated than created by humans. And I think one question that
humans. And I think one question that we're all asking is will we be drowned by zero effort slop? Are we going to see slop puriferate and just flood
everything? Right? How what do we do
everything? Right? How what do we do when there's so much noise? Um but if we look at this from a different angle, the instrument of storytelling, so cameras,
studios, distribution, the entire apparatus Hollywood built is collapsing into something anyone can hold. This
also means communities that are ignored by Hollywood now hold the tools to create content and distribute them on their own terms. The stories that were too niche and also too foreign, too
small of a market, too hard to cast, those are now producible by people who actually live them for the audiences who actually want them.
For example, the furry community in China now makes content for themselves using AI. And
this furry animation gained 1 million views in the last two weeks alone.
Another video creator in China has created an AI short film and it had gained 60 million views across all platforms just in the last seven days.
Similarly, we at GenZen are helping clients make content in industries that are traditionally too niche. And this is really exciting to me because we're able to create broader access and also
awareness to these verticals.
In the last four months, we've delivered 10 million impressions monthly across YouTube shorts, Instagram, and also Tik Tok.
For example, we've also made more egene content. And to increase awareness
content. And to increase awareness around this, we built an app to enable everyone to get get a reading.
Traditionally, doing egene reading can be a very complicated and perhaps confusing process for new beginners. So,
this tool enables you to quickly ask your burning questions. And if you're interested, you could also try this tool out on App Store for free. Um, we made it for free this week just for you to
try it. You can search Egene Oracle on
try it. You can search Egene Oracle on App Store or scan this QR code.
We've also created and scale content um around traditional Chinese med medicine, acupuncture, pressure points. These are
also topics that are historically underlooked and these type of content are more easily created because of the tools we now have access to.
All of this is supported by our in-house agentic video video workflows and we streamline and optimize the content production process which in turn deliver impressions and also productive
conversions to products. In a lot of ways, we see AI generated content as a vehicle to a more vibrant and also plural future where all of us hold the
tools to create narratives we believe are important. With that, thank you so
are important. With that, thank you so much for your time and you can find me online on Twitter at this ID. Um, if
this is something that's interesting to you and if you would also like some of the stickers, um, please come find me afterwards. Thank you so much.
afterwards. Thank you so much.
What a unique presentation. I need to figure out how to make slides and like presentations like these. So cool. Uh
next up we have Anun Jooshi who's a tech lead for Bland. Uh and he'll be talking about voice AI. And we had a presentation from 11 Labs earlier but this one's going in a different
direction which is voice AI is not a model issue. and we'll let Anun tell us
model issue. and we'll let Anun tell us more about this.
Hello everyone. Can you guys hear me?
Good. Sweet. I hope you guys are feeling good. Um, I just want to say before we
good. Um, I just want to say before we start, all the speakers have been amazing. So, can we just give a round of
amazing. So, can we just give a round of applause for all of them?
So, also I actually changed my talk title because I did realize that voice AI does have model issues. So, I changed it up and I'm going to talk about some
of the issues that I face while scaling up um voice AI for enterprise customers.
Um so, I'm Anun. I actually grew up here in Singapore. I moved to San Francisco
in Singapore. I moved to San Francisco two years ago for Bland. And fun fact, I actually used to be a theater kid in junior college here. Um yeah, I never
thought I'll be back on stage again, but here I am. I do like storytelling a lot.
Um, so I'll kick us off with one. So,
two years ago, I was in San Francisco. I
was just going on a coffee date with my CEO, Isaiah, and we were just hanging out, and he told me something that we
still talk about to this day. Um, he sat me down, he looked me dead in the eye, dead pan, and he told me this.
You're not going to believe me, but Pathways, the thing that you invented is going to impact millions of people, and millions of people are going to use it.
And I looked at him, I'm like, "This guy's ridiculous." Like, he's he's just
guy's ridiculous." Like, he's he's just the typical founder. He's trying to glaze me, make me feel good so that I
work harder. Um, and at that point we
work harder. Um, and at that point we were just Oh, well, one slide's missing, but I was going to show that we were
just on Discord. It was me and another engineer. Um, we were talking about,
engineer. Um, we were talking about, we're just the FDEEs, we were the engineers, we were the product managers.
Um, we're just figuring out the architecture for our agent with nameless and faceless people on bland Discord.
Um, and it's crazy to think now that we're actually serving millions of calls every month. It still hasn't struck me
every month. It still hasn't struck me that someone right now is talking to our agent. That's crazy. Um, and I went on
agent. That's crazy. Um, and I went on my Slack channel this morning too in team talk and there was a case study that came out with one of our customers
named American Way Health and you can check it up on our website too. They
said that we unlocked $430 million of revenue per year.
I didn't know that was possible. I
didn't know we could do that. Um so
yeah, all of this has grown way bigger than I could ever imagine. Um I'm lucky to have lessons I've learned while doing all of this and some pain points too
that I'd like you guys to learn from if you guys are trying to um integrate Voice AI into your services. So I'm sure all of you has seen a bunch of um voice
AI like demos and they are super cool but the thing that's hard is productionizing it and actually making it work for enterprise customers. Um so
I'll go into some of the pain points and findings that I had for what we actually need to do to make voice AI work for enterprise use cases.
Um, okay. The slides are different, but we'll just roll with it. Uh, I'll start off with the VO. The thing that I didn't realize that a lot of enterprise customers deal with and complain to us
about is the voicemail detection accuracy. Um,
accuracy. Um, I didn't realize that our current customers report and try to track the voicemail detection accuracy every
single day. Um the reason for that is
single day. Um the reason for that is most outbound calls don't actually reach humans. Most of it goes to voicemail and
humans. Most of it goes to voicemail and ensuring that that's a robust system that works across various scenarios. For
example, with call screeners where now uh iOS and Google voice have checks right before the call connects. For
example, they say um please say your name and reason for calling before we connect. Um, and
connect. Um, and there's a beep that happens. And what a lot of people use is Twilio, which has an answering machine detection feature, which essentially is just a beep
detection model. And it doesn't work
detection model. And it doesn't work that well. Enterprise customers can't
that well. Enterprise customers can't rely on it. So, I was working on that in bland to improve it. I was working on building a CNN model to look at the mel
spectograms of every audio chunk. Um,
and I didn't realize that beeps have so many different lengths and frequencies for phones and different phones. Um,
some frequencies also have dual band frequencies that are the same as what are called DTMF tones, which is what happens or the sound that you hear when
you press digits on your phone during a call. Um, so you don't want to cause
call. Um, so you don't want to cause false false positives there as well. Um,
so that was one of the hard things that we had to figure out and we even have a website now for you to test and benchmark voicemail detection. So if you
guys are trying to um integrate voice AI into your systems, make sure that you're looking into how well their voicemail detection um accuracy or system works.
So, next thing, um, there's going to be a slide of a Slack message that I received from a customer, and that Slack message said, "Why is my agent not
working the same way?" or "Why is my agent not working the same way it did yesterday?" I don't know how many of you
yesterday?" I don't know how many of you have dealt with customers telling you that or you yourself maybe have experienced that. For example, I know
experienced that. For example, I know with Claude, I hate it when things just change. Um, and from a business
change. Um, and from a business perspective, sometimes customers come to me when I didn't change anything. Like,
I didn't push any new code and you're coming to me saying that I broke their system. Um, but I get it like you are
system. Um, but I get it like you are spending hours working on their platform and their agent. And it sucks when something is just not working the way
you expect. Um, one story of when I
you expect. Um, one story of when I messed up as well is when I was trying to improve the hybrid search algorithm for our knowledgebased feature. Um, we
have our own self-hosted vector databases and I was just trying to increase accuracy. Um, it worked for
increase accuracy. Um, it worked for some customers and it caused a regression for another and that sucks.
It sucks to break the trust of your customers and that's hard to rebuild.
What we've built and what I'm proud of building in bland was that we allow customers to deploy canary deployments and test out versioned agent releases.
So for some context blind has dedicated infrastructure for each enterprise customer for data residency etc. And we with this we can allow them to spin up a
separate container where they can test out a new agent release and send and roll out a percentage of traffic there, couple phone numbers to route there so that they have more assurance that any
production changes are being tested out before it actually goes live. So that's
our way of trying to rebuild customer trust and that's super important for enterprise customers so that they can just focus on improving the agent the way it should work.
Now this was this is a funny story too.
So we're working with a Fortune 500 car rental company and we're trying to collect the car rental digit IDs. um it
yeah so that we can just help out with any other information they need to change and we went into production and we started realizing that okay the
digits are actually different from uh what is actually there and we looked into our pipeline the transcription engine was correct the TTS was working the way it's supposed to the LM was the
one that was hallucinating and the input for the digits were correct but it was saying and outputting something Um, I tried to prompt engineer my way
out of it. Didn't work. Um, and when I looked deeper into the tokenizer level, I saw that, okay, the repeated digits are actually being treated as one token rather than each digit being treated as
a separate token. And that's just how the tokenizer is. Um, and the hack that actually fixed the issue completely was
adding commas to each of the digits in between. The reason that worked was that
between. The reason that worked was that the LM can then now treat each digit as a separate token and we actually found like later on that uh
a paper was released uh which you can look up for sync and stro 2024 that was released after um we fix the issue but if you guys face something like that
just know that you can look that up and adding commas will help um solve the issue. it was only happening like five
issue. it was only happening like five out of a thousand times. But if you're working with enterprise customers, that's five times too many.
So this is a bit of a personal regret that I have um from there are decisions that I made
when we were in early on seed stage and a lot of YC uh like advice is move fast, break fast, but I wish that I was a bit
more intentional about some of the decisions I made when rolling out changes and yeah not causing as much customer pain. So, just being more
customer pain. So, just being more conscious about one-way door decisions versus two-way door.
So, going back to that story with Isaiah, um he still teases me to this day about that that I didn't believe him. And it
is super empowering to know that you can have that much impact just from code.
Um, and I just hope that you guys can learn from some of the lessons that I learned and the mistakes I made so that you guys can scale up for any other
service or like integrating voice AI um to be even bigger than I could do. So,
thank thank you so much for the time and yeah, my LinkedIn's here if you guys want to reach out.
Thank you so much, Anon. Next up, we're going to look at this design. Uh, and
we're going to be talking about going beyond flat design output and just going beyond autocomplete. So, how do we solve
beyond autocomplete. So, how do we solve the complex design problems and enterprise design bottlenecks that come with AI? And for that we're going to
with AI? And for that we're going to have Lin New who's the head of AI at Oello who will be sharing her thoughts on this uh once she's set up.
it's it's really time consuming and costly to um create like marketing content uh on brand and at scales. So if
you can see here as marketing channel multiply brands are faced with a relentless demand for content creation such as like when you want to create uh
a marketing campaigns or advertisement uh across different format like Tik Tok, Facebook um Instagram and and so on or
LinkedIn. Yeah. So we heard a lot of
LinkedIn. Yeah. So we heard a lot of complaints uh and testimonial across CMO, head of design on of different companies big or small. They all have to
admit that traditional design tools are slow, costly and reliant on specialized design skill. Not everyone can afford a
design skill. Not everyone can afford a big design or marketing team.
So we introduced Oello. So it is an AI power design platform that enables team to instantly and cost effectively create onbrand content at scale. So unlike like
Canva where you can use rightway as individual but it won't be able to learn your brand signature your brand assets or brand voice.
Boom. Boom.
Yeah. So as you can see that like uh when we use AI generated image model or videos right we have like a problem of model collapse when you keep prompting
it to say hey let's change this headline to another colors or change the logo or something like that. So when you use continuously use uh the previous uh
generated AI image to fit to the next time when you prompt it it will lead to a model collapse. So we uh in Oberllo we have uh been able to turn those flat
design into a fully editable where you can just move things around you can change the colors you can pair the colors that is learned from your brand asset.
So in here as you can see we have a lot of like workspace or domain and the models will be uh proprietary trained
regarding uh to to their own brands guidelines signature and all that. Yep.
For example, Obert, I think that if you go to Funan Mo, you guys will see that there's a store over and they are one um one of our clients right now. Yeah, you
can see that in here. Uh we
um use a lot of like proprietary like uh training uh data from them and like design our design team. We train that uh
and the model will be like fully you know kind of like private and not scraping from the just from the internet. Yeah.
internet. Yeah.
Uh so this is like one of the demo from our AI resize. If you ever try resize on canva you will understand that like sometime they will just like copy the
element over and just stretch the whole canvas. But in here you can see that it
canvas. But in here you can see that it will smartly, you know, re reorganize all of these kind of elements around it.
Yeah, you can see that. Um, so it not just, you know, copy over and stretch the canvas. Yeah.
the canvas. Yeah.
And when you replace a media with like another uh like videos or or image, it would change accordingly
across all of the formats and and campaigns. Yeah.
campaigns. Yeah.
So that's what that's what uh that's how you do like marketing campaigns and advertisement a skill and on brand In here is how you use our AI studio uh
functions that we have uh divided into you know people subject and product subject. Uh you can choose uh up to you
subject. Uh you can choose uh up to you know eight kind of images uh high quality and then you can just name them.
Let's say you will put it as Malo jacket or something like that. And then now you want to generate in an ad or a picture
using this model.
Let's say let she wear Rick Owen stuff.
Yep.
And we can you know at the same time generate uh to multiple format or size.
Yeah.
All of this information has been you know uh intelligently uh saved in your brand domain.
Let's say another example for train products and this is a design reference.
So you have a design reference somewhere and you have your own train product and you want to you know kind of like just combine them together you can add tag like at and it will
understand which uh subject you are referring to. Yeah.
referring to. Yeah.
Yeah. So this is the result from that.
In here you can actually hit refine if you want to change any details of that and it will be like fully editable. You
can actually change uh the text without you know kind of like prompting again.
Um and you can actually you know open in in editor and do more of that.
Yeah. So uh we also have like a short form videos which you can use to to be uh you know broadcast on billboard or
any kind of like uh dynamic banners.
Yeah. So that is all uh overall from our uh Oello platform. And here you can see that this one is a brand adset where you
can just actually pull in um put in your URL or put in your PDF files, do um Google Docs or anything else and it will pull all of your color scheme, primary
color colors, secondary colors, um logos, paddings and all that. Yeah. And
you can actually see that it will automatically tag your image like which kind which kind of product that it displays here.
So we have like other work in progress which is not launched yet but you can take a look here. Um
here is our monty monty resite. So you
can check out like for example if a designer wants to create like 10 sizes at once. Uh he or she can actually just
at once. Uh he or she can actually just do an initial design with this with this kind of thing and then afterward uh they
can just do like a collection marker or like the suggestion design like this.
This is just like a very simple format but it can be a much more complicated layout that enable you to create uh more
sop sophisticated campaigns. Yep. You
can see that we can select up to a lot of like different sizes and it will automatically you know kind of like expand like you like you see on Figma
there's like an infinite canvas right so uh this is the result from the multi multi-resize imagine like before if the agent uh the
agency has uh to have to make like one week or two weeks to complete and you know rearrange all of is now we can do
that with just a click. Yeah. So
yeah, so for things that are loading it will it will load later. Uh which one is complete will display first.
Um yeah, thank you for your attending.
Yeah, that's my presentation.
Awesome work. Thank you so much, Lynn.
Last two talks. Hang in there, guys.
We're almost almost at the end of the first day of talks.
To close the sessions out, we have two more talks. The first one is by Stefania
more talks. The first one is by Stefania Duga, who's a research scientist with Sakana AI. She will be talking about
Sakana AI. She will be talking about sovereign AI. So, how do you localize
sovereign AI. So, how do you localize frontier models for certain countries?
In this case, Japan since Sakana is based in Japan. Uh, I'll let uh Stefania set up and then uh we're good to go.
Hello. Hello.
Is the mic working? While I'm setting up, I know it's been a long day and you've been sitting and listening to so many talks. So, I'm going to invite you
many talks. So, I'm going to invite you to stand up for a second. Can you all stand up?
We're going to do a breathing exercise.
Take a breath in.
spread out.
Okay, thank you for playing along.
Awesome. Now we're ready to start.
Uh, one second.
Um, so good afternoon. My name is Stefania Dugga. I'm a research scientist
Stefania Dugga. I'm a research scientist at Sakan AI in Tokyo. And um, today I'm going to talk to you about sovereign AI.
Um and what I mean by that is not necessarily uh any country building a local model um but more the ability to talk about local agency over global
capability and think about that. So in
practice when I think about sovereign AI I think it's important to consider three things. Um the data which data needs to
things. Um the data which data needs to be stay u local and what models are best adapted for local use. Compute and
evaluation uh what sort of compute resources we need what workflows run on premise which workflows run on the cloud and accountability determining who remains accountable when we introduce AI
systems into our institutions.
So I wanted to share with you a personal uh story of how I got interested in this topic. Um I come from a small village in
topic. Um I come from a small village in Transennylvania, Romania. And uh before
Transennylvania, Romania. And uh before working in AI research, I used to run AI literacy workshops for children, families and educators around the world, including here in Singapore. This is a
video from uh academia hackathons with kids in 2013. And what I learned in this workshops, in classrooms, in maker spaces, in libraries is that people are very interested in AI. They want to use
it, but very often the AI models and systems are not adapted to their language and local needs. And that
translates into frontier AI capabilities of today. We expect communities and
of today. We expect communities and people to adapt to AI systems instead of adapting the systems to the local needs.
And in Japan, this localization poses multiple challenges. We need to consider
multiple challenges. We need to consider different registers for the languages, different cultural norms, different workflows, scientific practices, safety
and security policies. So localization
challenges is institutional and multifaceted and sovereignty uh I want you to think of it as a stack right so it starts with
data and figuring out what sort of unique data we need. Um then it goes to evaluation. How do we check for
evaluation. How do we check for neutrality, factuality, specific country benchmarks? Um, then we're talking about
benchmarks? Um, then we're talking about adaptation and this primarily happens through post- training, fine-tuning, rad tool use. Then we have the routing layer
tool use. Then we have the routing layer and here we need to have policy aware model selection interaction. What are
our users? What are the different personas? what's it what are the
personas? what's it what are the different UX decisions in how we present these models and products to to the users and governance.
So beyond that there's also a physical layer right because the different aspects of the stack have different needs for pre-training um we need a lot of data a lot of compute uh and the cost
is prohibitive in most of the cases for post- training we need to care a lot about local norms and preferences and I wanted to show you some examples of how specifically we consider that in
some of our projects and products. So uh
last uh um month in March 24 we launched our first consumer product Sakana chats and in this consumer product um we it's available for free for people in Japan.
Uh it's equipped with web search but we're actually and it's available for anyone in Japan. We're actually
supporting multiple ways of interaction.
So we're supporting uh standard mode uh which is neutral default Japanese register but we're supporting also keo the polite mode which is more form used in formal context and we're supporting
dialects the Osaka mode which is actually um giving answers in the kai dialect and people really appreciated this we have over 30,000 active users
every day and in this particular project we use post training as a sovereign control point so we started with open frontier models such as deep sea, llama,
GPTOSs. Then we had unique uh Japan
GPTOSs. Then we had unique uh Japan data for evaluation and preferences and we define a series of neutrality metrics with an expert of with a panel of
experts in policy. Then we used this to post-rain this open weights models to create a model we call Namazoo. And we
evaluated um we compared the evaluation between the postrain model and the base models. And we showed that the postrain
models. And we showed that the postrain model outperformed the original models on neutrality and factuality accuracy.
But the fact that it outperformed them is not the only thing that matters. What
we also showed is that many of these existing models would just refuse to answer uh questions that are more sensitive. For example, if you would ask
sensitive. For example, if you would ask deepseek, please tell me about government inter uh internet censorship in various countries, it would either refuse to answer or give a generic
highle uh answer. uh after our post training we showed that Namazu actually gives a multifaceted um response with links to specific uh
news articles that um are trusted. The
second model I want uh uh project I wanted to show was our work on AI scientist that is focused on scientific capability as a form of sovereignty.
So uh in this project we're actually using a multitude of agents that are supporting the entire research workflow.
So um the agents start with idea generation, novelty checks, idea scoring, then uh um we're using treebased experimentation to test these
different ideas, generate the code for them, do ablation studies and at the end we are actually creating a full paper um presenting the results. And this work
was uh la uh featured in nature last month as well. Um this is how the AI scientific scientist is using tree search to process like different hypothesis and test them and then pick
the best candidate.
And the paper uh generated with this uh system uh is the first uh fully generated paper that passed uh peer review at iclair last year. Um the other
example I wanted to show you is how we use multi- aent coordination. So for
this a very important concept is the concept of switchboard. Um and this switchboard learns to automatically route tasks depending on how hard these tasks are to the most appropriate
models. And like this we're optimizing
models. And like this we're optimizing for the cost and also for security. And
routing can be seen as a form of sovereignty um not as a way of isolating specific solutions from global solutions. So if a request is um very
solutions. So if a request is um very relevant to the Japanese context, it's going to be sent to the Japanese postrain model. If your a request is
postrain model. If your a request is very sensitive, maybe it's routed to the on premises secure model or maybe a human review is being uh solicited.
So this idea of coordination as sovereign capability is not only an architecture for focus for us but also a research focus. Uh we believe like our
research focus. Uh we believe like our bet is that the most capable AI systems are collection of specialized agents and
not single scaled models. And what we showed was actually in this model that we just launched the Sakana Fugu is that we can train a learned orchestrator to
pick the best model um given a specific task. But this orchestrator can also
task. But this orchestrator can also learn to call itself recursively for harder tasks. And this work uh is now
harder tasks. And this work uh is now available in beta access and was featured in two papers presented at iclair this year. In the evaluations of
fugu, what we see is that composition beats scale, right? So um we compared fugu which coordinates a pool of frontier models as an ensemble. It per
outperforms any single member of this ensemble on uh live codebench and sweep pro and other evaluation benchmarks.
Uh next I wanted to talk to you about domain adaptation because we all know data is scarce and there's a lot of data that we currently don't have digitized like there's a lot of tacid knowledge
and this is the missing data set. So
when we're working with different institutions, banks, hospitals like healthcare, government um we need to have a process for integrating expert
critique and feedback back into the model and the tools that we're developing. So for example uh when we
developing. So for example uh when we work with some of the major banks in Japan like MUFG and SNBC for credit memos we solit solicit over a thousand
points of feedback that get fed back into the model that uh learns to create better credit memos for their expert analyst.
Last but not least we're also supporting the government in Japan. So our team has uh showed that they can use an AIdriven intelligence for analyzing social media and show exactly how campaigns of
misinformations are being started and run.
And maybe the most important form of sovereign AI is to maintain local capabilities of questioning the dominant
architecture. So in our CTM work
architecture. So in our CTM work continuous thought machine we actually the team is actually proposing a new architecture beyond the transformer and this architecture is inspired by the
brain where the reasoning emerges from synchronization between neurons over time. So instead of h having a single
time. So instead of h having a single pass attention um there are multiple attention heads um that are um coordinating and such the model learns
how to do pretty complicated tasks like solving a maze and the way it learns to do that it's also inter interpretable for humans because they can see the
activations at the bottom. Um, we also tested it on image classifications where we could actually see exactly what part of the image the attention heads
focus over time. And the computing is actually adapted like for simpler images, it spends less time to figure out the classification than for
complicated images. So those were only a
complicated images. So those were only a few of the examples of the work that we're doing at Sakanam. Most of the projects I shared today are open source.
They're on our GitHub and on our blog.
um we want to develop AI solutions for Japan needs and democratize AI in Japan and I share with you this this stack layer for the sovereignty right but
every country picks which layers of this stack they want to own and they can own so not no single country tries to own every single layer of this stack so it's
important to see how different countries make different ownership decisions and this is what sovereignty looks like in practice this.
And to close it off, I wanted to to leave you with this message from kids to parents to researchers to AI engineers.
Um, it's very important to realize that we all have agency and that local agency is more important than global
capability. Um, so thank you very much.
capability. Um, so thank you very much.
Thank you so much, Stefania. And for the last talk of the day, we couldn't think of anyone better to have than Swix himself. Um, Swix is with Cognition, but
himself. Um, Swix is with Cognition, but he also happens to be the founder of the AI engineer conferences worldwide. And
since this is our first edition in Singapore and Swix is from Singapore, it makes perfect sense to have him close day one of talks today for all of us. So
Swix, whenever you're ready, the floor is yours.
Okay. Can you hear me?
Uh I think I think they're switching the lapel mics on. Uh shift. Where is this?
Okay.
Okay.
should be okay.
It's okay. I don't need to. Yeah, we're
good.
Okay. Hi everyone. Uh, how you all doing so far? Enjoying the conference. Yes.
so far? Enjoying the conference. Yes.
Awesome. So glad to have you. Um,
if you don't know, I'm Sean or also known as Swix. I come here in three capacities. One, I'm the founder of AI
capacities. One, I'm the founder of AI engineer. Uh, two, I am an adviser to
engineer. Uh, two, I am an adviser to cognition and one of the leading agent labs, and I'll explain what that is. And
three, I'm here as a Singaporean. And I
think all three of these identities merge together in this one talk, which I really wanted to share with you as well.
Um, so let's go into it, right? Um, I
don't think this clicker is working at all. All right, I'm gonna gonna skip the
all. All right, I'm gonna gonna skip the clicker. Uh, so first I'm going to tell
clicker. Uh, so first I'm going to tell talk a little bit about our story as a conference. Um, I'm pleased to say, you
conference. Um, I'm pleased to say, you know, like uh we've been uh this this conference is three years old. Uh it's
going it's gone around the world from London to Paris to San Francisco to New York to Miami uh and now to Singapore and next to Melbourne. Uh we've been
growing quite a bit. Uh we now serve 1.5 million unique developers a month. Um
and uh over 9,000 people have seen today's live stream alongside of you in person as well. Uh we are really trying to do uh our best to grow developer community all around the world and serve
the the going AI engineering industry.
Um but particularly the Singapore, you know, I've always been a Singapore show.
I' born and raised here. I went to uh I left for college uh in the US, but uh I keep continually being a very vocal and public um advocate for Singapore uh especially uh for for fellow
Singaporeans but also other people try to visit Singapore for the first time and I'm I'm actually pleased that we brought like Stefania and like a lot of my sort of international friends to visit Singapore for the first time. Uh
and in fact I one of my launching pads for my own career was in Singapore. Uh I
spoke at GSCOM Asia, still one of my favorite talks I've given of all time.
Um and that really start gave me the potential for like what conferences can do for uh not only my own career but also galvanizing an industry, galvanizing a country uh together as
well. Um I did also organize a lot of
well. Um I did also organize a lot of Singapore meetups, so I'm like kind of not new to this. Um here's some of our friends including Lihao and Thor and
Thomas. uh some of you who have seen who
Thomas. uh some of you who have seen who were familiar faces in the sort of engineering and conference circuit as well. Um more recently about 3 four
well. Um more recently about 3 four years ago I moved to San Francisco um and started Leighton Space. Um hands up I don't know if anyone has heard of
latent space my podcast and yeah okay thanks so much for listening. Um and as part of that I had a realization that there would be this thing called the AI engineer. Um and I started I wrote this
engineer. Um and I started I wrote this like infamous line that I'm going to live down for the rest of my life. uh
where basically there's this sort of forming gap between the research engineers and the full stack engineers um and that is effectively what all of you are today is AI engineering I think
it's a hugely growing uh demand if you don't know if you came to this conference and you've not read the blog post you probably should uh read what the the definition of AI engineer is
um be just just around at the same time I actually started hacking on my own stuff I'm not just a content creator I'm not just a community person. Uh I'm also a builder. Um I'm just not a very good
a builder. Um I'm just not a very good one and I'll be super honest about that.
Uh so I started building my own coding agent. It got super popular. It's called
agent. It got super popular. It's called
small developer. Um and it was built built on claude one if you can imagine.
Uh three major versions of claude a go.
I was building on this thing. Um and I got very excited about it but ultimately couldn't really scale. And also the model weights degraded on me overnight.
uh which I I know it's a conspiracy theory but I swear mine was true that uh the model got dumber overnight. Um so I stopped building it but uh over lo the whole like I I I moved on to sort of
greater and better things. So the very very first AI engineer I declared that there would be three types of AI engineer um and I didn't you know I started sort of broadening out and in
fact that was probably a career mistake.
Uh what actually happened in the subsequent three years is exactly this sequence uh where 2024 we build out more um sort of AI coding tools, 2025 more
product stuff. Uh 2026 is definitely the
product stuff. Uh 2026 is definitely the sort of year of deployment of agents. Um
and yeah this sort of kap Andre who's a little bit of a mentor of mine um drove this sort of stake in the in the ground when he last year he said that this is the start of the decade of agents right
if you take the founding of open AI in 2015 as the starting point um uh uh and taking the ne the first 10 years of scaling well what happens in the
subsequent 10 years is probably the deployment uh and building out the the harnesses and the scaffolding uh that that becomes agents Um and that's really kind of the path
that led me to cognition. Um uh they made three choices which I wish I made when I did small developer and I wrote about the AI engineer in 2023. Uh the
the the three non-obvious choices were choosing code um bridging sync and async and focusing on enterprise. I think each of these things were not like sounds
super obvious now. In 2023 you wanted to build chatbt you wanted to go consumer.
uh in 2023 you probably wanted to do auto reggressive uh LLMs and not really think about sync synchronous sync agents um and and and code was one modality out
of many modalities um but I think uh you know uh it business has shown that it is the king modality so choosing code um I think this is something I wrote in my
blog post on on on cognition where I really talks about like code is a proxy for software like coding agents and if the and basically If software is eating the world, then code agents are eating
software and it it actually starts to accumulate a lot of the power and the the economic value and it can probably do it in a much shorter time frame than all the other agent demos that you've seen that are probably not working as
well. The second part is something I've
well. The second part is something I've written about in uh in this other blog post called the semi- async value of death. There basically there is no
death. There basically there is no middle ground. You either want your
middle ground. You either want your responses very very quickly uh or you want to delegate asynchronously. And I
think um there's this sort of uncanny valley effect that happens when uh responses or LMS are like fast but not fast enough where you're sort of waiting there on the phone whether it's voice or
code or whatever other modes of interaction. So you basically just want
interaction. So you basically just want a barbell approach of um uh having the most synchronous live uh experiences or the most async experiences. And I think
any company that can adequately straddle the two uh is going to do super well.
And uh finally, enterprises. Um I think this is something that um abstractly makes sense. Like obviously you want to
makes sense. Like obviously you want to go after like the big logos like the City Banks and the OCBC's and the Goldman Sachs. Um but I think it I
Goldman Sachs. Um but I think it I didn't really appreciate why. So I'm
going to spend a little bit more time in sort of double clicking on this just so you understand really what being enterprise focused means. Um and
enterprise focus I I think in very plain terms serving serious customers. A lot
of AI customers are non-serious. Like
they will try your tool and then they will not give you feedback. They'll try
your tool, they'll turn after three months for the hot new thing. Um,
enterprise is the most serious vetting you'll ever get. Um, what do you what what so what does that mean? Um, a lot of tools start out single player.
Enterprise is immediately multiplayer to the tune of tens of thousands of developers, tens of thousands of repos.
Uh, the pricing power is also very interesting. uh instead of seeking
interesting. uh instead of seeking instead of starting with like a standard $20 a month plan and seeking the maximum subsidy and getting pissed off whenever people remove the subsidy and then moving on to the next best subsidy. Um
people are willing to pay for outcomes because this is enterprise we're talking about. Um and also but but to me the
about. Um and also but but to me the most interesting thing is being the first to discover expensive problems. Um and that's probably only discoverable at the uh enterprise scale. Um so this is
the sort of standard cognition site. I'm
going to show you my version of it which hopefully uh is more memorable. Uh all
in all I call this the Devon is in the details which is kind of a nice pun.
Um and this is the the the subject of part two of the talk right like I'm not here to talk about cognition. I'm here
to talk about what I learned from cognition in case you guys end up building an agent lab or working at an agent lab because I think it is probably the most single valuable lesson for any
AI engineer. Um so for for reference I
AI engineer. Um so for for reference I wrote this up in a post called the agent labs thesis. Um this is the November AI
labs thesis. Um this is the November AI engineer summit that we did in New York.
Uh where we listed agent labs on one side and model labs on the other. You
can look at those conferences on YouTube. Uh if you want to see what
YouTube. Uh if you want to see what examples of agent labs versus model labs look like. Um but if you want it in one
look like. Um but if you want it in one chart, this is probably it. Um where
model labs uh proportionately allocate resources towards training and compute uh and less towards uh deployments.
Obviously that deployment has gone up over time. Um, agent labs are more or
over time. Um, agent labs are more or less the complete opposite in terms of the resource allocation and prioritization, right? Um, and I think
prioritization, right? Um, and I think this is mostly holding except that they are starting to encroach on each other's turf. Like when I wrote this it was like
turf. Like when I wrote this it was like more clear now model labs are building agent uh labs internally uh with open eye and topic also doing uh hiring FDAs and then agent labs are building models
internally as well with both cursor and cognition uh rling like like putting a lot of compute in rling their models. Um
if you want to sort of break it down in terms of like uh contrast you can also do that in this way but I'm going to skip over this uh for the sake of time.
Um, and I think like the the details is really what I really want to sweat, right? Okay. So, for example, uh, a lot
right? Okay. So, for example, uh, a lot of people will say like just put your favorite coding agent of choice. I don't
want to name any uh, ones to to not piss them off. Uh, just put it in a
them off. Uh, just put it in a container. Um, the the reality is that
container. Um, the the reality is that it is not just about the container format. Uh, it's also about just
format. Uh, it's also about just building stateful sessions. Um, and
these are all the problems that have historically come up, right? Uh, it is about giving it real machine semantics, about giving all the tooling for real computer use. Um and here's a fun
computer use. Um and here's a fun example of a real life situation where shared machines uh if you want sort of multi-tenency for your sessions stateful sessions for coding agents it will actually break right so this is a real
incident um that these are real incidents with the same root cause right um so real incidents for example parallel agent sessions interfering with each other because they have a shared cache um or agent and auto except mode
published the entire company's source code onto a personal GitHub because why they had they had the secrets uh comingled right Um so what they both share is that basically you have no
isolation boundary in a container like container just only knows about one thing uh but it doesn't really uh it's not really set up for crossing or changing context between agent sessions.
Um so basically what you end up building is an agent platform which is everything above just the the VM or the container.
Um and this is the a full list. I'm
basically kind of open sourcing this. If
you guys wanted to build an agent lab, these are the things that exact things that you have to go through. If you're
if you're considering buying, this is what you have to evaluate whenever uh you are encountering a new agent lab for the first time. Um security is a very very important one of of course especially if you're natively
multiplayer with multiple levels of teams and orgs and all these things. Um
so uh agents like definitely need a lot of scope identity and lease privilege and these are all things that you sort of have to work through in terms of your your permissioning model. Um second
perception just the GPT wrapper right like that's what all that's all that application layer people are. Um I think to some extent you can be proudly a GPT rapper but you the the entire name of the game is just to make it thick and
worthwhile right. Um so the reality is
worthwhile right. Um so the reality is they actually long model diversity which has been a historically very good bet over time right like model diversity has a demonstrated um tendency to increase over time um open market share used to
be like 70 80% now it's down to 30 something uh depending on the the source um and uh also you're not just training you're not just wrapping other people's models you're also increasingly able to
train your own based on your own domain specific data uh and use cases as well.
So both uh cognition uh which uh these sweet grabb models and 3.5 models which I worked on as well as cursor are also doing that um and I think any other competent enough agent lab will have
enough resources to to to build it and you should do it because it's going to be much more bu fit for purpose right like um uh for for the majority of your
workload. Okay. Um, one more perception.
workload. Okay. Um, one more perception.
Um, eval is such a nebulous marketing concept, right? Like, uh, you most
concept, right? Like, uh, you most people just tell you to look at Sweetbench like my number is 0.1% higher than the other number. My model's
better. Uh, in reality, like reality is extremely multi-dimensional. Um, so
extremely multi-dimensional. Um, so here's all the examples of different kinds of evals that Cognition can run internally. Um, and it's not
internally. Um, and it's not summarizable in Cbench. And obviously
you're going to want to have different approaches for each of these uh real life use cases. Each of which can have tens and hundreds of millions of dollars behind them. Um so my my sort of spicy
behind them. Um so my my sort of spicy hot take is enterprises are the hardest eval that you can possibly get, right?
Like show me an RL environment that is harder than that than than the enterprise. Um cognition itself is an
enterprise. Um cognition itself is an enterprise with multiple orgs and multiple slacks and multiple uh IT systems and all that. Um it has been really solved in the last uh six months
which which to me like having joined more than six months ago like it was really interesting to see that oh like I I thought that was good and now I now I have a different definition of good. Um
and it's interestingly correlates with AR growth all of which has been publicly disclosed. So I'm not telling you
disclosed. So I'm not telling you anything you don't know. Um the the the new stuff I'm I'm I'm going to show later on. Uh, but I do think that uh
later on. Uh, but I do think that uh that's one of those things where you do have to keep track of like how honest are you about how much of a problem you're solving in the world versus uh
doing fun demos. Uh I think one of the interesting things is also communicating um like what kind of outcomes you people are paying for. Um it's very hard to do
that on a landing page, on a brochure, on a talk. Um, so I basically don't bother like I just paste this in here just because like people expect me to paste this in here, but I'm just gonna I'm just going to skip past it and I'll
tell you more locally specific stories about what we found in APAC in Singapore because that is basically why I can open source. Yeah. Okay. So that's the part
source. Yeah. Okay. So that's the part three like why Singapore like why am I here? Um, and I think the the sort of if
here? Um, and I think the the sort of if I can summarize the sort of uh the story of Singapore's economic development started with trade, then we went to oil, then we went to finance. We had a little
bit of a fling in bio. Uh, let's not talk about the crypto side. Um,
but like what's next, right? So, my
spicy take is that we have all these sort of leading leading figures. Fun
fact, did anyone know that Keo and Sam Corp merge into catrium? I just found that out. You one person knows. Um, so
that out. You one person knows. Um, so
like anyone who's actually Singaporean would be like, "Yeah, Sim Corp, Marine, Keo Corp." What is Catrium? Um, it's the
Keo Corp." What is Catrium? Um, it's the it's the the new entity. Anyway, my my sort of cheeky answer is that there is obviously a fourth phase to Singapore's
economy and it is here. Um,
uh, I'm here because Singapore has been chosen to be Cognition's Asia headquarters. Um, which is yay like
headquarters. Um, which is yay like very, uh, super fun. Uh I think you have to sort of uh even as a Singaporean I think you had to do go through this journey and it's something that we've
always wanted right like uh we always want MNC's you know the local term um to to choose their base here not just for sales which sales is fine sales is great uh but also for engineering also for
research um to me you have to succeed overseas to um to be recognized and do well locally which I call the Sununu strategy um and it's not just GTM so uh there's
all these quotes uh which I which I really like. Um uh Cornish hired or
really like. Um uh Cornish hired or acquired Havana. I think Nathan is in
acquired Havana. I think Nathan is in the audience somewhere as well as some of the other crew. Hey Nathan. Um and
definitely talk to Nathan uh later on if you want to join COG. Um and so I think like it's it's this is working right.
All I'm trying to say is like I have been part of the Singapore tech scene for my adult life and we've never had this level of foreign interests and
American interests uh in this region in Singapore in basing engineering and research in Singapore until now. So now
is the moment. Let's do it.
Okay. Um
um let me uh so I'm gonna uh so I actually had Nathan uh who who is my chaji uh go through all the call logs right uh uh of of all the work because there is so much work that that happens
behind the scenes that you guys never see because you're not in this business and we are and so I I wanted I wanted to share some examples right um uh here's some examples of like the the sheer
amount of demand that APAC has right um millions tens of millions of dollars a year spent on LM tokens. Okay. Um and
and the way that they they run your loans, your money is on spreadsheets by business analysts who do not stay there.
Right. So imagine if like you come in and you're like what the this the bank is run this way. Yes. Right. So you have to systematize it. You have to co you have to write the code that is otherwise
uh manually operated by business analysts. Uh same in government. Uh same
analysts. Uh same in government. Uh same
in uh in in other parts of tech. Um I
think uh you know again like this is the this is the sort of normal way that we present these things. These are all real numbers from the customers not not from the not from the company but I think like it's hard to tell unless you it's hard to tell from the just the numbers
itself like okay what does it mean for 10x acceleration and delivery time? Well
let me just show you the baseline right like the the baseline is for a local bank you two million cobalt uh lines of code of cobalt with no documentation and no engineers owning it. What are you
going to do?
Um and and so like this is where you can start to really apply AI and uh let me tell you this is not unique to just Singapore or just one bank. It is all banks. It is all it is any any company
banks. It is all it is any any company that has like a really large uh scale of of customers which is enterprise.
um hund00 million uh budgets for AI, 600 developers per roll out. Um uh you know like it's it's really uh mind-blowing the the amount of work that needs to be done that we cannot hire humans for
because it is boring. It is it is on a a language or system that nobody wants to work on whatever right um so I I'm hopefully sort of sharing some of these new stories for the first time. If you
you know if you want to to ask more ask Nathan. Um but uh I just want to share
Nathan. Um but uh I just want to share some solutions that uh cognition has figured out that um that that has worked right. Um Devon has a thing called
right. Um Devon has a thing called playbooks uh which are basically way more structured than normal chat that is basically like a single playbook can be worth hundreds of millions of dollars in my opinion uh because they are structured templates that can
parallelize agents in a much more reliable way than just open-ended chat.
So if you haven't tried a Devon playbook, you definitely should because these guys are using these things to transform banks and make uh billions of dollars. uh decodebased comprehension is
dollars. uh decodebased comprehension is the mode again why no docs right so of course you want AI to write the docs first and then use the docs to do the migrations so uh cognition was the first
pioneer in deep wiki um which uh I think a lot of people love as well uh there's billions of revenue in brownfield work and yeah lastly I think uh it's a standard thing in enterprise but it is
uh so so sort of uh surreal or so visceral and physical to see people and sales people say okay that the guy won't even get on the phone with us until we
have custom SSO. And like why? Because
they've locked down their GitHub and GitLab because they're responsible enterprises. And like the rest of us
enterprises. And like the rest of us don't like we just yolo stuff into our obsidian and our and our um personal sort of open clause. We don't really think about that. But when you have the
trust of millions of people, the money of millions of people, of course, you need to think about security like that.
And anyone that serves these enterprises also uh transitively has to do that. So
that's uh that's why I call sort of agents in Asia. Okay, let me sort of uh vector out again. Um I shared some learnings about APEC. Now I'm just talking about Singapore and why why I want to call this the agenting nation.
We're not there yet, but we're getting there. Um and we have to refer back to
there. Um and we have to refer back to our our dear forward deployment minister. Um Abishek I think who is in
minister. Um Abishek I think who is in the audience somewhere named him the forward deployment minister. I think
it's kind of sticking. Everyone's kind
of like that. Um he said three things in his talk this morning. He said that we have an advantage in deployment, democratization and decentralization which again not crypto. He actually just means that he wants uh he wants AI
everywhere in in public service. Uh and
I think we can help in all three of them. I think like this is actually
them. I think like this is actually really really good that he gets it and like the rest of us can get it. Um to me it was it's such a shock to realize that
like um Singapore itself has such demand uh roughly four times uh demand and supply of AI engineering talent. Um and
you know like it's it's cons going to continue to widen and grow, right? Like
demand for these roles are like growing 40% yearon year. There's so much money at stake. Uh and this is LinkedIn
at stake. Uh and this is LinkedIn surveying the field and really sort of reporting on this. So I I think it's like pretty credible number as well. Um
so my spicy take is I have stopped I've given up on the government. Like I I know I've just praised uh minister but um I've waited for years for the government to do something about the
tech sector. Uh I had this podcast with
tech sector. Uh I had this podcast with Minister Josmin uh and there's us walking together and talking about the future of it. Nothing happened. Um it is
only when the it is only when us when we the people of Singapore, we the citizens of Singapore decide to take matters into our own hands, right? Like um I think Singapore has a history of government-led economic development. Um
I think that I think the new age is going to be led from the private sector first into the public sector. So let's
go and make that happen, right? Um and I think this this conference is an example of that. We didn't wait for the
of that. We didn't wait for the government to approve it or like uh give us their support. It's very nice to have IMDA and AI Singapore in in the uh in Pullman and all the other expo stuff
supporting us. It's very nice to have
supporting us. It's very nice to have the Ministry of Foreign Affairs supporting us, but we do not need them.
We are we're here to work on the private sector and build ourselves up as a tech sector ourselves. So it starts with
sector ourselves. So it starts with everyone being high agency particularly all these organizers standing in the wings here. Give them a round of
wings here. Give them a round of applause. Like they made this happen.
applause. Like they made this happen.
This is their side project. They put you guys together. Um and I obviously I I I
guys together. Um and I obviously I I I helped and supported them. But like this conference would not happen without them. So um it's it starts with everyone
them. So um it's it starts with everyone in this room. It started with me. It
starts with with these organizers of 65 labs and it now it starts with you. Um,
so I really hope that you can come away from AI engineer and be more agentic in your lives and and really turn Singapore into a more agentic nation in general.
So thank you very much.
Well, we've come to the end of the first day of talks. Uh, thanks for staying all the way till the end. This is crazy.
Like give yourselves a round of applause for making it to the end of 10 hours of programming.
Okay, final notes before everyone heads out for dinner. Uh, we do have an afterparty here. We'll open the doors at
afterparty here. We'll open the doors at about 9:30. I'll start DJing at 10:00.
about 9:30. I'll start DJing at 10:00.
We've booked a DJ who's flying in from the UK who'll start playing at 11:30.
There's free flow for the first 500 people. So, if you want to come and
people. So, if you want to come and drink, be my guest. The chairs will go down. This will become a dance floor.
down. This will become a dance floor.
Um, and we want you guys to come and have fun. Um, if you are a conference
have fun. Um, if you are a conference attendee, please bring your lanyards with you because that'll help us prioritize your entry and do not lose them because we are not printing new
lanyards tomorrow. Hopefully that part
lanyards tomorrow. Hopefully that part of the instruction set is clear. If
everything is good, thank you so much and we'll see you bright and early tomorrow morning or tonight.
Hey, hey, hey, hey, Hey, hey,
hey.
Hey, hey, hey, hey.
Hey, hey hey.
Hey, hey, hey.
Hey, hey, hey.
Hey, hey, hey, hey, hey.
Hey, hey, Hey, hey, hey.
Hey, hey hey.
Hey, hey, hey, hey.
Loading video analysis...