Edwin Chen: Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

By Unsupervised Learning: Redpoint's AI Podcast

Summary

## Key takeaways - **LMSYS Optimizes for Clickbait**: Optimizing for LMSYS is like optimizing for clickbait because users scroll through responses in one or two seconds, preferring ones with more emojis, formatting, markdown, headers, and length over accuracy or instruction-following. In one example, users preferred the response wrongly claiming 1452 has only one divisor over the correct one listing divisors under six. [01:22], [03:05] - **Models Regress Without Proper Metrics**: A team's models regressed over 6-12 months because raters didn't execute code, providing data full of flowery language and bugs instead of verifying correctness, and they lacked quantitative measurements to detect it. This happened despite industry progress in coding. [05:12], [06:30] - **Benchmarks Cause Real-World Degradation**: Models excel at hill-climbing narrow benchmarks, sometimes contaminating training data or overfitting, leading to impressive scores but worse performance on real-world problems, like SAT prep not building broader skills. Frontier labs now use proper human evals with diligent, tasteful raters mimicking real-world messiness. [07:44], [09:20] - **Top Evaluators Need Taste and Creativity**: Great evaluators have domain expertise like algebraic topology or PyTorch, sophistication for well-designed code or essays beyond correctness, creativity in diverse prompts spanning real-world distributions, and strict instruction-following. Credentials like PhDs aren't enough; meritocratic measurement reveals true quality. [10:52], [11:30] - **Frontier Labs Diverge on Objectives**: Labs differ in objectives: some ignore LMSYS to avoid hallucinations and negative progress, OpenAI optimizes for user engagement like long sessions, Anthropic for productivity and time savings, shaping model capabilities, products, and people. Edwin shifted from one model to rule them all to a constellation shaped by unique theses. [31:41], [35:02]

Topics Covered

LMSYS Optimizes for Clickbait
Bad Data Causes Model Regression
Benchmarks Mask Real Degradation
Great Evaluators Need Taste
Constellation of Specialized Models

Full Transcript

As a CEO of Serge, Edwin Shan gets a front row seat to everything happening in the foundation model ecosystem.

Serge, which is worth a reported $24 billion, works super closely with all the top labs on improving their models.

And Edwin had a really interesting perspective on a bunch of different things. I'm Jacob Efron and today on

things. I'm Jacob Efron and today on Unsupervised Learning, we talked about Edwin's learnings from working with the top labs and the divergent approaches he's seen them take so far. We talked

about where models are today, his view on RL, RL environments, RL as a service, and that whole startup ecosystem. We

talked about what it means to have model taste and what really good researchers do. And we also hit on how Edwin's

do. And we also hit on how Edwin's changes views from there being one model to rule them all to there being a constellation of different models. Just

a fascinating conversation with someone who gets to see the future being created every day. Without further ado, here's

every day. Without further ado, here's Edwin. Well, Edwin, really excited to

Edwin. Well, Edwin, really excited to have you on the podcast. Thanks so much for coming on. There's a lot of things I'm excited to explore today with you, but uh one place I figured I'd start uh I listened in prep for this to to a few

podcasts you've been on and uh a lot of things you said that were pretty interesting. You know, one thing you've

interesting. You know, one thing you've talked about pretty consistently is just the consequences of choosing bad benchmarks as a as model builders and I think you've specifically called out LMina and some of the other things that

folks are optimizing for and so I'd love if you could just elaborate a bit on the kind of consequences of some of these optimizations. I think the way you

optimizations. I think the way you should think about El Marina in particular is that when you optimize for El Marina, you're basically optimizing

for clickbait. Like the mental model you

for clickbait. Like the mental model you should have in your head of Al Marina is that users go on to, you know, go on to El Marina, they issue a prompt and then

they get to your responses and then their goal is to basically vote on which one is better. But they actually aren't carefully reading your responses. What

you're doing instead is, okay, they just scroll through the two responses. After

two seconds, they just pick which one is better. And so, they're not carefully

better. And so, they're not carefully reading them. They're not seeing, oh

reading them. They're not seeing, oh yeah, this one followed all of my instructions. This one was completely

instructions. This one was completely accurate. This one was well researched.

accurate. This one was well researched.

It was high quality, and so on. Like,

they're not doing any of that. What

they're doing is again, they're they're reading through very very quickly, like maybe one, two seconds, and they're like, "Okay, which one impressed me the most?" What they're going to optimize

most?" What they're going to optimize for is, "Okay, which response had more emojis? which one caught my attention

emojis? which one caught my attention and like yeah like the thing that catch your attention will be a lot of emojis would be a lot of formatting so responses that are that just contain a lot of markdown a lot of headers and so

on so on they will just naturally prefer responses that seem that that are longer like yeah if it's longer then sure it has a sheen of expertise to it like they're not they're not they're not reading these things through like for

example I was actually looking through one of the data sets recently so they they published a bunch of these online and I just saw the all these incredible errors so one of the errors was Like I

think the prompt was tell me the tell me all the divisor of 1452 and one of the responses said oh there there's only one there's only one divisor of 1452 and that's the number

one and the other one got it completely correct it was I was like I think the problem was actually what are the what are all the divisor under under six of 1452 and obviously one two three four

and six are all divisor and so that was the other response but if you look at the data guess guess which one the raider preferred user preferred. The user preferred the

user preferred. The user preferred the one that was completely wrong. And

again, like if you even think about it, like this is a I don't know, a fairly simple mathematical question. But if you ask a more advanced prompt, what are you going to are you going to research it?

Are you going to go fact check it yourself? No. That's just not what

yourself? No. That's just not what you're going you're going to do as a user. And so what ends up happening is

user. And so what ends up happening is that users again, they're they're basically optimizing for whichever one caught their attention. It's almost like a tabloid. And it's interesting. It

a tabloid. And it's interesting. It

seems like something that pops up in in a bunch of different consumer use cases.

I remember there was like maybe a study of chatbt for medical responses. Uh I

forgot maybe a year ago or so and they you know the responses rated way higher than physician responses but I think when they actually unpacked why it was like literally just because the responses were longer.

>> Yeah. Yeah. Exactly. that that's one of the uh the big things that we've seen is that even when you haven't intentionally trained on Elms's data, like maybe you put a bunch of your models on on on the um on on the site and then okay, you

have 10 different models that are on the site and you're just going to, you know, AB test which one and pick the one that performs the best. Again, users just naturally prefer responses that are longer. It's like the easiest way to

longer. It's like the easiest way to think that response is high quality. And

so we've seen a lot of models. I just

think when you go through that process, you just naturally end up with responses that are two, three, four as verbose as ones that that aren't that aren't. And

obviously like you know this is you know uh isn't the only thing that people are optimizing for probably and and uh you know leaving things on the table. What

are some other examples you've seen of you know where these where not optimizing for the right thing really leaves a model on the wrong track? I

think the problems especially arise when you are both optimizing on the wrong kind of like you're training on the wrong kind of data you're optimizing for wrong order you're optimizing for the

wrong objective function and then in conjunction with that you don't have the right measurements in place. So for

example, earlier this year we were actually starting to work with a new team. A lot of the researchers were

team. A lot of the researchers were telling us like yeah they they were suspecting that their models were getting worse but they didn't have any quantitative evidence of it because they didn't have the right measurements in

place. So we basically dug into their

place. So we basically dug into their models for them and what we found is that actually over the past 6 to 12 months their models had actually regressed. And so what happened was uh

regressed. And so what happened was uh the data they were gathering from you know supposedly expert coders the data they were gathering the the raiders were just they weren't executing the code

they weren't checking the code carefully to see that it was um it it was actually correct and so what they were doing instead was they were again like very

similar to the Elmsus use case they were giving training data that was essentially full of flowery langu language and grandiose claims like oh

yeah like here I produced this amazing program for you that does ABC things and they had taken the time to execute you know executing code can actually often be times be pretty hard you have to install these libraries you have to have

all this infrastruure in place you have to um make sure that you actually understand the language and so on and so on and so the code was actually just completely either completely wrong or just full of all these subtle bugs that

people wouldn't notice until later and so at the end of 6 to 12 months because they didn't have any actually they didn't have any actual measurements in place to see whether or not the the models were actually improving or not.

Um they were they were optimizing for this. And it's kind of crazy because

this. And it's kind of crazy because really the the whole industry is moving forward. Coding is such a such a um such

forward. Coding is such a such a um such a hot area where everybody else is progressing and then whatever teams is basically making negative progress because you don't have the right data, you don't write measurements. I think

it's a real concern where enough people in industry just aren't paying attention to the quality of data that they're receiving and whether or not they're measuring the right things.

>> I mean, it's interesting how how things can actually go south uh given all the model improvements that are happening if you're taking shortcuts on the uh on the data side. And then also, you know, I

data side. And then also, you know, I was struck by in in that example, a big part of the problem is just not even knowing for six months uh if your if your models are getting better. I think

this is something interesting that we've seen across the industry now where it's not always clear like you know I think there was a generation to model improvements where it was like blindingly obvious uh that models were getting better in some ways uh and now

it feels like it's a little more challenging to tell at times what are like the best companies doing to figure out like hey you know on a month-to-month basis like are our models getting better so historically all the

companies have paid really um really hard attention to benchmarks like these very very academic benchmarks that have been created by the research community

and not all the labs realized the problems with basically benchmarks. It's

almost like very easy to optimize for benchmarks. Like models at the end of

benchmarks. Like models at the end of the day, they're very good at hill climbing very concrete, very specific, very narrowly defined objective functions. And so what end what would

functions. And so what end what would end up happening is that the models would make a ton of progress on on on the benchmarks. And sometimes it would

the benchmarks. And sometimes it would be fake because the you know the benchmark data would would actually be in training data and people wouldn't realize or even if it wasn't fake per se like they were actually proving on the benchmarks. What they wouldn't realize

benchmarks. What they wouldn't realize is that it's like because you've narrowly focused on benchmarks but don't have measurements in place outside of the outside of the benchmarks. People

wouldn't realize that yeah sure they got really really really good at some narrowly defined problem but their models were actually getting worse um kind of like on more real world problems. Like maybe an easy example

would be imagine optimizing for the SAT.

Like you as a high school student, sure you can spend thousands of like whatever it is hundreds of hours optimizing for the SAT. SAT is a very very narrowly

the SAT. SAT is a very very narrowly defined set of problems. Like sure it's like what is It's like reading comprehension and analogies and like vocabulary, but it's not really

measuring your ability to I don't know like to write well or your ability to perform complex problem solving in your real world or your ability to do all these other things that are outside of the SC2's domain. And so what would end

up happening is like again like we would actually see companies um like a lot of frontier labs they would suddenly get all these impressive scores on the benchmarks and uh we would just play

around with the models themselves like no your models only got a lot worse like like even before they told us that um that they were optimized for the benchmarks it became prevalent enough

that every now and then every now and then we would start playing around with a like a model that one of the frontier labs would give us and did like why did this suddenly drop in quality and like oh you guys probably

gather a lot of synthetic data to improve on your benchmarks and ask like yeah is that what you guys have seen like yeah like we suddenly doubled the performance on XYZ benchmark um and we didn't realize all all the other

consequences so I I think that's a real concern and then um like basically what the best have realized is that the kind of the only way to measure their uh the performance of their models is to run

proper human evals and the way it works is you're essentially asking asking people to again like high quality people, people who are very diligent, people who are both paying attention to

the content of the of the responses, but then are also really sophisticated and have a lot of taste to make sure, yeah, like this is a fun personality. This is

the kind of style that um maybe like the Frontier Labs want their models to emulate. And it's basically mimicking

emulate. And it's basically mimicking the kind of like the the real user real world experience like all the diversity, all the complexity, all like the messiness of the real world without the

artificial concept benchmarks. And then

then it's also laying on the quality from again like really really um experienced and really really trusted raiders as opposed to just anonymous people who aren't paying attention all

in this like this this uh this process of going through these rigorous human evals that's that's been a gold standard for all the frontier. I mean beyond I guess paying a little more attention maybe like what have you noticed about what makes a really good evaluator have

taste?

>> So at high level it might be the three these the following three things. One

would be certainly expertise in of itself like if you're judging a um like yeah like like a lot of our uh

like a lot of our um evaluation sets they're very very advanced. So they

might be measuring a model's ability to perform algebraic topology research or they might be measuring a model's ability to uh like use PyTorch and so

first of all you just need um like really really intelligent people with a lot of expertise to the domain that they're evaluating. So that's one piece.

they're evaluating. So that's one piece.

Another would be kind of this notion of sophistication and taste. It's like okay at the end of the day you want models to be um like when you write when you're

asking to write code you don't just care about correctness you care about is it code well written is it welldesigned or like in the area of like you know essays

or creative writing yeah was this a really really well-ritten essay that introduced new ideas had great pros and just doesn't feel like AI slob so just

this notion of sophistication another one would be this notion of kind of creativity. I think people often underestimate how important prompts are.

Like people think of evaluating uh like evaluating models and it's like okay yeah you're evaluating the response but in order to measure the models properly

you need to have um kind of like prompts that span the entire distribution of what you want your models to be good at. And so like like within creative writing for example, if

you just asked the model a thousand like if your prompt set is literally just a thousand thousand stories or a thousand essays and you're all phrased in the same way as opposed to you like the the the long long tail of how people

actually interact with models in the real world. That's just not going to cut

real world. That's just not going to cut it. And I think people often

it. And I think people often underestimate often underestimate how difficult it is to create good prompts and to be creative.

It's almost like that phenomenon where what is it? If you just ask me to like name 50 foods, it's like it's actually really really hard for me to do that until um unless I think really hard or I force myself to um maybe constrain

constrain list of foods in various ways.

In the same way, it's like very very hard surprisingly to be creative in this fashion. So anyways, like I would say

fashion. So anyways, like I would say creativity is another piece.

>> Yeah.

>> And finally the fourth piece is just like the ability to follow instructions.

So like when we are asking you to or when dealers are asking you to evaluate the models they often have very specific criteria that they have in mind like okay they have a certain I don't know style guide or they have a certain p

personality they want the model to follow or they care about XYZ criteria more than you know ABC criteria and so often times these are very very complex instructions and so people just need to

be uh very very very good at following them. You know, I guess moving to uh one

them. You know, I guess moving to uh one of the main themes I feel like of of model improvement these days. It feels

like RL environments, reward models around them are all the rage these days.

Uh can you talk a bit about that kind of transition and you know how you've thought about uh supporting your customers with it?

>> Uh yeah, definitely. So the way I think about our environments is that they're kind of a continuation. I mean there's almost just like a the next step in um in training paradigms. So in the same

way that historically a lot of work has gone into like SFT and then RHF and then verifiers our environments are kind of just like the next step in that progression. And you know I I I I do

progression. And you know I I I I do believe there we there will be other stuffs in future. And I think our environments are also really really interesting because um like for example we've actually been working on our own

environments for uh quite a while now like maybe one or one or two years. And

so it's actually interesting that the rest industry has only just started picking up with them. Like like a lot of the teams that we work with especially like there's this really really amazing agents team at Meta. We've actually been

working with them on our environments for over a year. So this is the team that created the Gaia benchmark. This is

basically their agents team that also just uh basically just open sourced their agent research environment, agent R environment platform. And so they I think really really saw the the the wave of of the future. So like our

environments are kind of like interestingly different. If I think

interestingly different. If I think about what we've had to build in order to support our environments, like what you need when you're creating RMS is our environments is the following. So like

one you need to have these really really rich worlds that are basically simulations of you know the real world as best as you can. So just in the same way that uh as I mentioned earlier that

creating prompts can be surprisingly hard because you want them to be creative and diverse and rich in real world as opposed to you know these very

very synthetic prompts that you find in benchmarks. Yeah. or you um or like just

benchmarks. Yeah. or you um or like just maybe what will naturally happen is if you don't have any constraints on diversity and creativity, you just let people create prompts willy-nilly, you just don't get much diversity. So in the

same way when you're creating these worlds, you also really need to have a bunch of really um like kind of like underlying entities supporting the world. Like basically when we create

world. Like basically when we create these worlds we have uh we basically populate them with uh people and businesses and tools and interactions between these people and you know

messages, slack, sack messages, emails, calendar events etc etc and we want all of these to basically mimic the real world in very very interesting and complex ways and so like a like a big part of our efforts. Okay. So how do we

build the tooling? How do we build the infrastructure? How do we build the the

infrastructure? How do we build the the quality control measurements? how we

build a data data measurements to make sure that this is happening. And then

once we create these worlds, we need to um basically create all the tools that exist in these worlds. So these could be the MCP servers that models are

accessing. These could be the browsers

accessing. These could be the browsers or the code that the models are executing. So we just need to make sure

executing. So we just need to make sure that we have the like the underlying infrastructure for the models to basically run within this world and to be able to execute the prompts. And then

we need to, you know, create the prompts themselves. We need to actually test how

themselves. We need to actually test how models are performing on these. We want

to make sure that um we're basically coming up with tasks that test the the limits of all these frontier models.

There's also like a very very big measurement and almost like an introspection um aspect where once we discover that a model has failed, we want to dig in and understand why. So yeah, there's I think

understand why. So yeah, there's I think there's a lot of really really interesting infrastructure and tooling that um that we ought to be able to support this.

>> I mean, you've been doing this work for a while now. uh I guess what have you learned along the way or what surprised you or maybe something that you initially got wrong in the way you set these things up and have gotten better

at? One interesting maybe viewpoint that

at? One interesting maybe viewpoint that we've tried to uh kind of like propose

is that it's actually very important to uh pay attention to model trajectories to understand why they are succeeding

and why they are failing. People often

underestimate the amount to which models can reward hack themselves to the correct answer. And

correct answer. And >> I feel like there's a lot of very funny examples of that.

>> Yeah. Like I think we've like like one of my favorite things like looking at looking at some of these examples because they they often just perform in all these crazy ways. And then

similarly, I think people have actually underestimated the kind of myriad of different ways that models can fail and what that says about the model's

underlying capabilities. Like I think

underlying capabilities. Like I think people have this like maybe notion in their head that okay, you know, I'm just going to give a final reward and um I was almost like trained if I own a reward, everything's going to work out.

I think what happens in practice is that again models tend to just deviate in these very very odd ways or they can show different types of intelligence depending on what the underlying capabilities are and if you don't shore

up those underlying capabilities enough models may seem to perform well in the short term but um like you're you're basically going to fight face a lot of problems down the road. This feels like a theme in a lot of the uh the work you do with the labs, right? Like there's

there's easy things to hill climb on uh but if you're not like doing it in a deeply thoughtful way uh with the right evaluations, uh it doesn't actually serve the end purpose you're going for.

All these things tie together because it's often a thing that we've seen with benchmarks where um like people will kind of end up hacking these benchmarks and so they think that they're making

progress because you know these these numbers that everybody's paying attention to are going up when um just their underlying model isn't becoming materially more intelligent. So

yeah, I think all these things tie together.

>> And then obviously, you know, I think alongside RL environments becoming a more consensus way to improve models, there's been like a flurry of of net new startup activities, right? I feel like there's probably 30 YC companies trying to do uh build RL environments. There's

a lot of these like RL as a service companies. You know, I wonder what you

companies. You know, I wonder what you what you make of that activity one and and I guess as a follow-up to that, it feels like every time there's a transition in maybe the main type of ways labs are improving their models,

some people see that as an opportunity to to step in. Um, but obviously there's persistence from people that were there in the previous wave. And so wondering, you know, how you kind of reflect on that.

>> I think one of the things that always drives me crazy about Silicon Valley is that there's this pivot culture that people have in their

minds where people are um just constantly pivoting to whatever seems the latest hot topic or whatever

seems to drive the highest valuations.

instead instead of building things that they really materially believe in and it's almost like this uh

uh like this funny thing where I think a lot of people in Silicon Valley maybe you know don't don't talk about Wall Street because oh Wall Street what are you doing you're just uh you're just chasing money at the end of the day but what you know what what's Silicon Valley

doing yeah you're just chasing the same thing you're chasing valuations you're chasing uh VCs uh you're pivoting not because you

have some amazing great idea but just because that's what like YC told you to do um in order to achieve target market fit and get that you know get some revenue that you can show show VCs when you're when you're fundraising. So I

yeah I think I really really like the culture. Well, and what I'm struck by is

culture. Well, and what I'm struck by is obviously it it seems, you know, it's it's not clear to anyone how long, you know, RL environments are are the flavor of the day to improve models and maybe that persists for a while. Maybe it it

doesn't. Um, but, you know, I think uh

doesn't. Um, but, you know, I think uh what what you've, you know, clearly shown the ability to do over time is like whatever the way that models are being improved, like you have an offering that that helps support that.

And obviously, I'm sure a lot of that comes from being just deeply embedded in your customers. I I could imagine some

your customers. I I could imagine some folks saying, "Look, building an RL environment is so different than the first few acts of surge, like you know, to to what extent do they have the right to go do that versus the 30 new

entrance?" Um, I'm curious like how on

entrance?" Um, I'm curious like how on the ground that's uh that's felt. Like

if you think about what our environments involve, they involve the kind of the following um following three things. So one is again like our environments are just

like the next iteration or the current iteration of the data that is needed to train to enable AGI and so that I mean that just fits obviously with our fundamental

thesis like I don't know we just want to create whatever data is needed to to enable that. So so so that's a big big

enable that. So so so that's a big big piece of it. Second our environments require a lot of tooling like I mentioned earlier. So it requires the

mentioned earlier. So it requires the tooling to create all these tools to to run the models to measure them to analyze them um uh and so on and so on.

And I mean that's not any different from the fact that yeah for RHF you needed tooling as well. So sure like it's a different kind of tooling but um all of those pieces are the same. Like even

when we're even when we've done RHF yeah you need a lot of tooling to ensure that uh you're able to analyze the models as they you know progress throughout the conversation you need to understand the

wins and the fails. Um uh you need to make sure that or sergers like the people who are creating these prompts and evaluating responses. You need to make sure that they're able to do the all these things in a really high quality and diverse way. Yeah. So like

you obviously need a lot of tooling to disport all this. Again, this is like very different from I think a lot of the other companies in our space where they are essentially just uh like staffing

agencies and so they historically haven't built any technology.

>> Um but like we've always been a technology company first and foremost and um yeah so it's just a another type of tooling in the same way that you're any

technology companies build tooling. And

the third piece of it is that um I think people maybe some of these other startups they haven't quite realized that

creating our environments is all about um getting really really rich complex creative data and there's just no other way to do it beyond using humans. Like

if you think about a lot of the um like even think like Sweetbench what is Sweet Bench? Well, it's a collection of uh PRs

Bench? Well, it's a collection of uh PRs that were created by humans and then like like sweet bench itself wasn't quite clean enough. So people need to build sweet bench verified and that's

you know uh people basically taking problems and evaluating them and cleaning up in various ways and so just in the same way I I think we fundamentally believe that creating our

environments is a human data problem that just requires a lot of technology.

I think some people probably just assume quality is synonymous with like credentials, right? And they're like,

credentials, right? And they're like, well, if I have a PhD in something labeling, you know, or spending time on on evaluating something like of course that's high quality, but like maybe what help us understand like an example of something where you actually have

someone that seems credentialed on paper to improve models, but it's not happening in practice.

>> An example I al always love to give is that take you take Hemingway. Hemingway

didn't have a PhD. I don't even know if he completed college. And uh yeah, I mean what we're looking for is yeah, like we want the greatest people in the

world at every skill regardless of their credential. So even just think about uh

credential. So even just think about uh like who works at Google, right? Google

doesn't just hire people based off of what school they have in a resume or what degree they have in their resume.

And that's not how you progress at a company like Google or any other company. the way you progress is based

company. the way you progress is based off the actual work that you do. And one

of the interesting things about us as a platform is like yeah we have a technology platform that looks at all the data that they're creating and then

measure it measures it and so we gather like you know basically millions of signals on or workers every day. we see

the types of data they're producing and so it's almost like the most meritocratic thing possible that you can imagine. So as opposed to okay you

imagine. So as opposed to okay you progress because you happen to have a degree from Harvard like no like we are going to actually measure what you do and

um advance you based off of that and sure like we have a ton of Harvard students on our platform. We have a ton of PhDs on our platform. I think we're probably the biggest uh like biggest

like source of PhDs in the world. But

that just isn't sufficient. Um and it's not sufficient for two reasons. So one

is like even if you think about like take take coders if you're like an MIT grad who has a computer science degree and you're actually really really good what you're probably not you're probably not actually going to try to create

really good data to train these models instead you're probably just going to try to cheat the system right like okay you're a really good coder you're you're like fascinated by um like right teaming

systems you're find you're fascinated by like adversarial attacks like we're trying to what you're going to try to do is you're going to try to find a way to And another part of it is like even if I

think about all of the people who are in my class at MIT or if I think about the number of people that I've interviewed from uh from MIT for first itself like honestly half of them

can't even code right um like there's a very very big difference between reading about something in a textbook and then having the street smarts to execute it like again this almost like ties back to

um the the other things I was saying about performance on benchmarks versus performance in a real world. Like a big problem that the frontier labs have had is that their models are almost like too textbook intelligent instead of uh

instead of having like the the street smarts to to do things in real world.

>> You know, you alluded to earlier, I think you mentioned one of the uh you know, labs is having you you know, be at their at their internal big conference today and obviously it speaks to just how closely you work with those teams.

you know, obviously you had, I guess, uh, a few months ago, you know, Meta decide to buy one of the vendors they they work with really closely in and scale. Um, and and given the proximity

scale. Um, and and given the proximity of those relationships, I know part of that was also, you know, bringing some talent into the organization, but I'm curious like your reaction to that and and do you think over time like some of

those things are are natural to happen given just the proximity of relationships? I mean, I think the the

relationships? I mean, I think the the scale acquisition was actually amazing for us because up until then, we'd been pretty under the radar. And so, I think

all of the top researchers, all of the labs already knew about us. So, they

knew that we had the highest quality by far. And they knew that we were the

far. And they knew that we were the biggest and and the fastest.

But the AI field has just been grossing so large that um you know, more and more people are entering the field every single day.

And um so basically the scale acquisition just put a spotlight on us that was really really helpful for for expanding like I think we u we just we got so much like new demand from all

these new teams overnight and so that was really really bene beneficial for us.

>> You kind of mentioned this this difference in uh you know maybe the culture of of folks that are attracted to these types of companies and the things that that people are optimizing for. I'm wondering how that manifests

for. I'm wondering how that manifests itself in maybe the different decisions you've made uh for search today and also maybe the different trajectories that these businesses go in over time.

>> Good question. So I think it's changed in or shaped us in a bunch of ways and maybe one of the most uh prevalent ones is in hiring. If I think about the culture that we're trying to ste and the

type of people that we're trying to hire, it's people who are kind of like researchers at researchers at heart and people who just fundamentally care about

data and AI. Like when I think about us as a company and what we are trying to build, I think of ourselves as a lot more like a research lab than like you know just another Silicon Valley startup

that's trying to chase uh chase money and hype and and valuations.

And one of the concrete ways that manifests is again if you think about this this idea of hiring people who are fundamentally interested by research and

data and enabling AGI as opposed to this type of Silicon Valley person who's kind of embodied by uh embodied by

growth hackers and embodied by people who are just doing whatever it takes to increase your revenue. Like if you think about the incentives of those types of

people, what they will often do is they'll basically try to sell you things that you may not need that they don't think will actually improve your models.

I mean, essentially, they'll just try they'll just try to act like salesmen as opposed to digging deep into what you need, digging deep into the problems with your models, trying to make sure

that you understand all the uh all the different ways that you should be measuring your models to make sure that they're actually improving as opposed to kind of just like selling you like

what's the snake oil. Um, so I think that uh that belief in caring more about model progress over revenue, I think

that that that really just shapes the company. I

company. I >> I guess speaking of model progress, like how how do you articulate like how what's the path to models getting better from here? And like does it feel like

from here? And like does it feel like there's consensus among you know most of the top labs around this or do you actually see some pretty divergent approaches?

Yeah, I think there's been a lot more divergence than we expect for all of the training paradigms out there. It almost

feels like every French lab has their own take on it. Sometimes those takes are wildly different. Sometimes they're

just like uh maybe slight variations on each other, but there's uh there's a lot more divergence than than what I expected.

>> What are like some of the key vectors where that divergence uh exists?

>> I think at a high level there are two ways in which the companies diverge. one

is once they choose their objective function basically you know what what type of training algorithm what type of training data are they going to gather and I can't speak too much about that

but I think an underestimated difference between all the frontier labs is their choice in

what they optimize for and what they pay attention to and so I can give a couple examples so like one example is so I've mentioned alum marina earlier And I think one of the fascinating

things to me is that some frontier labs, they've just chosen not to pay attention to it at all. And I think those frontier labs have done better because the frontier labs who have had to pay

attention. What researchers those labs

attention. What researchers those labs have often told me it's like okay like the researchers tell me I hate al marina. they they like these researchers

marina. they they like these researchers understand all the ways in which optimizing form arena will lead to uh these negative progress. They will lead to models that hallucinate because LM

Marina users don't care about hallucinations. If anything, they loved

hallucinations. If anything, they loved them because when your model hallucinates, it just makes it sound wild and crazy and kind like a very fun and enticing way again just like a dabloid. And so the fact that French

dabloid. And so the fact that French labs like some furniture labs have had the fortitude and the uh just like this

underlying thesis underlying belief in what they see as a path to AGI as opposed to um feeling like they need to

uh chase publicity and chase um uh chase like this like very hyped up and publicly visible leaderboard. uh

like I think just the fact that certain frontier labs have uh felt a freedom and fortitude not to pay attention to that because they had such a core belief and what they're going for instead I I think that that's actually really really

shaped um a lot of model progress and then I think another interesting divergence is again in the choice of objective that the frontier labs are trying to optimize for and I think you

can see this clearest difference between open AI and anthropobic like you think about open AI what are they optimizing for Now

it's almost like they are leaning more towards optimizing for user engagement.

So really long sessions or amount of daily users as opposed to a company like Anthropic who might be optimizing for

something more akin to productivity and like how much value you can extract like how much uh almost like GDP or how much productivity or like time savings you

can extract by interacting with the model.

And I think that shapes the types of products that they build. It shapes the type of types of people that they attract. It shapes uh the capabilities

attract. It shapes uh the capabilities of their models. Uh so so I think it just really shapes them in ways we're starting to see.

>> It's an interesting point because I think it it it feeds into this larger question of you know obviously I think a lot of model progress to date has all kind of you know uh come back to one core large model that can do lots of

different things. And you know to your

different things. And you know to your point now it feels like you could imagine models optimizing around consumer engagement and then a productivity model. I mean I think

productivity model. I mean I think you've you've seen this with voice right there's like the enterprise voice people want and then maybe more of like an engaging consumer voice over time. Do

you think that like there'll be one model that just is you know uh is is able to context switch across like whatever the the thing being optimized for or do we actually end up in a world where it's like no you probably want

like this enterprise model or this even per industry models for finance legal or or other things.

>> Yeah. So I think this is one place where my thinking has diverged a lot. So I

used to think that there would essentially be one model to build a model u because yeah okay sure you have some super intelligent >> the ultimate ASI vision. Yeah, like you have some super intelligent model and it should be able to context switch and

adopt whatever whatever you want it to do. But actually, I think um over the

do. But actually, I think um over the past year I've started to realize that it's almost like every company should have a thesis

on like the world is just so rich.

There's never going to be a one-sizefits-all solution. Instead,

one-sizefits-all solution. Instead, every company or every lab or every like AI, it needs to have a thesis underlying it that of like what will be useful in

the real world and uh what kinds of AI will best serve people and that thesis will shape how

the model behaves like sure like two models can be it's almost like two models can be just as intelligent as each other but they'll have different personalities. they'll have different

personalities. they'll have different biases for how they answer particular questions. They'll have different ways

questions. They'll have different ways in which they converse with you uh and so on and so on. And so just in the same way that um like if you think about I don't know Google and Facebook, if

Google were to build a social media platform, it's going to be very very different from the way Facebook built social media platforms. If Facebook were to build a search engine, it would have a very different take from how Google would build a search engine. And so

there's no like right or wrong answer um per se. It's just that different

per se. It's just that different companies, different people um they have different fundamental beliefs and in like the things that are like useful and good for the world. And so I I think the

same thing will happen with AI.

>> So what does that mean for for your take on how many people should be building models or like do you think we'll see more model players over time or or conversions from the folks we do have today? I definitely think that there

today? I definitely think that there should be more people building AI because um to that point like I think that many many many different types of thesis are needed on

what kinds of AI will be useful for the world and I I just don't think anybody's figured that out yet.

>> Should companies be training their own models as in like you know uh I'm a really large finance firm or a really large you know uh healthcare organization. So I definitely think that

organization. So I definitely think that eventually every every company should be training their own levels and that's because these models will be so important to the

world and you want to you'll want to eventually deploy them to like 99.9999% of use cases right and if you simply rely on models from the frontier labs

what they're optimizing for may not be what you're optimizing for and just again just because AI will be so important and you want to deploy it everywhere um I think if you want to get the the best value, best performance possible, yeah, you you should be

turning it on.

>> Can you achieve that optimization through like just some good prompting or some light fine-tuning or do you think it actually requires like building somewhat from scratch?

>> So, I think this again goes back to what I said earlier about having a fundamental thesis on how AI should serve your your customers and what types of AI you want to build. So if you have

a strong thesis uh it's almost like having like a product thesis but if you have a strong as opposed to just building some commodity product like we believe that you have some unique take

on how they should behave then yes I think it makes absolute sense obviously we'll be we'll we'll be curious to see how the cost of doing that you know uh uh change over time because certainly the the trade-off of of making that investment or having that unique take

you know today it's it's kind of relatively prohibly expensive to get to the state-of-the-art but I imagine a lot of people could build these opinionated models with a thesis and still be 6 12 months behind state-of-the-art and and be okay.

>> Yeah. Yeah. Exactly. Like I I don't think companies are quite ready right now given the state of I mean both the companies themselves in a state of AI but as AI gets better and better I think it will be increasingly increasingly important.

>> Obviously it's very clear models are getting way better at coding uh and these easily verifiable domains. You

know I think there was a time where you know you you'd use chat GBT and a new model would come out and it would be blindingly obvious uh that the models had gotten better. you know, I don't know if I I would necessarily say that's been the case over the last 3 six

months. You're you're obviously on the

months. You're you're obviously on the inside of this stuff. Do you feel like models are still getting better outside of coding right now?

>> Yeah, I definitely think they do. And in

part that's through all the evaluations that I run where we see this constant progress, but it's also it's also true that um I think just the other day I

started using Claude a lot more for for writing in particular and I was just shocked by how um how much better it was. uh today compared to compared to a

was. uh today compared to compared to a couple months ago. I do want to make sure I hit on on on some of the multimodal uh you know models that are being built you know whether it's video robotics uh stuff being done in in bio

I'm sure you thought about some of this stuff uh you to what extent is that interesting to you is it kind of like a similar set of problems or or how would you characterize uh you know what's similar and different there >> uh yeah so I think all these modalities are are fascinating to me I think one of

the things that people don't realize is that we actually work very very heavily across all of these spaces already um like I think uh maybe 50% or more of of our work today it's actually in domains

outside of pure text.

>> So I actually do think it's fascinating and again if you think about our thesis and what we're trying to enable it's like this idea of we just want to enable

AGI no matter what it takes and yeah if we want AI that is useful out there in the real world and it needs to understand all these capabilities needs to operate across all these domains and so we just want to do whatever it takes

to to make that happen. What's like

quality mean in the in the video context? I could certainly imagine, you

context? I could certainly imagine, you know, I get I get like these text use cases, but uh for video, you know, what's what's that mean to you guys?

>> Yeah, I think people often underestimate how how important quality is across even across these modalities that people think are surprisingly simple. So, one

example is again even in the types of prompts that you're creating, you need a lot of creativity and uh kind of like technology to make sure that you're exploring the full distribution of the

space. I think people often

space. I think people often underestimate that um because I think they just think that people can um kind of create prompts out of thin air that target the full distribution of a

model's capabilities when uh when when it's actually sparkling difficult. One

of our big goals is that we often try to teach people or or try to um try to make sure our customers understand. It's that

when you think about quality, you need to go beyond robotic instruction following and robotic correctness to think about all these um kind of other implications of of the prompt itself. Um

so yeah, I think people often underestimate what quality means.

>> Yeah. I mean what makes like either a video evaluator or like video itself like higher or lower quality you know for as you think about maybe the problems that text models have run into with Elm Marina I imagine there's

similar traps on the video side like what have you learned around that?

>> Uh yeah so again I think it boils down to this notion of taste and sification that I've mentioned before. It's like,

okay, sure. Like, you ask uh Scorsesei to film a video, create a video about uh I don't know, a fish. Then you ask your

high school arts graduate to do the same thing, like, you know, just pull over Grand Rap. Sure, both people can create

Grand Rap. Sure, both people can create a film about a fish, but Scorsese is probably going to have a much better film about the fish. And um like that's where that notion of taste and

sophistication and creativity and just like going above and beyond because when you think about what you want from like uh from models, it's not just the ability to to literally follow your

instructions and do whatever you say.

It's to um it's to kind like craft something that will blow your mind, something that feels imaginative and creative and kind of raises raises the bar. that that that's what we're

bar. that that that's what we're striving to do.

>> And then like robotics and bio in these spaces that require like a hardware component for data collection. Do you

think those are natural extensions for companies like you guys or maybe you're already working in them or is that like a totally separate, you know, set of companies that might pop up and do that?

The way I think about it is we want to do whatever it takes to enable the data that that uh that is going to help uh accelerate AGI and sometimes involves uh

sometimes involves building new tools, some involves buying new hardware equipment, sometimes involves you know uh expansion to whatever space we are yeah like we're a technology company.

we're going to do whatever it takes to to make that happen as opposed to being some, you know, like narrowly constrained company. Um because we just

constrained company. Um because we just pivoted in into the area and uh we're not really thinking longer term about like like we are thinking like we in contrast to some of the other companies, we are thinking longer term about everything that's needed to to achieve

all these things. Amazing. Well, we

always like to end our interviews with just a quick fire round where we get your take on on on a standard set of topics. And so maybe to we've hit on

topics. And so maybe to we've hit on this actually a few times, but I'm curious one thing that you've changed your mind on in AI in the last year.

Probably the biggest thing is um this idea that I used to think that there would be one model to rule them all and now I actually do see how these different uh call product opinions uh AI

opinions will will shape up every eye going forward.

>> Obviously it seems like uh Serge has just been a series of incredible wins and you built like an amazing company but I'm curious in reflecting back what's the biggest mistake you've made in building the business? So

my background has always been in research and data science and so I used to love publishing and I used to love uh like blogging. I used to love sharing

like blogging. I used to love sharing all of our insights and we did that early on and then I somehow just got too busy to do it and I can imagine how and

I really missed that like this idea of like teaching the world and like kind of sharing sharing our viewpoints on the industry and what needs to change or what needs to happen in order to make sure that we're on a good path. And so I

think the biggest mistake is that we kind of stopped uh stopped publishing as much in the past like two or three years. And so I'm hoping to fix that

years. And so I'm hoping to fix that now. And um we're basically

now. And um we're basically >> if you were if you were to like you know get a week of vacation where you could just sit back and write like a really long piece like what what would you write about or what what's kind of like most top of mind for you?

>> So the most top of thing top of mind thing to me really is this concept of objective functions and what every frontier lab is optimizing for. Like I

think it's surprisingly subtle and has surprisingly farreaching consequences.

Um like you know are you optimizing for engagement? Are you al optimizing for

engagement? Are you al optimizing for usefulness? Are are you optimizing for

usefulness? Are are you optimizing for number of users? Are you optimizing for GDB? Like like whatever is I think that

GDB? Like like whatever is I think that concept has very very um very very farreaching consequences for the industry and for AI at large.

>> What would you optimize for if you were running one of the labs? [laughter]

So, I would optimize for this I haven't quite crystallized it yet, but it's this notion of a month later, would you be

happy that you had this interaction with this model? Would it have almost like

model? Would it have almost like changed your life in some way?

like the more moments they can get like that and could it change your life because it maybe you were asking about a vacation and introduced introduced you to serendip this new um new location that you've never thought about before

or maybe you had a medical question with it and it uh you didn't quite maybe you didn't quite know how to phrase a medical question but then the AI serendipitously um like notice something uh and you know like taught you

something that you wouldn't have figured out otherwise like that I think is one of the one of the something a lot more right now.

>> I'm struck by like how much of this of of like these questions we're asking in AI just bring up like challenges that we've always had as a society, right? I

mean you were alluding to the SAT challenge earlier and like that's certainly you know it's an imperfect way of of measuring intelligence and but we've never really found uh much better ways and similarly even here it's you

know uh there's been lots of talk about what technology should be optimized on and what we should be trying to improve in people's lives and you know it's a hard a hard question to answer but obviously uh an ever important one as these as these models do get better and

are going to hill climb on on whatever it is we are we are optimizing for.

>> Yeah. Yeah. Exactly. Yeah, I think uh like a lot of the way I think about AI is and maybe like the the worries and consequence of AI is analogous to the parallels with uh social media.

>> Totally. Um Vlad, this has been a fascinating conversation. I want to make

fascinating conversation. I want to make sure I leave the last word to you. Uh

where can our listeners go to learn more about you about Surge? Uh anything else?

The mic is yours. Wherever you want to point folks.

>> Yeah, so I would definitely suggest or blog. So we're starting to to blog a lot

blog. So we're starting to to blog a lot more, starting to share a lot more uh insights, analyses. So definitely check

insights, analyses. So definitely check check that out.

>> Amazing. Well, thanks so much. This was

a ton of fun. Thanks much.

[music] [music]

Loading...

Loading video analysis...