SWE-Bench Verified is Contaminated: What Comes Next — with OpenAI Frontier Evals team

By Latent Space

Summary

Topics Covered

SWE-bench Verified Saturated and Contaminated
Unfair Tests Trap Smart Agents
SuperBench Pro Resets Coding Leaderboards
Evals Must Capture Design Taste
Track Real-World AI Impact Metrics

Full Transcript

Okay. Hi. We're here in the OPI studio with Mia and Olivia from the Frontier EVELS team or uh however you want to introduce yourself. Maybe maybe you want

introduce yourself. Maybe maybe you want to introduce uh name what you do at OpenAI and we can get it started.

>> I'm sure. Hi, I'm Olivia. I'm on the Frontier Bells team.

>> Are you sure?

>> Great.

>> Hi, I'm Mia. I am a VP of research at OpenAI. Um and my team are the codex

OpenAI. Um and my team are the codex team, the humid data team and uh the alignment team and we work a lot with Olivia's team on Frontier Eva.

>> Yeah. Uh very exciting. And as by my understanding, you were part of the original team that worked on CX verified as well.

>> Yeah. Uh Olivia's team, the Frontier Evad team and the human data team collaborated on creating three verified.

So you've you've seen the evolution of coding benchmarks over time and I uh I think it was round about sort of mid to late 2024 when you first cover verified things have evolved a lot since then

what's the blog post that you have worked on that you're that we're releasing today like what what is the sort of con what's the main thesis that you're pushing out >> so the main thesis is that sweeten

verified has been one of the northstar coding benchmarks that the field has looked at to measure coding progress u but recently we've seen that progress has kind of stalled and this we realize that this is because the eval is

effectively saturated and also highly contaminated. So at this point we think

contaminated. So at this point we think that um it's not really measuring coding performance improvements well anymore and we think that the field should move away from this towards other benchmarks >> like superb pro >> like super bench pro.

>> Yeah.

>> Amazing. Yeah. I one of the jokes I always have is like there's a group chat with all the labs and everyone just takes turns to increment like 0.1 on trucks and then it's like okay well you

have the best coding model I guess because you're 0.1% higher but it's not super convincing at this point at all.

>> Yeah.

>> So cool. I think the uh let's let's sort of reset on like what was the original work that you guys did for CS verified which I think was pretty substantial like it was like a very significant investment from open AI which like

people still don't appreciate and then what were the satisfactions that we that we found over time right so like what what was sweet bench verified should should um that people should know about

>> sweeten verified was a kind of a cleanup of original bench academic benchmark from a lab at Princeton called sweet bench and The agent is basically given a codebase and a task that was sourced

from a real world uh repository and GitHub issue and was asked to solve the task and is graded on whether some tests pass. And at the time this was uh

pass. And at the time this was uh quickly became a popular benchmark because at the time the field didn't really have good real world coding benchmarks. But then when OpenAI took a

benchmarks. But then when OpenAI took a look at the benchmark as part of one of the evals we wanted to track in our preparedness framework, folks uh started realizing that some of the cases where

agents were failing were due to bad problem setups rather than just to models being dumb. So books open did a pretty extensive human data campaign

hiring like almost hundred um real world uh software engineers to go through um the problems and figure out like are the tasks well specified? Are the tests actually fair and kind of created a

curated set of like 500 tasks that we thought were much better.

>> It's just it's maybe it's hard to to overstate like the amount of effort that it took to like create that benchmark.

There's literally like many expert software engineers reviewing the problems like sequentially multiple

times uh and to you know basically like three different experts independently decided it >> yeah you didn't have to do that you just tripled your cost for just

>> I mean we had to do it we had to do it actually because it's quite a hard task to like look at something like a problem and and the patch and then like it's not just the problem and the patch you have to like understand it in the context of

the codebase that the human or the model is in to solve the task. So it's a very complex problem and it was definitely needed to have three reviews and I think

like maybe we should have done more but uh it was definitely uh a lot of effort to get there.

>> Yeah. Um and there's there's more but people can read the the blog post for that. I will note that you guys had a

that. I will note that you guys had a trend in verifying benchmarks because I just recently saw I think Quinn had a hle verified for humanities license verified which like so now everyone's

verifying everything which is nice and good and like extra quality there. Um,

okay. So, but I think that the meat of it is that this this was a lot of like, well, here's the issue or problem statement and then here's the here's the divs, here's the golden uh tests and here's some regression tests, right?

That's that's like the rough setup of these 500 problems. Um, and there's some contamination always happens because all the measure was fully open. I think um you you did have canaries but like you

know stuff stuff leaks. There's like

multiple avenues, but like the problems are sourced from open-source repos. Yes.

>> So it's not just like when we usually publish evaluations, we publish evaluations and then we we add canary strings to ensure that you know they are easily filled out at training time.

Obviously if you use sort of like data from like open source >> GitHub >> you don't have actually like a Canary string and >> and you >> and these are also like some of these

are very popular repos like the Django repository so you're going to see like many instances being used kind of throughout GitHub.

>> Yeah. Yeah. You uh just before recording you were telling me that you found this in your own chain of thought with the G 5.2 also seeing that like they had extra knowledge or something.

>> Uh yes. So this was an example where the task um asked the agent to uh implement something but it wasn't told that there was this specific argument that the test was going to be looking for it using but

in the GPD 5.2 train of thought we actually saw instances of the model reasoning like hey I think that at some later version of this repository that implemented this particular argument maybe I should add it in. So this is an example of a test that like would be

pretty impossible to pass without this contamination knowledge.

>> Yeah. And I think you found that sort of force right and it triggered like a whole investigation both like in our own mods and also in uh other frontier models like in the market and like

understanding how contaminated the benchmark is like across the industry.

>> What else did you find? I mean that's [laughter] I have to double click on this.

>> So we when I say we this is mostly from other folks our team not Yes. Um but so we did um some analysis

Yes. Um but so we did um some analysis on first of all are the tests actually fair and so this happened by first um

taking all the problems that 03 couldn't solve reliably and then again uh getting a lot of humans to do basically another pass of uh kind of digging into you know what's wrong.

>> Is it the same exact analysis or were they reading 03's output and going here's where 03 went wrong? I think it was I mean it was definitely like a scope to the set of problems that models

failed and I believe they were able to look at like what the model solutions look like versus what >> so this isn't the same work as the original >> it's not exactly the same work it was like a a deeper dive it's like okay

which are the problems that we don't see any murder solving is like there's something fundamentally wrong with those problems or is there something you know wrong with like are the model just not

smart enough to solve the problems so that's kind of like what we what we dug into Yeah. And you found some.

into Yeah. And you found some.

>> Oh yes. Like in over half of the problems that were investigated in that deep dive, there was one problem or the other. I think the most common problem

other. I think the most common problem are like overly narrow tests where there's some particular implementation detail that um the tests were looking for but wasn't specified in the problem

description. So it wasn't fair to expect

description. So it wasn't fair to expect that model to make that particular design choice. Like one pretty blatant

design choice. Like one pretty blatant example are cases where the task asks you to implement some feature um then the tests are looking for you naming that argument or that function with a particular name but if you made chose

another reasonable name the test would fail.

>> Yeah.

>> And another set of um types of embed tests are tests that are just looking for additional features that were never mentioned in the problem description.

>> But that is significant is like that means that if you pass a test actually like you probably did like a really good job. But just because you didn't pass a

job. But just because you didn't pass a FET test doesn't mean that your implementation wasn't like a good one, right? So it was just like we only

right? So it was just like we only accept like very narrow versions of solutions and like not the whole space of >> of like viable and sort of like good

solutions to the problem.

>> Yeah, I think it's important that you're doing this because it in some way it is you in 2025 six going back in time and correcting your own work, right? because

you could have caught all this in in the original verified work.

>> I think so. It's definitely much harder to find a problem in the abstract than when you're looking at a very smart agent's best effort solution and trying to compare it.

>> It is harder or harder or >> it's much easier when you have exactly I think I think also like at the time when we bench verified was published I think it was like a very strong benchmark.

It's not like we're we're like oh this is not this wasn't like a strong benchmark at the time. I think this is something that a lot of benchmarks go through like as an evolution, right?

like when they start to become like popular and like viable, it's because they measure something like important and mods maybe do like 20% correct on them, sometimes even less and sort of

like people have something to hold on and and improve models on on these benchmarks and by the time that you hit like very high performance on the benchmarks like additional like.1%

improvements become sort of like meaningless and so like at the time I think you know that benchmark was like super valuable and it it taught like us and like the industry a lot. It's just

like now at the point that we are now where audits are as strong as they are now we're kind of starting to measure not necessarily like what we want to measure which is like coding capability

of our agents but like the agents ability to like correctly guess how to name a specific function >> and and that isn't really what we are like want to measure at this point.

>> Yeah. I I think that's fair. Is there I mean if I if I asked you to ballpark it like most models are most frontier models are now like 80 something. Is

there like what's the actual like number on Bench verified that you did you guess as like the ceiling or >> I guess that's really hard to say like I

uh when JDY 5.2 came out um folks took a look and found that it was solving like 31 problems that were in the set of should be very hard to solve without contamination problems. So I think it's quite possible that that number is

already something that we've hit if you didn't have contamination at all.

>> Fair enough.

>> Hard to say though.

>> Yeah. U cool. We're going to stop reporting CBench verified, right? And

then uh Subbench Pro will be sort of the next one which is an effort from scale.

What's your sort of comparison analysis?

What's what attracts you to subbench pro?

>> Uh the first one I think is just that it's harder. Um for sweet bench verified

it's harder. Um for sweet bench verified uh I think something like 90% of the problems are things that were estimated to take like an expert software engineer um like less than an hour. They're like

very well specified, very self-contained, and the um Subage Pro problems are just um bigger and harder and there's much more headroom on the EVL because it's not saturated.

>> Yeah.

>> Like categories of like one to four hours and four class.

>> Yeah. And it's more diverse. Um lots of repositories, multiple languages, qualitatively more different types of problems. So all that's great. Uh on the contamination side, we also think it's better there. So the way we uh were

better there. So the way we uh were measuring for contamination first verified was uh with this little like contamination auditor agent which is given the description of the task and

the patch and the task ID and told to go take this target model and kind of as an open-ended like set of questions try to uh find questions that will manage to kind of reveal what contamination might

be lurking in that model. And in sweet bench verified we found many instances of contamination across um like across open eye models across like quad uh open

4.5 Gemini flash and in all of these we saw things like regurgitating the ground truth solutions things like uh in some cases um giving like the task ids and

other things that are a pretty clear evidence of at minimum familiarity with the repositories.

>> Yeah.

>> Um so I mean oh task ID that [laughter] Yeah. Uh and Pearl on the other hand, uh

Yeah. Uh and Pearl on the other hand, uh we don't see this. I think there uh the auto agent found some like very light evidence that maybe a couple models might be very lightly familiar with like

one or two of the source repositories, but it's very different than Swen verified. Um so contamination is good. I

verified. Um so contamination is good. I

think there also like we should expect that at some point like that that's not going to be like the right benchmark anymore and like as a field we kind of have to continue to like move on and

like find harder and more representative problems that we can launch our capabilities on.

>> Awesome. So let's go into that. Uh I

think that there are a lot of I think we also pre in the in the pre-hat was well people feel qualitative difference when they're using 5.1 to 5.2 2 to 5.3 and it's not super expressed in these

benchmarks because they there are on a number of these things. What

capabilities do you really want to benchmark in a in a ideal coding benchmark you know I guess agent coding benchmark whatever you call it. Uh

>> I mean one thing is uh kind of open-ended design decisions places where the problem maybe is a little bit underspecified and seeing if the model can make reasonable design decisions.

>> What's a reasonable prompt for that?

like this five code me uh uh B2B SAS to make no mistakes or you know like that's that's the meme but like okay what's like what's like an actual usable uh open-ended problem like that like

>> oh sure I mean maybe an example could be finding a way to speed up a particular part of a codebase but there might be multiple different ways to >> yeah there are dedicated performance benchmarks I think you guys have efficiency or is that is that I don't

know I think that's that's of Harris's group >> um but yeah yeah I mean that that that is a good one >> I think There's just many many things that people

like value about working with with software engineering agents. They think

three swbench verified obviously measured like some they measures like some important capability which is like given like a description of a GitHub

issue can you produce like a patch that solves that issue you know satisfactory and like obviously there's like some issues with the with the benchmark that

means that now that we're at like 80% we don't really trust like further improvements on it but like it does measure something that is like a real real uh like capability of models but I

think as a field we're like moving beyond sort of you know can my coding agent like solve a small like GitHub issue for me right and so we're starting

to look at like much more longer term tasks right like that don't take like 15 minutes but maybe like hour sometimes days and then beyond sort of like what

kind of tasks can my agent solve like there might be things that are kind of a bit harder to grasp, right? Like Olivia

talked about sort of like does it have like design taste, right? Like does it solve the problem the way that you know my team likes to solve problems?

Is the code nice, right? Like is it is it well written? Like is it sort of like clean code, right? Like people care about these is it maintainable in the

future? people care about a lot of these

future? people care about a lot of these maybe less tangible um less tangible and like harder to measure frankly things that that are

still like super meaningful for people that are working with coding agents.

>> Yeah. So um I mean these are all qualities that are obviously the no longer the low hanging fruit like we have no idea how to eval the the simple question maybe the the there's sort of

two forks in the road. One is the sort of very human intensive, money intensive path, which is hire a bunch of contractors and try to annotate this.

The other is use an LLM to to proxy it and try to align the LM so that it can give you a reasonable proxy. Which of

those would you want? I what you want to do both?

>> I think like maybe you should talk about GDP law as like an example.

>> Um, sure. So GDP bell is an EVEL that um was again produced by a collaboration between human data team and the front of EVEL's team and it's trying to measure

whether agents can do kind of a variety of like real world white collar work. Um

that was an eval where grading is very hard requires kind of a lot of kind of domain knowledge on exactly what um are you looking for in each different context.

>> Yeah. across like 15 16 white collar jobs uh professions like I that take up a significant part of GDP which >> kind of high level professions and then a lot of like different granular subs

>> uh I have said like I'm a big fan uh it's it it is this is the eval for AGI basically >> but part of because it was um so hard to

it required so much kind of like domain knowledge um uh the human data team hired like a lot of um people from these professions to be very involved involved in creating tasks and creating the gold

solutions and trying to um help create rubrics and so forth so we can create lively.

>> So basically take the GDP val which is a generalist thing take that same approach to apply it to code and you roughly have like a rough road.

>> I think it's an interesting solution. I

think what you're pointing out as an important problem which is sort of this this like how realistic is it and like do you know what what we want to do is

like coding agents should write code that you know we think is good and so it's like asking human that's actually like a good way to ensure that it's also

kind of a slower like complex way to do that and so part of why I think you know surge verified ended up being super popular and where we are seeing like ala benchmarks like this being super

popular. It's like it's very easy. It

popular. It's like it's very easy. It

could even be easier, but like validating that a solution passes all the tests. It's like pretty trivial once

the tests. It's like pretty trivial once you can like run the tests in in like your on your computer or wherever you're running them and you can kind of like okay is it correct or is it not correct and you can kind of aggregate that and

that it's super simple but it doesn't tell you it's like you know did the like solve the problem like wow like you know ugly like would have actually like an open source maintainer of that project

have like merge that PR like that it doesn't tell you but there is a lot of value in having benchmarks that are both like easy to compare across the industry

and also that can be sort of run really fast without human involvement.

>> Yeah. Amazing. Uh your teams also put out other kinds of evals that are related like the I think there's an RL paper bench. Um and then sort of like

paper bench. Um and then sort of like the more sort of recursive self-improvement type uh evals. How much

should that figure into mainstream coding evals? you know like is there is

coding evals? you know like is there is there some way in which those things join together?

>> Sorry you're asking like should we build should we also be building evals for the self-improvement evals are you saying do coding evals currently cover that mind?

Um, I think I I I just think like those are some of the most advanced evals that we have and we're not using them in the normal path and it's just it's an

interesting uh split between well here's evals for coding normal things and then here's the one for machine learning that is like completely different right I think you get what I mean uh and that's

mostly a safety argument I guess uh but also like it's actually really useful for people to understand if the the model is really good at like AI I code basically.

>> Yeah. Oh yeah. Like my guess is that part of the reason that a lot of benchmarks so far haven't focused as much on the AI coding is just a question of like what data sets are easy to gather because a lot of the like you

know state-of-the-art um AI code bases are proprietary. So um if we make emails

are proprietary. So um if we make emails for that like we're probably not going to release them and it's harder for people in the field to make that kind of measure like is this a realistic research coding workflow. I do think

that it's good for the field to try to measure these skills in a public way. I

think it's just harder to make it realistic. card. And then one more thing

realistic. card. And then one more thing that a lot of people are trying to do which is like sort of well in instead of like a percentage of 0 to 100 maybe we redenominate in dollars right so you had

freelancer and all that uh other people are doing like vending bench whatever any any alpha in those or are they are they you still want like a traditional academic benchmark

>> I think [clears throat] in a way like there's like different ways to measure the same thing right if we're like oh this is like how much money it produces it's a fairly similar thing to saying like, "Oh, this problem would take like

a human, you know, two hours to solve or something like that." Usually, they're like fairly like correlated, right? Like

however, you know, much it would take like a human to solve that problem kind of determines the value that we ascribe like a solution. And so I I do think

that is like an important thing is like how complex and how sort of long running are the tasks that we are like able to entrust our agents with.

>> Yeah.

>> And so I think that that's like an important piece. But I I think here sort

important piece. But I I think here sort of monetary value, time or complexity, they all kind of like try to capture like a similar thing.

>> Yeah. Okay. So they're they're all proxies for some amount of increasing capacity that we want to measure. I

think that's a good thing. I think the only other sort of major player in this field is meter which has done the sort the long grass and congrats you guys have completely destroyed the curve for that. Any takes on that? Obviously

that. Any takes on that? Obviously

you've come up really well. So like it looks good but uh I don't know if like that approach is something that you want to incorporate in your own work making evals. Uh this is the long autonomy test

evals. Uh this is the long autonomy test uh eval if you're >> Yeah. I know we're from here and and and

>> Yeah. I know we're from here and and and we we we work with with me on these evaluations. So like we we do appreciate

evaluations. So like we we do appreciate them. I think then they're using time,

them. I think then they're using time, right? They're not using money. So I

right? They're not using money. So I

think like that was your question. I

think like complexity, however we can sort of like quantify it, is really important to understand like where our markets are are are getting to.

>> Okay. Complexity is the abstract thing and then it projects down to time, projects down to story points, whatever, dollars. Uh great. Um, one last question

dollars. Uh great. Um, one last question on just like just the overall preparedness framework is uh, you know, I was actually kind of looking at people mention the preparedness framework a lot. Uh, I don't think it's well

lot. Uh, I don't think it's well explained to a lot of people. Uh, and

you actually have a nice website where it's like uh um I think it's like test and like yeah inform and uh teach something and I I feel like you you actually do a lot of work there and and

I don't know if you want to talk about how the preparedness framework applies.

>> So the preparedness framework is uh open kind of like public framework for how we track uh frontier risk. So these are kind of capabilities that are typically dual use. Like you can use them for good

dual use. Like you can use them for good things or bad things, but we want to at least keep an eye out for the bad things to make sure that we have both we as a company and like the broader society are kind of prepared to handle the potential

downsides. And so at the moment we kind

downsides. And so at the moment we kind of track three different categories. Um

one is kind of uh biorisk, another is cyber security and a third is kind of uh research automation and model autonomy.

And that's kind of what ties most into the bench where, uh, coding is not all of automating research, but it is one very important key component. And so,

uh, we initially created sweepbench verified as part of, uh, like building out evals for that model autonomy workstream. Uh, and now, uh, I think,

workstream. Uh, and now, uh, I think, uh, we're like we have to move beyond that towards looking more at like can models actually start to actually automate research workflows.

>> Yeah. Amazing. Great. Am I anything else to add on just the general what people should know about preparedness and how eval and human data alignment all work together at that?

>> I think maybe the thing that I would say is that we really appreciate we we work really hard to build these events and and so we that's where we published

verified and that's where we're like sharing GDP these sorts of things. We

also deeply appreciate like other people and the entire field to kind of build eval and and share them and reuse them like three bench pro like yes that that's a better ev now we should use

them. So would really encourage people

them. So would really encourage people to find more ways to create and share evid that we can we and the entire field

can use to measure like progress on on on like a variety of capabilities including including encoding because it's important to understand so where we are.

>> Mia had to leave uh but we're we're just kind of uh talking a little bit about like the future directions that we want EVAs to go. Mhm.

>> Um and I I think here here we can dive in on like give us work good work on these these these things we'll talk to you know uh here's your platform to make

a call for what you're looking for. I

think a few things that would be useful I'd say first of all really really hard tasks like the kinds of things that would take top-notch engineers months or

teams weeks um would be quite good um especially if grading is reliable and grading has like you know you have for example like rubrics that have been sourced and validated by many people in

the field I think that would be quite valuable I think also benchmarks on kind of creating products end to end I think as people are byputting more that would be quite useful I think uh a third thing I'd say that is maybe not quite an evil

but I think is still relevant to the kind of overall mission of like we as a field and as a world should be tracking like where are these capabilities going

I'd like to see more metrics um tracking like real world usage like how much is AI actually being used in the field um how much is it you know replacing people's jobs how much is it um you know

augmenting people speeding people up just like real world metrics yeah >> yeah the the replacement thing is always like a sensitive on on on the sort of PR side of things, but uh you know we create new jobs that that manage the old

jobs and that's how it is. Um yourself

like uh you know I think in terms of the frontier evals that that OBI is really going to excited to push like you you put out really good work every single time. Um what should people expect from

time. Um what should people expect from from OPI itself?

>> I'm not sure I could say what we're going to >> general directions.

>> I mean a general directions I think looking at real world impact like real world um real through you know whatever >> that kind of stuff.

>> Yeah.

>> Yeah. Amazing. Um okay. Well, I'm

excited for Mobile World Impact. I think

you guys have, you know, really made a lot of progress and I think taken a lot of industry leadership for Cbench verified and and now moving on to Cbench Pro. So, thank you for doing this. Thank

Pro. So, thank you for doing this. Thank

you for being so transparent. Uh and I think people will respond in kind.

>> Yeah.

>> Yeah. Thanks for your time.

>> Thank you.

[music] >> [music]

Loading...

Loading video analysis...