[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

By Latent Space

Summary

## Key takeaways - **SWE-bench ignited by Devin**: SWE-bench was released in October 2023, initially not much used, but Cognition's Devin release kicked off an arms race in coding benchmarks. [00:31], [00:42] - **Code Clash evaluates long-horizon dev**: Code Clash pits two or more language models in a programming tournament where they iteratively edit their codebases and compete in arenas like Halite to assess consequential development on evolving codebases. [03:25], [04:04] - **Unit tests inadequate for verification**: Unit tests are disliked as verification, and traditional benchmarks treat tasks as independent episodes, lacking long-horizon development conditioned on prior model actions. [03:25], [03:40] - **Diversifying beyond Django focus**: Newer benchmarks diversify repos and languages across nine like JavaScript, Rust, Java, C, Ruby in multilingual SWE-bench, moving past Django-heavy sets. [02:07], [02:25] - **Long autonomy vs interactivity tension**: Long-running autonomous agents enable walking away for hours to optimize codebases by metrics like money-making, but real development favors fast human-AI back-and-forth interactivity. [12:13], [12:36] - **User simulators need improvement**: User simulators like in TAU-bench sample one path and may not be realistic, posing challenges for scaling evaluations of human-AI interaction without real product data. [08:35], [15:07]

Topics Covered

Diversify Benchmarks Beyond Django
Unit Tests Fail Long-Horizon Coding
Include Impossible Tasks to Detect Cheating
Long Autonomy Risks Industry Irrelevance
Balance Autonomy with Human-AI Collaboration

Full Transcript

Light space tornado wake up light.

>> We're here at Nurips with John Yang of Sweet Bench and many other things. But

welcome.

>> Thanks so much for having me. Yeah,

really happy to be here.

>> Uh last year I talked to Oier and uh I think Carlos as well, one of your co-authors.

>> How's CBS doing? like just just generally the project is like one and a half years olds.

>> Yeah. Yeah. I think one and a half years old in terms of when it was actually useful and we put it out October 2023 and then people didn't really touch it too much and then of course like

cognition came on the scene and Devon was an amazing release and I think after that it kind of kicked off the arms race.

>> Did they tell you beforehand or they just showed up?

>> It was you know I got an email about like two weeks ago. I think it was from I think it was from Walden. He was like, "Hey, you know, we have a good number on it." I was like, "Wow, congrats." You

it." I was like, "Wow, congrats." You

know, thanks for using it. And then the release was like mind-blowing. I was

like, "Wow, these guys did an excellent job." Yeah.

job." Yeah.

>> Amazing. And then Sweetbench verified was like maybe last year.

>> That's right. Yeah.

>> Um, catch us up this year. Like you have uh other languages. You have there's like a whole bunch of varieties of Sweetbench now.

>> Yeah.

>> So, what should people know?

>> Yeah, for sure. Um, I think there's a couple extensions that are happen. One

is like more Sweet Benches, Sweet Bench Pro, Sweet Bench Live. Um

>> Oh, Sweet Bench Pro. Was that with you guys? Because it looks independent. It's

guys? Because it looks independent. It's

like different authors.

>> It's completely independent. Yeah.

>> So, they just call it Sweet Bench Pro without your blessing.

>> Yeah.

>> I think uh I think we're we're we're okay with it. Uh when we came out, we were like, "Oh, cool. Interesting." It

would have been, you know, fun to be part of it. But, you know, I mean, congrats to them. It's a great benchmark. Yeah. Yeah.

benchmark. Yeah. Yeah.

>> Uh but yeah, uh multimodal.

>> Yeah, we did multimodal and multilingual. Um, and I think like those

multilingual. Um, and I think like those have multilingual seems to be >> is it like JavaScript? What else?

>> Yeah. Yeah. Multilingual is like it's like nine languages across like 40 repos. But yeah, you got them like

repos. But yeah, you got them like JavaScript, Rust, Java, C, you know, Ruby. Yeah. Yeah, you got him. Yeah.

Ruby. Yeah. Yeah, you got him. Yeah.

>> And then Corbench itself, a lot of people like they they talk about the the Django focus.

>> Yes.

>> Is there is there is there like I don't know how do we how do we move past Django?

>> Yeah, for sure. I mean, it's cool to see um a lot of the newer benchmarks like really try to diversify the repos. Like

in the two follow-ups we did with multimodal and multilingual, we made it a point to do that. So, I think >> but you can also just put out Swbench 2025 and just >> That is true. And do a new distribution.

Yeah. Yeah. So, it's been cool to see the follow-ups. I think quietly and and

the follow-ups. I think quietly and and it's an open question for me. I'm

excited to see how people curate the next sets. Like it's kind of interesting

next sets. Like it's kind of interesting to see in the literature or in their blog posts like how they're justifying why they're creating their separate split the easier ones where like oh more languages more repos. And then I think

now people are like well ours is more difficult because of this curation technique. And I'm yeah I'm excited to

technique. And I'm yeah I'm excited to see how how long that lasts and you know where we're going to like guide the evaluations towards.

>> Yeah. And more recently you're working on code clash.

>> Yes that's right. Uh so let's give people you've already done other episode other podcasts about it. I refer people to to that with your chat with Andy but just give like a people like a one-two sentence.

>> Yeah. No, happy to do it especially on your podcast. It's honor. Um yeah. So

your podcast. It's honor. Um yeah. So

basically the idea is I don't like unit tests as a form of verification. And I

also think there's an issue with bench where all of the task instances are independent of each other. So the moment you have the model kind of submit it, oh it's done, you know, and and that's the

end of the story, end of the episode, you know. So with code clash, what we're

you know. So with code clash, what we're thinking is let's try to really evaluate like long horizon development and uh development on a codebase that is

consequential and condition upon what a model did, you know, before to that codebase. And so the general idea is you

codebase. And so the general idea is you have two or more language models and they play a programming tournament. And

what that means is each model maintains their own code base and each round of the tournament first they get to like edit and improve their code base however they see fit very self-determined and

then in the competition phase those two code bases are are pitted against each other. So the code bases are run and

other. So the code bases are run and there's generally an arena you know we have a lot of diverse arenas but the arena is determined like codebase A is better than codebase B and then you kind of repeat that across multiple

>> as determined by an ele judge.

>> Yeah. Yeah. So element judge is definitely one of the mechanisms. Uh we started with some pretty like simple programming games. So one of the cooler

programming games. So one of the cooler ones is like howlight which uh my >> Oh yeah, I played it uh for Jane Street.

>> Yes, that's right. That's right. You

know that's awesome. Yeah. Highlight

one, two, three. Like Michael Troll of Cursor wrote this uh game.

>> Two Sigma Jane Street.

>> Yes. Oh. Oh. Two Sigma. Two Sig. Two

Sigma.

>> I worked at Two Sigma. I'm like,

>> "Oh, there you go." Yeah.

>> This is too long ago.

>> There you go. Yeah. 2016 at this point, but we're bringing it back. You know,

>> headlight is fun. I I would say if you've never done a programmatic competition where you have to control fleets of uh ships and attack things and defend things and collect resources.

>> Yeah. It's like play Starcraft but you can code, right?

>> Yeah. Exactly. Exactly. Yeah. Yeah.

>> A lot of games.

>> Yeah.

>> Is there are there non-games or you focus on games?

>> I think that's an excellent point. So

for kind of the initial release for scientific purposes, we kind of use existing programming games. Uh the

current ongoing effort is you know to build economically valuable arenas.

That's you know the popular word these days. So

days. So >> yeah, sweeter is a big one this year.

>> Yeah. GD GDP. Awesome. Yeah. just uh I mean I think the big selling point of terminal bench and sweep bench and these evals is that you know it's really close to real world utility and so I think

it's resolvable for code clash and that's what we're working on. Yeah.

>> Okay. Yeah.

>> Um so you're part of group.

>> Yes.

>> Um the other students have also been putting out a lot of other stuff. What

would you highlight?

>> Yeah. No I mean OIR is such a prolific mentor when it comes to benchmarking. So

efficiency I really like in the line of performance >> operation. What's the one? Yeah, for

>> operation. What's the one? Yeah, for

sure. Um, so efficiency was wrote by this PhD student called Jeffrey Ma who happened to be my high school classmate and the idea there was like you take a codebase and you just want to you know

do modifications that will literally make the code run faster. So I think this like parallelization sim operation stuff like that. Yeah.

>> So so no no behavior change just faster.

>> Exactly. Keep the unit test passing but I want better runtime. Okay.

>> Yeah. Yeah. And then and then there there's algo tune that is kind of in line with that. And then there's also kind of pushing along like the scientific coding domain. Uh yeah,

exactly. Psycho 2 is awesome. They did

like a quick >> and for for people is the way I explain psychode is it's human val but better.

>> Yeah. Exactly. Exactly. I think you know there's a lot of good stuff that these days where Yeah. That's that's the way to go.

>> Which is like sweet bench is expensive to run. Any agent tech benchmark is

to run. Any agent tech benchmark is expensive to run. Actually, you do need some completions benchmarks that just just >> complete. Exactly. Like, you know, you

>> complete. Exactly. Like, you know, you can do well on those first and then sort of graduate to the multi-turn expensive stuff. Yeah. Yeah.

stuff. Yeah. Yeah.

>> Uh okay. Other than that, just like broadly other work in the field in 2025, uh in terms of coding evals, um obviously we shot up Meter. They use

VBench and they have a very interesting like I guess human hours worked number.

Yeah, they like the x-axis being sort of the runtime and or yeah, y-axis being the completion, you know, like we can do more longunning speed and tasks. I think

the projections are are quite interesting and I definitely appreciate them kind of using sweet bench verified to to sort of proxy a lot of these things. But yeah, they're great. Yeah.

things. But yeah, they're great. Yeah.

Okay.

>> Any other work that like caught your eye?

>> Yeah, I mean I I think within the Okay, terminal bench bench. Uh yeah, critical point was kind of cool. Um

>> critical point. Yeah, that it's like a very new benchmark that uh OPIR did. Um

and I think it's kind of related to physics. Um there's this one called

physics. Um there's this one called secbench kind of related to cyber security. Yeah, exactly. Sbench, which I

security. Yeah, exactly. Sbench, which I I think is affiliated with lot like it's just cool to kind of see people really dive into different coding domains. And

then stepping a little bit outside of coding, um I'm personally think it's quite interesting to think about the user simulator stuff. So like TW badge badge too. Yeah. and vending bench and

badge too. Yeah. and vending bench and >> I got mix feelings.

>> Yeah. No, I'm interested.

>> Well, I mean it's it's like it's like you're sampling one path. I I don't know how realistic it is to be honest. It's

just the elements but it is cool.

>> No, for sure. Yeah, I agree. I I think it's a good initial effort. Um to me I think it's super cool to see companies like you know I'm sure Merore and stuff are focusing on building environments

like for code beyond code and so I think it it might be interesting to have like work gym style stuff. This is stuff that my adviser D. Young at Stanford thinks about a lot. So, yeah.

>> Yeah.

>> I just realized we we're talking about Terminal Bendy. Yes. In front of a lot

Terminal Bendy. Yes. In front of a lot of folks.

>> Yeah. Yeah.

>> You know, really, really, really good work. Uh just overall, um yeah, let's

work. Uh just overall, um yeah, let's talk about towbench cuz you mentioned towbench.

>> Yes. Yes. Uh there's some discussion or some people are saying that tobench is uh impossible to get a high score on because some of the tasks are

underspecified or just impossible. Yeah,

>> I don't know if you're up to speed on that.

>> I'm a little bit spicy. Yeah, it's a bit spicy. I think I saw so I you know for

spicy. I think I saw so I you know for like I worked with Shinyu and Caric back in Princeton very closely. I think Caric I just saw posted a tweet kind of um yeah like rebutting some of these

claims. Um yeah, I mean it it's I I think I get the concern. Um but yeah, I think it it's also brings up just maybe like interesting research problems to solve of like okay like why is it

impossible? Is it the ambiguity? Is it

impossible? Is it the ambiguity? Is it

kind of the user simulator that has issues? And I think generally we all

issues? And I think generally we all agree that you know we'll improve on these things over time for Ubots.

>> So I actually really like benchmarks that intentionally uh I think we should intentionally include impossible tasks >> as a flag. Yeah. Of like hey you're cheating.

>> Yes. Yeah, it's kind of sad that like Karthik actually is defending it because the master move would be like, "Oh, yeah, you caught us." Like that that that was uh you know like everyone reporting above 75 on top bench retail uh you'd be cheating.

>> Yeah. Oh, interesting. That would be that would be cool. Yeah. I mean, yeah, you'll have to ask the Tow Bench authors, but yeah. No, that that's that's fun. Um yeah, I I think there was

that's fun. Um yeah, I I think there was uh Impossible Bench was a recent benchmark. Uh maybe from was it from

benchmark. Uh maybe from was it from Anthropic? I don't know. But they

Anthropic? I don't know. But they

basically took Sweetbench verified and they changed the issues to make them impossible and they checked like how often the models would be like I actually just can't do this. I don't

know what's going on.

>> Oh, like for refusals.

>> Yes. Yes. Yes. So,

>> oh, how did they do?

>> I thought that was interesting. I think

they're all the models are all kind of attempting and saying like, oh, I did it, you know. So, maybe not great.

>> That's cool. But no, that's a that's an important one.

>> Yeah.

>> Uh, how does Cody evalance evolve next year?

>> Wow, that's a great question. And I mean honestly I think I think it's people will make more sweet benches. Um I think terminal bench has really got something going where you you ask people to you

know a sweet bench you're you're confined in some sense to the domain of issues and PRs that already exist. Um

which I think has its benefits of being close to reality and natural but I think with terminal bench there's a lot of creativity that you can infuse into that. So I would personally be really

that. So I would personally be really excited like the 2.0 job was really excellent and I'd be super excited to see you know 3.0 4.0 because of like the environments.

>> Yeah, I mean the environments, you know, bringing more people into the fold, you know, I think, correct me if I'm wrong, Mike, but early on you had PhD students, very smart CS people who are adding task

and, you know, what does that look like when you fold more coding environments for non-coding tasks, non-coding environments in general, and ask people to make stuff there. So, that's pretty

cool. And then, of course, for myself, I

cool. And then, of course, for myself, I think just like this longunning sweet agent kind of thing just feels very compelling. I think the vision of like,

compelling. I think the vision of like, hey, I tell it a goal. I don't have to be super specific about my task. I have

like a decent verifier that proxies what I want. Something literally like a

I want. Something literally like a codebase that makes the most money in this like setting, you know, like that's my verifier, you know, and I walk away for 5 hours. The thing is just running.

I'm hanging out with you, talking to my friends. I come back and it gives me

friends. I come back and it gives me like literally a soda codebase on on on that, you know, task. I think that would be super cool. Okay, I'll push back.

We're part time.

>> Yes.

>> And we are uh emphasizing a lot of interactivity >> because the the point is that you're going to underspecify, >> right? Right.

>> right? Right.

>> And actually what people want is back and forth, back and forth and on like a really fast time frame, which is terrible for a benchmark author, >> right? Because how you do that? Yeah.

>> right? Because how you do that? Yeah.

>> Uh but but realistic.

>> Yeah. So, um I I think like that this uh this this is where I'm a little bit anxious or cautious about this push for long autonomy, right? We're g I mean, you know, let's say this time next year,

we'll have 5 hours is is pessimistic like it'll be it'll be 24 >> long. Yeah. Right. Days.

>> long. Yeah. Right. Days.

>> Um but I don't know if that actually materially changes the industry.

>> So, we'll push it like as an evals, you know, we have the people people make evals here.

>> Yeah. Yeah, we push the industry in ways that we wanted to push, but I don't know if we like that's a productive way because that's more of like a a stunt that that like Yeah, it's a proof of concept that proof existence proof it

can be done.

>> Yeah.

>> But will you use it in >> for real life?

>> Yeah. Yeah. I mean, honestly, um to me, I think there's there's potentially room for growth. So, I I would actually agree

for growth. So, I I would actually agree with your take here. Um I mean uh with my lab at Stanford with DE like there's a you know her emphasis is on human AI collaboration and so I I definitely

don't believe in this idea of just kind of getting rid of the human. Um but yeah maybe just like finding the balance of like you know just because the developer ecosystem is so diverse and there's so

many participants in it who want different things out of it like just enabling different levels of abstraction. Um, and you know, it

abstraction. Um, and you know, it depends on the task. Like there's

settings where you want to be, you know, more involved and more sort of hands-on and so you want to use Windsurf for that. But then maybe there's kind of

that. But then maybe there's kind of this general data processing thing. It's

just a lot of JSON parsing you don't really care about and that's the one I kind of want to walk away from and just let it figure it out. Um, so yeah, I would agree with you generally.

>> Yeah. Yeah. Amazing. Any calls to action? What What do you want help on?

action? What What do you want help on?

How can people uh I guess like find more of your work? Definitely for the call to action. Super jealous of all the great

action. Super jealous of all the great data that cognition and you know cursor would get like that user interaction data is like really fascinating. From an

academic standpoint it feels like there's two difficult approaches to resolving that. Either you build like a

resolving that. Either you build like a really compelling product like El Marina that people have people use consistently which is I mean really tricky in and of itself or you build like really good

user simulators that try to mimic sort of these settings. But that is also like non-trivial. I don't think it's as

non-trivial. I don't think it's as simple as hey chatbt act like a human, right?

>> Yeah. So it would be really cool to sort of get inspiration of like what exactly does that data look like or or between the two like what's the best way to

scale up sort of evaluating human AI interaction and then I think for visibility for my own we're pushing more arenas like I think for for code clash what I'm excited about is the current

framing is really long running sweet agents but you know you could have multi- aents like two agents work together on the codebase and what happens you have a human and an agent

work on the codebase versus just AIs, what happens there? You know, like when the models improve and hopefully they hill climb and they become better at digesting logs and iterating on analysis, you know, how does how does

human AI interaction like change with model capability. Um, and so I'm kind of

model capability. Um, and so I'm kind of hoping, you know, I'm trying to inspire and and and convince people that it's a very cool test bed where you can do a lot of different sort of combinations of

like human AI on different arenas playing one arena at a time and arenas at a time, you know, and just, you know, >> yeah, I think very interested to work with you on on the interaction stuff.

Oh, that would be awesome.

>> And then I I think uh one one more thing I'll add is for cognition uh is going to be pushing a lot of codebase understanding which is kind of codebased retrieval plus.

>> Yes.

>> And mostly it is helping um humans understand their own code bases better to enable humans >> or to to sort of mind meld the human with the machine uh to do the highest possible task that LM could not do

alone, humans couldn't couldn't do alone. And then the other thing is also

alone. And then the other thing is also like basically automated context engineering for an LM. So that that that is like sort of like a research sub aent uh that we're that we're working on.

>> That's so awesome. Yeah.

>> So I don't know what the benchmark would be because like how do you how do you benchmark understanding >> that is true >> uh apart from I think like yeah it's mostly like you freeze a repo um have

some manually cured answers and then you know pose trivia questions that's very easy to saturate. So, I don't know how else to >> Yeah, I think um I I think Silus tweeted a while ago like sort of like the the

wiki the code wiki s that that's incredible. I mean I I use

incredible. I mean I I use >> with Google actually just came out their own version.

>> Oh yeah, with the the anti-gravity people. That's

people. That's >> uh No, no, no. This is like a separate >> team. Gotcha. Gotcha. Um

>> team. Gotcha. Gotcha. Um

>> but cool. That's the state of code.

>> Yep.

Loading...

Loading video analysis...