LongCut logo

The Open Source AI Model Beating GPT-5 on Agentic Performance

By The AI Daily Brief: Artificial Intelligence News

Summary

## Key takeaways - **Kimi K2 Thinking Outperforms GPT-5**: China's new open-source model, Kimi K2 Thinking, has reportedly surpassed GPT-5 and Claude 4.5 on key benchmarks like the 'humanity's last exam' and agentic search. [03:25] - **Cost-Effective Local AI Deployment**: Kimi K2 Thinking can run on consumer hardware like Mac M3 Ultras for significantly less cost than traditional large-scale training, making advanced AI more accessible. [03:56], [08:48] - **Agentic Workflow Breakthrough**: The model demonstrates advanced agentic capabilities, performing up to 300 sequential tool calls without human intervention, positioning it ahead of many Western frontier models. [04:19] - **Open Source 'Year of Open Weights' Predicted**: The rapid progress and accessibility of models like Kimi K2 are fueling predictions that 2026 will be the 'Year of Open Weights,' with increased competition from US labs. [12:13] - **Silicon Valley Adopts Chinese Models**: Despite geopolitical tensions, Silicon Valley startups are reportedly switching major workflows to Chinese models like Kimi K2 and Quen 3 due to their lower cost and strong performance. [09:21]

Topics Covered

  • China's AI models challenge US dominance
  • Kimmy K2: A turning point for open-source AI?
  • Cost-effective AI models are democratizing access
  • Chinese AI models excel in agentic workflows
  • The open-source lag is shrinking rapidly

Full Transcript

Welcome back to the AI daily brief.

Today we are once again talking about

another Chinese open-source model that

is really changing people's sense of

what is possible in the field of AI

today. Now to put this model release in

some proper context, we have to go back

to January. It is now coming up towards

the end of the year. And of course, this

is the time when I start to plan out my

end ofear coverage, which is a big time

for reflecting on the year that has

passed and what's to come. And any

endofear big story recap is inevitably

going to kick off with the big story

from January, which was of course the

release of Deepseek. When Chinese Lab

Deep Seeek dropped their reasoning

model, it caused an absolute tizzy in

the AI industry that even sent stocks

reeling. Now, there were three big

reasons that Deep Seek was such a big

deal. The first was that it totally

changed people's perception of how far

behind us China really was. Up until

that point, people were working on the

assumption that when it came to model

development, China was meaningfully

behind the US. And Deep Seek seemed to

suggest that wasn't true. The second big

reason for concern, and the one behind

the big stock wobble, was that at the

time it had appeared that they had

achieved those results at significantly

lower costs than big US training runs.

This made everyone question the

incredible amount of resources being

spent on the data center buildout. The

third reason deepseek was such a big

deal was more on the consumer side when

they released their R1 reasoning model.

The chatbot app that housed it actually

dethroned chatbt to become the number

one downloaded free app on Apple's app

store for iPhone. Now what was

interesting about this was that Deepseek

was not the first company to release a

reasoning model. At that point OpenAI's

R1 had been available for a number of

months. The difference was that DeepSeek

made it available for free, meaning that

for most people it was their first

experience with a reasoning model, which

of course, if you've ever experienced

the jump from a non-reasoning to a

reasoning model, is just a fundamentally

different LLM experience. So, this is

what kicked off the year and set the

tone for a number of different

conversations that we'd be having

throughout the year. Now, more recently,

the whole China element of this story

has heated back up in a big way. Nvidia

CEO Jensen Hang recently said in very

stark terms that he believed that China

would win the AI race because of their

disposition towards it. And even though,

by the way, all these outlets are

reporting that he backtracked. For my

money, the backtrack was kind of more

just a reaffirmation of what he was

saying while trying to present a

slightly more positive spin like the US

still had a chance. Along with the rise

in AI skepticism among market investors,

there has also been a surge in the idea

that China isn't building as many data

centers and that perhaps the US is

overbuilding. And investor Gordon

Johnson went viral with a tweet that

said, "Question for the AI bulls. The US

currently has around 5,426 data centers

and is investing billions to build more.

China has around 449 data centers and is

not adding. If AI is real, why isn't

China building thousands of data centers

every month, which they could clearly

do?" Semi-analysis Dylan Patel

responded, "Where did you get the idea

that they aren't adding? Not as much as

the US, but China has thousands of data

centers and are building many more. Your

data source sucks." Now the substance

here is less important than the

narrative and the fact that once again

China's actions become the big foil for

the US's and this is the setup into

which the new Kimmy K2 thinking model

was released. The new model was released

by moonshot last Thursday with claims of

outperformance on major benchmarks. The

model purportedly leads both GPT5 and

Claude sonnet 4.5 on humanity's last

exam which is a general knowledge test

on browse comp which is a test of

agentic search and seal zero which is a

test of the ability to collect real

world data. The model lags slightly on

major coding benchmarks like sweet bench

verified but not by much. DD dos of

menlo ventures wrote today is a turning

point in AI a Chinese open source model

is number one. Kimmy K2 thinking scored

51% on humanity's last exam, higher than

GPT5 in every other model. 60 cents per

million tokens and $2.5 per million

tokens output. The best at writing and

does 15 tokens per second on two Mac M3

Ultra. Seinal moment in AI. In other

words, the point that DD is making here

is that in addition to performing well,

it's doing so cheaply and in a way

that's efficient enough that people

could run it on their own hardware. Now,

in addition to scorching the benchmarks,

Moonshot claimed the model is capable of

200 to 300 sequential tool calls without

human interference. If that's true, it

would make it incredibly capable for

agentic workflows. Frankly, head and

shoulders above many of the Western

Frontier models. Indeed, according to

independent testing from artificial

analysis, Kimmy is now ranked ahead of

GPT5, Claude 4.5 Sonnet, and Gro 4 on

agentic tool use, and there's a fairly

significant gap. Some like Dan Mack

suggested that this might be enough to

delay the release of the next generation

of models as the Frontier Labs go back

to the drawing board. Referencing that

same recent quote that we were just

talking about from Jensen Huang, the one

where he said that Chinese AI is

nanconds behind America. Dan wrote,

"Jensen is right. Look at Kimmy K2

thinking. Watch for delayed releases of

Gemini 3, Opus 4.5, and GPT 5.1. Delays

signal they are not clearly better or

cheaper than Kimmy K2 thinking. That is

evidence that the USA is indeed falling

behind in the race, said Machina. Kimmy

K2 beating Gemini 3 would be well

humiliating. Doesn't even cover it.

Think about what Google has. Decades of

data, the best talent money can buy

infrastructure that runs the internet.

And they're sweating a smaller team's

model. That's not supposed to happen in

tech. The big guy wins always. Maybe not

this time, though. Now, part of what has

people excited is that the model is open

source. So, people were running their

own tests over the weekend. Pietro

Shirad, the CEO at Magic Path AI, wrote,

"Kimmy K2 thinking is incredible. So, I

built an agent to test it out. Kimmy

writer, it can generate a full novel

from one prompt, running up to 300 tool

requests per session. Here it is.

Creating an entire book, a collection of

15 short sci-fi stories. Lexe gave the

model the task of balancing nine eggs, a

book, a laptop, an empty plastic bottle,

and a nail to try out its reasoning. The

model came up with a counterintuitive

solution of arranging the eggs to

support the book as the starting point,

then adding the book, laptop, bottle,

and the nail in turn. Alexi remarked,

"Kimmy K2 thinking is the only modern

reasoning model in recent memory that

provided a human solution to this on the

first try." Now, another big shift here

is that Chinese models are now right

there with the US models on coding. AI

coding has been the breakout killer use

case for this year and frankly that's

probably been something of a comfort for

the western companies as this is one

area where they've continued to maintain

something of a lead. At the beginning of

the year claw 3.5 sonnet was the premier

model with no close competitor. Since

then later versions of claw GPT5 Gemini

2.5 Pro Gro 4 all have vied for the top

of the leaderboards and API credits from

developers. Increasingly though, Chinese

models are catching up, if not to the

absolute state-of-the-art, at least

presenting a very compelling cost of

value trade-off. Kimmy K2 thinking is

clearly better at coding than Claw 3.5

Sonnet, the model that everyone was

using just a few months ago, and it's

being served at a fraction of the cost.

In a recent article, the information

suggested that that competition is a

huge problem for Anthropic in

particular, given how much of their

revenue is derived from API use for

coding. They also point out that looking

abroad is an imperative for the Chinese

startups, writing, "It is critical they

find customers outside China who pay to

access the AI models through APIs, no

matter how low the prices are. That's

because it's difficult for AI companies

in China to generate revenue from

domestic customers where price

competition is fierce and business

customers are reluctant to pay for

subscriptions." The article continues,

"As the overall AI coding market grows

rapidly, the Chinese companies are

betting that there will be sufficient

demand for cheaper and good enough

options. And in fact, this is one way

that the release of Kim K2 could end up

being different to the Deepseek moment.

If the release of Deep Seek R1 was all

about giving consumers their first

glimpse of reasoning models that were

hidden behind the payw wall at OpenAI,

Kimmy K2 thinking could end up being

more about providing a near

state-of-the-art model that could

perform in the enterprise at a fraction

of the cost. Another interesting shift

is that models like Kimmy K2 thinking

are opening the door to self-hosted LLMs

in a way that wasn't really feasible

last year. Up until recently, there has

been a stark trade-off when a developer

chose to run models locally. Previously,

you could use open source models to

underpin products that didn't need

state-of-the-art AI, or you could tinker

around with them. But for serious

advanced production use cases, there

needed to be a very significant reason

to want the privacy or security of a

local model to make up for the reduced

performance. Kim K2 thinking is one of a

crop of Chinese models that have reduced

that gap. One of the reasons for that is

an innovation in quantization. You can

think of quantization as kind of like

compression for AI models. While the

process reduces performance, it also

lowers the memory requirements

substantially to allow models to fit on

consumer hardware. Kimmy K2 thinking,

for example, can be quantized down to

run on a pair of Mac M3 Ultras, which is

certainly not a cheap consumer setup,

but it is a realistic rig for a

professional programmer or a company.

Some are starting to wonder if local

LLMs will be a growing trend. I'm not

really sure that I'm convinced at this

point, but it is possible that we will

see certain types of industrial use

cases where the balance of value that

you get from running locally does shift

things and that will be an important

trend to keep an eye on. And while we

haven't seen a lot of US enterprises all

of a sudden adopting Chinese models,

there are growing reports that the

startup ecosystem has already made the

switch. Bloomberg opinion columnist

Katherine Thorbeck wrote, "In recent

weeks, a subtle shift has become

increasingly apparent. Speculation has

been stirring for months that lowcost

open- source Chinese AI models could

lure global users away from US

offerings, but now it appears they are

also quietly winning over Silicon

Valley. She referenced Chamath

Palahapatia commenting that one of his

portfolio companies has already moved

major workflows to Kimmy K2, which he

said is quote, "Frankly just a ton

cheaper than OpenAI and Anthropic." That

same week, Airbnb CEO Brian Chesy said

that they hadn't integrated with OpenAI

because the connections aren't quite

ready. Instead, Airbnb's new service

agent is quote relying a lot on

Alibaba's Quen 3 model, which Chesy said

is very good and also fast and cheap.

Mirror Morati's Thinking Machines Lab is

also building on Quen 3. Cursor's new

in-house coding agent, Composer 1, is

rumored to be built on top of a Chinese

model. And hugging face downloads for

Quen have recently overtaken downloads

of Meta's Llama models, suggesting a

shift in user patterns for open source

AI. Referencing that same Jensen Huang

quote, Thorbeck wrote, "It's premature

for Huang to declare a winner. The US

still has clear advantages when it comes

to access to cutting edge chips and

computing power, but Beijing's lowcost

and open source push is undoubtedly

attracting developers, the backbone of

AI innovation. If Washington truly wants

to come out on top in the long run, it

should start by asking why Silicon

Valley is already switching sides." So,

what's the net of all of this? Cash App

Patel writes, "Kimmy K2 thinking is more

important than 03, not because the model

is better, but because of what it

signals about the future of AI

development." For him, there are a few

different elements of this. First, that

the open source lag is now measured in

months, not years. That basically we've

seen the closed model advantage window

collapse from more than 18 months to 3

to 4 months. That China is treating AI

like they treated electric vehicle

manufacturing. In other words, not

trying to match the West, but trying to

lap it on price and accessibility and

competing on economics. And then this

observation, the real race isn't to AGI,

it's to democratization. He writes, who

cares if you build AGI if only a

thousand companies can afford it. Kimmy

K2 provides frontier performance at

commodity prices. That's the game. Dean

Zacharansky thinks that the agentic

capabilities update is the real deal

here. He writes, "In July 2025, models

could not effectively call tools, three

to five tool calls max. Then Kimmy K2

released and every subsequent model has

been post-trained for tool calling. Now

we have agents that can run for an hour

and 30 minutes. This is the quietest and

most significant advancement in recent

memory. Bindu ready writes, "In spite of

all the closed source drama, the biggest

story of 2025 has been open source

agentic models. Three new models

dominate the cheap mass market agent

space. GLM, Kimk 2, and Quen Coder are

all amazing with trillions of tokens

being used every day. That leads to a

prediction from Bindu. 2026 will be the

year of open weights. We will see at

least two US labs enter the arena. Kimmy

and GLM will push to close the gap in

agentic coding. Deepseek will finally

release R2. We will have

state-of-the-art image and video

generation models. LLM developer

community will explode. Now look,

obviously one of the subtexts for a lot

of this show is around the geopolitics

of this, but when it comes to consumer

choice, it's hard to see all of these

advancements as anything but incredibly

valuable. New frontiers of performance

and cost are being pushed, bringing the

efficiency and affordability of

everything down. And that's going to

mean all of us being able to do even

more with these models than what was

previously possible. Pretty interesting

stuff. Obviously, a lot to keep track

of. For now, that's going to do it for

today's AI Daily Brief. Appreciate you

listening or watching as always and

until next time, peace.

Loading...

Loading video analysis...