The Open Source AI Model Beating GPT-5 on Agentic Performance
By The AI Daily Brief: Artificial Intelligence News
Summary
## Key takeaways - **Kimi K2 Thinking Outperforms GPT-5**: China's new open-source model, Kimi K2 Thinking, has reportedly surpassed GPT-5 and Claude 4.5 on key benchmarks like the 'humanity's last exam' and agentic search. [03:25] - **Cost-Effective Local AI Deployment**: Kimi K2 Thinking can run on consumer hardware like Mac M3 Ultras for significantly less cost than traditional large-scale training, making advanced AI more accessible. [03:56], [08:48] - **Agentic Workflow Breakthrough**: The model demonstrates advanced agentic capabilities, performing up to 300 sequential tool calls without human intervention, positioning it ahead of many Western frontier models. [04:19] - **Open Source 'Year of Open Weights' Predicted**: The rapid progress and accessibility of models like Kimi K2 are fueling predictions that 2026 will be the 'Year of Open Weights,' with increased competition from US labs. [12:13] - **Silicon Valley Adopts Chinese Models**: Despite geopolitical tensions, Silicon Valley startups are reportedly switching major workflows to Chinese models like Kimi K2 and Quen 3 due to their lower cost and strong performance. [09:21]
Topics Covered
- China's AI models challenge US dominance
- Kimmy K2: A turning point for open-source AI?
- Cost-effective AI models are democratizing access
- Chinese AI models excel in agentic workflows
- The open-source lag is shrinking rapidly
Full Transcript
Welcome back to the AI daily brief.
Today we are once again talking about
another Chinese open-source model that
is really changing people's sense of
what is possible in the field of AI
today. Now to put this model release in
some proper context, we have to go back
to January. It is now coming up towards
the end of the year. And of course, this
is the time when I start to plan out my
end ofear coverage, which is a big time
for reflecting on the year that has
passed and what's to come. And any
endofear big story recap is inevitably
going to kick off with the big story
from January, which was of course the
release of Deepseek. When Chinese Lab
Deep Seeek dropped their reasoning
model, it caused an absolute tizzy in
the AI industry that even sent stocks
reeling. Now, there were three big
reasons that Deep Seek was such a big
deal. The first was that it totally
changed people's perception of how far
behind us China really was. Up until
that point, people were working on the
assumption that when it came to model
development, China was meaningfully
behind the US. And Deep Seek seemed to
suggest that wasn't true. The second big
reason for concern, and the one behind
the big stock wobble, was that at the
time it had appeared that they had
achieved those results at significantly
lower costs than big US training runs.
This made everyone question the
incredible amount of resources being
spent on the data center buildout. The
third reason deepseek was such a big
deal was more on the consumer side when
they released their R1 reasoning model.
The chatbot app that housed it actually
dethroned chatbt to become the number
one downloaded free app on Apple's app
store for iPhone. Now what was
interesting about this was that Deepseek
was not the first company to release a
reasoning model. At that point OpenAI's
R1 had been available for a number of
months. The difference was that DeepSeek
made it available for free, meaning that
for most people it was their first
experience with a reasoning model, which
of course, if you've ever experienced
the jump from a non-reasoning to a
reasoning model, is just a fundamentally
different LLM experience. So, this is
what kicked off the year and set the
tone for a number of different
conversations that we'd be having
throughout the year. Now, more recently,
the whole China element of this story
has heated back up in a big way. Nvidia
CEO Jensen Hang recently said in very
stark terms that he believed that China
would win the AI race because of their
disposition towards it. And even though,
by the way, all these outlets are
reporting that he backtracked. For my
money, the backtrack was kind of more
just a reaffirmation of what he was
saying while trying to present a
slightly more positive spin like the US
still had a chance. Along with the rise
in AI skepticism among market investors,
there has also been a surge in the idea
that China isn't building as many data
centers and that perhaps the US is
overbuilding. And investor Gordon
Johnson went viral with a tweet that
said, "Question for the AI bulls. The US
currently has around 5,426 data centers
and is investing billions to build more.
China has around 449 data centers and is
not adding. If AI is real, why isn't
China building thousands of data centers
every month, which they could clearly
do?" Semi-analysis Dylan Patel
responded, "Where did you get the idea
that they aren't adding? Not as much as
the US, but China has thousands of data
centers and are building many more. Your
data source sucks." Now the substance
here is less important than the
narrative and the fact that once again
China's actions become the big foil for
the US's and this is the setup into
which the new Kimmy K2 thinking model
was released. The new model was released
by moonshot last Thursday with claims of
outperformance on major benchmarks. The
model purportedly leads both GPT5 and
Claude sonnet 4.5 on humanity's last
exam which is a general knowledge test
on browse comp which is a test of
agentic search and seal zero which is a
test of the ability to collect real
world data. The model lags slightly on
major coding benchmarks like sweet bench
verified but not by much. DD dos of
menlo ventures wrote today is a turning
point in AI a Chinese open source model
is number one. Kimmy K2 thinking scored
51% on humanity's last exam, higher than
GPT5 in every other model. 60 cents per
million tokens and $2.5 per million
tokens output. The best at writing and
does 15 tokens per second on two Mac M3
Ultra. Seinal moment in AI. In other
words, the point that DD is making here
is that in addition to performing well,
it's doing so cheaply and in a way
that's efficient enough that people
could run it on their own hardware. Now,
in addition to scorching the benchmarks,
Moonshot claimed the model is capable of
200 to 300 sequential tool calls without
human interference. If that's true, it
would make it incredibly capable for
agentic workflows. Frankly, head and
shoulders above many of the Western
Frontier models. Indeed, according to
independent testing from artificial
analysis, Kimmy is now ranked ahead of
GPT5, Claude 4.5 Sonnet, and Gro 4 on
agentic tool use, and there's a fairly
significant gap. Some like Dan Mack
suggested that this might be enough to
delay the release of the next generation
of models as the Frontier Labs go back
to the drawing board. Referencing that
same recent quote that we were just
talking about from Jensen Huang, the one
where he said that Chinese AI is
nanconds behind America. Dan wrote,
"Jensen is right. Look at Kimmy K2
thinking. Watch for delayed releases of
Gemini 3, Opus 4.5, and GPT 5.1. Delays
signal they are not clearly better or
cheaper than Kimmy K2 thinking. That is
evidence that the USA is indeed falling
behind in the race, said Machina. Kimmy
K2 beating Gemini 3 would be well
humiliating. Doesn't even cover it.
Think about what Google has. Decades of
data, the best talent money can buy
infrastructure that runs the internet.
And they're sweating a smaller team's
model. That's not supposed to happen in
tech. The big guy wins always. Maybe not
this time, though. Now, part of what has
people excited is that the model is open
source. So, people were running their
own tests over the weekend. Pietro
Shirad, the CEO at Magic Path AI, wrote,
"Kimmy K2 thinking is incredible. So, I
built an agent to test it out. Kimmy
writer, it can generate a full novel
from one prompt, running up to 300 tool
requests per session. Here it is.
Creating an entire book, a collection of
15 short sci-fi stories. Lexe gave the
model the task of balancing nine eggs, a
book, a laptop, an empty plastic bottle,
and a nail to try out its reasoning. The
model came up with a counterintuitive
solution of arranging the eggs to
support the book as the starting point,
then adding the book, laptop, bottle,
and the nail in turn. Alexi remarked,
"Kimmy K2 thinking is the only modern
reasoning model in recent memory that
provided a human solution to this on the
first try." Now, another big shift here
is that Chinese models are now right
there with the US models on coding. AI
coding has been the breakout killer use
case for this year and frankly that's
probably been something of a comfort for
the western companies as this is one
area where they've continued to maintain
something of a lead. At the beginning of
the year claw 3.5 sonnet was the premier
model with no close competitor. Since
then later versions of claw GPT5 Gemini
2.5 Pro Gro 4 all have vied for the top
of the leaderboards and API credits from
developers. Increasingly though, Chinese
models are catching up, if not to the
absolute state-of-the-art, at least
presenting a very compelling cost of
value trade-off. Kimmy K2 thinking is
clearly better at coding than Claw 3.5
Sonnet, the model that everyone was
using just a few months ago, and it's
being served at a fraction of the cost.
In a recent article, the information
suggested that that competition is a
huge problem for Anthropic in
particular, given how much of their
revenue is derived from API use for
coding. They also point out that looking
abroad is an imperative for the Chinese
startups, writing, "It is critical they
find customers outside China who pay to
access the AI models through APIs, no
matter how low the prices are. That's
because it's difficult for AI companies
in China to generate revenue from
domestic customers where price
competition is fierce and business
customers are reluctant to pay for
subscriptions." The article continues,
"As the overall AI coding market grows
rapidly, the Chinese companies are
betting that there will be sufficient
demand for cheaper and good enough
options. And in fact, this is one way
that the release of Kim K2 could end up
being different to the Deepseek moment.
If the release of Deep Seek R1 was all
about giving consumers their first
glimpse of reasoning models that were
hidden behind the payw wall at OpenAI,
Kimmy K2 thinking could end up being
more about providing a near
state-of-the-art model that could
perform in the enterprise at a fraction
of the cost. Another interesting shift
is that models like Kimmy K2 thinking
are opening the door to self-hosted LLMs
in a way that wasn't really feasible
last year. Up until recently, there has
been a stark trade-off when a developer
chose to run models locally. Previously,
you could use open source models to
underpin products that didn't need
state-of-the-art AI, or you could tinker
around with them. But for serious
advanced production use cases, there
needed to be a very significant reason
to want the privacy or security of a
local model to make up for the reduced
performance. Kim K2 thinking is one of a
crop of Chinese models that have reduced
that gap. One of the reasons for that is
an innovation in quantization. You can
think of quantization as kind of like
compression for AI models. While the
process reduces performance, it also
lowers the memory requirements
substantially to allow models to fit on
consumer hardware. Kimmy K2 thinking,
for example, can be quantized down to
run on a pair of Mac M3 Ultras, which is
certainly not a cheap consumer setup,
but it is a realistic rig for a
professional programmer or a company.
Some are starting to wonder if local
LLMs will be a growing trend. I'm not
really sure that I'm convinced at this
point, but it is possible that we will
see certain types of industrial use
cases where the balance of value that
you get from running locally does shift
things and that will be an important
trend to keep an eye on. And while we
haven't seen a lot of US enterprises all
of a sudden adopting Chinese models,
there are growing reports that the
startup ecosystem has already made the
switch. Bloomberg opinion columnist
Katherine Thorbeck wrote, "In recent
weeks, a subtle shift has become
increasingly apparent. Speculation has
been stirring for months that lowcost
open- source Chinese AI models could
lure global users away from US
offerings, but now it appears they are
also quietly winning over Silicon
Valley. She referenced Chamath
Palahapatia commenting that one of his
portfolio companies has already moved
major workflows to Kimmy K2, which he
said is quote, "Frankly just a ton
cheaper than OpenAI and Anthropic." That
same week, Airbnb CEO Brian Chesy said
that they hadn't integrated with OpenAI
because the connections aren't quite
ready. Instead, Airbnb's new service
agent is quote relying a lot on
Alibaba's Quen 3 model, which Chesy said
is very good and also fast and cheap.
Mirror Morati's Thinking Machines Lab is
also building on Quen 3. Cursor's new
in-house coding agent, Composer 1, is
rumored to be built on top of a Chinese
model. And hugging face downloads for
Quen have recently overtaken downloads
of Meta's Llama models, suggesting a
shift in user patterns for open source
AI. Referencing that same Jensen Huang
quote, Thorbeck wrote, "It's premature
for Huang to declare a winner. The US
still has clear advantages when it comes
to access to cutting edge chips and
computing power, but Beijing's lowcost
and open source push is undoubtedly
attracting developers, the backbone of
AI innovation. If Washington truly wants
to come out on top in the long run, it
should start by asking why Silicon
Valley is already switching sides." So,
what's the net of all of this? Cash App
Patel writes, "Kimmy K2 thinking is more
important than 03, not because the model
is better, but because of what it
signals about the future of AI
development." For him, there are a few
different elements of this. First, that
the open source lag is now measured in
months, not years. That basically we've
seen the closed model advantage window
collapse from more than 18 months to 3
to 4 months. That China is treating AI
like they treated electric vehicle
manufacturing. In other words, not
trying to match the West, but trying to
lap it on price and accessibility and
competing on economics. And then this
observation, the real race isn't to AGI,
it's to democratization. He writes, who
cares if you build AGI if only a
thousand companies can afford it. Kimmy
K2 provides frontier performance at
commodity prices. That's the game. Dean
Zacharansky thinks that the agentic
capabilities update is the real deal
here. He writes, "In July 2025, models
could not effectively call tools, three
to five tool calls max. Then Kimmy K2
released and every subsequent model has
been post-trained for tool calling. Now
we have agents that can run for an hour
and 30 minutes. This is the quietest and
most significant advancement in recent
memory. Bindu ready writes, "In spite of
all the closed source drama, the biggest
story of 2025 has been open source
agentic models. Three new models
dominate the cheap mass market agent
space. GLM, Kimk 2, and Quen Coder are
all amazing with trillions of tokens
being used every day. That leads to a
prediction from Bindu. 2026 will be the
year of open weights. We will see at
least two US labs enter the arena. Kimmy
and GLM will push to close the gap in
agentic coding. Deepseek will finally
release R2. We will have
state-of-the-art image and video
generation models. LLM developer
community will explode. Now look,
obviously one of the subtexts for a lot
of this show is around the geopolitics
of this, but when it comes to consumer
choice, it's hard to see all of these
advancements as anything but incredibly
valuable. New frontiers of performance
and cost are being pushed, bringing the
efficiency and affordability of
everything down. And that's going to
mean all of us being able to do even
more with these models than what was
previously possible. Pretty interesting
stuff. Obviously, a lot to keep track
of. For now, that's going to do it for
today's AI Daily Brief. Appreciate you
listening or watching as always and
until next time, peace.
Loading video analysis...