Kimi K2 Thinking is CRAZY... (HUGE UPDATE)
By Matthew Berman
Summary
Topics Covered
- Chinese Open Model Beats GPT-5
- Executes 200-300 Tool Calls Seamlessly
- PhD Math Solved via 23 Tool Calls
- China Accelerates Open Frontier Lead
- Builds Complex Data Dashboards Autonomously
Full Transcript
We may have another deepseek moment on our hands. Moonshot Labs, a Chinese
our hands. Moonshot Labs, a Chinese frontier AI company, just released a completely open-source, completely open
weights, frontier level model that is better than GPT5, better than Claude 4.5 on some of the hardest benchmarks. Let
me break it all down for you. And this
video is brought to you by Vulture. More
on them later. This is Kimmy K2 thinking. This is a thinking model that
thinking. This is a thinking model that is capable of thinking for a long time and using tools in its thought process and already people are saying this is an incredible model. We've done some
incredible model. We've done some initial tests. I'm going to show you
initial tests. I'm going to show you that towards the end of the video. All
right, so let's get into the details.
Built as a thinking agent, it reasons step by step while using tools and it achieves state-of-the-art performance on humanity's last exam, browse comp, and
other benchmarks. It can execute up to
other benchmarks. It can execute up to 200 to 300 sequential tool calls without human interference and can reason
coherently across hundreds of steps to solve the most complex problems. This is Moonshot's attempt to scale up test time reasoning for the Kimmy series of models
and they have created an incredible model. All right, let's look at some of
model. All right, let's look at some of the benchmarks. Here is humanity's last
the benchmarks. Here is humanity's last exam. This is one of the hardest
exam. This is one of the hardest benchmarks on the planet. And look at this. Kimmy K2 thinking coming in with a
this. Kimmy K2 thinking coming in with a score of 44.9 as compared to GPT5 at 41.7.
It beats GPT5 at humanity's last exam. A
fully open- source, fully open weights model now is better than frontier models out of the US closed source labs. And
Claude sonnet 4.5 thinking scoring a 32.
It is also incredibly good at agentic browsing and search. In fact, they even built an agentic mode into Kimmy, which they haven't released yet. It should be released pretty soon. But look at this.
A gentic search for browse comping scoring a 60.2 versus 54.9 for GBT 5 and a measly 24.1 for Claude Sonnet 4.5
Thinking. We have Swebench Verified.
Thinking. We have Swebench Verified.
Although it is coming in last place amongst GBT5 and Claude Sonnet 4.5 Thinking, it is still a very good score
at 71 versus 74 for GBT5 and 77 for Claude. Then we have live codebench v6
Claude. Then we have live codebench v6 which is a competitive programming benchmark, 83.1 for Kimmy K2, 87 for GBT 5, and 64 for Sonnet 4.5 Thinking. All
right, so let me show you this example.
Here it is solving a PhD level mathematics problem and it does so with 23 tool calls in its chain of thought.
So check this out. I'm not going to be able to explain what's going on here, but I just want to show you it actually kind of going through the process of solving this. So here's the question.
solving this. So here's the question.
Now we have reasoning completed. We can
go in and we can actually see all of the crazy reasoning that's happening at every different step. Scroll down. We
can see it also did web search. So,
hypolic normal distribution PDF. So,
it's actually looking for references for sources to help inform how to solve the problem. Goes back and forth again. And
problem. Goes back and forth again. And
basically, it's just super impressive between searching for more information, reasoning, searching for more information. I am blown away by its
information. I am blown away by its capability. And then at the very end, it
capability. And then at the very end, it got the right answer. And again, I'm not going to be able to explain how or why, but here we are. It is also incredibly good at coding, of course. And we have
multiple examples provided by the Kimmy team. So check this out. A
team. So check this out. A
componentheavy website. Essentially a
word clone. And we can see here I can delete stuff. We have different fonts.
delete stuff. We have different fonts.
We have different font sizes, italics, bold, underline, striketh through. All
of these work. Even if I click save, it saves the document to my local computer.
Next, look at this example. We have a math explainer visualization of gradient descent. And again, this is created all
descent. And again, this is created all using a single prompt. And so we actually can see a full visualization of gradient descent. This is going to be
gradient descent. This is going to be great for me who uses a lot of B-roll and a lot of explainer visuals. So I'm
definitely going to be checking this out. Here's a model test that I've used
out. Here's a model test that I've used in the past. I don't know if they've seen one of my videos or not, but this is basically the exact test that I've run on previous models. Here we go.
Simulation of virus attacking cells in a bloodstream. We have different sliders
bloodstream. We have different sliders for the number of viruses as you can see here. We have the replication rate. We
here. We have the replication rate. We
have the types of viruses, aggressive stealth and fast replicating, number of white blood cells, white blood cell speed, detection range, and so on. So
very cool. And if you want to run incredible open-source open weights models like Kimmy K2 thinking, check out the sponsor of today's video, Vulture.
Vulture is the world's largest independent cloud provider and they've been a fantastic partner to us. So, I'm
really excited to tell you about them again today. So, if you need to
again today. So, if you need to provision GPUs, whether you're just tinkering on your own AI project or you're scaling up to production, Vulture
is the place to go. They offer the latest AMD and Nvidia GPUs spanning 32 locations across six continents, so you're going to get the lowest latency.
They also offer industry-leading price to performance with serious accessibility and reliability. So with
Vulture's global fully composable cloud infrastructure, you move your applications closer to your users and frees you from vendor lock in which you know I've talked about quite a bit on
this channel. They also have Vulture
this channel. They also have Vulture Kubernetes Engine which allows you to scale beyond just a single container. So
if you're tired of waiting in line for other GPU providers, check out Vulture today. They're offering my viewers $300
today. They're offering my viewers $300 in credits for your first 30 days when you visit getvulture.com/bman.
And remember to use code bur300. Thanks
again to Vulture. Back to the video. We
also have this vinyl simulation. Drop
the needle to play. So we have kind of like a circular set of words. Then I
drop the needle and it starts playing.
Now it doesn't have any sound, which is kind of weird. I would have thought it had sound, but it doesn't. but it still looks very cool. And then last, it allows you to create live music using Strudel, which I had not heard of, but
looks very cool. So, Strudel, I believe, is a coding language that allows you to create music. So, very cool examples.
create music. So, very cool examples.
And stick around, I'm going to show you one more example later in the video of a test that my team ran on Kimmy K2, and I was blown away by it. Kim K2 thinking is also really good at search and
integrating the information from search into its thinking process and then running subsequent searches based on everything it learned so far. So on
browse comp a challenging benchmark designed to evaluate model's ability to continuously browse search and reason over hard to find real world web information. K2 thinking achieved a
information. K2 thinking achieved a score of 60.2% significantly outperforming the human baseline of 29.2%.
K2 thinking can execute 200 to 300 sequential tool calls driven by long horizon planning and adaptive learning.
Now look at this complex logical problem that Kimmy K2 was given. So the
information below is about an individual who is an alumnest of a university founded after 1860 but before 1890 was a university athlete and later played for
a professional American football team briefly. starred in a science fiction
briefly. starred in a science fiction film about an alien invasion that was released after 2010 but before 2020 and so on. So it's kind of like a teaser
so on. So it's kind of like a teaser problem. And so look at its thinking.
problem. And so look at its thinking.
It's reasoning. It's reasoning. It does
a search reasoning some more. Does
another search and back and forth it goes till it finally comes up with Jimmy Garary Jr. They also say Kimmy K2 is really good at creative writing, but I'm usually pretty doubtful that AI is any
good at that. So, if you want to check out the full benchmarks, they're right here. And I will also drop a link to
here. And I will also drop a link to this in the description below. All
right. Now, let me give you a few reactions from some AI leaders throughout the web on Kimmy K2 Thinking's release. All right. First,
Thinking's release. All right. First,
Emad My Must Mystique, friend of the live show, founder of Stability AI, says, "Congratulations to Kimmy Moonshot for achieving state-of-the-art on many benchmarks and open sourcing the model.
The gap between close and open continues to narrow even as the cost of increasingly economically valuable tokens collapses. K2 has a unique vibe
tokens collapses. K2 has a unique vibe to try it out. Now, he goes on to talk about the cost of the new model's training. Listen to this. The base Kimmy
training. Listen to this. The base Kimmy K2 model used 2.8 million H800 hours with 14.8 trillion tokens, about $5.6 $6
million, which by the way is crazy to think about that these frontier models are getting so cheap to train. Then he
says the details of the post training for the reasoning version is not given, but likely max 20% more excluding data prep would be less than $3 million for state-of-the-art if they had Blackwell
chip access. Very interesting to think
chip access. Very interesting to think about. The cost of training frontier
about. The cost of training frontier models is dropping so quickly. All
right. And of course, everyone's thinking, how does this compare to Deep Seek? Can you believe in 2025, at the
Seek? Can you believe in 2025, at the very beginning of 2025, we had the Deepseek moment with R1 and now at the end of 2025, we have the Kimmy K2
moment, an incredibly both open source openweights models. So on the left we
openweights models. So on the left we have DeepSeek R1 671 billion parameters and Kimmy K2 thinking is a trillion
parameters. The vocabulary size bigger
parameters. The vocabulary size bigger is better 129,000 versus 160,000 on the vocab size for Kimmy K2 thinking. Both
of them are mixture of experts with the deepseek having 256 and Kimmy K2 thinking having 384 experts. So more
experts over there. And interestingly,
even though this is a bigger model, there are 37 billion parameters active during inference for DeepSeek and 32 billion active during inference for this
new Kimmy K2 thinking model. So, even
less, more efficient and they are equal context lengths at 128,000. Thank you to Sebastian Rosa for putting this together. And by the way, if you want to
together. And by the way, if you want to learn to use Kimmy K2 thinking and other models, you should download the subtle art of not getting replaced ebook. This
ebook contains 100 different use cases that you can use AI for today. Real
world use cases. So check it out, download it, let me know what you think.
Nathan Lambert of interconnects.ai, expert at all things training, had this to say about Kimmy K2 thinking. He says
early reports it has a distinctive style in writing which is very welcome. And
interestingly he says it's a 256k context length. I thought it was only
context length. I thought it was only 128 but it seems like it might actually be 256 trillion total parameters 32 billion active. We just talked about
billion active. We just talked about that. And I think really the crux of his
that. And I think really the crux of his writing and what I've been thinking about a lot is right here in this section. China's rise. At the start of
section. China's rise. At the start of the year, most people loosely following AI probably knew of Zero AI Labs now.
And towards wrapping up 2025, I'd say all of Deep Sea, Quen, and Kimmy are becoming household names. They all have seasons of their best releases and different strengths. It took many
different strengths. It took many Chinese companies only 6 months to catch up to the open frontier and ballpark of performance. Now, the question is if
performance. Now, the question is if they can offer something in a niche of the frontier that has real demand for users. So, it really seems like not only
users. So, it really seems like not only has open- source open weights caught up to frontier models, but also they're kind of just coming out of China. We had
an open weights model from OpenAI, but it's not a Frontier model that is GPT5.
We haven't had another version of Llama from Meta. And so, yeah, China is really
from Meta. And so, yeah, China is really pushing the open-source open weights frontier. All right. Now, let me show
frontier. All right. Now, let me show you what my team created with Kimmy K2 thinking. You can get to this on
thinking. You can get to this on kimmy.com. Analyze the relationship
kimmy.com. Analyze the relationship between population density and healthc care facility accessibility in Ghana.
Download the latest World Pop population raster and any open data set of health facility coordinates. Compute the
facility coordinates. Compute the average population density within a 10 km radius around each facility. Rank the
top 10 districts by lowest per capita facility coverage and generate a map and bar chart comparing results. Okay, so
check this out. It is writing a to-do and this is okay computer and this is something specific to Kimmy. So this is kind of like a scratchpad and an environment that it can run in. So this
is just beyond impressive. So boom,
created a full to-do list. Started
marking off the to-dos as you can see.
Did some searches browsing. Here we go.
You can actually see the browse happen.
So here it went to world pop. Clicked
around. All of this stuff happened. We
only gave it one piece of feedback. this
part of it is a mess and just go debug and fix it. And watch this. Let me show you the end result. Here we go.
Remember, this is created entirely by Kimmy K2 thinking and one piece of feedback. So, healthcare accessibility.
feedback. So, healthcare accessibility.
Boom. Executive summary. The page looks nice. Here we have an interactive map
nice. Here we have an interactive map where you actually have an overlay. It
tells you more information about the healthcare facilities. And remember, it
healthcare facilities. And remember, it downloaded all this information. It just
found it. District level disparities.
Here we have more interactive charts.
Look at this. This is just unreal. All
these different types of charts and graphs. So, so impressive. At the
graphs. So, so impressive. At the
bottom, we have even more. Forgive the
colors. I'm just using dark mode everything. It would look a lot better
everything. It would look a lot better if I turned that off. We have
limitations and next steps, data sources, and methodology. You can
download the CSVs of the facility analysis, district coverage, and underserved areas. And again, all of
underserved areas. And again, all of this created in just a few minutes with really one prompt. So impressive. And
once again, thank you to Vulture for sponsoring this video. Click through the link in the description below. Let them
know I sent you. Try out incredible open source models. Thanks again to Vulture.
source models. Thanks again to Vulture.
So that's it. My team is fully testing the Kimmy K2 thinking model. We're going
to put together a full testing video most likely. So stick around for that.
most likely. So stick around for that.
If you enjoyed this video, please consider giving a like and subscribe.
Loading video analysis...