Kimi K2 Thinking is CRAZY... (HUGE UPDATE)

By Matthew Berman

Summary

Topics Covered

Chinese Open Model Beats GPT-5
Executes 200-300 Tool Calls Seamlessly
PhD Math Solved via 23 Tool Calls
China Accelerates Open Frontier Lead
Builds Complex Data Dashboards Autonomously

Full Transcript

We may have another deepseek moment on our hands. Moonshot Labs, a Chinese

our hands. Moonshot Labs, a Chinese frontier AI company, just released a completely open-source, completely open

weights, frontier level model that is better than GPT5, better than Claude 4.5 on some of the hardest benchmarks. Let

me break it all down for you. And this

video is brought to you by Vulture. More

on them later. This is Kimmy K2 thinking. This is a thinking model that

thinking. This is a thinking model that is capable of thinking for a long time and using tools in its thought process and already people are saying this is an incredible model. We've done some

incredible model. We've done some initial tests. I'm going to show you

initial tests. I'm going to show you that towards the end of the video. All

right, so let's get into the details.

Built as a thinking agent, it reasons step by step while using tools and it achieves state-of-the-art performance on humanity's last exam, browse comp, and

other benchmarks. It can execute up to

other benchmarks. It can execute up to 200 to 300 sequential tool calls without human interference and can reason

coherently across hundreds of steps to solve the most complex problems. This is Moonshot's attempt to scale up test time reasoning for the Kimmy series of models

and they have created an incredible model. All right, let's look at some of

model. All right, let's look at some of the benchmarks. Here is humanity's last

the benchmarks. Here is humanity's last exam. This is one of the hardest

exam. This is one of the hardest benchmarks on the planet. And look at this. Kimmy K2 thinking coming in with a

this. Kimmy K2 thinking coming in with a score of 44.9 as compared to GPT5 at 41.7.

It beats GPT5 at humanity's last exam. A

fully open- source, fully open weights model now is better than frontier models out of the US closed source labs. And

Claude sonnet 4.5 thinking scoring a 32.

It is also incredibly good at agentic browsing and search. In fact, they even built an agentic mode into Kimmy, which they haven't released yet. It should be released pretty soon. But look at this.

A gentic search for browse comping scoring a 60.2 versus 54.9 for GBT 5 and a measly 24.1 for Claude Sonnet 4.5

Thinking. We have Swebench Verified.

Thinking. We have Swebench Verified.

Although it is coming in last place amongst GBT5 and Claude Sonnet 4.5 Thinking, it is still a very good score

at 71 versus 74 for GBT5 and 77 for Claude. Then we have live codebench v6

Claude. Then we have live codebench v6 which is a competitive programming benchmark, 83.1 for Kimmy K2, 87 for GBT 5, and 64 for Sonnet 4.5 Thinking. All

right, so let me show you this example.

Here it is solving a PhD level mathematics problem and it does so with 23 tool calls in its chain of thought.

So check this out. I'm not going to be able to explain what's going on here, but I just want to show you it actually kind of going through the process of solving this. So here's the question.

solving this. So here's the question.

Now we have reasoning completed. We can

go in and we can actually see all of the crazy reasoning that's happening at every different step. Scroll down. We

can see it also did web search. So,

hypolic normal distribution PDF. So,

it's actually looking for references for sources to help inform how to solve the problem. Goes back and forth again. And

problem. Goes back and forth again. And

basically, it's just super impressive between searching for more information, reasoning, searching for more information. I am blown away by its

information. I am blown away by its capability. And then at the very end, it

capability. And then at the very end, it got the right answer. And again, I'm not going to be able to explain how or why, but here we are. It is also incredibly good at coding, of course. And we have

multiple examples provided by the Kimmy team. So check this out. A

team. So check this out. A

componentheavy website. Essentially a

word clone. And we can see here I can delete stuff. We have different fonts.

delete stuff. We have different fonts.

We have different font sizes, italics, bold, underline, striketh through. All

of these work. Even if I click save, it saves the document to my local computer.

Next, look at this example. We have a math explainer visualization of gradient descent. And again, this is created all

descent. And again, this is created all using a single prompt. And so we actually can see a full visualization of gradient descent. This is going to be

gradient descent. This is going to be great for me who uses a lot of B-roll and a lot of explainer visuals. So I'm

definitely going to be checking this out. Here's a model test that I've used

out. Here's a model test that I've used in the past. I don't know if they've seen one of my videos or not, but this is basically the exact test that I've run on previous models. Here we go.

Simulation of virus attacking cells in a bloodstream. We have different sliders

bloodstream. We have different sliders for the number of viruses as you can see here. We have the replication rate. We

here. We have the replication rate. We

have the types of viruses, aggressive stealth and fast replicating, number of white blood cells, white blood cell speed, detection range, and so on. So

very cool. And if you want to run incredible open-source open weights models like Kimmy K2 thinking, check out the sponsor of today's video, Vulture.

Vulture is the world's largest independent cloud provider and they've been a fantastic partner to us. So, I'm

really excited to tell you about them again today. So, if you need to

again today. So, if you need to provision GPUs, whether you're just tinkering on your own AI project or you're scaling up to production, Vulture

is the place to go. They offer the latest AMD and Nvidia GPUs spanning 32 locations across six continents, so you're going to get the lowest latency.

They also offer industry-leading price to performance with serious accessibility and reliability. So with

Vulture's global fully composable cloud infrastructure, you move your applications closer to your users and frees you from vendor lock in which you know I've talked about quite a bit on

this channel. They also have Vulture

this channel. They also have Vulture Kubernetes Engine which allows you to scale beyond just a single container. So

if you're tired of waiting in line for other GPU providers, check out Vulture today. They're offering my viewers $300

today. They're offering my viewers $300 in credits for your first 30 days when you visit getvulture.com/bman.

And remember to use code bur300. Thanks

again to Vulture. Back to the video. We

also have this vinyl simulation. Drop

the needle to play. So we have kind of like a circular set of words. Then I

drop the needle and it starts playing.

Now it doesn't have any sound, which is kind of weird. I would have thought it had sound, but it doesn't. but it still looks very cool. And then last, it allows you to create live music using Strudel, which I had not heard of, but

looks very cool. So, Strudel, I believe, is a coding language that allows you to create music. So, very cool examples.

create music. So, very cool examples.

And stick around, I'm going to show you one more example later in the video of a test that my team ran on Kimmy K2, and I was blown away by it. Kim K2 thinking is also really good at search and

integrating the information from search into its thinking process and then running subsequent searches based on everything it learned so far. So on

browse comp a challenging benchmark designed to evaluate model's ability to continuously browse search and reason over hard to find real world web information. K2 thinking achieved a

information. K2 thinking achieved a score of 60.2% significantly outperforming the human baseline of 29.2%.

K2 thinking can execute 200 to 300 sequential tool calls driven by long horizon planning and adaptive learning.

Now look at this complex logical problem that Kimmy K2 was given. So the

information below is about an individual who is an alumnest of a university founded after 1860 but before 1890 was a university athlete and later played for

a professional American football team briefly. starred in a science fiction

briefly. starred in a science fiction film about an alien invasion that was released after 2010 but before 2020 and so on. So it's kind of like a teaser

so on. So it's kind of like a teaser problem. And so look at its thinking.

problem. And so look at its thinking.

It's reasoning. It's reasoning. It does

a search reasoning some more. Does

another search and back and forth it goes till it finally comes up with Jimmy Garary Jr. They also say Kimmy K2 is really good at creative writing, but I'm usually pretty doubtful that AI is any

good at that. So, if you want to check out the full benchmarks, they're right here. And I will also drop a link to

here. And I will also drop a link to this in the description below. All

right. Now, let me give you a few reactions from some AI leaders throughout the web on Kimmy K2 Thinking's release. All right. First,

Thinking's release. All right. First,

Emad My Must Mystique, friend of the live show, founder of Stability AI, says, "Congratulations to Kimmy Moonshot for achieving state-of-the-art on many benchmarks and open sourcing the model.

The gap between close and open continues to narrow even as the cost of increasingly economically valuable tokens collapses. K2 has a unique vibe

tokens collapses. K2 has a unique vibe to try it out. Now, he goes on to talk about the cost of the new model's training. Listen to this. The base Kimmy

training. Listen to this. The base Kimmy K2 model used 2.8 million H800 hours with 14.8 trillion tokens, about $5.6 $6

million, which by the way is crazy to think about that these frontier models are getting so cheap to train. Then he

says the details of the post training for the reasoning version is not given, but likely max 20% more excluding data prep would be less than $3 million for state-of-the-art if they had Blackwell

chip access. Very interesting to think

chip access. Very interesting to think about. The cost of training frontier

about. The cost of training frontier models is dropping so quickly. All

right. And of course, everyone's thinking, how does this compare to Deep Seek? Can you believe in 2025, at the

Seek? Can you believe in 2025, at the very beginning of 2025, we had the Deepseek moment with R1 and now at the end of 2025, we have the Kimmy K2

moment, an incredibly both open source openweights models. So on the left we

openweights models. So on the left we have DeepSeek R1 671 billion parameters and Kimmy K2 thinking is a trillion

parameters. The vocabulary size bigger

parameters. The vocabulary size bigger is better 129,000 versus 160,000 on the vocab size for Kimmy K2 thinking. Both

of them are mixture of experts with the deepseek having 256 and Kimmy K2 thinking having 384 experts. So more

experts over there. And interestingly,

even though this is a bigger model, there are 37 billion parameters active during inference for DeepSeek and 32 billion active during inference for this

new Kimmy K2 thinking model. So, even

less, more efficient and they are equal context lengths at 128,000. Thank you to Sebastian Rosa for putting this together. And by the way, if you want to

together. And by the way, if you want to learn to use Kimmy K2 thinking and other models, you should download the subtle art of not getting replaced ebook. This

ebook contains 100 different use cases that you can use AI for today. Real

world use cases. So check it out, download it, let me know what you think.

Nathan Lambert of interconnects.ai, expert at all things training, had this to say about Kimmy K2 thinking. He says

early reports it has a distinctive style in writing which is very welcome. And

interestingly he says it's a 256k context length. I thought it was only

context length. I thought it was only 128 but it seems like it might actually be 256 trillion total parameters 32 billion active. We just talked about

billion active. We just talked about that. And I think really the crux of his

that. And I think really the crux of his writing and what I've been thinking about a lot is right here in this section. China's rise. At the start of

section. China's rise. At the start of the year, most people loosely following AI probably knew of Zero AI Labs now.

And towards wrapping up 2025, I'd say all of Deep Sea, Quen, and Kimmy are becoming household names. They all have seasons of their best releases and different strengths. It took many

different strengths. It took many Chinese companies only 6 months to catch up to the open frontier and ballpark of performance. Now, the question is if

performance. Now, the question is if they can offer something in a niche of the frontier that has real demand for users. So, it really seems like not only

users. So, it really seems like not only has open- source open weights caught up to frontier models, but also they're kind of just coming out of China. We had

an open weights model from OpenAI, but it's not a Frontier model that is GPT5.

We haven't had another version of Llama from Meta. And so, yeah, China is really

from Meta. And so, yeah, China is really pushing the open-source open weights frontier. All right. Now, let me show

frontier. All right. Now, let me show you what my team created with Kimmy K2 thinking. You can get to this on

thinking. You can get to this on kimmy.com. Analyze the relationship

kimmy.com. Analyze the relationship between population density and healthc care facility accessibility in Ghana.

Download the latest World Pop population raster and any open data set of health facility coordinates. Compute the

facility coordinates. Compute the average population density within a 10 km radius around each facility. Rank the

top 10 districts by lowest per capita facility coverage and generate a map and bar chart comparing results. Okay, so

check this out. It is writing a to-do and this is okay computer and this is something specific to Kimmy. So this is kind of like a scratchpad and an environment that it can run in. So this

is just beyond impressive. So boom,

created a full to-do list. Started

marking off the to-dos as you can see.

Did some searches browsing. Here we go.

You can actually see the browse happen.

So here it went to world pop. Clicked

around. All of this stuff happened. We

only gave it one piece of feedback. this

part of it is a mess and just go debug and fix it. And watch this. Let me show you the end result. Here we go.

Remember, this is created entirely by Kimmy K2 thinking and one piece of feedback. So, healthcare accessibility.

feedback. So, healthcare accessibility.

Boom. Executive summary. The page looks nice. Here we have an interactive map

nice. Here we have an interactive map where you actually have an overlay. It

tells you more information about the healthcare facilities. And remember, it

healthcare facilities. And remember, it downloaded all this information. It just

found it. District level disparities.

Here we have more interactive charts.

Look at this. This is just unreal. All

these different types of charts and graphs. So, so impressive. At the

graphs. So, so impressive. At the

bottom, we have even more. Forgive the

colors. I'm just using dark mode everything. It would look a lot better

everything. It would look a lot better if I turned that off. We have

limitations and next steps, data sources, and methodology. You can

download the CSVs of the facility analysis, district coverage, and underserved areas. And again, all of

underserved areas. And again, all of this created in just a few minutes with really one prompt. So impressive. And

once again, thank you to Vulture for sponsoring this video. Click through the link in the description below. Let them

know I sent you. Try out incredible open source models. Thanks again to Vulture.

source models. Thanks again to Vulture.

So that's it. My team is fully testing the Kimmy K2 thinking model. We're going

to put together a full testing video most likely. So stick around for that.

most likely. So stick around for that.

If you enjoyed this video, please consider giving a like and subscribe.

Loading...

Loading video analysis...