DeepSeek’s New AI Just Surpassed Gemini 3 DeepThink With Brutal Logic
By AI Revolution
Summary
## Key takeaways - **DeepSeek Math V2 Drops Silently**: Deepseek Math V2 basically hit the internet out of nowhere. They quietly uploaded it to Hugging Face with no hype, and the crazy part is that this thing might be one of the most impressive math reasoning models that has ever been released publicly. [00:21], [00:30] - **Self-Verifiable Reasoning Framework**: Deepseek Math V2 was designed around one big principle, self-verifiable reasoning. Not just answer the question, but prove it, check it, and admit your mistakes using a student, teacher, supervisor concept. [01:48], [01:57] - **Examiner Grades Proofs Like Olympiad**: They built the examiner, a dedicated proof verification model. It reads through the entire proof, grades it with a three-point system, explains what is good, what is missing, and what is flatout wrong. [02:11], [02:27] - **Rewards Honesty Over Bluffing**: The student outputs the reasoning then writes a self-evaluation. The model gets rewarded for honesty, not just correctness; if it makes a mistake and honestly admits the flaw, it gets rewarded, but bluffing gets punished. [03:19], [03:28] - **Tencent's 1B OCR Beats Giants**: Tencent released Hunuen OCR, a 1 billion parameter OCR expert model, and this tiny model is beating major multimodal giants like Quen 3, VL4B, Gemini 2.5 Pro on OCR-centric tasks. [06:05], [06:16] - **End-to-End OCR Without Pipelines**: Huan OCR is built as a single end-to-end model. You give it an image and in one forward pass it handles text spotting, document parsing, information extraction, translation and even VQA without relying on any external modules. [06:47], [07:00]
Topics Covered
- Math Demands Self-Verifiable Reasoning
- Reward Honesty Over Bluffing in AI
- End-to-End OCR Crushes Pipelines
- 1B Model Beats Giant VLMs
Full Transcript
So, Deepseek woke up and decided to drop a math model that performs at International Math Olympiad gold medal level. And Tencent dropped a 1 billion
level. And Tencent dropped a 1 billion parameter OCR model that is somehow beating massive VLMs five or six times its size. It is one of those moments
its size. It is one of those moments where you kind of stop for a second and realize how fast everything is evolving.
So, let's talk about it. All right, so Deepseek Math V2 basically hit the internet out of nowhere. They quietly
uploaded it to Hugging Face with no hype. And the crazy part is that this
hype. And the crazy part is that this thing might be one of the most impressive math reasoning models that has ever been released publicly. The
previous version, the old 7B, already shocked everyone last year when it performed on the level of GPT4 and Gemini Ultra on math tasks, and that was a tiny model by today's standards. But
Math V2 is built on top of Deepseek V3.2 to Xpace and Deepseek is claiming it outperforms Gemini Deepthink which is the model Google built specifically to handle structured reasoning. They say it
basically operates at IMO metalist capability and this time it is not just solving problems. It is checking its own work like a professional mathematician.
And that is the thing you need to understand about this model. Most AI
math systems care only about one thing, the final answer. Did it get it right or wrong? But that is not how real math
wrong? But that is not how real math works. You cannot just spit out a number
works. You cannot just spit out a number and call it a day. The process is what matters. The rigor, the logic, the
matters. The rigor, the logic, the derivations. That is how math
derivations. That is how math competitions are graded and that is how real proofs are judged in the academic world. Deepseek noticed that accuracy
world. Deepseek noticed that accuracy only systems hit a ceiling. They do
great on benchmarks like AIM, but they collapse when asked to show a proper rigorous proof. You can trick your way
rigorous proof. You can trick your way into the right final number without actually understanding anything. So,
Deepseek Math V2 was designed around one big principle, self-verifiable reasoning. Not just answer the question,
reasoning. Not just answer the question, but prove it, check it, and admit your mistakes. They built the whole framework
mistakes. They built the whole framework around this student, teacher, supervisor concept that is honestly one of the smartest training structures we have seen for mathematical AI. First, they
built the examiner, a dedicated proof verification model. Think of it as the
verification model. Think of it as the greater in an olympiad. It does not care only about the final answer. It reads
through the entire proof, grades it, explains what is good, what is missing, and what is flatout wrong. And it does not grade in a binary way. It uses a three-point system. One point for a
three-point system. One point for a perfect rigorous derivation, 0.5 for you are kind of right, but sloppy, and zero for logical errors or missing steps. And
yes, the model has to write comments like a real greater. Then DeepSeek
realized something funny. Sometimes the
teacher gets things wrong. The examiner
might hallucinate an error or randomly penalize a proof for no reason. That
happens even with large models. So they
added a metaverifier or as Deepseek describes it, a supervisor. The
supervisor's job is not to check the proof. It checks whether the teacher's
proof. It checks whether the teacher's comments actually make sense. This extra
layer massively boosts accuracy because the system does not just trust one model's judgment. It gets cross-cheed.
model's judgment. It gets cross-cheed.
Then comes the really interesting part.
The student which is the generator model does not just generate a proof. It also
has to grade itself right after it.
Outputs the reasoning then writes a self-evaluation. And here is where DeepS
self-evaluation. And here is where DeepS went for something bold. The model gets rewarded for honesty, not just correctness. If it makes a mistake and
correctness. If it makes a mistake and honestly admits the flaw, it gets rewarded. If it tries to bluff its way
rewarded. If it tries to bluff its way through with, "Yeah, everything is fine," it gets punished. This forces the model to actually think through its proof, reflect on weak spots, and fix
problems instead of hallucinating confidence. And all of this builds
confidence. And all of this builds toward their final idea, a fully automated closed loop where the system basically evolves itself without needing armies of human mathematicians grading thousands of proofs. The student
generates lots of solutions to a problem. The teacher grades all of them.
problem. The teacher grades all of them.
They vote on the results. The ones that are hard to grade or solve become new training data. The teacher gets sharper.
training data. The teacher gets sharper.
The student gets sharper. The whole
ecosystem levels up together and the results are insane. On the IMO proof bench, which is a brutal set of Olympiad proof problems, DeepSeek Math V2 hits
nearly 99% on the basic benchmark. On
the advanced benchmark, it is slightly below Gemini Deepthink, but still at IMO gold level performance. On the 2024 Putinham test, which is notoriously
difficult, it scores 118 out of 120.
That is essentially a near-perfect score. You almost never see an open
score. You almost never see an open model hit numbers like that. And the
bigger takeaway here is not just wow, it solves hard problems. The real breakthrough is the framework.
Reinforcement learning for reasoning usually relies on final answer correctness as the reward. But this
system breaks that limitation. It
rewards reasoning quality, logic, and the ability to detect its own mistakes, which is something general LLM struggle with. As a result, hallucinations drop
with. As a result, hallucinations drop massively. The chain of thought becomes
massively. The chain of thought becomes more stable and the model becomes much more aligned with how mathematicians actually work. Deepseek is basically
actually work. Deepseek is basically saying that if we want AI to handle real math, real proofs, not multiple choice puzzles, we need models that can verify reasoning, not just generate it. And
Math V2 is one of the first models showing how far this approach can actually go. But real quick, if you've
actually go. But real quick, if you've been following all this AI news and thinking, "Okay, this is cool, but what can I actually do with it?" You're
definitely not alone. That's why we created the AI income blueprint. It
shows you seven ways regular people are using AI to build extra income streams on the side. No tech skills needed, and you can automate everything pretty easily. The guide contains simple,
easily. The guide contains simple, proven methods using tools I often talk about on this channel. Download it free by clicking the link in the description.
Now, let's switch gears because Tencent also dropped something that targets a completely different area and it is just as impressive. They released Hunuen OCR,
as impressive. They released Hunuen OCR, a 1 billion parameter OCR expert model, and this tiny model is beating major multimodal giants on OCR ccentric tasks.
Models like Quen 3, VL4B, Gemini 2.5 Pro, and even some commercial APIs. This
should not be possible at this size, but Tencent pushed an insane amount of engineering into this system. Let's
break down what makes it special. Huan
OCR is built very differently from most OCR systems out there. Usually, you have a big pipeline with a bunch of steps.
Detect the text, slice it out, recognize it, try to rebuild the layout, and hope the pieces line up. Tencent basically
said, "Why are we still doing this?" and
packed everything into a single endto-end model. You give it an image
endto-end model. You give it an image and in one forward pass it handles text spotting, document parsing, information extraction, translation and even VQA
without relying on any external modules.
That is the part that makes this model feel so clean because there is no chain of tools that can break. The backbone is where things get really clever. The
visual encoder starts from a Siglet V2400M foundation, but Tencent expanded it so it can take images at their original resolution and aspect ratio instead of forcing everything into a
square crop. That matters a lot in real
square crop. That matters a lot in real world OCR because documents come in every shape and size. Long receipts,
wide tables, multicolumn pages, weird screenshots, whatever. The model breaks
screenshots, whatever. The model breaks images into patches that match the original proportions so it does not lose structure. And this is one of the
structure. And this is one of the reasons it works so well on long text lines, complex layouts, and lowquality scans. After the image is processed,
scans. After the image is processed, Hunu and OCR uses this adaptive connector module that basically compresses the visual tokens into something shorter and more manageable without throwing away the important
textheavy details that keeps the language model light and fast because it does not have to process thousands of unnecessary tokens. Then there is the
unnecessary tokens. Then there is the language model itself. just 0.5B
parameters, but equipped with something they call XD ropey. Instead of treating everything like a flat sequence of tokens, it splits positional understanding across four dimensions.
The text itself, the height of the page, the width of the page, and time for video frames. So essentially, it
video frames. So essentially, it understands how things are placed on the page and how they connect spatially.
That is why it can parse multicolumn PDFs, follow cross-page flows, handle tables and forms, and even read moving subtitles in video frames without switching modes. Training this model was
switching modes. Training this model was a massive multi-stage process, but in simple terms, Tencent fed it a mix of pure text, synthetic OCR data, multilingual samples, hard documents,
and massive long context corpora. They
gradually increased the context window all the way to 32K, so it can handle long documents without collapsing. And
after all the supervised learning, they pushed it further using reinforcement learning with verifiable reward signals.
The model gets rewarded only when its outputs are perfectly aligned with ground truth structure, the right bounding boxes, the right text or accurate translations. If it outputs
accurate translations. If it outputs broken JSON or drifts off format, it gets zero reward. That is why its structured outputs stay so clean and the results honestly do not make sense for a
1B model. On 10 cents internal benchmark
1B model. On 10 cents internal benchmark of 900 OCR images across nine categories, it hits 70.92 overall, beating systems like Paddle OCR, BYU
OCR, and even general purpose VLMs like Quen 3 VL235B and seed vision. On OmniDoc, which is one of the hardest public document
understanding benchmarks, it scores 94.1 overall with really strong numbers on formulas and tables, too. These are
performance levels you normally expect from models several times larger. It
holds up when everything gets messy, too. On wild OmniDoc bench, where
too. On wild OmniDoc bench, where documents are printed, folded, and recaptured under terrible lighting, it still scores over 85. On DOC ML, which
covers 14 languages outside English and Chinese, it hits 91.03 and sets state-of-the-art results across the whole set. It nails information
extraction tasks with more than 92% accuracy. It scores 860 on OCR bench,
accuracy. It scores 860 on OCR bench, outperforming other small models like DeepSeek OCR and sitting very close to models like Quen 3, VL2B and Gemini 2.5
Pro. And it even won first place in the
Pro. And it even won first place in the ICDAR 2025 DIMP competition for English to Chinese document translation in the small model category. And all of this from a model with only 1 billion
parameters running endtoend without extra modules. That is why Hunuen OCR
extra modules. That is why Hunuen OCR feels like a turning point. We are
seeing the rise of these compact OCR specialists that replace huge pipelines with a single streamlined model. They
are small enough for production use.
They handle over 100 languages and they are already beating much larger generalpurpose vision language models on the tasks that actually matter in the real world. Watching this shift happen
real world. Watching this shift happen feels like the most exciting part of the whole AI race right now. So, here is something to think about. Which
direction do you see winning long-term?
Highly specialized small models or giant all-in-one systems? Drop your take in
all-in-one systems? Drop your take in the comments. I read every one of them.
the comments. I read every one of them.
Make sure to subscribe and hit like if you enjoyed the video. Thanks for
watching. Catch you in the next one.
Loading video analysis...