Mercury 2: The First Reasoning Diffusion Language Model (1,000+ tokens/sec)

By Developers Digest

Summary

Topics Covered

Diffusion Enables Built-in Error Correction
Mercury 2 Matches GPT-4o Mini on Benchmarks
Solve Speed at Model Level, Not Hardware
Diffusion Powers Agentic Latency Wins

Full Transcript

inception labs has just released mercury 2 which is a reasoning model that does over a thousand tokens per second the crazy part with this it's built on diffusion not auto regressive generation let me show you what that means so if you've been following my channel you might have remembered that i covered the original mercury model when it first came out that video broke down how diffusion models could work for text generation

now one of the things when we think back to a couple years ago when fast inference originally came on the scene it was with specialized hardware with companies like grok everyone got excited about the raw inference speed and rightfully so but the models that could run that kind of speed were generally pretty limited they generally couldn't do tool calling quite well they struggled with complex reasoning and they scored lower across most

benchmarks it was speed at a real cost now mercury 2 is completely different this is the first reasoning diffusion llm it's not auto regressive it's built on diffusion this is the same fundamental approach that already won in image and video generation and the people who built those diffusion methods are the ones who founded inception labs now they're

applying those same techniques to large language models so first off what makes this different just from fast inference the speed comes from the model itself not better hardware optimization and just to break this down a little bit the way that diffusion works is it generates multiple tokens per forward pass instead of one and the thing with this approach this is not just an incremental improvement it's a fundamentally different approach to how

generation works and that's why diffusion and reasoning actually work well together because what diffusion does is it revisits and refines its output during generation and it is built in error correction whereas auto regressive models commit to each token and move on if they make a mistake early on it can cascade into the subsequent steps of what the

llm generates whereas diffusion it can catch and fix mistakes as it goes across the whole output and the numbers back it up mercury 2 completes reasoning much faster than models that are out there when we compare the throughput of over a thousand tokens per second haiku does 89 and gpt5 mini does about 71 but speed without

quality doesn't matter mercury achieves speed without compromising on quality it ties gpt5 mini on aime 2025 at 91 .1 and scores competitively on gbqa and live code bench across the board now for a quick demonstration so on the left hand side i have haiku 4 .5 selected and then on the right hand side i have mercury 2 so the thing to know with mercury 2 is you're going

to be able to select the different level of reasoning if you're using this from the api you can select instant low medium or high right off the bat you'll notice that this is much much faster but where these capabilities increasingly come into play is with the model's capability with tool use now i want to do a quick demonstration of a little agentic application that i built okay so now to demonstrate this

further i'm going to say open up a browser and i want to go to hacker news and find the top stories related to ai once you've found them summarize what each of them are and then find the comments and what everyone is saying about the particular story i'm going to go ahead and send in this task now the really great thing with a model like this is by having the inference speed

be as fast as it is all of the different tool calls that you have within an application will occur much much quicker additionally any context that we either have to generate for those tools or that we extract from what we're asking for is going to occur a much much quicker because we have that faster inference time so we're able to move through the task much faster now the other thing to know

with the model is it has 128 000 tokens of context so you are going to be able to have and ingest an awful lot of context for the different tasks that you're asking for if you're interested in any of the code that i'm showing here i'll also put a link to this within the description of the video shortly after it goes live now to dive into some of the details so first

thing right off the bat if you do want to try this out they do have an open ai compatible api you can swap this out with the base url for inception as well as the model string and api key and you will be able to try this out within any application that you're leveraging an open ai model in the demonstration that i just showed you that was leveraging the ai sdk from

bercel and so you will be able to incorporate it easily into any of the agentic frameworks that you're leveraging and now in terms of some of the use cases as you saw me demonstrate you will be able to leverage this with tool use you can leverage it with structured outputs rag it has 128 000 tokens of context like i mentioned and it's to be priced at 25 cents per million tokens of

input and 75 cents per million tokens of output and just to put this price into perspective this makes it one of the most cost competitive models especially given its speed out there if we consider the intelligence speed and price dynamics of this is this is going to be a very compelling option for a whole host of applications like i demonstrated in the application this model is really going to shine a latency

sensitive application anything that involves an agent loop where every tool call adds to wait time think things like voice interfaces the p95 of voice interfaces the latency really determines if the experience feels natural at all additionally coding workflows and iteration cycles you're going to be able to prompt and review and tweak with rapid succession additionally even with

chat based consumer facing applications similar to the one that i showed you you're going to be able to benefit from having a much faster model it's just going to feel that much more of a compelling experience when you're going to be able to have speed back up the actual capabilities of your application now to touch on how diffusion llms work in comparison to what we are used to every llm that you

use today is auto regressive it generates one token at a time sequentially token one is locked before token two begins if the reasoning drifts early too bad it can only move forward think about your experience leveraging chat gpt or clod you know that it is sequentially based on what had just occurred whereas with the fusion models it's

completely different instead of generating left to right it will start with noise and it will iteratively refine the output in parallel you can also see this in image and video generation models before the final output you'll see a rough representation of the image that will get finer and finer as it goes through more cycles so you can sort of think of it like auto regressive is like a typewriter where each key

stroke is permanent whereas diffusion is like an editor looking at that entire document it starts with a rough draft and it sharpens the whole thing with each pass okay now if i take a look at artificial analysis the entire industry is really racing towards solving the inference problem open ai nvidia fireworks grok you name it basically

everyone out there billions and billions of dollars are being spent to make models faster nvidia just recently acquired grok for instance for 20 billion dollars and that was in large part for their fast inference speed but everyone is working with the auto regressive paradigm better hardware better kernels quantization distillation real gains but they're all

incremental you're squeezing more out of the same fundamental approach and this is where diffusion models and inception took a fundamentally different path they solved the speed bottleneck at the model level not at the infrastructure level and with reasoning and agentic workflows becoming the norm and really table stakes in 2026 sequential generation compounds latency think about it

every step in an agent loop will add more wait time whereas mercury 2 you don't have to choose between reasoning and speed anymore you can effectively have them both within your application so just to sum up mercury 2 is the first reasoning diffusion large language model it's five times faster than speed optimized auto regressive models with competitive quality and it's a completely different approach to how ai generates text so whether this

becomes the future of how llms work i definitely do not know but the results are definitely real and here today and the people behind it literally invented the techniques that we see in technologies like sora or stable diffusion or flux or all of these different diffusion models that are out there the power of those techniques in generating all of the beautiful images and videos that we see out there this same technique

they're applying to language models today so if you're interested in trying this out i encourage you to check out the api platform today try out the playground it's going to be within the description of the video go try it see it for yourself and if you found this video useful please like comment share and subscribe otherwise until the next one

Loading...

Loading video analysis...