LongCut logo

DeepSeek Just CRUSHED Big Tech Again: MHC - Better Way To Do AI

By AI Revolution

Summary

Topics Covered

  • Residual Connections Created Hidden Bottleneck
  • MHC Stabilizes Multiple Streams Mathematically
  • MHC Delivers Reasoning Gains at Scale
  • Optimizations Make 4x Capacity Feasible
  • DeepSeek Signals Compute Workarounds

Full Transcript

Every major AI model today is built on an idea that's more than 10 years old. It works, it scales, and everyone just assumed that's as far as it goes.

old. It works, it scales, and everyone just assumed that's as far as it goes.

DeepSeek just dropped a paper that basically says, no, there's another way forward. And if

it holds up, it changes how powerful models get built from here on out. Now,

that old design wasn't wrong. It was necessary. Without it, modern AI wouldn't exist at all. But it came with a trade -off. It kept models stable, at the cost

all. But it came with a trade -off. It kept models stable, at the cost of limiting how much information they could move around internally. To understand what DeepSeek actually did, you need to understand a basic problem that shows up when AI models get bigger. Large language models are made up of layers. A prompt goes into the first

bigger. Large language models are made up of layers. A prompt goes into the first layer, that layer does a bit of work, passes the result to the next layer, and so on, until the final layer produces an answer. During training, if the answer is wrong, a signal called a gradient flows backward through all those layers, telling each one how it should adjust. Years ago, researchers realized that forcing gradients to pass through

every single layer can cause problems. Signals can fade away or blow up. To fix

that, they invented something called residual connections. When residual connections were introduced, they didn't only improve models, they rescued deep learning from a very real wall. Before that point, training deep networks was fragile. You could stack layers, but after a certain depth, learning slowed down, gradients vanished, and performance actually got worse. Residual connections changed that overnight.

They gave models a stable shortcut. Information could flow forward and backward without getting distorted, and suddenly training very deep networks became reliable. That success locked the idea in place.

Once something works that well, people stop questioning it. Over time, residual connections stopped being treated as a design choice and started being treated as infrastructure. They were just assumed to be correct. Model builders focused elsewhere better attention mechanisms, more data, bigger parameter counts, expert routing, scaling laws. The internal flow of information between layers stayed mostly

untouched, and that made sense. Residual connections were stable, predictable, and easy to reason about. They did exactly what they were supposed to do. The trade -off was subtle.

about. They did exactly what they were supposed to do. The trade -off was subtle.

Stability came at the cost of flexibility. Information could pass through cleanly, but it was forced through a very narrow path. Everything had to fit through that single, residual stream.

For years, that limitation wasn't obvious. Bigger models and more data kept delivering gains. But

as models pushed into harder reasoning tasks, that narrow internal pathway quietly became a bottleneck.

Not because it was broken, but because it was doing exactly what it was designed to do. Now, here's where things get interesting. Over the last couple of years, researchers

to do. Now, here's where things get interesting. Over the last couple of years, researchers started asking a new question. What if, instead of just passing one stream of information through these shortcuts, you passed several streams at once? More internal communication? More capacity? More

flexibility? That idea led to something called hyperconnections. widened the internal data flow. Instead of one residual stream, you have multiple parallel streams interacting with each

flow. Instead of one residual stream, you have multiple parallel streams interacting with each other. On paper, this looks like a clean upgrade. The model gets more internal workspace,

other. On paper, this looks like a clean upgrade. The model gets more internal workspace, more ways to combine information, and more room to handle multi -step reasoning. Early on,

training behaves normally. Loss goes down. Metrics improve. Nothing looks obviously wrong. The problem shows up later. As training continues and depth increases, those unconstrained streams start interacting in

up later. As training continues and depth increases, those unconstrained streams start interacting in unstable ways. Signals get amplified layer after layer. Gradients grow larger than expected. Everything

unstable ways. Signals get amplified layer after layer. Gradients grow larger than expected. Everything

still looks fine until suddenly it isn't. Loss curves spike. Gradient norms explode. Training

collapses abruptly. Sometimes this happens after 10 ,000 steps. Sometimes later. The key issue is that it's not gradual. One checkpoint is fine. The next is unusable. That kind

of failure is unacceptable at scale. Large training runs are expensive, slow, and hard to debug. Architectures that collapse late in training are risky, even if they look promising in

debug. Architectures that collapse late in training are risky, even if they look promising in small experiments or short runs. This is why hyperconnections never became standard in large production models. The idea itself wasn't wrong. The issue was the lack of control. Once streams

models. The idea itself wasn't wrong. The issue was the lack of control. Once streams

are allowed to mix freely, instability becomes inevitable. That's the exact gap DeepSeq focused on.

DeepSeq's new method is called manifold -constrained hyperconnections, or MHC. The name sounds heavy, but the idea behind it is actually pretty straightforward. Instead of letting those internal streams mix however they want, DeepSeq constrained the mixing itself. The key idea is simple.

Streams should be able to exchange information, but the total signal strength must stay constant.

They enforce this by forcing the matrices that mix residual streams to follow strict rules.

Every row sums to one. Every column sums to one. In practical terms, that means information can be redistributed and blended, but never amplified or dampened overall. This preserves

the same identity behavior that made residual connections stable in the first place. Information flows

through the network cleanly. The difference is that now it can also move sideways between streams in a controlled way. DeepSeq enforces this constraint using the Sinkhorn -NOP algorithm, which projects the mixing matrices onto a specific geometric space, called the Berkhoff polytope. That space

has a crucial property. When these matrices are multiplied across layers, which is exactly what happens during deep training, the result remains stable. Signal magnitude stays bounded instead of drifting over time. This is why MHC works where earlier approaches failed. The constraint isn't tuned

over time. This is why MHC works where earlier approaches failed. The constraint isn't tuned or approximate. It's structural. Stability is guaranteed by the math itself, not by careful

or approximate. It's structural. Stability is guaranteed by the math itself, not by careful hyperparameter choices. Once that stability is locked in, widening the residual stream becomes

hyperparameter choices. Once that stability is locked in, widening the residual stream becomes practical instead of dangerous. This is the key insight. DeepSeq figured out how to keep the stability of old -school residual connections while still getting the extra capacity of multiple streams. That's why analysts are calling this a striking breakthrough. And this isn't just theory.

They actually tested this architecture on real models. They trained language models with 3 billion, 9 billion, and 27 billion parameters using MHC. Then they trained equivalent models using standard hyperconnections. Across eight different benchmarks, the MHC models consistently performed better. The gains were especially noticeable on reasoning -heavy tasks. On

GSM -8K, a math reasoning benchmark, the 27 billion parameter model jumped from 46 .7 to 53 .8. On BBH, a logical reasoning benchmark, it went

46 .7 to 53 .8. On BBH, a logical reasoning benchmark, it went from 43 .8 to 51. On MMLU, which measures general knowledge and understanding, the

from 43 .8 to 51. On MMLU, which measures general knowledge and understanding, the score improved from 59 to 63 .4. These are not tiny changes. At this scale, jumps like that matter. One reason this works so well is that widening the residual stream effectively gives the model more internal workspace. It's not just stacking more layers or

throwing more parameters at the problem. It's changing how information flows inside the model. That's

a different axis of scaling, and it complements the usual methods like adding more compute or more data. Of course, widening streams usually comes with a cost. More streams mean more data moving through memory, more pressure on GPUs, and slower training. This is where DeepSeq's engineering work matters just as much as the math. They didn't just propose a clean theoretical idea and stop there. They rebuilt large parts of the training stack to

make this practical. They wrote custom GPU kernels using TileLang to fuse operations together. Instead

of moving data in and out of memory repeatedly, the GPU does more work on each chunk before sending it back. That alone saves a lot of time. They also

use selective recomputation. Rather than storing every intermediate activation for backpropagation, they recompute certain values on the fly during the backward pass. That reduces VRAM usage significantly.

On top of that, they carefully overlapped communication and computation using a scheduling method called dual pipe, hiding data transfer behind normal compute work. The result of all this optimization is pretty wild. DeepSeq expanded the effective width of the model's internal data flow by four times, yet the total training time increased by only about 6 .7%. Hardware overhead

was measured at roughly 6 .27%. That's a small price to pay for a 400 % increase in internal capacity. This matters because memory access, not raw compute, is one of the biggest bottlenecks in modern AI training. People call this the memory wall. DeepSeq

managed to push past it without throwing absurd amounts of hardware at the problem. Now,

zoom out a bit, because DeepSeq already has a reputation for doing things differently. Back

in January 2025, they unveiled their R1 reasoning model. That launch rattled the tech industry and even spooked parts of the U .S. stock market. R1 showed that DeepSeq could match top -tier models like ChatGPT's O1 reasoning system at a fraction of the cost.

Analysts described it as a Sputnik moment. This new paper reads like a continuation of that story. Wei Sun, a principal analyst at CounterPoint Research, described it as a statement

that story. Wei Sun, a principal analyst at CounterPoint Research, described it as a statement of DeepSeq's internal capabilities. By redesigning the training stack end -to -end and combining unconventional ideas with rapid experimentation, DeepSeq is signaling that compute constraints are not stopping them. They're finding ways around them. There's also a strategic angle here. DeepSeq published

stopping them. They're finding ways around them. There's also a strategic angle here. DeepSeq published

this work openly. They didn't keep it locked behind closed doors. According to Lianje Su, chief analyst at Omdia, This openness reflects a growing confidence in the Chinese AI ecosystem.

Sharing foundational ideas while still delivering unique value through models is being treated as a competitive advantage, not a weakness. That openness also means competitors are paying attention. Analysts

expect other labs to start experimenting with similar constrained architectures. Once an idea like this is out, it rarely stays isolated for long. The timing of the paper has also raised eyebrows. DeepSeq is widely believed to be working on its next flagship model, R2.

raised eyebrows. DeepSeq is widely believed to be working on its next flagship model, R2.

That model was expected in mid -2025, but it got delayed. Reports suggest the founder, Liang Wenfeng, wasn't satisfied with its performance. Advanced chip shortages also played a role, which has increasingly shaped how Chinese labs approach training. Interestingly, the paper itself never mentions R2. But DeepSeq has a pattern. Before launching R1, they published foundational research that

R2. But DeepSeq has a pattern. Before launching R1, they published foundational research that later showed up in the model. Some analysts think MHC will definitely be part of whatever comes next. Others are more cautious. Weisun suggested there might not be a standalone R2 at all, and that these ideas could form the backbone of a future V4

model instead, especially since earlier R1 improvements were already folded into DeepSeq's V3 system. There's also the question of impact outside China. Business insiders Alistair Barr pointed out

system. There's also the question of impact outside China. Business insiders Alistair Barr pointed out that DeepSeq's recent updates didn't generate much buzz in Western markets. Distribution still

matters, and labs like OpenAI and Google have a massive advantage there. Even the best technical breakthrough struggles, if it doesn't reach users, still, from a technical perspective, MHC is hard to ignore. It addresses two problems at once. It restores stability in wide residual architectures, and it does so in a way that's efficient enough to use at

scale. The paper shows detailed stability analysis, including measurements of gradient norms and

scale. The paper shows detailed stability analysis, including measurements of gradient norms and signal amplification across dozens of layers. In standard hyperconnections, those values can spike into the thousands. With MHC, they stay close to one, even across deep networks. That's

the difference between a model that trains reliably and one that suddenly collapses after 12 ,000 steps. DeepSeq showed both behaviors side by side, and the contrast is dramatic. And

,000 steps. DeepSeq showed both behaviors side by side, and the contrast is dramatic. And

if widening how information flows inside a model delivers bigger gains than just stacking more layers, what else do we think is solved in AI that actually isn't? Drop your

thoughts in the comments. I'm curious how far you think this kind of architectural shift can really go. If this breakdown was useful, hit like, subscribe for more deep dives like this, and thanks for watching. Catch you in the next one.

Loading...

Loading video analysis...