How Residual Connections Are Getting an Upgrade [mHC]

By Jia-Bin Huang

Summary

Topics Covered

Deeper models can underperform due to training difficulty
Residual connections solve vanishing gradients via identity shortcuts
Hyper-connections generalize residuals with learnable feature routing
Composite feature mixing matrices destabilize deep networks
Sinkhorn-Knopp algorithm projects matrices to prevent gradient explosion

Full Transcript

The representational power of a deep neural network comes from stacking non-linear layers.

intuitively, deeper models should learn more complex and expressive representations and yield better results.

But, deeper neural networks are more difficult to train.

When we stack more layers, deeper models often have higher training errors, leading to poor performance on test data as well.

This is counterintuitive.

For example, we could construct a deeper model that performs at least as well as the shallow one, by simply duplicating the layers from the shallow network and use additional layers to perform the identity mapping.

So we know the solution exists.

It's just that the optimization is difficult.

To address this issue, researchers introduced the Residual Learning Framework.

Rather than hoping that the layers will directly learn the desire mapping, we explicitly designed them to feed a residual mapping instead.

That is, the difference between the input and the desired output. Here

each layer is a non-linear function with its own parameters W_l.

But why do the residual connections make the training easier?

Let's break it down.

The output of the first layer x2 is the input x1 plus the residual mapping F of x1.

Similarly.

Here's the output of the second layer x3.

We can replace x2 with x1 and the residual mapping F of x1.

Repeating this process recursively, we can represent the feature at any deeper unit L as the feature at any shallow unit, plus a residual function.

These identity mapping ensures that the features can be directly propagated from one unit to any other unit.

This design has a nice backward propagation property.

The gradient at the shallow unit consists of two components.

First, the component that is directly propagated from the deeper unit L, allowing the gradient information to flow directly.

Second, the component that propagates through the intermediate weight layers.

This alleviates the vanishing gradient problem and enables us to trend deeper models.

The residual connection design can be found everywhere, including models that understand images, language, play, go and predict proteins’ 3D structures.

It can be applied in virtually any type of layers like convolutional networks, attention and multilayer perceptron.

But, this design is somewhat restricted as there's only one residual path.

A recent promising extension is to enable the network to learn the strengths of its residual connections.

Here is how it works.

instead of using one single residual pass.

we expand the input features by n times.

This approach can be viewed as having n parallel residual streams. But, it is too expensive to run the layers n times.

Therefore, we first aggregate input features.

x1, x2 x3 x4, using a weighted combination.

The layer then process this aggregated features to produce an output.

Since there's only one output residual mapping from the layer, we expanded four times using learnable scaling.

We then add these rescaled residuals to their respective residual paths, resulting in the final output of the layer.

However, in this setup, the amount of information exchange between the residual streams is quite limited.

To address this, we introduce a learnable linear transformation between the residual streams. Here the transformation is parametrized by a 4x4 weight matrix.

We can interpret the transformation as a feature router.

By specifying the weights, we form flexible connection patterns that route features from one residual stream to another.

The weights can take any values along the model to flexibly form linear combinations of features across multiple streams. Similarly, the aggregation and expansion weights allow the model to adaptively control how the information is combined and distributed across the multiple residual streams. This is the intuition behind Hyper-Connections.

Now let's write this down more formally.

We stack the features from multiple streams along the row dimension.

Each feature is a d-dimensional vector, so the matrix x_l has dimensions n times d.

Here, the expansion rate n is four.

This is the input to the l-th layer.

Feature aggregations involves computing a weighted combination of features from residual streams. We can concisely represent this as matrix multiplications.

After computing the output of the layer, we expand and distribute the features to multiple residual frames.

We can write this as multiplying an (n by 1) weight vector with the (1 by d) input features.

Feature mixing across multiple streams is achieved using an (n by n) matrix.

Finally, the output is just the addition of h and z.

This is called hyper-connections.

A paper proposed by ByteDance earlier in 2025.

But, we don't want to learn this linear mappings directly because they will not be adaptive depending on the input of the layer.

Therefore, hyper-connections introduce a parametrization where the values of the linear mappings are dynamically conditioned on the input.

Here there are two types of trainable parameters.

The global ones that do not depend on the input, referred a static mappings.

and the input-dependent parameters, referred as dynamic mappings.

Hyper-connections show promising results over standard residual connections.

Hyper connections show up to 1.8x faster convergence and higher accuracy across multiple benchmarks compared to the baseline.

This is great.

however with deep seek attempt to adapt this technique for their model training, they observe significant training instability.

Why is this happening?

Here is the equation for how a single layer of hyper-connections computes its output.

Recursively extending to multiple layers, we can represent the features at a deeper level as two terms. The first term corresponds to the features at a shallow layer, successively transform by the feature mixing matrices across the intermediate layers.

The second term consists of the sum of the outputs from all previous residual functions.

Let's compare these with a standard residual connection for clarity.

In residual connections we have an identity mapping.

This identity mapping is essential for facilitating smooth information propagation.

But in hyper connections we have a composite mapping of all the feature mixing matrices between the shallow and deep layers.

Let's illustrate this problem with a simple one dimensional example.

Suppose the initial value is one, and the feature at layer L involves the successive multiplications of a scalar h.

Let's see how the value evolves over the layers.

When the value of h is one, we have identity mapping.

the output at a layer L is still one.

However, if we increase the value of h to 1.1, the output value exhibit a dramatic 100-fold increase.

Conversely, when h is less than one, the output value rapidly decays as it pass through the layers.

The instability becomes even more significant.

When edge takes on negative values, causing the output to oscillate dramatically.

But how do we stabilize the trending?

To do so, we need to ensure that these linear mappings are well-behaved.

For example, the feature mixing matrix can have arbitrary numbers because it is unconstrained.

The key idea is to make this matrix a doubly stochastic matrix, also known as the Birkhoff polytope.

The requirement is that all elements must be positive, and each row and column must sum to one.

The resulting composite residual mapping across multiple layers also remain doubly stochastic.

This effectively stabilized the training, preventing exploding or vanishing gradient problems. But how do we make this matrix doubly stochastic?

The first step is to make all the elements positive.

To achieve this, we apply an exponential function for each individual element.

The exponentiation ensures that each output is strictly positive and increases monotonically with its input.

But now when we sum up all the rows and all the columns, the values are not one.

We use a simple iterative algorithm that alternately rescale all rows and all columns of the matrix to sum up to one.

Here is how it works.

First, we rescale the sum of each column to one.

But after scaling, the sum of each row is still not one.

Therefore, we rescale the sum of each row to one.

But after rescaling all rows to sum to one, that changes the sum of the columns.

Fortunately, we can apply the alternative rescaling iteratively.

With only a few iterations, we make the feature mixing matrix very close to a doubly stochastic matrix.

It's very simple.

To make this sound more academic, we refer to this process as projecting the feature mixing metrix onto the manifold of doubly stochastic matrices.

Basically make them well-behaved.

This algorithm is known as the Sinkhorn-Knopp algorithm.

For the other two matrices, the DeepSeek paper also made some slight adjustments in terms of their parametrization.

Here is the parameterizations of the hyper connections papers discussed before.

Here is the one from DeepSeek.

The key difference is that they changed the activation function from tanh to sigmoid.

The primary reasons are twofold.

First, this avoids negative coefficients.

second.

The new parameterizations ensures that the aggregation and expansion weights are bounded.

The values cannot be larger than 1 or 2.

Compared to the original parameterizations, the new design makes the linear mappings more well-behaved.

But why is there a scalar two here?

Remember that this is the weight we used to rescale the layer output before adding back to each residual stream.

After the initialization, the parameters alpha and bias vector b are set to be small numbers.

This means that initially the input to the sigmoid is very close to zero, and therefore the output is very close to 0.5.

Multiplying it by two ensures that at the beginning of the training, the hyper-connections behave exactly like the standard residual connections.

Now we understand how hyper connections work and how to stabilize the training.

but the expanded residual streams as significant GPU memory footprints and slows down the trainings due to excessive memory IO access.

The DeepSeek paper proposes three efficient infrastructure designed to mitigate this problem.

First, to optimize memory and computation, they reorder normalization, fuse operations with shared memory access, and writes specialized kernels.

This greatly reduced redundancy and resource bottlenecks.

second the free intermediate activations after the forward pass.

And we compute them during the backward pass as needed, reducing memory usage through efficient block sizing and pipeline synchronization.

sir. The scheduled pipeline and kernel executions to maximize hardware usage by overlapping computation and communication.

Overall, these designs enhance training efficiencies at scale.

When using an expansion rate of four, the training overhead only increases by 6.7%.

Standard residual connections have remained unchanged for a long time, so it's exciting to see promising advancements like hyper connections.

I look forward to seeing more results and potentially wider adoption in the future.

Thanks for watching and I will see you next time.

Loading...

Loading video analysis...