DeepSeek's mHC: When Residual Connections Explode January 11, 2026 Every transformer you’ve ever used has the same residual connection design from 2016. GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)x + F(x)x+F(x). One stream of information flowing through the network, with each layer adding to it. DeepSeek asked: what if it was wider? The Setup Standard residual connections are the backbone of every modern transformer. The idea is simple: xl+1=xl+F(xl)x_{l+1} = x_l + F(x_l)xl+1​=xl​+F(xl​) The input flows through unchanged, plus the layer’s output. One stream of information. What goes in comes out, plus a learned update. This is why transformers can be hundreds of layers deep: the gradient has a clean path backward. Simple. Stable. Unchanged since 2016. Hyper-Connections take a different approach. Instead of one stream, expand to n parallel streams with learnable mixing matrices: xl+1=Hlresxl+Hlpost,TF(Hlprexl,Wl)x_{l+1} = H^{res}_l x_l + H^{post,T}_l F(H^{pre}_l x_l, W_l)xl+1​=Hlres​xl​+Hlpost,T​F(Hlpre​xl​,Wl​) Compared to standard residual: Standard Residual F Hyper-Connection H_res H_pre H_post F Three matrices control how information flows: H_res: How streams mix in the residual path (the red crossings) H_pre: How streams combine before entering the layer H_post: How the layer’s output distributes back to streams More expressive. More parameters with negligible computational overhead. Better performance, in theory. The problem? Those mixing matrices are unconstrained. They can amplify signals, not just route them. The Explosion Under aggressive learning rates, Hyper-Connection (HC) signal amplification in my reproduction hit 7x before eventually collapsing. Amax (the maximum of row and column absolute sums) measures how much a matrix can amplify signals. Unconstrained (HC) 1.1x 1.1 1.2x 1.32 1.15x 1.52 1.1x 1.67 1.2x 2.0 ... After 60 layers: 304x Constrained (mHC) 1.0x 1.0 1.0x 1.0 1.0x 1.0 1.0x 1.0 1.0x 1.0 ... After 60 layer...
First seen: 2026-01-12 15:01
Last seen: 2026-01-13 01:03