Musk Retweets Kimi Paper Sparking Major Silicon Valley Discussion: What's the Next Battleground for Attention?

On March 16, 2026, the Kimi team uploaded a paper titled “Attention Residuals” to arXiv, and things quickly spiraled out of control. Elon Musk reposted it, Karpathy commented, “We haven’t really taken the title ‘Attention is All You Need’ seriously,” and former OpenAI co-founder Jerry Tworek simply responded with four words: deep learning 2.0. When a structural paper from a Chinese team can spark such a level of discussion in Silicon Valley, the last time might have been with DeepSeek-V3.

But while the buzz is lively, most discussions remain at the level of “Kimi created something new, the experts are excited.” What’s overlooked is that on the same day, ByteDance’s Seed team and Huazhong University of Science and Technology jointly published another paper called Mixture-of-Depths Attention (MoDA), which addresses the exact same problem but with a completely different approach. Within the same week, Nanjing University’s Dilxat Muhtar, MPI’s Shiwei Liu, and others released a third paper titled “When Does Sparsity Mitigate the Curse of Depth in LLMs,” providing the most precise pathological report from a theoretical perspective.

Three papers appearing in close succession target the same issue. This is no coincidence. An overlooked structural problem that has persisted for nearly a decade has finally reached a critical point where it must be addressed.

The issue isn’t about the sequence dimension of attention. Over the past few years, attention has evolved through many generations—from multi-head attention to grouped query attention, to DeepSeek’s MLA, and various sparse variants—each optimizing how tokens see each other. This arms race has been exciting, but it obscures a fact—the way information is passed between layers has remained the same since the 2017 Transformer paper: residual connections, h = h + f(h), a simple addition operation without any learnable parameters.

All outputs from previous layers are summed equally. No choices, no forgetting, no learning. Each layer’s contribution is treated equally, whether it learns key features or noise.

Residual connections are the most successful “temporary solution” in the history of deep learning.

The most successful temporary solution

Residual connections were proposed by He Kaiming in ResNet in 2015. The idea is extremely simple: when a network gets too deep—over twenty layers—it becomes hard to train due to vanishing gradients, which nearly stops deep parameters from updating. So, each layer gets a “highway” that allows the input to skip directly to the output. Even if the layer learns nothing, information and gradients can still flow through this shortcut. The effect was immediate—ResNet pushed the network from over twenty layers to more than a hundred. Two years later, with the advent of Transformers, residual connections were adopted unchanged. Since then, no one has really modified this design.

Not that no one tried. Variants like ReZero, FixUp, and Highway Networks introduced learnable residual weights. But none became mainstream in large models because residual connections are just too useful: simple, stable, and with minimal computational overhead. At the scale of models at that time, their side effects hadn’t yet been exposed.

44% of layers are idle

What are the side effects? In early 2025, the Shiwei Liu team from Westlake University, Emory, and MPI published “The Curse of Depth.” In March this year, Muhtar et al. from Nanjing University further provided a quantitative diagnosis in “When Does Sparsity Mitigate the Curse of Depth in LLMs.” Under current mainstream architectures, deep transformations increasingly approximate the identity mapping—the input is almost the same as the output, meaning that this layer is effectively absent.

The numbers are stark. Researchers used a “usefulness score” to measure whether each layer performs meaningful transformations. In a 12-layer model, all layers are active. In a 16-layer model, three are dead. In a 24-layer model, nine are dead. In a 32-layer model, 14 are dead—44% of layers learn almost nothing. The parameter count increased from 900 million to 2.3 billion, a 156% increase in budget, but the effective number of layers only increased from 12 to 18.

Quantitative diagnosis of the curse of depth—effective layers grow with model size but with diminishing efficiency

This is directly related to how residual connections work. Each layer’s output is added via residual to a “main trunk.” As layers increase, the signals on this main trunk accumulate (think of it as background noise rising). But the new signals generated at each layer are limited in magnitude. In deep layers, the new signals are drowned out by the background noise, making the input and output nearly identical—these layers are effectively dead.

Residual connections solve the problem of “passing gradients,” but create the problem of “making deep layers meaningful.”

In the era of large models, this cost is real money. One layer involves billions of floating-point operations. A 128-layer model with 44% dead layers wastes nearly sixty layers’ worth of computation. The community has spent years optimizing inference efficiency—quantization, distillation, pruning, sparse attention, KV cache compression—all aimed at reducing the “useful” computation.

The biggest efficiency black hole isn’t the quadratic complexity of attention, but a simple addition operation that has remained unchanged since 2015.

Adding depth dimension to attention

ByteDance’s Seed team took a different route. They didn’t modify residual connections but added a second dimension to the attention mechanism itself.

Standard Transformer attention operates only along the sequence dimension—each token in the current layer attends to other tokens’ key-value pairs within the same layer. MoDA’s modification is intuitive: it also includes the key-value pairs from previous layers as candidates. When a token performs attention at layer L, it can see not only other tokens in the same layer but also directly revisit the key-value pairs from layers 1 through L-1. The sequence and depth dimensions are jointly normalized under the same softmax.

The idea is straightforward, but the challenge is how to implement it efficiently without slowing down the process.

MoDA’s dual-dimension attention—joint normalization over sequence and depth dimensions

Feeding all historical layers’ key-value pairs into attention would explode the computational cost. For a 32-layer model, layer 32 would attend to all key-value pairs from the previous 31 layers, effectively increasing the sequence length by 32 times. MoDA’s core engineering solution is a “grouped reordering” strategy, selecting only some historical layers’ key-value pairs, rearranged into contiguous memory to enable efficient GPU matrix multiplication.

Specifically, MoDA introduces a “depth stream” mechanism. Instead of attending to all historical layers, it uses a learnable routing to select the most relevant layers—similar to Mixture-of-Experts, where only certain experts are activated dynamically. Here, the “experts” are different depth layers of history.

At a sequence length of 64K, MoDA’s operator reaches 97.3% of FlashAttention-2’s efficiency. Adding the entire depth attention mechanism only slows it down by less than 3%.

Grouped reordering—moving scattered historical layer key-value pairs into contiguous memory

On a 1.5B parameter model (based on OLMo2 training recipe), MoDA improves average performance across 10 downstream tasks by 2.11%, with only 3.7% additional computation overhead. It may seem modest at first glance, but this is an architectural improvement, not achieved through more data or longer training. Moreover, MoDA’s effect grows with model size—larger models suffer more from depth degradation, and MoDA’s correction becomes more pronounced.

Performance comparison of MoDA across 10 downstream tasks

Interestingly, MoDA’s interaction with Post-Norm is notable. Most large models use Pre-Norm (layer normalization before attention) because Post-Norm (normalization after attention) is theoretically better but less stable during training. MoDA’s deep key-value mechanism provides an extra gradient pathway for Post-Norm, alleviating its instability.

The combination of MoDA + Post-Norm opens the possibility that the previous compromise—using Pre-Norm for training stability—may no longer be necessary.

Validation loss differences between Pre-Norm and Post-Norm after adding deep key-value

Revisiting old paths instead of forging new ones

MoDA doesn’t modify residual connections; it adds an alternative route outside of residuals. On the same day, Kimi’s team published Attention Residuals (AttnRes), which takes a more direct approach by directly modifying the residual connection itself.

Standard residuals simply sum all previous layer outputs equally into the main path—no choices, no forgetting. AttnRes replaces this fixed equal addition with an attention operation: each layer uses its own state as a query, with all previous layers’ outputs as candidates, and uses attention to determine which features are useful and their weights.

Residual connections are transformed from a fixed formula into a learnable, dynamic routing.

AttnRes’s core idea—using attention to replace equal-weight residual addition

The cost is that each layer must perform an additional deep attention computation, which is not cheap. Kimi’s team used a block strategy (Block AttnRes) to control costs: dividing layers into blocks, performing full deep attention within each block, and only aggregating at the block level between blocks.

AttnRes has been integrated into Kimi Linear (total 48 billion parameters / 3 billion activations), pre-trained on 1.4 trillion tokens, with consistent results across different model scales. The paper has been widely reported; details are omitted here. The reason to mention it here is to compare its approach with MoDA.

Training curves and ablation experiments for AttnRes

Both approaches diagnose the same root cause: deep layers repeatedly dilute shallow information through residual updates. But they cut at different points. MoDA leaves residual connections untouched but adds a depth dimension to attention, allowing deep layers to bypass residual flow and directly access shallow features. AttnRes directly modifies residuals, replacing the fixed addition with attention weighting—one “adding a new route,” the other “renovating the existing one.”

Both papers appeared on the same day, with different routes but targeting the same problem. This is no coincidence. The community’s consensus is that the depth-related issues of attention are well understood; the difference lies in the approach.

Effectiveness of AttnRes across different model scales

Unremoved scaffolding

Returning to the initial question: why has the deep-layer idling problem only been seriously addressed in 2026?

Because residual connections are too useful. They solved the most urgent problem at the time (vanishing gradients) with manageable costs (deep degradation was not obvious in small models). Alternative solutions like ReZero and Highway Networks had not been validated at large scale. No one was motivated to change them. They weren’t intentionally designed choices but temporary scaffolding that was forgotten over time—people built the structure, then forgot to remove the scaffolding, and over the years, everyone thought it was a load-bearing wall.

Residual signal dilution—deeper layers receive weaker signals

But what truly made this problem hard to detect wasn’t residuals themselves, but the fact that attention mechanisms have long operated in only one dimension. Over the past eight years, all attention evolutions—multi-head, grouping, sparsity, linear attention—focused on the sequence dimension. How tokens see each other has been optimized countless times. But how layers see each other? No one has asked. The depth dimension has been a blind spot for attention.

MoDA and AttnRes open this blind spot from different angles. MoDA adds a second dimension to attention, enabling it to operate simultaneously across sequence and depth. AttnRes turns inter-layer information transfer into an attention operation itself. Different routes, same conclusion: attention shouldn’t only look horizontally; it should also look vertically.

The extension of this conclusion is even larger than the papers themselves. Transformers still contain many fixed mechanisms that operate only in a single dimension. Each layer must be executed sequentially, cannot be skipped. Attention heads are computed independently and simply concatenated, with no dynamic coordination between heads. Each token, regardless of difficulty, follows the same computational path. These designs were originally engineering compromises to enable training and convergence.

The evolution of deep learning over the past decade, at its highest level, boils down to one thing: returning more and more structural decision-making from human designers to the model itself. Manually designed convolution kernels are replaced by learnable attention. Fixed positional encodings are replaced by learnable rotational encodings. Fixed expert routing is replaced by learnable routing. Now, the flow of information along the depth dimension is also beginning to be decided by attention itself.

Karpathy said we haven’t truly taken the literal meaning of “Attention is All You Need” seriously. He might be right. But not in the sense of “attention alone is enough,” rather “attention has not been used enough.” It has evolved many generations along the sequence dimension, but in the depth dimension, it’s just beginning.

Depth is the next battleground for attention.

Source: Tencent Technology

Risk Warning and Disclaimer

Market risks exist; investments should be cautious. This article does not constitute personal investment advice and does not consider individual users’ specific investment goals, financial situations, or needs. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular circumstances. Invest at your own risk.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin