MSA: Memory Sparse Attention

https://news.ycombinator.com/rss Hits: 7

Summary

MSA: Memory Sparse Attention A scalable, end-to-end trainable latent-memory framework for 100M-token contexts Paper • [Code](Coming Soon) • [Models](Coming Soon) Long-term memory is essential for general intelligence, yet the full attention bottleneck constrains most LLMs’ effective context length to 128K–1M. Existing attempts，hybrid linear attention, fixed-size state memory (e.g., RNNs), and external storage like RAG/agents，either suffer rapid precision decay and latency growth at extreme scales, lack end-to-end differentiability or dynamic memory maintenance, or require complex pipelines. We present Memory Sparse Attention (MSA): an end-to-end trainable, scalable sparse latent-state memory framework. Core ideas include: Scalable sparse attention + document-wise RoPE (parallel/global) achieving near-linear complexity in both training and inference; KV cache compression with a Memory Parallel inference engine to deliver 100M token throughput on 2×A800 GPUs; Memory Interleave for multi-round, multi-hop reasoning across scattered memory segments. On long-context QA and NIAH (Needle-in-a-Haystack) benchmarks, MSA surpasses same-backbone RAG, best-of-breed RAG stacks, and leading long-context models. Across an unprecedented 16K→100M token range, MSA shows < 9% degradation, suggesting a practical path to decouple memory capacity from reasoning. Scaling from 16K→100M tokens: MSA fuses top-k selection with sparse attention to remain end-to-end differentiable while allowing document decoupling at inference. On MS MARCO, MSA sustains <9% degradation and exhibits strong extrapolation. Some baseline curves end early due to their context limits. Figure 1: MSA scalability under extreme-long contexts Memory-Sparse Attention (MSA): an end-to-end trainable, scalable sparse attention layer with document-wise RoPE, realizing O(L) complexity and <9% degradation from 16K→100M tokens. KV cache compression + Memory Parallel: tiered storage (GPU-resident routing keys, CPU content K/V), di...

First seen: 2026-03-24 11:26

Last seen: 2026-03-24 17:34

Read Full Article More from this Source

MSA: Memory Sparse Attention

Summary

Related News

The RISE RISC-V Runners: free, native RISC-V CI on GitHub

Figma's MCP Update Reflects a Larger Industry Shift

Stop Publishing Garbage Data, It's Embarrassing

The bot situation on the internet is worse than you could imagine

Comparison shows audiophiles waste a lot of money