From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

https://news.ycombinator.com/rss Hits: 3

Summary

This one leans more technical than our usual Sci-Fi Saturday fare. Stick with it, we get to the sci-fi by the end.What KV Cache Actually IsSomeone types a forty-three-character question into ChatGPT, some throwaway query about dinner recipes or the capital of Mongolia. Before the first word of the response appears, those characters have been split into tokens, each token multiplied through billions of parameters to produce three vectors: a query, a key, and a value. The key-value pairs land in GPU memory, where they sit physically as bytes on a chip. That stored state is the model's awareness of the conversation, not a metaphor but a memory address.The key-value cache exists for a practical reason. Without it, generating each new token would require reprocessing every previous token in the conversation from scratch. A 2,000-token exchange would mean re-reading the entire history 2,000 times. The KV cache eliminates that redundancy. Once a token's key-value pair is computed and stored, it stays. The next token only needs to attend to what's already cached. Computation drops from quadratic to linear.Sebastian Raschka's LLM Architecture Gallery visualizes this mechanism across dozens of model families, and the numbers attached to each architecture make the weight tangible. GPT-2's KV cache costs 300 KiB per token in Raschka's comparison. That means a 4,000-token conversation occupies roughly 1.2 GB of GPU memory just for the cache, separate from the model weights themselves. Micron's engineering blog describes the KV cache as the point where "buzzword meets bottom line," and they're right. Every conversation has a physical cost measured in bytes, in watts, in cooling costs, in dollars per hour of GPU rental.Your conversation weighs something and takes up space, and when the session ends, that space gets reclaimed and everything stored there vanishes.How Memory EvolvedThe way models handle their KV cache has changed four times in six years, and each change says somethin...

First seen: 2026-03-31 17:28

Last seen: 2026-03-31 19:29

Read Full Article More from this Source

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

Summary

Related News

Show HN: How This Graybeard Built the Fastest and Freest Postgres BM25 Search

Teenage Engineering's PO-32 acoustic modem and synth implementation

South Polar Times

Ollama is now powered by MLX on Apple Silicon in preview

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)