Quantization from the Ground Up

https://news.ycombinator.com/rss Hits: 22

Summary

Qwen-3-Coder-Next is an 80 billion parameter model 159.4GB in size. That's roughly how much RAM you would need to run it, and that's before thinking about long context windows. This is not considered a big model. Rumors have it that frontier models have over 1 trillion parameters, which would require at least 2TB of RAM. The last time I saw that much RAM in one machine was never. But what if I told you we can make LLMs 4x smaller and 2x faster, enough to run very capable models on your laptop, all while losing only 5-10% accuracy. That's the magic of quantization. In this post, you are going to learn: How a model's parameters make it so big How floating point precision works and how models sacrifice it How to compress floats using quantization How to measure model quality loss after quantization If you already know what parameters are and how floats are stored, feel free to skip straight to quantization. What makes large language models so large? Parameters, also called "weights," are the majority of what an LLM is when it's in memory or on disk. In my prompt caching post I wrote that LLMs are an "enormous graph of billions of carefully arranged operations." What do those graphs look like? Let's start with the simplest example: 1 input, 1 parameter, 1 output. It doesn't look like much, but this is the fundamental building block of modern AI. It takes the input of 2.0, multiplies it by the parameter 0.5, and gets the output 1.0. LLMs, though, are much bigger. They have billions of these parameters in practice. One of the ways they get so big is that they have "layers." Here's how that looks. These two nodes in the middle are a layer. They both show 1.0 because both connections have a parameter of 0.5, and so they are the result of 2.0 * 0.5. Every connection between two nodes gets a parameter, so we have 4 in total above. When 2 connections end at the same node, the values are added together. So to get the output of 1.5 we add together 1.0 * 1.0 and 0.5 * 1.0. Play w...

First seen: 2026-03-25 16:52

Last seen: 2026-03-26 14:11

Read Full Article More from this Source

Quantization from the Ground Up

Summary

Related News

The RISE RISC-V Runners: free, native RISC-V CI on GitHub

Figma's MCP Update Reflects a Larger Industry Shift

Stop Publishing Garbage Data, It's Embarrassing

The bot situation on the internet is worse than you could imagine

Comparison shows audiophiles waste a lot of money