Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

https://news.ycombinator.com/rss Hits: 28

Summary

Flash-MoE: Running a 397B Parameter Model on a Laptop Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours. Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling. The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders. Configuration tok/s Quality Notes 4-bit experts, FMA kernel 4.36 Excellent Current best. Full tool calling. 209GB on disk. 4-bit experts, baseline 3.90 Excellent Before FMA kernel optimization. 2-bit experts, trust OS 5.74 Good* 120GB on disk. *Breaks JSON/tool calling. 2-bit peak single token 7.05 Good* Warm cache burst. *Not suitable for tool use. *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration. Machine: MacBook Pro, Apple M3 Max Chip: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE Memory: 48 GB unified (~400 GB/s bandwidth) SSD: 1TB Apple Fabric, 17.5 GB/s sequential read (measured) macOS: 26.2 (Darwin 25.2.0) The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096. SSD Expert Streaming — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed ("Trust the OS" principle). Inspired by Apple's "LLM in a Flash" paper. FMA-Optimized Dequant Kernel — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). Pre-...

First seen: 2026-03-22 12:50

Last seen: 2026-03-23 16:08

Read Full Article More from this Source

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

Summary

Related News

The RISE RISC-V Runners: free, native RISC-V CI on GitHub

Figma's MCP Update Reflects a Larger Industry Shift

Stop Publishing Garbage Data, It's Embarrassing

The bot situation on the internet is worse than you could imagine

Comparison shows audiophiles waste a lot of money