Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

https://news.ycombinator.com/rss Hits: 20

Summary

_ _ | | | |_ _ _ __ _ _ _ __ __ _ | |_| | | | | '_ \| | | | '__/ _` | | _ | |_| | |_) | |_| | | | (_| | |_| |_|\__, | .__/ \__,_|_| \__,_| |___/|_| Run models too big for your Mac's memory Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system. Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both. Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes. Hypura solves this by understanding the model architecture: Norms and embeddings are tiny but accessed every token — pinned to GPU MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality. Co-activation tracking predicts which experts will fire next for speculative prefetch. Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch lookahead depth scales automatically with available memory. The result: models that would crash your machine under naive mmap become runnable. Models that fit in memory run at full Metal GPU speed with zero overhead. Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier: GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMax...

First seen: 2026-03-24 17:34

Last seen: 2026-03-25 12:48

Read Full Article More from this Source

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

Summary

Related News

The RISE RISC-V Runners: free, native RISC-V CI on GitHub

Figma's MCP Update Reflects a Larger Industry Shift

Stop Publishing Garbage Data, It's Embarrassing

The bot situation on the internet is worse than you could imagine

Comparison shows audiophiles waste a lot of money