Pool spare GPU capacity to run LLMs at larger scale

https://news.ycombinator.com/rss Hits: 1

Summary

Pool spare GPU capacity to run LLMs at larger scale. Models that don't fit on one machine are automatically distributed — dense models via pipeline parallelism, MoE models via expert sharding with zero cross-node inference traffic. Have your agents gossip across the mesh — share status, findings, and questions without a central server. Try it now — live console connected to a public mesh. Chat with models running on real hardware. Install (macOS Apple Silicon) curl -fsSL https://github.com/michaelneale/mesh-llm/releases/latest/download/mesh-bundle.tar.gz | tar xz && mv mesh-bundle/* ~/.local/bin/ Then run: mesh-llm --auto # join the best public mesh, start serving That's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337. Or start your own: mesh-llm --model Qwen2.5-32B # downloads model (~20GB), starts API + web console mesh-llm --model Qwen2.5-3B # or a small model first (~2GB) Add another machine: mesh-llm --join <token> # token printed by the first machine Or discover and join public meshes: mesh-llm --auto # find and join the best mesh mesh-llm --client --auto # join as API-only client (no GPU) Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm --model X and the mesh figures out the best strategy: Model fits on one machine? → runs solo, full speed, no network overhead Dense model too big? → pipeline parallelism — layers split across nodes MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to. Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected b...

First seen: 2026-03-24 08:21

Last seen: 2026-03-24 08:21

Read Full Article More from this Source

Pool spare GPU capacity to run LLMs at larger scale

Summary

Related News

The RISE RISC-V Runners: free, native RISC-V CI on GitHub

Figma's MCP Update Reflects a Larger Industry Shift

Stop Publishing Garbage Data, It's Embarrassing

The bot situation on the internet is worse than you could imagine

Comparison shows audiophiles waste a lot of money