Ollama is now powered by MLX on Apple Silicon in preview

https://news.ycombinator.com/rss Hits: 15

Summary

Today, we’re previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple’s machine learning framework. This unlocks new performance to accelerate your most demanding work on macOS: Fastest performance on Apple silicon, powered by MLX Ollama on Apple silicon is now built on top of Apple’s machine learning framework, MLX, to take advantage of its unified memory architecture. This results in a large speedup of Ollama on all Apple Silicon devices. On Apple’s M5, M5 Pro and M5 Max chips, Ollama leverages the new GPU Neural Accelerators to accelerate both time to first token (TTFT) and generation speed (tokens per second). Prefill performance 0 500 1000 1500 2000 tokens/s 1810 Ollama 0.19 1154 Ollama 0.18 Decode performance 0 40 80 120 160 tokens/s 112 Ollama 0.19 58 Ollama 0.18 Testing was conducted on March 29, 2026, using Alibaba’s Qwen3.5-35B-A3B model quantized to `NVFP4` and Ollama’s previous implementation quantized to `Q4_K_M` using Ollama 0.18. Ollama 0.19 will see even higher performance (1851 token/s prefill and 134 token/s decode when running with `int4`). NVFP4 support: higher quality responses and production parity Ollama now leverages NVIDIA’s NVFP4 format to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads. As more inference providers scale inference using NVFP4 format, this allows Ollama users to share the same results as they would in a production environment. It further opens up Ollama to have the ability to run models optimized by NVIDIA’s model optimizer. Other precisions will be made available based on the design and usage intent from Ollama’s research and hardware partners. Improved caching for more responsiveness Ollama’s cache has been upgraded to make coding and agentic tasks more efficient. Lower memory utilization: Ollama will now reuse its cache across conversations, meaning less memory utilization and more cache hits when branching when using a shared system prompt w...

First seen: 2026-03-31 05:17

Last seen: 2026-03-31 19:29

Read Full Article More from this Source

Ollama is now powered by MLX on Apple Silicon in preview

Summary

Related News

The Claude Code Leak

DRAM pricing is killing the hobbyist SBC market

Show HN: Git bayesect – Bayesian Git bisection for non-deterministic bugs

Show HN: NASA Artemis II Mission Timeline Tracker

A new C++ back end for ocamlc