Summary of "Apple’s New M5 Max Changes the Local AI Story"

Summary — Apple M5 Max first-look (local AI & developer workflows)

A first-look review of the M5 Max MacBook Pro shows meaningful improvements for local AI workloads and developer responsiveness, driven by a new GPU architecture with neural accelerators, faster NVMe, and modest increases in sustained memory throughput. Results vary by model, framework and quantization format; the unified memory cap (128 GB) remains a limiting factor for the largest on-device models.

What’s new / hardware highlights

M5 Max replaces the M4 Max. Tested unit used a 40 GPU‑core configuration.
New GPU architecture with neural accelerators in every GPU core.
Apple’s claimed peaks:
- Up to ~4× peak GPU AI compute vs previous generation.
- Up to 614 GB/s unified memory bandwidth (peak, GPU+CPU).
Practical limits:
- Max unified memory remains 128 GB (same as M4 Max). M3 Ultra supports up to 512 GB.
New SSD (PCIe Gen 5) — sequential performance in reviewer tests:
- Reads ≈ 13.6 GB/s+
- Writes ≈ 16 GB/s+
- Random small-file IO also improved modestly.

Benchmarks & developer responsiveness

Browser / JS responsiveness
- Speedometer 3.1 (single‑core): M5 Max scored 60.5 — the highest the reviewer had seen, small gains over M4 and M3.
Multi‑core CPU work
- Python “Mandelbrot” heavy parallel test (18 vs 16 vs 32 cores):
  - M4 Max ≈ 14.6–15 s
  - M5 Max ≈ 11.6–11.8 s
  - M3 Ultra ≈ 8.5 s (still faster on heavily parallel CPU tasks)

Local-LLM (local AI) — key concepts

Two important stages in model inference:

Prompt processing / prefill (PP)
- Heavy on compute/GPU; benefits from GPU compute and neural accelerators.
Token generation (TG)
- More sensitive to sustained memory bandwidth (throughput).
Storage speed affects model load time, caching behavior, and working with very large models.

Memory bandwidth / sustained throughput

Stream (Triad) sustained CPU-side memory bandwidth (measured):
- M4 Max ≈ 319,000 MB/s
- M3 Ultra ≈ 337,000 MB/s
- M5 Max ≈ 351,000 MB/s
Interpretation:
- M5 Max leads these sustained CPU-side measurements (~13% over M4 Max, a few percent over M3 Ultra).
- These sustained numbers are lower than Apple’s quoted peak GPU+CPU bandwidth.

Local LLM tests and example results

LM Studio (Apple MLX and GGUF models)

Mixture‑of‑experts model (Qen/“Quen” 3.5, 35B; 50k context):
- Time‑to‑first‑token: M4 ≈ M5 ≈ 1.58 s.
- Tokens/sec (throughput): M4 ≈ 79 tps, M5 ≈ 88 tps.
- M3 Ultra: lower tokens/sec in that test but observed to have faster time‑to‑first‑token in some runs.
GPU utilization and power:
- M3 Ultra often hit higher GPU usage (up to 100%).
- Laptops (M4/M5) showed ~75–79% GPU usage.
- Power draw differences: laptop spikes ≈ 130–154 W; M3 Ultra spikes up to ≈ 240 W.

Llama.cpp / Llama Bench (dense & quantized models)

Gemma 34B Q4KM example (prompt processing vs token generation):
- Prompt processing (PP) dramatic uplift:
  - M4 Max ≈ 1,855 tokens/s
  - M5 Max ≈ 4,468 tokens/s — roughly a 4× improvement (matches Apple’s “many×” compute claims for AI paths)
  - M3 Ultra ≈ 2,959 tokens/s
- Token generation (TG) improvements:
  - More modest and highly model/format dependent; memory bandwidth helps but results vary.

Key takeaways / implications

M5 Max strengths
- Very large improvement in prompt‑processing (compute path) on certain models (up to ~4× vs M4 Max).
- Better sustained memory throughput than M4, slightly ahead of M3 Ultra in measured CPU-side stream tests.
- Much faster NVMe (Gen 5) speeds meaningfully reduce model load and cache times.
Limitations and caveats
- Unified memory cap of 128 GB restricts the largest on‑device models (M3 Ultra’s 512 GB remains an advantage for huge models).
- Real-world gains depend on:
  - Model architecture (Mixture‑of‑Experts vs dense)
  - Framework (MLX, GGUF, Llama.cpp)
  - Quantization and model format
Overall
- Laptop-level hardware (M5 Max) is approaching desktop-class workloads in many local-AI tasks and can beat M3 Ultra on selected prompt-processing workloads. The reviewer noted excitement about what an M5 Ultra variant could achieve.

Tools, benchmarks and models mentioned

Benchmarks: Speedometer 3.1, Stream (Triad) memory test, Python Mandelbrot-style multi-core test, Llama Bench / Llama.cpp, LM Studio.
Frameworks & formats: Apple MLX, GGUF, Llama.cpp.
Example models cited: “Quen/Quan 3.5” MoE ~35B, GPT/“GPTOSS” 12B, Gemma 34B Q4KM (quantized), various GGUF/Q4 quantized models.