Summary of "Apple’s New M5 Max Changes the Local AI Story"
Summary — Apple M5 Max first-look (local AI & developer workflows)
A first-look review of the M5 Max MacBook Pro shows meaningful improvements for local AI workloads and developer responsiveness, driven by a new GPU architecture with neural accelerators, faster NVMe, and modest increases in sustained memory throughput. Results vary by model, framework and quantization format; the unified memory cap (128 GB) remains a limiting factor for the largest on-device models.
What’s new / hardware highlights
- M5 Max replaces the M4 Max. Tested unit used a 40 GPU‑core configuration.
- New GPU architecture with neural accelerators in every GPU core.
- Apple’s claimed peaks:
- Up to ~4× peak GPU AI compute vs previous generation.
- Up to 614 GB/s unified memory bandwidth (peak, GPU+CPU).
- Practical limits:
- Max unified memory remains 128 GB (same as M4 Max). M3 Ultra supports up to 512 GB.
- New SSD (PCIe Gen 5) — sequential performance in reviewer tests:
- Reads ≈ 13.6 GB/s+
- Writes ≈ 16 GB/s+
- Random small-file IO also improved modestly.
Benchmarks & developer responsiveness
- Browser / JS responsiveness
- Speedometer 3.1 (single‑core): M5 Max scored 60.5 — the highest the reviewer had seen, small gains over M4 and M3.
- Multi‑core CPU work
- Python “Mandelbrot” heavy parallel test (18 vs 16 vs 32 cores):
- M4 Max ≈ 14.6–15 s
- M5 Max ≈ 11.6–11.8 s
- M3 Ultra ≈ 8.5 s (still faster on heavily parallel CPU tasks)
- Python “Mandelbrot” heavy parallel test (18 vs 16 vs 32 cores):
Local-LLM (local AI) — key concepts
Two important stages in model inference:
- Prompt processing / prefill (PP)
- Heavy on compute/GPU; benefits from GPU compute and neural accelerators.
- Token generation (TG)
- More sensitive to sustained memory bandwidth (throughput).
- Storage speed affects model load time, caching behavior, and working with very large models.
Memory bandwidth / sustained throughput
- Stream (Triad) sustained CPU-side memory bandwidth (measured):
- M4 Max ≈ 319,000 MB/s
- M3 Ultra ≈ 337,000 MB/s
- M5 Max ≈ 351,000 MB/s
- Interpretation:
- M5 Max leads these sustained CPU-side measurements (~13% over M4 Max, a few percent over M3 Ultra).
- These sustained numbers are lower than Apple’s quoted peak GPU+CPU bandwidth.
Local LLM tests and example results
LM Studio (Apple MLX and GGUF models)
- Mixture‑of‑experts model (Qen/“Quen” 3.5, 35B; 50k context):
- Time‑to‑first‑token: M4 ≈ M5 ≈ 1.58 s.
- Tokens/sec (throughput): M4 ≈ 79 tps, M5 ≈ 88 tps.
- M3 Ultra: lower tokens/sec in that test but observed to have faster time‑to‑first‑token in some runs.
- GPU utilization and power:
- M3 Ultra often hit higher GPU usage (up to 100%).
- Laptops (M4/M5) showed ~75–79% GPU usage.
- Power draw differences: laptop spikes ≈ 130–154 W; M3 Ultra spikes up to ≈ 240 W.
Llama.cpp / Llama Bench (dense & quantized models)
- Gemma 34B Q4KM example (prompt processing vs token generation):
- Prompt processing (PP) dramatic uplift:
- M4 Max ≈ 1,855 tokens/s
- M5 Max ≈ 4,468 tokens/s — roughly a 4× improvement (matches Apple’s “many×” compute claims for AI paths)
- M3 Ultra ≈ 2,959 tokens/s
- Token generation (TG) improvements:
- More modest and highly model/format dependent; memory bandwidth helps but results vary.
- Prompt processing (PP) dramatic uplift:
Key takeaways / implications
- M5 Max strengths
- Very large improvement in prompt‑processing (compute path) on certain models (up to ~4× vs M4 Max).
- Better sustained memory throughput than M4, slightly ahead of M3 Ultra in measured CPU-side stream tests.
- Much faster NVMe (Gen 5) speeds meaningfully reduce model load and cache times.
- Limitations and caveats
- Unified memory cap of 128 GB restricts the largest on‑device models (M3 Ultra’s 512 GB remains an advantage for huge models).
- Real-world gains depend on:
- Model architecture (Mixture‑of‑Experts vs dense)
- Framework (MLX, GGUF, Llama.cpp)
- Quantization and model format
- Overall
- Laptop-level hardware (M5 Max) is approaching desktop-class workloads in many local-AI tasks and can beat M3 Ultra on selected prompt-processing workloads. The reviewer noted excitement about what an M5 Ultra variant could achieve.
Tools, benchmarks and models mentioned
- Benchmarks: Speedometer 3.1, Stream (Triad) memory test, Python Mandelbrot-style multi-core test, Llama Bench / Llama.cpp, LM Studio.
- Frameworks & formats: Apple MLX, GGUF, Llama.cpp.
- Example models cited: “Quen/Quan 3.5” MoE ~35B, GPT/“GPTOSS” 12B, Gemma 34B Q4KM (quantized), various GGUF/Q4 quantized models.
Sponsor
- TryHackMe (security training platform) was mentioned as sponsor/ad.
Next coverage promised
- Deeper testing and comparisons for M5 Pro, M5 Air, M5 Neo and M5 Ultra.
Main speakers / sources referenced
- Video host / reviewer (presenting benchmarks and analysis)
- Apple (official M5 Max claims)
- “Animal” — Apple Neural Engine researcher referenced (Twitter)
- Tools/projects cited: Llama.cpp, LM Studio, Stream Triad, Speedometer
- TryHackMe (sponsor)
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...