Summary of "I built a private AI mini-cluster with Framework Desktop"

Summary of Video: “I built a private AI mini-cluster with Framework Desktop”

Main Technological Concepts & Product Features

Framework Desktop MiniRack & MiniITX Motherboard

Framework released a miniRack setup housing four Framework Desktop nodes, each based on a MiniITX motherboard with an AMD Ryzen AI Max Plus 395 APU.
The motherboard is a compact, soldered-down single board computer design with integrated APU and RAM optimized for AI workloads.
Key hardware features include:
- AMD Ryzen AI Max Plus 395 CPU with attached memory optimized for AI and gaming.
- Two M.2 slots, PCIe Gen 4 x4 slot (limited to 25W power), USB 3 headers.
- No built-in wireless antenna mounting despite a Wi-Fi slot.
- Phase change thermal interface pad for better cooling.
- Standard Flex ATX power supply (with a snorkel for airflow in desktop cases).
- ARGB and audio headers, power and CMOS reset buttons.
Power consumption:
- Approximately 10W idle.
- Around 150W at full load per node.
Noise level:
- Low noise (~46 dBA at full load) with Noctua fans.
- Fans stop spinning when idle.
Networking:
- Built-in 5 Gbps Ethernet.
- Thunderbolt 3/USB4 speeds capped at ~10 Gbps in testing (expected 20 Gbps).

MiniRack Form Factor

A 2U half-width rack designed to hold four MiniITX motherboards.
Includes power supply mounts and power buttons.
Designed for airflow with cutouts and venting.
Compatible with Framework power supplies; compatibility with other brands is uncertain.
Compact enough for home or small office use; can fit inside a larger rack.

Software, AI Clustering & Performance Analysis

AI Clustering with Framework Desktop MiniRack

AI clustering is in very early stages, especially with AMD APUs.
Framework collaborated with open-source projects like:
- llama.cpp (supports RPC mode).
- Exo for distributed AI workloads.
Challenges running large language models (LLMs) on clusters include:
- Network latency and bandwidth limitations.
- Software immaturity and bugs in clustering tools.
- Memory and IO bottlenecks when splitting models across nodes.
Performance observations:
- Small AI models (e.g., 70B parameter llama) run better on a single node than distributed across four nodes.
- Large models (e.g., 405B parameter Llama) barely run, with extremely slow token generation (~0.7 tokens/sec).
Current state:
- AI clustering offers no speed advantage for hobbyists.
- Vertical scaling (more RAM/VRAM on one machine) is preferable for performance.
- AMD’s AI software stack is immature, with driver and library issues and incomplete NPU support.

Benchmarking Highlights

Single node performance:
- Comparable to Apple M4 in single-core.
- Between Apple M4 and M4 Max in multi-core.
Linux kernel compilation completes in under a minute, faster than some ARM desktops.
High-performance Linpack (HPL) benchmarks:
- ~308 gigaflops per node.
- Four-node cluster reached over 1 teraflop FP64 performance, comparable to a 2005 top 500 supercomputer.
Cluster CPU performance nearly twice that of a $6,000 M4 Max Studio but costs about $8,000 total.
AI model token generation rates:
- ~45 tokens/sec on CPU (Olama).
- ~88 tokens/sec on iGPU using llama.cpp with Vulkan (better than CPU but less efficient than Apple chips).
Network limitations:
- 5 Gbps switch used.
- 10 Gbps expected with Thunderbolt/USB4 not fully realized.

Software Tools & Ecosystem

Primary clustering tool: llama.cpp with RPC mode.
Exo was promising but development has stalled, raising concerns about open-source AI tooling sustainability.
Distributed Llama is easier to use but limited in model compatibility.
Presenter developed automation via Ansible for cluster setup and benchmarking (available on GitHub).

Guides, Reviews, and Tutorials Provided

Hardware Assembly

Quick 20-minute assembly of four nodes into the miniRack using 3D printed trays.
Overview of power supply mounting, airflow considerations, and IO access.

Software Setup

Remote headless installation via Jet KVM and SSH.
Use of MPI for distributed benchmarking (HPL).
Running and benchmarking AI models using llama.cpp in RPC mode.

Performance Comparisons

Cluster CPU performance compared to Apple M4 Max Studio and previous Pi cluster.
AI token generation speed comparisons between CPU, iGPU, and cluster modes.
Cost-performance analysis of this cluster versus other AI hardware (Ampere server, Mac Studio, Mac Mini clusters).

Cautions & Recommendations

AI clustering is not yet practical for most users.
Vertical scaling (more powerful single machines) is more effective than horizontal scaling (clusters) for AI workloads.
AMD’s AI software stack needs improvement; expect bugs and incomplete features.
Be wary of open-source AI projects that lose maintenance or change licensing (e.g., Exo).

Main Speakers / Sources

Jeff Gearling: Video creator and presenter; conducted hardware testing, benchmarking, and software setup.
Narav Patel: Founder and CEO of Framework; provided background on Framework Desktop and miniRack design philosophy.
Community / Projects Mentioned:
- Exo: Open-source AI clustering tool, now stalled.
- llama.cpp: Open-source LLM inference tool with RPC mode.
- Distributed Llama: Another clustering tool.
- DeskPi: Partner for miniRack hardware design.
- Bartz: Community contributor helping with Distributed Llama.

Overall, the video is a detailed exploration and practical review of building a small AI compute cluster using Framework’s MiniITX desktops in a miniRack, focusing on hardware features, benchmarking, AI clustering challenges, and software tooling limitations.