Summary of "Three months wrong about why my 4-node AMD cluster was slow"
Tech/product overview
- Hardware focus: AMD Ryzen AI Max Plus 395 (“Strix Halo”) in the Minisforum MS-01 Max mini PC.
- Key capability: Each node has 128GB unified memory (shared CPU+GPU; GPU can access ~115GB per box). The host can run large models locally, and 4 nodes are arranged as an “AI supercomputer” on a desk.
- Clustering goal: Beat Minisforum’s marketing claims by getting a 4-node Strix Halo setup to run faster multi-node inference.
Networking + clustering setup
- Network hardware approach:
- Added QSFP NICs (one per node).
- Used a MikroTik CRS A12 switch (400G QSFP-DD).
- Initial expectation (later corrected): Higher network bandwidth would make the cluster faster.
- Major findings:
- Mesh LLM attempt (OpenAI-compatible API):
- Worked for single-node testing (e.g., Qwen-3-8B around 48 tok/s using Vulkan).
- Multi-node inference crashed for 235B MoE.
- Some models couldn’t see the GPU (e.g., Llama 3.1-405B).
- Llama.cpp RPC scaling issue:
- Could fit bigger models with RPC/pipeline parallelism.
- But it did not improve tokens/sec as nodes increased.
- More nodes mainly added extra sequential steps per token.
- Mesh LLM attempt (OpenAI-compatible API):
Tensor parallelism (the real requirement for speed)
- Core concept: Real speedup needs tensor parallelism, not pipeline parallelism.
- Practical translation: In the Python LLM ecosystem, tensor parallelism ≈ vLLM.
- Performance behavior:
- RPC/pipeline parallelism: token throughput stays flat as nodes increase.
- Tensor parallelism: throughput increases with more nodes (until overhead/bottlenecks dominate).
Bottleneck diagnosis: latency > bandwidth
- Network upgrade experiment: Switched to high-throughput NICs (Intel E810 100Gb QSFP).
- Even with better iperf throughput, token speed did not improve.
- Conclusion: Bottleneck was round-trip latency, not bandwidth, because tensor parallelism needs many synchronization round trips per generated token.
- Fix: Use RDMA to reduce round-trip latency:
- TCP: hundreds of microseconds
- RDMA: ~single-digit microseconds (claimed ~200× per round trip)
RDMA success/failure details (hardware + software constraints)
- Hardware mismatch pitfall: Mixing RDMA stacks/NIC types (e.g., Intel IPoIB vs Mellanox Rocky) caused no compatibility or worse performance.
- Some mixed setups could even underperform single-node.
- Setup requirements:
- Matching RDMA-capable cards.
- Linux tweaks, including:
- memlock unlimited in security limits
- Rebooting after changes (device locking behavior)
- Measured results (tensor parallelism):
- 2 nodes over TCP: ~22.9 tok/s
- 2 nodes over RDMA: ~25–25.6 tok/s
- 4 nodes (RDMA target):
- ~29 tok/s over 4 nodes using TCP
- RDMA “reported as working better” at times (e.g., ~38.3 tok/s on a small dense model)
Model scaling observations (important analysis)
- Why speedup isn’t uniform: It depends on active parameters per token, especially for MoE.
- Small dense models scale closer to linear.
- MoE scales less because only a subset of experts activates per token.
- Example (Qwen dense vs MoE):
- Dense: stronger scaling
- MoE: weaker scaling due to fewer active parameters per token
Quantization + model compatibility issues (what broke and what worked)
- Goal: Run increasingly large models under tensor parallelism across 4 nodes.
- DeepSeek R1 671B target:
- 4-bit AWQ for 671B didn’t work as expected:
- loading would “hang silently,” then GPUs idle
- MoE + AWQ reported as broken on this specific iGPU/software stack
- 4-bit AWQ for 671B didn’t work as expected:
- DeepSeek V2 light:
- EWQ on MoE broken (PR pending)
- Llama 405B AWQ:
- Worked in tensor-parallel form only under certain conditions
- Earlier attempts “hung” depending on software/kernel
Workaround that worked for big MoE
- Use W4A16 (4-bit weight-only, 16-bit activations) style quantization.
- The creator reports rock-solid runs for models that previously failed (including a Qwen 30B MoE-like case, then later DeepSeek R1).
Software fix: kernel/ROCm image mismatch resolved hangs
- Root cause suspected/identified: Newer ROCm required a newer Linux kernel than what Fedora initially shipped.
- Key steps:
- Initially on Fedora 43 (kernel 6.17), then upgraded to kernel 6.19
- Used Donato Capitella’s updated Strix Halo vLLM toolboxes/images
- Toolboxes were Fedora-aligned, more compatible with this environment than Ubuntu-focused ones.
- After kernel update + reboot, previously hanging models began to run again.
- Reported improvement: Previously failing models like Gemma 4 26B and eventually DeepSeek R1.
Final benchmark vs vendor demo
- Energy/power: ~180–187W per node (heavy heat).
- Creator’s reported throughput for DeepSeek R1 (4-bit quant, 4 nodes, vLLM tensor parallelism):
- 6.23 tok/s
- Minisforum marketing demo (same chip, 4-node DeepSeek R1 quantized to 4-bit):
- 5.94 tok/s
- Net claim: Creator achieved ~5% faster throughput than the vendor’s own video on a fully open-source stack.
Practical “results list” (what ran)
- 13 models tested; 3 cluster sizes (single/2/4-node context implied).
- Some configurations didn’t fit or failed; successful runs included:
- Fastest: Qwen 7B dense and Qwen 30B MoE at ~38 tok/s
- Big DeepSeek R1 (671B): ~6.23 tok/s
- Large-model failures/hangs were largely resolved via:
- the kernel + updated toolbox path, and
- avoiding unsupported quantization combinations on this stack.
Guides/tutorial takeaways (implied)
- Don’t rely on bandwidth upgrades alone—latency dominates for multi-node tensor parallelism.
- Use tensor parallelism (vLLM) for speedups; pipeline parallelism (llama.cpp RPC) won’t.
- For RDMA speedups:
- match NIC hardware/protocols
- ensure RDMA permissions (memlock) and correct RDMA stack
- For Strix Halo + open source:
- prefer Donato’s toolbox approach (container images with ROCm/Vulkan for this iGPU)
- use a compatible kernel version (author specifically recommends kernel 6.19)
Main speakers/sources
- Primary speaker: The YouTube creator (unnamed in subtitles), running the experiments and benchmarks.
- Referenced key source: Donato Capitella (Strix Halo cluster work; GitHub + “toolboxes” for ROCm/Vulkan/vLLM on Strix Halo).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...