Summary of "Three months wrong about why my 4-node AMD cluster was slow"

Tech/product overview

Hardware focus: AMD Ryzen AI Max Plus 395 (“Strix Halo”) in the Minisforum MS-01 Max mini PC.
Key capability: Each node has 128GB unified memory (shared CPU+GPU; GPU can access ~115GB per box). The host can run large models locally, and 4 nodes are arranged as an “AI supercomputer” on a desk.
Clustering goal: Beat Minisforum’s marketing claims by getting a 4-node Strix Halo setup to run faster multi-node inference.

Networking + clustering setup

Network hardware approach:
- Added QSFP NICs (one per node).
- Used a MikroTik CRS A12 switch (400G QSFP-DD).
Initial expectation (later corrected): Higher network bandwidth would make the cluster faster.
Major findings:
- Mesh LLM attempt (OpenAI-compatible API):
  - Worked for single-node testing (e.g., Qwen-3-8B around 48 tok/s using Vulkan).
  - Multi-node inference crashed for 235B MoE.
  - Some models couldn’t see the GPU (e.g., Llama 3.1-405B).
- Llama.cpp RPC scaling issue:
  - Could fit bigger models with RPC/pipeline parallelism.
  - But it did not improve tokens/sec as nodes increased.
  - More nodes mainly added extra sequential steps per token.

Tensor parallelism (the real requirement for speed)

Core concept: Real speedup needs tensor parallelism, not pipeline parallelism.
Practical translation: In the Python LLM ecosystem, tensor parallelism ≈ vLLM.
Performance behavior:
- RPC/pipeline parallelism: token throughput stays flat as nodes increase.
- Tensor parallelism: throughput increases with more nodes (until overhead/bottlenecks dominate).

Bottleneck diagnosis: latency > bandwidth

Network upgrade experiment: Switched to high-throughput NICs (Intel E810 100Gb QSFP).
- Even with better iperf throughput, token speed did not improve.
Conclusion: Bottleneck was round-trip latency, not bandwidth, because tensor parallelism needs many synchronization round trips per generated token.
Fix: Use RDMA to reduce round-trip latency:
- TCP: hundreds of microseconds
- RDMA: ~single-digit microseconds (claimed ~200× per round trip)

RDMA success/failure details (hardware + software constraints)

Hardware mismatch pitfall: Mixing RDMA stacks/NIC types (e.g., Intel IPoIB vs Mellanox Rocky) caused no compatibility or worse performance.
- Some mixed setups could even underperform single-node.
Setup requirements:
- Matching RDMA-capable cards.
- Linux tweaks, including:
  - memlock unlimited in security limits
  - Rebooting after changes (device locking behavior)
Measured results (tensor parallelism):
- 2 nodes over TCP: ~22.9 tok/s
- 2 nodes over RDMA: ~25–25.6 tok/s
- 4 nodes (RDMA target):
  - ~29 tok/s over 4 nodes using TCP
  - RDMA “reported as working better” at times (e.g., ~38.3 tok/s on a small dense model)

Model scaling observations (important analysis)

Why speedup isn’t uniform: It depends on active parameters per token, especially for MoE.
- Small dense models scale closer to linear.
- MoE scales less because only a subset of experts activates per token.
Example (Qwen dense vs MoE):
- Dense: stronger scaling
- MoE: weaker scaling due to fewer active parameters per token

Quantization + model compatibility issues (what broke and what worked)

Goal: Run increasingly large models under tensor parallelism across 4 nodes.
DeepSeek R1 671B target:
- 4-bit AWQ for 671B didn’t work as expected:
  - loading would “hang silently,” then GPUs idle
- MoE + AWQ reported as broken on this specific iGPU/software stack
DeepSeek V2 light:
- EWQ on MoE broken (PR pending)
Llama 405B AWQ:
- Worked in tensor-parallel form only under certain conditions
- Earlier attempts “hung” depending on software/kernel

Workaround that worked for big MoE

Use W4A16 (4-bit weight-only, 16-bit activations) style quantization.
The creator reports rock-solid runs for models that previously failed (including a Qwen 30B MoE-like case, then later DeepSeek R1).

Software fix: kernel/ROCm image mismatch resolved hangs

Root cause suspected/identified: Newer ROCm required a newer Linux kernel than what Fedora initially shipped.
Key steps:
- Initially on Fedora 43 (kernel 6.17), then upgraded to kernel 6.19
- Used Donato Capitella’s updated Strix Halo vLLM toolboxes/images
  - Toolboxes were Fedora-aligned, more compatible with this environment than Ubuntu-focused ones.
- After kernel update + reboot, previously hanging models began to run again.
Reported improvement: Previously failing models like Gemma 4 26B and eventually DeepSeek R1.

Final benchmark vs vendor demo

Energy/power: ~180–187W per node (heavy heat).
Creator’s reported throughput for DeepSeek R1 (4-bit quant, 4 nodes, vLLM tensor parallelism):
- 6.23 tok/s
Minisforum marketing demo (same chip, 4-node DeepSeek R1 quantized to 4-bit):
- 5.94 tok/s
Net claim: Creator achieved ~5% faster throughput than the vendor’s own video on a fully open-source stack.

Practical “results list” (what ran)

13 models tested; 3 cluster sizes (single/2/4-node context implied).
Some configurations didn’t fit or failed; successful runs included:
- Fastest: Qwen 7B dense and Qwen 30B MoE at ~38 tok/s
- Big DeepSeek R1 (671B): ~6.23 tok/s
Large-model failures/hangs were largely resolved via:
- the kernel + updated toolbox path, and
- avoiding unsupported quantization combinations on this stack.

Guides/tutorial takeaways (implied)

Don’t rely on bandwidth upgrades alone—latency dominates for multi-node tensor parallelism.
Use tensor parallelism (vLLM) for speedups; pipeline parallelism (llama.cpp RPC) won’t.
For RDMA speedups:
- match NIC hardware/protocols
- ensure RDMA permissions (memlock) and correct RDMA stack
For Strix Halo + open source:
- prefer Donato’s toolbox approach (container images with ROCm/Vulkan for this iGPU)
- use a compatible kernel version (author specifically recommends kernel 6.19)

Main speakers/sources

Primary speaker: The YouTube creator (unnamed in subtitles), running the experiments and benchmarks.
Referenced key source: Donato Capitella (Strix Halo cluster work; GitHub + “toolboxes” for ROCm/Vulkan/vLLM on Strix Halo).

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Three months wrong about why my 4-node AMD cluster was slow"

Tech/product overview

Networking + clustering setup

Tensor parallelism (the real requirement for speed)

Bottleneck diagnosis: latency > bandwidth

RDMA success/failure details (hardware + software constraints)

Model scaling observations (important analysis)

Quantization + model compatibility issues (what broke and what worked)

Workaround that worked for big MoE

Software fix: kernel/ROCm image mismatch resolved hangs

Final benchmark vs vendor demo

Practical “results list” (what ran)

Guides/tutorial takeaways (implied)

Main speakers/sources

Category

Share this summary

Is the summary off?

Video

Summary of "Three months wrong about why my 4-node AMD cluster was slow"

Tech/product overview

Networking + clustering setup

Tensor parallelism (the real requirement for speed)

Bottleneck diagnosis: latency > bandwidth

RDMA success/failure details (hardware + software constraints)

Model scaling observations (important analysis)

Quantization + model compatibility issues (what broke and what worked)

Workaround that worked for big MoE

Software fix: kernel/ROCm image mismatch resolved hangs

Final benchmark vs vendor demo

Practical “results list” (what ran)

Guides/tutorial takeaways (implied)

Main speakers/sources

Category ?

Share this summary

Is the summary off?

Video

Category