Summary of "I Plugged a DGX Spark and Mac Together... and Didn’t Expect This"
Technological concepts & experiment goal
-
The video tests heterogeneous LLM inference by splitting the two main phases of generation across different hardware:
- Prefill (processing the whole prompt): compute-heavy
- GPUs (e.g., Nvidia Blackwell in the DGX Spark / MSI Edge Expert) are strong here.
- Decode (generating tokens one-by-one): memory-bandwidth-heavy
- Apple Silicon is strong here.
- Prefill (processing the whole prompt): compute-heavy
-
The key idea is disaggregated prefill + decode (a real production concept, referenced with companies like Deepseek and ByteDance) to optimize each phase independently.
-
Caution from the author: just because you can combine hardware doesn’t mean it’s worth it, because network transfer can dominate latency.
Hardware / setup described
-
Two main machines used initially:
- MSI Edge Expert / DGX Spark-like system (“GB10”)
- Nvidia Blackwell GPU
- 128GB unified memory
- Mac Mini M4 Pro
- Apple Silicon
- 64GB unified memory
- MSI Edge Expert / DGX Spark-like system (“GB10”)
-
Later upgrade for better decode:
- Mac Studio M3 Ultra with 512GB unified memory
-
Networking components:
- Uses mDNS peer discovery via Exo’s/libp2p stack, and runs into macOS issues.
- Uses high-speed connectivity via QSFP NICs in a Thunderbolt enclosure and a Microex CSR 812 switch.
- Compares performance using link speeds (e.g., ~50GbE improvements vs earlier ~2.5GB USB Ethernet).
Software / framework & implementation details
-
Uses Exo (repo referenced) and experimental community support for Blackwell prefill/decode disaggregation.
-
Builds and compiles:
- Rust networking bindings
- VLM from source on ARM Linux for the Spark side
- MLX from source on the Mac side, plus Node.js setup
-
Implementation relies on:
- Running Exo instances on both devices
- Routing prefill computations to the GPU machine and decode to the Apple machine (or swapped depending on the stage)
Major challenges / fixes (from guide/debug perspective)
-
Peer discovery fails on macOS
- Exo uses mDNS, and the Mac and Spark couldn’t “see” each other.
-
Fix: avoid mDNS by explicitly dialing the peer
- Use an environment variable on the Spark side
- Result: the connection comes up instantly
-
Additional bugs:
- Prefill routing broken initially (Mac runner couldn’t reach GB10)
- Required reboot + workarounds until stable
Review-style findings: performance results (measurements)
Example model: Qwen 3.5 (thinking model)
-
Hardware roles:
- Spark (GB10) runs BF16 full precision for faster prefill compute
- Mac Mini runs 4-bit quantization optimized for Apple/MLX (intended for faster decode)
-
Observations:
- KV cache transfer dominates end-to-end time over slower links.
- Even with fast compute, network took most of the time:
- At ~25,000 tokens:
- Spark KV-cache computed in < 1s
- transfer took ~25s
- At ~25,000 tokens:
- Outcome:
- ~96% of total time was network, with the GPU mostly idle.
Token-generation timing nuance: “thinking models”
- For thinking models (hidden reasoning tokens), most time-to-first-token can be dominated by reasoning tokens at decode speed.
- In comparisons, when reasoning dominates, disaggregation can appear to converge, masking benefits.
Networking upgrade & controlled comparisons
Improvements made
- Added a Thunderbolt 5 enclosure with a QSFP NIC compatible with macOS.
- Used Melanox ConnectX4 (50GbE) because macOS rejected the initial Intel 100Gb card (“driver not installed”).
- Reported:
- Switching/NIC negotiation required tuning
- KV-cache transfer improved by about ~30%
Non-thinking model for clearer comparison
- Used Llama 3.1 / Llama 3.18B (dense, widely compatible).
-
Benchmark tool: LlamaBeni
- Measures prompt processing vs token generation via HTTP API calls (including stack overhead).
-
Key metrics (time to first token at PP = 4096):
- Disaggregated time-to-first-token matched Spark-alone closely:
- about 2.4s (disaggregated) vs 2.3s (Spark)
- Disaggregated decode was slower than Mac alone due to KV-cache injection overhead (remote KV injection slows decode).
- Disaggregated time-to-first-token matched Spark-alone closely:
Conclusion for the Mac Mini stage
- Disaggregation can recover Spark-like time-to-first-token when networking is not the bottleneck.
- But end-to-end gains depend heavily on network speed and model behavior (especially thinking tokens).
Main “scaling up” tests: swapping to Mac Studio M3 Ultra
- The author replaces Mac Mini with Mac Studio M3 Ultra to increase decode bandwidth.
-
Networking was reworked due to reaching issues again:
- Recreated the networking service so Mac Studio could reach Spark after reboot
- Took “a couple hours” this time vs days earlier
-
Results for Llama 3.18B (8B class):
- Decode much faster on Mac Studio:
- ~106 tokens/s decode vs Mac Mini’s ~52
- Disaggregated decode improved, but remained limited by KV-cache injection overhead
- Decode much faster on Mac Studio:
-
Argument: disaggregation becomes more attractive as model size increases
- Faster Apple decode helps
- Spark’s prefill advantage becomes more pronounced as models get larger
Larger models: what worked and what didn’t
Constraint: Spark memory + VLM kernel support for quantization
- For Llama 3.70B, full precision did not fit Spark’s 128GB.
-
Quantization attempts failed due to missing ML/VLM kernels for Spark’s architecture:
- FP8 failed (kernel compilation issues)
- AWQ and W4A16/GPTQ Marlin failed because the Marlin repack kernel was missing
-
Workaround: use models that fit and/or supported quantizations:
- Qwen 2.5 ~32B (BF16 on Spark + 4bit on Mac Studio)
- Gemma 2 ~27B (BF16 on Spark + 4bit on Mac Studio)
Disaggregated vs single-device patterns across model sizes
- Consistent pattern across Llama 8B, Gemma 27B, Qwen 32B:
- Disaggregated time-to-first-token tracks the Spark:
- Spark-like first-token latency recovered
- Spark’s prefill advantage grows with model size:
- At 8B: Spark only slightly ahead
- At 27–32B: Spark is ~2x to 2.5x faster at prefill
- Decode becomes relatively less dominant at larger sizes:
- At 8B: Mac Studio decode dramatically faster than Spark
- At 27–32B: decode gap shrinks due to factors like Spark decode improving relatively and attention/kernel fusion/compilation effects
- Disaggregated time-to-first-token tracks the Spark:
Final assessment / practical advice
-
As a proof of concept, heterogeneous disaggregated inference works:
- Two machines can produce tokens faster than either alone by matching each phase to its best hardware
- Still requires:
- workable networking
- correct kernels/quantization support
- significant engineering/debug time
-
As a practical purchase recommendation:
- DGX Spark + Mac Studio are expensive
- If buying new hardware, the author suggests a more cost-effective path:
- an RTX Pro 6000 workstation-based setup
-
Overall takeaway:
- Whether disaggregation helps depends mainly on network throughput and decode bottlenecks.
Main speakers / sources
- Main speaker/source: The video author (narrator; builds, debugs, benchmarks)
- Software/source referenced:
- Exo project/repo (and community PRs; libp2p/MDNS behavior)
- LlamaBeni benchmarking tool
- Comparative references (industry):
- Deepseek
- ByteDance (disaggregated prefill/decode used in production)
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.