Summary of "I Plugged a DGX Spark and Mac Together... and Didn’t Expect This"

Technological concepts & experiment goal

The video tests heterogeneous LLM inference by splitting the two main phases of generation across different hardware:
- Prefill (processing the whole prompt): compute-heavy
  - GPUs (e.g., Nvidia Blackwell in the DGX Spark / MSI Edge Expert) are strong here.
- Decode (generating tokens one-by-one): memory-bandwidth-heavy
  - Apple Silicon is strong here.
The key idea is disaggregated prefill + decode (a real production concept, referenced with companies like Deepseek and ByteDance) to optimize each phase independently.
Caution from the author: just because you can combine hardware doesn’t mean it’s worth it, because network transfer can dominate latency.

Hardware / setup described

Two main machines used initially:
- MSI Edge Expert / DGX Spark-like system (“GB10”)
  - Nvidia Blackwell GPU
  - 128GB unified memory
- Mac Mini M4 Pro
  - Apple Silicon
  - 64GB unified memory
Later upgrade for better decode:
- Mac Studio M3 Ultra with 512GB unified memory
Networking components:
- Uses mDNS peer discovery via Exo’s/libp2p stack, and runs into macOS issues.
- Uses high-speed connectivity via QSFP NICs in a Thunderbolt enclosure and a Microex CSR 812 switch.
- Compares performance using link speeds (e.g., ~50GbE improvements vs earlier ~2.5GB USB Ethernet).

Software / framework & implementation details

Uses Exo (repo referenced) and experimental community support for Blackwell prefill/decode disaggregation.
Builds and compiles:
- Rust networking bindings
- VLM from source on ARM Linux for the Spark side
- MLX from source on the Mac side, plus Node.js setup
Implementation relies on:
- Running Exo instances on both devices
- Routing prefill computations to the GPU machine and decode to the Apple machine (or swapped depending on the stage)

Major challenges / fixes (from guide/debug perspective)

Peer discovery fails on macOS
- Exo uses mDNS, and the Mac and Spark couldn’t “see” each other.
Fix: avoid mDNS by explicitly dialing the peer
- Use an environment variable on the Spark side
- Result: the connection comes up instantly
Additional bugs:
- Prefill routing broken initially (Mac runner couldn’t reach GB10)
- Required reboot + workarounds until stable

Review-style findings: performance results (measurements)

Example model: Qwen 3.5 (thinking model)

Hardware roles:
- Spark (GB10) runs BF16 full precision for faster prefill compute
- Mac Mini runs 4-bit quantization optimized for Apple/MLX (intended for faster decode)
Observations:
- KV cache transfer dominates end-to-end time over slower links.
- Even with fast compute, network took most of the time:
  - At ~25,000 tokens:
    - Spark KV-cache computed in < 1s
    - transfer took ~25s
- Outcome:
  - ~96% of total time was network, with the GPU mostly idle.

Token-generation timing nuance: “thinking models”

For thinking models (hidden reasoning tokens), most time-to-first-token can be dominated by reasoning tokens at decode speed.
In comparisons, when reasoning dominates, disaggregation can appear to converge, masking benefits.

Networking upgrade & controlled comparisons

Improvements made

Added a Thunderbolt 5 enclosure with a QSFP NIC compatible with macOS.
Used Melanox ConnectX4 (50GbE) because macOS rejected the initial Intel 100Gb card (“driver not installed”).
Reported:
- Switching/NIC negotiation required tuning
- KV-cache transfer improved by about ~30%

Non-thinking model for clearer comparison

Used Llama 3.1 / Llama 3.18B (dense, widely compatible).
Benchmark tool: LlamaBeni
- Measures prompt processing vs token generation via HTTP API calls (including stack overhead).
Key metrics (time to first token at PP = 4096):
- Disaggregated time-to-first-token matched Spark-alone closely:
  - about 2.4s (disaggregated) vs 2.3s (Spark)
- Disaggregated decode was slower than Mac alone due to KV-cache injection overhead (remote KV injection slows decode).

Conclusion for the Mac Mini stage

Disaggregation can recover Spark-like time-to-first-token when networking is not the bottleneck.
But end-to-end gains depend heavily on network speed and model behavior (especially thinking tokens).

Main “scaling up” tests: swapping to Mac Studio M3 Ultra

The author replaces Mac Mini with Mac Studio M3 Ultra to increase decode bandwidth.
Networking was reworked due to reaching issues again:
- Recreated the networking service so Mac Studio could reach Spark after reboot
- Took “a couple hours” this time vs days earlier
Results for Llama 3.18B (8B class):
- Decode much faster on Mac Studio:
  - ~106 tokens/s decode vs Mac Mini’s ~52
- Disaggregated decode improved, but remained limited by KV-cache injection overhead
Argument: disaggregation becomes more attractive as model size increases
- Faster Apple decode helps
- Spark’s prefill advantage becomes more pronounced as models get larger

Larger models: what worked and what didn’t

Constraint: Spark memory + VLM kernel support for quantization

For Llama 3.70B, full precision did not fit Spark’s 128GB.
Quantization attempts failed due to missing ML/VLM kernels for Spark’s architecture:
- FP8 failed (kernel compilation issues)
- AWQ and W4A16/GPTQ Marlin failed because the Marlin repack kernel was missing
Workaround: use models that fit and/or supported quantizations:
- Qwen 2.5 ~32B (BF16 on Spark + 4bit on Mac Studio)
- Gemma 2 ~27B (BF16 on Spark + 4bit on Mac Studio)

Disaggregated vs single-device patterns across model sizes

Consistent pattern across Llama 8B, Gemma 27B, Qwen 32B:
- Disaggregated time-to-first-token tracks the Spark:
  - Spark-like first-token latency recovered
- Spark’s prefill advantage grows with model size:
  - At 8B: Spark only slightly ahead
  - At 27–32B: Spark is ~2x to 2.5x faster at prefill
- Decode becomes relatively less dominant at larger sizes:
  - At 8B: Mac Studio decode dramatically faster than Spark
  - At 27–32B: decode gap shrinks due to factors like Spark decode improving relatively and attention/kernel fusion/compilation effects

Final assessment / practical advice

As a proof of concept, heterogeneous disaggregated inference works:
- Two machines can produce tokens faster than either alone by matching each phase to its best hardware
- Still requires:
  - workable networking
  - correct kernels/quantization support
  - significant engineering/debug time
As a practical purchase recommendation:
- DGX Spark + Mac Studio are expensive
- If buying new hardware, the author suggests a more cost-effective path:
  - an RTX Pro 6000 workstation-based setup
Overall takeaway:
- Whether disaggregation helps depends mainly on network throughput and decode bottlenecks.

Main speakers / sources

Main speaker/source: The video author (narrator; builds, debugs, benchmarks)
Software/source referenced:
- Exo project/repo (and community PRs; libp2p/MDNS behavior)
- LlamaBeni benchmarking tool
Comparative references (industry):
- Deepseek
- ByteDance (disaggregated prefill/decode used in production)

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "I Plugged a DGX Spark and Mac Together... and Didn’t Expect This"

Technological concepts & experiment goal

Hardware / setup described

Software / framework & implementation details

Major challenges / fixes (from guide/debug perspective)

Review-style findings: performance results (measurements)

Example model: Qwen 3.5 (thinking model)

Token-generation timing nuance: “thinking models”

Networking upgrade & controlled comparisons

Improvements made

Non-thinking model for clearer comparison

Conclusion for the Mac Mini stage

Main “scaling up” tests: swapping to Mac Studio M3 Ultra

Larger models: what worked and what didn’t

Constraint: Spark memory + VLM kernel support for quantization

Disaggregated vs single-device patterns across model sizes

Final assessment / practical advice

Main speakers / sources

Category

Share this summary

Is the summary off?

Video

Summary of "I Plugged a DGX Spark and Mac Together... and Didn’t Expect This"

Technological concepts & experiment goal

Hardware / setup described

Software / framework & implementation details

Major challenges / fixes (from guide/debug perspective)

Review-style findings: performance results (measurements)

Example model: Qwen 3.5 (thinking model)

Token-generation timing nuance: “thinking models”

Networking upgrade & controlled comparisons

Improvements made

Non-thinking model for clearer comparison

Conclusion for the Mac Mini stage

Main “scaling up” tests: swapping to Mac Studio M3 Ultra

Larger models: what worked and what didn’t

Constraint: Spark memory + VLM kernel support for quantization

Disaggregated vs single-device patterns across model sizes

Final assessment / practical advice

Main speakers / sources

Category ?

Share this summary

Is the summary off?

Video

Category