Summary of "Running vLLM on Strix Halo (AMD Ryzen AI MAX) + ROCm Performance Updates"
Running vLLM on Strix Halo (AMD Ryzen AI MAX) — ROCm updates
Hands-on guide and benchmarks for running vLLM on the Strix Halo (RDNA 3.5 integrated GPU / unified memory), with attention-backend comparisons, ROCm/The Rock updates, and practical configuration tips.
What the video covers
- How to run vLLM on the Strix Halo and how to configure the system for best unified-memory usage.
- A comparison of attention backends in vLLM (Triton vs AMD ROCm attention kernels), including a toggleable toolbox to test backends per model.
- Benchmarks (tokens/sec) across models and quantizations showing mixed results: ROCm attention can be much faster for some models (e.g., Gemma 3 12B) but slower or failing for others (e.g., Qwen 3 Coder/Next).
- Model compatibility and quantization limitations on Strix Halo: RDNA 3.5 lacks some instructions (no native FP8), so many FP8 or some AWQ/GPTQ quantizations won’t load natively; dequantization to FP16/BF16 is possible but removes memory savings.
- Practical guidance: run a Fedora host, tweak kernel parameters to expose nearly all of 128 GB unified memory while reserving ~4 GB for the OS, and use a Docker image built with vLLM + nightly ROCm/The Rock builds (AO Triton / ROCm attention).
- Stability caveats and the need to experiment model-by-model to find working quantizations and backends.
Key technical points and product features
vLLM
- Targeted at server-style workloads (multi-request handling, larger batches, tensor parallelism).
- Best performance on supported data-center GPUs; support on consumer AMD GPUs (Strix Halo) is still immature and model support is patchy.
Attention backends
- Triton: the safest baseline with broader reliability.
- ROCm attention (AMD native kernels): can outperform Triton for certain architectures but has uneven support on RDNA 3.5; results are use-case dependent.
Quantization and model formats
- GGML / llama.cpp: very versatile for local inference; community GGML quant options make many models easy to run and often performant on consumer hardware.
- FP8 models: work natively on newer RDNA (e.g., R9 700 AI Pro) but not on Strix Halo; dequantization to FP16/BF16 is possible but negates FP8 memory savings.
- AWQ / GPTQ variants: compatibility varies and some may fail depending on available kernels.
ROCm / The Rock
- ROCm 7 and The Rock PyTorch builds (AO Triton included) are available.
- AO Triton compiles key kernels ahead-of-time (AOT), improving stability and performance (notably for diffusion/image workloads).
- ROCm attention uses AMD’s native kernels (not Triton-generated), giving speedups for some models but introducing incompatibilities for others.
llama.cpp
- Developers added AMD-specific custom kernels, improving performance and diminishing the usefulness of earlier ROCW MMA tooling.
- For many models on Strix Halo, llama.cpp currently yields better practical performance than vLLM due to available quantizations and optimizations.
Image / video generation
- Rebuilding toolboxes on The Rock/nightly PyTorch with AO Triton can give ~2x speedups for many diffusion models (e.g., Qwen Image, ComfyUI).
- ComfyUI stability improved on recent Linux kernels.
Community and tooling
- Toolbox + benchmark scripts and a Docker image are available on the author’s GitHub.
- Active Strix Halo Discord community; AMD and contributors are iterating on kernels.
- AMD is collecting GPU kernel traces from ComfyUI users to improve ROCm stability and performance.
Practical takeaways and recommendations
- You can run vLLM on Strix Halo, but expect compatibility limits and the need to test models individually.
- Start with Triton as the baseline, and try ROCm attention for models where it may yield large gains.
- If you depend on FP8 memory savings, Strix Halo is likely unsuitable unless you accept dequantization (and the lost memory benefit).
- For many consumer use cases, llama.cpp with GGML quantizations may be simpler and faster today.
- Use the provided Docker image, benchmark scripts, and the GitHub repo to reproduce tests.
- Consider joining the Strix Halo Discord and contributing kernel traces if you run ComfyUI.
Reviews, guides, and resources mentioned
- The video: hands-on guide to running vLLM on Strix Halo, throughput benchmarks, and attention-backend comparisons.
- GitHub repo: toolbox, scripts, Docker image, and benchmark code (link referenced in video).
- Prior video: deeper vLLM explanation and a dual R9 700 comparison (recommended for more detail).
- Instructions for kernel parameter settings to expose unified memory while reserving ~4 GB for the OS.
- AMD GitHub issue / instructions for enabling GPU kernel trace dumps (for ROCm optimization, ComfyUI users).
Main speakers and sources
- Video presenter / channel author — demonstrates setup, benchmarks, and provides the toolbox.
- AMD / ROCm (including The Rock builds and ROCm 7).
- llama.cpp developers and community contributors.
- Community sources: Strix Halo Discord and ComfyUI users.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...