Summary of "Your local LLM is 10x slower than it should be"

Summary of Technological Concepts / Product Features / Analysis

Local LLM Serving Comparison (Ollama vs. Llama.cpp / Llama Server)

The video benchmarks token generation speed (“tokens per second”) across different local serving setups:

Ollama (example: 34B model)
- ~100 tokens/sec
Llama Server (based on Llama.cpp)
- ~124 tokens/sec in a single-chat scenario (one request at a time)

Why Latency Isn’t the Only Issue: Concurrency for “Code Assistants” and Agents

The speaker argues that single-request chat benchmarks don’t represent real code assistant / agent systems, which may require many simultaneous chats.

In remote querying tests, sending requests from a laptop to a Mac Studio running Llama Server yields similar per-request throughput (around 120 tokens/sec).

Concurrency Limits and the Key Bottleneck

Even with concurrency = 128, the speaker shows performance reaching a base level of roughly 231 tokens/sec. However, this still appears constrained by what Llama.cpp can handle effectively by itself—making high-concurrency scenarios challenging (especially compared to setups involving additional components).

“Llama Throughput Lab” Tool / Repo

The main project discussed is Llama Throughput Lab—an open-source benchmarking framework documented on GitHub.

It includes:

Automated tests and configuration sweeps
A Python launcher that can run across different OS/hardware where Llama.cpp runs (macOS, Windows, Linux; potentially NVIDIA)
Sweeps to explore the best configuration for your machine

Benchmark Tests: Single Request, Concurrent Requests, and “Full Sweep”

The framework includes three major benchmark modes:

Single request test
- Choose model and server target
- Measures tokens/sec
Concurrent requests test
- Run many simultaneous requests (example shows 128 concurrency)
- Computes metrics such as:
  - total/average tokens per second
  - overall throughput
Full Sweep test
- Sweeps parameter combinations by starting multiple Llama Server instances
- Uses a large parameter space (308 combinations) involving:
  - instances (number of Llama Server processes)
  - parallel (a Llama Server parameter)
  - concurrency (number of simultaneous client requests)

GPU vs. CPU and Why It Matters

Key points about hardware:

Benchmarks assume GPU execution; CPU performance is much slower.
VRAM/unified memory isn’t always the primary limiter—GPU compute can be the bottleneck.
Example hardware: Mac Studio with 64GB unified memory (also mentions machines up to 512GB RAM).
Multiple large models can be run via multiple instances (example: four 70B models across four Llama Server instances).

Results Example from the Sweep

On the Mac Studio, a “Full Sweep” produced a high benchmark:

~1,226 tokens/sec with:
- 16 instances
- parallel = 64
- concurrency = 1024

The speaker cautions these are benchmarks and real performance depends on the exact workload—so you should test your own scenario.

How the System Is Run: Round-Robin Load Distribution with EngineX

The benchmark launcher can start multiple Llama Server instances. A key infrastructure component is EngineX, described as a small proxy placed “in front of your llama servers.”

EngineX role:

Load-balances requests using round-robin
Helps avoid sending all traffic to a single server instance
Presented as a major practical approach for high concurrency

Illustrated flow:

Start many local Llama Server instances (example: 16)
Configure the EngineX endpoint (local host vs. remote IP/port)
Use Open Web UI as the client pointed at EngineX
Run multiple queries “at the same time”

It also mentions generation parameters such as maximum tokens, which affects how much inference work is done per request.

Related Learning / Sponsor Content (Non-Core to LLM Serving)

The segment includes sponsored content:

TryHackMe “Cyber Security 101 (sec 1)”
- hands-on applied fundamentals
- instant results, certificate, and a shareable badge
- topics include OS/networking/web security basics, blue-team/red-team fundamentals, and beginner malware analysis
Promotion:
- 40% off for nonpremium users using the speaker’s code
- 3 months free premium

Main Speakers / Sources

Primary Speaker / Author

The YouTube creator (presenter of the benchmarks and maintainer of the Llama Throughput Lab workflow; also references other creators)

Named External Sources

Georgio Ganov
- author/creator of Llama.cpp (referenced for documentation of parameters like parallel)
Donato Capitella
- inspiration for the distributed launcher approach referenced by the speaker
TryHackMe
- sponsored learning platform mentioned in the subtitles

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Your local LLM is 10x slower than it should be"

Summary of Technological Concepts / Product Features / Analysis

Local LLM Serving Comparison (Ollama vs. Llama.cpp / Llama Server)

Why Latency Isn’t the Only Issue: Concurrency for “Code Assistants” and Agents