Summary of "Маленькие LLM как агенты - тест локальных моделей до 8B"

What the video tests

The video evaluates small (~3B–9B parameters) local LLMs used as agents for:

Coding agent task: modify an existing project and implement a very specific UI feature without breaking API contracts.
Web search agent task: search the web for latest news/posts (Jan–Apr 2026) about a new image generation model, filter relevant info, and save results as a JSON/Jon file.
Tool-calling benchmark (instrumental mode): models don’t write code/search; they must choose whether/how to call tools correctly (or refuse tool use when appropriate).

All runs are done locally, using lama.cpp (lama CP) with context size ~64k tokens, and the authors record execution time and memory usage (and whether the model solved the task).

Test 1: Agent modifies an existing “Focusboard” project

Goal

Using the repo + AgentMD rules, the model must:

find the needed files on its own
add correct auto-completion for “Focus Session”
show a live count in the UI
avoid touching unnecessary code
not break API contracts; work only with the Focus Session logic

Model outcomes (high level)

Nanbage 4.14B: fails; described as over-reasoning but weak for integrating into a finished codebase (issues with incorrect “calling of bodies/tools/code paths”).
MinTrial 33B: fails; produces output but program errors persist even after multiple correction attempts.
Ken 3.5 4B: succeeds; good at project understanding and implementation. Adds/extends tests unsuccessfully, but overall task passes; session/timer behavior mentioned as working end-to-end.
Animatron 3 Nano 4B: fails; tool/tool-integration problems with OpenC/OpenCD workflow; never reaches solution.
Gema 4B: partially/conditionally successful; needs multiple retries and human-like steering, finishes on 4th attempt (positive but not reliable).
Sera 8B: unstable; gets stuck in loops on early runs, succeeds only on the 4th run (passed but unreliable).
Ministral 3 8B (8B class): best for this test; understands task immediately, implements logic and adds tests, verifies correctness; described as confident/complete with minimal retries.
H 359B: negative; first attempt looks good, but core logic fails (timer expires instantly). Multiple revision attempts don’t fix it.
Omnicoder 9B: best/most confident; quickly locates the right files, implements correctly, adds tests, avoids noticeable errors; described as “adult-like” accurate completion.

Key takeaway from Test 1

Small models can reason, but tool + code integration reliability is the limiting factor; the strongest ones complete the scenario end-to-end and correctly modify existing projects.

Test 2: Web search agent + save results to file

Goal

Act as an agent with web search + filesystem tools to gather Jan–Apr 2026 news/posts about a new image generation model, filter relevant items, and save to a Jon/JSON file in the working directory.

Emphasis is on the full pipeline: search → selection → verification → correct JSON writing → correct final output.

Model outcomes (high level)

NBA 413B: finds info, but fails at writing/saving the JSON (“jon began to evolve” / couldn’t write file normally).
Ministral 33B: succeeds overall; sources quality questioned and formatting of “confidence”/confns is odd in places, but it finishes. Some MCP/tool saving issues suspected to be environment-related.
3.54B (KN): clean success; finds relevant materials and assembles the output well.
3 Nano 4B: completes, but lower quality (few links, some potentially off-topic), though fastest.
Gema 4 E4B (4B class variant): succeeds with roughness; struggles to save results initially, fixes on 3rd try, then writes file correctly.
Sera (8B): fails due to agent behavior loop at recording stage; also burns ~200 web search requests → 100% failure.
Ministral 3.8B (8B class): best reliability; completes on first try, saves correctly, confidence allocated adequately.
Qu9B: also very strong; completes on first try, finds current/relevant info, saves to JSON without problems.
Omnicoder 9B: succeeds; first try good and fast, but overall in this second test (spread mentioned) many models break at simple JSON writing—top models are those that produce correct JSON and save file immediately.

Key takeaway from Test 2

The main failure mode is not searching—it’s correctly generating the final JSON and writing the file reliably.

Test 3: Benchmark for tool-calling behavior (12 prompts)

Purpose

Measure how well local small models:

decide when to call a tool
select the correct tool
decide when not to call tools
handle ambiguous and trap prompts safely

Scoring

evaluates correct tool calls and correct refusals
computes a final weighted Agent Score combining multiple metrics (accuracy, safety, stability, suitability)
fairness: same prompt set, each model run 10 times to reduce variance; benchmark validated as working correctly

Results highlights

Control/upper bound: GO 431B gets Agent Score 0.92 (metric scale works).
Best small-model balance: Animatron 3 Nano 4B: Agent Score ~0.8, avg delay ~3 seconds (practical for local agent pipelines).
Next cluster: Serpa 8B, Omnicoder 9B, and oddly Qu4B tie around 0.7; functional but less stable than leader.
Ministral 33B remains fastest, but solution quality lower → explicit speed vs reliability trade-off.
Practical conclusion from benchmark: among small models, Nematro 3 Nano 4B (Nano 4B class) is recommended for real local use (fast enough + good tool behavior), while some larger models may be fast but less reliable.

Key takeaway from Benchmark

Many models can “reason,” but a smaller set are stable in tool-instrumental agent behavior, which is crucial for local engineering reliability.

Main overall conclusion of the video

Small LLMs can be agents, but not all can reliably complete real tool-using workflows.
Some models produce good reasoning yet break on tool/code integration or JSON/file writing.
The strongest performers are those that are stable in real tasks, not just high-scoring on a dry benchmark.

Main speakers/sources mentioned

Seriflow specialists / “specialists from the Serflow company” (speaker(s) not individually named)
Company/source referenced: Serflow (testing team/channel)

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Маленькие LLM как агенты - тест локальных моделей до 8B"

What the video tests

Test 1: Agent modifies an existing “Focusboard” project

Goal

Model outcomes (high level)

Key takeaway from Test 1

Test 2: Web search agent + save results to file

Goal

Model outcomes (high level)

Key takeaway from Test 2

Test 3: Benchmark for tool-calling behavior (12 prompts)

Purpose

Scoring

Results highlights

Key takeaway from Benchmark

Main overall conclusion of the video

Main speakers/sources mentioned

Category

Share this summary

Is the summary off?

Video

Summary of "Маленькие LLM как агенты - тест локальных моделей до 8B"

What the video tests

Test 1: Agent modifies an existing “Focusboard” project

Goal

Model outcomes (high level)

Key takeaway from Test 1

Test 2: Web search agent + save results to file

Goal

Model outcomes (high level)

Key takeaway from Test 2

Test 3: Benchmark for tool-calling behavior (12 prompts)

Purpose

Scoring

Results highlights

Key takeaway from Benchmark

Main overall conclusion of the video

Main speakers/sources mentioned

Category ?

Share this summary

Is the summary off?

Video

Category