Summary of "Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI"

Speaker / source

Maxim Labonne (Liquid AI, Head of Pre-training)

Key technological concepts: “small vs big models” for edge/on-device use

Liquid AI’s focus: edge models

Liquid AI focuses on edge models optimized for on-device deployment, covering text, vision, and audio. Available sizes span roughly 350M to 24B. Recently mentioned releases include:

VLM 450M (released “yesterday”)
Updated 350M text model (released “the week before”)

Models are available on Hugging Face for experimentation.

Three main differences (and engineering implications)

Memory bound → low knowledge capacity
- Edge hardware limits total model size.
- Smaller models have less raw “knowledge,” making them less general-purpose.
Task narrowness
- Limited capacity leads small models to specialize.
- They’re not intended to behave like general chatbots.
- Instead, they excel at focused tasks such as summarization, data extraction, and tool use.
Latency sensitivity
- On-device deployment requires high throughput and low-latency inference.

Core lesson

Small models aren’t just scaled-down big models. They require different architectural and training choices because the constraints differ.

Architecture findings for small edge models

Gemma 3 (270M) and Gemma 2.5 (0.8B): hybrid attention + GQA; embedding inefficiency

Both variants use hybrid architectures:

Gemma 3 270M: sliding window attention + GQA
Gemma 2.5 0.8B: gated Delta Net + gated attention

Embedding layer share is large in the small variants

Gemma 3 270M: embedding ≈ 63% of parameters
Gemma 2.5 0.8B: embedding ≈ 29% of parameters

Interpretation: the “effective” parameters used for reasoning/knowledge are smaller than the total parameter count suggests.

Claimed reason: distillation from teacher models with huge vocabularies, which inflates embedding size.

LFM 2 architecture: faster operators + edge profiling

LFM 2 uses a hybrid architecture with:

short convolutions + GQA

Embedding is smaller relative to the rest

This is framed as improving the “effective parameter” budget.

Design method: on-device profiling

They emphasize selecting operators via profiling on the target hardware, rather than only theoretical design.

A highlighted block: Gated short convolution block
It is described as much faster than alternatives (including comparisons vs sliding window attention, gated Delta Net, gated linear attention, and GQA variants).

Empirical profiling results

CPU examples: AMD Ryzen Max Plus 395, Samsung Galaxy S25 Ultra
GPU: high throughput at high concurrency

Reported outcome

Faster inference and lower memory use for LFM 2-style architecture due to short convolutions.

Training recipe and scaling behavior (LFM 2.5 / 350M)

Training stages (LFM 2.5)

Stages include:

Pre-training + mid-training: 28T tokens
Supervised fine-tuning
Preference alignment (using an on-policy length-normalized DPO variant)
Reinforcement learning

Scaling law perspective: “more tokens still helps even at small scale”

They reference Chinchilla-optimal compute intuition, which suggests training at different data/size ratios. However:

They report performance still grows when increasing pre-training tokens beyond “compute optimal” expectations.
They reference a Roberts et al. paper about test-time scaling laws (published “last week”).
They compare LFM 2.5 350M with those laws.

Conclusion: not fully token-optimal, but scaling works even for small models—and is cheaper to train than very large models.

Post-training comparisons / benchmarks (targeted capabilities)

Their 350M model is described as significantly better than previous LFM 2 350M across categories:

Knowledge: GPQA (Diamond)
Instruction following: IF-Bench
Data extraction: Case Report Bench
Tool use: BFCL and “Dow 2” (tool-related evaluation)

Stated goal: optimize for data extraction + tool use. They frame being “not best at everything” as acceptable because users often don’t need “average” capabilities.

How training differs (practically) for small vs big models

While the overall stage structure is similar, the approach changes:

Supervised fine-tuning (SFT): best for narrowly focused capabilities
- For edge use cases: focus on specific functions/calls.
- Suggestion: start from a Hugging Face model and fine-tune for your task.
Preference alignment (DPO-style):
- Produces general improvements, not only benchmark chasing.
- Improves “overall” model quality and language quality.
Reinforcement learning (RL):
- Framed as very efficient even at small scale.
- Again emphasizes narrow focus: train with many environments/tasks to encourage generalization.
- Cold start sensitivity for small models: RL can fail if SFT mixtures lack similar examples.
  - If an RL task “doesn’t train,” add missing cold-start SFT data (or rebalance complexity).

Tutorial/issue analysis: Doom looping in small reasoning models

What doom looping is

A failure mode where the model repeats sequences indefinitely (never terminates).

It’s highlighted as worse when:

the task is too complex for the small model
used with reasoning models
combined with small capacity under high complexity

Solutions used to reduce doom loops (with concrete pipeline details)

Solution 1: Preference alignment data generation with doom-loop-aware selection (“DPO against doom loops”)

During preference alignment, their on-policy data generation pipeline:

Start from ~1M prompts (scale described)
Generate five rollouts using the policy model with temperature sampling to encourage diversity
- Expect at least one rollout won’t doom loop
Generate one additional rollout with temperature = 0
- Expected to doom loop
Use an LLM jury to score rollouts:
- chosen = best rollout
- rejected = worst rollout

Goal: train the policy to avoid doom-loop responses.

Solution 2: RL with verifiable rewards + repetition penalties

Use reinforcement learning with verifiable rewards, e.g.:

For math: reward is given only if a final answer is extractable/verifiable
If no correct final answer emerges: no positive reward, discouraging repetitive failure

Additional methods:

add an n-gram repetition penalty
use temperature sampling to diversify rollouts and reduce doomed-loop repetition

Quantitative example (small reasoning model under hard tasks)

For LFM 2.5 1.2B thinking:

Post-pretraining doom loop ratio: ~15–16%
After SFT: “barely moves” (SFT alone not enough)
After DPO (their first solution): significant reduction
After RL (their second solution): doom looping becomes almost nonexistent

Comparison claim: Attempting a similar setup with Gemma 3.5 0.8B in reasoning mode yields doom loops >50%, described as an example where small models behave like scaled-down big models—contrasting with Liquid’s edge-first approach.

Notes on distillation from bigger models

The discussion suggests that distilling doom-loop behavior from larger models may not directly transfer, implying additional steps or batches are required to re-eliminate doom loops.

Next-stage idea: agentic reinforcement learning + tool/web search to address memory limits

They emphasize the key characteristic:

memory bound ⇒ low knowledge ⇒ more hallucination

Proposed mitigation

Provide small models web search/tools so they can retrieve knowledge rather than rely on limited internal capacity.

Why this could work

They argue tiny models can still be very good at agentic tasks if they have reliable reasoning to use tools correctly.

Handling context limitations

Small models struggle with long context
Proposed workarounds include:
- recursive language model environments
- Python-based shortcuts to reduce context burden

Takeaway

Combine edge models with agentic tools for strong performance. Large-model agentic workloads are important, but may not always be the best fit.

Main speakers / sources (explicit at end)

Maxim Labonne (Liquid AI)
(Interviewer/questions present, but no name provided in subtitles; only “in your specific workflow…” style Q&A)

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI"

Speaker / source

Key technological concepts: “small vs big models” for edge/on-device use

Liquid AI’s focus: edge models

Three main differences (and engineering implications)

Core lesson

Architecture findings for small edge models

Gemma 3 (270M) and Gemma 2.5 (0.8B): hybrid attention + GQA; embedding inefficiency

Embedding layer share is large in the small variants

LFM 2 architecture: faster operators + edge profiling

Embedding is smaller relative to the rest

Design method: on-device profiling

Empirical profiling results

Reported outcome

Training recipe and scaling behavior (LFM 2.5 / 350M)

Training stages (LFM 2.5)

Scaling law perspective: “more tokens still helps even at small scale”

Post-training comparisons / benchmarks (targeted capabilities)

How training differs (practically) for small vs big models

Tutorial/issue analysis: Doom looping in small reasoning models

What doom looping is

Solutions used to reduce doom loops (with concrete pipeline details)

Solution 1: Preference alignment data generation with doom-loop-aware selection (“DPO against doom loops”)

Solution 2: RL with verifiable rewards + repetition penalties

Quantitative example (small reasoning model under hard tasks)

Notes on distillation from bigger models

Next-stage idea: agentic reinforcement learning + tool/web search to address memory limits

Proposed mitigation

Why this could work

Handling context limitations

Takeaway

Main speakers / sources (explicit at end)

Category

Share this summary

Is the summary off?

Video

Summary of "Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI"

Speaker / source

Key technological concepts: “small vs big models” for edge/on-device use

Liquid AI’s focus: edge models

Three main differences (and engineering implications)

Core lesson

Architecture findings for small edge models

Gemma 3 (270M) and Gemma 2.5 (0.8B): hybrid attention + GQA; embedding inefficiency

Embedding layer share is large in the small variants

LFM 2 architecture: faster operators + edge profiling

Embedding is smaller relative to the rest

Design method: on-device profiling

Empirical profiling results

Reported outcome

Training recipe and scaling behavior (LFM 2.5 / 350M)

Training stages (LFM 2.5)

Scaling law perspective: “more tokens still helps even at small scale”

Post-training comparisons / benchmarks (targeted capabilities)

How training differs (practically) for small vs big models

Tutorial/issue analysis: Doom looping in small reasoning models

What doom looping is

Solutions used to reduce doom loops (with concrete pipeline details)

Solution 1: Preference alignment data generation with doom-loop-aware selection (“DPO against doom loops”)

Solution 2: RL with verifiable rewards + repetition penalties

Quantitative example (small reasoning model under hard tasks)

Notes on distillation from bigger models

Next-stage idea: agentic reinforcement learning + tool/web search to address memory limits

Proposed mitigation

Why this could work

Handling context limitations

Takeaway

Main speakers / sources (explicit at end)

Category ?

Share this summary

Is the summary off?

Video

Category