Summary of "Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI"
Speaker / source
- Maxim Labonne (Liquid AI, Head of Pre-training)
Key technological concepts: “small vs big models” for edge/on-device use
Liquid AI’s focus: edge models
Liquid AI focuses on edge models optimized for on-device deployment, covering text, vision, and audio. Available sizes span roughly 350M to 24B. Recently mentioned releases include:
- VLM 450M (released “yesterday”)
- Updated 350M text model (released “the week before”)
Models are available on Hugging Face for experimentation.
Three main differences (and engineering implications)
-
Memory bound → low knowledge capacity
- Edge hardware limits total model size.
- Smaller models have less raw “knowledge,” making them less general-purpose.
-
Task narrowness
- Limited capacity leads small models to specialize.
- They’re not intended to behave like general chatbots.
- Instead, they excel at focused tasks such as summarization, data extraction, and tool use.
-
Latency sensitivity
- On-device deployment requires high throughput and low-latency inference.
Core lesson
Small models aren’t just scaled-down big models. They require different architectural and training choices because the constraints differ.
Architecture findings for small edge models
Gemma 3 (270M) and Gemma 2.5 (0.8B): hybrid attention + GQA; embedding inefficiency
Both variants use hybrid architectures:
- Gemma 3 270M: sliding window attention + GQA
- Gemma 2.5 0.8B: gated Delta Net + gated attention
Embedding layer share is large in the small variants
- Gemma 3 270M: embedding ≈ 63% of parameters
- Gemma 2.5 0.8B: embedding ≈ 29% of parameters
Interpretation: the “effective” parameters used for reasoning/knowledge are smaller than the total parameter count suggests.
Claimed reason: distillation from teacher models with huge vocabularies, which inflates embedding size.
LFM 2 architecture: faster operators + edge profiling
LFM 2 uses a hybrid architecture with:
- short convolutions + GQA
Embedding is smaller relative to the rest
This is framed as improving the “effective parameter” budget.
Design method: on-device profiling
They emphasize selecting operators via profiling on the target hardware, rather than only theoretical design.
- A highlighted block: Gated short convolution block
- It is described as much faster than alternatives (including comparisons vs sliding window attention, gated Delta Net, gated linear attention, and GQA variants).
Empirical profiling results
- CPU examples: AMD Ryzen Max Plus 395, Samsung Galaxy S25 Ultra
- GPU: high throughput at high concurrency
Reported outcome
Faster inference and lower memory use for LFM 2-style architecture due to short convolutions.
Training recipe and scaling behavior (LFM 2.5 / 350M)
Training stages (LFM 2.5)
Stages include:
- Pre-training + mid-training: 28T tokens
- Supervised fine-tuning
- Preference alignment (using an on-policy length-normalized DPO variant)
- Reinforcement learning
Scaling law perspective: “more tokens still helps even at small scale”
They reference Chinchilla-optimal compute intuition, which suggests training at different data/size ratios. However:
- They report performance still grows when increasing pre-training tokens beyond “compute optimal” expectations.
- They reference a Roberts et al. paper about test-time scaling laws (published “last week”).
- They compare LFM 2.5 350M with those laws.
Conclusion: not fully token-optimal, but scaling works even for small models—and is cheaper to train than very large models.
Post-training comparisons / benchmarks (targeted capabilities)
Their 350M model is described as significantly better than previous LFM 2 350M across categories:
- Knowledge: GPQA (Diamond)
- Instruction following: IF-Bench
- Data extraction: Case Report Bench
- Tool use: BFCL and “Dow 2” (tool-related evaluation)
Stated goal: optimize for data extraction + tool use. They frame being “not best at everything” as acceptable because users often don’t need “average” capabilities.
How training differs (practically) for small vs big models
While the overall stage structure is similar, the approach changes:
-
Supervised fine-tuning (SFT): best for narrowly focused capabilities
- For edge use cases: focus on specific functions/calls.
- Suggestion: start from a Hugging Face model and fine-tune for your task.
-
Preference alignment (DPO-style):
- Produces general improvements, not only benchmark chasing.
- Improves “overall” model quality and language quality.
-
Reinforcement learning (RL):
- Framed as very efficient even at small scale.
- Again emphasizes narrow focus: train with many environments/tasks to encourage generalization.
- Cold start sensitivity for small models: RL can fail if SFT mixtures lack similar examples.
- If an RL task “doesn’t train,” add missing cold-start SFT data (or rebalance complexity).
Tutorial/issue analysis: Doom looping in small reasoning models
What doom looping is
A failure mode where the model repeats sequences indefinitely (never terminates).
It’s highlighted as worse when:
- the task is too complex for the small model
- used with reasoning models
- combined with small capacity under high complexity
Solutions used to reduce doom loops (with concrete pipeline details)
Solution 1: Preference alignment data generation with doom-loop-aware selection (“DPO against doom loops”)
During preference alignment, their on-policy data generation pipeline:
- Start from ~1M prompts (scale described)
- Generate five rollouts using the policy model with temperature sampling to encourage diversity
- Expect at least one rollout won’t doom loop
- Generate one additional rollout with temperature = 0
- Expected to doom loop
- Use an LLM jury to score rollouts:
- chosen = best rollout
- rejected = worst rollout
Goal: train the policy to avoid doom-loop responses.
Solution 2: RL with verifiable rewards + repetition penalties
Use reinforcement learning with verifiable rewards, e.g.:
- For math: reward is given only if a final answer is extractable/verifiable
- If no correct final answer emerges: no positive reward, discouraging repetitive failure
Additional methods:
- add an n-gram repetition penalty
- use temperature sampling to diversify rollouts and reduce doomed-loop repetition
Quantitative example (small reasoning model under hard tasks)
For LFM 2.5 1.2B thinking:
- Post-pretraining doom loop ratio: ~15–16%
- After SFT: “barely moves” (SFT alone not enough)
- After DPO (their first solution): significant reduction
- After RL (their second solution): doom looping becomes almost nonexistent
Comparison claim: Attempting a similar setup with Gemma 3.5 0.8B in reasoning mode yields doom loops >50%, described as an example where small models behave like scaled-down big models—contrasting with Liquid’s edge-first approach.
Notes on distillation from bigger models
The discussion suggests that distilling doom-loop behavior from larger models may not directly transfer, implying additional steps or batches are required to re-eliminate doom loops.
Next-stage idea: agentic reinforcement learning + tool/web search to address memory limits
They emphasize the key characteristic:
memory bound ⇒ low knowledge ⇒ more hallucination
Proposed mitigation
- Provide small models web search/tools so they can retrieve knowledge rather than rely on limited internal capacity.
Why this could work
They argue tiny models can still be very good at agentic tasks if they have reliable reasoning to use tools correctly.
Handling context limitations
- Small models struggle with long context
- Proposed workarounds include:
- recursive language model environments
- Python-based shortcuts to reduce context burden
Takeaway
Combine edge models with agentic tools for strong performance. Large-model agentic workloads are important, but may not always be the best fit.
Main speakers / sources (explicit at end)
- Maxim Labonne (Liquid AI)
- (Interviewer/questions present, but no name provided in subtitles; only “in your specific workflow…” style Q&A)
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.