Summary of "Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI"

Speaker / source


Key technological concepts: “small vs big models” for edge/on-device use

Liquid AI’s focus: edge models

Liquid AI focuses on edge models optimized for on-device deployment, covering text, vision, and audio. Available sizes span roughly 350M to 24B. Recently mentioned releases include:

Models are available on Hugging Face for experimentation.

Three main differences (and engineering implications)

  1. Memory bound → low knowledge capacity

    • Edge hardware limits total model size.
    • Smaller models have less raw “knowledge,” making them less general-purpose.
  2. Task narrowness

    • Limited capacity leads small models to specialize.
    • They’re not intended to behave like general chatbots.
    • Instead, they excel at focused tasks such as summarization, data extraction, and tool use.
  3. Latency sensitivity

    • On-device deployment requires high throughput and low-latency inference.

Core lesson

Small models aren’t just scaled-down big models. They require different architectural and training choices because the constraints differ.


Architecture findings for small edge models

Gemma 3 (270M) and Gemma 2.5 (0.8B): hybrid attention + GQA; embedding inefficiency

Both variants use hybrid architectures:

Embedding layer share is large in the small variants

Interpretation: the “effective” parameters used for reasoning/knowledge are smaller than the total parameter count suggests.

Claimed reason: distillation from teacher models with huge vocabularies, which inflates embedding size.


LFM 2 architecture: faster operators + edge profiling

LFM 2 uses a hybrid architecture with:

Embedding is smaller relative to the rest

This is framed as improving the “effective parameter” budget.

Design method: on-device profiling

They emphasize selecting operators via profiling on the target hardware, rather than only theoretical design.

Empirical profiling results

Reported outcome

Faster inference and lower memory use for LFM 2-style architecture due to short convolutions.


Training recipe and scaling behavior (LFM 2.5 / 350M)

Training stages (LFM 2.5)

Stages include:


Scaling law perspective: “more tokens still helps even at small scale”

They reference Chinchilla-optimal compute intuition, which suggests training at different data/size ratios. However:

Conclusion: not fully token-optimal, but scaling works even for small models—and is cheaper to train than very large models.


Post-training comparisons / benchmarks (targeted capabilities)

Their 350M model is described as significantly better than previous LFM 2 350M across categories:

Stated goal: optimize for data extraction + tool use. They frame being “not best at everything” as acceptable because users often don’t need “average” capabilities.


How training differs (practically) for small vs big models

While the overall stage structure is similar, the approach changes:


Tutorial/issue analysis: Doom looping in small reasoning models

What doom looping is

A failure mode where the model repeats sequences indefinitely (never terminates).

It’s highlighted as worse when:

Solutions used to reduce doom loops (with concrete pipeline details)

Solution 1: Preference alignment data generation with doom-loop-aware selection (“DPO against doom loops”)

During preference alignment, their on-policy data generation pipeline:

Goal: train the policy to avoid doom-loop responses.


Solution 2: RL with verifiable rewards + repetition penalties

Use reinforcement learning with verifiable rewards, e.g.:

Additional methods:


Quantitative example (small reasoning model under hard tasks)

For LFM 2.5 1.2B thinking:

Comparison claim: Attempting a similar setup with Gemma 3.5 0.8B in reasoning mode yields doom loops >50%, described as an example where small models behave like scaled-down big models—contrasting with Liquid’s edge-first approach.


Notes on distillation from bigger models

The discussion suggests that distilling doom-loop behavior from larger models may not directly transfer, implying additional steps or batches are required to re-eliminate doom loops.


Next-stage idea: agentic reinforcement learning + tool/web search to address memory limits

They emphasize the key characteristic:

memory bound ⇒ low knowledge ⇒ more hallucination

Proposed mitigation

Why this could work

They argue tiny models can still be very good at agentic tasks if they have reliable reasoning to use tools correctly.

Handling context limitations

Takeaway

Combine edge models with agentic tools for strong performance. Large-model agentic workloads are important, but may not always be the best fit.


Main speakers / sources (explicit at end)

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video