Summary of "But how do AI images and videos actually work? | Guest video by Welch Labs"
High-level idea
Modern text-to-image and text-to-video models synthesize pixels by repeatedly transforming pure noise into structured images/videos using diffusion-based processes. Mathematically this is like running Brownian motion (diffusion) backwards in very high-dimensional spaces.
Core components and concepts
-
CLIP (OpenAI, 2021)
- Two encoders: a text encoder and an image encoder, each outputting a 512‑dimensional embedding.
- Trained contrastively: matching image–caption pairs are pulled together in embedding space; non-matching pairs are pushed apart (cosine similarity).
- The learned embedding space encodes semantic directions (e.g., a “hat” vector ≈ difference between “with hat” and “without hat”).
- CLIP maps image/text → embeddings but cannot generate images from embeddings directly.
-
Diffusion models (DDPM — Berkeley, 2020)
- Forward process: gradually add Gaussian noise to training images until the image is destroyed (a random walk/Brownian motion).
- Reverse process: train a neural network to undo noise and recover images.
- Key points from DDPM:
- Training target is the total added noise (ε) rather than only the immediate previous step — this reduces variance and makes learning more efficient.
- During sampling, DDPM adds noise at each reverse step (stochastic sampling). Adding noise during generation surprisingly improves sharpness and diversity; removing it collapses outputs toward the mean (blurriness).
- Intuition: the model learns a time-dependent score function / vector field that points back toward higher-density (realistic) data.
-
Time-conditioning and the vector-field view
- Models are conditioned on a time variable t (amount of noise), so they learn coarse behavior at high noise and fine structure as t → 0.
- Diffusion learning can be seen as learning a time-varying vector field (flow) in data space that directs noisy points back to the data manifold.
-
Why noise during sampling matters
- Without the stochastic term, reverse dynamics push samples to the dataset mean → resulting images are blurry.
- The random term is required to sample the full reverse Gaussian distribution: the model predicts the mean, but adding Gaussian noise draws a full sample.
-
DDIM and deterministic sampling (Stanford / Google)
- The stochastic reverse SDE (DDPM) can be mapped to a deterministic ODE (DDIM) with the same endpoint distribution.
- DDIM sampling is deterministic and can produce high-quality images with far fewer network iterations (faster sampling) by changing step scaling — no retraining required.
- Flow-matching generalizes DDIM; some video models (e.g., WAN) use these generalized flows.
-
Conditioning on text and steering generation
- Straight conditioning: feed CLIP (or other text) embeddings into the diffusion model (e.g., via cross-attention or concatenation) so the denoiser uses text context during training/inference.
- unCLIP / DALL·E 2 (OpenAI): train diffusion to invert the CLIP image encoder, enabling stronger prompt adherence by generating images consistent with CLIP embeddings.
- Conditioning alone is often insufficient for strong prompt adherence; additional techniques are commonly used.
-
Classifier-free guidance (CFG)
- Technique: train the model sometimes without conditioning (unconditional) and sometimes with conditioning. During sampling, compute conditioned output minus unconditioned output and amplify that difference by a guidance factor α to push samples toward the condition.
- CFG effectively amplifies the semantic direction corresponding to the prompt, improving adherence and detail. Guidance scale α controls strength (higher α → stronger adherence, but can introduce artifacts).
- WAN extends this by using “negative prompts” (explicitly encode undesired features, subtract and amplify) to steer outputs away from unwanted attributes. WAN 2.1 is an open-source video model demonstrating these techniques.
-
Practical open-source models and tools mentioned
- WAN 2.1: open-source video model used in the video demos.
- Stable Diffusion (Heidelberg team): open-source image diffusion model used in examples; benefits from classifier-free guidance and DDIM sampling to reduce compute.
- DALL·E 2 (OpenAI, unCLIP): a closed commercial approach that inverts CLIP to achieve strong prompt adherence.
Performance and compute trade-offs
- Early DDPMs required many denoiser steps; DDIM/flow methods dramatically reduce the required steps (faster sampling).
- Guidance (CFG) and inversion of powerful text–image encoders greatly improve fidelity to prompts, but tuning guidance and handling negative prompts are practical levers that trade off diversity for adherence.
- Deterministic ODE samplers (DDIM/flows) speed inference without retraining, but theoretical guarantees concern matching distributions, not individual samples.
Practical takeaways & intuition
- Diffusion sampling is a controlled reverse random walk guided by a learned vector field; including or removing the stochastic term has clear theoretical and empirical consequences (diversity vs. mean-collapse).
- Time-conditioning is essential: coarse structure is learned for large t; fine details are learned near t → 0.
- Combining CLIP-style embeddings with diffusion models (conditioning + classifier-free guidance or unCLIP-style inversion) enables text-driven image/video generation from language prompts.
- Deterministic ODE samplers (DDIM/flows) allow much faster inference without retraining, but they match distributions in aggregate rather than guaranteeing specific sample trajectories.
Guides, tutorials, and resources referenced
- The video mentions deeper theory tutorials on diffusion SDE/ODE connections and the DDPM math (useful for formal proofs and derivations).
- Suggested further viewing: WelshLabs’ content — detailed ML and math explainer videos (including a well-regarded series on complex topics).
Main speakers and primary sources cited
- Guest presenter: Stephen Welsh (WelshLabs) — author of the guest video and detailed explanations/demos.
- Key papers and teams referenced: OpenAI (CLIP, unCLIP / DALL·E 2), Berkeley (DDPM), Stanford & Google (DDIM / related work), Google Brain (Fokker–Planck / ODE mapping), Heidelberg (Stable Diffusion team), WAN team (WAN 2.1).
Further help (optional)
I can provide either of the following on request:
- A list of the exact original papers (titles, years) with links.
- A short step-by-step “how to” cheat-sheet for running/conditioning a diffusion model (Stable Diffusion or WAN-style pipelines).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.