- KL divergence and Fisher divergence
- Forward KL vs Reverse KL: mode-covering vs mode-seeking behavior
- Denoising Diffusion Probabilistic Models formulation
- Training objective and ELBO derivation
- Conditional trick
- Sampling procedure
- Variance-preserving (VP) schedules
- Connection to VAEs
- Denoising Score Matching formulation
- Training objective and optimal solution
- Conditional trick
- Tweedie’s formula. Ideal Denoiser
- Sampling procedure
- Motivation for multiple noise levels
- Variance-exploding (VE) schedules
- SDE and ODE formulations
- Wiener process (Brownian motion)
- Fokker–Planck and continuity equations. Why do we need them?
- Continuous-time schedule derivation
- FM formulation
- Training objective and optimal solution
- Conditional trick
- Sampling procedure
- Linear schedule
- Advantages over continuous diffusion formulations
- Euler, Heun
- DDIM, DPM-Solver
- Single-step vs multi-step (Adams–Bashforth)
- Connections across diffusion formulations
- Linear and cosine schedules
- EDM and SD3 schedules
- Shift selection strategies=
- Classifier guidance
- Classifier-free guidance
- Interval guidance
- AutoGuidance
- ε-, x₀-, and v-prediction
- Conversion between parameterizations
- Corresponding loss functions
(Table 1. JiT but for the 1 --> 0 process)
- Training timestep distributions (Uniform, Logit-normal)
- Resolution-dependent shift selection
- Timestep conditioning
- Class and text conditioning
- Adapter-based conditioning:
- Latent diffusion models
- Trade-offs: learnability vs reconstruction quality
- Representation Autoencoders (RAE)
Expectations:
- Training and sampling procedures
- Pros and cons for each variant
- Training from scratch vs distillation
- High-level connections between flow-map approaches
- Key difference between distribution matching and flow-map approaches
- General formulation
- Knowledge distillation
- Consistency models:
- Discrete-time and continuous-time Consistency Models
- Multi-boundary(-step) Consistency
- Consistency Trajectory Models (CTM)
- Shortcut models and Mean Flow
Expectations: understanding training and sampling procedures, and pros and cons for each variant.
- Image tokenizers:
- VQ-VAE (w/o PixelCNN)
- VQ-GAN
- ViT-VQ-GAN
- Prediction paradigms:
- Diffusion as implicit spectral autoregression
Expectations: understanding high-level model designs and ideas, and their pros and cons.
- Architectural differences from image models
- Frame-autoregressive diffusion
- Teacher forcing vs Diffusion forcing vs Self forcing
- Multi-Modal Large Language Models (MLLM), aka pure AR models for text and images
- Unified AR for text + diffusion for images (Bagel, TransFusion)
- VLM encoder + diffusion decoder (Qwen-image)
- 2D diffusion for 3D training (DMD-like for 3D):
- Multi-view diffusion architectures (SEVA)
- Course Materials
- Tracing the Principles Behind Modern Diffusion Models
- FLUX.2 blogpost about VAE latent spaces and timestep shifts