Skip to content

Latest commit

 

History

History
911 lines (606 loc) · 32.9 KB

File metadata and controls

911 lines (606 loc) · 32.9 KB

Generative Adversarial Networks (GANs)

1. Motivation and Historical Context

1.1 The Problem Before GANs

Generative modeling has been one of the fundamental challenges in machine learning: how can we create systems that can generate realistic synthetic data samples that resemble the distribution of real data? Before GANs (2014), the field relied on:

  • Maximum Likelihood Estimation (MLE) based approaches: Variational Autoencoders (VAEs), Boltzmann Machines
  • Explicit density models: Required tractable probability density functions $p(x)$, which is computationally expensive and restrictive
  • Autoregressive models: Slow generation as predictions required sequential sampling
  • Issues with existing methods:
    • VAEs produce blurry reconstructions due to their reconstruction loss objective
    • RBMs are difficult to scale and slow to sample from
    • Explicit density models struggle with high-dimensional data
    • No direct way to generate samples without explicitly modeling probability distributions

1.2 Historical Context: Why GANs Were Revolutionary

In June 2014, Ian Goodfellow et al. proposed Generative Adversarial Networks (GANs), introducing a paradigm shift that:

  • Avoided explicit density modeling: No need to define $p(x)$ explicitly
  • Enabled implicit density models: Learn the data distribution through a game-theoretic framework
  • Generated sharper, more realistic samples: Compared to VAEs, especially for images
  • Introduced an adversarial training scheme: Two networks in competition drive each other toward better solutions

This framework became the foundation for numerous applications: image generation, style transfer, super-resolution, and more.


2. Core Intuition: The Game-Theoretic Framework

2.1 The Prisoner's Dilemma Analogy

Think of GANs as a sophisticated adversarial game:

  • Generator (Counterfeiter): Tries to create fake money (data) convincingly
  • Discriminator (Police): Tries to catch fake money by distinguishing it from real currency
  • Goal: Both improve iteratively. The counterfeiter learns what works; the police becomes better at detection
  • Equilibrium: At convergence, the counterfeiter makes "perfect" fake money (indistinguishable from real)

2.2 The Architecture

Random Noise (z)
      ↓
  ┌─────────────┐
  │  Generator  │ → Fake Data (G(z))
  │      G      │
  └─────────────┘
        ↑
    Feedback
        ↓
  ┌─────────────┐
  │ Discriminator│ → Real? Fake?
  │      D      │   (0 or 1)
  └─────────────┘
      ↑
  Real Data (x)

Key Components:

  1. Generator $G$: Maps noise $z \sim p_z(z)$ to data space. Learning to capture the data distribution
  2. Discriminator $D$: A binary classifier distinguishing real data from generated data
  3. Training objective: Simultaneous gradient descent on conflicting objectives

2.3 Latent Space and Noise Vector

The latent vector is the "seed" for generation:

Typical Properties:

  • Shape: Usually 50–512 dimensional vectors
  • Distribution: Sampled from:
    • Gaussian: $z \sim \mathcal{N}(0, I)$ (most common)
    • Uniform: $z \sim U(-1, 1)$
    • Other distributions possible

Interpretation:

  • Encodes high-level generative factors
  • Different $z$ values produce different outputs
  • All information for generation must flow through this bottleneck
  • Higher dimensionality → more expressive model capacity

Key insight: The generator essentially learns a mapping from a simple distribution (e.g., Gaussian) to the complex data distribution


3. Mathematical Formulation

3.1 The Min-Max Game

The fundamental objective function of GANs is formulated as a minimax game:

$$\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$$

Breaking down the objective:

Component Meaning Who Optimizes
$\mathbb{E}{x \sim p{data}(x)}[\log D(x)]$ Log probability that D correctly identifies real data Discriminator maximizes
$\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$ Log probability that D correctly rejects fake data Discriminator maximizes
$\log(1 - D(G(z)))$ Difficulty for generator; fool the discriminator Generator minimizes

3.2 Interpretation of the Objective

For the Discriminator (maximization):

  • Wants $D(x) \approx 1$ for real data (correct classification)
  • Wants $D(G(z)) \approx 0$ for fake data (correct rejection)
  • Objective: $\max_D[\log D(x) + \log(1 - D(G(z)))]$

For the Generator (minimization):

  • Wants $D(G(z)) \approx 1$ (fool the discriminator)
  • Since discriminator is trying to make it small, generator minimizes: $$\min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$$

3.3 Alternative Generator Objective (Non-saturating)

The original objective has a problem: when $D$ is strong (good discriminator), $\log(1-D(G(z)))$ saturates, providing weak gradients to $G$.

Solution: Use the non-saturating objective instead: $$\min_G \mathbb{E}_{z \sim p_z(z)}[-\log D(G(z))]$$

This flips the gradient direction, providing stronger gradients early in training when $D(G(z)) \approx 0$.


4. The Training Algorithm

4.1 Alternating Gradient Descent

Initialize: Generator G and Discriminator D with random weights
For each training iteration:

  DISCRIMINATOR UPDATE:
    1. Sample m real samples {x₁, x₂, ..., xₘ} from p_data
    2. Sample m noise samples {z₁, z₂, ..., zₘ} from p_z
    3. Compute gradients: ∇_D [1/m Σlog D(xᵢ) + 1/m Σlog(1-D(G(zᵢ)))]
    4. Update D using gradient ascent

  GENERATOR UPDATE:
    1. Sample m noise samples {z₁, z₂, ..., zₘ} from p_z
    2. Compute gradients: ∇_G [-1/m Σlog D(G(zᵢ))] (or non-saturating variant)
    3. Update G using gradient descent

4.2 Convergence Claim

Under the original theoretical framework, if:

  • Both G and D have sufficient capacity
  • Training reaches a global optimum at each step

Then at Nash equilibrium:

  • $G$ recovers the data distribution: $p_G = p_{data}$
  • $D$ becomes unable to distinguish: $D(x) = 0.5$ everywhere

In practice: This theoretical guarantee rarely holds due to:

  • Limited network capacity
  • Non-convex optimization
  • Mode collapse (discussed below)

4.3 GANs as Games, Not Optimization Problems

Critical distinction from standard machine learning:

Standard Optimization:

Minimize: Loss(x, w)
Find: Global minimum
Stop: When loss converges

GAN Training (Game Theory):

Find: Nash Equilibrium
Updates: Simultaneous for two competing players
No single global objective to minimize
Parameters may oscillate even when quality improves

Key implication:

  • GAN loss curves may oscillate wildly
  • Loss oscillation ≠ Poor training
  • Loss decrease ≠ Improved sample quality
  • Must use external metrics (FID, IS) to assess progress

5. Why GANs Work: Information-Theoretic Perspective

5.1 Connection to Divergence Measures

The GAN objective can be reframed using divergence measures. At optimal discriminator $D^{*}$ for fixed $G$:

$$D^{*}(x) = \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}$$

Substituting back:

$$V(D^{*}, G) = \mathbb{E}_{x \sim p_{data}}[\log \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}] + \mathbb{E}_{x \sim p_G}[\log \frac{p_G(x)}{p_{data}(x) + p_G(x)}]$$

This can be rewritten as:

$$V(D^{*}, G) = -2\log 2 + 2 \cdot JSD(p_{data} \parallel p_G)$$

where $JSD$ is the Jensen-Shannon Divergence—a symmetric divergence measure between distributions.

Key insight: Minimizing the GAN objective is equivalent to minimizing the JS divergence between the data distribution and generator distribution.

5.2 Why This Matters

  • Advantage over MLE: JS divergence provides non-zero gradients even when distributions have no overlap (unlike KL divergence)
  • Practical implication: GANs can learn from disjoint supports, whereas MLE-based methods struggle
  • Limitation: JS divergence is symmetric but can be unintuitive for comparing distributions with different supports

5.3 No Explicit Likelihood (Important Distinction)

GANs learn an implicit density model:

Cannot compute: $p(x) = ?$ for any sample $x$

Why this matters:

  • No direct likelihood-based evaluation
  • Cannot use log-likelihood as a stopping criterion
  • Cannot compute perplexity or other traditional metrics
  • Prevents use of maximum likelihood estimation for training

Advantage over explicit models:

  • Not constrained to tractable probability distributions
  • Can model complex, high-dimensional distributions
  • Freedom in architecture design

Disadvantage:

  • Harder to evaluate (requires proxy metrics like FID)
  • Cannot estimate sample importance or likelihood

6. Key Challenges and Failure Modes

6.1 Mode Collapse

Problem: The generator learns to produce only a limited subset of the data distribution, ignoring modes (clusters).

True Distribution          Generator Output
┌─────────┬─────────┐      ┌─────────┐
│  Mode A │  Mode B │      │ Mode A  │
│ (many)  │ (many)  │      │ (many)  │
└─────────┴─────────┘      └─────────┘
                                   Mode B completely missing!

Why it happens:

  • Generator finds one mode and exploits it fully
  • Discriminator learns to reject samples from that mode
  • Generator switches to another mode rather than sampling all modes
  • Lack of diversity in loss signal

Solutions:

  • Minibatch discrimination: Discriminator compares minibatches, not individual samples
  • Unrolled GANs: Discriminator sees multiple generator update steps ahead
  • Spectral normalization: Stabilizes training
  • Wasserstein GANs: Use Wasserstein distance instead of JS divergence

6.2 Vanishing Gradients

Problem: When discriminator becomes too good, generator receives near-zero gradients, making learning stall.

Mathematical explanation: If $D(G(z)) \approx 0$ everywhere:

  • $\log(1 - D(G(z))) \approx 0$ and nearly constant
  • $\nabla_G \log(1 - D(G(z))) \approx 0$
  • No useful signal to improve generator

Visual intuition:

D Score
  1.0 │      Generator samples here
      │      ↓
      │    ┌──────┐
      │    │ FLAT │  ← Gradient nearly zero!
      │    │ (D=0)│
  0.0 │────┴──────┴────
      └──────────────────
      Real            Fake

Solutions:

  • Non-saturating loss: $-\log D(G(z))$ instead of $\log(1-D(G(z)))$
  • Feature matching: Match discriminator's hidden layer statistics
  • Wasserstein GAN: Use continuous loss without log

6.3 Training Instability

Problem: Generator and discriminator losses oscillate wildly; training doesn't converge smoothly.

Causes:

  • Simultaneous alternating updates create feedback loops
  • Imbalanced learning rates between networks
  • Networks with mismatched capacity

Indicators:

Loss curve: ════════════════════════════════
            (chaotic, oscillating wildly)

vs. Stable: ═════╲╲╲════════════════════════
            (monotonic decrease with plateaus)

Mitigation strategies:

  • Keep discriminator stronger than generator (don't train D to convergence)
  • Use gradient penalties
  • Batch normalization in both networks
  • Careful hyperparameter tuning (learning rates, batch sizes)

6.4 Convergence Proof Issues

Reality check: The theoretical convergence guarantees of vanilla GANs are:

  • Proven only under unrealistic assumptions (infinite capacity networks, continuous training)
  • Not guaranteed in practice with finite-capacity neural networks
  • No practical stopping criterion or convergence verification method

7. Variants and Improvements

7.1 DCGAN (Deep Convolutional GAN)

Historical significance: First successful CNN-based GAN (Radford et al., 2015); breakthrough enabling large-scale image generation

Architectural Guidelines (critical for stable training):

Component Rule Reason
Generator Pooling Replace with strided transpose convolutions Better gradient flow
Generator Norm Batch Norm on all layers except final Reduces internal covariate shift
Generator Activation ReLU hidden layers, tanh output Non-linearity + bounded output
Discriminator Pooling Replace with strided convolutions Gradient stability
Discriminator Norm Batch Norm all layers except first Prevents instability
Discriminator Activation LeakyReLU (α ≈ 0.2) Avoids dead neurons

Typical Architecture:

GENERATOR (input: z ~ N(0,I), shape: 100-dimensional):
  z → Reshape to (512, 1, 1)
  → ConvTranspose2d(512→256, k=4) + BN + ReLU
  → ConvTranspose2d(256→128, k=4) + BN + ReLU
  → ConvTranspose2d(128→64, k=4) + BN + ReLU
  → ConvTranspose2d(64→3, k=4) + Tanh
  Output: RGB image (3, 64, 64)

DISCRIMINATOR (input: image):
  Image (3, 64, 64)
  → Conv2d(3→64, k=4, stride=2) + LeakyReLU
  → Conv2d(64→128, k=4, stride=2) + BN + LeakyReLU
  → Conv2d(128→256, k=4, stride=2) + BN + LeakyReLU
  → Conv2d(256→512, k=4, stride=2) + BN + LeakyReLU
  → Flatten → Dense(1) + Sigmoid
  Output: Probability real (0-1)

Why these rules work:

  • Strided convolutions prevent checkerboard artifacts from naïve transpose convolutions
  • Batch normalization accelerates convergence and provides regularization
  • LeakyReLU prevents vanishing gradients in discriminator
  • tanh output on generator ensures output range [-1, 1] (matches normalized image data)

7.2 Wasserstein GAN (WGAN)

Key innovation: Replace JS divergence with Wasserstein distance (earth-mover distance)

$$W(p_{data}, p_G) = \inf_{\gamma \in \Pi} \mathbb{E}_{(x,y) \sim \gamma}[|x-y|]$$

Benefits:

  • Provides meaningful gradients even with non-overlapping supports
  • More stable training
  • Real-valued loss correlates with sample quality

Formula: $$\min_G \max_{D: |D|L \leq 1} \mathbb{E}{x \sim p_{data}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))]$$

where the discriminator is Lipschitz-constrained.

7.3 Conditional GAN (cGAN)

Motivation: Control what the generator produces by conditioning on labels

$$\min_G \max_D \mathbb{E}_{x \sim p_{data}, y}[\log D(x|y)] + \mathbb{E}_{z, y}[\log(1-D(G(z|y)))]$$

Architecture:

Label y ────┐
            ├→ Concatenate → Generator → Fake image
Noise z ────┤
            └→ Discriminator → Real or Fake?
                  (also receives label y)

Applications: Image generation with labels, text-to-image, class-conditional generation

7.4 Progressive GAN

Idea: Train from low to high resolution, gradually adding layers

Benefits:

  • Stabilizes training
  • Enables high-resolution image generation
  • Focuses on coarse details first, then fine details

Layer progression:

Resolution:  4×4  →  8×8  →  16×16  →  32×32  →  64×64  →  256×256  →  1024×1024
Training:    ═══════════════════════════════════════════════════════════════════

7.5 StyleGAN

Innovation: Decomposes generation into coarse (style) and fine (content) control

Architecture:

Latent Code z (random)
           ↓
    Mapping Network (8 layers) → w (style vector)
           ↓
  Adaptive Instance Normalization (AdaIN)
           ↓
  Synthesis Network with Progressive Growth
           ↓
    Generated Image

Advantages:

  • Disentangled representation (change style vs. content independently)
  • Better control over generation
  • State-of-the-art image quality

8. Training Stability Techniques

8.1 Spectral Normalization

Idea: Normalize weight matrices to have spectral norm (largest singular value) of 1

$$W_{SN} = \frac{W}{\sigma(W)}$$

where $\sigma(W) = \max_h |Wh|/|h|$ (largest singular value)

Effect: Limits Lipschitz constant of discriminator, stabilizing gradients

8.2 Gradient Penalty (WGAN-GP)

Rather than weight clipping, add a regularization term:

$$L = L_{original} + \lambda \mathbb{E}_{x}[(|\nabla_x D(x)|_2 - 1)^2]$$

Enforces gradient norm = 1, ensuring Lipschitz constraint

8.3 Batch Normalization and Layer Normalization

Batch Norm: Normalize activations across batch dimension

  • Reduces internal covariate shift
  • Stabilizes discriminator learning
  • Caveat: Can cause training oscillations if overused in generator

Better alternatives:

  • Layer Normalization: Normalize per-sample
  • Instance Normalization: Useful for style transfer GANs
  • Group Normalization: Hybrid approach

8.4 One-Sided Label Smoothing

Instead of:

  • Real labels: 1
  • Fake labels: 0

Use:

  • Real labels: 0.9
  • Fake labels: 0.0

Prevents discriminator from becoming overconfident, improving gradient flow


9. Loss Functions and Objectives

9.1 Standard GAN Loss

$$L_D = -\mathbb{E}_x[\log D(x)] - \mathbb{E}_z[\log(1-D(G(z)))]$$ $$L_G = -\mathbb{E}_z[\log D(G(z))]$$

9.2 Non-Saturating GAN Loss

$$L_D = -\mathbb{E}_x[\log D(x)] - \mathbb{E}_z[\log(1-D(G(z)))]$$ $$L_G = \mathbb{E}_z[\log(1-D(G(z)))]$$ (Generator objective flipped)

9.3 Wasserstein Loss

$$L_D = \mathbb{E}_z[D(G(z))] - \mathbb{E}_x[D(x)]$$ $$L_G = -\mathbb{E}_z[D(G(z))]$$

9.4 Hinge Loss (Discriminator perspective)

$$L_D = \mathbb{E}_x[\max(0, 1-D(x))] + \mathbb{E}_z[\max(0, 1+D(G(z)))]$$

Advantage: Robust margins, more stable training

9.5 Least Squares GAN (LSGAN)

$$L_D = \frac{1}{2}\mathbb{E}_x[(D(x)-1)^2] + \frac{1}{2}\mathbb{E}_z[D(G(z))^2]$$ $$L_G = \frac{1}{2}\mathbb{E}_z[(D(G(z))-1)^2]$$

Property: Penalizes far off predictions more, leading to faster convergence


10. Evaluation Metrics

10.1 Inception Score (IS)

Formula: $$IS = \exp(\mathbb{E}_x[KL(p(y|x) \parallel p(y))])$$

where $y$ is class label from pre-trained Inception network

Interpretation:

  • Measures class confidence: $p(y|x)$ should be high for specific classes
  • Measures diversity: marginal $p(y)$ should be uniform
  • Higher IS is better
  • Limitation: Doesn't directly measure similarity to real data

10.2 Fréchet Inception Distance (FID)

Formula: $$FID = |\mu_r - \mu_g|_2^2 + Tr(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$$

where $\mu, \Sigma$ are mean and covariance of activations

Advantages:

  • Measures both quality (first term: distance between means) and diversity (second term: covariance difference)
  • More robust to classifier overfitting than IS
  • Better correlation with human judgment
  • Gold standard metric in practice

10.3 Kernel Inception Distance (KID)

Computationally cheaper alternative to FID, based on maximum mean discrepancy

10.4 Precision and Recall for Generated Samples

Precision: Fraction of generated samples that are "realistic" Recall: Fraction of real data modes covered by generator

Trade-off: High precision (mode collapse) vs. high recall (blurry samples)

10.5 Why GAN Loss Does Not Correlate with Sample Quality

Important insight: GAN loss values oscillate and do not monotonically decrease, even when sample quality improves.

Why this happens:

  1. Discriminator loss and generator loss are not aligned with sample quality
  2. As generator improves, discriminator must adapt → loss increases
  3. As discriminator improves, generator loss increases → generator updates
  4. Simultaneous updates create feedback oscillations
  5. Nash equilibrium involves loss oscillation, not convergence

Example:

Iteration  | Gen Loss | Dis Loss | Sample Quality | Notes
───────────┼──────────┼──────────┼────────────────┼──────────────────
   100     |  0.50    |  0.60    | Blurry         |
   200     |  0.48    |  0.62    | Still blurry   | Loss decreased but quality unchanged
   300     |  0.55    |  0.58    | Better         | Loss increased but quality improved!
   400     |  0.52    |  0.63    | Even better    | Oscillating loss, improving quality
   500     |  0.51    |  0.59    | Great          | Loss oscillates, quality monotonically improves

Lesson: Do NOT use loss curves to assess GAN training progress. Always use external metrics (FID, IS, visual inspection).


11. Comparison: GANs vs VAEs vs Autoencoders

Aspect GAN VAE Autoencoder
Likelihood Implicit (cannot compute) Explicit (tractable) None
Sharp Images Excellent (high quality) Blurry (averaging) Moderate (lossy)
Stable Training ❌ (Notoriously unstable) ✅ (ELBO-based, stable) ✅ (Simple loss)
Generative Capability Generates new samples Generates new samples Mainly reconstruction
Mode Coverage Poor (mode collapse) Good (approximates full distribution) Poor (memorization)
Interpretability Black box Latent space interpretable Latent features interpretable
Training Objective Minimax game (no guarantee) Maximize ELBO (guaranteed convergence) MSE/Reconstruction loss
Evaluation Metrics FID, IS (proxy metrics) Negative ELBO (exact but loose) Reconstruction error
Training Speed Slow (adversarial iterations) Moderate Fast (single pass)
Memory Usage High (two networks) Moderate (one network) Moderate (one network)
Best For Photorealistic image generation Data exploration, interpolation Denoising, compression
Worst At High-dimensional data diversity Visual quality Any generative task

Key takeaway:

  • GAN: Sharp samples, hard to train
  • VAE: Stable, blurry samples, good latent space
  • AE: Simple baseline, mainly for reconstruction

12. Practical Applications

12.1 Image Generation

Text-to-Image: AttnGAN, StackGAN

Text: "A red bird with black wings"
            ↓
   [Text Encoder + GAN]
            ↓
    [Generated Bird Image]

Image-to-Image Translation: Pix2Pix, CycleGAN

Input: Sketch    →    GAN    →    Output: Photorealistic Image

12.2 Super-Resolution

SRGAN: Super-resolution using generative adversarial networks

  • Generator learns to upscale low-res images with fine details
  • Discriminator ensures output looks natural
  • Perceptual loss + adversarial loss

12.3 Face Generation and Manipulation

StyleGAN: Generate photo-realistic faces StarGAN: Facial attribute editing

Input Face + Attribute Label → GAN → Modified Face
(e.g., add beard, change hair color, age face)

12.4 Domain Adaptation

Unsupervised Pixel-Level Domain Adaptation:

  • PixelDA: Adversarially adapt synthetic game images to real
  • Use adversarial loss to match source and target domain feature distributions

12.5 Data Augmentation

Generate synthetic training data to:

  • Handle imbalanced datasets
  • Increase training set size
  • Improve model robustness

13. Relation to Curse of Dimensionality

13.1 The Problem

The curse of dimensionality states that as data dimensionality increases:

  • Volume grows exponentially
  • Data points become sparse
  • Probability mass concentrates in thin shells
  • Distances between points become nearly equal

Naive assumption: "GANs overcome the curse of dimensionality"

Reality: GANs do not eliminate the curse; they exploit data structure.

13.2 Why GANs Seem to Work Despite the Curse

GANs succeed because real-world data lies on low-dimensional manifolds:

High-Dimensional Space (e.g., 256×256 = 65K dims)
│
├─ Curse of Dimensionality (empty space)
│
└─ LOW-DIMENSIONAL MANIFOLD (actual data)
   └─ Natural images, faces, text, etc.
      ~100-1000 effective dimensions

Generator learns to map: $$\text{Simple distribution} \rightarrow \text{Low-dimensional data manifold}$$

Not the entire high-dimensional space.

13.3 When GANs Fail Due to Dimensionality

Conditions for failure:

  • Data without clear structure (e.g., random images)
  • Truly high-dimensional uniform distributions
  • Insufficient training data relative to dimensionality

Example: Generate random noise images

  • Generator finds it easier to output gray noise
  • Discriminator cannot distinguish meaningful structure
  • Mode collapse to "average noise"

Conclusion: GANs work because real data is structured, not because high-dimensional modeling is easy.


14. Common Misconceptions and Edge Cases

14.1 Misconception: "GANs minimize reconstruction error"

Wrong: GANs don't minimize pixel-level reconstruction loss Correct: GANs minimize distribution divergence (JS divergence, Wasserstein distance, etc.)

Consequence:

  • Generated samples are not "closest to training data"
  • Generated samples can be novel and diverse
  • Not constrained by reconstruction error bottleneck

14.2 Misconception: "GAN loss correlates with sample quality"

Wrong: Loss values that decrease = improving samples Correct: Loss oscillates; convergence is game-theoretic equilibrium

Evidence:

  • Loss can increase while visual quality improves
  • Stable, low loss can indicate mode collapse
  • No monotonic relationship exists

14.3 Misconception: "GANs always outperform VAEs"

Wrong: GANs are universally better Correct: Task-dependent trade-offs

Task Better Reason
Photo-realistic images GAN Sharp, high-quality
Data exploration VAE Interpretable latent space
Stable training VAE No adversarial instability
High-res generation GAN Can achieve 1024×1024+
Inference speed VAE Direct forward pass

14.4 Misconception: "Generator learns the true distribution"

Reality:

  • Generator learns an implicit approximation of the distribution
  • May only capture dominant modes
  • Doesn't explicitly model $p(x)$—can't compute likelihoods
  • Different from explicit density models

Edge case: For simple 1D distributions, generator often learns a unimodal approximation even when real data is multimodal

14.5 Misconception: "GAN must be min-max at each step"

Reality:

  • Practitioners often don't reach true Nash equilibrium
  • May update discriminator $k$ times per generator update (practical hyperparameter)
  • Early stopping used instead of waiting for convergence
  • Training is more art than science

14.6 Edge Case: Imbalanced Datasets

Problem: Discriminator can achieve high accuracy by memorizing rare samples

Solution:

  • Data augmentation
  • Weighted sampling
  • Focal loss variants

14.7 Edge Case: High-Dimensional Data

Problem: As dimensionality increases, probability mass concentrates in thin shells (curse of dimensionality)

Implication: Measuring convergence becomes harder; discriminator may be unable to distinguish distributions

Mitigation: Progressive training, better initialization, higher capacity networks


15. Mathematical Deeper Dives

15.1 Why GAN as Game Matters: Nash Equilibrium

The GAN minimax formulation is a sequential game where convergence is understood through game theory, not optimization.

Nash Equilibrium: A state where neither player benefits from unilaterally changing strategy

$$\text{For Nash Eq: } \exists (G^{_}, D^{_}) \text{ such that:}$$ $$G^{_} = \arg\min_G V(D^{_}, G)$$ $$D^{_} = \arg\max_D V(D^{_}, G)$$

Key difference from optimization:

  • Equilibrium ≠ Global minimum
  • Both players can have positive loss at equilibrium
  • Training may never reach equilibrium in practice

15.2 Optimal Discriminator Proof

For fixed generator $G$, the optimal discriminator is:

$$D^{*}(x) = \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}$$

Proof: The discriminator objective for fixed $G$ is: $$V(D) = \int_x p_{data}(x) \log D(x) + p_G(x) \log(1-D(x)) , dx$$

For each $x$, maximize: $$f(D) = p_{data} \log D + p_G \log(1-D)$$

Taking derivative: $$\frac{df}{dD} = \frac{p_{data}}{D} - \frac{p_G}{1-D} = 0$$

Solving: $D = \frac{p_{data}}{p_{data} + p_G}$

15.3 Global Optimality

At optimal $D^{*}$ and optimal $G$: $$\min_G \max_D V(D, G) = -\log 4 + 2 \cdot JSD(p_{data} \parallel p_G)$$

Minimum value $-\log 4$ achieved when $p_G = p_{data}$ (divergence = 0)


16. Summary: Key Takeaways

Aspect Insight
Core Idea Two networks in adversarial competition: one generates, one distinguishes
Mathematical Goal Minimize Jensen-Shannon divergence (implicitly)
Main Advantage No explicit density modeling required; sharp sample generation
Main Challenge Training instability, mode collapse, vanishing gradients
Practical Success Requires careful tuning; many architectural innovations needed
Evaluation Use FID score (industry standard), not just Inception Score
Current State Foundation for numerous applications; variations (WGAN, StyleGAN) address original issues

17. References and Further Reading

  • Goodfellow et al. (2014): Original GAN paper establishing framework
  • Wasserstein GAN (Arjovsky et al., 2017): Improved training stability
  • Progressive GAN (Karras et al., 2018): High-resolution image synthesis
  • StyleGAN (Karras et al., 2019): State-of-the-art generation with disentanglement
  • Spectral Normalization (Miyato et al., 2018): Stabilization technique
  • FID Score (Heusel et al., 2017): Primary evaluation metric

18. Appendix: Quick Reference—Common GAN Hyperparameters

Hyperparameter Typical Range Notes
Learning Rate (G) 0.0001 - 0.0002 Usually lower than discriminator
Learning Rate (D) 0.0002 - 0.0004 Higher LR for faster discrimination
Batch Size 32 - 256 Larger batches stabilize training
Latent Dimension 64 - 512 Higher → more model capacity
D Updates per G Update 1 - 5 Keep D slightly stronger
Gradient Penalty λ 10 - 100 For WGAN-GP
Label Smoothing 0.1 - 0.9 One-sided (real: 0.9, fake: 0.0)