Generative Adversarial Networks (GANs)

1. Motivation and Historical Context

1.1 The Problem Before GANs

Generative modeling has been one of the fundamental challenges in machine learning: how can we create systems that can generate realistic synthetic data samples that resemble the distribution of real data? Before GANs (2014), the field relied on:

Maximum Likelihood Estimation (MLE) based approaches: Variational Autoencoders (VAEs), Boltzmann Machines
Explicit density models: Required tractable probability density functions $p(x)$, which is computationally expensive and restrictive
Autoregressive models: Slow generation as predictions required sequential sampling
Issues with existing methods:
- VAEs produce blurry reconstructions due to their reconstruction loss objective
- RBMs are difficult to scale and slow to sample from
- Explicit density models struggle with high-dimensional data
- No direct way to generate samples without explicitly modeling probability distributions

1.2 Historical Context: Why GANs Were Revolutionary

In June 2014, Ian Goodfellow et al. proposed Generative Adversarial Networks (GANs), introducing a paradigm shift that:

Avoided explicit density modeling: No need to define $p(x)$ explicitly
Enabled implicit density models: Learn the data distribution through a game-theoretic framework
Generated sharper, more realistic samples: Compared to VAEs, especially for images
Introduced an adversarial training scheme: Two networks in competition drive each other toward better solutions

This framework became the foundation for numerous applications: image generation, style transfer, super-resolution, and more.

2. Core Intuition: The Game-Theoretic Framework

2.1 The Prisoner's Dilemma Analogy

Think of GANs as a sophisticated adversarial game:

Generator (Counterfeiter): Tries to create fake money (data) convincingly
Discriminator (Police): Tries to catch fake money by distinguishing it from real currency
Goal: Both improve iteratively. The counterfeiter learns what works; the police becomes better at detection
Equilibrium: At convergence, the counterfeiter makes "perfect" fake money (indistinguishable from real)

2.2 The Architecture

Random Noise (z)
      ↓
  ┌─────────────┐
  │  Generator  │ → Fake Data (G(z))
  │      G      │
  └─────────────┘
        ↑
    Feedback
        ↓
  ┌─────────────┐
  │ Discriminator│ → Real? Fake?
  │      D      │   (0 or 1)
  └─────────────┘
      ↑
  Real Data (x)

Key Components:

Generator $G$: Maps noise $z \sim p_z(z)$ to data space. Learning to capture the data distribution
Discriminator $D$: A binary classifier distinguishing real data from generated data
Training objective: Simultaneous gradient descent on conflicting objectives

2.3 Latent Space and Noise Vector

The latent vector is the "seed" for generation:

Typical Properties:

Shape: Usually 50–512 dimensional vectors
Distribution: Sampled from:
- Gaussian: $z \sim \mathcal{N}(0, I)$ (most common)
- Uniform: $z \sim U(-1, 1)$
- Other distributions possible

Interpretation:

Encodes high-level generative factors
Different $z$ values produce different outputs
All information for generation must flow through this bottleneck
Higher dimensionality → more expressive model capacity

Key insight: The generator essentially learns a mapping from a simple distribution (e.g., Gaussian) to the complex data distribution

3. Mathematical Formulation

3.1 The Min-Max Game

The fundamental objective function of GANs is formulated as a minimax game:

$$\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$$

Breaking down the objective:

Component	Meaning	Who Optimizes
$\mathbb{E}{x \sim p{data}(x)}[\log D(x)]$	Log probability that D correctly identifies real data	Discriminator maximizes
$\mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$	Log probability that D correctly rejects fake data	Discriminator maximizes
$\log(1 - D(G(z)))$	Difficulty for generator; fool the discriminator	Generator minimizes

3.2 Interpretation of the Objective

For the Discriminator (maximization):

Wants $D(x) \approx 1$ for real data (correct classification)
Wants $D(G(z)) \approx 0$ for fake data (correct rejection)
Objective: $\max_D[\log D(x) + \log(1 - D(G(z)))]$

For the Generator (minimization):

Wants $D(G(z)) \approx 1$ (fool the discriminator)
Since discriminator is trying to make it small, generator minimizes: $$\min_G \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]$$

3.3 Alternative Generator Objective (Non-saturating)

The original objective has a problem: when $D$ is strong (good discriminator), $\log(1-D(G(z)))$ saturates, providing weak gradients to $G$.

Solution: Use the non-saturating objective instead: $$\min_G \mathbb{E}_{z \sim p_z(z)}[-\log D(G(z))]$$

This flips the gradient direction, providing stronger gradients early in training when $D(G(z)) \approx 0$.

4. The Training Algorithm

4.1 Alternating Gradient Descent

Initialize: Generator G and Discriminator D with random weights
For each training iteration:

  DISCRIMINATOR UPDATE:
    1. Sample m real samples {x₁, x₂, ..., xₘ} from p_data
    2. Sample m noise samples {z₁, z₂, ..., zₘ} from p_z
    3. Compute gradients: ∇_D [1/m Σlog D(xᵢ) + 1/m Σlog(1-D(G(zᵢ)))]
    4. Update D using gradient ascent

  GENERATOR UPDATE:
    1. Sample m noise samples {z₁, z₂, ..., zₘ} from p_z
    2. Compute gradients: ∇_G [-1/m Σlog D(G(zᵢ))] (or non-saturating variant)
    3. Update G using gradient descent

4.2 Convergence Claim

Under the original theoretical framework, if:

Both G and D have sufficient capacity
Training reaches a global optimum at each step

Then at Nash equilibrium:

$G$ recovers the data distribution: $p_G = p_{data}$
$D$ becomes unable to distinguish: $D(x) = 0.5$ everywhere

In practice: This theoretical guarantee rarely holds due to:

Limited network capacity
Non-convex optimization
Mode collapse (discussed below)

4.3 GANs as Games, Not Optimization Problems

Critical distinction from standard machine learning:

Standard Optimization:

Minimize: Loss(x, w)
Find: Global minimum
Stop: When loss converges

GAN Training (Game Theory):

Find: Nash Equilibrium
Updates: Simultaneous for two competing players
No single global objective to minimize
Parameters may oscillate even when quality improves

Key implication:

GAN loss curves may oscillate wildly
Loss oscillation ≠ Poor training
Loss decrease ≠ Improved sample quality
Must use external metrics (FID, IS) to assess progress

5. Why GANs Work: Information-Theoretic Perspective

5.1 Connection to Divergence Measures

The GAN objective can be reframed using divergence measures. At optimal discriminator $D^{*}$ for fixed $G$:

$$D^{*}(x) = \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}$$

Substituting back:

$$V(D^{*}, G) = \mathbb{E}_{x \sim p_{data}}[\log \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}] + \mathbb{E}_{x \sim p_G}[\log \frac{p_G(x)}{p_{data}(x) + p_G(x)}]$$

This can be rewritten as:

$$V(D^{*}, G) = -2\log 2 + 2 \cdot JSD(p_{data} \parallel p_G)$$

where $JSD$ is the Jensen-Shannon Divergence—a symmetric divergence measure between distributions.

Key insight: Minimizing the GAN objective is equivalent to minimizing the JS divergence between the data distribution and generator distribution.

5.2 Why This Matters

Advantage over MLE: JS divergence provides non-zero gradients even when distributions have no overlap (unlike KL divergence)
Practical implication: GANs can learn from disjoint supports, whereas MLE-based methods struggle
Limitation: JS divergence is symmetric but can be unintuitive for comparing distributions with different supports

5.3 No Explicit Likelihood (Important Distinction)

GANs learn an implicit density model:

Cannot compute: $p(x) = ?$ for any sample $x$

Why this matters:

No direct likelihood-based evaluation
Cannot use log-likelihood as a stopping criterion
Cannot compute perplexity or other traditional metrics
Prevents use of maximum likelihood estimation for training

Advantage over explicit models:

Not constrained to tractable probability distributions
Can model complex, high-dimensional distributions
Freedom in architecture design

Disadvantage:

Harder to evaluate (requires proxy metrics like FID)
Cannot estimate sample importance or likelihood

6. Key Challenges and Failure Modes

6.1 Mode Collapse

Problem: The generator learns to produce only a limited subset of the data distribution, ignoring modes (clusters).

True Distribution          Generator Output
┌─────────┬─────────┐      ┌─────────┐
│  Mode A │  Mode B │      │ Mode A  │
│ (many)  │ (many)  │      │ (many)  │
└─────────┴─────────┘      └─────────┘
                                   Mode B completely missing!

Why it happens:

Generator finds one mode and exploits it fully
Discriminator learns to reject samples from that mode
Generator switches to another mode rather than sampling all modes
Lack of diversity in loss signal

Solutions:

Minibatch discrimination: Discriminator compares minibatches, not individual samples
Unrolled GANs: Discriminator sees multiple generator update steps ahead
Spectral normalization: Stabilizes training
Wasserstein GANs: Use Wasserstein distance instead of JS divergence

6.2 Vanishing Gradients

Problem: When discriminator becomes too good, generator receives near-zero gradients, making learning stall.

Mathematical explanation: If $D(G(z)) \approx 0$ everywhere:

$\log(1 - D(G(z))) \approx 0$ and nearly constant
$\nabla_G \log(1 - D(G(z))) \approx 0$
No useful signal to improve generator

Visual intuition:

D Score
  1.0 │      Generator samples here
      │      ↓
      │    ┌──────┐
      │    │ FLAT │  ← Gradient nearly zero!
      │    │ (D=0)│
  0.0 │────┴──────┴────
      └──────────────────
      Real            Fake

Solutions:

Non-saturating loss: $-\log D(G(z))$ instead of $\log(1-D(G(z)))$
Feature matching: Match discriminator's hidden layer statistics
Wasserstein GAN: Use continuous loss without log

6.3 Training Instability

Problem: Generator and discriminator losses oscillate wildly; training doesn't converge smoothly.

Causes:

Simultaneous alternating updates create feedback loops
Imbalanced learning rates between networks
Networks with mismatched capacity

Indicators:

Loss curve: ════════════════════════════════
            (chaotic, oscillating wildly)

vs. Stable: ═════╲╲╲════════════════════════
            (monotonic decrease with plateaus)

Mitigation strategies:

Keep discriminator stronger than generator (don't train D to convergence)
Use gradient penalties
Batch normalization in both networks
Careful hyperparameter tuning (learning rates, batch sizes)

6.4 Convergence Proof Issues

Reality check: The theoretical convergence guarantees of vanilla GANs are:

Proven only under unrealistic assumptions (infinite capacity networks, continuous training)
Not guaranteed in practice with finite-capacity neural networks
No practical stopping criterion or convergence verification method

7. Variants and Improvements

7.1 DCGAN (Deep Convolutional GAN)

Historical significance: First successful CNN-based GAN (Radford et al., 2015); breakthrough enabling large-scale image generation

Architectural Guidelines (critical for stable training):

Component	Rule	Reason
Generator Pooling	Replace with strided transpose convolutions	Better gradient flow
Generator Norm	Batch Norm on all layers except final	Reduces internal covariate shift
Generator Activation	ReLU hidden layers, tanh output	Non-linearity + bounded output
Discriminator Pooling	Replace with strided convolutions	Gradient stability
Discriminator Norm	Batch Norm all layers except first	Prevents instability
Discriminator Activation	LeakyReLU (α ≈ 0.2)	Avoids dead neurons

Typical Architecture:

GENERATOR (input: z ~ N(0,I), shape: 100-dimensional):
  z → Reshape to (512, 1, 1)
  → ConvTranspose2d(512→256, k=4) + BN + ReLU
  → ConvTranspose2d(256→128, k=4) + BN + ReLU
  → ConvTranspose2d(128→64, k=4) + BN + ReLU
  → ConvTranspose2d(64→3, k=4) + Tanh
  Output: RGB image (3, 64, 64)

DISCRIMINATOR (input: image):
  Image (3, 64, 64)
  → Conv2d(3→64, k=4, stride=2) + LeakyReLU
  → Conv2d(64→128, k=4, stride=2) + BN + LeakyReLU
  → Conv2d(128→256, k=4, stride=2) + BN + LeakyReLU
  → Conv2d(256→512, k=4, stride=2) + BN + LeakyReLU
  → Flatten → Dense(1) + Sigmoid
  Output: Probability real (0-1)

Why these rules work:

Strided convolutions prevent checkerboard artifacts from naïve transpose convolutions
Batch normalization accelerates convergence and provides regularization
LeakyReLU prevents vanishing gradients in discriminator
tanh output on generator ensures output range [-1, 1] (matches normalized image data)

7.2 Wasserstein GAN (WGAN)

Key innovation: Replace JS divergence with Wasserstein distance (earth-mover distance)

$$W(p_{data}, p_G) = \inf_{\gamma \in \Pi} \mathbb{E}_{(x,y) \sim \gamma}[|x-y|]$$

Benefits:

Provides meaningful gradients even with non-overlapping supports
More stable training
Real-valued loss correlates with sample quality

Formula: $$\min_G \max_{D: |D|L \leq 1} \mathbb{E}{x \sim p_{data}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))]$$

where the discriminator is Lipschitz-constrained.

7.3 Conditional GAN (cGAN)

Motivation: Control what the generator produces by conditioning on labels

$$\min_G \max_D \mathbb{E}_{x \sim p_{data}, y}[\log D(x|y)] + \mathbb{E}_{z, y}[\log(1-D(G(z|y)))]$$

Architecture:

Label y ────┐
            ├→ Concatenate → Generator → Fake image
Noise z ────┤
            └→ Discriminator → Real or Fake?
                  (also receives label y)

Applications: Image generation with labels, text-to-image, class-conditional generation

7.4 Progressive GAN

Idea: Train from low to high resolution, gradually adding layers

Benefits:

Stabilizes training
Enables high-resolution image generation
Focuses on coarse details first, then fine details

Layer progression:

Resolution:  4×4  →  8×8  →  16×16  →  32×32  →  64×64  →  256×256  →  1024×1024
Training:    ═══════════════════════════════════════════════════════════════════

7.5 StyleGAN

Innovation: Decomposes generation into coarse (style) and fine (content) control

Architecture:

Latent Code z (random)
           ↓
    Mapping Network (8 layers) → w (style vector)
           ↓
  Adaptive Instance Normalization (AdaIN)
           ↓
  Synthesis Network with Progressive Growth
           ↓
    Generated Image

Advantages:

Disentangled representation (change style vs. content independently)
Better control over generation
State-of-the-art image quality

8. Training Stability Techniques

8.1 Spectral Normalization

Idea: Normalize weight matrices to have spectral norm (largest singular value) of 1

$$W_{SN} = \frac{W}{\sigma(W)}$$

where $\sigma(W) = \max_h |Wh|/|h|$ (largest singular value)

Effect: Limits Lipschitz constant of discriminator, stabilizing gradients

8.2 Gradient Penalty (WGAN-GP)

Rather than weight clipping, add a regularization term:

$$L = L_{original} + \lambda \mathbb{E}_{x}[(|\nabla_x D(x)|_2 - 1)^2]$$

Enforces gradient norm = 1, ensuring Lipschitz constraint

8.3 Batch Normalization and Layer Normalization

Batch Norm: Normalize activations across batch dimension

Reduces internal covariate shift
Stabilizes discriminator learning
Caveat: Can cause training oscillations if overused in generator

Better alternatives:

Layer Normalization: Normalize per-sample
Instance Normalization: Useful for style transfer GANs
Group Normalization: Hybrid approach

8.4 One-Sided Label Smoothing

Instead of:

Real labels: 1
Fake labels: 0

Use:

Real labels: 0.9
Fake labels: 0.0

Prevents discriminator from becoming overconfident, improving gradient flow

9. Loss Functions and Objectives

9.1 Standard GAN Loss

$$L_D = -\mathbb{E}_x[\log D(x)] - \mathbb{E}_z[\log(1-D(G(z)))]$$ $$L_G = -\mathbb{E}_z[\log D(G(z))]$$

9.2 Non-Saturating GAN Loss

$$L_D = -\mathbb{E}_x[\log D(x)] - \mathbb{E}_z[\log(1-D(G(z)))]$$ $$L_G = \mathbb{E}_z[\log(1-D(G(z)))]$$ (Generator objective flipped)

9.3 Wasserstein Loss

$$L_D = \mathbb{E}_z[D(G(z))] - \mathbb{E}_x[D(x)]$$ $$L_G = -\mathbb{E}_z[D(G(z))]$$

9.4 Hinge Loss (Discriminator perspective)

$$L_D = \mathbb{E}_x[\max(0, 1-D(x))] + \mathbb{E}_z[\max(0, 1+D(G(z)))]$$

Advantage: Robust margins, more stable training

9.5 Least Squares GAN (LSGAN)

$$L_D = \frac{1}{2}\mathbb{E}_x[(D(x)-1)^2] + \frac{1}{2}\mathbb{E}_z[D(G(z))^2]$$ $$L_G = \frac{1}{2}\mathbb{E}_z[(D(G(z))-1)^2]$$

Property: Penalizes far off predictions more, leading to faster convergence

10. Evaluation Metrics

10.1 Inception Score (IS)

Formula: $$IS = \exp(\mathbb{E}_x[KL(p(y|x) \parallel p(y))])$$

where $y$ is class label from pre-trained Inception network

Interpretation:

Measures class confidence: $p(y|x)$ should be high for specific classes
Measures diversity: marginal $p(y)$ should be uniform
Higher IS is better
Limitation: Doesn't directly measure similarity to real data

10.2 Fréchet Inception Distance (FID)

Formula: $$FID = |\mu_r - \mu_g|_2^2 + Tr(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$$

where $\mu, \Sigma$ are mean and covariance of activations

Advantages:

Measures both quality (first term: distance between means) and diversity (second term: covariance difference)
More robust to classifier overfitting than IS
Better correlation with human judgment
Gold standard metric in practice

10.3 Kernel Inception Distance (KID)

Computationally cheaper alternative to FID, based on maximum mean discrepancy

10.4 Precision and Recall for Generated Samples

Precision: Fraction of generated samples that are "realistic" Recall: Fraction of real data modes covered by generator

Trade-off: High precision (mode collapse) vs. high recall (blurry samples)

10.5 Why GAN Loss Does Not Correlate with Sample Quality

Important insight: GAN loss values oscillate and do not monotonically decrease, even when sample quality improves.

Why this happens:

Discriminator loss and generator loss are not aligned with sample quality
As generator improves, discriminator must adapt → loss increases
As discriminator improves, generator loss increases → generator updates
Simultaneous updates create feedback oscillations
Nash equilibrium involves loss oscillation, not convergence

Example:

Iteration  | Gen Loss | Dis Loss | Sample Quality | Notes
───────────┼──────────┼──────────┼────────────────┼──────────────────
   100     |  0.50    |  0.60    | Blurry         |
   200     |  0.48    |  0.62    | Still blurry   | Loss decreased but quality unchanged
   300     |  0.55    |  0.58    | Better         | Loss increased but quality improved!
   400     |  0.52    |  0.63    | Even better    | Oscillating loss, improving quality
   500     |  0.51    |  0.59    | Great          | Loss oscillates, quality monotonically improves

Lesson: Do NOT use loss curves to assess GAN training progress. Always use external metrics (FID, IS, visual inspection).

11. Comparison: GANs vs VAEs vs Autoencoders

Aspect	GAN	VAE	Autoencoder
Likelihood	Implicit (cannot compute)	Explicit (tractable)	None
Sharp Images	Excellent (high quality)	Blurry (averaging)	Moderate (lossy)
Stable Training	❌ (Notoriously unstable)	✅ (ELBO-based, stable)	✅ (Simple loss)
Generative Capability	Generates new samples	Generates new samples	Mainly reconstruction
Mode Coverage	Poor (mode collapse)	Good (approximates full distribution)	Poor (memorization)
Interpretability	Black box	Latent space interpretable	Latent features interpretable
Training Objective	Minimax game (no guarantee)	Maximize ELBO (guaranteed convergence)	MSE/Reconstruction loss
Evaluation Metrics	FID, IS (proxy metrics)	Negative ELBO (exact but loose)	Reconstruction error
Training Speed	Slow (adversarial iterations)	Moderate	Fast (single pass)
Memory Usage	High (two networks)	Moderate (one network)	Moderate (one network)
Best For	Photorealistic image generation	Data exploration, interpolation	Denoising, compression
Worst At	High-dimensional data diversity	Visual quality	Any generative task

Key takeaway:

GAN: Sharp samples, hard to train
VAE: Stable, blurry samples, good latent space
AE: Simple baseline, mainly for reconstruction

12. Practical Applications

12.1 Image Generation

Text-to-Image: AttnGAN, StackGAN

Text: "A red bird with black wings"
            ↓
   [Text Encoder + GAN]
            ↓
    [Generated Bird Image]

Image-to-Image Translation: Pix2Pix, CycleGAN

Input: Sketch    →    GAN    →    Output: Photorealistic Image

12.2 Super-Resolution

SRGAN: Super-resolution using generative adversarial networks

Generator learns to upscale low-res images with fine details
Discriminator ensures output looks natural
Perceptual loss + adversarial loss

12.3 Face Generation and Manipulation

StyleGAN: Generate photo-realistic faces StarGAN: Facial attribute editing

Input Face + Attribute Label → GAN → Modified Face
(e.g., add beard, change hair color, age face)

12.4 Domain Adaptation

Unsupervised Pixel-Level Domain Adaptation:

PixelDA: Adversarially adapt synthetic game images to real
Use adversarial loss to match source and target domain feature distributions

12.5 Data Augmentation

Generate synthetic training data to:

Handle imbalanced datasets
Increase training set size
Improve model robustness

13. Relation to Curse of Dimensionality

13.1 The Problem

The curse of dimensionality states that as data dimensionality increases:

Volume grows exponentially
Data points become sparse
Probability mass concentrates in thin shells
Distances between points become nearly equal

Naive assumption: "GANs overcome the curse of dimensionality"

Reality: GANs do not eliminate the curse; they exploit data structure.

13.2 Why GANs Seem to Work Despite the Curse

GANs succeed because real-world data lies on low-dimensional manifolds:

High-Dimensional Space (e.g., 256×256 = 65K dims)
│
├─ Curse of Dimensionality (empty space)
│
└─ LOW-DIMENSIONAL MANIFOLD (actual data)
   └─ Natural images, faces, text, etc.
      ~100-1000 effective dimensions

Generator learns to map: $$\text{Simple distribution} \rightarrow \text{Low-dimensional data manifold}$$

Not the entire high-dimensional space.

13.3 When GANs Fail Due to Dimensionality

Conditions for failure:

Data without clear structure (e.g., random images)
Truly high-dimensional uniform distributions
Insufficient training data relative to dimensionality

Example: Generate random noise images

Generator finds it easier to output gray noise
Discriminator cannot distinguish meaningful structure
Mode collapse to "average noise"

Conclusion: GANs work because real data is structured, not because high-dimensional modeling is easy.

14. Common Misconceptions and Edge Cases

14.1 Misconception: "GANs minimize reconstruction error"

Wrong: GANs don't minimize pixel-level reconstruction loss Correct: GANs minimize distribution divergence (JS divergence, Wasserstein distance, etc.)

Consequence:

Generated samples are not "closest to training data"
Generated samples can be novel and diverse
Not constrained by reconstruction error bottleneck

14.2 Misconception: "GAN loss correlates with sample quality"

Wrong: Loss values that decrease = improving samples Correct: Loss oscillates; convergence is game-theoretic equilibrium

Evidence:

Loss can increase while visual quality improves
Stable, low loss can indicate mode collapse
No monotonic relationship exists

14.3 Misconception: "GANs always outperform VAEs"

Wrong: GANs are universally better Correct: Task-dependent trade-offs

Task	Better	Reason
Photo-realistic images	GAN	Sharp, high-quality
Data exploration	VAE	Interpretable latent space
Stable training	VAE	No adversarial instability
High-res generation	GAN	Can achieve 1024×1024+
Inference speed	VAE	Direct forward pass

14.4 Misconception: "Generator learns the true distribution"

Reality:

Generator learns an implicit approximation of the distribution
May only capture dominant modes
Doesn't explicitly model $p(x)$—can't compute likelihoods
Different from explicit density models

Edge case: For simple 1D distributions, generator often learns a unimodal approximation even when real data is multimodal

14.5 Misconception: "GAN must be min-max at each step"

Reality:

Practitioners often don't reach true Nash equilibrium
May update discriminator $k$ times per generator update (practical hyperparameter)
Early stopping used instead of waiting for convergence
Training is more art than science

14.6 Edge Case: Imbalanced Datasets

Problem: Discriminator can achieve high accuracy by memorizing rare samples

Solution:

Data augmentation
Weighted sampling
Focal loss variants

14.7 Edge Case: High-Dimensional Data

Problem: As dimensionality increases, probability mass concentrates in thin shells (curse of dimensionality)

Implication: Measuring convergence becomes harder; discriminator may be unable to distinguish distributions

Mitigation: Progressive training, better initialization, higher capacity networks

15. Mathematical Deeper Dives

15.1 Why GAN as Game Matters: Nash Equilibrium

The GAN minimax formulation is a sequential game where convergence is understood through game theory, not optimization.

Nash Equilibrium: A state where neither player benefits from unilaterally changing strategy

$$\text{For Nash Eq: } \exists (G^{_}, D^{_}) \text{ such that:}$$ $$G^{_} = \arg\min_G V(D^{_}, G)$$ $$D^{_} = \arg\max_D V(D^{_}, G)$$

Key difference from optimization:

Equilibrium ≠ Global minimum
Both players can have positive loss at equilibrium
Training may never reach equilibrium in practice

15.2 Optimal Discriminator Proof

For fixed generator $G$, the optimal discriminator is:

$$D^{*}(x) = \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}$$

Proof: The discriminator objective for fixed $G$ is: $$V(D) = \int_x p_{data}(x) \log D(x) + p_G(x) \log(1-D(x)) , dx$$

For each $x$, maximize: $$f(D) = p_{data} \log D + p_G \log(1-D)$$

Taking derivative: $$\frac{df}{dD} = \frac{p_{data}}{D} - \frac{p_G}{1-D} = 0$$

Solving: $D = \frac{p_{data}}{p_{data} + p_G}$

15.3 Global Optimality

At optimal $D^{*}$ and optimal $G$: $$\min_G \max_D V(D, G) = -\log 4 + 2 \cdot JSD(p_{data} \parallel p_G)$$

Minimum value $-\log 4$ achieved when $p_G = p_{data}$ (divergence = 0)

16. Summary: Key Takeaways

Aspect	Insight
Core Idea	Two networks in adversarial competition: one generates, one distinguishes
Mathematical Goal	Minimize Jensen-Shannon divergence (implicitly)
Main Advantage	No explicit density modeling required; sharp sample generation
Main Challenge	Training instability, mode collapse, vanishing gradients
Practical Success	Requires careful tuning; many architectural innovations needed
Evaluation	Use FID score (industry standard), not just Inception Score
Current State	Foundation for numerous applications; variations (WGAN, StyleGAN) address original issues

17. References and Further Reading

Goodfellow et al. (2014): Original GAN paper establishing framework
Wasserstein GAN (Arjovsky et al., 2017): Improved training stability
Progressive GAN (Karras et al., 2018): High-resolution image synthesis
StyleGAN (Karras et al., 2019): State-of-the-art generation with disentanglement
Spectral Normalization (Miyato et al., 2018): Stabilization technique
FID Score (Heusel et al., 2017): Primary evaluation metric

18. Appendix: Quick Reference—Common GAN Hyperparameters

Hyperparameter	Typical Range	Notes
Learning Rate (G)	0.0001 - 0.0002	Usually lower than discriminator
Learning Rate (D)	0.0002 - 0.0004	Higher LR for faster discrimination
Batch Size	32 - 256	Larger batches stabilize training
Latent Dimension	64 - 512	Higher → more model capacity
D Updates per G Update	1 - 5	Keep D slightly stronger
Gradient Penalty λ	10 - 100	For WGAN-GP
Label Smoothing	0.1 - 0.9	One-sided (real: 0.9, fake: 0.0)

FilesExpand file tree

W14_2_GANs.md

Latest commit

History

W14_2_GANs.md

File metadata and controls

Generative Adversarial Networks (GANs)

1. Motivation and Historical Context

1.1 The Problem Before GANs

1.2 Historical Context: Why GANs Were Revolutionary

2. Core Intuition: The Game-Theoretic Framework

2.1 The Prisoner's Dilemma Analogy

2.2 The Architecture

2.3 Latent Space and Noise Vector

3. Mathematical Formulation

3.1 The Min-Max Game

3.2 Interpretation of the Objective

3.3 Alternative Generator Objective (Non-saturating)

4. The Training Algorithm

4.1 Alternating Gradient Descent

4.2 Convergence Claim

4.3 GANs as Games, Not Optimization Problems

5. Why GANs Work: Information-Theoretic Perspective

5.1 Connection to Divergence Measures

5.2 Why This Matters

5.3 No Explicit Likelihood (Important Distinction)

6. Key Challenges and Failure Modes

6.1 Mode Collapse

6.2 Vanishing Gradients

6.3 Training Instability

6.4 Convergence Proof Issues

7. Variants and Improvements

7.1 DCGAN (Deep Convolutional GAN)

7.2 Wasserstein GAN (WGAN)

7.3 Conditional GAN (cGAN)

7.4 Progressive GAN

7.5 StyleGAN

8. Training Stability Techniques

8.1 Spectral Normalization

8.2 Gradient Penalty (WGAN-GP)

8.3 Batch Normalization and Layer Normalization

8.4 One-Sided Label Smoothing

9. Loss Functions and Objectives

9.1 Standard GAN Loss

9.2 Non-Saturating GAN Loss

9.3 Wasserstein Loss

9.4 Hinge Loss (Discriminator perspective)

9.5 Least Squares GAN (LSGAN)

10. Evaluation Metrics

10.1 Inception Score (IS)

10.2 Fréchet Inception Distance (FID)

10.3 Kernel Inception Distance (KID)

10.4 Precision and Recall for Generated Samples

10.5 Why GAN Loss Does Not Correlate with Sample Quality

11. Comparison: GANs vs VAEs vs Autoencoders

12. Practical Applications

12.1 Image Generation

12.2 Super-Resolution

12.3 Face Generation and Manipulation

12.4 Domain Adaptation

12.5 Data Augmentation

13. Relation to Curse of Dimensionality

13.1 The Problem

13.2 Why GANs Seem to Work Despite the Curse

13.3 When GANs Fail Due to Dimensionality

14. Common Misconceptions and Edge Cases

14.1 Misconception: "GANs minimize reconstruction error"

14.2 Misconception: "GAN loss correlates with sample quality"

14.3 Misconception: "GANs always outperform VAEs"

14.4 Misconception: "Generator learns the true distribution"

14.5 Misconception: "GAN must be min-max at each step"

14.6 Edge Case: Imbalanced Datasets

14.7 Edge Case: High-Dimensional Data

15. Mathematical Deeper Dives

15.1 Why GAN as Game Matters: Nash Equilibrium

15.2 Optimal Discriminator Proof

15.3 Global Optimality

16. Summary: Key Takeaways

17. References and Further Reading

18. Appendix: Quick Reference—Common GAN Hyperparameters