Definition

A Generative Adversarial Network (GAN) is a generative model consisting of two neural networks trained in opposition: a Generator that creates synthetic data samples, and a Discriminator that distinguishes real from generated data. Through this adversarial game, the Generator learns to produce increasingly realistic outputs. GANs produced the first photorealistic AI-generated faces and drove the early generative AI revolution.

The adversarial training game

A GAN pits two networks against each other. The Generator G maps random noise z to fake data. The Discriminator D tries to tell real from fake. They play a minimax game:

The original GAN objective (Goodfellow et al., 2014). D maximizes its ability to detect fakes; G minimizes D's success. At Nash equilibrium, G produces samples indistinguishable from real data and D outputs 0.5 everywhere.

Nash equilibrium in practice

The theoretical optimum is never actually reached — GAN training is notoriously unstable because D and G need to improve together at the right rate. If D becomes too strong too fast, gradients to G vanish (it can't learn). If G improves too fast, D can't keep up. Careful architecture design and loss function choice mitigate this.

GAN loss functions and training stability

The original GAN loss suffers from vanishing gradients when the discriminator is too confident. Wasserstein GAN (WGAN) replaced it with the Earth Mover distance:

WGAN critic (not constrained to [0,1]) measures the Wasserstein-1 distance between real and generated distributions. Requires the critic to be 1-Lipschitz — enforced via weight clipping (WGAN) or gradient penalty (WGAN-GP).

Variant	Key idea	Solves	Widely used
Vanilla GAN	Binary cross-entropy	Baseline	❌ Unstable
WGAN	Earth Mover distance + weight clipping	Vanishing gradients	⚠️ Clipping harms quality
WGAN-GP	Gradient penalty instead of clipping	Stable, meaningful loss metric	✅ Standard baseline
StyleGAN 2/3	R1 regularization + path length regularization	High-quality face synthesis	✅ SOTA for faces
BigGAN	Large batch + class conditioning	High-res diverse image generation	✅ ImageNet generation

Mode collapse and training instability

Mode collapse — the most common GAN failure — occurs when the Generator learns to produce only a narrow subset of the real distribution (e.g., only one face expression), because that's enough to fool the Discriminator:

Detecting mode collapse: monitor generator output diversity

import torch
import torch.nn.functional as F

def check_mode_collapse(generator, latent_dim=128, n_samples=1000, threshold=0.85):
    """
    If generated samples have very high pairwise similarity → mode collapse.
    Real diverse data should have low average cosine similarity.
    """
    with torch.no_grad():
        z = torch.randn(n_samples, latent_dim)
        fake = generator(z)          # (n_samples, C, H, W)
        # Flatten and normalize
        flat = fake.view(n_samples, -1)
        flat = F.normalize(flat, dim=1)
        # Sample 200 pairs for efficiency
        idx = torch.randint(0, n_samples, (200, 2))
        sims = (flat[idx[:, 0]] * flat[idx[:, 1]]).sum(dim=1)
        avg_sim = sims.mean().item()

    print(f"Average cosine similarity: {avg_sim:.3f}")
    if avg_sim > threshold:
        print("⚠️  Possible mode collapse detected!")
    else:
        print("✅ Generator output looks diverse")
    return avg_sim

Why mode collapse happens

The Generator finds a "local minimum" — a small set of convincing fakes that the Discriminator can't yet reject. Once D adapts, G might jump to another mode rather than spreading across all modes. Minibatch discrimination (showing D multiple samples at once so it can detect lack of diversity) and spectral normalization are the most reliable mitigations.

GAN applications

Application	Architecture	Example
Photorealistic face synthesis	StyleGAN 3	thispersondoesnotexist.com — 1024×1024 faces
Image-to-image translation	pix2pix (paired), CycleGAN (unpaired)	Sketch→photo, day→night, horse→zebra
Super-resolution	ESRGAN, Real-ESRGAN	4×upscale with realistic textures
Medical image synthesis	DCGAN, StyleGAN	Generate rare pathology training data
Video prediction	VideoGAN, DVD-GAN	Short video sequence generation
Drug molecule generation	MolGAN, Graph GAN	Generate novel molecular structures with target properties
Data augmentation	Conditional GAN	Synthetic training data for underrepresented classes

GANs vs Diffusion Models

Dimension	GANs	Diffusion Models
Training stability	❌ Notoriously unstable, mode collapse	✅ Stable — standard supervised denoising loss
Sample diversity	❌ Mode collapse risk	✅ Excellent diversity
Sampling speed	✅ Single forward pass (~milliseconds)	❌ 20–1000 denoising steps
Text conditioning	⚠️ Difficult — requires careful architecture	✅ Natural via cross-attention (DALL-E 3, SD3)
Image quality (2025)	✅ StyleGAN3 still top for faces	✅ Diffusion dominates general image gen
Video generation	⚠️ Limited progress	✅ Sora, Kling, Gen-3 — all diffusion-based
Best use today	Real-time generation, face synthesis, low-latency	Text-to-image, editing, video, highest quality

GANs are not dead

The adversarial training paradigm lives on in: (1) Adversarial examples — testing model robustness. (2) Adversarial training for robustness — training classifiers on adversarial examples. (3) Discriminator components in hybrid models. (4) Real-time edge applications where single-step inference is required. GAN-based face generators still produce more photorealistic identity-preserving results than diffusion for certain use cases.

Practice questions

What is mode collapse in GAN training and how does it manifest? (Answer: Mode collapse: the generator learns to produce a small subset of the possible outputs that fool the discriminator — ignoring other modes of the real distribution. Example: a face GAN only generates blonde females even though training data has diverse faces. The generator found a local optimum: one type of face consistently fools the discriminator. The discriminator then over-fits to this mode, but the generator doesn't need to diversify. Mitigation: minibatch discrimination (encourage diverse outputs per batch), Wasserstein loss, spectral normalisation.)
What is the Wasserstein distance (used in WGAN) and why is it more stable than JS divergence for GAN training? (Answer: Earth Mover's distance / Wasserstein-1: the minimum cost of transforming one distribution into another (minimum transport plan). Advantages over JS divergence: (1) Provides meaningful gradients even when distributions do not overlap — when generator is far from real data, JS divergence = constant log(2) but Wasserstein is proportional to distance. (2) Correlates better with sample quality — lower Wasserstein distance = better generated samples. WGAN with gradient penalty (WGAN-GP) is more stable to train than original GAN.)
What is the discriminator's role during inference with a trained GAN? (Answer: The discriminator is discarded during inference. Only the generator is used: sample z ~ N(0,I), compute G(z) to generate a new sample. The discriminator served only as a training signal — an adversary that forced the generator to improve. At convergence (if achieved), the generator outputs samples indistinguishable from real data. The discriminator has no role in production image generation systems like StyleGAN or BigGAN.)
How does StyleGAN control specific features (hair colour, age, facial expression) in generated faces? (Answer: StyleGAN uses Adaptive Instance Normalisation (AdaIN): a mapping network converts the latent z to a style vector w. At each resolution level, w modulates (via affine transform) the feature map mean and variance — directly controlling style at that level. Different levels control different aspects: coarse levels (4×4–8×8): pose, shape, face structure. Middle levels (16×16–32×32): facial features, hair style. Fine levels (64×64–1024×1024): colour, texture, fine details. Mixing styles from two latent codes produces faces with combined characteristics.)
What is a conditional GAN (cGAN) and how does it enable class-conditioned generation? (Answer: cGAN: add class label conditioning to both generator and discriminator. Generator: G(z, c) where c is the class label (one-hot or embedding) concatenated to z or injected via FiLM conditioning. Discriminator: D(x, c) evaluates whether real/fake AND whether x matches class c. Training: generator must fool discriminator for the correct class — cannot generate a cat image and claim it's a dog. Enables controllable generation: 'generate class 42' or 'generate a cat'. BigGAN, class-conditional ImageNet generation uses large-scale cGAN.)

Generative Adversarial Network (GAN)

The adversarial training game

GAN loss functions and training stability

Mode collapse and training instability

GAN applications

GANs vs Diffusion Models

Practice questions

Try LumiChats for ₹69

Related Terms