A Convolutional Neural Network (CNN) is a deep learning architecture designed for grid-structured data like images. CNNs use convolutional layers that apply learnable filters across the input, exploiting spatial locality and translation invariance — enabling them to efficiently detect features (edges, textures, shapes, objects) regardless of where they appear in the image.
The convolution operation
A convolutional layer applies a small learned filter (kernel) — typically 3×3 or 5×5 pixels — by sliding it across the entire input image. At each position it computes a dot product between filter weights and the local patch:
2D convolution: the filter K slides over image I. A 3×3 filter has 9 learnable parameters but applies to every spatial position — far fewer than a fully-connected layer.
A layer with 64 filters produces 64 feature maps — each detecting a different learned pattern (edges, curves, textures). The critical insight is parameter sharing: the same 9 weights are reused at every position, giving CNNs their extraordinary parameter efficiency.
Why convolution works for images
Images have two structural properties CNNs exploit: (1) Locality — nearby pixels are more related than distant ones, so a 3×3 filter captures local structure efficiently. (2) Translation invariance — a cat is a cat whether in the top-left or bottom-right of the image. Shared filter weights encode the same detector everywhere.
Pooling and receptive fields
Pooling layers reduce spatial dimensions, making representations smaller and approximately translation-invariant:
| Operation | Formula | Effect | Use case |
|---|---|---|---|
| Max pooling 2×2 | max(x_{i,j}, x_{i+1,j}, x_{i,j+1}, x_{i+1,j+1}) | Keeps strongest activation, discards exact position | Standard in CNNs — preserves sharp features |
| Average pooling | mean of region | Smoother, dilutes strong signals | Global average pooling before classifier head |
| Strided convolution | Conv with stride=2 | Learns to downsample (preferred in modern CNNs) | ResNet, EfficientNet — replaces explicit pooling |
The receptive field is the region of the original input that influences a neuron. Layer 1 sees 3×3 pixels; layer 5 sees ~100×100. Stacking conv + pool layers progressively expands the receptive field — early layers detect local edges, later layers detect global objects.
Landmark CNN architectures
| Architecture | Year | Key innovation | Depth | ImageNet top-5 err |
|---|---|---|---|---|
| AlexNet | 2012 | ReLU, dropout, GPU training | 8 layers | 15.3% |
| VGGNet | 2014 | Depth with uniform 3×3 filters | 16–19 layers | 7.3% |
| GoogLeNet / Inception | 2014 | Inception modules — parallel multi-scale filters | 22 layers | 6.7% |
| ResNet | 2015 | Residual (skip) connections — solved vanishing gradients | 50–152 layers | 3.57% |
| EfficientNet | 2019 | Compound scaling of width, depth, resolution | B0–B7 | 2.9% |
| ConvNeXt | 2022 | Modernized ResNet with Transformer design choices | ~200M params | Competitive with ViT |
ResNet skip connection insight
Rather than learning H(x), a residual block learns F(x) = H(x) − x, then outputs x + F(x). If the optimal transformation is close to identity, F just needs to be near zero — much easier to optimize. This simple change enabled training 152-layer networks that previously couldn't converge.
Transfer learning with CNNs
The most practical way to use CNNs — and why you almost never need to train from scratch:
Fine-tuning ResNet-50 on a custom image classification task
import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader
# Load ResNet-50 pretrained on ImageNet (1.2M images, 1000 classes)
model = models.resnet50(weights="IMAGENET1K_V2")
# Strategy 1: Freeze all layers, only train the final head
for param in model.parameters():
param.requires_grad = False
# Replace classifier head for your number of classes
num_classes = 5 # e.g., flower species
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only model.fc.parameters() have requires_grad=True
# Strategy 2: Unfreeze last 2 blocks for deeper fine-tuning
for param in model.layer4.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW([
{'params': model.layer4.parameters(), 'lr': 1e-4}, # low LR for pretrained
{'params': model.fc.parameters(), 'lr': 1e-3}, # higher LR for new head
])Transfer works across surprising domains
ImageNet-pretrained CNNs transfer well to medical imaging, satellite imagery, and industrial inspection — domains with completely different content. Early layers learn universal edge/texture detectors that are useful everywhere. Only fine-tune the last 1–2 blocks unless your domain is very different from natural images.
CNNs vs Vision Transformers (ViTs)
Vision Transformers (ViT, Dosovitskiy et al., 2020) divide the image into 16×16 patches, treat each as a token, then process with self-attention. The comparison:
| Dimension | CNN (ResNet/EfficientNet) | Vision Transformer (ViT/CLIP) |
|---|---|---|
| Inductive bias | Strong: locality + translation invariance | Weak: must learn spatial structure from data |
| Data hunger | Works well with 10K–100K images | Needs 1M+ images (or large-scale pretraining) |
| Compute | O(HW) — linear in image pixels | O((HW/p²)²) — quadratic in patch count |
| Scale ceiling | Saturates around 1B params | Keeps improving with more data + compute |
| Best use (2025) | Edge/mobile, small datasets, real-time | Foundation models (CLIP, SAM, DINOv2), large-scale tasks |
2025 practical guidance
For new projects: use a pretrained ViT (DINOv2, CLIP) if you have GPU budget and large data. Use EfficientNet or ConvNeXt if you need lower latency, mobile deployment, or have limited data. Hybrid models (ConvFormer, CvT) combine both — useful middle ground.
Practice questions
- What is the receptive field of a CNN layer and why does it matter? (Answer: The receptive field is the region of the input image that contributes to one output neuron's activation. A 3×3 conv layer: each output pixel sees 3×3 input pixels. Two stacked 3×3 layers: each output pixel sees 5×5 input pixels. N layers of 3×3 convolutions: receptive field = (2N+1)×(2N+1). Deep CNNs with small filters achieve large effective receptive fields (global context) while using fewer parameters than single large filters. For image classification, the final feature maps must have receptive fields large enough to span the full input.)
- What is the difference between same padding and valid padding in a convolution? (Answer: Valid padding: no padding — output is smaller than input. Input 32×32, kernel 3×3: output is 30×30 (shrinks by kernel_size-1=2). Same padding: pad input so output has the same spatial dimensions as input. Input 32×32, kernel 3×3: pad by 1 on each side, output is 32×32. TensorFlow default: same padding. PyTorch default: valid (padding=0). Use same padding when you want to preserve spatial dimensions through many layers; use valid when spatial reduction is intentional.)
- What is the difference between regular convolution and depthwise separable convolution (used in MobileNet)? (Answer: Regular convolution: each filter operates across ALL input channels simultaneously — one filter per output channel, each with in_channels × k × k parameters. Total: out_channels × in_channels × k². Depthwise separable: (1) Depthwise: one filter per input channel (operates in each channel independently). (2) Pointwise: 1×1 convolution combines channels. Total: in_channels × k² + in_channels × out_channels. ~8–9× fewer parameters for 3×3 conv. MobileNet achieves competitive accuracy at 10× fewer parameters using this factorisation.)
- What is feature map visualisation and what does it reveal about CNN learning? (Answer: Visualising activations of filters at different layers shows the hierarchy of learned representations: Layer 1: simple edges and colours (oriented Gabor-like filters). Layer 2: textures and simple shapes (combinations of edges). Layer 3–4: object parts (wheels, eyes, windows). Final layers: complete objects and scenes. This hierarchical feature learning (Zeiler & Fergus 2013) confirmed that CNNs learn semantically meaningful features automatically — without hand-crafting, as required by pre-deep-learning vision systems.)
- What is the difference between stride and pooling for spatial downsampling in CNNs? (Answer: Pooling (max/average): take the max or average over a spatial window, reduce spatial size by pool_factor. Fixed operation — no learned parameters. Max pooling: keeps the strongest activation (feature present or absent). Strided convolution: move the conv filter by stride>1, producing smaller output. Learned downsampling — the network learns how to combine spatially adjacent information. Modern CNNs (ResNet, EfficientNet) prefer strided convolutions for downsampling: they're learnable and often outperform fixed pooling. Pooling still used in attention mechanisms and some architectures.)