Definition

A Convolutional Neural Network (CNN) is a deep learning architecture designed for grid-structured data like images. CNNs use convolutional layers that apply learnable filters across the input, exploiting spatial locality and translation invariance — enabling them to efficiently detect features (edges, textures, shapes, objects) regardless of where they appear in the image.

The convolution operation

A convolutional layer applies a small learned filter (kernel) — typically 3×3 or 5×5 pixels — by sliding it across the entire input image. At each position it computes a dot product between filter weights and the local patch:

2D convolution: the filter K slides over image I. A 3×3 filter has 9 learnable parameters but applies to every spatial position — far fewer than a fully-connected layer.

A layer with 64 filters produces 64 feature maps — each detecting a different learned pattern (edges, curves, textures). The critical insight is parameter sharing: the same 9 weights are reused at every position, giving CNNs their extraordinary parameter efficiency.

Why convolution works for images

Images have two structural properties CNNs exploit: (1) Locality — nearby pixels are more related than distant ones, so a 3×3 filter captures local structure efficiently. (2) Translation invariance — a cat is a cat whether in the top-left or bottom-right of the image. Shared filter weights encode the same detector everywhere.

Pooling and receptive fields

Pooling layers reduce spatial dimensions, making representations smaller and approximately translation-invariant:

Operation	Formula	Effect	Use case
Max pooling 2×2	max(x_{i,j}, x_{i+1,j}, x_{i,j+1}, x_{i+1,j+1})	Keeps strongest activation, discards exact position	Standard in CNNs — preserves sharp features
Average pooling	mean of region	Smoother, dilutes strong signals	Global average pooling before classifier head
Strided convolution	Conv with stride=2	Learns to downsample (preferred in modern CNNs)	ResNet, EfficientNet — replaces explicit pooling

The receptive field is the region of the original input that influences a neuron. Layer 1 sees 3×3 pixels; layer 5 sees ~100×100. Stacking conv + pool layers progressively expands the receptive field — early layers detect local edges, later layers detect global objects.

Landmark CNN architectures

Architecture	Year	Key innovation	Depth	ImageNet top-5 err
AlexNet	2012	ReLU, dropout, GPU training	8 layers	15.3%
VGGNet	2014	Depth with uniform 3×3 filters	16–19 layers	7.3%
GoogLeNet / Inception	2014	Inception modules — parallel multi-scale filters	22 layers	6.7%
ResNet	2015	Residual (skip) connections — solved vanishing gradients	50–152 layers	3.57%
EfficientNet	2019	Compound scaling of width, depth, resolution	B0–B7	2.9%
ConvNeXt	2022	Modernized ResNet with Transformer design choices	~200M params	Competitive with ViT

ResNet skip connection insight

Rather than learning H(x), a residual block learns F(x) = H(x) − x, then outputs x + F(x). If the optimal transformation is close to identity, F just needs to be near zero — much easier to optimize. This simple change enabled training 152-layer networks that previously couldn't converge.

Transfer learning with CNNs

The most practical way to use CNNs — and why you almost never need to train from scratch:

Fine-tuning ResNet-50 on a custom image classification task

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader

# Load ResNet-50 pretrained on ImageNet (1.2M images, 1000 classes)
model = models.resnet50(weights="IMAGENET1K_V2")

# Strategy 1: Freeze all layers, only train the final head
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head for your number of classes
num_classes = 5   # e.g., flower species
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only model.fc.parameters() have requires_grad=True

# Strategy 2: Unfreeze last 2 blocks for deeper fine-tuning
for param in model.layer4.parameters():
    param.requires_grad = True

optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-4},   # low LR for pretrained
    {'params': model.fc.parameters(),     'lr': 1e-3},   # higher LR for new head
])

Transfer works across surprising domains

ImageNet-pretrained CNNs transfer well to medical imaging, satellite imagery, and industrial inspection — domains with completely different content. Early layers learn universal edge/texture detectors that are useful everywhere. Only fine-tune the last 1–2 blocks unless your domain is very different from natural images.

CNNs vs Vision Transformers (ViTs)

Vision Transformers (ViT, Dosovitskiy et al., 2020) divide the image into 16×16 patches, treat each as a token, then process with self-attention. The comparison:

Dimension	CNN (ResNet/EfficientNet)	Vision Transformer (ViT/CLIP)
Inductive bias	Strong: locality + translation invariance	Weak: must learn spatial structure from data
Data hunger	Works well with 10K–100K images	Needs 1M+ images (or large-scale pretraining)
Compute	O(HW) — linear in image pixels	O((HW/p²)²) — quadratic in patch count
Scale ceiling	Saturates around 1B params	Keeps improving with more data + compute
Best use (2025)	Edge/mobile, small datasets, real-time	Foundation models (CLIP, SAM, DINOv2), large-scale tasks

2025 practical guidance

For new projects: use a pretrained ViT (DINOv2, CLIP) if you have GPU budget and large data. Use EfficientNet or ConvNeXt if you need lower latency, mobile deployment, or have limited data. Hybrid models (ConvFormer, CvT) combine both — useful middle ground.

Practice questions

What is the receptive field of a CNN layer and why does it matter? (Answer: The receptive field is the region of the input image that contributes to one output neuron's activation. A 3×3 conv layer: each output pixel sees 3×3 input pixels. Two stacked 3×3 layers: each output pixel sees 5×5 input pixels. N layers of 3×3 convolutions: receptive field = (2N+1)×(2N+1). Deep CNNs with small filters achieve large effective receptive fields (global context) while using fewer parameters than single large filters. For image classification, the final feature maps must have receptive fields large enough to span the full input.)
What is the difference between same padding and valid padding in a convolution? (Answer: Valid padding: no padding — output is smaller than input. Input 32×32, kernel 3×3: output is 30×30 (shrinks by kernel_size-1=2). Same padding: pad input so output has the same spatial dimensions as input. Input 32×32, kernel 3×3: pad by 1 on each side, output is 32×32. TensorFlow default: same padding. PyTorch default: valid (padding=0). Use same padding when you want to preserve spatial dimensions through many layers; use valid when spatial reduction is intentional.)
What is the difference between regular convolution and depthwise separable convolution (used in MobileNet)? (Answer: Regular convolution: each filter operates across ALL input channels simultaneously — one filter per output channel, each with in_channels × k × k parameters. Total: out_channels × in_channels × k². Depthwise separable: (1) Depthwise: one filter per input channel (operates in each channel independently). (2) Pointwise: 1×1 convolution combines channels. Total: in_channels × k² + in_channels × out_channels. ~8–9× fewer parameters for 3×3 conv. MobileNet achieves competitive accuracy at 10× fewer parameters using this factorisation.)
What is feature map visualisation and what does it reveal about CNN learning? (Answer: Visualising activations of filters at different layers shows the hierarchy of learned representations: Layer 1: simple edges and colours (oriented Gabor-like filters). Layer 2: textures and simple shapes (combinations of edges). Layer 3–4: object parts (wheels, eyes, windows). Final layers: complete objects and scenes. This hierarchical feature learning (Zeiler & Fergus 2013) confirmed that CNNs learn semantically meaningful features automatically — without hand-crafting, as required by pre-deep-learning vision systems.)
What is the difference between stride and pooling for spatial downsampling in CNNs? (Answer: Pooling (max/average): take the max or average over a spatial window, reduce spatial size by pool_factor. Fixed operation — no learned parameters. Max pooling: keeps the strongest activation (feature present or absent). Strided convolution: move the conv filter by stride>1, producing smaller output. Learned downsampling — the network learns how to combine spatially adjacent information. Modern CNNs (ResNet, EfficientNet) prefer strided convolutions for downsampling: they're learnable and often outperform fixed pooling. Pooling still used in attention mechanisms and some architectures.)

Convolutional Neural Network (CNN)

The convolution operation

Pooling and receptive fields

Landmark CNN architectures

Transfer learning with CNNs

CNNs vs Vision Transformers (ViTs)

Practice questions

Try LumiChats for ₹69

Related Terms