Definition

A neural network is a computational model loosely inspired by biological neural networks. It consists of layers of interconnected nodes (neurons) that each apply a linear transformation followed by a nonlinear activation function. Neural networks can approximate any continuous function (universal approximation theorem) and learn complex patterns from data through gradient-based optimization.

Anatomy of a neural network

A neural network is organized into layers. Each neuron computes a weighted sum of its inputs plus a bias, then applies an activation function:

Layer l output: W = weight matrix, b = bias, f = activation function (ReLU, GELU, etc.)

Input layer: receives raw features (pixel values, token embeddings)
Hidden layers: learn progressively abstract representations. "Deep" = many hidden layers.
Output layer: produces the final prediction (class probabilities via softmax, or a continuous value)

Building a neural network from scratch with NumPy

import numpy as np

def relu(x):    return np.maximum(0, x)
def sigmoid(x): return 1 / (1 + np.exp(-x))

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))  # numerically stable
    return e_x / e_x.sum(axis=-1, keepdims=True)

class DenseLayer:
    def __init__(self, n_in: int, n_out: int, activation=relu):
        # He initialization for ReLU layers
        self.W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
        self.b = np.zeros((n_out, 1))
        self.activation = activation

    def forward(self, x: np.ndarray) -> np.ndarray:
        self.input = x
        self.z = self.W @ x + self.b       # linear: (n_out, batch)
        self.a = self.activation(self.z)   # nonlinear
        return self.a

# A 3-layer network: 784 → 256 → 128 → 10 (MNIST-style)
l1 = DenseLayer(784, 256, relu)
l2 = DenseLayer(256, 128, relu)
l3 = DenseLayer(128, 10,  lambda x: x)  # raw logits, softmax applied separately

# Forward pass
batch_size = 32
x = np.random.randn(784, batch_size)  # 32 images, 784 pixels each

h1 = l1.forward(x)   # (256, 32)
h2 = l2.forward(h1)  # (128, 32)
logits = l3.forward(h2)  # (10, 32)
probs = softmax(logits.T).T  # (10, 32) — class probabilities

print(f"Input shape:  {x.shape}")
print(f"Hidden 1:     {h1.shape}")
print(f"Hidden 2:     {h2.shape}")
print(f"Output probs: {probs.shape}")
print(f"Probs sum:    {probs.sum(axis=0)[:3].round(4)}")  # [1, 1, 1] ✓

# Total parameters:
params = 784*256 + 256 + 256*128 + 128 + 128*10 + 10
print(f"Parameters: {params:,}")  # 235,146

Forward pass: the complete math

The forward pass computes predictions layer by layer. For a network with L layers, input x⁽⁰⁾ = x, and weight matrices W⁽ˡ⁾ and bias vectors b⁽ˡ⁾ at layer l:

Pre-activation (linear step): weighted sum of previous layer outputs plus bias

Post-activation (nonlinear step): apply activation function element-wise. f is typically ReLU for hidden layers, softmax for the output layer.

Final prediction ŷ: output of the last layer (softmax for classification, linear for regression)

For a concrete 3-layer example (784 → 256 → 128 → 10) on a single input vector x ∈ ℝ⁷⁸⁴:

Hidden layer 1: 784→256, output shape (256,)

Hidden layer 2: 256→128, output shape (128,)

Output layer: 128→10, ŷ is a probability vector over 10 classes

Full forward pass with PyTorch — training loop included

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, in_dim=784, hidden=(256, 128), out_dim=10):
        super().__init__()
        dims = [in_dim, *hidden, out_dim]
        self.layers = nn.ModuleList([
            nn.Linear(dims[i], dims[i+1]) for i in range(len(dims)-1)
        ])

    def forward(self, x):
        # x: (batch, 784)
        for layer in self.layers[:-1]:
            x = F.relu(layer(x))   # z = Wx + b, then ReLU
        return self.layers[-1](x)  # logits (no softmax — CrossEntropyLoss applies it)

model = MLP()

# ── Training step ──────────────────────────────────────
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()   # = log-softmax + NLL loss

x = torch.randn(32, 784)           # batch of 32 images
y = torch.randint(0, 10, (32,))    # ground-truth class labels

optimizer.zero_grad()
logits = model(x)                  # forward pass  → (32, 10)
loss = criterion(logits, y)        # compute loss
loss.backward()                    # backward pass (autograd)
optimizer.step()                   # update weights

print(f"Loss: {loss.item():.4f}")
print(f"Logits shape: {logits.shape}")    # (32, 10)
probs = F.softmax(logits, dim=-1)
print(f"Probs sum (should be 1): {probs[0].sum():.4f}")

Vectorized batching

In practice all inputs are processed as a batch matrix X ∈ ℝ^{B×n} — the same weight matrix multiplies all B examples simultaneously. This is why GPUs (optimized for large matrix multiplications) are so critical: a single GPU instruction handles the entire batch.

Universal Approximation Theorem

One of the most important theoretical results in deep learning:

Universal Approximation Theorem (Cybenko, 1989)

A feedforward neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of ℝⁿ to arbitrary precision — given a suitable non-polynomial activation function.

However, 'sufficient width' might be exponentially large. Deep networks are far more parameter-efficient: depth enables hierarchical composition. Each layer builds on representations from the previous, dramatically reducing total parameters needed. For example, a Boolean function over n variables may need O(2ⁿ) neurons in a shallow network but only O(n) neurons in a deep one.

Weight initialization strategies

Starting weights at zero causes symmetry — all neurons compute the same thing (symmetry breaking problem). Proper initialization keeps activation variance consistent across layers:

Xavier/Glorot initialization — designed for tanh and sigmoid activations

He initialization — designed for ReLU activations (variance accounts for ReLU killing half the signal)

Weight initialization in PyTorch

import torch.nn as nn

# PyTorch applies good initialization by default,
# but you can customize it:

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# Apply He initialization explicitly to all Linear layers:
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
        nn.init.zeros_(m.bias)

model.apply(init_weights)

Practice questions

What is the universal approximation theorem and what does it guarantee? (Answer: Universal approximation theorem (Cybenko 1989, Hornik 1991): a feedforward network with a single hidden layer of sufficient width and non-linear activation can approximate any continuous function on a compact set to arbitrary precision. What it guarantees: such a network EXISTS with the right weights. What it does NOT guarantee: how to find those weights (training convergence), how wide the network must be (could be exponentially wide), or that gradient descent will find the solution. The theorem justifies using neural networks but does not make training easy.)
What is the difference between a shallow (wide) network and a deep (narrow) network with the same parameter count? (Answer: Shallow wide networks: single hidden layer can represent complex functions but may need exponentially more neurons than a deep network for the same function. Deep narrow networks: learn hierarchical representations — each layer extracts increasingly abstract features. Empirically: deep networks generalise better with fewer parameters because they exploit compositionality (complex functions = compositions of simpler ones). Deep networks also benefit from techniques not applicable to single-layer networks (batch norm, residual connections).)
What is weight initialisation and why does it critically affect training? (Answer: Random weight initialisation breaks symmetry (all-zero weights: all neurons learn the same thing). Proper initialisation keeps signal variance stable through forward and backward passes. Xavier/Glorot: var(W) = 2/(fan_in + fan_out) — designed for tanh/sigmoid activations. He initialisation: var(W) = 2/fan_in — designed for ReLU (larger variance needed because ReLU kills half its inputs). PyTorch default: He uniform for linear layers. Bad initialisation causes vanishing (too small) or exploding (too large) gradients from the first forward pass, making training impossible.)
What is the difference between a fully connected layer, a convolutional layer, and a recurrent layer? (Answer: Fully connected (dense): every input neuron connects to every output neuron. Parameter count: fan_in × fan_out. No spatial structure — treats all input positions equally. Convolutional: shared weight filter slides over spatial input. Parameter count: kernel_size² × channels (independent of input size). Exploits spatial locality and translation invariance. Recurrent: processes sequences by maintaining hidden state — the same weight matrix applied at each step. Parameter count: (input_dim + hidden_dim) × hidden_dim. Exploits temporal structure.)
What is batch normalisation and what problems does it solve in deep networks? (Answer: Batch norm normalises each layer's activations: x̂ = (x - μ_batch) / √(σ²_batch + ε), then applies learnable scale/shift: y = γx̂ + β. Problems it solves: (1) Internal covariate shift: distributions of layer inputs change as weights update — norm stabilises them. (2) Gradient flow: normalised activations keep gradients in healthy ranges. (3) Enables higher learning rates. (4) Mild regularisation effect (noise from batch statistics). Critical for training deep networks (>10 layers) — without it, deep ResNets are nearly untrainable from scratch.)

Neural Network

Anatomy of a neural network

Forward pass: the complete math

Universal Approximation Theorem

Weight initialization strategies

Practice questions

Try LumiChats for ₹69

Related Terms