A Multi-Layer Perceptron (MLP) is the foundational feed-forward neural network: an input layer, one or more hidden layers of neurons with non-linear activation functions, and an output layer. Information flows strictly in one direction — forward — with no cycles. Training uses backpropagation and gradient descent to adjust weights. MLPs are universal approximators: given enough neurons, they can approximate any continuous function. GATE DS&AI tests MLP forward pass computation, activation functions, and training dynamics.
Real-life analogy: The brain's decision chain
Imagine a doctor diagnosing a patient. First, basic facts are assessed: fever? (yes/no). Then a specialist interprets symptoms in combination: fever AND cough AND loss of smell → high COVID probability. Then a senior doctor makes the final call: admit or not? Each level of decision-making uses the previous level's output. An MLP works identically: each layer extracts higher-level abstractions from the previous layer's representations.
Architecture and forward pass
Forward pass: output of layer l is the activation function σ applied to the weighted sum of previous layer outputs. W^(l) is the weight matrix, b^(l) is the bias vector. a^(0) = x (input). Final layer output is the prediction.
MLP from scratch — forward pass and backpropagation
import numpy as np
class MLP:
"""2-layer MLP with ReLU hidden layer and sigmoid output."""
def __init__(self, n_input, n_hidden, n_output, lr=0.01):
# Xavier initialisation — prevents vanishing/exploding gradients
self.W1 = np.random.randn(n_input, n_hidden) * np.sqrt(2/n_input)
self.b1 = np.zeros(n_hidden)
self.W2 = np.random.randn(n_hidden, n_output) * np.sqrt(2/n_hidden)
self.b2 = np.zeros(n_output)
self.lr = lr
def relu(self, z): return np.maximum(0, z)
def sigmoid(self, z): return 1 / (1 + np.exp(-z))
def relu_grad(self, z): return (z > 0).astype(float)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1 # (n, hidden)
self.a1 = self.relu(self.z1) # ReLU activation
self.z2 = self.a1 @ self.W2 + self.b2 # (n, output)
self.a2 = self.sigmoid(self.z2) # Sigmoid output
return self.a2
def backward(self, X, y):
n = X.shape[0]
# Output layer gradient (cross-entropy + sigmoid)
dz2 = self.a2 - y.reshape(-1, 1) # (n, output)
dW2 = self.a1.T @ dz2 / n
db2 = dz2.mean(axis=0)
# Hidden layer gradient (chain rule through ReLU)
da1 = dz2 @ self.W2.T
dz1 = da1 * self.relu_grad(self.z1) # Element-wise ReLU gradient
dW1 = X.T @ dz1 / n
db1 = dz1.mean(axis=0)
# Gradient descent update
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def train(self, X, y, epochs=1000):
for epoch in range(epochs):
y_hat = self.forward(X)
self.backward(X, y)
if epoch % 200 == 0:
loss = -np.mean(y*np.log(y_hat+1e-8) + (1-y)*np.log(1-y_hat+1e-8))
print(f"Epoch {epoch}: Loss={loss:.4f}")
# XOR problem — not linearly separable, needs hidden layer
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0, 1, 1, 0]) # XOR
mlp = MLP(n_input=2, n_hidden=4, n_output=1, lr=0.5)
mlp.train(X, y, epochs=5000)
preds = (mlp.forward(X) > 0.5).astype(int).flatten()
print(f"XOR predictions: {preds}") # Should be [0, 1, 1, 0]Activation functions — the key non-linearity
| Activation | Formula | Output range | Problem | Use |
|---|---|---|---|---|
| Sigmoid | 1/(1+e^−z) | (0, 1) | Vanishing gradients, not zero-centred | Binary output layer |
| Tanh | (e^z−e^−z)/(e^z+e^−z) | (−1, 1) | Vanishing gradients (less than sigmoid) | RNNs, hidden layers (older) |
| ReLU | max(0, z) | [0, ∞) | Dying ReLU (neurons stuck at 0) | Default for hidden layers |
| Leaky ReLU | max(0.01z, z) | (−∞, ∞) | Fixed slope — may not be optimal | When dying ReLU is a problem |
| Softmax | e^zₖ / Σe^zⱼ | (0,1), sums to 1 | Expensive for large vocabulary | Multi-class output layer |
Vanishing gradient problem
Sigmoid and Tanh derivatives are < 0.25 everywhere. During backpropagation through many layers, gradients are multiplied: 0.25^10 = 9.5×10⁻⁷. Deep networks with sigmoid activations barely update their early layers — they learn very slowly or not at all. ReLU solves this: its gradient is exactly 1 for positive inputs, allowing gradients to flow through deep networks unchanged.
Universal approximation theorem
The Universal Approximation Theorem (Hornik, 1989) states: an MLP with a single hidden layer containing a finite number of neurons with non-linear activation functions can approximate any continuous function on a compact subset of R^n to arbitrary accuracy. This is the mathematical guarantee that neural networks are powerful enough in principle — but says nothing about how easy it is to train them or how many neurons are needed.
Practice questions (GATE-style)
- Why can't a single-layer (no hidden layer) MLP solve XOR? (Answer: XOR is not linearly separable — no single hyperplane can separate the (0,0),(1,1) class from (0,1),(1,0) class. A hidden layer creates non-linear decision boundaries.)
- What is the purpose of Xavier initialisation? (Answer: Ensures that the variance of activations remains approximately constant across layers during initialisation, preventing vanishing or exploding gradients from the start of training.)
- An MLP has 3 input features, 1 hidden layer with 5 neurons, and 2 output neurons. How many trainable parameters? (Answer: (3×5 + 5) + (5×2 + 2) = 15+5+10+2 = 32 parameters (weights + biases for each layer).)
- Which activation function should you use for a multi-class classification output layer? (Answer: Softmax — it converts raw scores (logits) to a probability distribution over K classes that sums to 1.)
- Why does ReLU cause "dying neurons"? (Answer: If a neuron receives a large negative input during training, its output (and gradient) becomes 0 permanently. The weights never update for that neuron — it is "dead". Solution: Leaky ReLU uses slope 0.01 for negative inputs.)
On LumiChats
Every LLM — GPT, Claude, Gemini — is a deep feed-forward network at its core, with each Transformer block containing MLP sub-layers. The attention mechanism is added on top, but the fundamental computation (linear transformation + non-linear activation) is identical to an MLP layer.
Try it free