Definition

A Multi-Layer Perceptron (MLP) is the foundational feed-forward neural network: an input layer, one or more hidden layers of neurons with non-linear activation functions, and an output layer. Information flows strictly in one direction — forward — with no cycles. Training uses backpropagation and gradient descent to adjust weights. MLPs are universal approximators: given enough neurons, they can approximate any continuous function. GATE DS&AI tests MLP forward pass computation, activation functions, and training dynamics.

Real-life analogy: The brain's decision chain

Imagine a doctor diagnosing a patient. First, basic facts are assessed: fever? (yes/no). Then a specialist interprets symptoms in combination: fever AND cough AND loss of smell → high COVID probability. Then a senior doctor makes the final call: admit or not? Each level of decision-making uses the previous level's output. An MLP works identically: each layer extracts higher-level abstractions from the previous layer's representations.

Architecture and forward pass

Forward pass: output of layer l is the activation function σ applied to the weighted sum of previous layer outputs. W^(l) is the weight matrix, b^(l) is the bias vector. a^(0) = x (input). Final layer output is the prediction.

MLP from scratch — forward pass and backpropagation

import numpy as np

class MLP:
    """2-layer MLP with ReLU hidden layer and sigmoid output."""
    def __init__(self, n_input, n_hidden, n_output, lr=0.01):
        # Xavier initialisation — prevents vanishing/exploding gradients
        self.W1 = np.random.randn(n_input, n_hidden)  * np.sqrt(2/n_input)
        self.b1 = np.zeros(n_hidden)
        self.W2 = np.random.randn(n_hidden, n_output) * np.sqrt(2/n_hidden)
        self.b2 = np.zeros(n_output)
        self.lr = lr

    def relu(self, z):     return np.maximum(0, z)
    def sigmoid(self, z):  return 1 / (1 + np.exp(-z))
    def relu_grad(self, z): return (z > 0).astype(float)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1          # (n, hidden)
        self.a1 = self.relu(self.z1)               # ReLU activation
        self.z2 = self.a1 @ self.W2 + self.b2     # (n, output)
        self.a2 = self.sigmoid(self.z2)            # Sigmoid output
        return self.a2

    def backward(self, X, y):
        n = X.shape[0]
        # Output layer gradient (cross-entropy + sigmoid)
        dz2 = self.a2 - y.reshape(-1, 1)          # (n, output)
        dW2 = self.a1.T @ dz2 / n
        db2 = dz2.mean(axis=0)
        # Hidden layer gradient (chain rule through ReLU)
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_grad(self.z1)       # Element-wise ReLU gradient
        dW1 = X.T @ dz1 / n
        db1 = dz1.mean(axis=0)
        # Gradient descent update
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def train(self, X, y, epochs=1000):
        for epoch in range(epochs):
            y_hat = self.forward(X)
            self.backward(X, y)
            if epoch % 200 == 0:
                loss = -np.mean(y*np.log(y_hat+1e-8) + (1-y)*np.log(1-y_hat+1e-8))
                print(f"Epoch {epoch}: Loss={loss:.4f}")

# XOR problem — not linearly separable, needs hidden layer
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0, 1, 1, 0])   # XOR

mlp = MLP(n_input=2, n_hidden=4, n_output=1, lr=0.5)
mlp.train(X, y, epochs=5000)
preds = (mlp.forward(X) > 0.5).astype(int).flatten()
print(f"XOR predictions: {preds}")  # Should be [0, 1, 1, 0]

Activation functions — the key non-linearity

Activation	Formula	Output range	Problem	Use
Sigmoid	1/(1+e^−z)	(0, 1)	Vanishing gradients, not zero-centred	Binary output layer
Tanh	(e^z−e^−z)/(e^z+e^−z)	(−1, 1)	Vanishing gradients (less than sigmoid)	RNNs, hidden layers (older)
ReLU	max(0, z)	[0, ∞)	Dying ReLU (neurons stuck at 0)	Default for hidden layers
Leaky ReLU	max(0.01z, z)	(−∞, ∞)	Fixed slope — may not be optimal	When dying ReLU is a problem
Softmax	e^zₖ / Σe^zⱼ	(0,1), sums to 1	Expensive for large vocabulary	Multi-class output layer

Vanishing gradient problem

Sigmoid and Tanh derivatives are < 0.25 everywhere. During backpropagation through many layers, gradients are multiplied: 0.25^10 = 9.5×10⁻⁷. Deep networks with sigmoid activations barely update their early layers — they learn very slowly or not at all. ReLU solves this: its gradient is exactly 1 for positive inputs, allowing gradients to flow through deep networks unchanged.

Universal approximation theorem

The Universal Approximation Theorem (Hornik, 1989) states: an MLP with a single hidden layer containing a finite number of neurons with non-linear activation functions can approximate any continuous function on a compact subset of R^n to arbitrary accuracy. This is the mathematical guarantee that neural networks are powerful enough in principle — but says nothing about how easy it is to train them or how many neurons are needed.

Practice questions (GATE-style)

Why can't a single-layer (no hidden layer) MLP solve XOR? (Answer: XOR is not linearly separable — no single hyperplane can separate the (0,0),(1,1) class from (0,1),(1,0) class. A hidden layer creates non-linear decision boundaries.)
What is the purpose of Xavier initialisation? (Answer: Ensures that the variance of activations remains approximately constant across layers during initialisation, preventing vanishing or exploding gradients from the start of training.)
An MLP has 3 input features, 1 hidden layer with 5 neurons, and 2 output neurons. How many trainable parameters? (Answer: (3×5 + 5) + (5×2 + 2) = 15+5+10+2 = 32 parameters (weights + biases for each layer).)
Which activation function should you use for a multi-class classification output layer? (Answer: Softmax — it converts raw scores (logits) to a probability distribution over K classes that sums to 1.)
Why does ReLU cause "dying neurons"? (Answer: If a neuron receives a large negative input during training, its output (and gradient) becomes 0 permanently. The weights never update for that neuron — it is "dead". Solution: Leaky ReLU uses slope 0.01 for negative inputs.)

On LumiChats

Every LLM — GPT, Claude, Gemini — is a deep feed-forward network at its core, with each Transformer block containing MLP sub-layers. The attention mechanism is added on top, but the fundamental computation (linear transformation + non-linear activation) is identical to an MLP layer.

Try it free

Multi-Layer Perceptron (MLP) & Feed-Forward Neural Networks

Real-life analogy: The brain's decision chain

Architecture and forward pass

Activation functions — the key non-linearity

Universal approximation theorem

Practice questions (GATE-style)

Try LumiChats for ₹69

Related Terms