GRU (Gated Recurrent Unit) is a simplified version of LSTM with only two gates (reset and update) instead of three. It achieves comparable performance with fewer parameters and faster training. Bidirectional RNNs process sequences both forwards and backwards simultaneously, capturing context from both directions — essential for NLP tasks like named entity recognition and question answering. Seq2Seq models combine an encoder RNN and a decoder RNN with an attention mechanism — the architecture behind Google Translate, voice assistants, and text summarisation.
GRU vs LSTM — simpler gates, same power
GRU equations. z_t = update gate (how much to update hidden state). r_t = reset gate (how much of previous state to forget when computing candidate). h̃_t = tanh(W[r_t⊙h_{t-1}, x_t]) = candidate hidden state. h_t = (1-z_t)h_{t-1} + z_t h̃_t = final hidden state blending old and new.
GRU, LSTM, Bidirectional RNN comparison
import torch
import torch.nn as nn
import numpy as np
# GRU: 2 gates (update + reset) — lighter than LSTM (3 gates)
gru = nn.GRU(
input_size=128, # Feature dimension of each input token
hidden_size=256, # Hidden state size
num_layers=2, # Stack 2 GRU layers
batch_first=True, # (batch, seq, features) — standard convention
dropout=0.3, # Dropout between layers (not on last layer)
bidirectional=False
)
# LSTM: 3 gates (input + forget + output) + cell state
lstm = nn.LSTM(
input_size=128,
hidden_size=256,
num_layers=2,
batch_first=True,
dropout=0.3,
bidirectional=False
)
# Bidirectional GRU: processes sequence in BOTH directions
bigru = nn.GRU(
input_size=128, hidden_size=128,
num_layers=2, batch_first=True,
dropout=0.3, bidirectional=True # Output hidden_size = 2×128 = 256
)
# Input: batch=32, sequence_length=50, features=128
batch = torch.randn(32, 50, 128)
# GRU forward pass
gru_out, gru_hidden = gru(batch)
# gru_out: (32, 50, 256) — output at each timestep
# gru_hidden: (2, 32, 256) — final hidden state for each layer
# LSTM forward pass — returns (output, (hidden, cell))
lstm_out, (lstm_hidden, lstm_cell) = lstm(batch)
# Bidirectional GRU
bigru_out, bigru_hidden = bigru(batch)
# bigru_out: (32, 50, 256) — 128 forward + 128 backward concatenated
# bigru_hidden: (4, 32, 128) — 2 layers × 2 directions
print(f"GRU output: {gru_out.shape}") # [32, 50, 256]
print(f"BiGRU output:{bigru_out.shape}") # [32, 50, 256]
print(f"LSTM output: {lstm_out.shape}") # [32, 50, 256]
# Parameter count comparison
gru_params = sum(p.numel() for p in gru.parameters())
lstm_params = sum(p.numel() for p in lstm.parameters())
print(f"GRU parameters: {gru_params:,}")
print(f"LSTM parameters: {lstm_params:,}")
# LSTM has ~33% more parameters than GRU
# Sentiment classification with bidirectional GRU
class SentimentBiGRU(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, n_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.bigru = nn.GRU(embed_dim, hidden_size, batch_first=True,
bidirectional=True, dropout=0.3, num_layers=2)
self.classifier = nn.Sequential(
nn.Linear(hidden_size * 2, 64), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, n_classes)
)
def forward(self, tokens):
x = self.embedding(tokens) # (batch, seq, embed)
out, hidden = self.bigru(x) # out: (batch, seq, 2×hidden)
# Use final hidden states from both directions
final = torch.cat([hidden[-2], hidden[-1]], dim=1) # (batch, 2×hidden)
return self.classifier(final)Seq2Seq — the architecture behind neural translation
Seq2Seq has two parts: Encoder — reads the input sequence and compresses it into a context vector (final hidden state). Decoder — generates the output sequence token by token, using the context vector as initial state. With attention: the decoder attends to ALL encoder hidden states at each decoding step, not just the final one — solving the bottleneck problem for long sequences.
| Architecture | Key feature | Primary use case |
|---|---|---|
| Unidirectional RNN/GRU | Processes left→right only | Language generation, auto-regressive models |
| Bidirectional RNN/GRU | Processes both directions simultaneously | Classification, NER, QA — needs full context |
| Seq2Seq (Encoder-Decoder) | Two RNNs: encoder reads, decoder generates | Machine translation, summarisation, chatbots |
| Seq2Seq + Attention | Decoder attends to encoder states | Better MT, question answering |
| Transformer | Self-attention replaces recurrence entirely | BERT, GPT, T5 — all modern NLP |
Practice questions
- GRU has 2 gates; LSTM has 3. What gate does GRU eliminate and how does it compensate? (Answer: GRU eliminates the separate output gate and merges the input/forget gates into a single update gate. The update gate (z_t) controls both how much of the old hidden state to keep AND how much of the new candidate to add — balancing memory and new input with one gate instead of two.)
- Why does a bidirectional RNN outperform a unidirectional one for NER? (Answer: NER requires full sentence context. "Apple announced..." — is Apple a company or fruit? Bidirectional RNN sees "Apple" with context from both left (nothing) and right ("announced a product launch") — the right context clarifies it is a company. Unidirectional only sees left context at each position.)
- The Seq2Seq bottleneck problem — what is it and what solved it? (Answer: Original Seq2Seq: the entire source sentence is compressed into ONE fixed-size vector (the encoder's final hidden state). For long sentences, early tokens are forgotten. Attention mechanism solved this by letting the decoder directly attend to all encoder hidden states — no compression bottleneck.)
- In a bidirectional GRU with hidden_size=128 per direction, what is the output dimension at each timestep? (Answer: 128 (forward) + 128 (backward) = 256. The forward and backward outputs are concatenated, giving a 256-dimensional representation at each timestep that captures context from both directions.)
- Why are RNNs being replaced by Transformers for NLP tasks? (Answer: RNNs process tokens sequentially — cannot parallelise (token t depends on token t-1). Transformers use self-attention: every token attends to every other token in parallel — dramatically faster training. Also, transformers handle long-range dependencies better as attention is O(n²) not O(n) memory, and attention weights are directly interpretable.)
On LumiChats
Bidirectional GRUs are used in embedding models that power LumiChats document search. When you upload a document, it is encoded by a bidirectional model that reads each sentence both forwards and backwards to create rich embeddings capturing full sentence context.
Try it free