Definition

GRU (Gated Recurrent Unit) is a simplified version of LSTM with only two gates (reset and update) instead of three. It achieves comparable performance with fewer parameters and faster training. Bidirectional RNNs process sequences both forwards and backwards simultaneously, capturing context from both directions — essential for NLP tasks like named entity recognition and question answering. Seq2Seq models combine an encoder RNN and a decoder RNN with an attention mechanism — the architecture behind Google Translate, voice assistants, and text summarisation.

GRU vs LSTM — simpler gates, same power

GRU equations. z_t = update gate (how much to update hidden state). r_t = reset gate (how much of previous state to forget when computing candidate). h̃_t = tanh(W[r_t⊙h_{t-1}, x_t]) = candidate hidden state. h_t = (1-z_t)h_{t-1} + z_t h̃_t = final hidden state blending old and new.

GRU, LSTM, Bidirectional RNN comparison

import torch
import torch.nn as nn
import numpy as np

# GRU: 2 gates (update + reset) — lighter than LSTM (3 gates)
gru = nn.GRU(
    input_size=128,     # Feature dimension of each input token
    hidden_size=256,    # Hidden state size
    num_layers=2,       # Stack 2 GRU layers
    batch_first=True,   # (batch, seq, features) — standard convention
    dropout=0.3,        # Dropout between layers (not on last layer)
    bidirectional=False
)

# LSTM: 3 gates (input + forget + output) + cell state
lstm = nn.LSTM(
    input_size=128,
    hidden_size=256,
    num_layers=2,
    batch_first=True,
    dropout=0.3,
    bidirectional=False
)

# Bidirectional GRU: processes sequence in BOTH directions
bigru = nn.GRU(
    input_size=128, hidden_size=128,
    num_layers=2, batch_first=True,
    dropout=0.3, bidirectional=True  # Output hidden_size = 2×128 = 256
)

# Input: batch=32, sequence_length=50, features=128
batch = torch.randn(32, 50, 128)

# GRU forward pass
gru_out, gru_hidden = gru(batch)
# gru_out: (32, 50, 256) — output at each timestep
# gru_hidden: (2, 32, 256) — final hidden state for each layer

# LSTM forward pass — returns (output, (hidden, cell))
lstm_out, (lstm_hidden, lstm_cell) = lstm(batch)

# Bidirectional GRU
bigru_out, bigru_hidden = bigru(batch)
# bigru_out: (32, 50, 256) — 128 forward + 128 backward concatenated
# bigru_hidden: (4, 32, 128) — 2 layers × 2 directions

print(f"GRU output:  {gru_out.shape}")    # [32, 50, 256]
print(f"BiGRU output:{bigru_out.shape}")  # [32, 50, 256]
print(f"LSTM output: {lstm_out.shape}")   # [32, 50, 256]

# Parameter count comparison
gru_params  = sum(p.numel() for p in gru.parameters())
lstm_params = sum(p.numel() for p in lstm.parameters())
print(f"GRU parameters:  {gru_params:,}")
print(f"LSTM parameters: {lstm_params:,}")
# LSTM has ~33% more parameters than GRU

# Sentiment classification with bidirectional GRU
class SentimentBiGRU(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, n_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.bigru = nn.GRU(embed_dim, hidden_size, batch_first=True,
                             bidirectional=True, dropout=0.3, num_layers=2)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, 64), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(64, n_classes)
        )

    def forward(self, tokens):
        x = self.embedding(tokens)              # (batch, seq, embed)
        out, hidden = self.bigru(x)             # out: (batch, seq, 2×hidden)
        # Use final hidden states from both directions
        final = torch.cat([hidden[-2], hidden[-1]], dim=1)  # (batch, 2×hidden)
        return self.classifier(final)

Seq2Seq — the architecture behind neural translation

Seq2Seq has two parts: Encoder — reads the input sequence and compresses it into a context vector (final hidden state). Decoder — generates the output sequence token by token, using the context vector as initial state. With attention: the decoder attends to ALL encoder hidden states at each decoding step, not just the final one — solving the bottleneck problem for long sequences.

Architecture	Key feature	Primary use case
Unidirectional RNN/GRU	Processes left→right only	Language generation, auto-regressive models
Bidirectional RNN/GRU	Processes both directions simultaneously	Classification, NER, QA — needs full context
Seq2Seq (Encoder-Decoder)	Two RNNs: encoder reads, decoder generates	Machine translation, summarisation, chatbots
Seq2Seq + Attention	Decoder attends to encoder states	Better MT, question answering
Transformer	Self-attention replaces recurrence entirely	BERT, GPT, T5 — all modern NLP

Practice questions

GRU has 2 gates; LSTM has 3. What gate does GRU eliminate and how does it compensate? (Answer: GRU eliminates the separate output gate and merges the input/forget gates into a single update gate. The update gate (z_t) controls both how much of the old hidden state to keep AND how much of the new candidate to add — balancing memory and new input with one gate instead of two.)
Why does a bidirectional RNN outperform a unidirectional one for NER? (Answer: NER requires full sentence context. "Apple announced..." — is Apple a company or fruit? Bidirectional RNN sees "Apple" with context from both left (nothing) and right ("announced a product launch") — the right context clarifies it is a company. Unidirectional only sees left context at each position.)
The Seq2Seq bottleneck problem — what is it and what solved it? (Answer: Original Seq2Seq: the entire source sentence is compressed into ONE fixed-size vector (the encoder's final hidden state). For long sentences, early tokens are forgotten. Attention mechanism solved this by letting the decoder directly attend to all encoder hidden states — no compression bottleneck.)
In a bidirectional GRU with hidden_size=128 per direction, what is the output dimension at each timestep? (Answer: 128 (forward) + 128 (backward) = 256. The forward and backward outputs are concatenated, giving a 256-dimensional representation at each timestep that captures context from both directions.)
Why are RNNs being replaced by Transformers for NLP tasks? (Answer: RNNs process tokens sequentially — cannot parallelise (token t depends on token t-1). Transformers use self-attention: every token attends to every other token in parallel — dramatically faster training. Also, transformers handle long-range dependencies better as attention is O(n²) not O(n) memory, and attention weights are directly interpretable.)

On LumiChats

Bidirectional GRUs are used in embedding models that power LumiChats document search. When you upload a document, it is encoded by a bidirectional model that reads each sentence both forwards and backwards to create rich embeddings capturing full sentence context.

Try it free

GRU, Bidirectional RNN & Sequence-to-Sequence Models

GRU vs LSTM — simpler gates, same power

Seq2Seq — the architecture behind neural translation

Practice questions

Try LumiChats for ₹69

Related Terms