Glossary/T5, Seq2Seq Transformers & Encoder-Decoder NLP
Natural Language Processing

T5, Seq2Seq Transformers & Encoder-Decoder NLP

Unifying every NLP task as text-to-text — translation, summarisation, QA, and more.


Definition

T5 (Text-To-Text Transfer Transformer, Google 2019) frames every NLP task as a text-to-text problem: the input and output are always strings. Translate English to French: input='translate English to French: Hello', output='Bonjour'. Summarise: input='summarize: [article]', output='[summary]'. This unified framework enables training one model on all tasks simultaneously. The encoder-decoder architecture processes the input with full bidirectional attention (encoder) and generates output auto-regressively (decoder). BART, MarianMT, mBART, and Whisper also use this architecture.

The text-to-text unification

NLP TaskT5 Input formatT5 Output
Translation"translate English to French: How are you?""Comment allez-vous?"
Summarisation"summarize: [long article text]""[1-3 sentence summary]"
Question Answering"question: What is the capital? context: France is a country whose capital is Paris.""Paris"
Sentiment"sentiment: This movie was absolutely terrible.""negative"
Grammar check"cola sentence: A cat in the box sat.""unacceptable"
Paraphrase detection"stsb sentence1: ... sentence2: ...""3.8" (similarity score)

T5 for multi-task NLP in a single model

from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-small', legacy=False)
model     = T5ForConditionalGeneration.from_pretrained('t5-small')

def t5_inference(input_text: str, max_length: int = 100) -> str:
    inputs  = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=max_length,
                              num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ── Multiple tasks, one model ──
tasks = [
    "translate English to French: The weather is beautiful today.",
    "summarize: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889, it was originally criticized by some of France's leading artists and intellectuals for its design.",
    "question: What year was the Eiffel Tower completed? context: The Eiffel Tower was constructed from 1887 to 1889 by engineer Gustave Eiffel.",
    "grammar: Me want to goes to the store.",
]
for task in tasks:
    result = t5_inference(task)
    print(f"Input:  {task[:70]}...")
    print(f"Output: {result}")
    print()

# ── BART: another popular encoder-decoder (better for summarisation) ──
from transformers import BartForConditionalGeneration, BartTokenizer
bart_tok = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_mdl = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

article = """The transformer architecture was introduced in 2017 by Vaswani et al.
It replaced recurrent networks with self-attention, enabling parallel training.
BERT and GPT are both based on transformers but use different attention patterns."""

inputs = bart_tok(article, return_tensors='pt', max_length=1024, truncation=True)
summary_ids = bart_mdl.generate(**inputs, max_new_tokens=60, num_beams=4)
print("BART summary:", bart_tok.decode(summary_ids[0], skip_special_tokens=True))

Encoder-decoder architecture deep dive

Encoder: Processes the full input sequence with bidirectional self-attention — every token can attend to every other input token. Produces a rich contextual representation of the input. Decoder: Generates output tokens auto-regressively. Has two types of attention per block: (1) Masked self-attention over previously generated tokens. (2) Cross-attention over all encoder states — the decoder 'reads' the encoder output to inform generation.

ArchitectureAttentionKey strengthModels
Encoder-onlyBidirectional self-attentionDeep understanding of inputBERT, RoBERTa, ELECTRA
Decoder-onlyCausal self-attentionFlexible generation, scales wellGPT-4, Claude, Llama, Gemini
Encoder-DecoderBidirectional encoder + cross-attention decoderSeq2seq: input understanding + generationT5, BART, mT5, Whisper, MarianMT

Practice questions

  1. T5 frames classification as text-to-text. How does it output a class label? (Answer: The model generates the label as a text string. For sentiment: output is literally the word "positive" or "negative". For cola acceptability: output is "acceptable" or "unacceptable". The text output is then mapped to a class. This allows a single model to handle all tasks without task-specific output heads.)
  2. What is cross-attention in the encoder-decoder and what does it connect? (Answer: Cross-attention in the decoder attends to the ENCODER hidden states (keys and values come from encoder, queries come from decoder current state). At each decoder step, the decoder reads the most relevant parts of the encoded input. This is how the model accesses the input meaning while generating output.)
  3. BART vs T5 — what is the main architectural difference? (Answer: BART uses a standard transformer encoder-decoder with a denoising pre-training objective (corrupting input text and learning to reconstruct it). T5 uses span masking (a form of masked LM) as pre-training and frames everything as text-to-text. BART is better for generation tasks (summarisation); T5 is better for understanding tasks.)
  4. When would you choose T5 over GPT for a business NLP task? (Answer: T5 when: (1) You have a specific seq2seq task (translation, summarisation, data-to-text). (2) Input understanding is as important as generation. (3) You want to fine-tune on a specific task efficiently. GPT when: (1) You need flexible generation. (2) Zero/few-shot prompting is sufficient. (3) Chat interface is needed.)
  5. Whisper uses an encoder-decoder transformer for speech recognition. What does each part do? (Answer: Encoder: processes the mel spectrogram (audio features) with full bidirectional attention — understands the entire audio context. Decoder: auto-regressively generates transcript tokens while attending to encoder audio representations via cross-attention. Same architecture as T5/BART applied to audio-to-text.)

On LumiChats

T5-style text-to-text framing inspired the instruction-following format used in modern LLMs. When you prompt LumiChats, the model processes your instruction as the 'encoder input' conceptually and generates the response as 'decoder output' — the same text-to-text paradigm at massive scale.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms