Definition

T5 (Text-To-Text Transfer Transformer, Google 2019) frames every NLP task as a text-to-text problem: the input and output are always strings. Translate English to French: input='translate English to French: Hello', output='Bonjour'. Summarise: input='summarize: [article]', output='[summary]'. This unified framework enables training one model on all tasks simultaneously. The encoder-decoder architecture processes the input with full bidirectional attention (encoder) and generates output auto-regressively (decoder). BART, MarianMT, mBART, and Whisper also use this architecture.

The text-to-text unification

NLP Task	T5 Input format	T5 Output
Translation	"translate English to French: How are you?"	"Comment allez-vous?"
Summarisation	"summarize: [long article text]"	"[1-3 sentence summary]"
Question Answering	"question: What is the capital? context: France is a country whose capital is Paris."	"Paris"
Sentiment	"sentiment: This movie was absolutely terrible."	"negative"
Grammar check	"cola sentence: A cat in the box sat."	"unacceptable"
Paraphrase detection	"stsb sentence1: ... sentence2: ..."	"3.8" (similarity score)

T5 for multi-task NLP in a single model

from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-small', legacy=False)
model     = T5ForConditionalGeneration.from_pretrained('t5-small')

def t5_inference(input_text: str, max_length: int = 100) -> str:
    inputs  = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=max_length,
                              num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ── Multiple tasks, one model ──
tasks = [
    "translate English to French: The weather is beautiful today.",
    "summarize: The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889, it was originally criticized by some of France's leading artists and intellectuals for its design.",
    "question: What year was the Eiffel Tower completed? context: The Eiffel Tower was constructed from 1887 to 1889 by engineer Gustave Eiffel.",
    "grammar: Me want to goes to the store.",
]
for task in tasks:
    result = t5_inference(task)
    print(f"Input:  {task[:70]}...")
    print(f"Output: {result}")
    print()

# ── BART: another popular encoder-decoder (better for summarisation) ──
from transformers import BartForConditionalGeneration, BartTokenizer
bart_tok = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_mdl = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

article = """The transformer architecture was introduced in 2017 by Vaswani et al.
It replaced recurrent networks with self-attention, enabling parallel training.
BERT and GPT are both based on transformers but use different attention patterns."""

inputs = bart_tok(article, return_tensors='pt', max_length=1024, truncation=True)
summary_ids = bart_mdl.generate(**inputs, max_new_tokens=60, num_beams=4)
print("BART summary:", bart_tok.decode(summary_ids[0], skip_special_tokens=True))

Encoder-decoder architecture deep dive

Encoder: Processes the full input sequence with bidirectional self-attention — every token can attend to every other input token. Produces a rich contextual representation of the input. Decoder: Generates output tokens auto-regressively. Has two types of attention per block: (1) Masked self-attention over previously generated tokens. (2) Cross-attention over all encoder states — the decoder 'reads' the encoder output to inform generation.

Architecture	Attention	Key strength	Models
Encoder-only	Bidirectional self-attention	Deep understanding of input	BERT, RoBERTa, ELECTRA
Decoder-only	Causal self-attention	Flexible generation, scales well	GPT-4, Claude, Llama, Gemini
Encoder-Decoder	Bidirectional encoder + cross-attention decoder	Seq2seq: input understanding + generation	T5, BART, mT5, Whisper, MarianMT

Practice questions

T5 frames classification as text-to-text. How does it output a class label? (Answer: The model generates the label as a text string. For sentiment: output is literally the word "positive" or "negative". For cola acceptability: output is "acceptable" or "unacceptable". The text output is then mapped to a class. This allows a single model to handle all tasks without task-specific output heads.)
What is cross-attention in the encoder-decoder and what does it connect? (Answer: Cross-attention in the decoder attends to the ENCODER hidden states (keys and values come from encoder, queries come from decoder current state). At each decoder step, the decoder reads the most relevant parts of the encoded input. This is how the model accesses the input meaning while generating output.)
BART vs T5 — what is the main architectural difference? (Answer: BART uses a standard transformer encoder-decoder with a denoising pre-training objective (corrupting input text and learning to reconstruct it). T5 uses span masking (a form of masked LM) as pre-training and frames everything as text-to-text. BART is better for generation tasks (summarisation); T5 is better for understanding tasks.)
When would you choose T5 over GPT for a business NLP task? (Answer: T5 when: (1) You have a specific seq2seq task (translation, summarisation, data-to-text). (2) Input understanding is as important as generation. (3) You want to fine-tune on a specific task efficiently. GPT when: (1) You need flexible generation. (2) Zero/few-shot prompting is sufficient. (3) Chat interface is needed.)
Whisper uses an encoder-decoder transformer for speech recognition. What does each part do? (Answer: Encoder: processes the mel spectrogram (audio features) with full bidirectional attention — understands the entire audio context. Decoder: auto-regressively generates transcript tokens while attending to encoder audio representations via cross-attention. Same architecture as T5/BART applied to audio-to-text.)

On LumiChats

T5-style text-to-text framing inspired the instruction-following format used in modern LLMs. When you prompt LumiChats, the model processes your instruction as the 'encoder input' conceptually and generates the response as 'decoder output' — the same text-to-text paradigm at massive scale.

Try it free

T5, Seq2Seq Transformers & Encoder-Decoder NLP

The text-to-text unification

Encoder-decoder architecture deep dive

Practice questions

Try LumiChats for ₹69

Related Terms