Machine Translation (MT) is the automatic conversion of text from one language to another while preserving meaning. The field progressed from rule-based systems (1950s) to statistical phrase-based models (1990s) to neural sequence-to-sequence (Seq2Seq) models (2014) to transformer-based models (2017+). Google Translate, DeepL, and modern LLMs all use transformer architectures. The Seq2Seq encoder-decoder framework and the attention mechanism are foundational concepts that also underpin summarisation, question answering, and dialogue systems.
Real-life analogy: The professional interpreter
A human interpreter at a conference listens to the full sentence (encoding), holds it in memory, then speaks the translation (decoding). They do not translate word-by-word — they wait for full context before producing output. The Seq2Seq encoder-decoder mirrors this exactly: the encoder reads the full source sentence into a fixed-size context vector (like working memory), and the decoder generates the target sentence token by token from this context.
Evolution of machine translation
| Era | Approach | Pros | Cons | Example system |
|---|---|---|---|---|
| 1950s-1980s | Rule-based MT (handcrafted grammar + dictionaries) | Predictable, controllable | Brittle, cannot scale to all exceptions | SYSTRAN |
| 1990s-2010s | Statistical MT (phrase alignment from parallel corpora) | Learns from data, handles idioms | Short-range context only, large memory | Google Translate v1, Moses |
| 2014-2017 | Neural Seq2Seq (LSTM encoder-decoder) | End-to-end learning, long-range context | Fixed-size bottleneck, slow training | Google Neural MT 2016 |
| 2017-present | Transformer (self-attention) | Parallelisable, SOTA quality | Huge compute, expensive to train | Google Translate, DeepL, GPT-4 |
Seq2Seq encoder-decoder architecture
The Seq2Seq framework has two components: Encoder reads the source sentence token by token and produces a context vector (the final hidden state). Decoder generates the target sentence auto-regressively, initialised with the encoder context vector.
Seq2Seq translation with Hugging Face MarianMT
from transformers import MarianMTModel, MarianTokenizer
# MarianMT: Facebook/Helsinki-NLP transformer translation models
model_name = "Helsinki-NLP/opus-mt-en-fr" # English to French
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(text: str) -> str:
inputs = tokenizer([text], return_tensors="pt",
padding=True, truncation=True, max_length=512)
outputs = model.generate(
**inputs,
max_new_tokens=200,
num_beams=4, # Beam search: consider top-4 candidates
early_stopping=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
sentences = [
"The cat sat on the mat.",
"Machine learning is transforming natural language processing.",
"I would like a coffee please.",
]
for s in sentences:
print(f"EN: {s}")
print(f"FR: {translate(s)}")
print()The bottleneck problem and attention
The original Seq2Seq model compressed the entire source sentence into a single fixed-size vector — the encoder hidden state. For long sentences (50+ words), this bottleneck loses information. Bahdanau attention (2015) solved this: the decoder attends directly to all encoder hidden states at each generation step, computing a weighted sum to focus on relevant source words. This attention mechanism is the direct ancestor of the Transformer.
BLEU score — evaluating translation quality
BLEU (Bilingual Evaluation Understudy). p_n = precision of n-gram matches between hypothesis and reference translations. BP = brevity penalty (penalises short translations). w_n = uniform weight (1/N). BLEU = 0 (no match) to 1 (perfect match). Industry standard MT metric despite known limitations.
Computing BLEU score with NLTK
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
reference = [["the", "cat", "is", "on", "the", "mat"]]
hypothesis = ["the", "cat", "is", "on", "mat"] # missing "the"
smooth = SmoothingFunction().method1
score = sentence_bleu(reference, hypothesis,
smoothing_function=smooth)
print(f"BLEU score: {score:.3f}") # ~0.63
# For corpus-level BLEU (more reliable):
references = [[["the", "cat", "is", "on", "the", "mat"]]]
hypotheses = [["the", "cat", "is", "on", "mat"]]
corpus_score = corpus_bleu(references, hypotheses)
print(f"Corpus BLEU: {corpus_score:.3f}")Practice questions
- What is the bottleneck problem in vanilla Seq2Seq? (Answer: The entire source sentence is compressed into one fixed-size vector. Long sentences lose information, causing quality degradation. Attention mechanisms solve this.)
- What does beam search do in decoder generation? (Answer: Instead of greedily picking the single best token at each step, it tracks the top-k most likely sequences (beams) in parallel, leading to better overall translation quality.)
- BLEU score of 1.0 means what? Is it achievable in practice? (Answer: Perfect match with reference translation. Rarely achieved — even human translators disagree on wording, so BLEU >0.6 is considered excellent.)
- Why is machine translation harder for agglutinative languages (Finnish, Turkish)? (Answer: One root word can have hundreds of grammatical suffixes. "talossanikin" (Finnish, meaning "even in my house") is one word with 3 morphemes. Sub-word tokenisation (BPE/SentencePiece) is critical.)
- What is the key architectural difference between LSTM Seq2Seq and Transformer for translation? (Answer: LSTM processes tokens sequentially (cannot parallelise). Transformer uses self-attention over all positions simultaneously — enabling parallelism and capturing long-range dependencies.)
On LumiChats
LumiChats supports 40+ multilingual AI models, many of which use transformer-based translation. You can ask LumiChats to translate text, compare translations across models, or explain the nuances between different translations of the same sentence.
Try it free