Text summarisation automatically produces a shorter version of a document that retains the most important information. Two paradigms: <strong>Extractive</strong> summarisation selects and stitches together key sentences directly from the source. <strong>Abstractive</strong> summarisation generates new sentences that may not appear verbatim in the source — more like human summarisation. Modern approaches use transformers (BART, T5, Pegasus) for abstractive summarisation and achieve near-human quality on news and scientific articles.
Real-life analogy: Two types of students
Imagine two students summarising a textbook chapter. The first student highlights key sentences and copies them verbatim into their notes — this is extractive summarisation. The second student reads the whole chapter, understands it, and writes the key ideas in their own words — this is abstractive summarisation. The second approach is more human-like but requires genuine understanding, not just pattern-matching.
Extractive summarisation
Extractive methods score each sentence by importance and select the top-k sentences. Classic algorithms: TF-IDF based scoring (sentences with high TF-IDF words are important), TextRank (graph-based, similar to PageRank — sentences are nodes, edges weighted by similarity), LSA (Latent Semantic Analysis using SVD).
Extractive summarisation with TextRank (sumy library)
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
text = """
Natural Language Processing (NLP) is a branch of artificial intelligence
that deals with the interaction between computers and humans through language.
The ultimate objective of NLP is to read, decipher, understand, and make sense
of the human language in a manner that is valuable. NLP combines computational
linguistics with statistical, machine learning, and deep learning models.
These technologies enable computers to process human language in the form of
text or voice data. Applications include machine translation, sentiment analysis,
named entity recognition, speech recognition, and question answering systems.
"""
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=2)
for sentence in summary:
print(str(sentence))
# Alternative: BERT-based extractive (BertSum)
# pip install bert-extractive-summarizer
from summarizer import Summarizer
model = Summarizer()
summary = model(text, min_length=50, max_length=150)
print(summary)Abstractive summarisation with transformers
BART (Facebook, 2019): Denoising autoencoder pre-trained to reconstruct corrupted text, fine-tuned on CNN/DailyMail for summarisation. T5 (Google, 2019): Text-To-Text Transfer Transformer — frames all NLP tasks as text-to-text. Pegasus (Google, 2020): Pre-trained specifically for summarisation by masking entire sentences (Gap Sentences Generation).
Abstractive summarisation with BART
from transformers import pipeline
summarizer = pipeline("summarization",
model="facebook/bart-large-cnn",
device=-1) # CPU; use device=0 for GPU
article = """
The artificial intelligence startup Anthropic, founded by former OpenAI
employees, announced a major funding round that values the company at
over 15 billion dollars. The company is known for its Claude AI assistant,
which competes directly with OpenAI's ChatGPT and Google's Gemini.
Anthropic focuses heavily on AI safety research, publishing papers on
constitutional AI and interpretability. The new funding will be used to
train more powerful models and expand research into safe and beneficial AI.
Investors include Google, Amazon, and several venture capital firms.
"""
summary = summarizer(article,
max_length=80,
min_length=30,
do_sample=False, # greedy / beam decoding
num_beams=4,
)[0]["summary_text"]
print(summary)
# "Anthropic, founded by former OpenAI employees, has raised funding valuing
# the company at over 15 billion dollars. The AI safety startup is known for
# its Claude assistant, which competes with ChatGPT and Gemini."ROUGE score — evaluating summaries
ROUGE-N: recall of N-gram overlaps between hypothesis summary and reference. ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence). Higher = better overlap with human reference summaries.
| Method | Copies source text? | Quality | Speed | Best for |
|---|---|---|---|---|
| TF-IDF scoring | Yes (sentences) | Baseline | Very fast | Quick extraction, news headlines |
| TextRank | Yes (sentences) | Good | Fast | Unsupervised, no training data |
| BERT extractive | Yes (sentences) | Better | Medium | When training data scarce |
| BART/T5 abstractive | No (generates new) | SOTA | Slow (GPU needed) | Production, research, journalism |
Practice questions
- What is the key difference between extractive and abstractive summarisation? (Answer: Extractive selects existing sentences verbatim. Abstractive generates new sentences that paraphrase the content — closer to human summarisation.)
- Why might extractive summarisation produce incoherent summaries? (Answer: Selected sentences are stitched together without considering discourse coherence — they may use pronouns whose referents are in non-selected sentences, causing confusion.)
- ROUGE-2 measures overlap of: (Answer: Bigrams (2-word sequences) between the generated summary and the reference summary. Higher ROUGE-2 indicates the model captures more two-word phrases from the reference.)
- Pegasus was pre-trained with Gap Sentences Generation. What is this? (Answer: Entire sentences are masked during pre-training, and the model must generate them from context — this directly trains the model for summarisation since generating masked sentences is similar to summarising the remaining context.)
- When would you prefer extractive over abstractive summarisation? (Answer: When faithfulness is critical (legal, medical) — extractive cannot hallucinate since it only uses source sentences. Abstractive models may generate plausible-sounding but incorrect facts.)
On LumiChats
LumiChats can summarise PDFs, web pages, and long documents using abstractive summarisation models. The Study Mode feature specifically uses summarisation to create concise notes from textbook chapters — just paste the text and ask for a summary at any detail level.
Try it free