Query optimisation is the process of making SQL queries execute faster. The database query optimiser automatically generates an execution plan — the sequence of operations (index scans, hash joins, sort nodes) it will use to compute your result. Understanding execution plans via EXPLAIN/EXPLAIN ANALYZE lets you identify bottlenecks: missing indexes, sequential scans on large tables, expensive sorts, and inefficient join strategies. Index creation, query rewriting, and statistics maintenance are the primary optimisation tools.
Real-life analogy: The library search strategy
Finding a book by title in a library with 1 million books: (1) No catalogue: walk every shelf in order — O(n) sequential scan. (2) With card catalogue (index): look up title → shelf number → go directly — O(log n). (3) If you search frequently by author AND title: create a composite index on (author, title). Query optimisation is deciding which catalogue to use and in what order to combine searches.
EXPLAIN and EXPLAIN ANALYZE — reading execution plans
Reading EXPLAIN output and identifying bottlenecks
-- EXPLAIN: shows the query plan (estimated costs, no actual execution)
EXPLAIN SELECT * FROM Employee WHERE DeptID = 5;
-- Output example:
-- Seq Scan on employee (cost=0.00..234.00 rows=15 width=128)
-- Filter: (deptid = 5)
-- "Seq Scan" = PROBLEM! Reading all 10,000 rows to find 15 matching ones.
-- cost=start_cost..total_cost, rows=estimated_rows, width=avg_bytes_per_row
-- EXPLAIN ANALYZE: actually runs the query + shows real vs estimated stats
EXPLAIN ANALYZE SELECT * FROM Employee WHERE DeptID = 5;
-- Output:
-- Seq Scan on employee (cost=0.00..234.00 rows=15 width=128)
-- (actual time=0.08..12.43 rows=15 loops=1)
-- Filter: (deptid = 5)
-- Rows Removed by Filter: 9985
-- Planning Time: 0.12 ms
-- Execution Time: 12.55 ms ← 12ms just to find 15 rows!
-- Create index and re-explain:
CREATE INDEX idx_employee_dept ON Employee(DeptID);
EXPLAIN ANALYZE SELECT * FROM Employee WHERE DeptID = 5;
-- Index Scan using idx_employee_dept on employee
-- (cost=0.28..3.74 rows=15 width=128)
-- (actual time=0.05..0.12 rows=15 loops=1)
-- Execution Time: 0.25 ms ← 50x faster!
-- Key execution plan nodes to know:
-- Seq Scan: Full table scan — problem for large tables with selective WHERE
-- Index Scan: B+ tree traversal — good, follows pointers to heap
-- Index Only Scan: Covering index — fastest, no heap access needed
-- Bitmap Heap Scan: Batch index lookup — used for moderate selectivity
-- Hash Join: Build hash table of smaller table, probe with larger
-- Nested Loop: For each outer row, scan inner — good when inner is small + indexed
-- Merge Join: Sort both sides, merge — good for pre-sorted or large datasets
-- Hash Aggregate: GROUP BY using hash table — faster than Sort Aggregate
-- Sort: Explicit sort — expensive for large datasets without indexIndex strategies for performance
Index types and when to use each
-- B-TREE INDEX (default): for =, <, >, BETWEEN, ORDER BY, LIKE 'prefix%'
CREATE INDEX idx_emp_salary ON Employee(Salary);
CREATE INDEX idx_emp_name ON Employee(LastName, FirstName); -- Composite
-- Composite index column order matters:
-- (LastName, FirstName) → fast for: WHERE LastName = 'Kumar'
-- WHERE LastName = 'Kumar' AND FirstName = 'Ravi'
-- NOT fast for: WHERE FirstName = 'Ravi' (can't use, leading column missing)
-- Rule: most selective column FIRST, most frequently used column FIRST
-- COVERING INDEX: include all columns needed by the query (no heap lookup)
-- Query: SELECT Name, Salary FROM Employee WHERE DeptID = 5
CREATE INDEX idx_covering ON Employee(DeptID) INCLUDE (Name, Salary);
-- "Index Only Scan" — entire result from index, never touches main table
-- PARTIAL INDEX: index only a subset of rows
-- Query: SELECT * FROM Employee WHERE Status = 'Active' AND Salary > 50000
CREATE INDEX idx_active_emp ON Employee(Salary) WHERE Status = 'Active';
-- Smaller index, faster queries on the common case
-- HASH INDEX (PostgreSQL): for equality only — slightly faster than B-tree for =
CREATE INDEX idx_hash_id ON Employee USING HASH (EmpID);
-- Cannot use for: range queries, ORDER BY, LIKE
-- GIN/GiST INDEX (PostgreSQL): for full-text search, arrays, JSON
CREATE INDEX idx_name_fts ON Employee USING GIN (to_tsvector('english', Name));
SELECT * FROM Employee WHERE to_tsvector('english', Name) @@ to_tsquery('kumar');
-- Common query optimization patterns:
-- 1. Avoid functions on indexed columns in WHERE (disables index):
-- BAD: WHERE UPPER(Name) = 'RAVI' (function on column)
-- GOOD: WHERE Name = 'Ravi' (direct comparison)
-- OR: Create a functional index: CREATE INDEX ON Employee (UPPER(Name))
-- 2. Avoid leading % in LIKE (disables index):
-- BAD: WHERE Name LIKE '%kumar' (index useless)
-- GOOD: WHERE Name LIKE 'kumar%' (can use B-tree prefix)
-- 3. Use SELECT only needed columns (avoid SELECT *):
-- SELECT * FROM Employee → reads full 128-byte row from disk
-- SELECT EmpID, Name → may use covering index, reads only 20 bytes
-- 4. Use LIMIT for TOP-N queries:
-- SELECT * FROM Employee ORDER BY Salary DESC LIMIT 10
-- With index on Salary (DESC): reads only 10 rows, no full sort neededQuery rewriting for performance
Before and after optimisation examples
-- SLOW: correlated subquery runs once per row
SELECT Name FROM Employee e
WHERE Salary > (SELECT AVG(Salary) FROM Employee WHERE DeptID = e.DeptID);
-- FAST: compute averages once with CTE, then join
WITH DeptAvg AS (
SELECT DeptID, AVG(Salary) AS AvgSalary FROM Employee GROUP BY DeptID
)
SELECT e.Name FROM Employee e JOIN DeptAvg da ON e.DeptID = da.DeptID
WHERE e.Salary > da.AvgSalary;
-- SLOW: OR on different indexed columns (may not use index)
SELECT * FROM Employee WHERE FirstName = 'Ravi' OR LastName = 'Ravi';
-- FAST: UNION ALL (each part uses its own index)
SELECT * FROM Employee WHERE FirstName = 'Ravi'
UNION ALL
SELECT * FROM Employee WHERE LastName = 'Ravi' AND FirstName != 'Ravi';
-- SLOW: Multiple separate queries in application code
-- SELECT * FROM Employee WHERE DeptID = 1;
-- SELECT * FROM Employee WHERE DeptID = 2;
-- ...for each department
-- FAST: One query with IN
SELECT * FROM Employee WHERE DeptID IN (1, 2, 3, 4, 5);
-- Check index usage and table statistics
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats WHERE tablename = 'employee';
-- Update statistics (so optimizer has accurate row estimates)
ANALYZE Employee; -- Update statistics for one table
ANALYZE; -- Update all tables
VACUUM ANALYZE Employee; -- Reclaim dead rows + update statisticsThe golden rule of SQL optimisation
Always EXPLAIN ANALYZE before and after any optimisation. Do not guess — measure. The most common wins: (1) Add index on WHERE and JOIN columns. (2) Replace SELECT * with specific columns. (3) Replace correlated subqueries with CTEs or JOINs. (4) Add LIMIT to queries that only need a few rows. (5) ANALYZE tables after bulk loads so the optimiser has accurate statistics.
Practice questions
- EXPLAIN output shows "Seq Scan" on a 10-million-row table in a WHERE clause. What should you do? (Answer: Create an index on the filter column — CREATE INDEX idx_name ON table(column). Run ANALYZE to update statistics. Then verify with EXPLAIN ANALYZE that the index is being used.)
- A composite index (LastName, FirstName) exists. Will WHERE FirstName = 'Ravi' use it? (Answer: No — composite indexes can only be used if the leading column(s) are included in the WHERE clause. A query filtering only on FirstName (not LastName) cannot use this index. Create a separate index on FirstName if needed.)
- Why does WHERE YEAR(OrderDate) = 2024 prevent index usage? (Answer: Applying a function (YEAR()) to the indexed column prevents B-tree index use — the function result is not indexed. Solutions: (1) Range query: WHERE OrderDate >= '2024-01-01' AND OrderDate < '2025-01-01'. (2) Functional index: CREATE INDEX ON Orders (YEAR(OrderDate)).)
- What is a covering index? (Answer: An index that contains all columns needed by a query — so the database can answer the query entirely from the index without accessing the main table rows. Created with INCLUDE clause in PostgreSQL. Results in "Index Only Scan" — the fastest possible access path.)
- When would a Hash Join be preferred over a Nested Loop Join? (Answer: Hash Join: builds hash table of smaller input, probes with larger — O(n+m), best for large unsorted datasets. Nested Loop: for each outer row, scan inner — O(n*m) but with an index on inner it is O(n*log m), best when outer result is small and inner has an index.)
On LumiChats
LumiChats can analyse your EXPLAIN ANALYZE output, identify which operation is the bottleneck, suggest the right index type, and rewrite slow queries for 10x-100x performance improvements. Paste your query and execution plan output.
Try it free