Chapter 5: Subqueries & CTEs
Complex analytical queries often require intermediate result sets. While early versions of SQL relied heavily on nested subqueries, modern engineering standardizes on Common Table Expressions (CTEs) for readability and recursive capabilities.
I. Subquery Mechanics
A subquery is a SELECT statement nested within another statement (SELECT, INSERT, UPDATE, DELETE).
1. Non-Correlated Subqueries
These execute exactly once before the outer query. The database engine calculates the inner result set and passes it to the outer WHERE clause (usually via an IN or NOT IN operator).
-- Find employees in the 'Engineering' or 'Research' departments
SELECT first_name, last_name
FROM employees
WHERE department_id IN (
SELECT department_id
FROM departments
WHERE department_name IN ('Engineering', 'Research')
);
2. Correlated Subqueries
A correlated subquery references columns from the outer query. It acts like a nested loop: the inner query executes once for every row in the outer query. This is a severe performance bottleneck ( complexity).
-- Find employees who earn more than the average salary of their specific department
SELECT e1.first_name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e1.department_id
);
II. Common Table Expressions (CTEs)
The WITH clause defines a temporary, named result set that exists only for the duration of the query.
1. Materialization vs. Inlining
Historically, CTEs were an "optimization fence"—the database materialized the CTE into an unindexed temporary table. Modern engines (PostgreSQL 12+) "inline" CTEs, treating them like a view and pushing predicates down. If you need to force materialization for performance tuning, you use the MATERIALIZED keyword.
WITH HighEarners AS (
SELECT employee_id, department_id, salary
FROM employees
WHERE salary > 100000
),
DeptStats AS (
SELECT department_id, COUNT(*) as earner_count
FROM HighEarners
GROUP BY department_id
)
SELECT d.department_name, ds.earner_count
FROM departments d
JOIN DeptStats ds ON d.department_id = ds.department_id;
2. Recursive CTEs (Graph Traversal)
SQL is inherently declarative and not suited for loops. However, the RECURSIVE modifier allows a CTE to reference itself, enabling the traversal of hierarchical data (like an organizational chart or bill of materials).
III. Production Anti-Patterns
- Unbounded Recursive CTEs: Running a hierarchical query without a depth limit or cycle detection (e.g.,
WHERE depth < 10). A circular reference in the data (Employee A manages B, B manages A) will cause an infinite loop and crash the server. - Correlated Subquery in
SELECTList: Returning a subquery value for every row in a massive result set.SELECT name, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) FROM users u. This executes a distinct query for every user. Always rewrite usingJOINandGROUP BY. - Excessive Inlining: Defining a heavy CTE with aggregation and joining it three times in the main query. If the engine inlines it, it calculates the heavy aggregation three separate times. Use
MATERIALIZEDto calculate it once.
IV. Performance Bottlenecks
NOT INwithNULLValues: If you useWHERE id NOT IN (SELECT id FROM ...)and the subquery returns even a singleNULLvalue, the entire predicate evaluates toUNKNOWNand the outer query returns zero rows. Always useNOT EXISTS.- TempDB Spillage: When a CTE is materialized, the database stores it in tempdb (SQL Server) or work files (PostgreSQL). If the CTE is millions of rows, it will saturate the temporary disk I/O and block other queries.