Subqueries & CTEs

Chapter 5: Subqueries & CTEs

Complex analytical queries often require intermediate result sets. While early versions of SQL relied heavily on nested subqueries, modern engineering standardizes on Common Table Expressions (CTEs) for readability and recursive capabilities.

I. Subquery Mechanics

A subquery is a SELECT statement nested within another statement (SELECT, INSERT, UPDATE, DELETE).

1. Non-Correlated Subqueries

These execute exactly once before the outer query. The database engine calculates the inner result set and passes it to the outer WHERE clause (usually via an IN or NOT IN operator).

-- Find employees in the 'Engineering' or 'Research' departments
SELECT first_name, last_name
FROM employees
WHERE department_id IN (
    SELECT department_id
    FROM departments
    WHERE department_name IN ('Engineering', 'Research')
);

2. Correlated Subqueries

A correlated subquery references columns from the outer query. It acts like a nested loop: the inner query executes once for every row in the outer query. This is a severe performance bottleneck (O(N×M)O(N \times M) complexity).

-- Find employees who earn more than the average salary of their specific department
SELECT e1.first_name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id
);

II. Common Table Expressions (CTEs)

The WITH clause defines a temporary, named result set that exists only for the duration of the query.

1. Materialization vs. Inlining

Historically, CTEs were an "optimization fence"—the database materialized the CTE into an unindexed temporary table. Modern engines (PostgreSQL 12+) "inline" CTEs, treating them like a view and pushing predicates down. If you need to force materialization for performance tuning, you use the MATERIALIZED keyword.

WITH HighEarners AS (
    SELECT employee_id, department_id, salary
    FROM employees
    WHERE salary > 100000
),
DeptStats AS (
    SELECT department_id, COUNT(*) as earner_count
    FROM HighEarners
    GROUP BY department_id
)
SELECT d.department_name, ds.earner_count
FROM departments d
JOIN DeptStats ds ON d.department_id = ds.department_id;

2. Recursive CTEs (Graph Traversal)

SQL is inherently declarative and not suited for loops. However, the RECURSIVE modifier allows a CTE to reference itself, enabling the traversal of hierarchical data (like an organizational chart or bill of materials).

Base CaseSELECT CEO (Manager=NULL)UNION ALLRecursive StepJOIN Emp ON Mgr=CTE.idResult SetAppend to Final Table


III. Production Anti-Patterns

  • Unbounded Recursive CTEs: Running a hierarchical query without a depth limit or cycle detection (e.g., WHERE depth < 10). A circular reference in the data (Employee A manages B, B manages A) will cause an infinite loop and crash the server.
  • Correlated Subquery in SELECT List: Returning a subquery value for every row in a massive result set. SELECT name, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) FROM users u. This executes a distinct query for every user. Always rewrite using JOIN and GROUP BY.
  • Excessive Inlining: Defining a heavy CTE with aggregation and joining it three times in the main query. If the engine inlines it, it calculates the heavy aggregation three separate times. Use MATERIALIZED to calculate it once.

IV. Performance Bottlenecks

  • NOT IN with NULL Values: If you use WHERE id NOT IN (SELECT id FROM ...) and the subquery returns even a single NULL value, the entire predicate evaluates to UNKNOWN and the outer query returns zero rows. Always use NOT EXISTS.
  • TempDB Spillage: When a CTE is materialized, the database stores it in tempdb (SQL Server) or work files (PostgreSQL). If the CTE is millions of rows, it will saturate the temporary disk I/O and block other queries.