Joins & Set Operations

Chapter 4: Joins & Set Operations

Relational databases achieve their power by decoupling data into normalized tables and recombining them at query time. This combination is performed using Joins (horizontal combination) and Set Operations (vertical combination). Understanding the physical execution of these operations is critical for scaling queries beyond trivial datasets.

I. Join Mechanics & Types

A JOIN operations uses a predicate (typically equality ON a.id = b.a_id) to link rows from two sets.

1. INNER JOIN

Returns only the rows that have matching values in both tables. This is the default and most restrictive join.

-- Retrieve employees and their department names
SELECT e.first_name, e.last_name, d.department_name
FROM employees e
INNER JOIN departments d 
    ON e.department_id = d.department_id;

2. OUTER JOINS (LEFT, RIGHT, FULL)

Outer joins preserve rows from one or both tables even if no match is found, filling the missing correlated columns with NULL.

  • LEFT JOIN: Preserves all rows from the left table.
  • RIGHT JOIN: Preserves all rows from the right table.
  • FULL OUTER JOIN: Preserves all rows from both tables.
-- Retrieve ALL employees, including those who are not assigned to a department
SELECT e.first_name, d.department_name
FROM employees e
LEFT JOIN departments d 
    ON e.department_id = d.department_id;

INNER JOINLEFT JOIN


II. Physical Join Algorithms

The logical SQL syntax tells the database what to join. The Cost-Based Optimizer (CBO) decides how to join them using one of three physical algorithms:

  1. Nested Loop Join: The engine iterates through the outer table. For every row, it scans the inner table for a match. O(N * M) complexity. Highly performant only if the inner table is indexed on the join key (O(N * log M)).
  2. Hash Join: The engine builds an in-memory Hash Table for the smaller table. It then scans the larger table, probing the hash table for O(1) lookups. Excellent for large, unsorted equijoins.
  3. Sort-Merge Join: The engine sorts both tables by the join key, then steps through them simultaneously like a zipper. O(N log N + M log M). Optimal when tables are massive and already indexed/sorted on the join keys.

III. Set Operations (UNION, INTERSECT, EXCEPT)

Set operations combine the results of two independent queries vertically. Both queries must return the same number of columns with compatible data types.

  • UNION ALL: Appends the results of query B to query A. Exceptionally fast.
  • UNION: Appends the results and removes duplicates. This implicitly forces a Sort or Hash Unique operation, making it significantly slower.
  • INTERSECT: Returns only rows that exist in both query A and query B.
  • EXCEPT (or MINUS): Returns rows that exist in query A but NOT in query B.
-- Find active customers who are also newsletter subscribers
SELECT email FROM active_customers
INTERSECT
SELECT email FROM newsletter_subscribers;

IV. Production Anti-Patterns

  • Blind LEFT JOIN Cascades: Stacking a dozen LEFT JOIN clauses out of fear of dropping data. This explodes the size of the intermediate working sets and prevents the optimizer from reordering joins efficiently.
  • Cartesian Products (Accidental CROSS JOIN): Omitting the ON clause or using comma-separated tables without a WHERE condition. Joining a 10k row table with a 10k row table results in 100,000,000 rows in memory, instantly crashing the query.
  • Using UNION instead of UNION ALL: Using standard UNION when you know the two sets are already mutually exclusive. You force the database engine to perform an expensive uniqueness check for no reason.

V. Performance Bottlenecks

  • Hash Spill (TempDB Saturation): When a Hash Join is chosen but the smaller table exceeds the available work_mem, the database spills the hash buckets to disk. This causes I/O thrashing and cripples performance.
  • Data Skew in Joins: Joining on a key where 90% of the rows have the same value (e.g., status = 'ACTIVE'). This causes massive CPU contention during the probe phase of a Hash Join.