Chapter 6: Aggregation & Window Functions
SQL moves beyond mere row retrieval by providing advanced analytical capabilities. The engine can group data for summary statistics (GROUP BY) or calculate running totals and rankings over a sliding frame of rows (Window Functions). These operations typically require the database to buffer large amounts of data in memory to perform hashing or sorting.
I. Standard Aggregation (GROUP BY & HAVING)
Aggregation condenses multiple rows into a single scalar value. When a query includes a GROUP BY clause, the SELECT list can only contain columns that are either explicitly grouped or wrapped in an aggregate function (e.g., SUM(), MAX()).
1. HAVING vs WHERE
The WHERE clause filters rows before aggregation occurs. The HAVING clause filters the aggregated results after they are calculated.
-- Find departments with more than 50 active employees
SELECT department_id, COUNT(employee_id) AS active_count
FROM employees
WHERE status = 'ACTIVE' -- Filter 1: Row-level evaluation (SARGable)
GROUP BY department_id
HAVING COUNT(employee_id) > 50; -- Filter 2: Group-level evaluation (Post-Aggregation)
II. Analytic Window Functions (OVER)
While GROUP BY collapses rows, Window Functions allow you to perform aggregations and rankings without losing the original row details. The calculation is applied to a "window" of rows related to the current row.
1. Partitioning and Ordering
The OVER clause defines the window.
PARTITION BY: Divides the result set into logical partitions. The function resets for each partition.ORDER BY: Sorts the rows within each partition. Crucially, ifORDER BYis present without a frame clause, the default frame is from the start of the partition to the current row, creating a Running Total.
-- Rank employees by salary within each department
SELECT
employee_id,
department_id,
salary,
RANK() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) as dept_salary_rank
FROM employees;
2. Common Analytic Functions
| Function | Behavior | Technical Description |
|---|---|---|
ROW_NUMBER() | 1, 2, 3, 4 | Assigns a sequential integer to each row within the partition. |
RANK() | 1, 1, 3, 4 | Assigns ranks. Ties get the same rank, leaving gaps in the sequence. |
DENSE_RANK() | 1, 1, 2, 3 | Assigns ranks without gaps. |
LEAD(col, n) | Returns value n rows ahead | Accesses data from subsequent rows in the window without self-joins. |
LAG(col, n) | Returns value n rows behind | Accesses data from preceding rows. Highly effective for calculating month-over-month growth. |
III. Production Anti-Patterns
- Filtering Window Functions in
WHERE: You cannot use a window function in aWHEREclause becauseWHEREevaluates before theOVERclause is processed. Always wrap the window function in a CTE, then filter the CTE. GROUP BYwithout Indexes: Grouping on unindexed columns forces the database to perform a massive Hash Aggregate, building a hash table of the entire column in memory, which leads towork_memexhaustion.- Using
COUNT(*)overCOUNT(1)orCOUNT(col):COUNT(*)counts rows,COUNT(col)ignoresNULLvalues in that column. Mixing these up leads to inaccurate analytic reporting.
IV. Performance Bottlenecks
- Window Function Sorting (Sort Spill):
OVER (PARTITION BY x ORDER BY y)forces the database to perform an in-memory sort for every single partition. If the dataset is 100GB, the sort overflows to disk temp files. Always ensure(x, y)is covered by a composite B+ Tree index to allow the engine to read the data pre-sorted. - Distinct Aggregations:
COUNT(DISTINCT user_id)is a known bottleneck. The database must sort or hash every user ID to remove duplicates before counting. On massive datasets (e.g., billions of rows), consider using probabilistic data structures like HyperLogLog (if supported by the engine) for a 100x speedup with a 1% error margin.