Aggregation & Window Functions

Chapter 6: Aggregation & Window Functions

SQL moves beyond mere row retrieval by providing advanced analytical capabilities. The engine can group data for summary statistics (GROUP BY) or calculate running totals and rankings over a sliding frame of rows (Window Functions). These operations typically require the database to buffer large amounts of data in memory to perform hashing or sorting.

I. Standard Aggregation (GROUP BY & HAVING)

Aggregation condenses multiple rows into a single scalar value. When a query includes a GROUP BY clause, the SELECT list can only contain columns that are either explicitly grouped or wrapped in an aggregate function (e.g., SUM(), MAX()).

1. HAVING vs WHERE

The WHERE clause filters rows before aggregation occurs. The HAVING clause filters the aggregated results after they are calculated.

-- Find departments with more than 50 active employees
SELECT department_id, COUNT(employee_id) AS active_count
FROM employees
WHERE status = 'ACTIVE'       -- Filter 1: Row-level evaluation (SARGable)
GROUP BY department_id
HAVING COUNT(employee_id) > 50; -- Filter 2: Group-level evaluation (Post-Aggregation)

Raw Data10M RowsWHEREFilter to 1MGROUP BYHash AggregateHAVINGFilter Groups


II. Analytic Window Functions (OVER)

While GROUP BY collapses rows, Window Functions allow you to perform aggregations and rankings without losing the original row details. The calculation is applied to a "window" of rows related to the current row.

1. Partitioning and Ordering

The OVER clause defines the window.

  • PARTITION BY: Divides the result set into logical partitions. The function resets for each partition.
  • ORDER BY: Sorts the rows within each partition. Crucially, if ORDER BY is present without a frame clause, the default frame is from the start of the partition to the current row, creating a Running Total.
-- Rank employees by salary within each department
SELECT 
    employee_id, 
    department_id, 
    salary,
    RANK() OVER (
        PARTITION BY department_id 
        ORDER BY salary DESC
    ) as dept_salary_rank
FROM employees;

2. Common Analytic Functions

FunctionBehaviorTechnical Description
ROW_NUMBER()1, 2, 3, 4Assigns a sequential integer to each row within the partition.
RANK()1, 1, 3, 4Assigns ranks. Ties get the same rank, leaving gaps in the sequence.
DENSE_RANK()1, 1, 2, 3Assigns ranks without gaps.
LEAD(col, n)Returns value n rows aheadAccesses data from subsequent rows in the window without self-joins.
LAG(col, n)Returns value n rows behindAccesses data from preceding rows. Highly effective for calculating month-over-month growth.

III. Production Anti-Patterns

  • Filtering Window Functions in WHERE: You cannot use a window function in a WHERE clause because WHERE evaluates before the OVER clause is processed. Always wrap the window function in a CTE, then filter the CTE.
  • GROUP BY without Indexes: Grouping on unindexed columns forces the database to perform a massive Hash Aggregate, building a hash table of the entire column in memory, which leads to work_mem exhaustion.
  • Using COUNT(*) over COUNT(1) or COUNT(col): COUNT(*) counts rows, COUNT(col) ignores NULL values in that column. Mixing these up leads to inaccurate analytic reporting.

IV. Performance Bottlenecks

  • Window Function Sorting (Sort Spill): OVER (PARTITION BY x ORDER BY y) forces the database to perform an in-memory sort for every single partition. If the dataset is 100GB, the sort overflows to disk temp files. Always ensure (x, y) is covered by a composite B+ Tree index to allow the engine to read the data pre-sorted.
  • Distinct Aggregations: COUNT(DISTINCT user_id) is a known bottleneck. The database must sort or hash every user ID to remove duplicates before counting. On massive datasets (e.g., billions of rows), consider using probabilistic data structures like HyperLogLog (if supported by the engine) for a 100x speedup with a 1% error margin.