Chapter 7: Large-Scale Databases & Big Data

When a dataset grows beyond the storage or I/O capacity of a single server, database engineers must transition from Vertical Scaling (scaling up) to Horizontal Scaling (scaling out). This involves managing distributed data locality, minimizing network shuffles, and choosing specialized storage formats for analytical (OLAP) vs. transactional (OLTP) workloads.

I. Scaling Strategies: Sharding & Shared-Nothing

The primary architecture for large-scale databases is Shared-Nothing, where each node in the cluster is independent and has its own CPU, memory, and disk. Data is distributed across these nodes using Sharding.

1. Data Locality & The Shuffle Problem

A "Good" shard key ensures that most queries can be satisfied by a single node (Targeted Query). If a query requires data from all nodes (e.g., a global SUM or JOIN), the system must perform a Shuffle, moving gigabytes of data over the network to a central coordinator. Shuffling is the primary bottleneck in distributed systems, often gated by top-of-rack switch bandwidth.

II. Columnar Storage & Vectorized Execution (OLAP)

Transactional databases store data in rows (best for single-record lookups). Analytical databases (e.g., ClickHouse, Snowflake) store data in Columns.

I/O Efficiency: For a query like SELECT AVG(price), a columnar engine only reads the price column from disk, ignoring hundreds of other attributes.
Vectorization: Modern OLAP engines process data in batches (vectors) of 1024 rows using SIMD instructions. A single CPU cycle can perform a filter or addition on multiple values simultaneously, achieving 10-100x speedups over traditional iterative processing.

III. Production Anti-Patterns

Skewed Sharding: Using a key like country where 90% of data is in one value (e.g., "US"), creating a "Hot Node" while others remain idle.
Small File Problem in Data Lakes: Generating millions of 1KB files in S3. The metadata overhead of listing these files often takes longer than reading the data itself. Use Compaction to merge them into 128MB+ files.
Over-reliance on Global Joins: Designing a distributed schema that requires joining large tables across shards for standard app flows.

IV. Performance Bottlenecks

Metadata Chokepoint: In systems like HDFS or early Hive, the central metadata store (NameNode) can become a bottleneck as the number of files grows into the millions.
Tail Latency (P99) in Clusters: In a 100-node cluster, a single "Slow Node" (due to disk failure or GC) can slow down the entire global query, as the coordinator must wait for the slowest response.
Shuffle Spill: When a distributed JOIN or SORT exceeds the available executor RAM, it spills to local disk, slowing the operation by several orders of magnitude.