Chapter 3: Relational Database Specification (RDBMS)
Relational Databases (RDBMS) are based on the Relational Model introduced by E.F. Codd in 1970. This model is rooted in First-Order Predicate Logic and Set Theory, where a relation is mathematically defined as a set of tuples that share the same attributes. Unlike the navigational models that preceded it (hierarchical and network), the relational model emphasizes Logical Data Independence, meaning the way data is queried is decoupled from how it is physically stored. This abstraction is governed by the Closed World Assumption, which posits that the database contains all the truths of the domain; if a fact is not in the database, it is considered false.
I. Relational Integrity & Physical Implementation
To maintain high data quality and ensure these mathematical properties are preserved during physical execution, RDBMS engines enforce several levels of integrity constraints. These constraints prevent the introduction of anomalies (Insert, Update, Delete) and ensure that relationships between entities remain logically sound.
1. Entity Integrity (Primary Keys)
Every table must have a Primary Key (PK). It must be unique and non-null. The engine creates a unique B+ Tree index on the PK column(s). Unlike a standard B-Tree, a B+ Tree stores all data in the leaf nodes, while internal nodes only store keys for navigation. This structure is highly optimized for range queries because leaf nodes are linked together in a doubly-linked list, allowing for a single "index seek" followed by a sequential scan of the required pages.
2. Referential Integrity (Foreign Keys)
Ensures the relationship between tables remains consistent. Every INSERT or UPDATE on the child table requires a "lookup" in the parent index, adding O(log N) latency to the write path. While this ensures integrity, it can become a bottleneck in massive ingestion pipelines, sometimes leading architects to enforce integrity in the application layer for specific high-scale use cases.
3. Join Algorithms: The Executor's Toolkit
When joining tables, the executor chooses a physical implementation based on the estimated cost:
- Nested Loop Join:
O(M * log N)with an index on the inner relation. Best for small-to-large joins. - Hash Join:
O(M + N). Creates an in-memory hash table for the smaller relation. Best for large, equijoin operations where no supporting index exists. - Sort-Merge Join:
O(M log M + N log N). Sorts both sides and merges. Best for range joins or when both relations are already sorted by the join key.
II. Database Normalization: The 3NF Baseline
Normalization is the process of organizing data to reduce redundancy and improve data integrity by applying a series of mathematical "Normal Forms."
| Normal Form | Rule | Engineering Goal |
|---|---|---|
| 1NF | Atomic values; No repeating groups. | Ensure a flat, predictable structure. |
| 2NF | In 1NF + All non-key attributes are fully dependent on the PK. | Remove partial dependencies. |
| 3NF | In 2NF + No transitive dependencies (non-key columns depend only on the PK). | Eliminate redundancy and update anomalies. |
III. Production Anti-Patterns
- Blind Normalization: Over-normalizing (e.g., 5NF/6NF) in a read-heavy system, leading to excessive JOINs and CPU exhaustion.
- Polymorphic Associations: Using a single
parent_idandparent_typestring column, which bypasses RDBMS foreign key enforcement and prevents index-based join optimization. - Wide Tables: Tables with 100+ columns causing Row Chaining (row exceeds page size) or Row Migration, significantly increasing I/O cost per read.
IV. Performance Bottlenecks
- Random I/O during Joins: Nested loop joins without proper indexing causing the disk head to seek constantly, destroying throughput on HDD-based systems.
- TempDB/Workfile Spill: Hash joins or sorts that exceed the allocated
work_mem(PostgreSQL) orsort_buffer_size(MySQL), spilling intermediate results to disk. - Lack of Index-Only Scans: Fetching columns not in the index, forcing a "Bookmark Lookup" or "Table Access by Index RID," effectively doubling the I/O cost.