Chapter 2: Data Manipulation (DML)
Data Manipulation Language (DML) encompasses the operations used to mutate the state of the database: INSERT, UPDATE, and DELETE. In a highly concurrent system, DML operations must strictly adhere to the ACID properties, utilizing the Write-Ahead Log (WAL) to ensure durability and Locks/MVCC to ensure isolation.
I. Inserting Data (INSERT)
The INSERT statement adds new rows to a table. When a row is inserted, the engine allocates space in a data page, updates the primary clustered index, and simultaneously updates any secondary B+ tree indexes.
1. Single and Multi-Row Inserts
Batching inserts reduces the overhead of network round-trips and transaction commit synchronization (fsync).
-- Single row insert
INSERT INTO employees (employee_id, first_name, last_name, email, department_id)
VALUES (101, 'Alice', 'Smith', 'alice@corp.local', 10);
-- Multi-row bulk insert (Highly recommended for throughput)
INSERT INTO employees (employee_id, first_name, last_name, email, department_id)
VALUES
(102, 'Bob', 'Jones', 'bob@corp.local', 10),
(103, 'Carol', 'Davis', 'carol@corp.local', 20);
2. INSERT INTO ... SELECT
This pattern moves data directly within the database kernel without bringing it back to the application server, making it optimal for ELT (Extract, Load, Transform) pipelines.
-- Archiving old records to a history table
INSERT INTO employees_archive (emp_id, name, archived_at)
SELECT employee_id, first_name, CURRENT_TIMESTAMP
FROM employees
WHERE status = 'TERMINATED';
II. Modifying State (UPDATE & DELETE)
UPDATE and DELETE operations are complex because they require a "Search" phase to locate the targeted rows before the "Mutation" phase can occur.
1. The UPDATE Statement
In MVCC (Multi-Version Concurrency Control) engines like PostgreSQL, an UPDATE is not an in-place mutation. It is physically executed as a DELETE of the old tuple (marking it as invisible) followed by an INSERT of the new tuple.
-- Targeting a specific row via the Primary Key (O(log N) cost)
UPDATE employees
SET salary = salary * 1.05,
status = 'ACTIVE'
WHERE employee_id = 102;
2. The DELETE Statement
Deletes remove the visibility of rows. The actual disk space is reclaimed later by background garbage collection processes (e.g., VACUUM).
-- Deleting a subset of rows
DELETE FROM employees
WHERE status = 'TERMINATED' AND department_id = 50;
III. Production Anti-Patterns
- The Unfiltered Mutation: Executing an
UPDATEorDELETEwithout aWHEREclause. This will target the entire table, locking all rows and filling the transaction log until the disk is full. - Read-Modify-Write in App Code: Selecting a value into the application (e.g.,
balance = 100), modifying it in Python/Node (balance -= 10), and updating it (SET balance = 90). In concurrent environments, this causes Lost Updates. Always perform relative updates directly in SQL:UPDATE accounts SET balance = balance - 10 WHERE id = 1. - DML in Loops: Emitting 10,000 individual
INSERTstatements from an application loop. Each statement requires a network round trip and an individual disk commit. Use bulk inserts orCOPYcommands.
IV. Performance Bottlenecks
- Write Amplification from Indexes: Every
INSERTorDELETErequires an update to the primary clustered index and every non-clustered index on the table. A table with 15 indexes will incur a massive write penalty, completely bottlenecking high-throughput ingestion pipelines. - Lock Escalation: When a session acquires thousands of row-level locks (e.g., during a massive
UPDATE), the database engine may "escalate" these into a single Table-Level Lock to save memory. This instantly blocks all other writes to the table. - Hot-Spot Page Contention: Multiple concurrent sessions attempting to
INSERTrows into a table clustered by a monotonically increasing sequence (like an auto-increment ID). All threads compete for a lock on the "right-most" leaf page of the B+ tree.