NoSQL & Unstructured Data

Chapter 5: NoSQL & Unstructured Data

NoSQL (Not Only SQL) databases are engineered to solve the limitations of traditional RDBMS in the areas of horizontal scaling, flexible schema management, and high-velocity ingestion. By relaxing the strict ACID requirements of relational systems (favoring BASE: Basically Available, Soft state, Eventual consistency), NoSQL systems can achieve massive scale across geographically distributed clusters.

I. NoSQL Data Models & Storage Engines

Unlike the rigid tabular structure of RDBMS, NoSQL provides specialized models optimized for different data shapes and access patterns.

DocumentO(1) Fetch by IDBSON / JSONKey-ValueO(1) Hash MapIn-Memory (Redis)ColumnarSparse O(1) ScanSSTable / ParquetGraphAdjacency Lists

1. Storage Engines: LSM-Tree Mechanics

Many NoSQL systems (e.g., Cassandra, RocksDB) use Log-Structured Merge-Trees (LSM-Trees) to optimize for high-velocity writes. In an LSM-Tree, writes are first appended to a MemTable (in-memory) and a commit log. When the MemTable fills, it is flushed to disk as an immutable SSTable (Sorted String Table). To maintain read performance, background Compaction jobs merge these SSTables and remove duplicates or deleted records. To avoid expensive disk reads for non-existent keys, these systems employ Bloom Filters—probabilistic structures that can definitively say if a key might exist or definitely does not.

II. Distributed Coordination: Consistent Hashing & Quorum

To scale without a centralized bottle neck, NoSQL clusters use Consistent Hashing. This technique maps both data keys and physical nodes onto a logical ring. By using Virtual Nodes (VNodes), the database ensures that data is evenly distributed even if nodes have varying hardware capacities.

Consistency in distributed systems is managed through Quorum Models (N, W, R):

  • N: The number of replicas.
  • W: The number of nodes that must acknowledge a write.
  • R: The number of nodes that must respond to a read. To achieve Strong Consistency, we must satisfy the condition R + W > N. If this condition is relaxed, the system provides Eventual Consistency, allowing for higher availability at the risk of stale reads.

III. Production Anti-Patterns

  • Hot Partitions: Selecting a partition key with low cardinality (e.g., zip_code for a local app), causing one node to handle all traffic while others remain idle.
  • Ignoring Compaction Debt: Allowing SSTables to accumulate faster than they can be merged, leading to high Read Amplification and disk space exhaustion.
  • Large Documents/Rows: Storing multi-megabyte documents in systems optimized for small key-value pairs, causing GC pauses and network saturation.

IV. Performance Bottlenecks

  • Gossip Protocol Overhead: In massive clusters, the metadata sync between nodes can consume significant internal network bandwidth.
  • Bloom Filter False Positives: As the dataset grows, Bloom filters may require more memory to maintain a low false-positive rate; otherwise, read latency will spike due to unnecessary disk seeks.
  • JVM GC Pauses: Since many NoSQL engines are Java-based, "Stop-the-World" garbage collection cycles can cause catastrophic tail latency spikes in real-time applications.