Data Modeling & Schema Design

Chapter 8: Data Modeling & Schema Design

Data modeling in MongoDB is driven by application access patterns. The fundamental engineering principle is: "Data that is accessed together should be stored together." This requires a shift from relational normalization to a "Workload-Optimized" model that prioritizes read efficiency and data locality.

I. Technical Tradeoffs: Embedding vs. Referencing

1. Embedding (One-to-Few)

Embedding related data into a single BSON document allows the storage engine to fetch the entire object in a single sequential disk seek. This is ideal for one-to-few relationships where the data is static or updated together.

  • Advantages: Atomic updates within the document, zero-JOIN overhead, and reduced network round-trips.
  • Risks: Violating the 16MB BSON limit and causing Write Amplification as the document grows.

2. Referencing (One-to-Many)

Referencing uses unique identifiers (like _id) to link documents across collections. This is preferred for many-to-many relationships or when the related data is large and frequently accessed independently.

  • Advantages: Smaller document sizes, reduced redundancy, and easier cache management for "Hot" entities.
  • Risks: Requires $lookup (Left Outer Join) or application-side joins, increasing latency.

II. The Subset & Bucket Patterns

To handle large-scale data while maintaining the 16MB limit, expert architects use the Subset Pattern. Instead of storing all related items (like thousands of comments), the primary document only stores the N most recent items. Older items are "Offloaded" to a separate collection, ensuring the primary document remains small and fits in the WiredTiger cache.

Primary DocumentSubset: Last 10 ItemsHistorical CollectionPartitioned by ParentIDOptimized for Range Scans


III. Production Anti-Patterns

  • The Unbounded Array: Allowing arrays (e.g., audit_logs) to grow indefinitely. As the document approaches 16MB, performance degrades exponentially, and eventually, the application will crash on inserts.
  • The "Relational-Shadow" Pattern: Over-normalizing into many small collections and joining them on every query. This is an anti-pattern because MongoDB's engine is not optimized for massive multi-way joins like a relational database.
  • Ignoring Write Amplification: Frequently updating a single field in a document that contains 10MB of embedded data. Even with a surgical $set, WiredTiger may still have to rewrite the entire data page.

IV. Performance Bottlenecks

  • BSON Depth Overhead: Documents with more than 20-30 levels of nesting increase the parsing overhead for the query engine and driver.
  • Schema Validation CPU Cost: Complex JSON Schema rules (especially those with $regex or $where) add measurable latency to every insert and update operation.
  • Index Key Growth: Every byte added to a field that is indexed adds to the index size and the RAM required to keep it resident. Use short field names or alias fields in production.