Rows, columns, and consequences

Speak at Rootconf’s Special Edition on Databases

Saravana S

Overcoming Management and Data Protection challenges of complex distributed databases - An Integrated approach

Submitted Apr 23, 2026

Introduction:

Case Study: Managing and Protecting Distributed Databases - MongoDB Sharded Clusters

The Challenges of Distributed Database Management

Modern, large-scale applications often rely on distributed database architectures like MongoDB Sharded Clusters to handle massive data volumes and high write throughput. A sharded cluster horizontally partitions data across multiple independent replica sets (shards), alongside a config server that tracks cluster metadata.
While this architecture provides incredible scale, it introduces severe challenges for data protection:

  1. Cross-Shard Inconsistency: Taking uncoordinated backups of individual shards captures each partition at a different logical time. If a distributed transaction spans multiple shards during the backup window, the resulting snapshot will be internally inconsistent, leading to broken data relationships upon restore.

  2. Operational Overhead: Managing backups, oplog extractions, and restorations across a dozen independent replica sets requires complex, manual scripting and significant downtime to quiesce the database.

  3. Storage Inefficiency: Traditional backup tools often struggle with the sheer size of sharded clusters, resulting in slow backup times and bloated storage footprints.

Problem:

The Shortcomings of Isolated Approaches

What happens when a platform attempts to manage a sharded cluster’s data protection in isolation:

NDB (Nutanix Database Service - DBaaS solution) Alone (Infrastructure-Only): NDB excels at instantaneous, zero-byte infrastructure snapshots. However, without deep, application-level awareness of MongoDB’s distributed topology, taking independent storage snapshots of multiple shards at slightly different milliseconds leads to “torn” cross-shard transactions. The result is a backup that is fast, but fundamentally inconsistent and unusable for reliable recovery.

MongoDB Ops Manager (Enterprise grade MongoDB database management platform) Alone (Software-Only): Ops Manager perfectly understands MongoDB’s internal logic, utilizing backup cursors to ensure absolute distributed consistency. However, relying solely on Ops Manager means performing logical, streaming backups over the network. For enterprise datasets scaling into the 50–100 TB range, logical backups take hours or days, consume massive compute resources, and severely impact production performance. Restores are equally agonizing, making strict RTOs impossible to achieve.

Solution:

An Integrated Approach to MongoDB Cluster Management using NDB and MongoDB Ops Manager -

The integrated solution addresses the challenges of distributed data protection:

  1. Cluster-Wide Application Consistency:
    Instead of snapshotting shards blindly, NDB delegates the consistency coordination to Ops Manager. Ops Manager utilizes MongoDB backup cursors to bring every shard and config server to a consistent, stable state simultaneously. Only when the entire cluster is in a “READY” state does NDB capture the underlying infrastructure snapshots.

  2. Continuous Data Protection and PITR:
    NDB orchestrates coordinated oplog catchups alongside the snapshots. This allows administrators to perform precise Point-in-Time Recoveries (PITR) across the entire distributed cluster, ensuring minimal data loss (RPO) without requiring application downtime.

  3. Unified Operational Experience:
    Despite the underlying complexity of shards and config servers, NDB treats the entire MongoDB Sharded Cluster as a single logical database entity. DBAs define SLA policies, backup schedules, and retention windows in NDB once, and the platform handles the orchestration.

  4. Instantaneous, Storage-Efficient Backups:
    Because NDB leverages native infrastructure snapshots, capturing the data takes seconds, regardless of whether the cluster holds 500GB or 50TB of data. This eliminates the performance penalties associated with traditional streaming backups.

Key benefits or takeways :

  1. Levaraging best of both worlds - NDB and MongoDB Ops Manager, which otherwise provide below par overall experience using only one of the platforms
  2. Continous data protection without impacting the live database transactions
  3. Cross shard data consistency, storage efficiency and faster backups
  4. Restore of distributed databases on eventuality with minimal RTO - In minutes, instead of hours or days.

Who can get benefitted from this session?

  1. Enterprise customers who use MongoDB Shard databases
  2. Database and Storage Technical Experts - Who is looking ways to optimize the technologies that they use to provide optimal experience to end users.

About the speaker:

Name: Saravana Selvaraj
Work For: Nutanix, Bangalore
Designation: Staff Engineer - NDB - DaaS
Interest: Storage, Databases, Big Data, Solution Integration and AI Enthusiast
IT Experience: 20+ years
Linked In: https://www.linkedin.com/in/saravanas/

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy