Preventing Data Fragmentation in MongoDB Part 1

Handling Data Fragmentation in MongoDB: Part 1

Mydbops

Jan 23, 2025

Mins to Read

All

‍

In the dynamic world of database management, maintaining performance and efficiency is paramount. One of the key challenges that MongoDB users face is data fragmentation. As data is continually modified or deleted, it can become scattered across the storage system, leading to inefficiencies that impact both speed and resource utilisation.

Data fragmentation occurs when the physical storage of data does not align with the logical organisation of that data. This misalignment can result in slower query performance, increased disk space usage, and longer maintenance times. Understanding the data fragmentation in MongoDB is essential for database administrators and developers alike, as it can have a significant impact on application performance and user experience.

In this blog, we’ll explore what data fragmentation is, its causes, and the implications for your MongoDB environment. We’ll also provide practical strategies for preventing and mitigating fragmentation, ensuring your database remains agile and efficient.

‍Commands to Identify Data Fragmentation

Data fragmentation can significantly hinder the overall efficiency of your database. To ensure smooth operation, it’s crucial to implement strategies for compacting and cleaning up data regularly.

Key MongoDB Commands to detect fragmentation

Use the following command to identify Data fragmentation on the server.

db.getSiblingDB(dbName).getCollection(coll).stats().wiredTiger['block-manager']['file bytes available for reuse']

Understanding the free storage size and fragmentation levels

Starting from MongoDB version 5.0, a new field has been introduced in freeStorageSize, which helps identify fragmented data.

db.stats(1024*1024*1024).freeStorageSize // To find the fragmented data
db.stats(1024*1024*1024).totalFreeStorageSize // This command will provide both the index and data fragmentation size

‍Handling Fragmentation

The fragmentation of the data occurs when the actual data stored in the database is spread unevenly across storage. As documents are updated, Archived large amounts of data or deleted, gaps may form, leading to inefficient use of space. Over time, this fragmentation can slow down read and write operations, as the database engine must work harder to access fragmented data.

Fragmentation can occur in both data and indexes. There are two main approaches to managing this fragmentation:

Compaction
Initial synchronization

‍Compaction

Compact rewrites and defragments all data and indexes within a specified collection. By reorganizing the storage, it helps to optimise data access and improve overall performance. In databases using the WiredTiger storage engine, this process not only enhances efficiency but also releases any unnecessary disk space back to the operating system. This is particularly beneficial in managing storage resources, allowing for better utilization of disk space as data changes over time.

Points to note when using compaction

For clusters that enforce authentication, you must log in as a user with the compact privilege on the target collection. The dbAdmin and hostManager roles grant the necessary privileges to run the compact command on non-system collections. For system collection need to create the custom role along with compact privileges.
The primary node does not replicate the compact command to the secondaries.
To observe how the storage space for the collection changes, execute the collStats command both before and after compaction.
The effectiveness of compaction is workload dependent and no disk space may be recovered.
From Mongo version 2.6, mongod rebuilds all indexes in parallel following the compact operation.
Starting in MongoDB 5.0.12, A secondary node can replicate data and read are permitted on the secondary while the compact is running.

Syntax Of Compaction In Mongo Version Prior to 8.0

db.runCommand(
{
     compact: <string>,
     force: <boolean>, // Optional 
     comment: <any>, // Optional
})

Syntax Of Compaction In Mongo Version 8.0

db.runCommand(
   {
     compact: <string>,
     dryRun: <boolean>,
     force: <boolean>, // Optional
     freeSpaceTargetMB: <int>, // Optional
     comment: <any>, // Optional
   }
)

Compact: It is a string value. Please specify the name of the collection that requires compaction.

dryRun: It is a boolean value and is available in Mongo 8.0. By this dry run, we can find how the data can be reclaimed.

force: It is a boolean value. If enabled, forces compact to run on the primary in a replica set. Starting in v4.4, compact does not block MongoDB CRUD Operations on the database it is compacting.

freeSpaceTargetMB: Specifies the minimum amount of storage space, in megabytes, that must be recoverable for compaction to proceed.Default: 20.

Blocking Behaviour Of Compaction

Version	Blocking Behavior
Before 4.4	compact blocks all read and write activity.
At 4.4	compact blocks these operations: db.collection.drop() db.collection.createIndex() db.collection.createIndexes() db.collection.dropIndex() db.collection.dropIndexes() collMod All other operations are permitted.
After 4.4	Locking behaviour is changed after 4.4 Mongo versions and compact block these operations: db.collection.drop() db.collection.createIndex() db.collection.createIndexes() db.collection.dropIndex() db.collection.dropIndexes() collMod All other operations are permitted.
Starting in MongoDB 6.0.2 (and 5.0.12)	A secondary node can replicate while compact is running. Reads are permitted.

‍Termination of Ongoing Compaction

To check for ongoing compaction, you can use the db.currentOp() command. Based on the output of db.currentOp(), you can use the operation ID with the db.killOp() method to terminate the ongoing compaction on the server.

‍Steps to perform compaction

Initiate Compaction on Secondary Nodes: Start by running the compaction process on the secondary nodes first. This helps to minimise the impact on your primary node's performance.
Promote a Secondary to Primary: Once the compaction on the secondaries is complete, promote one of the secondary nodes to become the new primary. This can be done using your cluster management tools or commands.
Run Compaction on the Former Primary: With the old primary now functioning as a secondary, proceed to run the compaction process on this node. This ensures that it is optimized without affecting the primary operations.
Revert to Original Primary: After the compaction on the former primary is complete, switch back to using it as the primary node. This allows you to restore the original configuration with all nodes optimized.

Following this process helps ensure minimal disruption and maintains the performance of your MongoDB cluster.

Note: Starting from MongoDB version 8.0, a new feature called autoCompact has been introduced. This feature operates in the background to identify free spaces and reclaim that space, optimizing storage efficiency.

Initial sync

You can also perform an initial sync on the server to eliminate fragmented data. During this process, the secondary node takes an initial snapshot of the primary's data, copying all databases, collections, and indexes and storing the data in ordered blocks. This helps remove fragmentation.

Initial sync Process on existing collection

The initial sync process requires stopping the MongoDB service on the node. Therefore, ensure that either the primary or another secondary can handle the traffic in the cluster. We recommend performing this activity during non-production hours.

Steps for Initial Sync

Stop the MongoDB Service: Begin by stopping the MongoDB service on one of the secondary nodes.
Remove the Secondary from the Replica Set: Use the appropriate commands to remove the secondary node from the replica set.
Restart the MongoDB Service: Start the MongoDB service on the secondary node again.
Re-add the Node to the Replica Set: Add the node back to the replica set. Once the server transitions to the secondary state, the initial sync will commence.
Validate Replication Lag: After the initial sync is complete, check the replication lag to ensure the secondary is fully synced with the primary node.

You can apply this method to all secondary and hidden nodes. If you need to perform an initial sync on the primary, elect a new primary and follow the same process. If the initial sync takes too long, consider using a disk snapshot from a secondary where the initial sync has already been completed. You can attach this disk to the servers in the replica set after stopping the MongoDB service and then restart the service after the disk snapshot has been added.

Based on our experience, we recommend using initial sync rather than the compaction method to address data fragmentation. This is because compaction must be executed on each individual collection, and there’s no guarantee that it will effectively reclaim all claimed disk space. By opting for initial sync, you ensure a more thorough and efficient approach to managing data fragmentation, while also minimizing the potential for performance degradation associated with fragmented storage.

Stay tuned for Dealing with Fragmentation in MongoDB: Part 2 – Tackling Index Fragmentation for Improved Efficiency, where we delve deep into index fragmentation.

Managing data fragmentation is crucial for maintaining peak database performance. Whether it's compaction, initial sync, or advanced fragmentation management strategies, Mydbops has the expertise to ensure your MongoDB remains efficient and reliable. Looking for expert guidance? Our tailored MongoDB Managed Services, Consulting, and Remote DBA Solutions help you. Reach out to us today and experience proactive, expert-led MongoDB support.

No items found.