Unexpected IO Spikes in MongoDB: Diagnosing and Resolving TTL Index Issues During Standalone to Replica Conversions

May 23, 2024

Mins to Read

All

illustration representing unexpected spikes in mongodb

How We Resolved MongoDB Resource Spikes During Replica Set Migration

‍

In our recent small project to convert a standalone MongoDB setup into a replica set, a process we've successfully completed several times for various clients, we faced an unexpected challenge. After enabling the replica set, the database node experienced significant increases in CPU usage and disk I/O activity.

This unforeseen issue prompted a thorough investigation to understand its underlying causes and suggest effective solutions. Drawing upon my experience in MongoDB administration and troubleshooting, I'll walk you through the steps I took to identify and address the issue. Finally, we'll discover that a surprising factor contributed to this challenge.

‍

Unforeseen Hiccups: A Long-Past Data Migration Mishap

During a data migration or node replacement operation,

The process was initiated by enabling the replica set and efficiently synchronizing data across the cluster.
Subsequently, the old node was removed and the newly added member was inadvertently reverted back to a standalone instance by the activity team.
However, in the final stage, a crucial step was overlooked: the removal of the local database from the standalone instance.

As a result of this oversight, unforeseen complications arose.

‍

Anatomy on local DB

Each mongod instance maintains its own local database, which serves as a repository for data utilized in the replication process and other instance-specific data.
Notably, the local database remains invisible to replication, meaning collections within it are not replicated.

Below are the considerable collection details in the local database:

Collection Name	Details
local.system.replset	Holds the replica set's configuration object as its single document. View configuration information using rs.conf() in Mongosh or query the collection.
local.oplog.rs	The capped collection that stores the oplog. Size is set using oplogSizeMB at creation. Resize using the Change the Size of the Oplog procedure.
local.replset.minvalid	Contains an object used internally by replica sets to track replication status.
local.startup_log	A capped collection where each mongod instance inserts a document on startup with diagnostic information about the instance and host.

Note: The collections local.oplog.rs and local.system.replset along with other system collections are prohibited from being dropped within the local database.

Errors:

 
local> db.getCollection('system.replset').drop()

MongoServerError[IllegalOperation]: can't drop system collection local.system.replset

local> db.getCollection('oplog.rs').drop()

MongoServerError[Location5255001]: can't drop oplog on storage engines that support replSetResizeOplog command

‍

Standalone to Replica Set Transition

Recap of Steps

In our current scenario, it was observed that the node discussed earlier was undergoing a transition to become a replica set.
During pre-validation, the existence of the local database was noted.
Being aware that retaining the old oplog within the database while transitioning to a replica set would result in the oplog being reapplied, the decision was made to drop the local database and commence the mongod as a replica set.

Issue Arises

Subsequently, a significant increase in database load and aggressive saturation of disk writes I/O was encountered.

Standalone to Replica Set Transition Spike in CPU Usage

Standalone to Replica Set Transition Spike in Cache Usage

Standalone to Replica Set Transition Spike in Disc Usage

To address this issue, the mongod process was temporarily halted and the node was reverted to a standalone configuration, resulting in the issue being resolved.
Upon investigation of the new oplog, a large volume of delete operation entries was observed. However, examination of relevant metrics and monitoring data revealed no corresponding information regarding deletions in the mongod logs or at the opcounter level.

‍

Implications of Disabled TTL and Warning Messages

The TTL is disabled due to the presence of the system.replset collection in the local db if the MongoDB deployment is standalone. A warning message will exist in the startup warning for the mongo shell.

 
Related Startup Warning :

2024-04-09T12:45:09.536+00:00: Document(s) exist in 'system.replset', but started without --replSet. Database contents may appear inconsistent with the writes that were visible when this node was running as part of a replica set. Restart with --replSet unless you are doing maintenance and no other clients are connected. The TTL collection monitor will not start because of this. For more info see http://dochub.mongodb.org/core/ttlcollections

FYI: In this case, even though we validated the ttlMonitorEnabled, it will appear as enabled, but the actual behavior is restricted.

 
test> db.adminCommand({ getParameter:1, ttlMonitorEnabled: 1 }){ ttlMonitorEnabled: true, ok: 1 }

‍

Mitigating Obstacles

To overcome this hurdle in the transition phase, the best approach is to purge the data that matches the TTL condition in the respective collections.
First, identify the collections that have TTL indexes. Following any purging strategy, proceed to delete the data accordingly.
Based on the data size of each collection, prioritize the purging approach.
Referencing the purging documents, set a batch size and formulate the query for the purging process.
Once all the old data is removed, cross-verify and then drop the local database after disabling authentication in the standalone deployment.
Subsequently, observe the database health and resource utilization before performing the replica set transition.

Through meticulous analysis and collaboration with my team, we were able to identify the root cause of the problem and implement preventative measures to safeguard against similar incidents in the future. This experience underscored the importance of proactive planning and vigilant monitoring in ensuring the stability and integrity of MongoDB deployments.

Migrate Your MongoDB Without Resource Woes! Contact Mydbops today to leverage our MongoDB Managed Services and Consulting expertise. We'll help you avoid resource saturation and ensure a successful transition to a replica set for your remote databases.

No items found.