Unexpected IO Spikes in MongoDB: Diagnosing and Resolving TTL Index Issues During Standalone to Replica Conversions
In our recent small project to convert a standalone MongoDB setup into a replica set, a process we've successfully completed several times for various clients, we faced an unexpected challenge. After enabling the replica set, the database node experienced significant increases in CPU usage and disk I/O activity.
This unforeseen issue prompted a thorough investigation to understand its underlying causes and suggest effective solutions. Drawing upon my experience in MongoDB administration and troubleshooting, I'll walk you through the steps I took to identify and address the issue. Finally, we'll discover that a surprising factor contributed to this challenge.
Unforeseen Hiccups: A Long-Past Data Migration Mishap
During a data migration or node replacement operation,
- The process was initiated by enabling the replica set and efficiently synchronizing data across the cluster.
- Subsequently, the old node was removed and the newly added member was inadvertently reverted back to a standalone instance by the activity team.
- However, in the final stage, a crucial step was overlooked: the removal of the local database from the standalone instance.
As a result of this oversight, unforeseen complications arose.
Anatomy on local DB
- Each mongod instance maintains its own local database, which serves as a repository for data utilized in the replication process and other instance-specific data.
- Notably, the local database remains invisible to replication, meaning collections within it are not replicated.
Below are the considerable collection details in the local database:
Note: The collections local.oplog.rs and local.system.replset along with other system collections are prohibited from being dropped within the local database.
Errors:
Standalone to Replica Set Transition
Recap of Steps
- In our current scenario, it was observed that the node discussed earlier was undergoing a transition to become a replica set.
- During pre-validation, the existence of the local database was noted.
- Being aware that retaining the old oplog within the database while transitioning to a replica set would result in the oplog being reapplied, the decision was made to drop the local database and commence the mongod as a replica set.
Issue Arises
- Subsequently, a significant increase in database load and aggressive saturation of disk writes I/O was encountered.
- To address this issue, the mongod process was temporarily halted and the node was reverted to a standalone configuration, resulting in the issue being resolved.
- Upon investigation of the new oplog, a large volume of delete operation entries was observed. However, examination of relevant metrics and monitoring data revealed no corresponding information regarding deletions in the mongod logs or at the opcounter level.
Implications of Disabled TTL and Warning Messages
- The TTL is disabled due to the presence of the system.replset collection in the local db if the MongoDB deployment is standalone. A warning message will exist in the startup warning for the mongo shell.
FYI: In this case, even though we validated the ttlMonitorEnabled, it will appear as enabled, but the actual behavior is restricted.
Mitigating Obstacles
- To overcome this hurdle in the transition phase, the best approach is to purge the data that matches the TTL condition in the respective collections.
- First, identify the collections that have TTL indexes. Following any purging strategy, proceed to delete the data accordingly.
- Based on the data size of each collection, prioritize the purging approach.
- Referencing the purging documents, set a batch size and formulate the query for the purging process.
- Once all the old data is removed, cross-verify and then drop the local database after disabling authentication in the standalone deployment.
- Subsequently, observe the database health and resource utilization before performing the replica set transition.
Through meticulous analysis and collaboration with my team, we were able to identify the root cause of the problem and implement preventative measures to safeguard against similar incidents in the future. This experience underscored the importance of proactive planning and vigilant monitoring in ensuring the stability and integrity of MongoDB deployments.
Migrate Your MongoDB Without Resource Woes! Contact Mydbops today to leverage our MongoDB Managed Services and Consulting expertise. We'll help you avoid resource saturation and ensure a successful transition to a replica set for your remote databases.
{{cta}}