In the world of databases, using the latest MongoDB version can offer peak performance and new functionalities. It means better performance and cool features. But sometimes, new versions bring problems.
Imagine a MongoDB setup that's running well on version 4.0.28. Everything seems fine until you decide to update to version 4.2.24. That's when things get tricky. Chunk Migration failures start happening, even though the data balancer is working fine. Our Mydbops MongoDB team recently encountered a similar situation during the upgrade of a shard from 4.0 (v4.0.28) to 4.2 (v4.2.24).
This blog is about the journey to fix this issue. We'll dig into the error messages, find out why this happened, and learn how to deal with it. We're taking you behind the scenes of MongoDB upgrades, where small problems can lead to big trouble. Join us as we unravel the story of MongoDB Shard Upgrades and the quest to fix index problems.
Issue
Our story begins with a seemingly healthy MongoDB shard running version 4.0.28. Everything appeared to be in order until we decided to embark on a version upgrade journey to 4.2.24. During this upgrade, a rather unexpected issue reared its head – Chunk Migration failures, even though the balancer was functioning correctly.
Error Details
Error in router nodes
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
515 : Success
49 : Failed with error 'aborted', from shardA to shardB
1 : Failed with error 'aborted', from shardB to shardA
databases:
{ "_id" : "Information", "primary" : "shardB", "partitioned" : true, "version" : { "uuid" : UUID("e4999523-772d-4d21-af01-2c4f38efbdc2"), "lastMod" : 1 } }
Information.movies
shard key: { "released" : "hashed" }
unique: false
balancing: true
chunks:
shardA 5
shardB 3
Error from Mongo log
2023-08-11T09:59:02.301+0530 I SHARDING [Balancer] Balancer move Information.movies: [{ released: 0 }, { released: 40158834000849533 }), from shardA, to sahrdB failed :: caused by :: OperationFailed: Data transfer error: migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be scheduled manually
Cause for Issue
Going deeper into the core of the issue, we found that both the source shard (shardA) and the destination shard (shardB) had an equal number of indexes, and these indexes had identical keys. However, the crucial difference between them was the names assigned to these indexes. Surprisingly, this seemingly minor distinction in index names turned out to be the primary reason behind the failure in migrating the chunks between the two shards.
If there is a mismatch in index names, the details will be readily visible in the following command. However, if there is any disparity in index properties, you will need to utilize the MongoDB Shard Index Inconsistent Script to identify and address those inconsistencies.
If the mongo version is 4.0 this kind of data miss-match won’t cause any issue for the chunk migration. But after upgrading the shard primary member to 4.2 (FCV is not set to 4.2, still it is 4.0 only) the same chunk migration fails recursively between those shards.
Below log messages were projected into the log recursively.
2023-08-30T15:39:11.150+0000 I SHARDING [conn45] Starting chunk migration ns: Information.movies, [{ released: MinKey }, { released: -8016517071337804512 }), fromShard: shardA, toShard: shardB with expected collection version epoch 64ef335fc5e83f3209ded598
2023-08-30T15:39:11.152+0000 I SHARDING [conn45] about to log metadata event into changelog: { _id: "ip-172-31-44-90-2023-08-30T15:39:11.152+0000-64ef629fdcd04d9990a86ecf", server: "ip-172-31-44-90", clientAddr: "172.31.42.216:54736", time: new Date(1693409951152), what: "moveChunk.start", ns: "Information.movies", details: { min: { released: MinKey }, max: { released: -8016517071337804512 }, from: "shardA", to: "shardB" } }
2023-08-30T15:39:11.162+0000 I SHARDING [conn45] moveChunk data transfer progress: { waited: true, active: true, sessionId: "shardA_shardB_64ef629fdcd04d9990a86ed0", ns: "Information.movies", from: "shardA/172.31.44.90:27018", fromShardId: "shardA", min: { released: MinKey }, max: { released: -8016517071337804512 }, shardKeyPattern: { released: "hashed" }, supportsCriticalSectionDuringCatchUp: true, state: "fail", errmsg: "migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be schedul...", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0, $gleStats: { lastOpTime: { ts: Timestamp(1693409168, 1), t: 2 }, electionId: ObjectId('7fffffff0000000000000002') }, lastCommittedOpTime: Timestamp(1693409948, 1), $configServerState: { opTime: { ts: Timestamp(1693409951, 5), t: 2 } }, $clusterTime: { clusterTime: Timestamp(1693409951, 5), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1693409948, 1) } mem used: 0 documents remaining to clone: 1487
2023-08-30T15:39:11.162+0000 I SHARDING [conn45] about to log metadata event into changelog: { _id: "ip-172-31-44-90-2023-08-30T15:39:11.162+0000-64ef629fdcd04d9990a86ed4", server: "ip-172-31-44-90", clientAddr: "172.31.42.216:54736", time: new Date(1693409951162), what: "moveChunk.error", ns: "Information.movies", details: { min: { released: MinKey }, max: { released: -8016517071337804512 }, from: "shardA", to: "shardB" } }
2023-08-30T15:39:11.164+0000 W SHARDING [conn45] Chunk move failed :: caused by :: OperationFailed: Data transfer error: migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be scheduled manually
Justification
Confirmation that the issue arose in 4.2
To confirm that the problem specifically arose after upgrading to MongoDB version 4.2, we conducted a manual chunk migration as a test. Interestingly, the command completed successfully when executed in MongoDB version 4.0. However, when we ran the same command in MongoDB version 4.2, it failed and produced the exact error that we had encountered during the actual migration process.
mongos> db.runCommand( {
... moveChunk : 'Information.movies' ,
... bounds :[ { "released" : NumberLong(0) }, { "released" : NumberLong("40158834000849533") } ],
... to : 'shardB' }
... )
{
"ok" : 0,
"errmsg" : "Data transfer error: migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be scheduled manually",
"code" : 96,
"codeName" : "OperationFailed",
"operationTime" : Timestamp(1693405407, 9),
"$clusterTime" : {
"clusterTime" : Timestamp(1693405407, 9),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
}
Validating Inconsistent Indexes in a Sharded Cluster
In MongoDB, there is a useful parameter called numShardedCollectionsWithInconsistentIndexes that helps identify how many sharded collections have inconsistent indexes across all the shards.
Starting from version 4.2.6, you can access this parameter through the primary member of the configuration cluster. To check the consistency of indexes, you can use the following commands:
To enable or disable index consistency checks for sharded collections on the primary config server, use these commands:
MongoDB while Starting
mongod --setParameter enableShardedIndexConsistencyCheck=true
MongoDB while runtime
db.adminCommand( { setParameter: 1, enableShardedIndexConsistencyCheck: true } )
After the confirmation that the enableShardedIndexConsistencyCheckis enabled, in the config server, you need to execute the below command.
db.serverStatus().shardedIndexConsistency
The above command will return the number of sharded collections having inconsistent indexes.
It's important to note that the default time interval for index inconsistency checks is set to 10 minutes. While you can modify the time interval using the shardedIndexConsistencyCheckIntervalMS parameter, please be aware that this adjustment must be made before starting MongoDB; it cannot be altered during runtime.
To resolve the issue, it's necessary to recreate the index with the same properties as those on the other shards. This ensures that all shards have the same index names.
Depending on the production use case, a strategy must be developed to recreate these indexes effectively.
Our Experience
In our client's production environment, there are six shards, each of which contains five data-bearing members. The data size on each shard is 1 TB. Notably, only two of these shards have a consistent index name, while the other four have a different index name due to the issue we encountered earlier.
Temporary fix
To address the issue, we followed a specific procedure. First, we removed the current primary member from the cluster and operated it as a standalone instance. Then, we deleted the existing index on this standalone member and re-created the index, making sure to include the name option in the index creation command.
Once the index creation was successfully completed, we reintegrated this member back into the cluster and reinstated it as the primary member. This approach allowed us to resolve the index inconsistency problem efficiently
Permanent fix
The secondary members of the shards did not have the updated index. To address this, we carefully crafted a strategic approach to create and update the indexes on these secondary shard members.
We appreciate your participation in this troubleshooting journey, and we hope that the insights we've shared will prove valuable in your MongoDB-related endeavors.
Note: The information provided in this blog is based on real-world scenarios encountered by the Mydbops MongoDB team. MongoDB's behaviour may evolve with newer versions, so it is advisable to consult the official documentation and conduct comprehensive testing when undertaking upgrades.
Explore Mydbops Blogs for a wealth of MongoDB-related content and resources.