Troubleshooting MongoDB Shard Upgrades: Resolving Index Discrepancies

Oct 3, 2023

Mins to Read

All

In the world of databases, using the latest MongoDB version can offer peak performance and new functionalities. It means better performance and cool features. But sometimes, new versions bring problems.

Imagine a MongoDB setup that's running well on version 4.0.28. Everything seems fine until you decide to update to version 4.2.24. That's when things get tricky. Chunk Migration failures start happening, even though the data balancer is working fine. Our Mydbops MongoDB team recently encountered a similar situation during the upgrade of a shard from 4.0 (v4.0.28) to 4.2 (v4.2.24).

This blog is about the journey to fix this issue. We'll dig into the error messages, find out why this happened, and learn how to deal with it. We're taking you behind the scenes of MongoDB upgrades, where small problems can lead to big trouble. Join us as we unravel the story of MongoDB Shard Upgrades and the quest to fix index problems.

Issue

Our story begins with a seemingly healthy MongoDB shard running version 4.0.28. Everything appeared to be in order until we decided to embark on a version upgrade journey to 4.2.24. During this upgrade, a rather unexpected issue reared its head – Chunk Migration failures, even though the balancer was functioning correctly.

Error Details

Error in router nodes

 
balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                515 : Success
                49 : Failed with error 'aborted', from shardA to shardB
                1 : Failed with error 'aborted', from shardB to shardA
  databases:
        {  "_id" : "Information",  "primary" : "shardB",  "partitioned" : true,  "version" : {  "uuid" : UUID("e4999523-772d-4d21-af01-2c4f38efbdc2"),  "lastMod" : 1 } }
                Information.movies
                        shard key: { "released" : "hashed" }
                        unique: false
                        balancing: true
                        chunks:
                                shardA	5
                                shardB	3

Error from Mongo log

 
2023-08-11T09:59:02.301+0530 I SHARDING [Balancer] Balancer move Information.movies: [{ released: 0 }, { released: 40158834000849533 }), from shardA, to sahrdB failed :: caused by :: OperationFailed: Data transfer error: migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be scheduled manually

Cause for Issue

Going deeper into the core of the issue, we found that both the source shard (shardA) and the destination shard (shardB) had an equal number of indexes, and these indexes had identical keys. However, the crucial difference between them was the names assigned to these indexes. Surprisingly, this seemingly minor distinction in index names turned out to be the primary reason behind the failure in migrating the chunks between the two shards.

Index Count:

 
mongos> db.getSiblingDB('Information').movies.getIndexes().length
8

shardA:PRIMARY> db.getSiblingDB('Information').movies.getIndexes().length
8

shardB:PRIMARY> db.getSiblingDB('Information').movies.getIndexes().length
8

Index Names

shardA

 
mongos> db.getSiblingDB('Information').movies.stats().shards.shardA.indexSizes
{
	"_id_" : 331776,
	"released_hashed" : 237568,
	"cast_text_fullplot_text_genres_text_title_text" : 14483456,
	"type_1_year_1" : 131072,
	"year_1_awards.wins_1" : 167936,
	"year_1" : 151552,
	"runtime_1" : 167936,
	"title_1" : 491520
}

Index:

	{
		"v" : 2,
		"key" : {
			"title" : 1
		},
		"name" : "title_1",
		"ns" : "Information.movies"
	}

shardB

 
mongos> db.getSiblingDB('Information').movies.stats().shards.shardB.indexSizes
{
	"_id_" : 3072000,
	"released_hashed" : 2015232,
	"cast_text_fullplot_text_genres_text_title_text" : 81420288,
	"type_1_year_1" : 1892352,
	"year_1_awards.wins_1" : 1916928,
	"year_1" : 1900544,
	"runtime_1" : 1929216,
	"title" : 1658880
}

Index:

	{
		"v" : 2,
		"key" : {
			"title" : 1
		},
		"name" : "title",
		"ns" : "Information.movies"
	}

Index MissMatch Details from Mongos

If there is a mismatch in index names, the details will be readily visible in the following command. However, if there is any disparity in index properties, you will need to utilize the MongoDB Shard Index Inconsistent Script to identify and address those inconsistencies.

 
mongos> db.getSiblingDB('Information').movies.stats().indexSizes
{
	"_id_" : 3072000,
	"released_hashed" : 2015232,
	"cast_text_fullplot_text_genres_text_title_text" : 81420288,
	"type_1_year_1" : 1892352,
	"year_1_awards.wins_1" : 1916928,
	"year_1" : 1900544,
	"runtime_1" : 1929216,
	"title" : 1658880
	"title_1" : 491520
}

If the mongo version is 4.0 this kind of data miss-match won’t cause any issue for the chunk migration. But after upgrading the shard primary member to 4.2 (FCV is not set to 4.2, still it is 4.0 only) the same chunk migration fails recursively between those shards.

Below log messages were projected into the log recursively.

 
2023-08-30T15:39:11.150+0000 I SHARDING [conn45] Starting chunk migration ns: Information.movies, [{ released: MinKey }, { released: -8016517071337804512 }), fromShard: shardA, toShard: shardB with expected collection version epoch 64ef335fc5e83f3209ded598

2023-08-30T15:39:11.152+0000 I SHARDING [conn45] about to log metadata event into changelog: { _id: "ip-172-31-44-90-2023-08-30T15:39:11.152+0000-64ef629fdcd04d9990a86ecf", server: "ip-172-31-44-90", clientAddr: "172.31.42.216:54736", time: new Date(1693409951152), what: "moveChunk.start", ns: "Information.movies", details: { min: { released: MinKey }, max: { released: -8016517071337804512 }, from: "shardA", to: "shardB" } }

2023-08-30T15:39:11.162+0000 I SHARDING [conn45] moveChunk data transfer progress: { waited: true, active: true, sessionId: "shardA_shardB_64ef629fdcd04d9990a86ed0", ns: "Information.movies", from: "shardA/172.31.44.90:27018", fromShardId: "shardA", min: { released: MinKey }, max: { released: -8016517071337804512 }, shardKeyPattern: { released: "hashed" }, supportsCriticalSectionDuringCatchUp: true, state: "fail", errmsg: "migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be schedul...", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0, $gleStats: { lastOpTime: { ts: Timestamp(1693409168, 1), t: 2 }, electionId: ObjectId('7fffffff0000000000000002') }, lastCommittedOpTime: Timestamp(1693409948, 1), $configServerState: { opTime: { ts: Timestamp(1693409951, 5), t: 2 } }, $clusterTime: { clusterTime: Timestamp(1693409951, 5), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1693409948, 1) } mem used: 0 documents remaining to clone: 1487

2023-08-30T15:39:11.162+0000 I SHARDING [conn45] about to log metadata event into changelog: { _id: "ip-172-31-44-90-2023-08-30T15:39:11.162+0000-64ef629fdcd04d9990a86ed4", server: "ip-172-31-44-90", clientAddr: "172.31.42.216:54736", time: new Date(1693409951162), what: "moveChunk.error", ns: "Information.movies", details: { min: { released: MinKey }, max: { released: -8016517071337804512 }, from: "shardA", to: "shardB" } }
2023-08-30T15:39:11.164+0000 W SHARDING [conn45] Chunk move failed :: caused by :: OperationFailed: Data transfer error: migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be scheduled manually

Justification

Confirmation that the issue arose in 4.2

To confirm that the problem specifically arose after upgrading to MongoDB version 4.2, we conducted a manual chunk migration as a test. Interestingly, the command completed successfully when executed in MongoDB version 4.0. However, when we ran the same command in MongoDB version 4.2, it failed and produced the exact error that we had encountered during the actual migration process.

Mongodb version 4.0

 
mongos> db.runCommand( { 
... moveChunk : 'Information.movies' ,
... bounds :[ { "released" : NumberLong(0) }, { "released" : NumberLong("40158834000849533") } ],
... to : 'shardB' }
... )

{
	"millis" : 528849,
	"ok" : 1,
	"operationTime" : Timestamp(1693403602, 89),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1693403602, 89),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}

Mongodb version 4.2

 
mongos> db.runCommand( { 
... moveChunk : 'Information.movies' ,
... bounds :[ { "released" : NumberLong(0) }, { "released" : NumberLong("40158834000849533") } ],
... to : 'shardB' }
... )
{
	"ok" : 0,
	"errmsg" : "Data transfer error: migrate failed: CannotCreateCollection: aborting, shard is missing 1 indexes and collection is not empty. Non-trivial index creation should be scheduled manually",
	"code" : 96,
	"codeName" : "OperationFailed",
	"operationTime" : Timestamp(1693405407, 9),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1693405407, 9),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}

Validating Inconsistent Indexes in a Sharded Cluster

In MongoDB, there is a useful parameter called numShardedCollectionsWithInconsistentIndexes that helps identify how many sharded collections have inconsistent indexes across all the shards.

Starting from version 4.2.6, you can access this parameter through the primary member of the configuration cluster. To check the consistency of indexes, you can use the following commands:

 
db.adminCommand( { getParameter: 1, enableShardedIndexConsistencyCheck: 1 } )

To enable or disable index consistency checks for sharded collections on the primary config server, use these commands:

 
MongoDB while Starting

mongod --setParameter enableShardedIndexConsistencyCheck=true


MongoDB while runtime

db.adminCommand( { setParameter: 1, enableShardedIndexConsistencyCheck: true } )

After the confirmation that the enableShardedIndexConsistencyCheck is enabled, in the config server, you need to execute the below command.

 
db.serverStatus().shardedIndexConsistency

The above command will return the number of sharded collections having inconsistent indexes.

It's important to note that the default time interval for index inconsistency checks is set to 10 minutes. While you can modify the time interval using the shardedIndexConsistencyCheckIntervalMS parameter, please be aware that this adjustment must be made before starting MongoDB; it cannot be altered during runtime.

 
mongod --setParameter shardedIndexConsistencyCheckIntervalMS=300000

MongoDB Upgrade Issues — shardedIndexConsistencyCheckIntervalMS

We encountered an issue in our production environment, and we have recreated the same problem in our testing environment.

To find the exact collection and inconsistent index details, we need to use the following snippet.

Script

 
function shardsIndexValidate() {
    var output = [];
    const pipeline = [
        { $indexStats: {} },
        { $group: { _id: null, indexDoc: { $push: "$$ROOT" }, allShards: { $addToSet: "$shard" } } },
        { $unwind: "$indexDoc" },
        {$group: {"_id": "$indexDoc.name","shards": { $push: "$indexDoc.shard" },"specs": { $addToSet: { $arrayToObject: { $setUnion: { $objectToArray: "$indexDoc.spec" } } } },"allShards": { $first: "$allShards" }}},
        { $addFields: { "missingFromShards": { $setDifference: ["$allShards", "$shards"] } } },
        {$match: {$expr: { $or: [{ $gt: [{ $size: "$missingFromShards" }, 0] }, { $gt: [{ $size: "$specs" }, 1] }] }}},
        { $project: { specs: 1, missingFromShards: 1 } }
    ];
    db.getMongo().getDBNames().forEach(function (dbname) {
        if (dbname != "admin" && dbname != "config" && dbname != "local") {
            db.getSiblingDB(dbname).getCollectionInfos().forEach(function (collInfo) {
                var cname = collInfo.name;
                var collType = collInfo.type;
                if (typeof collType == 'undefined' || collType == "collection") {
                    if (cname != "system.profile" && cname != "system.js" && cname != "system.namespaces" && cname != "system.indexes" && cname != "system.views") {
                        var missMatch = []; 
                        if (db.getSiblingDB(dbname)[cname].stats().sharded) {
                            missMatch = db.getSiblingDB(dbname)[cname].aggregate(pipeline).toArray();
                        } else { return }
                        if (missMatch.length > 0) {
                            output.push({
                                dbName: dbname,
                                collection: cname,
                                inConsistentIndexes: missMatch });
                        }
                    }
                }
            });
        }
    });
    return output;
}

Output

For testing the above script, we have created two inconsistent indexes on the testing shard (namespace: Information.movies ).

 
mongos> shardsIndexValidate()
[
	{
		"dbName" : "Information",
		"collection" : "movies",
		"inConsistentIndexes" : [
			{
				"_id" : "titel_1",
				"specs" : [
					{
						"key" : {
							"titel" : 1
						},
						"name" : "titel_1",
						"ns" : "Information.movies",
						"v" : 2
					}
				],
				"missingFromShards" : [
					"shardB"
				]
			},
			{
				"_id" : "test",
				"specs" : [
					{
						"key" : {
							"test" : 1
						},
						"name" : "test",
						"ns" : "Information.movies",
						"sparse" : true,
						"v" : 2
					},
					{
						"key" : {
							"test" : 1
						},
						"name" : "test",
						"ns" : "Information.movies",
						"v" : 2
					}
				],
				"missingFromShards" : [ ]
			}
		]
	}
]

Issue Fixing

To resolve the issue, it's necessary to recreate the index with the same properties as those on the other shards. This ensures that all shards have the same index names.
Depending on the production use case, a strategy must be developed to recreate these indexes effectively.

Our Experience

In our client's production environment, there are six shards, each of which contains five data-bearing members. The data size on each shard is 1 TB. Notably, only two of these shards have a consistent index name, while the other four have a different index name due to the issue we encountered earlier.

Temporary fix

To address the issue, we followed a specific procedure. First, we removed the current primary member from the cluster and operated it as a standalone instance. Then, we deleted the existing index on this standalone member and re-created the index, making sure to include the name option in the index creation command.
Once the index creation was successfully completed, we reintegrated this member back into the cluster and reinstated it as the primary member. This approach allowed us to resolve the index inconsistency problem efficiently

Permanent fix

The secondary members of the shards did not have the updated index. To address this, we carefully crafted a strategic approach to create and update the indexes on these secondary shard members.

We appreciate your participation in this troubleshooting journey, and we hope that the insights we've shared will prove valuable in your MongoDB-related endeavors.

Note: The information provided in this blog is based on real-world scenarios encountered by the Mydbops MongoDB team. MongoDB's behaviour may evolve with newer versions, so it is advisable to consult the official documentation and conduct comprehensive testing when undertaking upgrades.

Explore Mydbops Blogs for a wealth of MongoDB-related content and resources.

MongoDB Shard upgrade

MongoDB indexing

MongoDB Upgrade Issue

No items found.