Mastering MySQL Group Replication Recovery: Strategies for High Availability and Data Integrity

Mydbops
Mar 5, 2024
20
Mins to Read
All

MySQL Group Replication offers high availability and data redundancy, but like any technology, it's essential to be prepared for unexpected events. In the dynamic world of database management, ensuring high availability and data integrity is paramount. In this blog, we'll explore the standard practices and procedures for recovering from complete failures in a Group Replication.

Recovery Methods

In MySQL Group Replication, recovery from complete failures involves two main methods: partial recovery and full recovery. Partial recovery involves reintegrating a single node after withdrawal for maintenance or network issues, while full recovery restores the entire cluster after a simultaneous outage, minimizing data loss and preserving integrity.

Partial recovery

When a single node is intentionally withdrawn from the cluster for maintenance or due to a network issue and subsequently reintegrated, it marks a phase of partial recovery in the group replication. This strategic manner allows for essential activities without compromising the overall integrity of the replication group. The seamless reintegration of the node ensures a controlled and efficient recovery process, maintaining the group's resilience.

Full recovery

In the scenario where all nodes within the cluster experience a simultaneous outage, achieving a full recovery involves not only bringing the servers back online but meticulously restoring the entire cluster with up-to-date data. The emphasis lies on minimizing or mitigating data loss during this recovery process, ensuring a comprehensive restoration strategy that prioritizes the preservation of data integrity and continuity within the cluster.

Recovery Workflow

  1. Joining nodes initiate connections with specified seed nodes listed in group_replication_group_seeds, facilitating the formation of the replication group.
  1. Seed members will share an updated list of available group members with the joiner node, ensuring a comprehensive view for efficient data transfer.
  1. The joining node, equipped with the member list, intelligently selects a donor node for the seamless transfer of data.
  1. Connection attempts are made from the joining node to the chosen donor node to establish a replication link.
  1. In case of connection issues with the primary donor node, the joining node dynamically explores alternative available members, optimizing the data transfer process.
  1. The joining node undergoes a meticulous synchronization process, ensuring harmonious integration with the replication group, and fostering a consistent and reliable data environment.
  1. If the amount of data that's missing is below the threshold of group_replication_clone_threshold (Default - 9223372036854775807), then the  missing data will be synchronized from the donor node using the binlog (incremental recovery method).
  1. If group_replication_clone_threshold is enabled and the missing transaction surpasses the specified threshold, the joining node proceeds with remote cloning of the complete dataset from the donor node.

Note: Ensure that BACKUP_ADMIN permission is granted for the replication user to facilitate this operation.

MySQL Group Replication Recovery
MySQL Group Replication Recovery Workflow

Recovery State Definitions

  • OFFLINE:

The node is currently not an active participant in the MySQL group replication. This state is transient, manifesting when the node is either in the process of joining the group or undergoing re-entry after a temporary disconnection.

  • RECOVERING:

The node has successfully joined the group, initiating a crucial phase of distributed recovery. During this stage, a meticulous data transfer mechanism is underway, orchestrated by the remote clone operation or binary log.

  • ERROR:

The node is encountering a problem or issue that has caused a member to enter an error state, rendering it unable to actively participate in the group.

  • ONLINE:

The node is actively and successfully participating in the replication group. This state indicates that the node is synchronized with the group and is capable of both sending and receiving transactions, contributing to the overall consistency and functionality of the replication setup.

  • UNREACHABLE:

The node is currently unreachable or inaccessible within the replication group. This state typically occurs when network issues or communication problems are preventing the node from establishing or maintaining connections with other members of the group.

Partial Recovery Scenario

The partial recovery can be helpful in scenarios where a node has been down for a shorter duration of time due to network partition or a Node has been intentionally taken offline for maintenance purposes.

In a 2 GR node setup, currently only one of the nodes is active.

Node 1:

 
mysql> select MEMBER_HOST, MEMBER_STATE from performance_schema.replication_group_members;

+---------------+--------------+

| MEMBER_HOST   | MEMBER_STATE |

+---------------+--------------+

| Node 1        | ONLINE       |

+---------------+--------------+

1 row in set (0.00 sec)
	

The failure node will try to connect with the cluster members until the value is set for the group_replication_recovery_retry_count. Once the limit of the retry count has been reached the node will be marked as OFFLINE. We have to start the group replication manually for the server in the OFFLINE state.

Node 2:

 
mysql> select MEMBER_HOST, MEMBER_STATE from performance_schema.replication_group_members;

+-------------+--------------+

| MEMBER_HOST | MEMBER_STATE |

+-------------+--------------+

| Node 2      | OFFLINE      |

+-------------+--------------+

1 row in set (0.00 sec)
	

Once Group replication has been started, it will initiate a synchronization process to retrieve missing data from other online nodes. Throughout this period, Node 2 will temporarily be in the RECOVERING state, ensuring a seamless integration of the updated data and maintaining the overall consistency of the replication group.

 
mysql> select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;

+---------------+--------------+-------------+

| MEMBER_HOST   | MEMBER_STATE | MEMBER_ROLE |

+---------------+--------------+-------------+

| Node 1        | ONLINE       | PRIMARY     |

| Node 2        | RECOVERING   | SECONDARY   |

+---------------+--------------+-------------+

2 rows in set (0.01 sec)
	

Data recovery process

Incremental recovery

Whenever a member joins or leaves the group, triggering a view change, each online server logs a view change event for future execution. By queuing this event after applying any pending transactions, we ensure accurate tracking of when this transition occurs.

Meanwhile, the joining member identifies a suitable donor from the list of online servers provided by the membership service via the view abstraction. The member then joins the new view, and the online members collectively record a view change event in the binary log.

The server will use the incremental technique of data sync if the joiner node's missing transaction hasn't gone above the group_replication_clone_threshold limit. It will pull the data from the donor node binlog and apply it to the current/joiner node like asynchronous replication.

The joining member establishes a connection with the donor, initiating the state transfer process by using binlog from the donor node. This communication with the donor persists until the joining server's applier thread processes the view change log event associated with the view change triggered when the joining server becomes part of the group.

 
2023-11-14T05:29:39.243084Z 0 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'

2023-11-14T05:29:39.243222Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'

2023-11-14T05:29:39.243082Z 35 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 172.31.60.169:3306.'

2023-11-14T05:29:39.243531Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to node 1:3306, node 2:3306 on view 16999388627713930:4.'
	

We can actively monitor the replication lag within the group by executing show slave status for channel group_replication_recovery and reviewing the comprehensive syncing details.

 
mysql> show slave status for channel 'group_replication_recovery'\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: Node 1
                  Master_User: rpl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000020
          Read_Master_Log_Pos: 835021208
               Relay_Log_File: ip-172-31-53-9-relay-bin-group_replication_recovery.000006
                Relay_Log_Pos: 826284157
        Relay_Master_Log_File: binlog.000020
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 826283947
              Relay_Log_Space: 835021748
              Until_Condition: SQL_VIEW_ID
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 514
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: a0bf2766-71d4-11ee-816c-06e6380e502d
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 1
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 24bbde8d-6b6b-11ee-9c5f-06e65eab1927:21928-25680
            Executed_Gtid_Set: 24bbde8d-6b6b-11ee-9c5f-06e65eab1927:1-25665
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: group_replication_recovery
           Master_TLS_Version: TLSv1.2,TLSv1.3
       Master_public_key_path:
        Get_master_public_key: 0
            Network_Namespace:
1 row in set, 1 warning (0.00 sec)
	

After completing data synchronization, the server seamlessly transitions to the ONLINE state and follows synchronous replication among the clusters, signifying that the data is now in complete harmony. With synchronization achieved, the server stands ready to efficiently handle and serve incoming traffic, ensuring optimal performance and reliability for the end users.

 
mysql> select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;

+---------------+--------------+-------------+

| MEMBER_HOST   | MEMBER_STATE | MEMBER_ROLE |

+---------------+--------------+-------------+

| Node 1        | ONLINE       | PRIMARY     |

| Node 2        | ONLINE       | SECONDARY   |

+---------------+--------------+-------------+

2 rows in set (0.00 sec)
	

Cloning

When the missing transaction on the joiner node has exceeded the limit of the group_replication_clone_threshold the server will undergo the data sync using the cloning method.

These pertinent details about the cloning operation can be retrieved from the error log of the recovery server. Upon the successful completion of cloning, the server undergoes an automatic restart, ensuring a seamless transition to an updated state aligned with the synchronized data, and thereby enhancing the overall reliability of the system.

 
2023-11-14T05:40:43.373142Z 0 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'

2023-11-14T05:40:43.373248Z 0 [Warning] [MY-013469] [Repl] Plugin group_replication reported: 'This member will start distributed recovery using clone. It is due to the number of missing transactions being higher than the configured threshold of 1000.'

2023-11-14T05:40:44.373187Z 35 [System] [MY-014010] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been started.'

2023-11-14T05:40:44.406520Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Cloning from a remote group donor.'

2023-11-14T05:40:44.406806Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to Node 1:3306, Node 2:3306 on view 16999388627713930:6.'

2023-11-14T05:40:44.408679Z 70 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'

2023-11-14T05:40:44.740702Z 71 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Started

2023-11-14T05:40:45.482446Z 71 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Finished
	

After the successful completion of the cloning process, the system initiates the recovery of any remaining data that may have occurred during the cloning phase using the 'incremental recovery' mechanism.

 
2023-11-14T05:43:18.584925Z 0 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'

2023-11-14T05:43:18.585076Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'

2023-11-14T05:43:18.584926Z 12 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as a secondary member with primary member address Node 1:3306.'

2023-11-14T05:43:18.585419Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to Node 1:3306, Node 2:3306 on view 16999388627713930:8.'
	

Once all the data is synced, the server will become ONLINE.

 
mysql> select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;

+---------------+--------------+-------------+

| MEMBER_HOST   | MEMBER_STATE | MEMBER_ROLE |

+---------------+--------------+-------------+

| Node 1        | ONLINE       | PRIMARY     |

| Node 2        | ONLINE       | SECONDARY   |

+---------------+--------------+-------------+

2 rows in set (0.00 sec)
	

Full Recovery Methods

MySQL Group Replication Recovery
Full Recovery

Method 1: Recovery from backup

In cases of server crashes caused by hardware issues or server failure to start on the same machine, the recommended approach is data recovery from backups. We adhere to a standard practice of daily backups, which can take the form of either logical or physical backups depending on the use case and data size.

Here, we've outlined the steps to prepare the primary server for Group Replication using the scheduled automated backups.

Physical backups

Sample backup command:

 
xtrabackup --backup --no-lock -–binlog-info -–parallel=4 --target-dir=
	

Note: For a comprehensive overview of the available options, please consult the xtrabackup documentation page, where you can find in-depth information.

We must ensure the successful backup by verifying it through the log.

Backup logs:

 
2023-10-18T09:51:44.047882-00:00 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (291263818) to (291263818) was copied.

2023-10-18T09:51:44.257076-00:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!
	

Sample restore command :

 
xtrabackup --prepare -—use-memory=2G --target-dir=
	

Ensure the backup is meticulously crafted, and free from any errors or warnings. Once the flawless backup is ready, initiate the server using that designated directory.

Prepare logs:

 
2023-10-18T09:52:20.420043-00:00 0 [Note] [MY-013084] [InnoDB] Log background threads are being closed...

2023-10-18T09:52:20.445621-00:00 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 291264022

2023-10-18T09:52:20.449891-00:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!
	

Logical backups

Sample backup command:

 
mydumper --verbose=3 --outputdir= --logfile=/.log
	

We need to guarantee a successful backup by performing a thorough verification process through the log.

Backup logs:

 
2023-10-18 10:40:13 [INFO] - Releasing DDL lock

2023-10-18 10:40:13 [INFO] - Queue count: 0 0 0 0 0

2023-10-18 10:40:13 [INFO] - Main connection closed

2023-10-18 10:40:13 [INFO] - Finished dump at: 2023-10-18 10:40:13
	

Sample restore command :

 
myloader --directory=  --verbose=3 --logfile=/.log
	

After the restoration process, carefully scrutinize the loader log for any ERROR, WARNING, or CRITICAL messages to ensure the integrity of the operation.

Restore logs:

 
2023-10-18 10:47:43 [INFO] - Errors found:

- Tablespace:   0

- Schema:       0

- Data:         0

- View:         0

- Sequence:     0

- Index:        0

- Trigger:      0

- Constraint:   0

- Post:         0

Retries:        0
	

After the primary server within the group replication has been successfully restored from the backup, adding the remaining node to the group replication becomes an easy process.

Method 2:  Local Server Recovery

In scenarios where all nodes within the Group Replication are offline, the procedure involves restarting individual servers within the replication group.

Validation

It's imperative to verify that the group_replication_bootstrap_group and group_replication_start_on_boot parameters are both set to OFF in cnf on every server within the replication group. Once the variables are added in the cnf, proceed with starting the individual server. This precaution is vital for a smooth and uninterrupted server restart process, ensuring that no data loss occurs on the server.

Node1:

 
+-----------------------------------+----------------+

| VARIABLE_NAME                     | VARIABLE_VALUE |

+-----------------------------------+----------------+

| group_replication_bootstrap_group | OFF            |

| group_replication_start_on_boot   | OFF            |

+-----------------------------------+----------------+
	

Node2:

 
+-----------------------------------+----------------+

| VARIABLE_NAME                     | VARIABLE_VALUE |

+-----------------------------------+----------------+

| group_replication_bootstrap_group | OFF            |

| group_replication_start_on_boot   | OFF            |

+-----------------------------------+----------------+
	

Upon the successful restart of the server, it is of utmost importance to verify that all nodes within the Group Replication are intentionally placed in an offline state.

Node1:

 
mysql> select MEMBER_HOST, MEMBER_STATE from performance_schema.replication_group_members;

+---------------+--------------+

| MEMBER_HOST   | MEMBER_STATE |

+---------------+--------------+

| Node1         | OFFLINE      |

+---------------+--------------+

1 row in set (0.00 sec)
	

Node 2:

 
mysql> select MEMBER_HOST, MEMBER_STATE from performance_schema.replication_group_members;

+---------------+--------------+

| MEMBER_HOST   | MEMBER_STATE |

+---------------+--------------+

| Node2         | OFFLINE      |

+---------------+--------------+

1 row in set (0.00 sec)
	

Identification

Identifying the node with the most recent transactions on the server is a crucial step in the recovery process. This information can be obtained from the performance_schema.replication_connection_status table, which holds the last transaction ID applied on the server. By querying this table, you can identify the server with the latest data updates by comparing the transaction ID.

Node 1:

 
mysql> SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";

+--------------------------------------------+

| received_transaction_set                   |

+--------------------------------------------+

| b8192f4b-a054-32d6-ac0b-88b32af3ba1d:1-415 |

+--------------------------------------------+

1 row in set (0.00 sec)
	

Node 2 :

 
mysql> SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";

+--------------------------------------------+

| received_transaction_set                   |

+--------------------------------------------+

| b8192f4b-a054-32d6-ac0b-88b32af3ba1d:1-314 |

+--------------------------------------------+

1 row in set (0.00 sec)
	

Recovery

After identifying Node1 as the repository of the most recent transactions, the next step is to initiate a secure bootstrap procedure, designating it as the primary server. This strategic move is pivotal in preserving data integrity and averting any potential data loss on the server.

Node 1:

 
mysql> set global group_replication_bootstrap_group=1;
Query OK, 0 rows affected (0.00 sec)

mysql> start group_replication;
Query OK, 0 rows affected, 1 warning (1.17 sec)

mysql>  select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;
+---------------+--------------+-------------+
| MEMBER_HOST   | MEMBER_STATE | MEMBER_ROLE |
+---------------+--------------+-------------+
| Node1         | ONLINE       | PRIMARY     |
+---------------+--------------+-------------+
1 row in set (0.00 sec)
	

After ensuring the secure startup of Node1 as the primary server, the subsequent phase involves the gradual addition of the remaining nodes to the cluster, one node at a time, in a meticulous and controlled manner.

Node 2:

 
mysql> start group_replication;

Query OK, 0 rows affected, 1 warning (2.27 sec)

mysql> select MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE from performance_schema.replication_group_members;

+---------------+--------------+-------------+

| MEMBER_HOST   | MEMBER_STATE | MEMBER_ROLE |

+---------------+--------------+-------------+

| Node2         | ONLINE       | SECONDARY   |

| Node1         | ONLINE       | PRIMARY     |

+---------------+--------------+-------------+

2 rows in set (0.01 sec)
	

Flow chart

MySQL Group Replication Recovery
Flow chart

Through this meticulous step-by-step approach, we systematically validate and pinpoint the node with the most up-to-date data, ensuring its promotion without data loss. This process facilitates the seamless re-establishment of the cluster, safeguarding data integrity and overall cluster reliability.

MySQL Group Replication offers robust solutions for data redundancy and availability, but it's essential to be prepared for unexpected events. By following the standard practices and procedures outlined in this blog, you can effectively recover from complete failures in a Group Replication, ensuring the resilience and continuity of your database operations.

Optimize your MySQL Group Replication setup with Mydbops' InnoDB Cluster Consulting Services. Our experts can help you streamline your database management and improve performance. Contact us today to discuss your specific needs.

{{cta}}

No items found.

About the Author

Mydbops

Subscribe Now!

Subscribe here to get exclusive updates on upcoming webinars, meetups, and to receive instant updates on new database technologies.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.