AWS recently announced the release of MariaDB 10.2 for RDS. One of customer needs this release for GIS indexing .The recent encounter on a Replication issue with RDS MariaDB 10.2 is briefed here .
Problem Statement:
After the upgrade of replicas to MariaDB 10.2 on RDS, Once replication is stopped manually through “call mysql.rds_stop_replication;” or replication failed due to some error. Replication cannot be started back using “call mysql.rds_start_replication;” and there is no straightforward way or documented process to start the replication back. Also, most of the replication related RDS commands like skip errors etc will not work. This is due to the implementation of replication handling in RDS.
Summary:
On Jan 5, 2018, RDS announced support for MariaDB 10.2. Release Notes From Amazon
To test the release we tried upgrading one of our read replicas used for development and testing on RDS from 10.1.23 to 10.2.11.
Reproducible Test Case:
### I have a running slave upgraded to MariaDB 10.2
mysql> select @@version;
+-----------------+
| @@version |
+-----------------+
| 10.2.11-MariaDB |
+-----------------+
mysql> pager grep -i 'Running:\|Seconds'
mysql> show slave status\G
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
### Stopping replication manually
mysql> call mysql.rds_stop_replication;
+----------------------------------------------------------------------------+
| Message |
+----------------------------------------------------------------------------+
| Slave has encountered an error. Run SHOW SLAVE STATUS\G; to see the error. |
+----------------------------------------------------------------------------+
1 row in set (1.39 sec)
Query OK, 0 rows affected (1.54 sec)
### RDS threw error info here because, even after RDS performing “STOP SLAVE”, it sees threads owned by “system user”, Which we detailed in the later section.
### Let’s check the slave is actually stopped.
mysql> pager grep -i 'Running:\|Seconds'
mysql> show slave status\G
Slave_IO_Running: No
Slave_SQL_Running: No
Seconds_Behind_Master: NULL
### Slave is stopped, Let’s try to startup.
mysql> call mysql.rds_start_replication;
Query OK, 0 rows affected (0.48 sec)
### Call returns OK, Let’s check the status.
mysql> pager grep -i 'Running:\|Seconds'
mysql> show slave status\G
Slave_IO_Running: No
Slave_SQL_Running: No
Seconds_Behind_Master: NULL
### Call returned OK, Nothing happened.
If “call mysql.rds_start_replication” is failing or not working, there is no other documented way to start back the replication in RDS. Let’s take a deep look at RDS implementation.
RDS Implementation:
As most of us know we will not be granted super privileges on RDS, To manage replication we have to use the procedures used by RDS. This is how the procedure is implemented.
Procedure: mysql.rds_start_replication
CREATE DEFINER=`rdsadmin`@`localhost` PROCEDURE `rds_start_replication`()
BEGIN
DECLARE v_mysql_version VARCHAR(20);
DECLARE v_threads_running INT;
DECLARE v_called_by_user VARCHAR(50);
DECLARE v_sleep int;
DECLARE sql_logging BOOLEAN;
select @@sql_log_bin into sql_logging;
Select user() into v_called_by_user;
Select version() into v_mysql_version;
SELECT COUNT(1) into v_threads_running FROM information_schema.processlist WHERE user = 'system user';
if v_threads_running = 0
then
set @@sql_log_bin=off;
update mysql.rds_replication_status set called_by_user=v_called_by_user,action='start slave', mysql_version=v_mysql_version where action is not null;
commit;
select sleep(1) into v_sleep;
START SLAVE;
SELECT COUNT(1) into v_threads_running FROM information_schema.processlist WHERE user = 'system user';
if v_threads_running = 2
then
insert into mysql.rds_history (called_by_user,action,mysql_version) values (v_called_by_user,'start slave', v_mysql_version);
commit;
Select 'Slave running normally.' as Message;
else
Select 'Slave has encountered an error. Run SHOW SLAVE STATUS\\G; to see the error.' as Message;
end if;
else
if v_threads_running = 2
then
Select 'Slave may already running. Call rds_stop_replication to stop replication;' as Message;
end if;
end if;
set @@sql_log_bin=sql_logging;
END
Procedure In Words:
– Get count of threads getting executed by the user ‘system user’
– Only if it’s 0 starts the replication.
This is how replication traditionally worked, we will have two threads with the system user, one is IO thread and another is SQL thread.
So what has changed in MariaDB 10.2:
mysql> show processlist;
+----+-------------+-----------------------+------+---------+------+--------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+----+-------------+-----------------------+------+---------+------+--------------------------+------------------+----------+
| 1 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 |
| 2 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |
| 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |
| 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |
| 5 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 |
MariaDB 10.2 made background processes (InnoDB purge threads / InnoDB shutdown handler) as threads executed by the system user.
By default, innodb_purge_threads = 4, 1 shutdown handler thread. So we always have 5 threads running as system user.
Issue:
As most of the replication handling procedures on RDS “rds_stop_replication, rds_start_replication, rds_skip_repl_error, etc” operates based on the count of the threads ran by “system user”, this update broke the complete implementation of RDS replication handling.
I honestly believe RDS should implement more robust validations and fix this bug soon.
Temporary Hack:
Though it’s not documented I just tried doing some kind of hack after reading the procedure and succeeded starting the stopped replication back.
- Disable read-only on the replica.
- update mysql.rds_replication_status set action=’start slave’;
- Wait for the next minute to cross, Replication will be started.
mysql> update mysql.rds_replication_status set action='start slave';
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from mysql.rds_replication_status;
+----+---------------------+-------------------------+-------------+-----------------+-------------+-------------+
| id | action_timestamp | called_by_user | action | mysql_version | master_host | master_port |
+----+---------------------+-------------------------+-------------+-----------------+-------------+-------------+
| 1 | 2018-01-18 08:29:55 | mydbops@122.166.223.194 | start slave | 10.2.11-MariaDB | NULL | NULL |
+----+---------------------+-------------------------+-------------+-----------------+-------------+-------------+
mysql> select now();
+---------------------+
| now() |
+---------------------+
| 2018-01-18 08:30:04 |
+---------------------+
mysql> pager grep "Running:\|Seconds"
mysql> show slave status\G
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
It all fixed and replication is back online .