Recently we had encountered a strange issue with replication and temp directory(tmpdir) change while working for one major client.
All the servers under this were running with Percona flavor of MySQL versioned 5.6.38 hosted on a Debian 8(Jessie)
The MySQL architecture setup is as follows one master with 5 direct slaves under it
Through this blog, we will take you through the issue we had faced and how we fixed ultimately.
Client Request:
There was a simple request from our client to add a column and index to a 16GB production table since the table had foreign keys, to avoid complications we decided to go with online DDL instead of pt-online-schema.
When we started to alter, it got failed due to insufficient space in “tmpdir”.MySQL by default would be using “/tmp” for temp table creating, sorting and other temp operation, Since we had only 5.7G left on the “/” in master Alter failed as below.
mysql> alter table payment.transactions add column wallet_aggregate_id bigint(20) default null,add index idx_wallet_aggregate_id(wallet_aggregate_id),algorithm=inplace,lock=none;
ERROR 1878 (HY000): Temporary file write failure.
root@payment-mydbops:/home/waseem.mydbops# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 9.8G 3.6G 5.7G 39% /
mysql> select @@tmpdir;
+----------+
| @@tmpdir |
+----------+
| /tmp |
+----------+
1 row in set (0.00 sec)
Change of MySQL tmpdir:
We had to explain the situation to the stakeholders. With the approval of client we changed tmpdir from “/tmp” to “/var/tmp/mysql” partition with a MySQL restart, then the alter was completed successfully in master
root@payment-mydbops:/home/waseem.mydbops# df -h /var/tmp/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vgmysql-mysqltmp 25G 33M 25G 1% /var/tmp/mysql
mysql> select @@tmpdir;
+----------------+
| @@tmpdir |
+----------------+
| /var/tmp/mysql |
+----------------+
1 row in set (0.00 sec)
mysql> alter table transactions add column aggregate_id bigint(20) default null,add index idx_aggregate_id(aggregate_id),algorithm=inplace,lock=none;
Query OK, 0 rows affected (48 min 1.57 sec)
Records: 0 Duplicates: 0 Warnings: 0
Raise of Replication Issue:
The replication went down in all the 5 slaves connected to master, because of the alters since all the slaves were using “/tmp”, as their tmpdir which had only 9.8G of usable space. below is the snap of replication error
mysql> show slave status\G
*************************** 1. row ***************************
Master_Log_File: 118120-mysql-bin.000237
Read_Master_Log_Pos: 171370534
Relay_Log_File: mysqld-relay-bin.000704
Relay_Log_Pos: 99663098
Relay_Master_Log_File: 118120-mysql-bin.000237
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1878
Last_Error: Error 'Temporary file write failure.' on query. Default database: ''. Query: 'alter table transactions add column aggregate_id bigint(20) default null,add index idx_aggregate_id(aggregate_id),algorithm=inplace,lock=none'
Exec_Master_Log_Pos: 99662928
Relay_Log_Space: 171371048
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
Since we had faced the same issue with the master it’s just a straight forward to change the tmpdir to “/var/tmp/mysql” and restarted MySQL to fix the replication issue too.
But the actual surprise factor is that the slave is completely out of the replication cluster, slave status is as below.
mysql> show slave status\G
Empty set (0.01 sec)
Below is the error found from the error log
2019-02-20 18:04:29 547 [ERROR] Error in checking /var/lib/mysql/master.info repository info type of FILE.
2019-02-20 18:04:29 547 [ERROR] Error creating master info: Error checking repositories.
2019-02-20 18:04:29 547 [ERROR] Failed to create or recover replication info repository.
2019-02-20 18:04:29 547 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2019-02-20 18:04:29 547 [Note] Event Scheduler: Loaded 0 events
Since we had the master.info in the table as below, we had ignored the error at first line initially.
mysql> show global variables like '%repository%';
+---------------------------+---------------------+
| Variable_name | Value |
+---------------------------+---------------------+
| master_info_repository | TABLE |
| relay_log_info_repository | TABLE |
+---------------------------+---------------------+
Since we had the Binlog co-ordinated handly with us we decided to run the change master command again to resume replication, it again failed with below error on server-id
change master to master_host='10.200.xx.xxx',master_user='replication',master_password='dessert',master_log_file='118120-mysql-bin.000237',master_log_pos=99662928;
ERROR 1794 (HY000): Slave is not configured or failed to initialize properly. You must at least set --server-id to enable either a master or a slave. Additional error messages can be found in the MySQL error log.
This error made us check the config file for server-id, which was untouched.
Again we went back to error log as below
2019-02-20 18:04:29 547 [ERROR] Error in checking /var/lib/mysql/master.info repository info type of FILE.
2019-02-20 18:04:29 547 [ERROR] Error creating master info: Error checking repositories.
2019-02-20 18:04:29 547 [ERROR] Failed to create or recover replication info repository.
Now we had checked the file “/var/lib/mysql/master.info” even though we had our repo info in the table and found the file to be empty, we decided to remove the file and restart MySQL. But still, the issue was not fixed
Finally, we decided to change the master_info_repository and relay_log_info_repository from “TABLE” to “FILE” format, also removed empty master.info file from datadir and restarted MySQL. Below are the changed done in the slave.
mysql> show global variables like '%repository%';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| master_info_repository | FILE |
| relay_log_info_repository | FILE |
+---------------------------+-------+
root@payments-mydbops:/home/waseem.mydbops# less /etc/mysql/conf.d/mysql.cnf | grep -i repository
master_info_repository = FILE
relay_log_info_repository = FILE
Even we had our “Change master to” query ready for re-configuring, surprisingly there was no room for that since after the above change, the slave resumed without any manual intervention as below
mysql> show slave status\G
*************************** 1. row ***************************
Master_Log_File: 118120-mysql-bin.000237
Read_Master_Log_Pos: 214316053
Relay_Log_File: mysqld-relay-bin.000702
Relay_Log_Pos: 290
Relay_Master_Log_File: 118120-mysql-bin.000237
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 42840
1 row in set (0.00 sec)
Conclusion :
The above issue was quite strange. still, we are trying to reproduce the issue with other versions of MySQL such as 5.6, 5.7 as well as 8.0.
We had resolved the issue, but still, we are trying to understand more about this. Will keep everyone posted
For now, hope this blog would be helpful for someone to solve a similar issue with replication.
Thanks for your time and continued support!!
Image courtesy: Photo by Christian Regg on Unsplash