Recently we got paged by CPU spikes on one of our clients slave MySQL ( MariaDB ) server under MyDBOPS Remote DBA Team Support. Our MySQL Engineers started analysing the root cause of the issues and resolved it. This blog provides an overview of the troubleshooting and fix proposed.
System Summary :
Database server : 10.1.10-MariaDB-log MariaDB Server
Hardware : 12 Core/ 48 GB RAM / 1 TB Hard disk ( HDD )
OS version : CentOS 7.2
Database Size : 550GB
Engine : InnoDB
InnoDB version : 5.6.26-76
It is a dedicated DB server and no other processes were consuming the CPU. There were no open transactions and long running queries too. This is the sample output of show process list ( truncated ).
Processlist:
+-------+-------------+---------------------+----------+---------+---------+-----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+-------+-------------+---------------------+----------+---------+---------+-----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| 2 | system user | | NULL | Connect | 1409851 | Waiting for master to send event | NULL | 0.000 |
| 3 | system user | | NULL | Connect | 0 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0.000 |
| 39106 | query | india:34652 | database | Query | 1 | statistics | SELECT `QueryProfile`.`user_id` FROM `database`.`query_profiles` AS `QueryProfile` WHERE `country` | 0.000 |
| 39341 | query | hawthorne:46692 | database | Query | 0 | statistics | SELECT `QueryProfile`.`user_id` FROM `database`.`query_profiles` AS `QueryProfile` WHERE `country` | 0.000 |
| 39569 | query | hawthorne:40418 | database | Query | 0 | statistics | SELECT `QueryProfile`.`user_id` FROM `database`.`query_profiles` AS `QueryProfile` WHERE `country` | 0.000 |
| 39767 | query | india:45976 | database | Query | 1 | statistics | SELECT `QueryProfile`.`user_id` FROM `database`.`query_profiles` AS `QueryProfile` WHERE `country` | 0.000 |
| 40057 | query | india:39382 | database | Query | 0 | statistics | SELECT `QueryProfile`.`user_id` FROM `database`.`query_profiles` AS `QueryProfile` WHERE `country` | 0.000 |
+-------+-------------+---------------------+----------+---------+---------+-----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
From the process list it is noted that the queries were running on high concurrency and their execution time was less than a sec. The QPS was high at that point in time than the normal too. Let us have look inside InnoDB what is happening internally .
Show Engine InnoDB status\G
MySQL thread id 39341, OS thread handle 0x7fae2be43700, query id 586419831
hawthorne 192.168.197.188 query statistics
SELECT `QueryProfile`.`user_id` FROM `query_profiles` AS `QueryProfile` WHERE
`country` = 'US' AND `gender` = 'F' AND `postal_code` IN
('00027', '00041', '00043', '00067', '00071', '00084', '00100', '00152', '00198',
'00830', '01001', '01013', '01020', '01022', '01030', '01033', '01040', '01041',
'01056', '01071', '01075', '01085', '01089', '01090', '01103', '01104', '01105',
'01106', '01107', '01108', '01109', '01115', '01116', '01118', '01119', '01129',
'01144', '01151', '01199', '01236', '01240','01302', '01342', '01376', '01427',
'01501', '01503', '01505', '01507', '01510')
---TRANSACTION 14538072151, not started estimating records in index range
It is noted that the Transaction were holding at state “estimating records in index range“.Let us have a full view of the query and its execution plan.
The Query has some good where clause for filtering and a wider range on column ` postal_code`. Let us have a view at the execution plan for to know what optimiser does internally ?
Execution Plan of Query:
+------+-------------+--------------+------+---------------------------------------------------------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------------+------+---------------------------------------------------------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | QueryProfile | ref | user_id,gender,birthdate,country,state,postal_code,hhi,ignore | user_id | 4 | const | 1 | Using where |
+------+-------------+--------------+------+---------------------------------------------------------------+---------+---------+-------+------+-------------+
The execution plans looks so normal the query uses optimal index on column `user_id` and scans only one row ( almost a primary key look up ) according to explain plan. Explain plan claims that it is scanning only one row but according to InnoDB status queries were waiting for “estimating the records in index range”.Let us have a look at the table structure and the index.
The Column` id` is a primary key of the table with auto_increment. Column `user_id` uses Int data type with not null and have a B+Tree index over it.
Now it seems more like a bug / mal function in query optimiser . Optimiser make the query for a index range scans though the records were unique ( like primary key ).
MyDBOPS team had a quick chat with the client to know more about this table. The data in `user_id` is supposed to be unique according the client. We made a scan for uniqueness of the records in `user_id`.
select user_id,count(*) from query_profiles group by user_id having count(*) > 1;
+---------+----------+
| user_id | count(*) |
+---------+----------+
| 0 | 24 |
+---------+----------+
1 rows in set (0.60 sec)
The value “0” persists at multiple points ( 24 times ) inside the table. Client claims that it might be due to initial test made at the time of launching and removed those records from the master.
We have created a new index over column `user_id` and made it as a unique key. And dropped the existing index.
alter table query_profiles add unique uniq_uid(user_id);
alter table query_profiles drop index user_id;
After the drop in index and with new unique index the CPU cycles spends over “estimating records over index range” was completely avoided and we have reclaimed the CPU cycles spent.
CPU Graph:
From the above graph we can note a huge improvement in IDLE CPU after the index conversion.
Making an unique or a primary key is not feasible option in all cases it always depends on the type of data stored and inputs from application team. But in this case Unique index forces the optimiser to look only one row and overcome the abnormal behaviour in optimiser.