Histogram in MySQL 8.0

Mydbops

Jun 18, 2020

Mins to Read

All

MySQL 8.0 introduces many new features. We will have a look at the exciting histogram feature in MySQL 8.0

Histogram:

What is Histogram?

In General, a histogram is an accurate representation of the distribution of numerical data. In MySQL, It will be useful to collect data distribution for a specific column.

What problem it solves?

In general DB Optimizer gives us the best execution plan, But the stats make the execution plan better and better .The data distribution of values in columns can make good impact in optimiser in case of column with less distinct values.

We will see an example of how it helps optimizer in some cases.

I have used a production case. MySQL version is 8.0.15 installed in ubuntu 18.04 (32GB RAM,8 core) with optimal configuration. Let us try to optimise a query using histogram.

 
root@localhost:(none)>show create table app.app_user_state\G
*************************** 1. row ***************************
Table: app_user_state
Create Table: CREATE TABLE `app_user_state` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`uid` int(11) DEFAULT NULL,
`state` int(11) DEFAULT NULL,
`subState` int(11) DEFAULT NULL,
`changedOn` datetime DEFAULT NULL,
`description` varchar(128) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=67975632 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

SELECT * FROM app.app_user_state WHERE state = ? AND changedOn > ? AND changedOn < ?;

root@localhost:(none)>select min(changedOn),max(changedOn) from app.app_user_state;
+---------------------+---------------------+
| min(changedOn)      | max(changedOn)      |
+---------------------+---------------------+
| 2018-04-26 08:28:45 | 2019-05-23 10:00:03 |
+---------------------+---------------------+
1 row in set (15.75 sec)

Query fetching data from July 2018 to August 2018 with a state value of 52(contains 60133 records). Note No index on column changedOn.

 
root@localhost:(none)>explain SELECT * FROM app.app_user_state WHERE state = 53 AND changedOn > '2018-07-17 00:00:00' and ChangedOn < '2018-08-07 00:00:00'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: app_user_state
         rows: 3984266
     filtered: 11.11
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Query is trying to fetch data from 26, April 2018 to 27, April 2018 ( contains no records and but still it tooks around 6 seconds)

 
root@localhost:(none)>explain SELECT * FROM app.app_user_state WHERE state = 53 AND changedOn > '2018-04-26 08:28:45' and ChangedOn < '2018-04-27 08:28:45'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: app_user_state
rows: 3984266
filtered: 11.11
Extra: Using where
1 row in set, 1 warning (0.00 sec)

Without any distribution stats, the optimizer considers data is equally distributed across all days in a year. We can provide intelligence to optimizer by using the histogram.

To Create histogram (syntax ):

 
Analyze table table_name update histogram on column_name with N buckets;

Histogram contains buckets (N) for part of the range of the values stored in the column.If N clause is ignored, the number of buckets by default is 100. (range from 0 to 1024).

 
mysql> analyze table app.app_user_state update histogram on changedOn\G
*************************** 1. row ***************************
Table: app.app_user_state
Op: histogram
Msg_type: status
Msg_text: Histogram statistics created for column 'changedOn'.
1 row in set (21.76 sec)

Histogram is created for the column changedOn.

 
mysql> SELECT TABLE_NAME,COLUMN_NAME,HISTOGRAM->>'$."data-type"' AS 'data-type', JSON_LENGTH(HISTOGRAM->>'$."buckets"') AS 'bucket-count'        FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+---------------+-------------+-----------+--------------+
| TABLE_NAME    | COLUMN_NAME | data-type | bucket-count |
+---------------+-------------+-----------+--------------+
| app_user_state | changedOn   | datetime  |          100 |
+---------------+-------------+-----------+--------------+
1 row in set (0.00 sec)

Execution plan post histogram

 
root@localhost:(none)>explain SELECT * FROM app.app_user_state WHERE state = 53 AND changedOn > '2018-04-26 08:28:45' and ChangedOn < '2018-04-26 08:28:45'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: app_user_state
rows: 3984266
filtered: 0.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)

In this can, we can observe the difference in the filtered columns of explain plan. Filtering can even occur in the absence of an index.

Histogram statistics are widely used for non-indexed columns. We can also add an index to histogram columns that will help the optimizer to make row estimates, But index should be updated whenever the table modification is higher. This is not a problem in the histogram. We can create/update histogram when there is a demand. So, There is no overhead in the histogram when the table is modified.

We can disable histogram by two methods

1) By dropping the histogram for the particular column of a table.

 
Analyze table table_name drop histogram on column_name;

2) By disabling condition_fanout_filter of optimizer switch.

 
Set optimizer_switch=’condition_fanout_filter=off’;

We can get more information on histogram stats from Information_schema.column_statistics table.

 
root@localhost:(none)>select * from information_schema.column_statistics\G 
*************************** 1. row *************************** 
SCHEMA_NAME: app 
TABLE_NAME: app_user_state 
COLUMN_NAME: changedOn
HISTOGRAM: 
{"buckets": [["2018-04-26 08:28:45.000000", "2018-07-19 11:39:51.000000", 0.031245524205098827, 778563], ... ... "2019-05-29 00:50:31.000000", 0.9998813914351189, 266432]], 
"data-type": "datetime", 
"null-values": 0.00011860856488112289, 
"collation-id": 8, 
"last-updated": "2019-05-29 00:52:57.168591", 
"sampling-rate": 0.012078229085799905, 
"histogram-type": "equi-height", 
"number-of-buckets-specified": 32}

Histogram Objects have keys like buckets, data-type, collation-id, last-updated, sampling-rate, histogram-type, number-of-buckets-specified.

Buckets: Bucket structure depends on the histogram type

Histogram Types:

1) Singleton histogram

2) Equi height histogram

Singleton: One bucket represents one single value in the column. This type of histogram will be created when the number of distinct values in the column is less than or equal to the number of buckets specified in the analyze table syntax.

Equi-height: One bucket represents a range of values. This type of histogram will be created when distinct values in the column are greater than the number of buckets specified in the analyze table syntax.

For Example State column in the app.app_user_state table is having 6 distinct values. Now creating a histogram with 6 buckets as equal as distinct values in analyze table syntax.

 
root@localhost:(none)>analyze table app.app_user_state update histogram on state with 6 buckets\G 
*************************** 1. row *************************** 
Table: app.app_user_state 
Op: histogram 
Msg_type: status 
Msg_text: Histogram statistics created for column 'state'. 

root@localhost:(none)>select histogram from information_schema.column_statistics\G 

HISTOGRAM: 
{"buckets": [[1, 0.19453761379971252], [32, 0.31223197172975564], [33, 0.7182019645424054], [34, 0.7733506528509823], [53, 0.8396232630570196], [86, 1.0]], 
"data-type": "int", 
"null-values": 0.0, 
"collation-id": 8, 
"last-updated": "2019-05-29 01:18:47.982666", 
"sampling-rate": 0.03623352139303918,
 "histogram-type": "singleton", 
"number-of-buckets-specified": 6}.

We can see that the type of histogram created is Singleton. In singleton histogram, each buckets contain two values [1, 0.19453761379971252]

value 1: bucket value (distinct values in column)

value 2: cumulative frequency for each value

Let try updating histogram with 2 buckets

 
root@localhost:(none)>analyze table app.app_user_state update histogram on state with 2 buckets\G 
*************************** 1. row *************************** 
Table: app.app_user_state 
Op: histogram Msg_type: status
 Msg_text: Histogram statistics created for column 'state'. 

root@localhost:(none)>select histogram from information_schema.column_statistics\G 
HISTOGRAM: 
{"buckets": [[1, 32, 0.31338171297403283, 2], [33, 86, 1.0, 4]], 
"data-type": "int", 
"null-values": 0.0, 
"collation-id": 8, 
"last-updated": "2019-05-29 01:31:21.307545", 
"sampling-rate": 0.036234657847286224, "histogram-type": "equi-height", 
"number-of-buckets-specified":2}

In an equi-height histogram, bucket contains 4 values. [1, 32, 0.31338171297403283, 2]

Values 1 & 2: The lower and upper ranges for the bucket(Range of values)

Value3: The cumulative frequency for the values

Value4: The number of distinct values in the range.

data-type: The type of data this histogram contains.

number-of-buckets-specified: Number of buckets specified in the analyze table statement. (100 is the default)

last-updated: when the histogram is generated.

null-values: The value lies between 0.0 and 1.0. The fraction of column values that contain null values. 0 means that the column contains no null values.

collation-id: The collation-id for the histogram data.

Sampling-rate: The value ranges between 0.0 and 1.0. This indicates that the fraction of data that was sampled while creating the histogram. If the value is 1 means that all the data was read while creating the histogram(No sampling occurs)

The memory allocation for sampling is controlled by the variable histogram_generation_max_mem_size. It can be defined in both the global and session level while creating the histogram. While we creating a histogram for the particular column, the server will read all the data into the memory and then process it. If we are doing histogram for larger tables, there is a risk of reading all data into memory. we can avoid this problem by using this variable. The server will calculate how many rows can be fit into the memory defined by the histogram memory variable histogram_generation_max_mem_size. If all the values in the column are not fitted defined by the variable, the server allows only values fitted into memory and starts to take sampling.

How to chose a column for histogram ?