MySQL 8.0 introduces many new features. We will have a look at the exciting histogram feature in MySQL 8.0
Histogram:
What is Histogram?
In General, a histogram is an accurate representation of the distribution of numerical data. In MySQL, It will be useful to collect data distribution for a specific column.
What problem it solves?
In general DB Optimizer gives us the best execution plan, But the stats make the execution plan better and better .The data distribution of values in columns can make good impact in optimiser in case of column with less distinct values.
We will see an example of how it helps optimizer in some cases.
I have used a production case. MySQL version is 8.0.15 installed in ubuntu 18.04 (32GB RAM,8 core) with optimal configuration. Let us try to optimise a query using histogram.
Query fetching data from July 2018 to August 2018 with a state value of 52(contains 60133 records). Note No index on column changedOn.
Query is trying to fetch data from 26, April 2018 to 27, April 2018 ( contains no records and but still it tooks around 6 seconds)
Without any distribution stats, the optimizer considers data is equally distributed across all days in a year. We can provide intelligence to optimizer by using the histogram.
To Create histogram (syntax ):
Histogram contains buckets (N) for part of the range of the values stored in the column.If N clause is ignored, the number of buckets by default is 100. (range from 0 to 1024).
Histogram is created for the column changedOn.
Execution plan post histogram
In this can, we can observe the difference in the filtered columns of explain plan. Filtering can even occur in the absence of an index.
Histogram statistics are widely used for non-indexed columns. We can also add an index to histogram columns that will help the optimizer to make row estimates, But index should be updated whenever the table modification is higher. This is not a problem in the histogram. We can create/update histogram when there is a demand. So, There is no overhead in the histogram when the table is modified.
We can disable histogram by two methods
1) By dropping the histogram for the particular column of a table.
2) By disabling condition_fanout_filter of optimizer switch.
We can get more information on histogram stats from Information_schema.column_statistics table.
Histogram Objects have keys like buckets, data-type, collation-id, last-updated, sampling-rate, histogram-type, number-of-buckets-specified.
Buckets: Bucket structure depends on the histogram type
Histogram Types:
1) Singleton histogram
2) Equi height histogram
Singleton: One bucket represents one single value in the column. This type of histogram will be created when the number of distinct values in the column is less than or equal to the number of buckets specified in the analyze table syntax.
Equi-height: One bucket represents a range of values. This type of histogram will be created when distinct values in the column are greater than the number of buckets specified in the analyze table syntax.
For Example State column in the app.app_user_state table is having 6 distinct values. Now creating a histogram with 6 buckets as equal as distinct values in analyze table syntax.
We can see that the type of histogram created is Singleton. In singleton histogram, each buckets contain two values [1, 0.19453761379971252]
value 1: bucket value (distinct values in column)
value 2: cumulative frequency for each value
Let try updating histogram with 2 buckets
In an equi-height histogram, bucket contains 4 values. [1, 32, 0.31338171297403283, 2]
Values 1 & 2: The lower and upper ranges for the bucket(Range of values)
Value3: The cumulative frequency for the values
Value4: The number of distinct values in the range.
data-type: The type of data this histogram contains.
number-of-buckets-specified: Number of buckets specified in the analyze table statement. (100 is the default)
last-updated: when the histogram is generated.
null-values: The value lies between 0.0 and 1.0. The fraction of column values that contain null values. 0 means that the column contains no null values.
collation-id: The collation-id for the histogram data.
Sampling-rate: The value ranges between 0.0 and 1.0. This indicates that the fraction of data that was sampled while creating the histogram. If the value is 1 means that all the data was read while creating the histogram(No sampling occurs)
The memory allocation for sampling is controlled by the variable histogram_generation_max_mem_size. It can be defined in both the global and session level while creating the histogram. While we creating a histogram for the particular column, the server will read all the data into the memory and then process it. If we are doing histogram for larger tables, there is a risk of reading all data into memory. we can avoid this problem by using this variable. The server will calculate how many rows can be fit into the memory defined by the histogram memory variable histogram_generation_max_mem_size. If all the values in the column are not fitted defined by the variable, the server allows only values fitted into memory and starts to take sampling.
How to chose a column for histogram ?
- values which do not change much over time
- Columns with low cardinality values
- Columns with uneven distribution
Columns to be avoided in histograms?
- Column with more distinct values and inserts are high (manual intervention need to update histogram frequently)
Histogram is an wonderful feature in MySQL 8.0 which can help DBA’s in optimising their SQL’s for better response time.
More Reads on Histogram:
https://dev.mysql.com/worklog/task/?id=8943
Image by Plush Design Studio on Unsplash