PostgreSQL 13 is generally available from September 24, 2020. Last year, PostgreSQL 12 was released with a lot of features like Generated Columns, Partitioning Improvements and so on and started to be widely used in Production Environments. This year, PostgreSQL 13 comes up with more exciting features as well as performance and security improvements.
Here are a few major features from PostgreSQL 13
- Deduplication of B-tree indexes
- Incremental Sorting
- Logical Replication for Partitioned Tables and so on.
In this blog post, we will try to explore the most popular feature of this release: Deduplication of B-tree index.
Environment:
I am having two Centos 7 servers, one installed with PostgreSQL 12 and other with PostgreSQL 13(Beta Release) for making comparisons.
All are ready, Let’s start with B-Tree index Introduction
B Tree Index:
As by name itself, PostgreSQL B Tree index is based on the B Tree Data Structure(Lehman & Yao algorithm and B+ tree). So, the difference between B Tree and B+ tree are below
BTree Internal Data Structure:
Root, Parents, Leaves are all pages with the same structure. It contains
- Block Number
- High Key
- Items
Block Number:
It contains the block number of the Block .
High Key:
In high key page, we can find values less than or equal to 27
Page:
List of Pages:
Items:
It is a Key / Value pair ( Pointer and value )
- Value:
If it is a parent node, it contains the first row. If it is a leaf node, it contains the indexes rows
2. Pointer:
There are so many index types available in PostgreSQL. BTree Index is good at Comparison and Range operations such as =, <, >, <=, >=. Also, it is the default index structure for index creation.
Now let us step into deduplication of BTree index
What is Deduplication ?
As per official documentation,
A duplicate is a leaf page tuple (a tuple that points to a table row) where all indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: deduplication.
That is, it simply means that merging of the duplicate values together, forming a single list for each value. So, key value appears only once, no duplicates. This merging the values is achieved by a sorted array of tuple pointer(TID) that point to rows in the table.
Why do we need it ?
- To reduce the storage size of the indexes (Only on duplicates, not on unique value columns)
- Latency of queries reduced due to reduced block scans.
- Overall query throughput may increase
- The overhead of routine index vacuuming can be reduced
Test comparisons:
Loaded test data into PostgreSQL 12 and 13 with below two cases
- Non Unique Index
- Unique Index
From the above results, it is observed that the difference in the size of non-unique indexes as they can perform deduplication in postgreSQL 13. There is no advantage over this feature for unique indexes.
New Parameter:
And Deduplication is enabled by default. If we want to disable it for creating particular index creation, it can done by using this new parameter deduplicate_items
The syntax is
Turning off deduplication for already index columns by using alter command, but it prevents only future insertions triggering deduplication, not the existing list tuple.
Upgrading Versions:
Upgrading the versions to PostgreSQL 13 using pg_upgrade requires reindex to create the indexes using deduplication. Upgrading with pg_dump or pg_dumpall does not require reindexing because deduplication is automatically enabled while creation of tables.
Fewer Restrictions:
It does not support numeric, float, container types. So, it is not possible to use this feature on these data types.
Conclusion:
As we tested out this feature, it gives us a great reduction in index size. This is a very nice feature in PostgreSQL 13 which will be really useful to reduce the footprints of indexes and also helps to improve the performance considerably.