PostgreSQL: New features in Version 13 - Index Deduplication

With the second beta release of PostgreSQL 13, we now have first sight of the full range of new features the PostgreSQL and Open Source world is getting excited about! Naturally, Sirius engineers are getting deep into them, and we'd like to share these exciting new features with you!

Today’s post will examine one of the new features of PostgreSQL 13 – B-tree index de-duplication.

Index de-duplication allows for the grouping of duplicate tuples into a single key with an assocated array identifying the individual tuples.

The benefits are:

Decreased storage
Potential query throughput gain
Reduced index maintenance overhead

We will test with a simple table of countries and isocodes

\d idx_test

              Table "geoname.idx_test"

 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 isocode | integer |           |          |
 name    | text    |           |          |

Indexes:
    "name2_ind" btree (name)

The table has 10M rows with the following distribution histogram (script provided by the Faraday Blog):

 bucket |   range   |  freq   |       bar
--------+-----------+---------+-----------------
      1 | [0,44]    |  479700 | ***
      2 | [48,86]   |  280620 | **
      3 | [90,132]  |  425070 | ***
      4 | [136,178] |  887670 | ******
      5 | [180,222] |  219510 | **
      6 | [226,268] |  301590 | **
      7 | [270,312] |  230670 | **
      8 | [316,356] |  665910 | *****
      9 | [360,400] |  870660 | ******
     10 | [404,446] |  287460 | **
     11 | [450,484] |  535410 | ****
     12 | [492,535] |  346320 | **
     13 | [540,580] |  693630 | *****
     14 | [583,624] |  486450 | ***
     15 | [626,670] |  387990 | ***
     16 | [674,710] |  250470 | **
     17 | [716,756] |  285750 | **
     18 | [760,804] |  436770 | ***
     19 | [807,840] | 2102760 | ***************
     20 | [850,887] |  151020 | *
     21 | [894,894] |   25470 |
(21 rows)

We created the same table in prior versions and a copy in v12 with the storage parameter ‘deduplicate_items’ set to off ( default is on)

The index storage difference is quite significant:

DB Version	13.2	13.2 No Deduplication	12.3	9.5.22
Index Size	69MB	266MB	266MB	265MB

If we run the following in all instances

explain (analyze, verbose ,buffers on )
select isocode from idx_test where name = 'United States';

DB Version	13.2	13.2 No Deduplication	12.3	9.5.22
Buffers Shared Hit	1695	7589	7589	7588

Buffers: shared hit are blocks that are read from cache.

Vacuum performance - update 2M rows / vacuum

DB VERSION	Update Time	Index Pages Vacuumed	Vacuum Time
9.5.22	21 Sec	7583	1.97 Sec
12.3	19 Sec	7630	2.32 Sec
13.2 No Deduplication	21 Sec	7630	1.9 Sec
13.2	18 Sec	1701	1.5 Sec

Re-index performance - postgresql 9.5 is included for completeness as concurrent index rebuilds is not available in that version.

DB VERSION	Pre Re-index Size	Re-index Time	Post Re-index Size
9.5.22	376 MB	39 Sec	300 MB
12.3	354 MB	24 Sec	300 MB
13.2 No Deduplication	354 MB	16 Sec	300 MB
13.2	83 MB	12 Sec	69 MB

This is a very useful implementation, and is enabled by default. To disable this (per index) you would use the following command:

create index test_nodedup on idx_test2(name) with (deduplicate_items = off);

Author

Thomas Flatley