With the second beta release of PostgreSQL 13, we now have first sight of the full range of new features the PostgreSQL and Open Source world is getting excited about! Naturally, Sirius engineers are getting deep into them, and we'd like to share these exciting new features with you!
Today’s post will examine one of the new features of PostgreSQL 13 – B-tree index de-duplication.
Index de-duplication allows for the grouping of duplicate tuples into a single key with an assocated array identifying the individual tuples.
The benefits are:
- Decreased storage
- Potential query throughput gain
- Reduced index maintenance overhead
We will test with a simple table of countries and isocodes
\d idx_test Table "geoname.idx_test" Column | Type | Collation | Nullable | Default ---------+---------+-----------+----------+--------- isocode | integer | | | name | text | | | Indexes: "name2_ind" btree (name)
The table has 10M rows with the following distribution histogram (script provided by the Faraday Blog):
bucket | range | freq | bar --------+-----------+---------+----------------- 1 | [0,44] | 479700 | *** 2 | [48,86] | 280620 | ** 3 | [90,132] | 425070 | *** 4 | [136,178] | 887670 | ****** 5 | [180,222] | 219510 | ** 6 | [226,268] | 301590 | ** 7 | [270,312] | 230670 | ** 8 | [316,356] | 665910 | ***** 9 | [360,400] | 870660 | ****** 10 | [404,446] | 287460 | ** 11 | [450,484] | 535410 | **** 12 | [492,535] | 346320 | ** 13 | [540,580] | 693630 | ***** 14 | [583,624] | 486450 | *** 15 | [626,670] | 387990 | *** 16 | [674,710] | 250470 | ** 17 | [716,756] | 285750 | ** 18 | [760,804] | 436770 | *** 19 | [807,840] | 2102760 | *************** 20 | [850,887] | 151020 | * 21 | [894,894] | 25470 | (21 rows)
We created the same table in prior versions and a copy in v12 with the storage parameter ‘deduplicate_items’ set to off ( default is on)
The index storage difference is quite significant:
DB Version | 13.2 |
13.2 No Deduplication |
12.3 | 9.5.22 |
---|---|---|---|---|
Index Size | 69MB | 266MB | 266MB | 265MB |
If we run the following in all instances
explain (analyze, verbose ,buffers on ) select isocode from idx_test where name = 'United States';
DB Version | 13.2 |
13.2 No Deduplication |
12.3 | 9.5.22 |
---|---|---|---|---|
Buffers Shared Hit | 1695 | 7589 | 7589 | 7588 |
Buffers: shared hit are blocks that are read from cache.
Vacuum performance - update 2M rows / vacuum
DB VERSION | Update Time | Index Pages Vacuumed | Vacuum Time |
---|---|---|---|
9.5.22 | 21 Sec | 7583 | 1.97 Sec |
12.3 | 19 Sec | 7630 | 2.32 Sec |
13.2 No Deduplication | 21 Sec | 7630 | 1.9 Sec |
13.2 | 18 Sec | 1701 | 1.5 Sec |
Re-index performance - postgresql 9.5 is included for completeness as concurrent index rebuilds is not available in that version.
DB VERSION | Pre Re-index Size | Re-index Time | Post Re-index Size |
9.5.22 | 376 MB | 39 Sec | 300 MB |
12.3 | 354 MB | 24 Sec | 300 MB |
13.2 No Deduplication | 354 MB | 16 Sec | 300 MB |
13.2 | 83 MB | 12 Sec | 69 MB |
This is a very useful implementation, and is enabled by default. To disable this (per index) you would use the following command:
create index test_nodedup on idx_test2(name) with (deduplicate_items = off);