In Part 1 we looked at how PostgreSQL handles data concurrency, MVCC, and learned that understanding MVCC is critical for understanding PostgreSQL. Part 2 continued with MVCC and focused on concurrent access.
Continuing our introduction to PostgreSQL, in today's third installment we will look at the most essential maintenance operation, Vacuum. As we saw earlier, the implementation of MVCC (Multi-Version Concurrency Control) in PostgreSQL differs from other RDBMs systems such as Oracle. In PostgreSQL, dml operations do not remove older versions of rows. Multiple versions of rows may need to be maintained for any active transactions that began prior to the dml. Vacuum is the utility that removes these rows once they are eligible. In addition, Vacuum also provides other useful/necessary benefits:
- As Mentioned, to recover disk space/improve performance by removing older row versions.
- To improve performance by maintaining accurate statistics.
- To improve vacuum performance by updating the visibility map.
- To prevent against data loss due to transaction wraparound.
We will examine each of these in order. For this we will use a PostgreSQL v13.2 instance (latest beta release) with autovacuum set to off (never turn this off!). Autovacuum is a set of automated processes which periodically run vacuum/analyze commands based on adjustable parameters. These parameters can be set at the cluster or table level for better control.
- Disk space recovery/reuse:
We have a 10M row table (vac_test) with no indexes.
Total size (in MB)
Main - table size
Fsm - free space map of table
Vm - visibility map of table - pages which have no dead rows
init - replacement for unlogged tables
Main | fsm | vm | init | table size | indexes | Total 1116 | 0 | 0 | 0 | 1116 | 0 | 1116
We can see this represented at the file level.
Each table segment has a default max size of 1G (this can be changed at initdb using --with-segsize option). Each additional segment will be named filenode.n , filenode.n+1, etc.
To find the file path location:
Select pg_relation_filepath('vac_test'); pg_relation_filepath -------------------- base/26290/27763
So, 2 segments initially
ls -lrt --block-size M $PGDATA/base/26290/27763 -rw--------- 1 pg13r2 postgres 1024M Jul 9 12:51 /data/pg13r2/data/base/26290/27763 -rw--------- 1 pg13r2 postgres 93M Jul 9 12:51 /data/pg13r2/data/base/26290/27763.1
Now, lets update 5M rows
# Update vac_test # set code = 'Pittsburgh' # where id::text like '%4%'; UPDATE 5217031
We didn't add rows, but it we check the size, the table has grown approx 50%.
main | fsm | vm | init | table size | indexes | Total 1698 | 0 | 0 | 0 | 1698 | 0 | 1698
View from the file system
-rw--------- 1 pg13r2 postgres 1M Jul 9 13:31 /data/pg13r2/data/base/26290/27763_fsm -rw--------- 1 pg13r2 postgres 675M Jul 9 13:31 /data/pg13r2/data/base/26290/27763.1 -rw--------- 1 pg13r2 postgres 1024M Jul 9 13:32 /data/pg13r2/data/base/25290/27763
As you can see, this is a problem, which is resolved by vacuum.
Vacuum has several options:
\h vacuum Command: VACUUM Description: garbage-collect and optionally analyze a database
VACUUM [ ( option [,...])][ table_and_columns [,...]] VACUUM [ FULL ][ FREEZE ][ VERBOSE ][ ANALYZE ][ table_and_columns [,...]]
where option can be one of:
FULL [ boolean ] FREEZE [ boolean ] VERBOSE [ boolean ] ANALYZE [ boolean ] DISABLE_PAGE_SKIPPING [ boolean ] SKIP_LOCKED [ boolean ] INDEX_CLEANUP [ boolean ] TRUNCATE [ boolean ] PARALLEL integer - new feature added in v13 to allow for parallel vacuum index maintenance.
and table_and_columns is:
table_name [( column_name [,...])]
We will examine these parameters later but for now we are just going to run a simple analyze.
# vacuum verbose vac_test; INFO: vacuuming "public.vac_test" INFO: "vac_test": removed 5217031 row versions in 142858 pages INFO: "vac_test": found 5217031 removable, 10000000 nonremovable row versions in 217387 out of 217387 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 5738829 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user 2.53s, system 0.71s, elapsed 4.48s INFO: vacuuming "pg_toast.pg_toast_27755" INFO: index "pg_toast_27755_index" now contains 0 row versions in 1 pages DETAIL: 0 index row versions where removed. 0 index pages have been deleted, 0 are currently resuable. CPU: user: 0.00s, system 0.00s, elapsed 0.00s. INFO: "pg_toast_27755" found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet; oldest xmin 5738829 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00s, system 0.00s, elapsed 0.00s. VACUUM
Lets check the space again
main | fsm | vm | init | table size | indexes | total 1698 | 0 | 0 | 0 | 1698 | 0 | 1698
The allocated space is the same
we see the creation of the visibility map at the file level
--rw--------- 1 pg13r2 postgres 1M Jul 9 13:31 /data/pg13r2/data/base/26290/27763_fsm --rw--------- 1 pg13r2 postgres 675M Jul 9 13:33 /data/pg13r2/data/base/26290/27763.1 --rw--------- 1 pg13r2 postgres 1M Jul 9 13:39 /data/pg13r2/data/base/26290/27763_vm --rw--------- 1 pg13r2 postgres 1024M Jul 9 13:40 /data/pg13r2/data/base/26290/27763
The dead rows are removed and the space is made available to the table but is not released to the file system - in order to achieve that you would need to run a vacuum full
# vacuum full verbose vac_test; INFO: vacuuming "public.vac_test" INFO: "vac_test": found 0 removeable rows, 10000000 nonremovable row versions in 217387 pages DETAIL: 0 dead row versions cannot be removed yet. CPU: user: 30.84s, system: 2.16s, elapsed: 38.55s VACUUM
We check the size and see we are back to our original:
main | fsm | vm | init | table size | indexes | total 1116 | 0 | 0 | 0 | 1116 | 0 | 1116 --rw--------- 1 pg13r2 postgres 1024M Jul 9 13:43 /data/pg13r2/data/base/26290/27769 --rw--------- 1 pg13r2 postgres 93M Jul 9 13:43 /data/pg13r2/data/base/26290/27769.1
If possible, it is best to avoid a vacuum full. As you can see, the file node has changed - a vacuum full is a complete rewrite of the table.
- This requires enough disk space to accommodate both tables until the rewrite is complete.
- More importantly, vacuum full requires an exclusive lock on the table so so other operations may occur concurrently. A standard vacuum allows for this.
- The autovacuum process does not run vacuum full.
In the next post, we will continue to review the main function of vacuum and examine the parameters that control this behavior. We will follow that with a series of posts on the other operations of vacuum mentioned above.