Sirius is working to provide more transparent processes for planning clients’ database usage, and more proactive monitoring and reporting on the state of databases. Recommendations in this series should be seen as part of a long term roadmap.
This report covers PostgreSQL but will be extended to include Oracle.
PostgreSQL works in a similar way to making notes on a piece of paper. You note down something. Then you change your mind. You cross it out and draw a line on the paper to wherever there is a space further down, and write a new version. After a while, the sheet of paper will comprise mostly deletions and it will become difficult to assemble a current set of notes from it. Bear this example in mind in the following explanations.
What are vacuums?
Vacuums are the equivalent of taking the deletions off your sheet of notes and leaving spaces. Assembling the full set of notes afterwards is still quite difficult because the notes will be scattered all over the surface of the paper in an order that will appear less and less clear as time goes by.
Vacuum full is a version of vacuum that picks up the notes and moves them to freed-up spaces at the beginning of the sheet. However, this implies moving data that is not dead and so needs an exclusive lock, which may disrupt normal system operations in peak traffic time. To return to our example, a vacuum full means that your sheet of notes cannot be used while it is being re-ordered. Also, since data is simply moved up the sheet, chaotic sequences are left in place.
What is clustering?
You are likely to respond to the growing disorder on your sheet by copying just the current notes according to some logical order onto a new sheet. There is such an action in PostgreSQL, which is called cluster. A new, ordered version of the sheet of notes is made and the old sheet is discarded.
In making your new sheet, the smart thing would be to leave some space under each entry so that, when you need to make a new version, you can try to fit it under the previous one. The FILLFACTOR storage parameter in PostgreSQL offers a similar facility. For example, if you state that 50% of blank space needs to remain blank, enough space after each entry will be left for the entry to be duplicated. The disadvantage is that an ACCESS EXCLUSIVE lock is acquired on the table during the process. This stops any other database operations (both reads and writes) and the data is effectively lost during the time the process runs. You need to choose your moment and be quick about it!
Running a CLUSTER command makes sense if you are short of disk space but, if you can duplicate the amount of space used by the table, there is a quicker way.
After this you simply drop the old table and rename the new into the old one (and rebuild all indexes and constraints). The reasons why this process can be quicker can be found in the attached official documents for CLUSTER. What is important is that you must have the resources (a lot of RAM, for example) and there are some downsides set out in the documents that you need to be aware of (you should not use it if OIDs are important to your architecture).
Why do we vacuum at all, if clustering is so much better?
The answer is that clustering is not so much better. More often than not, the added value is minimal but the added cost is not. Clustering makes sense if, and only if, there is a single dominant access criterion for your table. Suppose you have a huge table ordered with a very inefficient index (an index based on text values, for example), it is expensive to jump across remotely placed data chunks, like it would be to collect paper sheets sparsely spread across a number of floors in your office building. At the very least, you should ensure your sheets are all close to each other and in sequence.
Generally, however, a table is accessed according to very different criteria depending on the query's context. You may want to read data by name, or by date, or maybe according to that date's weather. If you want all these alternative access strategies to work, without assigning prominence to any of them, then clustering is clearly not the solution. What you need is a small number of sheets, which is what vacuum will do, together with very good information about what is where, which is what indexes will provide. Clustering is always going to be the exception rather than the rule.
What if I don't vacuum/cluster?
Returning to the example of notes on a sheet of paper, the more disordered you allow your notes to become the longer it will take you to extract information from them. You will end up with hundreds of sheets, many of which contain just a single usable entry and all in a random order. You can have an separate list of the content on each sheet, which is what PostgreSQL calls an index, but you will still have to skim through an immense number of bloated sheets of paper.
Basically, the more frequently your entries change and/or new entries are added the sooner your notes will need re-ordering to remain usable. If you keep drawing a line on each entry to delete it, even a relatively small table of maybe 50 entries will eventually expand into tens of single sheets of paper. You need some metrics to determine what needs to be vacuumed. Alternatively, you can let autovacuum deal with the situation, in which case PostgreSQL will decide on its own when and what should be done, and how.
How do I minimize I/O impact?
All kinds of vacuum cause a substantial increase in I/O traffic. Some are worse than others and even add locks, but the I/O impact cannot be avoided. The minimum activity required to make decisions about data, even if you just want to get rid of them, is to read them and everything above this adds impact, possibly in different areas. There is no one-size-fits-all approach since all depends on where your system vulnerabilities lie. The only known way to mitigate the I/O impact is to use cost based delays. These pause the vacuum process at given intervals in order to dilute the impact, but at the cost of it taking correspondingly longer to complete.
What sorts of vacuum?
Vacuum can be pointed to a single table or, if not, it will deal with all tables in your database. It can be run with added specifications, and you can combine them to create more complex behaviours:
Marked in green are the variants that do not add other impacts to the common I/O bottleneck. Those marked in red are those that need to modify live data and thus add locks to the already existing problem of I/O consumption. As a general policy, cost based delays should be used on the green variants. You should try to run the red variants only in low traffic, with no delays and with as many resources as possible ensure they end as quickly as possible. PostgreSQL does this by default. It does not apply the delays while the resources are locked, so it is safe to leave the delays on and rely on PostgreSQL to decide when to use them.
In the next article to be published on Friday 12th October 2012, we review FREEZE.