Or when the vacuum is not enough.
Anyone who works with PostgreSQL, whether a DBA or a developer, at some point encountered vacuum. All DBAs have configured vacuum and most developers have experienced decreased database performance and slow query performance due to the vacuum.
Though vacuum may seems as a black box to developers, most DBAs know how important vacuum and its configurations are. To those who are unfamiliar with vacuum, it is basically a garbage collector which cleans the database. Someone might ask – what does it clean my database from? Here, I would like to provide a brief overview, whilst you can find a detailed answer in the official PostgreSQL documentation.
Whilst handling data PostgreSQL provides individual snapshots of it to each client. Eventually, with the update and delete operations some data in the snapshots become obsolete, but still exist in the database and occupy space in tables’ and indexes’ datafiles. The result is datafile fragmentation and overall performance degradation. Here is where vacuum comes into play, its main goal is to clean this unneeded data (the dead rows) from tables and indexes, and to free up space for new data.
Of course, the world isn’t perfect and vacuum has its own particularities. For example, after vacuum has done its work, the space doesn’t actually becomes free, it’s only available for reuse within vacuumed table or index. Thus, it may seem that the size of the table or index has been reduced. In fact, after vacuuming the size is exactly the same as before the vacuum. This is something that many beginner Postgres users are confused by.
Another example are specific workloads where vacuum is powerless or is unable to do its work efficiently. Let’s take idle transactions, they defer cleanup of dead rows and lead to tables or indexes bloat. Therefore, idle transactions, and even regular long transactions harm database and must be avoided.
Another interesting example is the intensively updated tables with high number of indexes, in this case, when the vacuum processes such tables and their indexes, a high amount of updates trigger a new vacuum on that table, and we might see endless vacuum on that table. Even the aggressively configured vacuum wouldn’t be helpful in this case and the result – tables bloat.
You might also remember other similar cases with the same outcome – none of the vacuum configurations help and it seems impossible to return the occupied space. Of course, Postgres has VACUUM FULL feature, but it blocks access to the table, not only for writes but also for read SELECT queries. This is often unacceptable especially in production environment.
There is a way out, however, PostgreSQL is an open source and it has a great community. Community offers many tools for PostgreSQL and there are few tools which address VACUUM FULL limitations and do its work without blocking access to the processed table.
The first tool is pg_repack. It works like VACUUM FULL on steroids, it creates а new table and moves data from the old one to the new. To avoid the exclusive table locking, an additional log table is created to log changes made in the original table, and added a trigger that logs INSERT/UPDATE/DELETEs to the log table. When data from original tables has been imported into the new one and indexes are rebuilt, the changes from the log table are applied and finally, pg_repack swaps tables including indexes and drops the old, original table. This is quite simple and reliable process, but it has one caveat – extra space equals to an original table with its indexes plus some space for log table has to be reserved.
This is the main function of pg_repack, however, it has a few additional interesting features that allow you to:
- process only indexes without parent table.
- move tables and indexes between tablespaces – it’s helpful to balance IO or move archived partitions between fast SSD and slow HDD storages permanently for databases clients.
- perform a non-blocking CLUSTER ordered by the specified columns.
- speed up indexes build process with concurrent workers.
- cancel queries if they block pg_repack.
So, pg_repack looks nice and as you can see pg_repack is speed oriented, so it should be remembered that its inaccurate usage might lead to extreme disk IO and drop client queries’ performance.
Another community tool is pgcompacttable (also known as compactor) is another tool which helps to compress tables and indexes and achieve their optimal size.
Pgcompacttable works in different manner than pg_repack. It uses one interesting feature of PostgreSQL when it does INSERT or UPDATE operations – all new versions of rows go to available space at the beginning of the table. This is really important, because if starts updating all rows in the table from the end and moves backwards to the table beginning. Eventually all rows will fill all the available space and freed up space at the bottom of the table can be truncated by regular vacuum. Using special system column – ctid it’s possible to update rows and moving them from the end of table to the beginning. Thus pgcompacttable with batches of updates and vacuums forces PostgreSQL to move rows within tables and table finally becomes compacted.
Pgcompacttable has additional interesting feature that is worth mentioning, which somewhat intersects with pg_repack features:
- ability to process only indexes without parent table.
- ability to adjust performance impact caused by pgcompacttable
What to choose? Here is a short checklist that might be handy for choosing your tool:
1. Does the database has performance reserves? Is it important to keep clients queries performance?
Yes – pgcompacttable, No – pg_repack.
2. Do I need to move tables or indexes between tablespaces?
Yes – pg_repack, No – pgcompacttable.
3. Do I have a free reserved space for maintenance?
Yes – pg_repack, No – pgcompacttable.
In any case, it is always a good idea to keep both tools on your server.