Revitalising VACUUM FULL for 8.3 - Mailing list pgsql-hackers

From Simon Riggs
Subject Revitalising VACUUM FULL for 8.3
Date
Msg-id 1172750543.3760.1232.camel@silverbirch.site
Whole thread Raw
Responses Re: Revitalising VACUUM FULL for 8.3
Re: Revitalising VACUUM FULL for 8.3
List pgsql-hackers
Use case for VACUUM FULL is very low these days. VACUUM does the most
important part of what VACUUM FULL offers, yet does it concurrently
rather than with a full table lock. VACUUM FULL also
- has very long execution time
- generates lots of WAL traffic
- uses lots of memory while it runs
- isn't as good at compacting a relation as CLUSTER
- sometimes requires multiple runs to properly compact data

CLUSTER has
- much better compaction than VACUUM FULL when run concurrently with
other transactions (yes, really!)
- need not generate WAL, in many cases
- offers no weird failure cases

I propose to use the CLUSTER technique in place of the current VACUUM
FULL code. The command VACUUM FULL would still exist, but would execute
the command in a different manner, very similar to the way CLUSTER
works, just without the sort-the-table feature.

The benefits of this approach would be:
- VACUUM FULL would compact relations much better than it does now
- would never need multiple executions to achieve good compaction
- operate much faster, with a single pass over the main relation
- it would compact as effectively as CLUSTER, yet execute faster
- need not generate WAL while it executes
- no requirement for large maintenance_work_mem

The potential downsides of this approach are the following:
- "it uses more space"
- "it breaks MVCC"
There are good answers to both these points, so please read on.

The manual says VACUUM FULL can be speeded up by dropping and
re-creating indexes, but it is still lengthy. It is even faster to drop
the indexes, do a CREATE TABLE AS SELECT * FROM table, drop the old
table and then rebuild the indexes. 

So the main use case for current VACUUM FULL is when the space to be
freed inside the table is low enough to make defraging the table quicker
than than the above workaround, yet still high enough that we were
worried enough to do a VACUUM FULL. You must also be running it
concurrently with other transactions, but clearly ones that don't
include the current table because they will be locked out by the VACUUM
FULL. That's a tough requirement because this table is by-definition one
that is/has been heavily updated/deleted. And also VACUUM hasn't been
much use at freeing space at the end of the table, which it would only
fail to do with concurrently held locks. Thats a very narrow use case
and I doubt whether it exists at all any longer.

New VACUUM FULL would perform a SeqScan of the main relation using
SnapshotNow, inserting the results into a new relfilenode. If it crashes
part way through the first phase, we drop the file being built and table
is untouched. No WAL need be written while we are doing this, except
when archive_command is set. This produces a new relation which is
tightly compacted, free of older tuples and the physical file is no
larger than required. Vacuum delay points would also be supported. No FK
checks would be required, nor would other constraints need to
be-rechecked during the insertion. Index entries would not be made
during the insertions into the second heap. Just as with CLUSTER, the
second phase would consist of rebuilding all indexes, ensuring that they
too are as compact as possible.

Space usage of VACUUM FULL could be as high as twice the target table,
but only in the case where there wasn't anything to VACUUM or truncate.
In the typical use case we would be looking to remove large numbers of
dead tuples, as well as truncate the relation, so the actual space
overhead would be more typically only about +50% of the pre-VACUUM FULL
size of the target table. If execution hits an out-of-space error then
the command can quickly recover. If space really isn't available, then
indexes can be dropped manually and the process re-executed. Arranging
for more temp space is now easier with the new temp space management
code.

Current CLUSTER does not respect MVCC. It's possible for an old
serializable transaction to miss consistent data as a result. That is a
serious problem for pg_dump and I propose to fix that in this proposal,
for both CLUSTER and the new VACUUM FULL.

I would like to introduce the concept of utility transactions. This is
any transaction that touches only one table in a transaction and is not
returning or modifying data. All utility transactions wait until they
are older than all non-utility transactions before they commit. A
utility transaction would currently be any VACUUM, VACUUM FULL and
CREATE INDEX CONCURRENTLY. That is safe because each of those commands
executes in its own transaction and doesn't touch more than one table at
a time. Once each knows there is no chance of being interfered with, it
can continue its work and commit. This technique is already in use for
CREATE INDEX CONCURRENTLY, so just needs to be extended to all other
utilities - but in a way that allows them to recognise each other. This
extends upon the thought that VACUUMs already recognise other VACUUMs
and avoid using them as part of their Snapshot. 

I would also like to detect cases where CLUSTER, CREATE INDEX and ALTER
TABLE are running in their own implicit top-level transaction, so that
these too can be recognised by the server as utility transactions.

The utility transaction concept would make new VACUUM FULL MVCC-safe and
would also make most executions of CLUSTER MVCC-safe also (the implicit
top-level transaction cases).

The coding for much of this looks very straightforward. I have to
rewrite VACUUM FULL to cope with HOT anyway, so this seems like both a
faster, more robust and easier development path to take. However, none
of the above arguments rely on the acceptance of the HOT patch to
increase their relevance for this proposal. 

Comments?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: "Florian G. Pflug"
Date:
Subject: Re: Is there a way to run heap_insert() AFTER ExecInsertIndexTuples() ?
Next
From: Zoltan Boszormenyi
Date:
Subject: Re: Is there a way to run heap_insert() AFTER ExecInsertIndexTuples() ?