Here is a WIP patch based on the discussions here:
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00863.php
The attached WIP patch improves the LAZY VACUUM by limiting or
avoiding the second heap scan. This not only saves considerable time
in VACUUM, but also reduces the double-writes of vacuumed blocks. If
the second heap scan is considerably limited, that should also save
CPU usage and reduce WAL log writing.
With HOT, the first heap scan prunes and defrags every page in the
heap. That truncates all the dead tuples to their DEAD line pointers
and releases all the free space in the page. The second scan only
removes these DEAD line pointers and records the free space in the
FSM. The free space in fact does not change from the first pass. But
to do so, it again calls RepairPageFragmentation on each page, dirties
the page and calls log_heap_clean() again on the page. This clearly
looks like too much work for a small gain.
As this patch stands, the first phase of vacuum prunes the heap pages
as usual. But it marks the DEAD line pointers as DEAD_RECLAIMED to
signal that the index pointers to these line pointers are being
removed, if certain conditions are satisfied. Other backend when
prunes a page, also reclaims DEAD_RECLAIMED line pointers by marking
them UNUSED. We need some additional logic to do this in a safe way:
- An additional boolean pg_class attribute (relvac_inprogress) is used
to track the status of vacuum on a relation. If the attribute is true,
either vacuum is in progress on the relation or the last vacuum did
not complete successfully.
When VACUUM starts, it sets relvac_inprogress to true. The transaction
is committed and a new transaction is started so that all other
backends can see the change. We also note down the transactions which
may already have the table open. VACUUM then starts the first heap
scan. It prunes the page, but it can start marking the DEAD line
pointers as DEAD_RECLAIMED only after it knows that all other backends
can see that VACUUM is in progress on the target relation. Otherwise
there is a danger that backends might reclaim DEAD line pointers
before their index pointers are removed and that would lead to index
corruption. We do that by periodic conditional waits on the noted
transactions ids. Once all old transactions are gone, VACUUM sets the
second scan limit to the current block number and starts marking
subsequent DEAD line pointers as DEAD_RECLAIMED.
In most of the cases where the old transactions quickly go away, and
for large tables, the second scan will be very limited. In the worst
case, we might incur the overhead of conditional waits without any
success.
TODO:
- We can potentially update FSM at the end of first pass. This is not
a significant issue if the second scan is very limited. But if we do
this, we need to handle the truncate case properly.
- As the patch stands, we check of old transactions at every block
iteration. This might not be acceptable for the cases where there are
long running transactions. We probably need some exponential gap here.
- As the patch stands, the heap_page_prune handles reclaiming the
DEAD_RECLAIMED line pointers since it already has ability to WAL log
similar changes. We don't do any extra work to trigger pruning though
(except than setting page_prune_xid). May be we should trigger pruning
if we got a line pointer bloat in a page too.
Please let me know comments/suggestions and any other improvements.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com