lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples - Mailing list pgsql-hackers

From Noah Misch
Subject lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples
Date
Msg-id 20130108024957.GA4751@tornado.leadboat.com
Whole thread Raw
Responses Re: lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples
Re: lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples
List pgsql-hackers
Per this comment in lazy_scan_heap(), almost all tuple removal these days
happens in heap_page_prune():

                case HEAPTUPLE_DEAD:
                    /*
                     * Ordinarily, DEAD tuples would have been removed by
                     * heap_page_prune(), but it's possible that the tuple
                     * state changed since heap_page_prune() looked.  In
                     * particular an INSERT_IN_PROGRESS tuple could have
                     * changed to DEAD if the inserter aborted.  So this
                     * cannot be considered an error condition.

vacuumlazy.c remains responsible for noticing the LP_DEAD line pointers left
by heap_page_prune(), removing corresponding index entries, and marking those
line pointers LP_UNUSED.

Nonetheless, lazy_vacuum_heap() retains the ability to remove actual
HEAPTUPLE_DEAD tuples and reclaim their LP_NORMAL line pointers.  This support
gets exercised only in the scenario described in the above comment.  For hot
standby, this capability requires its own WAL record, XLOG_HEAP2_CLEANUP_INFO,
to generate the necessary conflicts[1].  There is a bug in lazy_scan_heap()'s
bookkeeping for the xid to place in that WAL record.  Each call to
heap_page_prune() simply overwrites vacrelstats->latestRemovedXid, but
lazy_scan_heap() expects it to only ever increase the value.  I have a
attached a minimal fix to be backpatched.  It has lazy_scan_heap() ignore
heap_page_prune()'s actions for the purpose of this conflict xid, because
heap_page_prune() emitted an XLOG_HEAP2_CLEAN record covering them.

At that point in the investigation, I realized that the cost of being able to
remove entire tuples in lazy_vacuum_heap() greatly exceeds the benefit.
Again, the benefit is being able to remove tuples whose inserting transaction
aborted between the HeapTupleSatisfiesVacuum() call in heap_page_prune() and
the one in lazy_scan_heap().  To make that possible, lazy_vacuum_heap() grabs
a cleanup lock, calls PageRepairFragmentation(), and emits a WAL record for
every page containing LP_DEAD line pointers or HEAPTUPLE_DEAD tuples.  If we
take it out of the business of removing tuples, lazy_vacuum_heap() can skip
WAL and update lp_flags under a mere shared lock.  The second attached patch,
for HEAD, implements that.  Besides optimizing things somewhat, it simplifies
the code and removes rarely-tested branches.  (This patch supersedes the
backpatch-oriented patch rather than stacking with it.)

The bookkeeping behind the "page containing dead tuples is marked as
all-visible in relation" warning is also faulty; it only fires when
lazy_heap_scan() saw the HEAPTUPLE_DEAD tuple; again, heap_page_prune() will
be the one to see it in almost every case.  I changed the warning to fire
whenever the table cannot be marked all-visible for a reason other than the
presence of too-recent live tuples.

I considered renaming lazy_vacuum_heap() to lazy_heap_clear_dead_items(),
reflecting its narrower role.  Ultimately, I left function names unchanged.

This patch conflicts textually with Pavan's "Setting visibility map in
VACUUM's second phase" patch, but I don't see any conceptual incompatibility.

I can't give a simple statement of the performance improvement here.  The
XLOG_HEAP2_CLEAN record is fairly compact, so the primary benefit of avoiding
it is the possibility of avoiding a full-page image.  For example, if a
checkpoint lands just before the VACUUM and again during the index-cleaning
phase (assume just one such phase in this example), this patch reduces
heap-related WAL volume by almost 50%.  Improvements as extreme as 2% and 97%
are possible given other timings of checkpoints relatively to the VACUUM.  In
general, expect this to help VACUUMs spanning several checkpoint cycles more
than it helps shorter VACUUMs.  I have attached a script I used as a reference
workload for testing different checkpoint timings.  There should also be some
improvement from keeping off WALInsertLock, not requiring WAL flushes to evict
from the ring buffer during the lazy_vacuum_heap() phase, and not taking a
second buffer cleanup lock.  I did not attempt to quantify those.

Thanks,
nm

[1] Normally, heap_page_prune() removes the tuple first (leaving an LP_DEAD
line pointer), and vacuumlazy.c removes index entries afterward.  When the
removal happens in this order, the XLOG_HEAP2_CLEAN record takes care of
conflicts.  However, in the rarely-used code path, we remove the index entries
before removing the tuple.  XLOG_HEAP2_CLEANUP_INFO conflicts with standby
snapshots that might need the vanishing index entries.

Attachment

pgsql-hackers by date:

Previous
From: Shigeru Hanada
Date:
Subject: Re: PATCH: optimized DROP of multiple tables within a transaction
Next
From: Peter Eisentraut
Date:
Subject: PL/Python result object str handler