Thread: lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples
Per this comment in lazy_scan_heap(), almost all tuple removal these days happens in heap_page_prune(): case HEAPTUPLE_DEAD: /* * Ordinarily, DEAD tuples would have been removed by * heap_page_prune(), but it's possible that the tuple * state changed since heap_page_prune() looked. In * particular an INSERT_IN_PROGRESS tuple could have * changed to DEAD if the inserter aborted. So this * cannot be considered an error condition. vacuumlazy.c remains responsible for noticing the LP_DEAD line pointers left by heap_page_prune(), removing corresponding index entries, and marking those line pointers LP_UNUSED. Nonetheless, lazy_vacuum_heap() retains the ability to remove actual HEAPTUPLE_DEAD tuples and reclaim their LP_NORMAL line pointers. This support gets exercised only in the scenario described in the above comment. For hot standby, this capability requires its own WAL record, XLOG_HEAP2_CLEANUP_INFO, to generate the necessary conflicts[1]. There is a bug in lazy_scan_heap()'s bookkeeping for the xid to place in that WAL record. Each call to heap_page_prune() simply overwrites vacrelstats->latestRemovedXid, but lazy_scan_heap() expects it to only ever increase the value. I have a attached a minimal fix to be backpatched. It has lazy_scan_heap() ignore heap_page_prune()'s actions for the purpose of this conflict xid, because heap_page_prune() emitted an XLOG_HEAP2_CLEAN record covering them. At that point in the investigation, I realized that the cost of being able to remove entire tuples in lazy_vacuum_heap() greatly exceeds the benefit. Again, the benefit is being able to remove tuples whose inserting transaction aborted between the HeapTupleSatisfiesVacuum() call in heap_page_prune() and the one in lazy_scan_heap(). To make that possible, lazy_vacuum_heap() grabs a cleanup lock, calls PageRepairFragmentation(), and emits a WAL record for every page containing LP_DEAD line pointers or HEAPTUPLE_DEAD tuples. If we take it out of the business of removing tuples, lazy_vacuum_heap() can skip WAL and update lp_flags under a mere shared lock. The second attached patch, for HEAD, implements that. Besides optimizing things somewhat, it simplifies the code and removes rarely-tested branches. (This patch supersedes the backpatch-oriented patch rather than stacking with it.) The bookkeeping behind the "page containing dead tuples is marked as all-visible in relation" warning is also faulty; it only fires when lazy_heap_scan() saw the HEAPTUPLE_DEAD tuple; again, heap_page_prune() will be the one to see it in almost every case. I changed the warning to fire whenever the table cannot be marked all-visible for a reason other than the presence of too-recent live tuples. I considered renaming lazy_vacuum_heap() to lazy_heap_clear_dead_items(), reflecting its narrower role. Ultimately, I left function names unchanged. This patch conflicts textually with Pavan's "Setting visibility map in VACUUM's second phase" patch, but I don't see any conceptual incompatibility. I can't give a simple statement of the performance improvement here. The XLOG_HEAP2_CLEAN record is fairly compact, so the primary benefit of avoiding it is the possibility of avoiding a full-page image. For example, if a checkpoint lands just before the VACUUM and again during the index-cleaning phase (assume just one such phase in this example), this patch reduces heap-related WAL volume by almost 50%. Improvements as extreme as 2% and 97% are possible given other timings of checkpoints relatively to the VACUUM. In general, expect this to help VACUUMs spanning several checkpoint cycles more than it helps shorter VACUUMs. I have attached a script I used as a reference workload for testing different checkpoint timings. There should also be some improvement from keeping off WALInsertLock, not requiring WAL flushes to evict from the ring buffer during the lazy_vacuum_heap() phase, and not taking a second buffer cleanup lock. I did not attempt to quantify those. Thanks, nm [1] Normally, heap_page_prune() removes the tuple first (leaving an LP_DEAD line pointer), and vacuumlazy.c removes index entries afterward. When the removal happens in this order, the XLOG_HEAP2_CLEAN record takes care of conflicts. However, in the rarely-used code path, we remove the index entries before removing the tuple. XLOG_HEAP2_CLEANUP_INFO conflicts with standby snapshots that might need the vanishing index entries.
Attachment
On Tue, Jan 8, 2013 at 8:19 AM, Noah Misch <noah@leadboat.com> wrote:
At that point in the investigation, I realized that the cost of being able to
remove entire tuples in lazy_vacuum_heap() greatly exceeds the benefit.
Again, the benefit is being able to remove tuples whose inserting transaction
aborted between the HeapTupleSatisfiesVacuum() call in heap_page_prune() and
the one in lazy_scan_heap(). To make that possible, lazy_vacuum_heap() grabs
a cleanup lock, calls PageRepairFragmentation(), and emits a WAL record for
every page containing LP_DEAD line pointers or HEAPTUPLE_DEAD tuples. If we
take it out of the business of removing tuples, lazy_vacuum_heap() can skip
WAL and update lp_flags under a mere shared lock. The second attached patch,
for HEAD, implements that. Besides optimizing things somewhat, it simplifies
the code and removes rarely-tested branches. (This patch supersedes the
backpatch-oriented patch rather than stacking with it.)
+1. I'd also advocated removing the line pointer array scan in lazy_scan_heap() because the heap_page_prune() does most of the work. The reason why we still have that scan in lazy_scan_heap() is to check for new dead tuples (which should be rare), check for all-visibility of the page and freeze tuples if possible. I think we can move all of that to heap_page_prune().
But while you are at it, I wonder if you have time and energy to look at the single pass vacuum work that I, Robert and others tried earlier but failed to get to the final stage [1][2]. If we do something like that, we might not need the second scan of the heap at all, which as you rightly pointed out, does not do much today anyway, other than marking LP_DEAD line pointers to LP_UNUSED. We can push that work to the next vacuum or even a foreground task.
BTW, with respect to your idea of not WAL logging the operation of setting LP_DEAD to LP_UNUSED in lazy_vacuum_page(), I wonder if this would create problems with crash recovery. During normal vacuum operation, you may have converted LP_DEAD to LP_UNUSED. Then you actually used one of those line pointers for subsequent PageAddItem() on the page. If you crash after that but before the page gets written to the disk, during crash recovery, AFAICS PageAddItem() will complain with "will not overwrite a used ItemId" warning and subsequent PANIC of the recovery.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Hi Pavan, Thanks for reviewing. On Tue, Jan 08, 2013 at 02:41:54PM +0530, Pavan Deolasee wrote: > On Tue, Jan 8, 2013 at 8:19 AM, Noah Misch <noah@leadboat.com> wrote: > > At that point in the investigation, I realized that the cost of being able > > to > > remove entire tuples in lazy_vacuum_heap() greatly exceeds the benefit. > > Again, the benefit is being able to remove tuples whose inserting > > transaction > > aborted between the HeapTupleSatisfiesVacuum() call in heap_page_prune() > > and > > the one in lazy_scan_heap(). To make that possible, lazy_vacuum_heap() > > grabs > > a cleanup lock, calls PageRepairFragmentation(), and emits a WAL record for > > every page containing LP_DEAD line pointers or HEAPTUPLE_DEAD tuples. If > > we > > take it out of the business of removing tuples, lazy_vacuum_heap() can skip > > WAL and update lp_flags under a mere shared lock. The second attached > > patch, > > for HEAD, implements that. Besides optimizing things somewhat, it > > simplifies > > the code and removes rarely-tested branches. (This patch supersedes the > > backpatch-oriented patch rather than stacking with it.) > > > > > +1. I'd also advocated removing the line pointer array scan in > lazy_scan_heap() because the heap_page_prune() does most of the work. The > reason why we still have that scan in lazy_scan_heap() is to check for new > dead tuples (which should be rare), check for all-visibility of the page > and freeze tuples if possible. I think we can move all of that to > heap_page_prune(). Yes; that was an interesting thread. > But while you are at it, I wonder if you have time and energy to look at > the single pass vacuum work that I, Robert and others tried earlier but > failed to get to the final stage [1][2]. If we do something like that, we > might not need the second scan of the heap at all, which as you rightly > pointed out, does not do much today anyway, other than marking LP_DEAD line > pointers to LP_UNUSED. We can push that work to the next vacuum or even a > foreground task. I don't wish to morph this patch into a removal of lazy_vacuum_heap(), but that will remain promising for the future. > BTW, with respect to your idea of not WAL logging the operation of setting > LP_DEAD to LP_UNUSED in lazy_vacuum_page(), I wonder if this would create > problems with crash recovery. During normal vacuum operation, you may have > converted LP_DEAD to LP_UNUSED. Then you actually used one of those line > pointers for subsequent PageAddItem() on the page. If you crash after that > but before the page gets written to the disk, during crash recovery, AFAICS > PageAddItem() will complain with "will not overwrite a used ItemId" warning > and subsequent PANIC of the recovery. Good catch; I can reproduce that problem. I've now taught PageAddItem() to tolerate replacing an LP_DEAD item until recovery reaches consistency. Also, since lazy_vacuum_page() no longer calls PageRepairFragmentation(), it should call PageSetHasFreeLinePointers() itself. The attached update fixes both problems. (I have also attached the unchanged backpatch-oriented fix to keep things together.) nm
Attachment
On 8 January 2013 02:49, Noah Misch <noah@leadboat.com> wrote: > There is a bug in lazy_scan_heap()'s > bookkeeping for the xid to place in that WAL record. Each call to > heap_page_prune() simply overwrites vacrelstats->latestRemovedXid, but > lazy_scan_heap() expects it to only ever increase the value. I have a > attached a minimal fix to be backpatched. It has lazy_scan_heap() ignore > heap_page_prune()'s actions for the purpose of this conflict xid, because > heap_page_prune() emitted an XLOG_HEAP2_CLEAN record covering them. Interesting. Yes, bug, and my one of mine also. ISTM the right fix is fix to correctly initialize on pruneheap.c line 176 prstate.latestRemovedXid = *latestRemovedXid; better to make it work than to just leave stuff hanging. I very much like your patch to remove all that cruft altogether; good riddance. I think you're missing removing a few calls to HeapTupleHeaderAdvanceLatestRemovedXid(), perhaps even that routine as well. Not sure about the skipping WAL records and share locking part, that's too radical for me. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 10, 2013 at 02:45:36AM +0000, Simon Riggs wrote: > On 8 January 2013 02:49, Noah Misch <noah@leadboat.com> wrote: > > There is a bug in lazy_scan_heap()'s > > bookkeeping for the xid to place in that WAL record. Each call to > > heap_page_prune() simply overwrites vacrelstats->latestRemovedXid, but > > lazy_scan_heap() expects it to only ever increase the value. I have a > > attached a minimal fix to be backpatched. It has lazy_scan_heap() ignore > > heap_page_prune()'s actions for the purpose of this conflict xid, because > > heap_page_prune() emitted an XLOG_HEAP2_CLEAN record covering them. > > Interesting. Yes, bug, and my one of mine also. > > ISTM the right fix is fix to correctly initialize on pruneheap.c line 176 > prstate.latestRemovedXid = *latestRemovedXid; > better to make it work than to just leave stuff hanging. That works, too. > I very much like your patch to remove all that cruft altogether; good > riddance. I think you're missing removing a few calls to > HeapTupleHeaderAdvanceLatestRemovedXid(), perhaps even that routine as > well. Thanks. Did you have a particular HeapTupleHeaderAdvanceLatestRemovedXid() call site in mind? The three calls remaining in the tree look right to me. > Not sure about the skipping WAL records and share locking part, that's > too radical for me. Certainly a fair point of discussion. In particular, using a plain exclusive lock wouldn't be much worse.
On Wednesday, January 9, 2013, Noah Misch wrote:
On Thu, Jan 10, 2013 at 02:45:36AM +0000, Simon Riggs wrote:
> On 8 January 2013 02:49, Noah Misch <noah@leadboat.com> wrote:
> > There is a bug in lazy_scan_heap()'s
> > bookkeeping for the xid to place in that WAL record. Each call to
> > heap_page_prune() simply overwrites vacrelstats->latestRemovedXid, but
> > lazy_scan_heap() expects it to only ever increase the value. I have a
> > attached a minimal fix to be backpatched. It has lazy_scan_heap() ignore
> > heap_page_prune()'s actions for the purpose of this conflict xid, because
> > heap_page_prune() emitted an XLOG_HEAP2_CLEAN record covering them.
>
> Interesting. Yes, bug, and my one of mine also.
>
> ISTM the right fix is fix to correctly initialize on pruneheap.c line 176
> prstate.latestRemovedXid = *latestRemovedXid;
> better to make it work than to just leave stuff hanging.
That works, too.
As bug fixes don't usually go through the commit-fest process, will someone be committing one of these two ideas for the back-branches? And to HEAD, in case the more invasive patch doesn't make it in?
I have a preliminary nit-pick on the big patch. It generates a compiler warning:
vacuumlazy.c: In function ‘lazy_scan_heap’:
vacuumlazy.c:445:9: warning: variable ‘prev_dead_count’ set but not used [-Wunused-but-set-variable]
Thanks,
Jeff
Noah, * Noah Misch (noah@leadboat.com) wrote: > The attached update fixes both > problems. (I have also attached the unchanged backpatch-oriented fix to keep > things together.) I've just started looking at/playing with this patch and was wondering if you'd missed Jeff's comments on it..? I note that prev_dead_count is still in this patch- any particular reason..? Also, perhaps we should consider Simon's one-liner fix for backpatching this instead of the original patch you posted? To be honest, I'm also a bit concerned about applying the patch you provided for HEAD this late in the 9.3 cycle, especially if the plan is to make further changes in this area to simplify things moving forward. Perhaps we could do all of those changes early in the 9.4 cycle? Thanks, Stephen
On Sat, Jan 19, 2013 at 02:58:32PM -0800, Jeff Janes wrote: > I have a preliminary nit-pick on the big patch. It generates a compiler > warning: > > vacuumlazy.c: In function ?lazy_scan_heap?: > vacuumlazy.c:445:9: warning: variable ?prev_dead_count? set but not used > [-Wunused-but-set-variable] Thanks. That variable can just go away. Since a full review may turn up more things, I'll hold off on sending a version fixing that alone.
On Tue, Jan 22, 2013 at 09:45:37PM -0500, Stephen Frost wrote: > * Noah Misch (noah@leadboat.com) wrote: > > The attached update fixes both > > problems. (I have also attached the unchanged backpatch-oriented fix to keep > > things together.) > > I've just started looking at/playing with this patch and was wondering > if you'd missed Jeff's comments on it..? I note that prev_dead_count is > still in this patch- any particular reason..? Thanks for the reminder; I've now replied to Jeff. > Also, perhaps we should > consider Simon's one-liner fix for backpatching this instead of the > original patch you posted? I have no nontrivial preference between the two approaches. > To be honest, I'm also a bit concerned about applying the patch you > provided for HEAD this late in the 9.3 cycle, especially if the plan is > to make further changes in this area to simplify things moving forward. > Perhaps we could do all of those changes early in the 9.4 cycle? Concerning "further changes," I suppose you refer to Pavan's two designs noted upthread? I don't recommend going out of our way to consider all these changes together; there are disadvantages, too, of making several VACUUM performance changes in short succession. I'm also not aware of concrete plans to implement those designs. You're the second commentator to be skittish about the patch's correctness, so I won't argue against a conservatism-motivated bounce of the patch. Thanks, nm
On Wed, Jan 23, 2013 at 10:05 AM, Noah Misch <noah@leadboat.com> wrote: > You're the second commentator to be skittish about the patch's correctness, so > I won't argue against a conservatism-motivated bounce of the patch. Can you please rebase the patch against the latest head ? I see Alvaro's and Simon's recent changes has bit-rotten the patch. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On 23 January 2013 04:35, Noah Misch <noah@leadboat.com> wrote: >> Also, perhaps we should >> consider Simon's one-liner fix for backpatching this instead of the >> original patch you posted? > > I have no nontrivial preference between the two approaches. Sorry, I didn't see this. I guess you saw I applied my one liner and backpatched it. I'm expecting to apply Noah's larger patch to HEAD with the suggested edits. I'll do that last thing in the CF. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 28, 2013 at 02:12:33PM +0000, Simon Riggs wrote: > On 23 January 2013 04:35, Noah Misch <noah@leadboat.com> wrote: > >> Also, perhaps we should > >> consider Simon's one-liner fix for backpatching this instead of the > >> original patch you posted? > > > > I have no nontrivial preference between the two approaches. > > Sorry, I didn't see this. I guess you saw I applied my one liner and > backpatched it. Yes; thanks. > I'm expecting to apply Noah's larger patch to HEAD with the suggested > edits. I'll do that last thing in the CF. What "suggested edits" do you have in mind?
On Mon, Jan 28, 2013 at 07:24:04PM +0530, Pavan Deolasee wrote: > On Wed, Jan 23, 2013 at 10:05 AM, Noah Misch <noah@leadboat.com> wrote: > > > You're the second commentator to be skittish about the patch's correctness, so > > I won't argue against a conservatism-motivated bounce of the patch. > > Can you please rebase the patch against the latest head ? I see > Alvaro's and Simon's recent changes has bit-rotten the patch. Attached.
Attachment
On Wed, Jan 30, 2013 at 7:34 AM, Noah Misch <noah@leadboat.com> wrote: > On Mon, Jan 28, 2013 at 07:24:04PM +0530, Pavan Deolasee wrote: >> On Wed, Jan 23, 2013 at 10:05 AM, Noah Misch <noah@leadboat.com> wrote: >> >> > You're the second commentator to be skittish about the patch's correctness, so >> > I won't argue against a conservatism-motivated bounce of the patch. >> >> Can you please rebase the patch against the latest head ? I see >> Alvaro's and Simon's recent changes has bit-rotten the patch. > > Attached. Thanks. Here are a few comments. 1. I saw you took care of the bug that I reported in the first email by allowing overwriting a LP_DEAD itemid during recovery. While this should be OK given that we had never seen reports of recovery trying to accidentally overwrite a LP_DEAD itemid, are you sure after this change we can't get into this situation post reachedConsistency ? If we ever do, I think the recovery would just fail. 2. In lazy_scan_heap() after detecting a DEAD tuple you now try to confirm that the tuple must not require a freeze. Is that really safe ? I think this assumes that the HOT prune would have already truncated *every* DEAD tuple to LP_DEAD except an INSERT_IN_PROGRESS which may have turned into DEAD between two calls to HeapTupleSatisfiesVacuum(). While this might be true, but we never tried hard to prove that before because it wasn't necessary. I remember Heikki raising this concern when I proposed setting the VM bit after HOT prune. So a proof of that would probably help. 3. Converting LP_DEAD to LP_UNUSED in lazy_vacuum_page() while holding just a SHARE lock seems to be OK, but is a bit adventurous. I would rather just get an EX lock and do it. Also, its probably more appropriate to just mark the buffer dirty instead of SetBufferCommitInfoNeedsSave(). It may cause line pointer bloat and vacuum may not come to process this page again. This will also be kind of unnecessary after the patch to set VM bit in the second phase gets in. 4. Are changes in the variable names and logic around them in lazy_scan_heap() really required ? Just makes me a bit uncomfortable. See if we can minimize those changes or do it in a separate patch, if possible. I haven't run tests with the patch yet. Will see if I can try a few. I share other's views on making these changes late in the cycle, but if we can reduce the foot-print of the patch, we should be OK. I see the following (and similar) messages while applying the patch, but may be they are harmless. (Stripping trailing CRs from patch.) Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Wed, Jan 30, 2013 at 11:37:41PM +0530, Pavan Deolasee wrote: > On Wed, Jan 30, 2013 at 7:34 AM, Noah Misch <noah@leadboat.com> wrote: > > On Mon, Jan 28, 2013 at 07:24:04PM +0530, Pavan Deolasee wrote: > >> On Wed, Jan 23, 2013 at 10:05 AM, Noah Misch <noah@leadboat.com> wrote: > >> > >> > You're the second commentator to be skittish about the patch's correctness, so > >> > I won't argue against a conservatism-motivated bounce of the patch. > >> > >> Can you please rebase the patch against the latest head ? I see > >> Alvaro's and Simon's recent changes has bit-rotten the patch. > > > > Attached. > > Thanks. Here are a few comments. > > 1. I saw you took care of the bug that I reported in the first email > by allowing overwriting a LP_DEAD itemid during recovery. While this > should be OK given that we had never seen reports of recovery trying > to accidentally overwrite a LP_DEAD itemid, are you sure after this > change we can't get into this situation post reachedConsistency ? If > we ever do, I think the recovery would just fail. Having pondered this again, I conclude that checking reachedConsistency is wrong. As a simple example, during ongoing streaming replication, the startup process will regularly be asked to overwrite LP_DEAD items. Those items were LP_UNUSED on the master, but only a full-page image would transfer that fact to the standby before something overwrites the item. That shows another flaw. VACUUM marking a page all-visible is WAL-logged, so the standby will receive that change. But the standby may still have LP_DEAD pointers on the affected page, where the master has LP_UNUSED. I'm not aware of any severe consequences; no scan type would be fooled. Still, this violates a current assumption of the system. > 2. In lazy_scan_heap() after detecting a DEAD tuple you now try to > confirm that the tuple must not require a freeze. Is that really safe > ? I think this assumes that the HOT prune would have already truncated > *every* DEAD tuple to LP_DEAD except an INSERT_IN_PROGRESS which may > have turned into DEAD between two calls to HeapTupleSatisfiesVacuum(). > While this might be true, but we never tried hard to prove that before > because it wasn't necessary. I remember Heikki raising this concern > when I proposed setting the VM bit after HOT prune. So a proof of that > would probably help. Yes, the patch assumes all that. Since lazy_scan_heap() already refuses to remove a heap-only or HOT-updated HEAPTUPLE_DEAD tuple, heap_page_prune() had better not miss truncating those. Otherwise, we would pass such a tuple to heap_freeze_tuple(), which assumes the tuple is not HEAPTUPLE_DEAD. The case where heap_page_prune() could be leaving a HEAPTUPLE_DEAD tuple today without such problems is a simple tuple not participating in HOT. I think it's clear from inspection that heap_page_prune() will visit all such tuples, and heap_prune_chain() will truncate them when visited. Since a tuple can move from HEAPTUPLE_INSERT_IN_PROGRESS to HEAPTUPLE_DEAD at any instant here, heap_freeze_tuple() must not misbehave on such tuples. Indeed, it does not; since the xmin was running at some point after we chose a freeze horizon, the xmin will be above that horizon. Therefore, the try_freeze dance I added ought to be unnecessary. > 3. Converting LP_DEAD to LP_UNUSED in lazy_vacuum_page() while holding > just a SHARE lock seems to be OK, but is a bit adventurous. I would > rather just get an EX lock and do it. Fair enough. > Also, its probably more > appropriate to just mark the buffer dirty instead of > SetBufferCommitInfoNeedsSave(). There exist precedents for both styles. > It may cause line pointer bloat and > vacuum may not come to process this page again. How would that happen? > This will also be kind > of unnecessary after the patch to set VM bit in the second phase gets > in. To the extent that pages tend to become all-visible after removing dead material, yes. When a long-running transaction is holding back the all-visible horizon, it will still matter. > 4. Are changes in the variable names and logic around them in > lazy_scan_heap() really required ? Just makes me a bit uncomfortable. > See if we can minimize those changes or do it in a separate patch, if > possible. I assume you refer to the replacement of "all_visible" and "has_dead_tuples" with "has_nonlive" and "has_too_recent". As ever, it would be hard to honestly say that those changes were *required*. They serve the side goal of fixing the bookkeeping behind one of the warnings, which, independent of this patch, isn't firing in as often as it should. I could split that out as an independent patch. > I haven't run tests with the patch yet. Will see if I can try a few. I > share other's views on making these changes late in the cycle, but if > we can reduce the foot-print of the patch, we should be OK. Given those reactions and my shortness of time to solidly fix everything noted above, let's table this patch. I'm marking it Returned with Feedback. > I see the following (and similar) messages while applying the patch, > but may be they are harmless. > > (Stripping trailing CRs from patch.) I don't see anything like that. Thanks, nm