Thread: Emit fewer vacuum records by reaping removable tuples during pruning
Hi, When there are no indexes on the relation, we can set would-be dead items LP_UNUSED and remove them during pruning. This saves us a vacuum WAL record, reducing WAL volume (and time spent writing and syncing WAL). See this example: drop table if exists foo; create table foo(a int) with (autovacuum_enabled=false); insert into foo select i from generate_series(1,10000000)i; update foo set a = 10; \timing on vacuum foo; On my machine, the attached patch set provides a 10% speedup for vacuum for this example -- and a 40% decrease in WAL bytes emitted. Admittedly, this case is probably unusual in the real world. On-access pruning would preclude it. Throw a SELECT * FROM foo before the vacuum and the patch has no performance benefit. However, it has no downside as far as I can tell. And, IMHO, it is a code clarity improvement. This change means that lazy_vacuum_heap_page() is only called when we are actually doing a second pass and reaping dead items. I found it quite confusing that lazy_vacuum_heap_page() was called by lazy_scan_heap() to set dead items unused in a block that we just pruned. I think it also makes it clear that we should update the VM in lazy_scan_prune(). All callers of lazy_scan_prune() will now consider updating the VM after returning. And most of the state communicated back to lazy_scan_heap() from lazy_scan_prune() is to inform it whether or not to update the VM. I didn't do that in this patch set because I would need to pass all_visible_according_to_vm to lazy_scan_prune() and that change didn't seem worth the improvement in code clarity in lazy_scan_heap(). I am planning to add a VM update into the freeze record, at which point I will move the VM update code into lazy_scan_prune(). This will then allow us to consolidate the freespace map update code for the prune and noprune cases and make lazy_scan_heap() short and sweet. Note that (on principle) this patch set is on top of the bug fix I proposed in [1]. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com
Attachment
On Mon, Nov 13, 2023 at 2:29 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I think it also makes it clear that we should update the VM in > lazy_scan_prune(). All callers of lazy_scan_prune() will now consider > updating the VM after returning. And most of the state communicated back > to lazy_scan_heap() from lazy_scan_prune() is to inform it whether or > not to update the VM. That makes sense. > I didn't do that in this patch set because I would > need to pass all_visible_according_to_vm to lazy_scan_prune() and that > change didn't seem worth the improvement in code clarity in > lazy_scan_heap(). Have you thought about finding a way to get rid of all_visible_according_to_vm? (Not necessarily in the scope of the ongoing work, just in general.) all_visible_according_to_vm is inherently prone to races -- it tells us what the VM used to say about the page, back when we looked. It is very likely that the page isn't visible in this sense, anyway, because VACUUM is after all choosing to scan the page in the first place when we end up in lazy_scan_prune. (Granted, this is much less true than it should be due to the influence of SKIP_PAGES_THRESHOLD, which implements a weird and inefficient form of prefetching/readahead.) Why should we necessarily care what all_visible_according_to_vm says or would say at this point? We're looking at the heap page itself, which is more authoritative than the VM (theoretically they're equally authoritative, but not really, not when push comes to shove). The best answer that I can think of is that all_visible_according_to_vm gives us a way of noticing and then logging any inconsistencies between VM and heap that we might end up "repairing" in passing (once we've rechecked the VM). But maybe that could happen elsewhere. Perhaps that cross-check could be pushed into visibilitymap.c, when we go to set a !PageIsAllVisible(page) page all-visible in the VM. If we're setting it and find that it's already set from earlier on, then remember and complaing/LOG it. No all_visible_according_to_vm required, plus this seems like it might be more thorough. Just a thought. -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Mon, Nov 13, 2023 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Nov 13, 2023 at 2:29 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I think it also makes it clear that we should update the VM in > > lazy_scan_prune(). All callers of lazy_scan_prune() will now consider > > updating the VM after returning. And most of the state communicated back > > to lazy_scan_heap() from lazy_scan_prune() is to inform it whether or > > not to update the VM. > > That makes sense. > > > I didn't do that in this patch set because I would > > need to pass all_visible_according_to_vm to lazy_scan_prune() and that > > change didn't seem worth the improvement in code clarity in > > lazy_scan_heap(). > > Have you thought about finding a way to get rid of > all_visible_according_to_vm? (Not necessarily in the scope of the > ongoing work, just in general.) > > all_visible_according_to_vm is inherently prone to races -- it tells > us what the VM used to say about the page, back when we looked. It is > very likely that the page isn't visible in this sense, anyway, because > VACUUM is after all choosing to scan the page in the first place when > we end up in lazy_scan_prune. (Granted, this is much less true than it > should be due to the influence of SKIP_PAGES_THRESHOLD, which > implements a weird and inefficient form of prefetching/readahead.) > > Why should we necessarily care what all_visible_according_to_vm says > or would say at this point? We're looking at the heap page itself, > which is more authoritative than the VM (theoretically they're equally > authoritative, but not really, not when push comes to shove). The best > answer that I can think of is that all_visible_according_to_vm gives > us a way of noticing and then logging any inconsistencies between VM > and heap that we might end up "repairing" in passing (once we've > rechecked the VM). But maybe that could happen elsewhere. > > Perhaps that cross-check could be pushed into visibilitymap.c, when we > go to set a !PageIsAllVisible(page) page all-visible in the VM. If > we're setting it and find that it's already set from earlier on, then > remember and complaing/LOG it. No all_visible_according_to_vm > required, plus this seems like it might be more thorough. Setting aside the data corruption cases (the repairing you mentioned), I think the primary case that all_visible_according_to_vm seeks to protect us from is if the page was already marked all visible in the VM and pruning did not change that. So, it avoids a call to visibilitymap_set() when the page was known to already be set all visible (as long as we didn't newly set all frozen). As you say, this does not seem like the common case. Pages we vacuum will most often not be all visible. I actually wonder if it is worse to rely on that old value of all visible and then call visibilitymap_set(), which takes an exclusive lock on the VM page, than it would be to just call visibilitymap_get_status() anew. This is what we do with all frozen. The problem is that it is kind of hard to tell because the whole thing is a bit of a tangled knot. I've ripped this code apart and put it back together again six ways from Sunday trying to get rid of all_visible_according_to_vm and next_unskippable_allvis/skipping_current_range's bootleg prefetching without incurring any additional calls to visibilitymap_get_status(). As best I can tell, our best case scenario is Thomas' streaming read API goes in, we add vacuum as a user, and we can likely remove the skip range logic. Then, the question remains about how useful it is to save the visibility map statuses we got when deciding whether or not to skip a block. - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Mon, Nov 13, 2023 at 5:28 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > When there are no indexes on the relation, we can set would-be dead > items LP_UNUSED and remove them during pruning. This saves us a vacuum > WAL record, reducing WAL volume (and time spent writing and syncing > WAL). ... > Note that (on principle) this patch set is on top of the bug fix I > proposed in [1]. > > [1] https://www.postgresql.org/message-id/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com Rebased on top of fix in b2e237afddc56a and registered for the january fest https://commitfest.postgresql.org/46/4665/ - Melanie
Attachment
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Nov 17, 2023 at 6:12 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Nov 13, 2023 at 5:28 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > When there are no indexes on the relation, we can set would-be dead > > items LP_UNUSED and remove them during pruning. This saves us a vacuum > > WAL record, reducing WAL volume (and time spent writing and syncing > > WAL). > ... > > Note that (on principle) this patch set is on top of the bug fix I > > proposed in [1]. > > > > [1] https://www.postgresql.org/message-id/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com > > Rebased on top of fix in b2e237afddc56a and registered for the january fest > https://commitfest.postgresql.org/46/4665/ I got an off-list question about whether or not this codepath is exercised in existing regression tests. It is -- vacuum.sql tests include those which vacuum a table with no indexes and tuples that can be deleted. I also looked through [1] to see if there were any user-facing docs which happened to mention the exact implementation details of how and when tuples are deleted by vacuum. I didn't see anything like that, so I don't think there are user-facing docs which need updating. - Melanie [1] https://www.postgresql.org/docs/devel/routine-vacuuming.html
On Mon, Nov 13, 2023 at 07:06:15PM -0500, Melanie Plageman wrote: > As best I can tell, our best case scenario is Thomas' streaming read API > goes in, we add vacuum as a user, and we can likely remove the skip > range logic. This does not prevent the work you've been doing in 0001 and 0002 posted upthread, right? Some progress is always better than no progress, and I can see the appeal behind doing 0001 actually to keep the updates of the block numbers closer to where we determine if relation truncation is safe of not rather than use an intermediate state in LVPagePruneState. 0002 is much, much, much trickier.. -- Michael
Attachment
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Sat, Dec 23, 2023 at 9:14 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Nov 13, 2023 at 07:06:15PM -0500, Melanie Plageman wrote: > > As best I can tell, our best case scenario is Thomas' streaming read API > > goes in, we add vacuum as a user, and we can likely remove the skip > > range logic. > > This does not prevent the work you've been doing in 0001 and 0002 > posted upthread, right? Some progress is always better than no > progress Correct. Peter and I were mainly discussing next refactoring steps as we move toward combining the prune, freeze, and VM records. This thread's patches stand alone. > I can see the appeal behind doing 0001 actually to keep > the updates of the block numbers closer to where we determine if > relation truncation is safe of not rather than use an intermediate > state in LVPagePruneState. Exactly. > 0002 is much, much, much trickier.. Do you have specific concerns about its correctness? I understand it is an area where we have to be sure we are correct. But, to be fair, that is true of all the pruning and vacuuming code. - Melanie
On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Do you have specific concerns about its correctness? I understand it > is an area where we have to be sure we are correct. But, to be fair, > that is true of all the pruning and vacuuming code. I'm kind of concerned that 0002 might be a performance regression. It pushes more branches down into the heap-pruning code, which I think can sometimes be quite hot, for the sake of a case that rarely occurs in practice. I take your point about it improving things when there are no indexes, but what about when there are? And even if there are no adverse performance consequences, is it really worth complicating the logic at such a low level? Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an informal word that seems to have no advantage over something like "immediate" or "now," and I don't think "reap" has a precise, universally-understood meaning. You could call this "mark_unused_now" or "immediately_mark_unused" or something and it would be far more self-documenting, IMHO. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-01-04 12:31:36 -0500, Robert Haas wrote: > On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Do you have specific concerns about its correctness? I understand it > > is an area where we have to be sure we are correct. But, to be fair, > > that is true of all the pruning and vacuuming code. > > I'm kind of concerned that 0002 might be a performance regression. It > pushes more branches down into the heap-pruning code, which I think > can sometimes be quite hot, for the sake of a case that rarely occurs > in practice. I was wondering the same when talking about this with Melanie. But I guess there are some scenarios that aren't unrealistic, consider e.g. bulk data loading with COPY with an UPDATE to massage the data afterwards, before creating the indexes. Where I could see this becoming more interesting / powerful is if we were able to do 'pronto reaping' not just from vacuum but also during on-access pruning. For ETL workloads VACUUM might often not run between modifications of the same page, but on-access pruning will. Being able to reclaim dead items at that point seems like it could be pretty sizable improvement? > I take your point about it improving things when there are no indexes, but > what about when there are? I suspect that branch prediction should take care of the additional branches. heap_prune_chain() indeed can be very hot, but I think we're primarily bottlenecked on data cache misses. For a single VACUUM, whether we'd do pronto reaping would be constant -> very well predictable. We could add an unlikely() to make sure the branch-predictor-is-empty case optimizes for the much more common case of having indexes. Falsely assuming we'd not pronto reap wouldn't be particularly bad, as the wins for the case are so much bigger. If we were to use pronto reaping for on-access pruning, it's perhaps a bit less predictable, as pruning for pages of a relation with indexes could be interleaved with pruning for relations without. But even there I suspect it'd not be the primary bottleneck: We call heap_page_prune_chain() in a loop for every tuple on a page, the branch predictor should quickly learn whether we're using pronto reaping. Whereas we're not becoming less cache-miss heavy when looking at subsequent tuples. > And even if there are no adverse performance consequences, is it really > worth complicating the logic at such a low level? Yes, I think this is the main question here. It's not clear though that the state after the patch is meaningfullye more complicated? It removes nontrivial code from lazy_scan_heap() and pays for that with a bit more complexity in heap_prune_chain(). > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an > informal word that seems to have no advantage over something like > "immediate" or "now," I didn't like that either :) > and I don't think "reap" has a precise, universally-understood meaning. Less concerned about that. > You could call this "mark_unused_now" or "immediately_mark_unused" or > something and it would be far more self-documenting, IMHO. How about 'no_indexes' or such? Greetings, Andres Freund
Hi, On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index 14de8158d49..b578c32eeb6 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -8803,8 +8803,13 @@ heap_xlog_prune(XLogReaderState *record) > nunused = (end - nowunused); > Assert(nunused >= 0); > > - /* Update all line pointers per the record, and repair fragmentation */ > - heap_page_prune_execute(buffer, > + /* > + * Update all line pointers per the record, and repair fragmentation. > + * We always pass pronto_reap as true, because we don't know whether > + * or not this option was used when pruning. This reduces the > + * validation done on replay in an assert build. > + */ Hm, that seems not great. Both because we loose validation and because it seems to invite problems down the line, due to pronto_reap falsely being set to true in heap_page_prune_execute(). > @@ -581,7 +589,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * function.) > */ > if (ItemIdIsDead(lp)) > + { > + /* > + * If the relation has no indexes, we can set dead line pointers > + * LP_UNUSED now. We don't increment ndeleted here since the LP > + * was already marked dead. > + */ > + if (prstate->pronto_reap) > + heap_prune_record_unused(prstate, offnum); > + > break; > + } I wasn't immediately sure whether this is reachable - but it is, e.g. after on-access pruning (which currently doesn't yet use pronto reaping), after pg_upgrade or dropping an index. > Assert(ItemIdIsNormal(lp)); > htup = (HeapTupleHeader) PageGetItem(dp, lp); > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * redirect the root to the correct chain member. > */ > if (i >= nchain) > - heap_prune_record_dead(prstate, rootoffnum); > + { > + /* > + * If the relation has no indexes, we can remove dead tuples > + * during pruning instead of marking their line pointers dead. Set > + * this tuple's line pointer LP_UNUSED. > + */ > + if (prstate->pronto_reap) > + heap_prune_record_unused(prstate, rootoffnum); > + else > + heap_prune_record_dead(prstate, rootoffnum); > + } > else > heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); > } > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > * item. This can happen if the loop in heap_page_prune caused us to > * visit the dead successor of a redirect item before visiting the > * redirect item. We can clean up by setting the redirect item to > - * DEAD state. > + * DEAD state. If pronto_reap is true, we can set it LP_UNUSED now. > */ > - heap_prune_record_dead(prstate, rootoffnum); > + if (prstate->pronto_reap) > + heap_prune_record_unused(prstate, rootoffnum); > + else > + heap_prune_record_dead(prstate, rootoffnum); > } > > return ndeleted; There's three new calls to heap_prune_record_unused() and the logic got more nested. Is there a way to get to a nicer end result? > From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Mon, 13 Nov 2023 16:47:08 -0500 > Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune > > Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or > not there are tuples on the page making rel truncation unsafe. > LVRelState->nonempty_pages is updated to reflect this. Previously, both > functions set an output parameter or output parameter member, hastup, to > indicate that nonempty_pages should be updated to reflect the latest > non-removable page. There doesn't seem to be any reason to wait until > lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other > counters in the LVRelState are updated in lazy_scan_[no]prune(). > This allows us to get rid of the output parameter hastup. > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel) > continue; > } > > - /* Collect LP_DEAD items in dead_items array, count tuples */ > - if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup, > + /* > + * Collect LP_DEAD items in dead_items array, count tuples, > + * determine if rel truncation is safe > + */ > + if (lazy_scan_noprune(vacrel, buf, blkno, page, > &recordfreespace)) > { > Size freespace = 0; > > /* > * Processed page successfully (without cleanup lock) -- just > - * need to perform rel truncation and FSM steps, much like the > - * lazy_scan_prune case. Don't bother trying to match its > - * visibility map setting steps, though. > + * need to update the FSM, much like the lazy_scan_prune case. > + * Don't bother trying to match its visibility map setting > + * steps, though. > */ > - if (hastup) > - vacrel->nonempty_pages = blkno + 1; > if (recordfreespace) > freespace = PageGetHeapFreeSpace(page); > UnlockReleaseBuffer(buf); The comment continues to say that we "determine if rel truncation is safe" - but I don't see that? Oh, I see, it's done inside lazy_scan_noprune(). This doesn't seem like a clear improvement to me. Particularly because it's only set if lazy_scan_noprune() actually does something. I don't like the existing code in lazy_scan_heap(). But this kinda seems like tinkering around the edges, without getting to the heart of the issue. I think we should 1) Move everything after ReadBufferExtended() and the end of the loop into its own function 2) All the code in the loop body after the call to lazy_scan_prune() is specific to the lazy_scan_prune() path, it doesn't make sense that it's at the same level as the the calls to lazy_scan_noprune(), lazy_scan_new_or_empty() or lazy_scan_prune(). Either it should be in lazy_scan_prune() or a new wrapper function. 3) It's imo wrong that we have UnlockReleaseBuffer() (there are 6 different places unlocking if I didn't miscount!) and RecordPageWithFreeSpace() calls in this many places. I think this is largely a consequence of the previous points. Once those are addressed, we can have one common place. But I digress. Greetings, Andres Freund
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
Thanks for the review! On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote: > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > > Assert(ItemIdIsNormal(lp)); > > htup = (HeapTupleHeader) PageGetItem(dp, lp); > > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > > * redirect the root to the correct chain member. > > */ > > if (i >= nchain) > > - heap_prune_record_dead(prstate, rootoffnum); > > + { > > + /* > > + * If the relation has no indexes, we can remove dead tuples > > + * during pruning instead of marking their line pointers dead. Set > > + * this tuple's line pointer LP_UNUSED. > > + */ > > + if (prstate->pronto_reap) > > + heap_prune_record_unused(prstate, rootoffnum); > > + else > > + heap_prune_record_dead(prstate, rootoffnum); > > + } > > else > > heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); > > } > > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > > * item. This can happen if the loop in heap_page_prune caused us to > > * visit the dead successor of a redirect item before visiting the > > * redirect item. We can clean up by setting the redirect item to > > - * DEAD state. > > + * DEAD state. If pronto_reap is true, we can set it LP_UNUSED now. > > */ > > - heap_prune_record_dead(prstate, rootoffnum); > > + if (prstate->pronto_reap) > > + heap_prune_record_unused(prstate, rootoffnum); > > + else > > + heap_prune_record_dead(prstate, rootoffnum); > > } > > > > return ndeleted; > > There's three new calls to heap_prune_record_unused() and the logic got more > nested. Is there a way to get to a nicer end result? So, I could do another loop through the line pointers in heap_page_prune() (after the loop calling heap_prune_chain()) and, if pronto_reap is true, set dead line pointers LP_UNUSED. Then, when constructing the WAL record, I would just not add the prstate.nowdead that I saved from heap_prune_chain() to the prune WAL record. This would eliminate the extra if statements from heap_prune_chain(). It will be more performant than sticking with the original (master) call to lazy_vacuum_heap_page(). However, I'm not convinced that the extra loop to set line pointers LP_DEAD -> LP_UNUSED is less confusing than keeping the if pronto_reap test in heap_prune_chain(). heap_prune_chain() is where line pointers' new values are decided. It seems weird to pick one new value for a line pointer in heap_prune_chain() and then pick another, different new value in a loop after heap_prune_chain(). I don't see any way to eliminate the if pronto_reap tests without a separate loop setting LP_DEAD->LP_UNUSED, though. > > From 608658f2cbc0acde55aac815c0fdb523ec24c452 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman <melanieplageman@gmail.com> > > Date: Mon, 13 Nov 2023 16:47:08 -0500 > > Subject: [PATCH v2 1/2] Indicate rel truncation unsafe in lazy_scan[no]prune > > > > Both lazy_scan_prune() and lazy_scan_noprune() must determine whether or > > not there are tuples on the page making rel truncation unsafe. > > LVRelState->nonempty_pages is updated to reflect this. Previously, both > > functions set an output parameter or output parameter member, hastup, to > > indicate that nonempty_pages should be updated to reflect the latest > > non-removable page. There doesn't seem to be any reason to wait until > > lazy_scan_[no]prune() returns to update nonempty_pages. Plenty of other > > counters in the LVRelState are updated in lazy_scan_[no]prune(). > > This allows us to get rid of the output parameter hastup. > > > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel) > > continue; > > } > > > > - /* Collect LP_DEAD items in dead_items array, count tuples */ > > - if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup, > > + /* > > + * Collect LP_DEAD items in dead_items array, count tuples, > > + * determine if rel truncation is safe > > + */ > > + if (lazy_scan_noprune(vacrel, buf, blkno, page, > > &recordfreespace)) > > { > > Size freespace = 0; > > > > /* > > * Processed page successfully (without cleanup lock) -- just > > - * need to perform rel truncation and FSM steps, much like the > > - * lazy_scan_prune case. Don't bother trying to match its > > - * visibility map setting steps, though. > > + * need to update the FSM, much like the lazy_scan_prune case. > > + * Don't bother trying to match its visibility map setting > > + * steps, though. > > */ > > - if (hastup) > > - vacrel->nonempty_pages = blkno + 1; > > if (recordfreespace) > > freespace = PageGetHeapFreeSpace(page); > > UnlockReleaseBuffer(buf); > > The comment continues to say that we "determine if rel truncation is safe" - > but I don't see that? Oh, I see, it's done inside lazy_scan_noprune(). This > doesn't seem like a clear improvement to me. Particularly because it's only > set if lazy_scan_noprune() actually does something. I don't get what the last sentence means ("Particularly because..."). The new location of the hastup test in lazy_scan_noprune() is above an unconditional return true, so it is also only set if lazy_scan_noprune() actually does something. I think the lazy_scan[]prune() functions shouldn't try to export the hastup information to lazy_scan_heap(). It's confusing. We should be moving all of the page-specific processing into the individual functions instead of in the body of lazy_scan_heap(). > I don't like the existing code in lazy_scan_heap(). But this kinda seems like > tinkering around the edges, without getting to the heart of the issue. I think > we should > > 1) Move everything after ReadBufferExtended() and the end of the loop into its > own function > > 2) All the code in the loop body after the call to lazy_scan_prune() is > specific to the lazy_scan_prune() path, it doesn't make sense that it's at > the same level as the the calls to lazy_scan_noprune(), > lazy_scan_new_or_empty() or lazy_scan_prune(). Either it should be in > lazy_scan_prune() or a new wrapper function. > > 3) It's imo wrong that we have UnlockReleaseBuffer() (there are 6 different > places unlocking if I didn't miscount!) and RecordPageWithFreeSpace() calls > in this many places. I think this is largely a consequence of the previous > points. Once those are addressed, we can have one common place. I have other patches that do versions of all of the above, but they didn't seem to really fit with this patch set. I am taking a step to move code out of lazy_scan_heap() that doesn't belong there. That fact that other code should also be moved from there seems more like a "yes and" than a "no but". That being said, do you think I should introduce patches doing further refactoring of lazy_scan_heap() (like what you suggest above) into this thread? - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Thu, Jan 4, 2024 at 12:31 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 11:27 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Do you have specific concerns about its correctness? I understand it > > is an area where we have to be sure we are correct. But, to be fair, > > that is true of all the pruning and vacuuming code. > > I'm kind of concerned that 0002 might be a performance regression. It > pushes more branches down into the heap-pruning code, which I think > can sometimes be quite hot, for the sake of a case that rarely occurs > in practice. I take your point about it improving things when there > are no indexes, but what about when there are? And even if there are > no adverse performance consequences, is it really worth complicating > the logic at such a low level? Regarding the additional code complexity, I think the extra call to lazy_vacuum_heap_page() in lazy_scan_heap() actually represents a fair amount of code complexity. It is a special case of page-level processing that should be handled by heap_page_prune() and not lazy_scan_heap(). lazy_scan_heap() is responsible for three main things -- loop through the blocks in a relation and process each page (pruning, freezing, etc), invoke index vacuuming, invoke functions to loop through dead_items and vacuum pages. The logic to do the per-page processing is spread across three places, though. When a single page is being processed, page pruning happens in heap_page_prune(). Freezing, dead items recording, and visibility checks happen in lazy_scan_prune(). Visibility map updates and freespace map updates happen back in lazy_scan_heap(). Except, if the table has no indexes, in which case, lazy_scan_heap() also invokes lazy_vacuum_heap_page() to set dead line pointers unused and do another separate visibility check and VM update. I maintain that all page-level processing should be done in the page-level processing functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be directly responsible for special case page-level processing. > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an > informal word that seems to have no advantage over something like > "immediate" or "now," and I don't think "reap" has a precise, > universally-understood meaning. You could call this "mark_unused_now" > or "immediately_mark_unused" or something and it would be far more > self-documenting, IMHO. Yes, I see how pronto is unnecessarily informal. If there are no cases other than when the table has no indexes that we would consider immediately marking LPs unused, then perhaps it is better to call it "no_indexes" (per andres' suggestion)? - Melanie
On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > When a single page is being processed, page pruning happens in > heap_page_prune(). Freezing, dead items recording, and visibility > checks happen in lazy_scan_prune(). Visibility map updates and > freespace map updates happen back in lazy_scan_heap(). Except, if the > table has no indexes, in which case, lazy_scan_heap() also invokes > lazy_vacuum_heap_page() to set dead line pointers unused and do > another separate visibility check and VM update. I maintain that all > page-level processing should be done in the page-level processing > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be > directly responsible for special case page-level processing. But you can just as easily turn this argument on its head, can't you? In general, except for HOT tuples, line pointers are marked dead by pruning and unused by vacuum. Here you want to turn it on its head and make pruning do what would normally be vacuum's responsibility. I mean, that's not to say that your argument is "wrong" ... but what I just said really is how I think about it, too. > > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an > > informal word that seems to have no advantage over something like > > "immediate" or "now," and I don't think "reap" has a precise, > > universally-understood meaning. You could call this "mark_unused_now" > > or "immediately_mark_unused" or something and it would be far more > > self-documenting, IMHO. > > Yes, I see how pronto is unnecessarily informal. If there are no cases > other than when the table has no indexes that we would consider > immediately marking LPs unused, then perhaps it is better to call it > "no_indexes" (per andres' suggestion)? wfm. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 5, 2024 at 8:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > When a single page is being processed, page pruning happens in > > heap_page_prune(). Freezing, dead items recording, and visibility > > checks happen in lazy_scan_prune(). Visibility map updates and > > freespace map updates happen back in lazy_scan_heap(). Except, if the > > table has no indexes, in which case, lazy_scan_heap() also invokes > > lazy_vacuum_heap_page() to set dead line pointers unused and do > > another separate visibility check and VM update. I maintain that all > > page-level processing should be done in the page-level processing > > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be > > directly responsible for special case page-level processing. > > But you can just as easily turn this argument on its head, can't you? > In general, except for HOT tuples, line pointers are marked dead by > pruning and unused by vacuum. Here you want to turn it on its head and > make pruning do what would normally be vacuum's responsibility. I actually think we are going to want to stop referring to these steps as pruning and vacuuming. It is confusing because vacuuming refers to the whole process of doing garbage collection on the table and also to the specific step of setting dead line pointers unused. If we called these steps say, pruning and reaping, that may be more clear. Vacuuming consists of three phases -- the first pass, index vacuuming, and the second pass. I don't think we should dictate what happens in each pass. That is, we shouldn't expect only pruning to happen in the first pass and only reaping to happen in the second pass. For example, I think Andres has previously proposed doing another round of pruning after index vacuuming. The second pass/third phase is distinguished primarily by being after index vacuuming. If we think about it this way, that frees us up to set dead line pointers unused in the first pass when the table has no indexes. For clarity, I could add a block comment explaining that doing this is an optimization and not a logical requirement. One way to make this even more clear would be to set the dead line pointers unused in a separate loop after heap_prune_chain() as I proposed upthread. - Melanie
On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I actually think we are going to want to stop referring to these steps > as pruning and vacuuming. It is confusing because vacuuming refers to > the whole process of doing garbage collection on the table and also to > the specific step of setting dead line pointers unused. If we called > these steps say, pruning and reaping, that may be more clear. I agree that there's some terminological confusion here, but if we leave all of the current naming unchanged and start using new naming conventions in new code, I don't think that's going to clear things up. Getting rid of the weird naming with pruning, "scanning" (that is part of vacuum), and "vacuuming" (that is another part of vacuum) is gonna take some work. > Vacuuming consists of three phases -- the first pass, index vacuuming, > and the second pass. I don't think we should dictate what happens in > each pass. That is, we shouldn't expect only pruning to happen in the > first pass and only reaping to happen in the second pass. For example, > I think Andres has previously proposed doing another round of pruning > after index vacuuming. The second pass/third phase is distinguished > primarily by being after index vacuuming. > > If we think about it this way, that frees us up to set dead line > pointers unused in the first pass when the table has no indexes. For > clarity, I could add a block comment explaining that doing this is an > optimization and not a logical requirement. One way to make this even > more clear would be to set the dead line pointers unused in a separate > loop after heap_prune_chain() as I proposed upthread. I don't really disagree with any of this, but I think the question is whether the code is cleaner with mark-LP-as-unused pushed down into pruning or whether it's better the way it is. Andres seems confident that the change won't suck for performance, which is good, but I'm not quite convinced that it's the right direction to go with the code, and he doesn't seem to be either. Perhaps this all turns on this point: MP> I am planning to add a VM update into the freeze record, at which point MP> I will move the VM update code into lazy_scan_prune(). This will then MP> allow us to consolidate the freespace map update code for the prune and MP> noprune cases and make lazy_scan_heap() short and sweet. Can we see what that looks like on top of this change? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-01-04 17:37:27 -0500, Melanie Plageman wrote: > On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > > > Assert(ItemIdIsNormal(lp)); > > > htup = (HeapTupleHeader) PageGetItem(dp, lp); > > > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > > > * redirect the root to the correct chain member. > > > */ > > > if (i >= nchain) > > > - heap_prune_record_dead(prstate, rootoffnum); > > > + { > > > + /* > > > + * If the relation has no indexes, we can remove dead tuples > > > + * during pruning instead of marking their line pointers dead. Set > > > + * this tuple's line pointer LP_UNUSED. > > > + */ > > > + if (prstate->pronto_reap) > > > + heap_prune_record_unused(prstate, rootoffnum); > > > + else > > > + heap_prune_record_dead(prstate, rootoffnum); > > > + } > > > else > > > heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); > > > } > > > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, > > > * item. This can happen if the loop in heap_page_prune caused us to > > > * visit the dead successor of a redirect item before visiting the > > > * redirect item. We can clean up by setting the redirect item to > > > - * DEAD state. > > > + * DEAD state. If pronto_reap is true, we can set it LP_UNUSED now. > > > */ > > > - heap_prune_record_dead(prstate, rootoffnum); > > > + if (prstate->pronto_reap) > > > + heap_prune_record_unused(prstate, rootoffnum); > > > + else > > > + heap_prune_record_dead(prstate, rootoffnum); > > > } > > > > > > return ndeleted; > > > > There's three new calls to heap_prune_record_unused() and the logic got more > > nested. Is there a way to get to a nicer end result? > > So, I could do another loop through the line pointers in > heap_page_prune() (after the loop calling heap_prune_chain()) and, if > pronto_reap is true, set dead line pointers LP_UNUSED. Then, when > constructing the WAL record, I would just not add the prstate.nowdead > that I saved from heap_prune_chain() to the prune WAL record. Hm, that seems a bit sad as well. I am wondering if we could move the pronto_reap handling into heap_prune_record_dead() or a wrapper of it. I am more concerned about the human reader than the CPU here... > > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel) > > > continue; > > > } > > > > > > - /* Collect LP_DEAD items in dead_items array, count tuples */ > > > - if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup, > > > + /* > > > + * Collect LP_DEAD items in dead_items array, count tuples, > > > + * determine if rel truncation is safe > > > + */ > > > + if (lazy_scan_noprune(vacrel, buf, blkno, page, > > > &recordfreespace)) > > > { > > > Size freespace = 0; > > > > > > /* > > > * Processed page successfully (without cleanup lock) -- just > > > - * need to perform rel truncation and FSM steps, much like the > > > - * lazy_scan_prune case. Don't bother trying to match its > > > - * visibility map setting steps, though. > > > + * need to update the FSM, much like the lazy_scan_prune case. > > > + * Don't bother trying to match its visibility map setting > > > + * steps, though. > > > */ > > > - if (hastup) > > > - vacrel->nonempty_pages = blkno + 1; > > > if (recordfreespace) > > > freespace = PageGetHeapFreeSpace(page); > > > UnlockReleaseBuffer(buf); > > > > The comment continues to say that we "determine if rel truncation is safe" - > > but I don't see that? Oh, I see, it's done inside lazy_scan_noprune(). This > > doesn't seem like a clear improvement to me. Particularly because it's only > > set if lazy_scan_noprune() actually does something. > > I don't get what the last sentence means ("Particularly because..."). Took me a second to understand myself again too, oops. What I think I meant is that it seems error-prone that it's only set in some paths inside lazy_scan_noprune(). Previously it was at least a bit clearer in lazy_scan_heap() that it would be set for the different possible paths. > > I don't like the existing code in lazy_scan_heap(). But this kinda seems like > > tinkering around the edges, without getting to the heart of the issue. I think > > we should > > > > 1) Move everything after ReadBufferExtended() and the end of the loop into its > > own function > > > > 2) All the code in the loop body after the call to lazy_scan_prune() is > > specific to the lazy_scan_prune() path, it doesn't make sense that it's at > > the same level as the the calls to lazy_scan_noprune(), > > lazy_scan_new_or_empty() or lazy_scan_prune(). Either it should be in > > lazy_scan_prune() or a new wrapper function. > > > > 3) It's imo wrong that we have UnlockReleaseBuffer() (there are 6 different > > places unlocking if I didn't miscount!) and RecordPageWithFreeSpace() calls > > in this many places. I think this is largely a consequence of the previous > > points. Once those are addressed, we can have one common place. > > I have other patches that do versions of all of the above, but they > didn't seem to really fit with this patch set. I am taking a step to > move code out of lazy_scan_heap() that doesn't belong there. That fact > that other code should also be moved from there seems more like a "yes > and" than a "no but". That being said, do you think I should introduce > patches doing further refactoring of lazy_scan_heap() (like what you > suggest above) into this thread? It probably should not be part of this patchset. I probably shouldn't have written the above here, but after concluding that I didn't think your small refactoring patch was quite right, I couldn't stop myself from thinking about what would be right. Greetings, Andres Freund
Hi, On 2024-01-05 08:59:41 -0500, Robert Haas wrote: > On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > When a single page is being processed, page pruning happens in > > heap_page_prune(). Freezing, dead items recording, and visibility > > checks happen in lazy_scan_prune(). Visibility map updates and > > freespace map updates happen back in lazy_scan_heap(). Except, if the > > table has no indexes, in which case, lazy_scan_heap() also invokes > > lazy_vacuum_heap_page() to set dead line pointers unused and do > > another separate visibility check and VM update. I maintain that all > > page-level processing should be done in the page-level processing > > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be > > directly responsible for special case page-level processing. > > But you can just as easily turn this argument on its head, can't you? > In general, except for HOT tuples, line pointers are marked dead by > pruning and unused by vacuum. Here you want to turn it on its head and > make pruning do what would normally be vacuum's responsibility. OTOH, the pruning logic, including its WAL record, already supports marking items unused, all we need to do is to tell it to do so in a few more cases. If we didn't already need to have support for this, I'd a much harder time arguing for doing this. One important part of the larger project is to combine the WAL records for pruning, freezing and setting the all-visible/all-frozen bit into one WAL record. We can't set all-frozen before we have removed the dead items. So either we need to combine pruning and setting items unused for no-index tables or we end up considerably less efficient in the no-indexes case. An aside: As I think we chatted about before, I eventually would like the option to remove index entries for a tuple during on-access pruning, for OLTP workloads. I.e. before removing the tuple, construct the corresponding index tuple, use it to look up index entries pointing to the tuple. If all the index entries were found (they might not be, if they already were marked dead during a lookup, or if an expression wasn't actually immutable), we can prune without the full index scan. Obviously this would only be suitable for some workloads, but it could be quite beneficial when you have huge indexes. The reason I mention this is that then we'd have another source of marking items unused during pruning. Greetings, Andres Freund
On Fri, Jan 5, 2024 at 3:05 PM Andres Freund <andres@anarazel.de> wrote: > OTOH, the pruning logic, including its WAL record, already supports marking > items unused, all we need to do is to tell it to do so in a few more cases. If > we didn't already need to have support for this, I'd a much harder time > arguing for doing this. > > One important part of the larger project is to combine the WAL records for > pruning, freezing and setting the all-visible/all-frozen bit into one WAL > record. We can't set all-frozen before we have removed the dead items. So > either we need to combine pruning and setting items unused for no-index tables > or we end up considerably less efficient in the no-indexes case. Those are fair arguments. > An aside: > > As I think we chatted about before, I eventually would like the option to > remove index entries for a tuple during on-access pruning, for OLTP > workloads. I.e. before removing the tuple, construct the corresponding index > tuple, use it to look up index entries pointing to the tuple. If all the index > entries were found (they might not be, if they already were marked dead during > a lookup, or if an expression wasn't actually immutable), we can prune without > the full index scan. Obviously this would only be suitable for some > workloads, but it could be quite beneficial when you have huge indexes. The > reason I mention this is that then we'd have another source of marking items > unused during pruning. I will be astonished if you can make this work well enough to avoid huge regressions in plausible cases. There are plenty of cases where we do a very thorough job opportunistically removing index tuples. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 5, 2024 at 1:47 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > MP> I am planning to add a VM update into the freeze record, at which point > MP> I will move the VM update code into lazy_scan_prune(). This will then > MP> allow us to consolidate the freespace map update code for the prune and > MP> noprune cases and make lazy_scan_heap() short and sweet. > > Can we see what that looks like on top of this change? Yes, attached is a patch set which does this. My previous patchset already reduced the number of places we unlock the buffer and update the freespace map in lazy_scan_heap(). This patchset combines the lazy_scan_prune() and lazy_scan_noprune() FSM update cases. I also have a version which moves the freespace map updates into lazy_scan_prune() and lazy_scan_noprune() -- eliminating all of these from lazy_scan_heap(). This is arguably more clear. But Andres mentioned he wanted fewer places unlocking the buffer and updating the FSM. - Melanie
Attachment
- v3-0004-Inline-LVPagePruneState-members-in-lazy_scan_prun.patch
- v3-0001-Indicate-rel-truncation-unsafe-in-lazy_scan-no-pr.patch
- v3-0003-Update-VM-in-lazy_scan_prune.patch
- v3-0002-Set-would-be-dead-items-LP_UNUSED-while-pruning.patch
- v3-0005-Remove-unneeded-struct-LVPagePruneState.patch
- v3-0006-Conslidate-FSM-update-code-in-lazy_scan_heap.patch
Hi, On 2024-01-05 15:23:12 -0500, Robert Haas wrote: > On Fri, Jan 5, 2024 at 3:05 PM Andres Freund <andres@anarazel.de> wrote: > > An aside: > > > > As I think we chatted about before, I eventually would like the option to > > remove index entries for a tuple during on-access pruning, for OLTP > > workloads. I.e. before removing the tuple, construct the corresponding index > > tuple, use it to look up index entries pointing to the tuple. If all the index > > entries were found (they might not be, if they already were marked dead during > > a lookup, or if an expression wasn't actually immutable), we can prune without > > the full index scan. Obviously this would only be suitable for some > > workloads, but it could be quite beneficial when you have huge indexes. The > > reason I mention this is that then we'd have another source of marking items > > unused during pruning. > > I will be astonished if you can make this work well enough to avoid > huge regressions in plausible cases. There are plenty of cases where > we do a very thorough job opportunistically removing index tuples. These days the AM is often involved with that, via table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to happen before physically removing the already-marked-killed index entries. We can't rely on being able to actually prune the heap page at that point, there might be other backends pinning it, but often we will be able to. If we were to prune below heap_index_delete_tuples(), we wouldn't need to recheck that index again during "individual tuple pruning", if the to-be-marked-unused heap tuple is one of the tuples passed to heap_index_delete_tuples(). Which presumably will be very commonly the case. At least for nbtree, we are much more aggressive about marking index entries as killed, than about actually removing the index entries. "individual tuple pruning" would have to look for killed-but-still-present index entries, not just for "live" entries. Greetings, Andres Freund
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 5, 2024 at 2:47 PM Andres Freund <andres@anarazel.de> wrote: > On 2024-01-04 17:37:27 -0500, Melanie Plageman wrote: > > On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > > > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel) > > > > continue; > > > > } > > > > > > > > - /* Collect LP_DEAD items in dead_items array, count tuples */ > > > > - if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup, > > > > + /* > > > > + * Collect LP_DEAD items in dead_items array, count tuples, > > > > + * determine if rel truncation is safe > > > > + */ > > > > + if (lazy_scan_noprune(vacrel, buf, blkno, page, > > > > &recordfreespace)) > > > > { > > > > Size freespace = 0; > > > > > > > > /* > > > > * Processed page successfully (without cleanup lock) -- just > > > > - * need to perform rel truncation and FSM steps, much like the > > > > - * lazy_scan_prune case. Don't bother trying to match its > > > > - * visibility map setting steps, though. > > > > + * need to update the FSM, much like the lazy_scan_prune case. > > > > + * Don't bother trying to match its visibility map setting > > > > + * steps, though. > > > > */ > > > > - if (hastup) > > > > - vacrel->nonempty_pages = blkno + 1; > > > > if (recordfreespace) > > > > freespace = PageGetHeapFreeSpace(page); > > > > UnlockReleaseBuffer(buf); > > > > > > The comment continues to say that we "determine if rel truncation is safe" - > > > but I don't see that? Oh, I see, it's done inside lazy_scan_noprune(). This > > > doesn't seem like a clear improvement to me. Particularly because it's only > > > set if lazy_scan_noprune() actually does something. > > > > I don't get what the last sentence means ("Particularly because..."). > > Took me a second to understand myself again too, oops. What I think I meant is > that it seems error-prone that it's only set in some paths inside > lazy_scan_noprune(). Previously it was at least a bit clearer in > lazy_scan_heap() that it would be set for the different possible paths. I see what you are saying. But if lazy_scan_noprune() doesn't do anything, then it calls lazy_scan_prune(), which does set hastup and update vacrel->nonempty_pages if needed. Using hastup in lazy_scan_[no]prune() also means that they are directly updating LVRelState after determining how to update it. lazy_scan_heap() isn't doing responsible anymore. I don't see a reason to be passing information back to lazy_scan_heap() to update LVRelState when lazy_scan_[no]prune() has access to the LVRelState. Importantly, once I combine the prune and freeze records, hastup is set in heap_page_prune() instead of lazy_scan_prune() (that whole loop in lazy_scan_prune() is eliminated()). And I don't like having to pass hastup back through lazy_scan_prune() and then to lazy_scan_heap() so that lazy_scan_heap() can use it (and use it to update a data structure available in lazy_scan_prune()). - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
Patch 0001 in the attached set addresses the following review feedback: - pronto_reap renamed to no_indexes - reduce the number of callers of heap_prune_record_unused() by calling it from heap_prune_record_dead() when appropriate - add unlikely hint to no_indexes test I've also dropped the patch which moves the test of hastup into lazy_scan_[no]prune(). In the future, I plan to remove the loop from lazy_scan_prune() which sets hastup and set it instead in heap_page_prune(). Changes to hastup's usage could be done with that change instead of in this set. The one review point I did not address in the code is that which I've responded to inline below. Though not required for the immediate reaping feature, I've included patches 0002-0004 which are purely refactoring. These patches simplify lazy_scan_heap() by moving the visibility map code into lazy_scan_prune() and consolidating the updates to the FSM and vacrel->nonempty_pages. I've proposed them in this thread because there is an interdependence between eliminating the lazy_vacuum_heap_page() call, moving the VM code, and combining the three FSM updates (lazy_scan_prune() [index and no index] and lazy_scan_noprune()). This illustrates the code clarity benefits of the change to mark line pointers LP_UNUSED during pruning if the table has no indexes. I can propose them in another thread if 0001 is merged. On Thu, Jan 4, 2024 at 3:03 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote: > > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > > index 14de8158d49..b578c32eeb6 100644 > > --- a/src/backend/access/heap/heapam.c > > +++ b/src/backend/access/heap/heapam.c > > @@ -8803,8 +8803,13 @@ heap_xlog_prune(XLogReaderState *record) > > nunused = (end - nowunused); > > Assert(nunused >= 0); > > > > - /* Update all line pointers per the record, and repair fragmentation */ > > - heap_page_prune_execute(buffer, > > + /* > > + * Update all line pointers per the record, and repair fragmentation. > > + * We always pass pronto_reap as true, because we don't know whether > > + * or not this option was used when pruning. This reduces the > > + * validation done on replay in an assert build. > > + */ > > Hm, that seems not great. Both because we loose validation and because it > seems to invite problems down the line, due to pronto_reap falsely being set > to true in heap_page_prune_execute(). I see what you are saying. With the name change to no_indexes, it would be especially misleading to future programmers who might decide to use that parameter for something else. Are you thinking it would be okay to check the catalog to see if the table has indexes in heap_xlog_prune() or are you suggesting that I add some kind of flag to the prune WAL record? - Melanie
Attachment
On Fri, Jan 5, 2024 at 12:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > > As I think we chatted about before, I eventually would like the option to > > remove index entries for a tuple during on-access pruning, for OLTP > > workloads. I.e. before removing the tuple, construct the corresponding index > > tuple, use it to look up index entries pointing to the tuple. If all the index > > entries were found (they might not be, if they already were marked dead during > > a lookup, or if an expression wasn't actually immutable), we can prune without > > the full index scan. Obviously this would only be suitable for some > > workloads, but it could be quite beneficial when you have huge indexes. The > > reason I mention this is that then we'd have another source of marking items > > unused during pruning. > > I will be astonished if you can make this work well enough to avoid > huge regressions in plausible cases. There are plenty of cases where > we do a very thorough job opportunistically removing index tuples. Right. In particular, bottom-up index deletion works well because it adds a kind of natural backpressure to one important special case (the case of non-HOT updates that don't "logically change" any indexed column). It isn't really all that "opportunistic" in my understanding of the term -- the overall effect is to *systematically* control bloat in a way that is actually quite reliable. Like you, I have my doubts that it would be valuable to be more proactive about deleting dead index tuples that are just random dead tuples. There may be a great many dead index tuples diffusely spread across an index -- these can be quite harmless, and not worth proactively cleaning up (even at a fairly low cost). What we mostly need to worry about is *concentrated* build-up of dead index tuples in particular leaf pages. A natural question to ask is: what cases remain, where we could stand to add more backpressure? What other "special case" do we not yet address? I think that retail index tuple deletion could work well as part of a limited form of "transaction rollback" that cleans up after a just-aborted transaction, within the backend that executed the transaction itself. I suspect that this case now has outsized importance, precisely because it's the one remaining case where the system accumulates index bloat without any sort of natural backpressure. Making the transaction/backend that creates bloat directly responsible for proactively cleaning it up tends to have a stabilizing effect over time. The system is made to live within its means. We could even fully reverse heap page line pointer bloat under this "transaction rollback" scheme -- I bet that aborted xacts are a disproportionate source of line pointer bloat. Barring a hard crash, or a very large transaction, we could "undo" the physical changes to relations before permitting the backend to retry the transaction from scratch. This would just work as an optimization. -- Peter Geoghegan
On Fri, Jan 5, 2024 at 12:57 PM Andres Freund <andres@anarazel.de> wrote: > > I will be astonished if you can make this work well enough to avoid > > huge regressions in plausible cases. There are plenty of cases where > > we do a very thorough job opportunistically removing index tuples. > > These days the AM is often involved with that, via > table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to > happen before physically removing the already-marked-killed index entries. We > can't rely on being able to actually prune the heap page at that point, there > might be other backends pinning it, but often we will be able to. If we were > to prune below heap_index_delete_tuples(), we wouldn't need to recheck that > index again during "individual tuple pruning", if the to-be-marked-unused heap > tuple is one of the tuples passed to heap_index_delete_tuples(). Which > presumably will be very commonly the case. I don't understand. Making heap_index_delete_tuples() prune heap pages in passing such that we can ultimately mark dead heap tuples LP_UNUSED necessitates high level coordination -- it has to happen at a level much higher than heap_index_delete_tuples(). In other words, making it all work safely requires the same high level context that makes it safe for VACUUM to set a stub LP_DEAD line pointer to LP_UNUSED (index tuples must never be allowed to point to TIDs/heap line pointers that can be concurrently recycled). Obviously my idea of "a limited form of transaction rollback" has the required high-level context available, which is the crucial factor that allows it to safely reverse all bloat -- even line pointer bloat (which is traditionally something that only VACUUM can do safely). I have a hard time imagining a scheme that can do that outside of VACUUM without directly targeting some special case, such as the case that I'm calling "transaction rollback". In other words, I have a hard time imagining how this would ever be practical as part of any truly opportunistic cleanup process. AFAICT the dependency between indexes and the heap is just too delicate for such a scheme to ever really be practical. > At least for nbtree, we are much more aggressive about marking index entries > as killed, than about actually removing the index entries. "individual tuple > pruning" would have to look for killed-but-still-present index entries, not > just for "live" entries. These days having index tuples directly marked LP_DEAD is surprisingly unimportant to heap_index_delete_tuples(). The batching optimization implemented by _bt_simpledel_pass() tends to be very effective in practice. We only need to have the right *general* idea about which heap pages to visit -- which heap pages will yield some number of deletable index tuples. -- Peter Geoghegan
On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > But you can just as easily turn this argument on its head, can't you? > > In general, except for HOT tuples, line pointers are marked dead by > > pruning and unused by vacuum. Here you want to turn it on its head and > > make pruning do what would normally be vacuum's responsibility. > > I actually think we are going to want to stop referring to these steps > as pruning and vacuuming. It is confusing because vacuuming refers to > the whole process of doing garbage collection on the table and also to > the specific step of setting dead line pointers unused. If we called > these steps say, pruning and reaping, that may be more clear. What about index VACUUM records? Should they be renamed to REAP records, too? > Vacuuming consists of three phases -- the first pass, index vacuuming, > and the second pass. I don't think we should dictate what happens in > each pass. That is, we shouldn't expect only pruning to happen in the > first pass and only reaping to happen in the second pass. Why not? It's not self-evident that it matters much either way. I don't see it being worth the complexity (which is not to say that I'm opposed to what you're trying to do here). Note that we only need a cleanup for the first heap pass right now (not the second heap pass). So if you're going to prune in the second heap pass, you're going to have to add a mechanism that makes it safe (by acquiring a cleanup lock once we decide that we actually want to prune, say). Or maybe you'd just give up on the fact that we don't need cleanup locks for the second hea pass these days instead (which seems like a step backwards). > For example, > I think Andres has previously proposed doing another round of pruning > after index vacuuming. The second pass/third phase is distinguished > primarily by being after index vacuuming. I think of the second pass/third phase as being very similar to index vacuuming. Both processes/phases don't require cleanup locks (actually nbtree VACUUM does require cleanup locks, but the reasons why are pretty esoteric, and not shared by every index AM). And, both processes/phases don't need to generate their own recovery conflicts. Neither type of WAL record requires a snapshotConflictHorizon field of its own, since we can safely assume that some PRUNE record must have taken care of all that earlier on. -- Peter Geoghegan
On Fri, Jan 5, 2024 at 3:57 PM Andres Freund <andres@anarazel.de> wrote: > > I will be astonished if you can make this work well enough to avoid > > huge regressions in plausible cases. There are plenty of cases where > > we do a very thorough job opportunistically removing index tuples. > > These days the AM is often involved with that, via > table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to > happen before physically removing the already-marked-killed index entries. We > can't rely on being able to actually prune the heap page at that point, there > might be other backends pinning it, but often we will be able to. If we were > to prune below heap_index_delete_tuples(), we wouldn't need to recheck that > index again during "individual tuple pruning", if the to-be-marked-unused heap > tuple is one of the tuples passed to heap_index_delete_tuples(). Which > presumably will be very commonly the case. > > At least for nbtree, we are much more aggressive about marking index entries > as killed, than about actually removing the index entries. "individual tuple > pruning" would have to look for killed-but-still-present index entries, not > just for "live" entries. I don't want to derail this thread, but I don't really see what you have in mind here. The first paragraph sounds like you're imagining that while pruning the index entries we might jump over to the heap and clean things up there, too, but that seems like it wouldn't work if the table has more than one index. I thought you were talking about starting with a heap tuple and bouncing around to every index to see if we can find index pointers to kill in every one of them. That *could* work out, but you only need one index to have been opportunistically cleaned up in order for it to fail to work out. There might well be some workloads where that's often the case, but the regressions in the workloads where it isn't the case seem like they would be rather substantial, because doing an extra lookup in every index for each heap tuple visited sounds pricey. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 5, 2024 at 3:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Yes, attached is a patch set which does this. My previous patchset > already reduced the number of places we unlock the buffer and update > the freespace map in lazy_scan_heap(). This patchset combines the > lazy_scan_prune() and lazy_scan_noprune() FSM update cases. I also > have a version which moves the freespace map updates into > lazy_scan_prune() and lazy_scan_noprune() -- eliminating all of these > from lazy_scan_heap(). This is arguably more clear. But Andres > mentioned he wanted fewer places unlocking the buffer and updating the > FSM. Hmm, interesting. I haven't had time to study this fully today, but I think 0001 looks fine and could just be committed. Hooray for killing useless variables with dumb names. This part of 0002 makes me very, very uncomfortable: + /* + * Update all line pointers per the record, and repair fragmentation. + * We always pass no_indexes as true, because we don't know whether or + * not this option was used when pruning. This reduces the validation + * done on replay in an assert build. + */ + heap_page_prune_execute(buffer, true, redirected, nredirected, nowdead, ndead, nowunused, nunused); The problem that I have with this is that we're essentially saying that it's ok to lie to heap_page_prune_execute because we know how it's going to use the information, and therefore we know that the lie is harmless. But that's not how things are supposed to work. We should either find a way to tell the truth, or change the name of the parameter so that it's not a lie, or change the function so that it doesn't need this parameter in the first place, or something. I can occasionally stomach this sort of lying as a last resort when there's no realistic way of adapting the code being called, but that's surely not the case here -- this is a newborn parameter, and it shouldn't be a lie on day 1. Just imagine if some future developer thought that the no_indexes parameter meant that the relation actually didn't have indexes (the nerve of them!). I took a VERY fast look through the rest of the patch set and I think that the overall direction looks like it's probably reasonable, but that's a very preliminary conclusion which I reserve the right to revise after studying further. @Andres: Are you planning to review/commit any of this? Are you hoping that I'm going to do that? Somebody else want to jump in here? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jan 08, 2024 at 03:50:47PM -0500, Robert Haas wrote: > Hmm, interesting. I haven't had time to study this fully today, but I > think 0001 looks fine and could just be committed. Hooray for killing > useless variables with dumb names. I've been looking at 0001 a couple of weeks ago and thought that it was fine because there's only one caller of lazy_scan_prune() and one caller of lazy_scan_noprune() so all the code paths were covered. + /* rel truncation is unsafe */ + if (hastup) + vacrel->nonempty_pages = blkno + 1; Except for this comment that I found misleading because this is not about the fact that truncation is unsafe, it's about correctly tracking the the last block where we have tuples to ensure a correct truncation. Perhaps this could just reuse "Remember the location of the last page with nonremovable tuples"? If people object to that, feel free. -- Michael
Attachment
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Tue, Jan 9, 2024 at 12:56 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Jan 08, 2024 at 03:50:47PM -0500, Robert Haas wrote: > > Hmm, interesting. I haven't had time to study this fully today, but I > > think 0001 looks fine and could just be committed. Hooray for killing > > useless variables with dumb names. > > I've been looking at 0001 a couple of weeks ago and thought that it > was fine because there's only one caller of lazy_scan_prune() and one > caller of lazy_scan_noprune() so all the code paths were covered. > > + /* rel truncation is unsafe */ > + if (hastup) > + vacrel->nonempty_pages = blkno + 1; Andres had actually said that he didn't like pushing the update of nonempty_pages into lazy_scan_[no]prune(). So, my v4 patch set eliminates this. I can see an argument for doing both the update of vacrel->nonempty_pages and the FSM updates in lazy_scan_[no]prune() because it eliminates some of the back-and-forth between the block-specific functions and lazy_scan_heap(). lazy_scan_new_or_empty() has special logic for deciding how to update the FSM -- so that remains in lazy_scan_new_or_empty() either way. On the other hand, the comment above lazy_scan_new_or_empty() says we can get rid of this special handling if we make relation extension crash safe. Then it would make more sense to have a consolidated FSM update in lazy_scan_heap(). However it does still mean that we repeat the "UnlockReleaseBuffer()" and FSM update code in even more places. Ultimately I can see arguments for and against. Is it better to avoid having the same few lines of code in two places or avoid unneeded communication between page-level functions and lazy_scan_heap()? > Except for this comment that I found misleading because this is not > about the fact that truncation is unsafe, it's about correctly > tracking the the last block where we have tuples to ensure a correct > truncation. Perhaps this could just reuse "Remember the location of > the last page with nonremovable tuples"? If people object to that, > feel free. I agree the comment could be better. But, simply saying that it tracks the last page with non-removable tuples makes it less clear how important this is. It makes it sound like it could be simply for stats purposes. I'll update the comment to something that includes that sentiment but is more exact than "rel truncation is unsafe". - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Mon, Jan 8, 2024 at 3:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 3:34 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > This part of 0002 makes me very, very uncomfortable: > > + /* > + * Update all line pointers per the record, and repair > fragmentation. > + * We always pass no_indexes as true, because we don't > know whether or > + * not this option was used when pruning. This reduces > the validation > + * done on replay in an assert build. > + */ > + heap_page_prune_execute(buffer, true, > > redirected, nredirected, > nowdead, ndead, > > nowunused, nunused); > > The problem that I have with this is that we're essentially saying > that it's ok to lie to heap_page_prune_execute because we know how > it's going to use the information, and therefore we know that the lie > is harmless. But that's not how things are supposed to work. We should > either find a way to tell the truth, or change the name of the > parameter so that it's not a lie, or change the function so that it > doesn't need this parameter in the first place, or something. I can > occasionally stomach this sort of lying as a last resort when there's > no realistic way of adapting the code being called, but that's surely > not the case here -- this is a newborn parameter, and it shouldn't be > a lie on day 1. Just imagine if some future developer thought that the > no_indexes parameter meant that the relation actually didn't have > indexes (the nerve of them!). I agree that this is an issue. The easiest solution would be to change the name of the parameter to heap_page_prune_execute()'s from "no_indexes" to something like "validate_unused", since it is only used in assert builds for validation. However, though I wish a name change was the right way to solve this problem, my gut feeling is that it is not. It seems like we should rely only on the WAL record itself in recovery. Right now the parameter is used exclusively for validation, so it isn't so bad. But what if someone uses this parameter in the future in heap_xlog_prune() to decide how to modify the page? It seems like the right solution would be to add a flag into the prune record indicating what to pass to heap_page_prune_execute(). In the future, I'd like to add flags for updating the VM to each of the prune and vacuum records (eliminating the separate VM update record). Thus, a new flags member of the prune record could have future use. However, this would add a uint8 to the record. I can go and look for some padding if you think this is the right direction? - Melanie
On Tue, Jan 9, 2024 at 10:56 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Andres had actually said that he didn't like pushing the update of > nonempty_pages into lazy_scan_[no]prune(). So, my v4 patch set > eliminates this. Mmph - but I was so looking forward to killing hastup! > On the other hand, the comment above lazy_scan_new_or_empty() says we > can get rid of this special handling if we make relation extension > crash safe. Then it would make more sense to have a consolidated FSM > update in lazy_scan_heap(). However it does still mean that we repeat > the "UnlockReleaseBuffer()" and FSM update code in even more places. I wouldn't hold my breath waiting for relation extension to become crash-safe. Even if you were a whale, you'd be about four orders of magnitude short of holding it long enough. > Ultimately I can see arguments for and against. Is it better to avoid > having the same few lines of code in two places or avoid unneeded > communication between page-level functions and lazy_scan_heap()? To me, the conceptual complexity of an extra structure member is a bigger cost than duplicating TWO lines of code. If we were talking about 20 lines of code, I'd say rename it to something less dumb. > > Except for this comment that I found misleading because this is not > > about the fact that truncation is unsafe, it's about correctly > > tracking the the last block where we have tuples to ensure a correct > > truncation. Perhaps this could just reuse "Remember the location of > > the last page with nonremovable tuples"? If people object to that, > > feel free. > > I agree the comment could be better. But, simply saying that it tracks > the last page with non-removable tuples makes it less clear how > important this is. It makes it sound like it could be simply for stats > purposes. I'll update the comment to something that includes that > sentiment but is more exact than "rel truncation is unsafe". I agree that "rel truncation is unsafe" is less clear than desirable, but I'm not sure that I agree that tracking the last page with non-removable tuples makes it sound unimportant. However, the comments also need to make everybody happy, not just me. Maybe something like "can't truncate away this page" or similar. A long-form comment that really spells it out is fine too, but I don't know if we really need it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 9, 2024 at 11:35 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > The easiest solution would be to change the name of the parameter to > heap_page_prune_execute()'s from "no_indexes" to something like > "validate_unused", since it is only used in assert builds for > validation. Right. > However, though I wish a name change was the right way to solve this > problem, my gut feeling is that it is not. It seems like we should > rely only on the WAL record itself in recovery. Right now the > parameter is used exclusively for validation, so it isn't so bad. But > what if someone uses this parameter in the future in heap_xlog_prune() > to decide how to modify the page? Exactly. > It seems like the right solution would be to add a flag into the prune > record indicating what to pass to heap_page_prune_execute(). In the > future, I'd like to add flags for updating the VM to each of the prune > and vacuum records (eliminating the separate VM update record). Thus, > a new flags member of the prune record could have future use. However, > this would add a uint8 to the record. I can go and look for some > padding if you think this is the right direction? I thought about this approach and it might be OK but I don't love it, because it means making the WAL record bigger on production systems for the sake of assertion that only fires for developers. Sure, it's possible that there might be another use in the future, but there might also not be another use in the future. How about changing if (no_indexes) to if (ndead == 0) and adding a comment like this: /* If there are any tuples being marked LP_DEAD, then the relation must have indexes, so every item being marked unused must be a heap-only tuple. But if there are no tuples being marked LP_DEAD, then it's possible that the relation has no indexes, in which case all we know is that the line pointer shouldn't already be LP_UNUSED. */ BTW: + * LP_REDIRECT, or LP_DEAD items to LP_UNUSED during pruning. We + * can't check much here except that, if the item is LP_NORMAL, it + * should have storage before it is set LP_UNUSED. Is it really helpful to check this here, or just confusing/grasping at straws? I mean, the requirement that LP_NORMAL items have storage is a general one, IIUC, not something that's specific to this situation. It feels like the equivalent of checking that your children don't set fire to the couch on Tuesdays. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Tue, Jan 9, 2024 at 1:31 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 10:56 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Andres had actually said that he didn't like pushing the update of > > nonempty_pages into lazy_scan_[no]prune(). So, my v4 patch set > > eliminates this. > > Mmph - but I was so looking forward to killing hastup! > > > Ultimately I can see arguments for and against. Is it better to avoid > > having the same few lines of code in two places or avoid unneeded > > communication between page-level functions and lazy_scan_heap()? > > To me, the conceptual complexity of an extra structure member is a > bigger cost than duplicating TWO lines of code. If we were talking > about 20 lines of code, I'd say rename it to something less dumb. Yes, I agree. I thought about it more, and I prefer updating the FSM and setting nonempty_pages into lazy_scan_[no]prune(). Originally, I had ordered the patch set with that first (before the patch to do immediate reaping), but there is no reason for it to be so. Using hastup can be done in a subsequent commit on top of the immediate reaping patch. I will post a new version of the immediate reaping patch which addresses your feedback. Then, separately, I will post a revised version of the lazy_scan_heap() refactoring patches. - Melanie
On Tue, Jan 9, 2024 at 2:23 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Yes, I agree. I thought about it more, and I prefer updating the FSM > and setting nonempty_pages into lazy_scan_[no]prune(). Originally, I > had ordered the patch set with that first (before the patch to do > immediate reaping), but there is no reason for it to be so. Using > hastup can be done in a subsequent commit on top of the immediate > reaping patch. I will post a new version of the immediate reaping > patch which addresses your feedback. Then, separately, I will post a > revised version of the lazy_scan_heap() refactoring patches. I kind of liked it first, because I thought we could just do it and get it out of the way, but if Andres doesn't agree with the idea, it probably does make sense to push it later, as you say here. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On January 9, 2024 11:33:29 AM PST, Robert Haas <robertmhaas@gmail.com> wrote: >On Tue, Jan 9, 2024 at 2:23 PM Melanie Plageman ><melanieplageman@gmail.com> wrote: >> Yes, I agree. I thought about it more, and I prefer updating the FSM >> and setting nonempty_pages into lazy_scan_[no]prune(). Originally, I >> had ordered the patch set with that first (before the patch to do >> immediate reaping), but there is no reason for it to be so. Using >> hastup can be done in a subsequent commit on top of the immediate >> reaping patch. I will post a new version of the immediate reaping >> patch which addresses your feedback. Then, separately, I will post a >> revised version of the lazy_scan_heap() refactoring patches. > >I kind of liked it first, because I thought we could just do it and >get it out of the way, but if Andres doesn't agree with the idea, it >probably does make sense to push it later, as you say here. I don't have that strong feelings about it. If both of you think it looks good, go ahead... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
I had already written the patch for immediate reaping addressing the below feedback before I saw the emails that said everyone is happy with using hastup in lazy_scan_[no]prune() in a preliminary patch. Let me know if you have a strong preference for reordering. Otherwise, I will write the three subsequent patches on top of this one. On Tue, Jan 9, 2024 at 2:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 11:35 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > The easiest solution would be to change the name of the parameter to > > heap_page_prune_execute()'s from "no_indexes" to something like > > "validate_unused", since it is only used in assert builds for > > validation. > > Right. > > > However, though I wish a name change was the right way to solve this > > problem, my gut feeling is that it is not. It seems like we should > > rely only on the WAL record itself in recovery. Right now the > > parameter is used exclusively for validation, so it isn't so bad. But > > what if someone uses this parameter in the future in heap_xlog_prune() > > to decide how to modify the page? > > Exactly. > > > It seems like the right solution would be to add a flag into the prune > > record indicating what to pass to heap_page_prune_execute(). In the > > future, I'd like to add flags for updating the VM to each of the prune > > and vacuum records (eliminating the separate VM update record). Thus, > > a new flags member of the prune record could have future use. However, > > this would add a uint8 to the record. I can go and look for some > > padding if you think this is the right direction? > > I thought about this approach and it might be OK but I don't love it, > because it means making the WAL record bigger on production systems > for the sake of assertion that only fires for developers. Sure, it's > possible that there might be another use in the future, but there > might also not be another use in the future. > > How about changing if (no_indexes) to if (ndead == 0) and adding a > comment like this: /* If there are any tuples being marked LP_DEAD, > then the relation must have indexes, so every item being marked unused > must be a heap-only tuple. But if there are no tuples being marked > LP_DEAD, then it's possible that the relation has no indexes, in which > case all we know is that the line pointer shouldn't already be > LP_UNUSED. */ Ah, I like this a lot. Attached patch does this. I've added a modified version of the comment you suggested. My only question is if we are losing something without this sentence (from the old comment): - * ... They don't need to be left in place as LP_DEAD items until VACUUM gets - * around to doing index vacuuming. I don't feel like it adds a lot, but it is absent from the new comment, so thought I would check. > BTW: > > + * LP_REDIRECT, or LP_DEAD items to LP_UNUSED > during pruning. We > + * can't check much here except that, if the > item is LP_NORMAL, it > + * should have storage before it is set LP_UNUSED. > > Is it really helpful to check this here, or just confusing/grasping at > straws? I mean, the requirement that LP_NORMAL items have storage is a > general one, IIUC, not something that's specific to this situation. It > feels like the equivalent of checking that your children don't set > fire to the couch on Tuesdays. Hmm. Yes. I suppose I was trying to find something to validate. Is it worth checking that the line pointer is not already LP_UNUSED? Or is that a bit ridiculous? - Melanie
Attachment
On Tue, Jan 9, 2024 at 3:13 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I had already written the patch for immediate reaping addressing the > below feedback before I saw the emails that said everyone is happy > with using hastup in lazy_scan_[no]prune() in a preliminary patch. Let > me know if you have a strong preference for reordering. Otherwise, I > will write the three subsequent patches on top of this one. I don't know if it rises to the level of a strong preference. It's just a preference. > Ah, I like this a lot. Attached patch does this. I've added a modified > version of the comment you suggested. My only question is if we are > losing something without this sentence (from the old comment): > > - * ... They don't need to be left in place as LP_DEAD items > until VACUUM gets > - * around to doing index vacuuming. > > I don't feel like it adds a lot, but it is absent from the new > comment, so thought I would check. I agree that we can leave that out. It wouldn't be bad to include it if someone had a nice way of doing that, but it doesn't seem critical, and if forcing it in there makes the comment less clear overall, it's a net loss IMHO. > Hmm. Yes. I suppose I was trying to find something to validate. Is it > worth checking that the line pointer is not already LP_UNUSED? Or is > that a bit ridiculous? I think that's worthwhile (hence my proposed wording). -- Robert Haas EDB: http://www.enterprisedb.com
On 1/8/24 2:10 PM, Robert Haas wrote: > On Fri, Jan 5, 2024 at 3:57 PM Andres Freund <andres@anarazel.de> wrote: >>> I will be astonished if you can make this work well enough to avoid >>> huge regressions in plausible cases. There are plenty of cases where >>> we do a very thorough job opportunistically removing index tuples. >> >> These days the AM is often involved with that, via >> table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to >> happen before physically removing the already-marked-killed index entries. We >> can't rely on being able to actually prune the heap page at that point, there >> might be other backends pinning it, but often we will be able to. If we were >> to prune below heap_index_delete_tuples(), we wouldn't need to recheck that >> index again during "individual tuple pruning", if the to-be-marked-unused heap >> tuple is one of the tuples passed to heap_index_delete_tuples(). Which >> presumably will be very commonly the case. >> >> At least for nbtree, we are much more aggressive about marking index entries >> as killed, than about actually removing the index entries. "individual tuple >> pruning" would have to look for killed-but-still-present index entries, not >> just for "live" entries. > > I don't want to derail this thread, but I don't really see what you > have in mind here. The first paragraph sounds like you're imagining > that while pruning the index entries we might jump over to the heap > and clean things up there, too, but that seems like it wouldn't work > if the table has more than one index. I thought you were talking about > starting with a heap tuple and bouncing around to every index to see > if we can find index pointers to kill in every one of them. That > *could* work out, but you only need one index to have been > opportunistically cleaned up in order for it to fail to work out. > There might well be some workloads where that's often the case, but > the regressions in the workloads where it isn't the case seem like > they would be rather substantial, because doing an extra lookup in > every index for each heap tuple visited sounds pricey. The idea of probing indexes for tuples that are now dead has come up in the past, and the concern has always been whether it's actually safe to do so. An obvious example is an index on a function and now the function has changed so you can't reliably determine if a particular tuple is present in the index. That's bad enough during an index scan, but potentially worse while doing heeap cleanup. Given that operators are functions, this risk exists to some degree in even simple indexes. Depending on the gains this might still be worth doing, at least for some cases. It's hard to conceive of this breaking for indexes on integers, for example. But we'd still need to be cautious. -- Jim Nasby, Data Architect, Austin TX
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Tue, Jan 9, 2024 at 3:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 3:13 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I had already written the patch for immediate reaping addressing the > > below feedback before I saw the emails that said everyone is happy > > with using hastup in lazy_scan_[no]prune() in a preliminary patch. Let > > me know if you have a strong preference for reordering. Otherwise, I > > will write the three subsequent patches on top of this one. > > I don't know if it rises to the level of a strong preference. It's > just a preference. Attached v6 has the immediate reaping patch first followed by the code to use hastup in lazy_scan_[no]prune(). 0003 and 0004 move the VM update code into lazy_scan_prune() and eliminate LVPagePruneState entirely. 0005 moves the FSM update into lazy_scan_[no]prune(), substantially simplifying lazy_scan_heap(). > I agree that we can leave that out. It wouldn't be bad to include it > if someone had a nice way of doing that, but it doesn't seem critical, > and if forcing it in there makes the comment less clear overall, it's > a net loss IMHO. > > > Hmm. Yes. I suppose I was trying to find something to validate. Is it > > worth checking that the line pointer is not already LP_UNUSED? Or is > > that a bit ridiculous? > > I think that's worthwhile (hence my proposed wording). Done in attached v6. - Melanie
Attachment
- v6-0003-Move-VM-update-code-to-lazy_scan_prune.patch
- v6-0001-Set-would-be-dead-items-LP_UNUSED-while-pruning.patch
- v6-0002-Indicate-rel-truncation-unsafe-in-lazy_scan-no-pr.patch
- v6-0005-Move-freespace-map-update-into-lazy_scan_-no-prun.patch
- v6-0004-Inline-LVPagePruneState-members-in-lazy_scan_prun.patch
On Tue, Jan 9, 2024 at 5:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Done in attached v6. Don't kill me, but: + /* + * For now, pass no_indexes == false regardless of whether or not + * the relation has indexes. In the future we may enable immediate + * reaping for on access pruning. + */ + heap_page_prune(relation, buffer, vistest, false, + &presult, NULL); My previous comments about lying seem equally applicable here. - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) + if (!ItemIdIsUsed(itemid)) continue; There is a bit of overhead here for the !no_indexes case. I assume it doesn't matter. static void heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum) { + /* + * If the relation has no indexes, we can remove dead tuples during + * pruning instead of marking their line pointers dead. Set this tuple's + * line pointer LP_UNUSED. We hint that tables with indexes are more + * likely. + */ + if (unlikely(prstate->no_indexes)) + { + heap_prune_record_unused(prstate, offnum); + return; + } I think this should be pushed to the callers. Else you make the existing function name into another lie. + bool recordfreespace; Not sure if it's necessary to move this to an outer scope like this? The whole handling of this looks kind of confusing. If we're going to do it this way, then I think lazy_scan_prune() definitely needs to document how it handles this function (i.e. might set true to false, won't set false to true, also meaning). But are we sure we want to let a local variable with a weird name "leak out" like this? + Assert(vacrel->do_index_vacuuming); Is this related? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 10, 2024 at 3:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 5:42 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Done in attached v6. > > Don't kill me, but: > > + /* > + * For now, pass no_indexes == false > regardless of whether or not > + * the relation has indexes. In the future we > may enable immediate > + * reaping for on access pruning. > + */ > + heap_page_prune(relation, buffer, vistest, false, > + &presult, NULL); > > My previous comments about lying seem equally applicable here. Yes, the options I can think of are: 1) rename the parameter to "immed_reap" or similar and make very clear in heap_page_prune_opt() that you are not to pass true. 2) make immediate reaping work for on-access pruning. I would need a low cost way to find out if there are any indexes on the table. Do you think this is possible? Should we just rename the parameter for now and think about that later? > > - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) > + if (!ItemIdIsUsed(itemid)) > continue; > > There is a bit of overhead here for the !no_indexes case. I assume it > doesn't matter. Right. Should be okay. Alternatively, and I'm not saying this is a good idea, but we could throw this into the loop in heap_page_prune() which calls heap_prune_chain(): + if (ItemIdIsDead(itemid) && prstate.no_indexes) + { + heap_prune_record_unused(&prstate, offnum); + continue; + } I think that is correct? But, from a consistency perspective, we don't call the heap_prune_record* functions directly from heap_page_prune() in any other cases. And it seems like it would be nice to have all of those in one place? > static void > heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum) > { > + /* > + * If the relation has no indexes, we can remove dead tuples during > + * pruning instead of marking their line pointers dead. Set this tuple's > + * line pointer LP_UNUSED. We hint that tables with indexes are more > + * likely. > + */ > + if (unlikely(prstate->no_indexes)) > + { > + heap_prune_record_unused(prstate, offnum); > + return; > + } > > I think this should be pushed to the callers. Else you make the > existing function name into another lie. Yes, so I preferred it in the body of heap_prune_chain() (the caller). Andres didn't like the extra level of indentation. I could wrap heap_record_dead() and heap_record_unused(), but I couldn't really think of a good wrapper name. heap_record_dead_or_unused() seems long and literal. But, it might be the best I can do. I can't think of a general word which encompasses both the transition to death and to disposal. > + bool recordfreespace; > > Not sure if it's necessary to move this to an outer scope like this? > The whole handling of this looks kind of confusing. If we're going to > do it this way, then I think lazy_scan_prune() definitely needs to > document how it handles this function (i.e. might set true to false, > won't set false to true, also meaning). But are we sure we want to let > a local variable with a weird name "leak out" like this? Which function do you mean when you say "document how lazy_scan_prune() handles this function". And no we definitely don't want a variable like this to be hanging out in lazy_scan_heap(), IMHO. The other patches in the stack move the FSM updates into lazy_scan_[no]prune() and eliminate this parameter. I moved up the scope because lazy_scan_noprune() already had recordfreespace as an output parameter and initialized it unconditionally inside. I initialize it unconditionally in lazy_scan_prune() as well. I mention in the commit message that this is temporary because we plan to eliminate recordfreespace as an output parameter by updating the FSM in lazy_scan_[no]prune(). I could have stuck recordfreespace into the LVPagePruneState with the other output parameters. But, leaving it as a separate output parameter made the diffs lovely for the rest of the patches in the stack. > + Assert(vacrel->do_index_vacuuming); > > Is this related? Yes, previously the assert was: Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming); And we eliminated the caller of lazy_vacuum_heap_page() with vacrel->nindexes == 0. Now it should only be called after doing index vacuuming (thus index vacuuming should definitely be enabled). - Melanie
On Wed, Jan 10, 2024 at 5:28 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Yes, the options I can think of are: > > 1) rename the parameter to "immed_reap" or similar and make very clear > in heap_page_prune_opt() that you are not to pass true. > 2) make immediate reaping work for on-access pruning. I would need a > low cost way to find out if there are any indexes on the table. Do you > think this is possible? Should we just rename the parameter for now > and think about that later? I don't think we can implement (2), because: robert.haas=# create table test (a int); CREATE TABLE robert.haas=# begin; BEGIN robert.haas=*# select * from test; a --- (0 rows) <in another window> robert.haas=# create index on test (a); CREATE INDEX In English, the lock we hold during regular table access isn't strong enough to foreclose concurrent addition of an index. So renaming the parameter seems like the way to go. How about "mark_unused_now"? > > - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) > > + if (!ItemIdIsUsed(itemid)) > > continue; > > > > There is a bit of overhead here for the !no_indexes case. I assume it > > doesn't matter. > > Right. Should be okay. Alternatively, and I'm not saying this is a > good idea, but we could throw this into the loop in heap_page_prune() > which calls heap_prune_chain(): > > + if (ItemIdIsDead(itemid) && prstate.no_indexes) > + { > + heap_prune_record_unused(&prstate, offnum); > + continue; > + } > > I think that is correct? Wouldn't that be wrong in any case where heap_prune_chain() calls heap_prune_record_dead()? > Yes, so I preferred it in the body of heap_prune_chain() (the caller). > Andres didn't like the extra level of indentation. I could wrap > heap_record_dead() and heap_record_unused(), but I couldn't really > think of a good wrapper name. heap_record_dead_or_unused() seems long > and literal. But, it might be the best I can do. I can't think of a > general word which encompasses both the transition to death and to > disposal. I'm sure we could solve the wordsmithing problem but I think it's clearer if the heap_prune_record_* functions don't get clever. > > + bool recordfreespace; > > > > Not sure if it's necessary to move this to an outer scope like this? > > The whole handling of this looks kind of confusing. If we're going to > > do it this way, then I think lazy_scan_prune() definitely needs to > > document how it handles this function (i.e. might set true to false, > > won't set false to true, also meaning). But are we sure we want to let > > a local variable with a weird name "leak out" like this? > > Which function do you mean when you say "document how > lazy_scan_prune() handles this function". "function" was a thinko for "variable". > And no we definitely don't > want a variable like this to be hanging out in lazy_scan_heap(), IMHO. > The other patches in the stack move the FSM updates into > lazy_scan_[no]prune() and eliminate this parameter. I moved up the > scope because lazy_scan_noprune() already had recordfreespace as an > output parameter and initialized it unconditionally inside. I > initialize it unconditionally in lazy_scan_prune() as well. I mention > in the commit message that this is temporary because we plan to > eliminate recordfreespace as an output parameter by updating the FSM > in lazy_scan_[no]prune(). I could have stuck recordfreespace into the > LVPagePruneState with the other output parameters. But, leaving it as > a separate output parameter made the diffs lovely for the rest of the > patches in the stack. I guess I'll have to look at more of the patches to see what I think of this. Looking at this patch in isolation, it's ugly, IMHO, but it seems we agree on that. > > + Assert(vacrel->do_index_vacuuming); > > > > Is this related? > > Yes, previously the assert was: > Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming); > And we eliminated the caller of lazy_vacuum_heap_page() with > vacrel->nindexes == 0. Now it should only be called after doing index > vacuuming (thus index vacuuming should definitely be enabled). Ah. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 9, 2024 at 2:35 PM Andres Freund <andres@anarazel.de> wrote: > I don't have that strong feelings about it. If both of you think it looks good, go ahead... Done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
Attached v7 is rebased over the commit you just made to remove LVPagePruneState->hastup. On Thu, Jan 11, 2024 at 11:54 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jan 10, 2024 at 5:28 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Yes, the options I can think of are: > > > > 1) rename the parameter to "immed_reap" or similar and make very clear > > in heap_page_prune_opt() that you are not to pass true. > > 2) make immediate reaping work for on-access pruning. I would need a > > low cost way to find out if there are any indexes on the table. Do you > > think this is possible? Should we just rename the parameter for now > > and think about that later? > > I don't think we can implement (2), because: > > robert.haas=# create table test (a int); > CREATE TABLE > robert.haas=# begin; > BEGIN > robert.haas=*# select * from test; > a > --- > (0 rows) > > <in another window> > > robert.haas=# create index on test (a); > CREATE INDEX > > In English, the lock we hold during regular table access isn't strong > enough to foreclose concurrent addition of an index. Ah, I see. > So renaming the parameter seems like the way to go. How about "mark_unused_now"? I've done this in attached v7. > > > - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) > > > + if (!ItemIdIsUsed(itemid)) > > > continue; > > > > > > There is a bit of overhead here for the !no_indexes case. I assume it > > > doesn't matter. > > > > Right. Should be okay. Alternatively, and I'm not saying this is a > > good idea, but we could throw this into the loop in heap_page_prune() > > which calls heap_prune_chain(): > > > > + if (ItemIdIsDead(itemid) && prstate.no_indexes) > > + { > > + heap_prune_record_unused(&prstate, offnum); > > + continue; > > + } > > > > I think that is correct? > > Wouldn't that be wrong in any case where heap_prune_chain() calls > heap_prune_record_dead()? Hmm. But previously, we skipped heap_prune_chain() for already dead line pointers. In my patch, I rely on the test in the loop in heap_prune_chain() to set those LP_UNUSED if mark_unused_now is true (previously the below code just broke out of the loop). /* * Likewise, a dead line pointer can't be part of the chain. (We * already eliminated the case of dead root tuple outside this * function.) */ if (ItemIdIsDead(lp)) { /* * If the caller set mark_unused_now true, we can set dead line * pointers LP_UNUSED now. We don't increment ndeleted here since * the LP was already marked dead. */ if (unlikely(prstate->mark_unused_now)) heap_prune_record_unused(prstate, offnum); break; } so wouldn't what I suggested simply set the item LP_UNSED that before invoking heap_prune_chain()? Thereby achieving the same result without invoking heap_prune_chain() for already dead line pointers? I could be missing something. That heap_prune_chain() logic sometimes gets me turned around. > > Yes, so I preferred it in the body of heap_prune_chain() (the caller). > > Andres didn't like the extra level of indentation. I could wrap > > heap_record_dead() and heap_record_unused(), but I couldn't really > > think of a good wrapper name. heap_record_dead_or_unused() seems long > > and literal. But, it might be the best I can do. I can't think of a > > general word which encompasses both the transition to death and to > > disposal. > > I'm sure we could solve the wordsmithing problem but I think it's > clearer if the heap_prune_record_* functions don't get clever. I've gone with my heap_record_dead_or_unused() suggestion. - Melanie
Attachment
On Thu, Jan 11, 2024 at 2:30 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Attached v7 is rebased over the commit you just made to remove > LVPagePruneState->hastup. Apologies for making you rebase but I was tired of thinking about that patch. I'm still kind of hung up on the changes that 0001 makes to vacuumlazy.c. Say we didn't add the recordfreespace parameter to lazy_scan_prune(). Couldn't the caller just compute it? lpdead_items goes out of scope, but there's also prstate.has_lpdead_items. Pressing that gripe a bit more, I just figured out that "Wait until lazy_vacuum_heap_rel() to save free space" gets turned into "If we will likely do index vacuuming, wait until lazy_vacuum_heap_rel() to save free space." That follows the good practice of phrasing the comment conditionally when the comment is outside the if-statement. But the if-statement says merely "if (recordfreespace)", which is not obviously related to "If we will likely do index vacuuming" but under the hood actually is. But it seems like an unpleasant amount of action at a distance. If that condition said if (vacrel->nindexes > 0 && vacrel->do_index_vacuuming && prstate.has_lpdead_items) UnlockReleaseBuffer(); else { PageGetHeapFreeSpace; UnlockReleaseBuffer; RecordPageWithFreeSpace } it would be a lot more obvious that the code was doing what the comment says. That's a refactoring question, but I'm also wondering if there's a functional issue. Pre-patch, if the table has no indexes and some items are left LP_DEAD, then we mark them unused, forget them, RecordPageWithFreeSpace(), and if enough pages have been visited, FreeSpaceMapVacuumRange(). Post-patch, if the table has no indexes, no items will be left LP_DEAD, and *recordfreespace will be set to true, so we'll PageRecordFreeSpace(). But this seems to me to mean that post-patch we'll PageRecordFreeSpace() in more cases than we do now. Specifically, imagine a case where the current code wouldn't mark any items LP_DEAD and the page also had no pre-existing items that were LP_DEAD and the table also has no indexes. Currently, I believe we wouldn't PageRecordFreeSpace(), but with the patch, I think we would. Am I wrong? Note that the call to FreeSpaceMapVacuumRange() seems to try to guard against this problem by testing for vacrel->tuples_deleted > tuples_already_deleted. I haven't tried to verify whether that guard is correct, but the fact that FreeSpaceMapVacuumRange() has such a guard and RecordPageWithFreeSpace() does not have one makes me suspicious. Another interesting effect of the patch is that instead of getting lazy_vacuum_heap_page()'s handling of the all-visible status of the page, we get the somewhat more complex handling done by lazy_scan_heap(). I haven't fully through the consequences of that, but if you have, I'd be interested to hear your analysis. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Thu, Jan 11, 2024 at 4:49 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 2:30 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I'm still kind of hung up on the changes that 0001 makes to vacuumlazy.c. > > Say we didn't add the recordfreespace parameter to lazy_scan_prune(). > Couldn't the caller just compute it? lpdead_items goes out of scope, > but there's also prstate.has_lpdead_items. > > Pressing that gripe a bit more, I just figured out that "Wait until > lazy_vacuum_heap_rel() to save free space" gets turned into "If we > will likely do index vacuuming, wait until lazy_vacuum_heap_rel() to > save free space." That follows the good practice of phrasing the > comment conditionally when the comment is outside the if-statement. > But the if-statement says merely "if (recordfreespace)", which is not > obviously related to "If we will likely do index vacuuming" but under > the hood actually is. But it seems like an unpleasant amount of action > at a distance. If that condition said if (vacrel->nindexes > 0 && > vacrel->do_index_vacuuming && prstate.has_lpdead_items) > UnlockReleaseBuffer(); else { PageGetHeapFreeSpace; > UnlockReleaseBuffer; RecordPageWithFreeSpace } it would be a lot more > obvious that the code was doing what the comment says. Yes, this is a good point. Seems like writing maintainable code is really about never lying and then figuring out when hiding the truth is also lying. Darn! My only excuse is that lazy_scan_noprune() has a similarly confusing output parameter, recordfreespace, both of which I removed in a later patch in the set. I think we should change it as you say. > That's a refactoring question, but I'm also wondering if there's a > functional issue. Pre-patch, if the table has no indexes and some > items are left LP_DEAD, then we mark them unused, forget them, > RecordPageWithFreeSpace(), and if enough pages have been visited, > FreeSpaceMapVacuumRange(). Post-patch, if the table has no indexes, no > items will be left LP_DEAD, and *recordfreespace will be set to true, > so we'll PageRecordFreeSpace(). But this seems to me to mean that > post-patch we'll PageRecordFreeSpace() in more cases than we do now. > Specifically, imagine a case where the current code wouldn't mark any > items LP_DEAD and the page also had no pre-existing items that were > LP_DEAD and the table also has no indexes. Currently, I believe we > wouldn't PageRecordFreeSpace(), but with the patch, I think we would. > Am I wrong? Ah! I think you are right. Good catch. I could fix this with logic like this: bool space_freed = vacrel->tuples_deleted > tuples_already_deleted; if ((vacrel->nindexes == 0 && space_freed) || (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming))) I think I made this mistake when working on a different version of this that combined the prune and no prune cases. I noticed that lazy_scan_noprune() updates the FSM whenever there are no indexes. I wonder why this is (and why we don't do it in the prune case). > Note that the call to FreeSpaceMapVacuumRange() seems to try to guard > against this problem by testing for vacrel->tuples_deleted > > tuples_already_deleted. I haven't tried to verify whether that guard > is correct, but the fact that FreeSpaceMapVacuumRange() has such a > guard and RecordPageWithFreeSpace() does not have one makes me > suspicious. FreeSpaceMapVacuumRange() is not called for the no prune case, so I think this is right. > Another interesting effect of the patch is that instead of getting > lazy_vacuum_heap_page()'s handling of the all-visible status of the > page, we get the somewhat more complex handling done by > lazy_scan_heap(). I haven't fully through the consequences of that, > but if you have, I'd be interested to hear your analysis. lazy_vacuum_heap_page() calls heap_page_is_all_visible() which does another HeapTupleSatisfiesVacuum() call -- which is definitely going to be more expensive than not doing that. In one case, in lazy_scan_heap(), we might do a visibilitymap_get_status() (via VM_ALL_FROZEN()) to avoid calling visibilitymap_set() if the page is already marked all frozen in the VM. But that would pale in comparison to another HeapTupleSatisfiesVacuum() (I think). The VM update code in lazy_scan_heap() looks complicated but two out of four cases are basically to deal with uncommon data corruption. - Melanie
On Thu, Jan 11, 2024 at 01:43:27PM -0500, Robert Haas wrote: > On Tue, Jan 9, 2024 at 2:35 PM Andres Freund <andres@anarazel.de> wrote: >> I don't have that strong feelings about it. If both of you think it >> looks good, go ahead... > > Done. Thanks for e2d5b3b9b643. -- Michael
Attachment
On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Yes, this is a good point. Seems like writing maintainable code is > really about never lying and then figuring out when hiding the truth > is also lying. Darn! I think that's pretty true. In this particular case, this code has a fair number of preexisting problems of this type, IMHO. It's been touched by many hands, each person with their own design ideas, and the result isn't as coherent as if it were written by a single mind from scratch. But, because this code is also absolutely critical to the system not eating user data, any changes have to be approached with the highest level of caution. I think it's good to be really careful about this sort of refactoring anywhere, because finding out a year later that you broke something and having to go back and fix it is no fun even if the consequences are minor ... here, they might not be. > Ah! I think you are right. Good catch. I could fix this with logic like this: > > bool space_freed = vacrel->tuples_deleted > tuples_already_deleted; > if ((vacrel->nindexes == 0 && space_freed) || > (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming))) Perhaps that would be better written as space_freed || (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point you might not need to introduce the variable. > I think I made this mistake when working on a different version of > this that combined the prune and no prune cases. I noticed that > lazy_scan_noprune() updates the FSM whenever there are no indexes. I > wonder why this is (and why we don't do it in the prune case). Yeah, this all seems distinctly under-commented. As noted above, this code has grown organically, and some things just never got written down. Looking through the git history, I see that this behavior seems to date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced lazy_scan_noprune(). The comments don't explain the reasoning, but my guess is that it was just an accident. It's not entirely evident to me whether there might ever be good reasons to update the freespace map for a page where we haven't freed up any space -- after all, the free space map isn't crash-safe, so you could always postulate that updating it will correct an existing inaccuracy. But I really doubt that there's any good reason for lazy_scan_prune() and lazy_scan_noprune() to have different ideas about whether to update the FSM or not, especially in an obscure corner case like this. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 12, 2024 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > Looking through the git history, I see that this behavior seems to > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced > lazy_scan_noprune(). The comments don't explain the reasoning, but my > guess is that it was just an accident. It's not entirely evident to me > whether there might ever be good reasons to update the freespace map > for a page where we haven't freed up any space -- after all, the free > space map isn't crash-safe, so you could always postulate that > updating it will correct an existing inaccuracy. But I really doubt > that there's any good reason for lazy_scan_prune() and > lazy_scan_noprune() to have different ideas about whether to update > the FSM or not, especially in an obscure corner case like this. Why do you think that lazy_scan_prune() and lazy_scan_noprune() have different ideas about whether to update the FSM or not? Barring certain failsafe edge cases, we always call PageGetHeapFreeSpace() exactly once for each scanned page. While it's true that we won't always do that in the first heap pass (in cases where VACUUM has indexes), that's only because we expect to do it in the second heap pass instead -- since we only want to do it once. It's true that lazy_scan_noprune unconditionally calls PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases we know that there won't be any second heap pass, and so in both cases we always call PageGetHeapFreeSpace() in the first heap pass. It's just that it's a bit harder to see that in the lazy_scan_prune case. No? -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 12, 2024 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Ah! I think you are right. Good catch. I could fix this with logic like this: > > > > bool space_freed = vacrel->tuples_deleted > tuples_already_deleted; > > if ((vacrel->nindexes == 0 && space_freed) || > > (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming))) > > Perhaps that would be better written as space_freed || > (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point > you might not need to introduce the variable. As I revisited this, I realized why I had written this logic for updating the FSM: if (vacrel->nindexes == 0 || !vacrel->do_index_vacuuming || lpdead_items == 0) It is because in master, in lazy_scan_heap(), when there are no indexes and no space has been freed, we actually keep going and hit the other logic to update the FSM at the bottom of the loop: if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) The effect is that if there are no indexes and no space is freed we always update the FSM. When combined, that means that if there are no indexes, we always update the FSM. With the logic you suggested, we fail a pageinspect test which expects the FSM to exist when it doesn't. So, adding the logic: if (space_freed || (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming) || vacrel->nindexes == 0) We pass that pageinspect test. But, we fail a freespacemap test which expects some freespace reported which isn't: id | blkno | is_avail -----------------+-------+---------- - freespace_tab | 0 | t + freespace_tab | 0 | f which I presume is because space_freed is not the same as !has_lpdead_items. I can't really see what logic would be right here. > > I think I made this mistake when working on a different version of > > this that combined the prune and no prune cases. I noticed that > > lazy_scan_noprune() updates the FSM whenever there are no indexes. I > > wonder why this is (and why we don't do it in the prune case). > > Yeah, this all seems distinctly under-commented. As noted above, this > code has grown organically, and some things just never got written > down. > > Looking through the git history, I see that this behavior seems to > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced > lazy_scan_noprune(). The comments don't explain the reasoning, but my > guess is that it was just an accident. It's not entirely evident to me > whether there might ever be good reasons to update the freespace map > for a page where we haven't freed up any space -- after all, the free > space map isn't crash-safe, so you could always postulate that > updating it will correct an existing inaccuracy. But I really doubt > that there's any good reason for lazy_scan_prune() and > lazy_scan_noprune() to have different ideas about whether to update > the FSM or not, especially in an obscure corner case like this. All of this makes me wonder why we would update the FSM in vacuum when no space was freed. I thought that RelationAddBlocks() and RelationGetBufferForTuple() would handle making sure the FSM gets created and updated if the reason there is freespace is because the page hasn't been filled yet. - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Jan 12, 2024 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Looking through the git history, I see that this behavior seems to > > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced > > lazy_scan_noprune(). The comments don't explain the reasoning, but my > > guess is that it was just an accident. It's not entirely evident to me > > whether there might ever be good reasons to update the freespace map > > for a page where we haven't freed up any space -- after all, the free > > space map isn't crash-safe, so you could always postulate that > > updating it will correct an existing inaccuracy. But I really doubt > > that there's any good reason for lazy_scan_prune() and > > lazy_scan_noprune() to have different ideas about whether to update > > the FSM or not, especially in an obscure corner case like this. > > Why do you think that lazy_scan_prune() and lazy_scan_noprune() have > different ideas about whether to update the FSM or not? > > Barring certain failsafe edge cases, we always call > PageGetHeapFreeSpace() exactly once for each scanned page. While it's > true that we won't always do that in the first heap pass (in cases > where VACUUM has indexes), that's only because we expect to do it in > the second heap pass instead -- since we only want to do it once. > > It's true that lazy_scan_noprune unconditionally calls > PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same > behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases > we know that there won't be any second heap pass, and so in both cases > we always call PageGetHeapFreeSpace() in the first heap pass. It's > just that it's a bit harder to see that in the lazy_scan_prune case. > No? So, I think this is the logic in master: Prune case, first pass - indexes == 0 && space_freed -> update FSM - indexes == 0 && (!space_freed || !index_vacuuming) -> update FSM - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM which reduces to: - indexes == 0 || !space_freed || !index_vacuuming No Prune (except aggressive vacuum), first pass: - indexes == 0 -> update FSM - indexes > 0 && !space_freed -> update FSM which reduces to: - indexes == 0 || !space_freed which is, during the first pass, if we will not do index vacuuming, either because we have no indexes, because there is nothing to vacuum, or because do_index_vacuuming is false, make sure we update the freespace map now. but what about no prune when do_index_vacuuming is false? I still don't understand why vacuum is responsible for updating the FSM per page when no line pointers have been set unused. That is how PageGetFreeSpace() figures out if there is free space, right? - Melanie
On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > So, I think this is the logic in master: > > Prune case, first pass > > ... > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM What is "space_freed"? Isn't that something from your uncommitted patch? As I said, the aim is to call PageGetHeapFreeSpace() (*not* PageGetFreeSpace(), which is only used for index pages) exactly once per heap page scanned. This is supposed to happen independently of whatever specific work was/will be required for the heap page. In general, we don't ever trust that the FSM is already up-to-date. Presumably because the FSM isn't crash safe. On master, prunestate.has_lpdead_items may be set true when our VACUUM wasn't actually the thing that performed pruning that freed tuple storage -- typically when some other backend was the one that did all required pruning at some earlier point in time, often via opportunistic pruning. For better or worse, the only thing that VACUUM aims to do is make sure that PageGetHeapFreeSpace() gets called exactly once per scanned page. > which is, during the first pass, if we will not do index vacuuming, > either because we have no indexes, because there is nothing to vacuum, > or because do_index_vacuuming is false, make sure we update the > freespace map now. > > but what about no prune when do_index_vacuuming is false? Note that do_index_vacuuming cannot actually affect the "nindexes == 0" case. We have an assertion that's kinda relevant to this point: if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) { /* * Wait until lazy_vacuum_heap_rel() to save free space. * .... */ Assert(vacrel->nindexes > 0); UnlockReleaseBuffer(buf); } else { .... } We should never get this far down in the lazy_scan_heap() loop when "nindexes == 0 && prunestate.has_lpdead_items". That's handled in the special "single heap pass" branch after lazy_scan_prune(). As for the case where we use lazy_scan_noprune() for a "nindexes == 0" VACUUM's heap page, we also can't get this far down. (Actually, lazy_scan_noprune lies if it has to in order to make sure that lazy_scan_heap never has to deal with a "nindexes == 0" VACUUM with LP_DEAD items from a no-cleanup-lock page. This is a kludge -- see comments about how this is "dishonest" inside lazy_scan_noprune().) > I still don't understand why vacuum is responsible for updating the > FSM per page when no line pointers have been set unused. That is how > PageGetFreeSpace() figures out if there is free space, right? You mean PageGetHeapFreeSpace? Not really. (Though even pruning can set line pointers unused, or heap-only tuples.) Even if pruning doesn't happen in VACUUM, that doesn't mean that the FSM is up-to-date. In short, we do these things with the free space map because it is a map of free space (which isn't crash safe) -- nothing more. I happen to agree that that general design has a lot of problems, but those seem out of scope here. -- Peter Geoghegan
On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote: > Why do you think that lazy_scan_prune() and lazy_scan_noprune() have > different ideas about whether to update the FSM or not? When lazy_scan_prune() is called, we call RecordPageWithFreeSpace if vacrel->nindexes == 0 && prunestate.has_lpdead_items. See the code near the comment that begins "Consider the need to do page-at-a-time heap vacuuming". When lazy_scan_noprune() is called, whether we call RecordPageWithFreeSpace depends on the output parameter recordfreespace. That is set to true whenever vacrel->nindexes == 0 || lpdead_items == 0. See the code near the comment that begins "Save any LP_DEAD items". The case where I thought there was a behavior difference is when vacrel->nindexes == 0 and lpdead_items == 0 and thus prunestate.has_lpdead_items is false. But now I see (from Melanie's email) that this isn't really true, because in that case we fall through to the logic that we use when indexes are present, giving us a second chance to call RecordPageWithFreeSpace(), which we take when (prunestate.has_lpdead_items && vacrel->do_index_vacuuming) comes out false, as it always does in the scenario that I postulated. P.S. to Melanie: I'll respond to your further emails next but I wanted to respond to this one from Peter first so he didn't think I was rudely ignoring him. :-) P.P.S. to everyone: Yikes, this logic is really confusing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > So, I think this is the logic in master: > > > > Prune case, first pass > > > > ... > > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM > > What is "space_freed"? Isn't that something from your uncommitted patch? Yes, I was mixing the two together. I just want to make sure that we agree that, on master, when lazy_scan_prune() is called, the logic for whether or not to update the FSM after the first pass is: indexes == 0 || !has_lpdead_items || !index_vacuuming and when lazy_scan_noprune() is called, the logic for whether or not to update the FSM after the first pass is: indexes == 0 || !has_lpdead_items Those seem different to me. - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > So, I think this is the logic in master: > > > > Prune case, first pass > > > > ... > > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM > > What is "space_freed"? Isn't that something from your uncommitted patch? > > As I said, the aim is to call PageGetHeapFreeSpace() (*not* > PageGetFreeSpace(), which is only used for index pages) exactly once > per heap page scanned. This is supposed to happen independently of > whatever specific work was/will be required for the heap page. In > general, we don't ever trust that the FSM is already up-to-date. > Presumably because the FSM isn't crash safe. > > On master, prunestate.has_lpdead_items may be set true when our VACUUM > wasn't actually the thing that performed pruning that freed tuple > storage -- typically when some other backend was the one that did all > required pruning at some earlier point in time, often via > opportunistic pruning. For better or worse, the only thing that VACUUM > aims to do is make sure that PageGetHeapFreeSpace() gets called > exactly once per scanned page. ... > > I still don't understand why vacuum is responsible for updating the > > FSM per page when no line pointers have been set unused. That is how > > PageGetFreeSpace() figures out if there is free space, right? > > You mean PageGetHeapFreeSpace? Not really. (Though even pruning can > set line pointers unused, or heap-only tuples.) > > Even if pruning doesn't happen in VACUUM, that doesn't mean that the > FSM is up-to-date. > > In short, we do these things with the free space map because it is a > map of free space (which isn't crash safe) -- nothing more. I happen > to agree that that general design has a lot of problems, but those > seem out of scope here. So, there are 3 issues I am trying to understand: 1) How often should vacuum update the FSM (not vacuum as in the second pass but vacuum as in the whole thing that is happening in lazy_scan_heap())? 2) What is the exact logic in master that ensures that vacuum implements the cadence in 1)? 3) How can the logic in 2) be replicated exactly in my patch that sets would-be dead items LP_UNUSED during pruning? From what Peter is saying, I think 1) is decided and is once per page (across all passes) For 2), see my previous email. And for 3), TBD until 2) is agreed upon. - Melanie
On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Yes, I was mixing the two together. > > I just want to make sure that we agree that, on master, when > lazy_scan_prune() is called, the logic for whether or not to update > the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items || !index_vacuuming > > and when lazy_scan_noprune() is called, the logic for whether or not > to update the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items > > Those seem different to me. This analysis seems correct to me, except that "when lazy_scan_noprune() is called" should really say "when lazy_scan_noprune() is called (and returns true)", because when it returns false we fall through and call lazy_scan_prune() afterwards. Here's a draft patch to clean up the inconsistency here. It also gets rid of recordfreespace, because ISTM that recordfreespace is adding to the confusion here rather than helping anything. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Fri, Jan 12, 2024 at 2:32 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > This analysis seems correct to me, except that "when > lazy_scan_noprune() is called" should really say "when > lazy_scan_noprune() is called (and returns true)", because when it > returns false we fall through and call lazy_scan_prune() afterwards. Now that I see your patch, I understand what Melanie must have meant. I agree that there is a small inconsistency here, that we could well do without. In general I am in favor of religiously eliminating such inconsistencies (between lazy_scan_prune and lazy_scan_noprune), unless there is a reason not to. Not because it's necessarily important. More because it's just too hard to be sure whether it might matter. It's usually easier to defensively assume that it matters. > Here's a draft patch to clean up the inconsistency here. It also gets > rid of recordfreespace, because ISTM that recordfreespace is adding to > the confusion here rather than helping anything. You're using "!prunestate.has_lpdead_items" as part of your test that sets "recordfreespace". But lazy_scan_noprune doesn't get passed a pointer to prunestate, so clearly you'll need to detect the same condition some other way. -- Peter Geoghegan
On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > You're using "!prunestate.has_lpdead_items" as part of your test that > sets "recordfreespace". But lazy_scan_noprune doesn't get passed a > pointer to prunestate, so clearly you'll need to detect the same > condition some other way. OOPS. Thanks. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan <pg@bowt.ie> wrote: > > What is "space_freed"? Isn't that something from your uncommitted patch? > > Yes, I was mixing the two together. An understandable mistake. > I just want to make sure that we agree that, on master, when > lazy_scan_prune() is called, the logic for whether or not to update > the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items || !index_vacuuming > > and when lazy_scan_noprune() is called, the logic for whether or not > to update the FSM after the first pass is: > > indexes == 0 || !has_lpdead_items > > Those seem different to me. Right. As I said to Robert just now, I can now see that they're slightly different conditions. FWIW my brain was just ignoring " || !index_vacuuming". I dismissed it as an edge-case, only relevant when the failsafe has kicked in. Which it is. But that's still no reason to allow an inconsistency that we can easily just avoid. -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 12, 2024 at 2:47 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > > You're using "!prunestate.has_lpdead_items" as part of your test that > > sets "recordfreespace". But lazy_scan_noprune doesn't get passed a > > pointer to prunestate, so clearly you'll need to detect the same > > condition some other way. > > OOPS. Thanks. Also, I think you should combine these in lazy_scan_noprune() now /* Save any LP_DEAD items found on the page in dead_items array */ if (vacrel->nindexes == 0) { /* Using one-pass strategy (since table has no indexes) */ if (lpdead_items > 0) { Since we don't set recordfreespace in the outer if statement anymore And I noticed you missed a reference to recordfreespace output parameter in the function comment above lazy_scan_noprune(). - Melanie
On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Also, I think you should combine these in lazy_scan_noprune() now > > /* Save any LP_DEAD items found on the page in dead_items array */ > if (vacrel->nindexes == 0) > { > /* Using one-pass strategy (since table has no indexes) */ > if (lpdead_items > 0) > { > > Since we don't set recordfreespace in the outer if statement anymore Well, maybe, but there's an else clause attached to the outer "if", so you have to be a bit careful. I didn't think it was critical to further rejigger this. > And I noticed you missed a reference to recordfreespace output > parameter in the function comment above lazy_scan_noprune(). OK. So what's the best way to solve the problem that Peter pointed out? Should we pass in the prunestate? Maybe just replace bool *recordfreespace with bool *has_lpdead_items? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 12, 2024 at 3:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > So what's the best way to solve the problem that Peter pointed out? > Should we pass in the prunestate? Maybe just replace bool > *recordfreespace with bool *has_lpdead_items? Yea, that works for now. I mean, I think the way we should do it is update the FSM in lazy_scan_noprune(), but, for the purposes of this patch, yes. has_lpdead_items output parameter seems fine to me. - Melanie
On Fri, Jan 12, 2024 at 4:05 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Yea, that works for now. I mean, I think the way we should do it is > update the FSM in lazy_scan_noprune(), but, for the purposes of this > patch, yes. has_lpdead_items output parameter seems fine to me. Here's v2. It's not exactly clear to me why you want to update the FSM in lazy_scan_[no]prune(). When I first looked at v7-0004, I said to myself "well, this is dumb, because now we're just duplicating something that is common to both cases". But then I realized that the logic was *already* duplicated in lazy_scan_heap(), and that by pushing the duplication down to lazy_scan_[no]prune(), you made the two copies identical rather than, as at present, having two copies that differ from each other. Perhaps that's a sufficient reason to make the change all by itself, but it seems like what would be really good is if we only needed one copy of the logic. I don't know if that's achievable, though. More generally, I somewhat question whether it's really right to push things from lazy_scan_heap() and into lazy_scan_[no]prune(), mostly because of the risk of having to duplicate logic. I'm not even really convinced that it's good for us to have both of those functions. There's an awful lot of code duplication between them already. Each has a loop that tests the status of each item and then, for LP_USED, switches on htsv_get_valid_status or HeapTupleSatisfiesVacuum. It doesn't seem trivial to unify all that code, but it doesn't seem very nice to have two copies of it, either. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Mon, Jan 15, 2024 at 12:29:57PM -0500, Robert Haas wrote: > On Fri, Jan 12, 2024 at 4:05 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Yea, that works for now. I mean, I think the way we should do it is > > update the FSM in lazy_scan_noprune(), but, for the purposes of this > > patch, yes. has_lpdead_items output parameter seems fine to me. > > Here's v2. > > It's not exactly clear to me why you want to update the FSM in > lazy_scan_[no]prune(). When I first looked at v7-0004, I said to > myself "well, this is dumb, because now we're just duplicating > something that is common to both cases". But then I realized that the > logic was *already* duplicated in lazy_scan_heap(), and that by > pushing the duplication down to lazy_scan_[no]prune(), you made the > two copies identical rather than, as at present, having two copies > that differ from each other. Perhaps that's a sufficient reason to > make the change all by itself, but it seems like what would be really > good is if we only needed one copy of the logic. I don't know if > that's achievable, though. Upthread in v4-0004, I do have a version which combines the two FSM updates for lazy_scan_prune() and lazy_scan_noprune(). If you move the VM updates from lazy_scan_heap() into lazy_scan_prune(), then there is little difference between the prune and no prune cases in lazy_scan_heap(). The main difference is that, when lazy_scan_noprune() returns true, you are supposed to avoid calling lazy_scan_new_or_empty() again and have to avoid calling lazy_scan_prune(). I solved this with a local variable "do_prune" and checked it before calling lazy_scan_new_or_empty() and lazy_scan_prune(). I moved away from this approach because it felt odd to test "do_prune" before calling lazy_scan_new_or_empty(). Though, perhaps it doesn't matter if we just call lazy_scan_new_or_empty() again. We do that when lazy_scan_noprune() returns false anyway. I thought perhaps we could circumvent this issue by moving lazy_scan_new_or_empty() into lazy_scan_prune(). But, this seemed wrong to me because, as it stands now, if lazy_scan_new_or_empty() returns true, we would want to skip the VM update code and FSM update code in lazy_scan_heap() after lazy_scan_prune(). That would mean lazy_scan_prune() would have to return something to indicate that. > More generally, I somewhat question whether it's really right to push > things from lazy_scan_heap() and into lazy_scan_[no]prune(), mostly > because of the risk of having to duplicate logic. I'm not even really > convinced that it's good for us to have both of those functions. > There's an awful lot of code duplication between them already. Each > has a loop that tests the status of each item and then, for LP_USED, > switches on htsv_get_valid_status or HeapTupleSatisfiesVacuum. It > doesn't seem trivial to unify all that code, but it doesn't seem very > nice to have two copies of it, either. I agree that the duplicated logic in both places is undesirable. I think the biggest issue with combining them would be that when lazy_scan_noprune() returns false, it needs to go get a cleanup lock and then invoke the logic of lazy_scan_prune() for all the tuples on the page. This seems a little hard to get right in a single function. Then there are more trivial-to-solve differences like invoking heap_tuple_should_freeze() and not bothering with the whole heap_prepare_freeze_tuple() if we only have the share lock. I am willing to try and write a version of this to see if it is better. I will say, though, my agenda was to eventually push the actions taken in the loop in lazy_scan_prune() into heap_page_prune() and heap_prune_chain(). > From 32684f41d1dd50f726aa0dfe8a5d816aa5c42d64 Mon Sep 17 00:00:00 2001 > From: Robert Haas <rhaas@postgresql.org> > Date: Mon, 15 Jan 2024 12:05:52 -0500 > Subject: [PATCH v2] Be more consistent about whether to update the FSM while > vacuuming. Few small review comments below, but, overall, LGTM. > Previously, when lazy_scan_noprune() was called and returned true, we would > update the FSM immediately if the relation had no indexes or if the page > contained no dead items. On the other hand, when lazy_scan_prune() was > called, we would update the FSM if either of those things was true or > if index vacuuming was disabled. Eliminate that behavioral difference by > considering vacrel->do_index_vacuuming in both cases. > > Also, instead of having lazy_scan_noprune() make the decision > internally and pass it back to the caller via *recordfreespace, just > have it pass the number of LP_DEAD items back to the caller, and then It doesn't pass the number of LP_DEAD items back to the caller. It passes a boolean. > let the caller make the decision. That seems less confusing, since > the caller also decides in the lazy_scan_prune() case; moreover, this > way, the whole test is in one place, instead of spread out. Perhaps it isn't important, but I find this wording confusing. You mention lazy_scan_prune() and then mention that "the whole test is in one place instead of spread out" -- which kind of makes it sound like you are consolidating FSM updates for both the lazy_scan_noprune() and lazy_scan_prune() cases. Perhaps simply flipping the order of the "since the caller" and "moreover, this way" conjunctions would solve it. I defer to your judgment. > --- > src/backend/access/heap/vacuumlazy.c | 58 ++++++++++++++-------------- > 1 file changed, 29 insertions(+), 29 deletions(-) > > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > index b63cad1335..f17816b81d 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -1960,7 +1974,6 @@ lazy_scan_noprune(LVRelState *vacrel, > Assert(BufferGetBlockNumber(buf) == blkno); > > hastup = false; /* for now */ > - *recordfreespace = false; /* for now */ > > lpdead_items = 0; > live_tuples = 0; > @@ -2102,18 +2115,8 @@ lazy_scan_noprune(LVRelState *vacrel, > hastup = true; > missed_dead_tuples += lpdead_items; > } > - > - *recordfreespace = true; > - } > - else if (lpdead_items == 0) > - { > - /* > - * Won't be vacuuming this page later, so record page's freespace in > - * the FSM now > - */ > - *recordfreespace = true; > } > - else > + else if (lpdead_items != 0) It stuck out to me a bit that this test is if lpdead_items != 0 and the one above it: /* Save any LP_DEAD items found on the page in dead_items array */ if (vacrel->nindexes == 0) { /* Using one-pass strategy (since table has no indexes) */ if (lpdead_items > 0) { tests if lpdead_items > 0. It is more the inconsistency that bothered me than the fact that lpdead_items is signed. > @@ -2159,6 +2156,9 @@ lazy_scan_noprune(LVRelState *vacrel, > if (hastup) > vacrel->nonempty_pages = blkno + 1; > > + /* Did we find LP_DEAD items? */ > + *has_lpdead_items = (lpdead_items > 0); I would drop this comment. The code doesn't really need it, and the reason we care if there are LP_DEAD items is not because they are dead but because we want to know if we'll touch this page again. You don't need to rehash all that here, so I think omitting the comment is enough. - Melanie
On Mon, Jan 15, 2024 at 4:03 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > It doesn't pass the number of LP_DEAD items back to the caller. It > passes a boolean. Oops. > Perhaps it isn't important, but I find this wording confusing. You > mention lazy_scan_prune() and then mention that "the whole test is in > one place instead of spread out" -- which kind of makes it sound like > you are consolidating FSM updates for both the lazy_scan_noprune() and > lazy_scan_prune() cases. Perhaps simply flipping the order of the "since > the caller" and "moreover, this way" conjunctions would solve it. I > defer to your judgment. I rewrote the commit message a bit. See what you think of this version. > tests if lpdead_items > 0. It is more the inconsistency that bothered me > than the fact that lpdead_items is signed. Changed. > > @@ -2159,6 +2156,9 @@ lazy_scan_noprune(LVRelState *vacrel, > > if (hastup) > > vacrel->nonempty_pages = blkno + 1; > > > > + /* Did we find LP_DEAD items? */ > > + *has_lpdead_items = (lpdead_items > 0); > > I would drop this comment. The code doesn't really need it, and the > reason we care if there are LP_DEAD items is not because they are dead > but because we want to know if we'll touch this page again. You don't > need to rehash all that here, so I think omitting the comment is enough. I want to keep the comment. I guess it's a pet peeve of mine, but I hate it when people do this: /* some comment */ some_code(); some_more_code(); /* some other comment */ even_more_code(); IMV, this makes it unclear whether /* some comment */ is describing both some_code() and some_more_code(), or just the former. To be fair, there is often no practical confusion, because if the comment is good and the code is nothing too complicated then you understand which way the author meant it. But sometimes the comment is bad or out of date and sometimes the code is difficult to understand and then you're left scratching your head as to what the author meant. I prefer to insert a comment above some_more_code() in such cases, even if it's a bit perfunctory. I think it makes the code easier to read. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Tue, Jan 16, 2024 at 10:24 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jan 15, 2024 at 4:03 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > Perhaps it isn't important, but I find this wording confusing. You > > mention lazy_scan_prune() and then mention that "the whole test is in > > one place instead of spread out" -- which kind of makes it sound like > > you are consolidating FSM updates for both the lazy_scan_noprune() and > > lazy_scan_prune() cases. Perhaps simply flipping the order of the "since > > the caller" and "moreover, this way" conjunctions would solve it. I > > defer to your judgment. > > I rewrote the commit message a bit. See what you think of this version. All LGTM.
On Tue, Jan 16, 2024 at 11:28 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > All LGTM. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/12/24 12:45 PM, Robert Haas wrote: > P.P.S. to everyone: Yikes, this logic is really confusing. Having studied all this code several years ago when it was even simpler - it was *still* very hard to grok even back then. I *greatly appreciate* the effort that y'all are putting into increasing the clarity here. BTW, back in the day the whole "no indexes" optimization was a really tiny amount of code... I think it amounted to 2 or 3 if statements. I haven't yet attempted to grok this patchset, but I'm definitely wondering how much it's worth continuing to optimize that case. Clearly it'd be very expensive to memoize dead tuples just to trawl that list a single time to clean the heap, but outside of that I'm not sure other optimazations are worth it given the amount of code complexity/duplication they seem to require - especially for code where correctness is so crucial. -- Jim Nasby, Data Architect, Austin TX
On Tue, Jan 16, 2024 at 4:28 PM Jim Nasby <jim.nasby@gmail.com> wrote: > On 1/12/24 12:45 PM, Robert Haas wrote: > > P.P.S. to everyone: Yikes, this logic is really confusing. > > Having studied all this code several years ago when it was even simpler > - it was *still* very hard to grok even back then. I *greatly > appreciate* the effort that y'all are putting into increasing the > clarity here. Thanks. And yeah, I agree. > BTW, back in the day the whole "no indexes" optimization was a really > tiny amount of code... I think it amounted to 2 or 3 if statements. I > haven't yet attempted to grok this patchset, but I'm definitely > wondering how much it's worth continuing to optimize that case. Clearly > it'd be very expensive to memoize dead tuples just to trawl that list a > single time to clean the heap, but outside of that I'm not sure other > optimazations are worth it given the amount of code > complexity/duplication they seem to require - especially for code where > correctness is so crucial. Personally, I don't think throwing away that optimization is the way to go. The idea isn't intrinsically complicated, I believe. It's just that the code has become messy because of too many hands touching it. At least, that's my read. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Tue, Jan 16, 2024 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 11:28 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > All LGTM. > > Committed. Attached v8 patch set is rebased over this. In 0004, I've taken the approach you seem to favor and combined the FSM updates from the prune and no prune cases in lazy_scan_heap() instead of pushing the FSM updates into lazy_scan_prune() and lazy_scan_noprune(). I did not guard against calling lazy_scan_new_or_empty() a second time in the case that lazy_scan_noprune() was called. I can do this. I mentioned upthread I found it confusing for lazy_scan_new_or_empty() to be guarded by if (do_prune). The overhead of calling it wouldn't be terribly high. I can change that based on your opinion of what is better. The big open question/functional change is when we consider vacuuming the FSM. Previously, we only considered vacuuming the FSM in the no indexes, has dead items case. After combining the FSM updates from lazy_scan_prune()'s no indexes/has lpdead items case, lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them consider vacuuming the FSM. I could guard against this, but I wasn't sure why we would want to only vacuum the FSM in the no indexes/has dead items case. I also noticed while rebasing something I missed while reviewing 45d395cd75ffc5b -- has_lpdead_items is set in a slightly different place in lazy_scan_noprune() than lazy_scan_prune() (with otherwise identical surrounding code). Both are correct, but it would have been nice for them to be the same. If the patches below are committed, we could standardize on the location in lazy_scan_noprune(). - Melanie
Attachment
On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Attached v8 patch set is rebased over this. Reviewing 0001, consider the case where a table has no indexes. Pre-patch, PageTruncateLinePointerArray() will happen when lazy_vacuum_heap_page() is called; post-patch, it will not occur. That's a loss. Should we care? On the plus side, visibility map repair, if required, can now take place. That's a gain. I'm otherwise satisfied with this patch now, except for some extremely minor nitpicking: + * For now, pass mark_unused_now == false regardless of whether or Personally, i would write "pass mark_unused_now as false" here, because we're not testing equality. Or else "pass mark_unused_now = false". This is not an equality test. + * During pruning, the caller may have passed mark_unused_now == true, Again here, but also, this is referring to the name of a parameter to a function whose name is not given. I think this this should either talk fully in terms of code ("When heap_page_prune was called, mark_unused_now may have been passed as true, which allows would-be LP_DEAD items to be made LP_USED instead.") or fully in English ("During pruning, we may have decided to mark would-be dead items as unused."). > In 0004, I've taken the approach you seem to favor and combined the FSM > updates from the prune and no prune cases in lazy_scan_heap() instead > of pushing the FSM updates into lazy_scan_prune() and > lazy_scan_noprune(). I do like that approach. I think do_prune can be declared one scope inward, in the per-block for loop. I would probably initialize it to true so I could drop the stubby else block: + /* If we do get a cleanup lock, we will definitely prune */ + else + do_prune = true; And then I'd probably write the call as if (!lazy_scan_noprune()) doprune = true. If I wanted to stick with not initializing do_prune, I'd put the comment inside as /* We got the cleanup lock, so we will definitely prune */ and add braces since that makes it a two-line block. > I did not guard against calling lazy_scan_new_or_empty() a second time > in the case that lazy_scan_noprune() was called. I can do this. I > mentioned upthread I found it confusing for lazy_scan_new_or_empty() > to be guarded by if (do_prune). The overhead of calling it wouldn't be > terribly high. I can change that based on your opinion of what is > better. To me, the relevant question here isn't reader confusion, because that can be cleared up with a comment explaining why we do or do not test do_prune. Indeed, I'd say it's a textbook example of when you should comment a test: when it might look wrong to the reader but you have a good reason for doing it. But that brings me to what I think the real question is here: do we, uh, have a good reason for doing it? At first blush the structure looks a bit odd here. lazy_scan_new_or_empty() is intended to handle PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases where we process the page without a cleanup lock, and lazy_scan_prune() the regular case. So you might think that lazy_scan_new_or_empty() would always be applied *first*, that we would then conditionally apply lazy_scan_noprune(), and finally conditionally apply lazy_scan_prune(). Like this: bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf); if (lazy_scan_new_or_empty()) continue; if (!got_cleanup_lock && !lazy_scan_noprune()) { LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBufferForCleanup(buf); got_cleanup_lock = true; } if (got_cleanup_lock) lazy_scan_prune(); The current organization of the code seems to imply that we don't need to worry about the PageIsNew() and PageIsEmpty() cases before calling lazy_scan_noprune(), and at the moment I'm not understanding why that should be the case. I wonder if this is a bug, or if I'm just confused. > The big open question/functional change is when we consider vacuuming > the FSM. Previously, we only considered vacuuming the FSM in the no > indexes, has dead items case. After combining the FSM updates from > lazy_scan_prune()'s no indexes/has lpdead items case, > lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them > consider vacuuming the FSM. I could guard against this, but I wasn't > sure why we would want to only vacuum the FSM in the no indexes/has > dead items case. I don't get it. Conceptually, I agree that we don't want to be inconsistent here without some good reason. One of the big advantages of unifying different code paths is that you avoid being accidentally inconsistent. If different things are different it shows up as a test in the code instead of just having different code paths in different places that may or may not match. But I thought the whole point of 45d395cd75ffc5b4c824467140127a5d11696d4c was to iron out the existing inconsistencies so that we could unify this code without having to change any more behavior. In particular, I thought we just made it consistently adhere to the principle Peter articulated, where we record free space when we're touching the page for presumptively the last time. I gather that you think it's still not consistent, but I don't understand what the remaining inconsistency is. Can you explain further? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 17, 2024 at 12:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Attached v8 patch set is rebased over this. > > Reviewing 0001, consider the case where a table has no indexes. > Pre-patch, PageTruncateLinePointerArray() will happen when > lazy_vacuum_heap_page() is called; post-patch, it will not occur. > That's a loss. Should we care? On the plus side, visibility map > repair, if required, can now take place. That's a gain. I thought that this wasn't an issue because heap_page_prune_execute() calls PageRepairFragmentation() which similarly modifies pd_lower and sets the hint bit about free line pointers. > I'm otherwise satisfied with this patch now, except for some extremely > minor nitpicking: > > + * For now, pass mark_unused_now == false regardless of whether or > > Personally, i would write "pass mark_unused_now as false" here, > because we're not testing equality. Or else "pass mark_unused_now = > false". This is not an equality test. > > + * During pruning, the caller may have passed mark_unused_now == true, > > Again here, but also, this is referring to the name of a parameter to > a function whose name is not given. I think this this should either > talk fully in terms of code ("When heap_page_prune was called, > mark_unused_now may have been passed as true, which allows would-be > LP_DEAD items to be made LP_USED instead.") or fully in English > ("During pruning, we may have decided to mark would-be dead items as > unused."). Fixed both of the above issues as suggested in attached v9 > > In 0004, I've taken the approach you seem to favor and combined the FSM > > updates from the prune and no prune cases in lazy_scan_heap() instead > > of pushing the FSM updates into lazy_scan_prune() and > > lazy_scan_noprune(). > > I do like that approach. > > I think do_prune can be declared one scope inward, in the per-block > for loop. I would probably initialize it to true so I could drop the > stubby else block: > > + /* If we do get a cleanup lock, we will definitely prune */ > + else > + do_prune = true; > > And then I'd probably write the call as if (!lazy_scan_noprune()) > doprune = true. > > If I wanted to stick with not initializing do_prune, I'd put the > comment inside as /* We got the cleanup lock, so we will definitely > prune */ and add braces since that makes it a two-line block. If we don't unconditionally set do_prune using the result of lazy_scan_noprune(), then we cannot leave do_prune uninitialized. I preferred having it uninitialized, as it didn't imply that doing pruning was the default. Also, it made it simpler to have that comment about always pruning when we get the cleanup lock. However, with the changes you mentioned below (got_cleanup_lock), this discussion is moot. > > I did not guard against calling lazy_scan_new_or_empty() a second time > > in the case that lazy_scan_noprune() was called. I can do this. I > > mentioned upthread I found it confusing for lazy_scan_new_or_empty() > > to be guarded by if (do_prune). The overhead of calling it wouldn't be > > terribly high. I can change that based on your opinion of what is > > better. > > To me, the relevant question here isn't reader confusion, because that > can be cleared up with a comment explaining why we do or do not test > do_prune. Indeed, I'd say it's a textbook example of when you should > comment a test: when it might look wrong to the reader but you have a > good reason for doing it. > > But that brings me to what I think the real question is here: do we, > uh, have a good reason for doing it? At first blush the structure > looks a bit odd here. lazy_scan_new_or_empty() is intended to handle > PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases > where we process the page without a cleanup lock, and > lazy_scan_prune() the regular case. So you might think that > lazy_scan_new_or_empty() would always be applied *first*, that we > would then conditionally apply lazy_scan_noprune(), and finally > conditionally apply lazy_scan_prune(). Like this: > > bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf); > if (lazy_scan_new_or_empty()) > continue; > if (!got_cleanup_lock && !lazy_scan_noprune()) > { > LockBuffer(buf, BUFFER_LOCK_UNLOCK); > LockBufferForCleanup(buf); > got_cleanup_lock = true; > } > if (got_cleanup_lock) > lazy_scan_prune(); > > The current organization of the code seems to imply that we don't need > to worry about the PageIsNew() and PageIsEmpty() cases before calling > lazy_scan_noprune(), and at the moment I'm not understanding why that > should be the case. I wonder if this is a bug, or if I'm just > confused. Yes, I also spent some time thinking about this. In master, we do always call lazy_scan_new_or_empty() before calling lazy_scan_noprune(). The code is aiming to ensure we call lazy_scan_new_or_empty() once before calling either of lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call lazy_scan_new_or_empty() unconditionally first because of the different lock types expected. But, your structure has solved that. I've used a version of your example code above in attached v9. It is much nicer. > > The big open question/functional change is when we consider vacuuming > > the FSM. Previously, we only considered vacuuming the FSM in the no > > indexes, has dead items case. After combining the FSM updates from > > lazy_scan_prune()'s no indexes/has lpdead items case, > > lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them > > consider vacuuming the FSM. I could guard against this, but I wasn't > > sure why we would want to only vacuum the FSM in the no indexes/has > > dead items case. > > I don't get it. Conceptually, I agree that we don't want to be > inconsistent here without some good reason. One of the big advantages > of unifying different code paths is that you avoid being accidentally > inconsistent. If different things are different it shows up as a test > in the code instead of just having different code paths in different > places that may or may not match. > > But I thought the whole point of > 45d395cd75ffc5b4c824467140127a5d11696d4c was to iron out the existing > inconsistencies so that we could unify this code without having to > change any more behavior. In particular, I thought we just made it > consistently adhere to the principle Peter articulated, where we > record free space when we're touching the page for presumptively the > last time. I gather that you think it's still not consistent, but I > don't understand what the remaining inconsistency is. Can you explain > further? Ah, I realize I was not clear. I am now talking about inconsistencies in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating the freespace map during the course of vacuuming the heap relation. - Melanie
Attachment
On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Reviewing 0001, consider the case where a table has no indexes. > > Pre-patch, PageTruncateLinePointerArray() will happen when > > lazy_vacuum_heap_page() is called; post-patch, it will not occur. > > That's a loss. Should we care? On the plus side, visibility map > > repair, if required, can now take place. That's a gain. > > I thought that this wasn't an issue because heap_page_prune_execute() > calls PageRepairFragmentation() which similarly modifies pd_lower and > sets the hint bit about free line pointers. Ah, OK, I didn't understand that PageRepairFragmentation() does what is also done by PageTruncateLinePointerArray(). > Yes, I also spent some time thinking about this. In master, we do > always call lazy_scan_new_or_empty() before calling > lazy_scan_noprune(). The code is aiming to ensure we call > lazy_scan_new_or_empty() once before calling either of > lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call > lazy_scan_new_or_empty() unconditionally first because of the > different lock types expected. But, your structure has solved that. > I've used a version of your example code above in attached v9. It is > much nicer. Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was called either way. Glad that my proposed restructuring managed to be helpful despite that confusion, though. :-) At a quick glance, I also like the way this looks. I'll review it more thoroughly later. Does this patch require 0002 and 0003 or could it equally well go first? I confess that I don't entirely understand why we want 0002 and 0003. > Ah, I realize I was not clear. I am now talking about inconsistencies > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > the freespace map during the course of vacuuming the heap relation. Fair enough, but I'm still not quite sure exactly what the question is. It looks to me like the current code, when there are indexes, vacuums the FSM after each round of index vacuuming. When there are no indexes, doing it after each round of index vacuuming would mean never doing it, so instead we vacuum the FSM every ~8GB. I assume what happened here is that somebody decided doing it after each round of index vacuuming was the "right thing," and then realized that was not going to work if no index vacuuming was happening, and so inserted the 8GB threshold to cover that case. I don't really know what to make of all of this. On a happy PostgreSQL system, doing anything after each round of index vacuuming means doing it once, because multiple rounds of indexing vacuum are extremely expensive and we hope that it won't ever occur. From that point of view, the 8GB threshold is better, because it means that when we vacuum a large relation, space should become visible to the rest of the system incrementally without needing to wait for the entire vacuum to finish. On the other hand, we also have this idea that we want to record free space in the FSM once, after the last time we touch the page. Given that behavior, vacuuming the FSM every 8GB when we haven't yet done index vacuuming wouldn't accomplish much of anything, because we haven't updated it for the pages we just touched. On the third hand, the current behavior seems slightly ridiculous, because pruning the page is where we're mostly going to free up space, so we might be better off just updating the FSM then instead of waiting. That free space could be mighty useful during the long wait between pruning and marking line pointers unused. On the fourth hand, that seems like a significant behavior change that we might not want to undertake without a bunch of research that we might not want to do right now -- and if we did do it, should we then update the FSM a second time after marking line pointers unused? I'm not sure if any of this is answering your actual question, though. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jan 17, 2024 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Ah, I realize I was not clear. I am now talking about inconsistencies > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > > the freespace map during the course of vacuuming the heap relation. > > Fair enough, but I'm still not quite sure exactly what the question > is. It looks to me like the current code, when there are indexes, > vacuums the FSM after each round of index vacuuming. When there are no > indexes, doing it after each round of index vacuuming would mean never > doing it, so instead we vacuum the FSM every ~8GB. I assume what > happened here is that somebody decided doing it after each round of > index vacuuming was the "right thing," and then realized that was not > going to work if no index vacuuming was happening, and so inserted the > 8GB threshold to cover that case. Note that VACUUM_FSM_EVERY_PAGES is applied against the number of rel_pages "processed" so far -- *including* any pages that were skipped using the visibility map. It would make a bit more sense if it was applied against scanned_pages instead (just like FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only a very loose relationship with physical work performed/time elapsed. I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the wrong idea. If it's such a good idea then why not apply it all the time? That is, why not apply it independently of whether nindexes==0 in the current VACUUM operation? (You know, just like with FAILSAFE_EVERY_PAGES.) -- Peter Geoghegan
On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the > wrong idea. If it's such a good idea then why not apply it all the > time? That is, why not apply it independently of whether nindexes==0 > in the current VACUUM operation? (You know, just like with > FAILSAFE_EVERY_PAGES.) Actually, I suppose that we couldn't apply it independently of nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our second pass over the heap takes place for those LP_DEAD-containing heap pages scanned since the last round of index/heap vacuuming took place (or since VACUUM began). We need to make sure that the FSM has the most recent possible information known to VACUUM, which would break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 17, 2024 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > Yes, I also spent some time thinking about this. In master, we do > > always call lazy_scan_new_or_empty() before calling > > lazy_scan_noprune(). The code is aiming to ensure we call > > lazy_scan_new_or_empty() once before calling either of > > lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call > > lazy_scan_new_or_empty() unconditionally first because of the > > different lock types expected. But, your structure has solved that. > > I've used a version of your example code above in attached v9. It is > > much nicer. > > Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was > called either way. Glad that my proposed restructuring managed to be > helpful despite that confusion, though. :-) > > At a quick glance, I also like the way this looks. I'll review it more > thoroughly later. Does this patch require 0002 and 0003 or could it > equally well go first? I confess that I don't entirely understand why > we want 0002 and 0003. Well, 0002 and 0003 move the updates to the visibility map into lazy_scan_prune(). We only want to update the VM if we called lazy_scan_prune() (i.e. not if lazy_scan_noprune() returned true). We also need the lock on the heap page when updating the visibility map but we want to have released the lock before updating the FSM, so we need to first update the VM then the FSM. The VM update code, in my opinion, belongs in lazy_scan_prune() -- since it is only done when lazy_scan_prune() is called. To keep the VM update code in lazy_scan_heap() and still consolidate the FSM update code, we would have to surround all of the VM update code in a test (if got_cleanup_lock, I suppose). I don't see any advantage in doing that. > > Ah, I realize I was not clear. I am now talking about inconsistencies > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > > the freespace map during the course of vacuuming the heap relation. > > Fair enough, but I'm still not quite sure exactly what the question > is. It looks to me like the current code, when there are indexes, > vacuums the FSM after each round of index vacuuming. When there are no > indexes, doing it after each round of index vacuuming would mean never > doing it, so instead we vacuum the FSM every ~8GB. I assume what > happened here is that somebody decided doing it after each round of > index vacuuming was the "right thing," and then realized that was not > going to work if no index vacuuming was happening, and so inserted the > 8GB threshold to cover that case. Ah, I see. I understood that we want to update the FSM every 8GB, but I didn't understand that we wanted to check if we were at that 8GB only after a round of index vacuuming. That would explain why we also had to do it in the no indexes case -- because, as you say, there wouldn't be a round of index vacuuming. This does mean that something is not quite right with 0001 as well as 0004. We'd end up checking if we are at 8GB much more often. I should probably find a way to replicate the cadence on master. > I don't really know what to make of > all of this. On a happy PostgreSQL system, doing anything after each > round of index vacuuming means doing it once, because multiple rounds > of indexing vacuum are extremely expensive and we hope that it won't > ever occur. From that point of view, the 8GB threshold is better, > because it means that when we vacuum a large relation, space should > become visible to the rest of the system incrementally without needing > to wait for the entire vacuum to finish. On the other hand, we also > have this idea that we want to record free space in the FSM once, > after the last time we touch the page. Given that behavior, vacuuming > the FSM every 8GB when we haven't yet done index vacuuming wouldn't > accomplish much of anything, because we haven't updated it for the > pages we just touched. On the third hand, the current behavior seems > slightly ridiculous, because pruning the page is where we're mostly > going to free up space, so we might be better off just updating the > FSM then instead of waiting. That free space could be mighty useful > during the long wait between pruning and marking line pointers unused. > On the fourth hand, that seems like a significant behavior change that > we might not want to undertake without a bunch of research that we > might not want to do right now -- and if we did do it, should we then > update the FSM a second time after marking line pointers unused? I suspect we'd need to do some testing of various scenarios to justify such a change. - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jan 17, 2024 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Ah, I realize I was not clear. I am now talking about inconsistencies > > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > > > the freespace map during the course of vacuuming the heap relation. > > > > Fair enough, but I'm still not quite sure exactly what the question > > is. It looks to me like the current code, when there are indexes, > > vacuums the FSM after each round of index vacuuming. When there are no > > indexes, doing it after each round of index vacuuming would mean never > > doing it, so instead we vacuum the FSM every ~8GB. I assume what > > happened here is that somebody decided doing it after each round of > > index vacuuming was the "right thing," and then realized that was not > > going to work if no index vacuuming was happening, and so inserted the > > 8GB threshold to cover that case. > > Note that VACUUM_FSM_EVERY_PAGES is applied against the number of > rel_pages "processed" so far -- *including* any pages that were > skipped using the visibility map. It would make a bit more sense if it > was applied against scanned_pages instead (just like > FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other > words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only > a very loose relationship with physical work performed/time elapsed. This is a good point. Seems like a very reasonable change to make, as I would think that was the original intent. - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the > > wrong idea. If it's such a good idea then why not apply it all the > > time? That is, why not apply it independently of whether nindexes==0 > > in the current VACUUM operation? (You know, just like with > > FAILSAFE_EVERY_PAGES.) > > Actually, I suppose that we couldn't apply it independently of > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our > second pass over the heap takes place for those LP_DEAD-containing > heap pages scanned since the last round of index/heap vacuuming took > place (or since VACUUM began). We need to make sure that the FSM has > the most recent possible information known to VACUUM, which would > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. > > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. I now see I misunderstood and my earlier email was wrong. I didn't notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0. So, in master, we call FreeSpaceMapVacuumRange() always after a round of index vacuuming and periodically if there are no indexes. It seems like you are asking whether not we should vacuum the FSM at a different cadence for the no indexes case (and potentially count blocks actually vacuumed instead of blocks considered). And it seems like Robert is asking whether or not we should FreeSpaceMapVacuumRange() more frequently than after index vacuuming in the nindexes > 0 case. Other than the overhead of the actual vacuuming of the FSM, what are the potential downsides of knowing about freespace sooner? It could change what pages are inserted to. What are the possible undesirable side effects? - Melanie
On Wed, Jan 17, 2024 at 5:47 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the > > > wrong idea. If it's such a good idea then why not apply it all the > > > time? That is, why not apply it independently of whether nindexes==0 > > > in the current VACUUM operation? (You know, just like with > > > FAILSAFE_EVERY_PAGES.) > > > > Actually, I suppose that we couldn't apply it independently of > > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our > > second pass over the heap takes place for those LP_DEAD-containing > > heap pages scanned since the last round of index/heap vacuuming took > > place (or since VACUUM began). We need to make sure that the FSM has > > the most recent possible information known to VACUUM, which would > > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. > > > > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. > > I now see I misunderstood and my earlier email was wrong. I didn't > notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0. > So, in master, we call FreeSpaceMapVacuumRange() always after a round > of index vacuuming and periodically if there are no indexes. The "nindexes == 0" if() that comes just after our call to lazy_scan_prune() is "the one-pass equivalent of a call to lazy_vacuum()". Though this includes the call to FreeSpaceMapVacuumRange() that immediately follows the two-pass case calling lazy_vacuum(), too. > It seems like you are asking whether not we should vacuum the FSM at a > different cadence for the no indexes case (and potentially count > blocks actually vacuumed instead of blocks considered). > > And it seems like Robert is asking whether or not we should > FreeSpaceMapVacuumRange() more frequently than after index vacuuming > in the nindexes > 0 case. There is no particular reason for the nindexes==0 case to care about how often we'd call FreeSpaceMapVacuumRange() in the counterfactual world where the same VACUUM ran on the same table, except that it was nindexes>1 instead. At least I don't see any. > Other than the overhead of the actual vacuuming of the FSM, what are > the potential downsides of knowing about freespace sooner? It could > change what pages are inserted to. What are the possible undesirable > side effects? The whole VACUUM_FSM_EVERY_PAGES thing comes from commit 851a26e266. The commit message of that work seems to suppose that calling FreeSpaceMapVacuumRange() more frequently is pretty much strictly better than calling it less frequently, at least up to the point where certain more-or-less fixed costs paid once per FreeSpaceMapVacuumRange() start to become a problem. I think that that's probably about right. The commit message also says that we "arbitrarily update upper FSM pages after each 8GB of heap" (in the nindexes==0 case). So VACUUM_FSM_EVERY_PAGES is only very approximately analogous to what we do in the nindexes>1 case. That seems reasonable because these two cases really aren't so comparable in terms of the FSM vacuuming requirements -- the nindexes==0 case legitimately doesn't have the same dependency on heap vacuuming (and index vacuuming) that we have to consider when nindexes>1. -- Peter Geoghegan
On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > Actually, I suppose that we couldn't apply it independently of > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our > second pass over the heap takes place for those LP_DEAD-containing > heap pages scanned since the last round of index/heap vacuuming took > place (or since VACUUM began). We need to make sure that the FSM has > the most recent possible information known to VACUUM, which would > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. > > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. I agree with all of this. I thought I'd said all of this, actually, in my prior email, but perhaps it wasn't as clear as it needed to be. But I also said one more thing that I'd still like to hear your thoughts about, which is: why is it right to update the FSM after the second heap pass rather than the first one? I can't help but suspect this is an algorithmic holdover from pre-HOT days, when VACUUM's first heap pass was read-only and all the work happened in the second pass. Now, nearly all of the free space that will ever become free becomes free in the first pass, so why not advertise it then, instead of waiting? Admittedly, HOT is not yet 15 years old, so maybe it's too soon to adapt our VACUUM algorithm for it. *wink* -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 17, 2024 at 5:33 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > This does mean that something is not quite right with 0001 as well as > 0004. We'd end up checking if we are at 8GB much more often. I should > probably find a way to replicate the cadence on master. I believe I've done this in attached v10. - Melanie
Attachment
On Thu, Jan 18, 2024 at 8:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > But I also said one more thing that I'd still like to hear your > thoughts about, which is: why is it right to update the FSM after the > second heap pass rather than the first one? I can't help but suspect > this is an algorithmic holdover from pre-HOT days, when VACUUM's first > heap pass was read-only and all the work happened in the second pass. > Now, nearly all of the free space that will ever become free becomes > free in the first pass, so why not advertise it then, instead of > waiting? I don't think that doing everything FSM-related in the first heap pass is a bad idea -- especially not if it buys you something elsewhere. The problem with your justification for moving things in that direction (if any) is that it is occasionally not quite true: there are at least some cases where line pointer truncation after making a page's LP_DEAD items -> LP_UNUSED will actually matter. Plus PageGetHeapFreeSpace() will return 0 if and when "PageGetMaxOffsetNumber(page) > MaxHeapTuplesPerPage && !PageHasFreeLinePointers(page)". Of course, nothing stops you from compensating for this by anticipating what will happen later on, and assuming that the page already has that much free space. It might even be okay to just not try to compensate for anything, PageGetHeapFreeSpace-wise -- just do all FSM stuff in the first heap pass, and ignore all this. I happen to believe that a FSM_CATEGORIES of 256 is way too much granularity to be useful in practice -- I just don't have any faith in the idea that that kind of granularity is useful (it's quite the opposite). A further justification might be what we already do in the heapam.c REDO routines: the way that we use XLogRecordPageWithFreeSpace already operates with far less precision that corresponding code from vacuumlazy.c. heap_xlog_prune() already has recovery do what you propose to do during original execution; it doesn't try to avoid duplicating an anticipated call to XLogRecordPageWithFreeSpace that'll take place when heap_xlog_vacuum() runs against the same page a bit later on. You'd likely prefer a simpler argument for doing this -- an argument that doesn't require abandoning/discrediting the idea that a high degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure that that's possible -- the current design is at least correct on its own terms. And what you propose to do will probably be less correct on those same terms, silly though they are. -- Peter Geoghegan
On Thu, Jan 18, 2024 at 9:53 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I believe I've done this in attached v10. Oh, I see. Good catch. I've now committed 0001. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 18, 2024 at 10:09 AM Peter Geoghegan <pg@bowt.ie> wrote: > The problem with your justification for moving things in that > direction (if any) is that it is occasionally not quite true: there > are at least some cases where line pointer truncation after making a > page's LP_DEAD items -> LP_UNUSED will actually matter. Plus > PageGetHeapFreeSpace() will return 0 if and when > "PageGetMaxOffsetNumber(page) > MaxHeapTuplesPerPage && > !PageHasFreeLinePointers(page)". Of course, nothing stops you from > compensating for this by anticipating what will happen later on, and > assuming that the page already has that much free space. I think we're agreeing but I want to be sure. If we only set LP_DEAD items to LP_UNUSED, that frees no space. But if doing so allows us to truncate the line pointer array, that that frees a little bit of space. Right? One problem with using this as a justification for the status quo is that truncating the line pointer array is a relatively recent behavior. It's certainly much newer than the choice to have VACUUM touch the FSM in the second page than the first page. Another problem is that the amount of space that we're freeing up in the second pass is really quite minimal even when it's >0. Any tuple that actually contains any data at all is at least 32 bytes, and most of them are quite a bit larger. Item pointers are 2 bytes. To save enough space to fit even one additional tuple, we'd have to free *at least* 16 line pointers. That's going to be really rare. And even if it happens, is it even useful to advertise that free space? Do we want to cram one more tuple into a page that has a history of extremely heavy updates? Could it be that it's smarter to just forget about that free space? You've written before about the stupidity of cramming tuples of different generations into the same page, and that concept seems to apply here. When we heap_page_prune(), we don't know how much time has elapsed since the page was last modified - but if we're lucky, it might not be very much. Updating the FSM at that time gives us some shot of filling up the page with data created around the same time as the existing page contents. By the time we vacuum the indexes and come back, that temporal locality is definitely lost. > You'd likely prefer a simpler argument for doing this -- an argument > that doesn't require abandoning/discrediting the idea that a high > degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure > that that's possible -- the current design is at least correct on its > own terms. And what you propose to do will probably be less correct on > those same terms, silly though they are. I've never really understood why you think that the number of FSM_CATEGORIES is the problem. I believe I recall you endorsing a system where pages are open or closed, to try to achieve temporal locality of data. I agree that such a system could work better than what we have now. I think there's a risk that such a system could create pathological cases where the behavior is much worse than what we have today, and I think we'd need to consider carefully what such cases might exist and what mitigation strategies might make sense. However, I don't see a reason why such a system should intrinsically want to reduce FSM_CATEGORIES. If we have two open pages and one of them has enough space for the tuple we're now trying to insert and the other doesn't, we'd still like to avoid having the FSM hand us the one that doesn't. Now, that said, I suspect that we actually could reduce FSM_CATEGORIES somewhat without causing any real problems, because many tables are going to have tuples that are all about the same size, and even in a table where the sizes vary more than is typical, a single tuple can't consume more than a quarter of the page, so granularity above that point seems completely useless. So if we needed some bitspace to track the open/closed status of pages or similar, I suspect we could find that in the existing FSM byte per page without losing anything. But all of that is just an argument that reducing the number of FSM_CATEGORIES is *acceptable*; it doesn't amount to an argument that it's better. My current belief is that it isn't better, just a vehicle to do something else that maybe is better, like squeezing open/closed tracking or similar into the existing bit space. My understanding is that you think it would be better on its own terms, but I have not yet been able to grasp why that would be so. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 18, 2024 at 10:42 AM Robert Haas <robertmhaas@gmail.com> wrote: > Now, that said, I suspect that we actually could reduce FSM_CATEGORIES > somewhat without causing any real problems, because many tables are > going to have tuples that are all about the same size, and even in a > table where the sizes vary more than is typical, a single tuple can't > consume more than a quarter of the page, Actually, I think that's a soft limit, not a hard limit. But the granularity above that level probably doesn't need to be very high, at leat. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 18, 2024 at 10:43 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think we're agreeing but I want to be sure. If we only set LP_DEAD > items to LP_UNUSED, that frees no space. But if doing so allows us to > truncate the line pointer array, that that frees a little bit of > space. Right? That's part of it, yes. > One problem with using this as a justification for the status quo is > that truncating the line pointer array is a relatively recent > behavior. It's certainly much newer than the choice to have VACUUM > touch the FSM in the second page than the first page. True. But the way that PageGetHeapFreeSpace() returns 0 for a page with 291 LP_DEAD stubs is a much older behavior. When that happens it is literally true that the page has lots of free space. And yet it's not free space we can actually use. Not until those LP_DEAD items are marked LP_UNUSED. > Another problem is that the amount of space that we're freeing up in > the second pass is really quite minimal even when it's >0. Any tuple > that actually contains any data at all is at least 32 bytes, and most > of them are quite a bit larger. Item pointers are 2 bytes. To save > enough space to fit even one additional tuple, we'd have to free *at > least* 16 line pointers. That's going to be really rare. I basically agree with this. I would still worry about the "291 LP_DEAD stubs makes PageGetHeapFreeSpace return 0" thing specifically, though. It's sort of a special case. > And even if it happens, is it even useful to advertise that free > space? Do we want to cram one more tuple into a page that has a > history of extremely heavy updates? Could it be that it's smarter to > just forget about that free space? I think so, yes. Another big source of inaccuracies here is that we don't credit RECENTLY_DEAD tuple space with being free space. Maybe that isn't a huge problem, but it makes it even harder to believe that precision in FSM accounting is an intrinsic good. > > You'd likely prefer a simpler argument for doing this -- an argument > > that doesn't require abandoning/discrediting the idea that a high > > degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure > > that that's possible -- the current design is at least correct on its > > own terms. And what you propose to do will probably be less correct on > > those same terms, silly though they are. > > I've never really understood why you think that the number of > FSM_CATEGORIES is the problem. I believe I recall you endorsing a > system where pages are open or closed, to try to achieve temporal > locality of data. My remarks about "FSM_CATEGORIES-wise precision" were basically remarks about the fundamental problem with the free space map. Which is really that it's just a map of free space, that gives exactly zero thought to various high level things that *obviously* matter. I wasn't particularly planning on getting into the specifics of that with you now, on this thread. A brief recap might be useful: other systems with a heap table AM free space management structure typically represent the free space available on each page using a far more coarse grained counter. Usually one with less than 10 distinct increments. The immediate problem with FSM_CATEGORIES having such a fine granularity is that it increases contention/competition among backends that need to find some free space for a new tuple. They'll all diligently try to find the page with the least free space that still satisfies their immediate needs -- there is no thought for the second-order effects, which are really important in practice. > But all of that is just an argument that reducing the number of > FSM_CATEGORIES is *acceptable*; it doesn't amount to an argument that > it's better. My current belief is that it isn't better, just a vehicle > to do something else that maybe is better, like squeezing open/closed > tracking or similar into the existing bit space. My understanding is > that you think it would be better on its own terms, but I have not yet > been able to grasp why that would be so. I'm not really arguing that reducing FSM_CATEGORIES and changing nothing else would be better on its own (it might be, but that's not what I meant to convey). What I really wanted to convey is this: if you're going to go the route of ignoring LP_DEAD free space during vacuuming, you're conceding that having a high degree of precision about available free space isn't actually useful (or wouldn't be useful if it was actually possible at all). Which is something that I generally agree with. I'd just like it to be clear that you/Melanie are in fact taking one small step in that direction. We don't need to discuss possible later steps beyond that first step. Not right now. -- Peter Geoghegan
On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > True. But the way that PageGetHeapFreeSpace() returns 0 for a page > with 291 LP_DEAD stubs is a much older behavior. When that happens it > is literally true that the page has lots of free space. And yet it's > not free space we can actually use. Not until those LP_DEAD items are > marked LP_UNUSED. To me, this is just accurate reporting. What we care about in this context is the amount of free space on the page that can be used to store a new tuple. When there are no line pointers available to be allocated, that amount is 0. > Another big source of inaccuracies here is that we don't credit > RECENTLY_DEAD tuple space with being free space. Maybe that isn't a > huge problem, but it makes it even harder to believe that precision in > FSM accounting is an intrinsic good. The difficulty here is that we don't know how long it will be before that space can be reused. Those recently dead tuples could become dead within a few milliseconds or stick around for hours. I've wondered about the merits of some FSM that had built-in visibility awareness, i.e. the capability to record something like "page X currently has Y space free and after XID Z is all-visible it will have Y' space free". That seems complex, but without it, we either have to bet that the space will actually become free before anyone tries to use it, or that it won't. If whatever guess we make is wrong, bad things happen. > My remarks about "FSM_CATEGORIES-wise precision" were basically > remarks about the fundamental problem with the free space map. Which > is really that it's just a map of free space, that gives exactly zero > thought to various high level things that *obviously* matter. I wasn't > particularly planning on getting into the specifics of that with you > now, on this thread. Fair. > A brief recap might be useful: other systems with a heap table AM free > space management structure typically represent the free space > available on each page using a far more coarse grained counter. > Usually one with less than 10 distinct increments. The immediate > problem with FSM_CATEGORIES having such a fine granularity is that it > increases contention/competition among backends that need to find some > free space for a new tuple. They'll all diligently try to find the > page with the least free space that still satisfies their immediate > needs -- there is no thought for the second-order effects, which are > really important in practice. I think that the completely deterministic nature of the computation is a mistake regardless of anything else. That serves to focus contention rather than spreading it out, which is dumb, and would still be dumb with any other number of FSM_CATEGORIES. > What I really wanted to convey is this: if you're going to go the > route of ignoring LP_DEAD free space during vacuuming, you're > conceding that having a high degree of precision about available free > space isn't actually useful (or wouldn't be useful if it was actually > possible at all). Which is something that I generally agree with. I'd > just like it to be clear that you/Melanie are in fact taking one small > step in that direction. We don't need to discuss possible later steps > beyond that first step. Not right now. Yeah. I'm not sure we're actually going to change that right now, but I agree with the high-level point regardless, which I would summarize like this: The current system provides more precision about available free space than we actually need, while failing to provide some other things that we really do need. We need not agree today on exactly what those other things are or how best to get them in order to agree that the current system has significant flaws, and we do agree that it does. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 18, 2024 at 11:46 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > > True. But the way that PageGetHeapFreeSpace() returns 0 for a page > > with 291 LP_DEAD stubs is a much older behavior. When that happens it > > is literally true that the page has lots of free space. And yet it's > > not free space we can actually use. Not until those LP_DEAD items are > > marked LP_UNUSED. > > To me, this is just accurate reporting. What we care about in this > context is the amount of free space on the page that can be used to > store a new tuple. When there are no line pointers available to be > allocated, that amount is 0. I agree. All I'm saying is this (can't imagine you'll disagree): It's not okay if you fail to update the FSM a second time in the second heap pass -- at least in some cases. It's reasonably frequent for a page that has 0 usable free space when lazy_scan_prune returns to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page() is done with it. While I am sympathetic to the argument that LP_DEAD item space just isn't that important in general, that doesn't apply with this one special case. This is a "step function" behavior, and is seen whenever VACUUM runs following bulk deletes of tuples -- a rather common case. Clearly the FSM shouldn't show that pages that are actually completely empty at the end of VACUUM as having no available free space after a VACUUM finishes (on account of how they looked immediately after lazy_scan_prune ran). That'd just be wrong. > > Another big source of inaccuracies here is that we don't credit > > RECENTLY_DEAD tuple space with being free space. Maybe that isn't a > > huge problem, but it makes it even harder to believe that precision in > > FSM accounting is an intrinsic good. > > The difficulty here is that we don't know how long it will be before > that space can be reused. Those recently dead tuples could become dead > within a few milliseconds or stick around for hours. I've wondered > about the merits of some FSM that had built-in visibility awareness, > i.e. the capability to record something like "page X currently has Y > space free and after XID Z is all-visible it will have Y' space free". > That seems complex, but without it, we either have to bet that the > space will actually become free before anyone tries to use it, or that > it won't. If whatever guess we make is wrong, bad things happen. All true -- it is rather complex. Other systems with a heap table access method based on a foundation of 2PL (Oracle, DB2) literally need a transactionally consistent FSM structure. In fact I believe that Oracle literally requires the equivalent of an MVCC snapshot read (a "consistent get") to be able to access what seems like it ought to be strictly a physical data structure correctly. Everything needs to work in the rollback path, independent of whatever else may happen to the page before an xact rolls back (i.e. independently of what other xacts might end up doing with the page). This requires very tight coordination to avoid bugs where a transaction cannot roll back due to not having enough free space to restore the original tuple during UNDO. I don't think it's desirable to have anything as delicate as that here. But some rudimentary understanding of free space being allocated/leased to certain transactions and/or backends does seem like a good idea. There is some intrinsic value to these sorts of behaviors, even in a system without any UNDO segments, where it is never strictly necessary. > I think that the completely deterministic nature of the computation is > a mistake regardless of anything else. That serves to focus contention > rather than spreading it out, which is dumb, and would still be dumb > with any other number of FSM_CATEGORIES. That's a part of the problem too, I guess. The actual available free space on each page is literally changing all the time, when measured at FSM_CATEGORIES-wise granularity -- which leads to a mad dash among backends that all need the same amount of free space for their new tuple. One reason why other systems pretty much require coarse-grained increments of free space is the need to manage the WAL overhead for a crash-safe FSM/free list structure. > Yeah. I'm not sure we're actually going to change that right now, but > I agree with the high-level point regardless, which I would summarize > like this: The current system provides more precision about available > free space than we actually need, while failing to provide some other > things that we really do need. We need not agree today on exactly what > those other things are or how best to get them in order to agree that > the current system has significant flaws, and we do agree that it > does. I agree with this. -- Peter Geoghegan
On Thu, Jan 18, 2024 at 12:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > It's not okay if you fail to update the FSM a second time in the > second heap pass -- at least in some cases. It's reasonably frequent > for a page that has 0 usable free space when lazy_scan_prune returns > to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page() > is done with it. Oh, good point. That's an important subtlety I had missed. > That's a part of the problem too, I guess. > > The actual available free space on each page is literally changing all > the time, when measured at FSM_CATEGORIES-wise granularity -- which > leads to a mad dash among backends that all need the same amount of > free space for their new tuple. One reason why other systems pretty > much require coarse-grained increments of free space is the need to > manage the WAL overhead for a crash-safe FSM/free list structure. Interesting. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 18, 2024 at 10:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > Oh, I see. Good catch. > > I've now committed 0001. I have now also committed 0002 and 0003. I made some modifications to 0003. Specifically: - I moved has_lpdead_items inside the loop over blocks instead of putting it at the function toplevel. - I adjusted the comments in lazy_scan_prune() because it seemed to me that the comment about "Now scan the page..." had gotten too far separated from the loop where that happens. - I combined two lines in an if-test because one of them was kinda short. Hope that's OK with you. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Thu, Jan 18, 2024 at 3:20 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 10:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Oh, I see. Good catch. > > > > I've now committed 0001. > > I have now also committed 0002 and 0003. I made some modifications to > 0003. Specifically: > > - I moved has_lpdead_items inside the loop over blocks instead of > putting it at the function toplevel. > - I adjusted the comments in lazy_scan_prune() because it seemed to me > that the comment about "Now scan the page..." had gotten too far > separated from the loop where that happens. > - I combined two lines in an if-test because one of them was kinda short. > > Hope that's OK with you. Awesome, thanks! I have attached a rebased version of the former 0004 as v11-0001. - Melanie
Attachment
On Thu, Jan 18, 2024 at 9:23 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I have attached a rebased version of the former 0004 as v11-0001. This looks correct to me, although I wouldn't mind some more eyes on it. However, I think that the comments still need more work. Specifically: /* * Prune, freeze, and count tuples. * * Accumulates details of remaining LP_DEAD line pointers on page in * dead_items array. This includes LP_DEAD line pointers that we * pruned ourselves, as well as existing LP_DEAD line pointers that * were pruned some time earlier. Also considers freezing XIDs in the * tuple headers of remaining items with storage. It also determines * if truncating this block is safe. */ - lazy_scan_prune(vacrel, buf, blkno, page, - vmbuffer, all_visible_according_to_vm, - &has_lpdead_items); + if (got_cleanup_lock) + lazy_scan_prune(vacrel, buf, blkno, page, + vmbuffer, all_visible_according_to_vm, + &has_lpdead_items); I think this comment needs adjusting. Possibly, the preceding calls to lazy_scan_noprune() and lazy_scan_new_or_empty() could even use a bit better comments, but in those cases, you're basically keeping the same code with the same comment, so it's kinda defensible. Here, though, you're making the call conditional without any comment update. /* * Final steps for block: drop cleanup lock, record free space in the * FSM. * * If we will likely do index vacuuming, wait until * lazy_vacuum_heap_rel() to save free space. This doesn't just save * us some cycles; it also allows us to record any additional free * space that lazy_vacuum_heap_page() will make available in cases * where it's possible to truncate the page's line pointer array. * + * Our goal is to update the freespace map the last time we touch the + * page. If the relation has no indexes, or if index vacuuming is + * disabled, there will be no second heap pass; if this particular + * page has no dead items, the second heap pass will not touch this + * page. So, in those cases, update the FSM now. + * * Note: It's not in fact 100% certain that we really will call * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index * vacuuming (and so must skip heap vacuuming). This is deemed okay * because it only happens in emergencies, or when there is very * little free space anyway. (Besides, we start recording free space * in the FSM once index vacuuming has been abandoned.) */ I think this comment needs a rewrite, not just sticking the other comment in the middle of it. There's some duplication between these two comments, and merging it all together should iron that out. Personally, I think my comment (which was there before, this commit only moves it here) is clearer than what's already here about the intent, but it's lacking some details that are captured in the other two paragraphs, and we probably don't want to lose those details. If you'd like, I can try rewriting these comments to my satisfaction and you can reverse-review the result. Or you can rewrite them and I'll re-review the result. But I think this needs to be a little less mechanical. It's not just about shuffling all the comments around so that all the text ends up somewhere -- we also need to consider the degree to which the meaning becomes duplicated when it all gets merged together. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 24, 2024 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 18, 2024 at 9:23 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I have attached a rebased version of the former 0004 as v11-0001. > > This looks correct to me, although I wouldn't mind some more eyes on > it. However, I think that the comments still need more work. > > Specifically: > > /* > * Prune, freeze, and count tuples. > * > * Accumulates details of remaining LP_DEAD line pointers on page in > * dead_items array. This includes LP_DEAD line pointers that we > * pruned ourselves, as well as existing LP_DEAD line pointers that > * were pruned some time earlier. Also considers freezing XIDs in the > * tuple headers of remaining items with storage. It also determines > * if truncating this block is safe. > */ > - lazy_scan_prune(vacrel, buf, blkno, page, > - vmbuffer, all_visible_according_to_vm, > - &has_lpdead_items); > + if (got_cleanup_lock) > + lazy_scan_prune(vacrel, buf, blkno, page, > + vmbuffer, all_visible_according_to_vm, > + &has_lpdead_items); > > I think this comment needs adjusting. Possibly, the preceding calls to > lazy_scan_noprune() and lazy_scan_new_or_empty() could even use a bit > better comments, but in those cases, you're basically keeping the same > code with the same comment, so it's kinda defensible. Here, though, > you're making the call conditional without any comment update. > > /* > * Final steps for block: drop cleanup lock, record free space in the > * FSM. > * > * If we will likely do index vacuuming, wait until > * lazy_vacuum_heap_rel() to save free space. This doesn't just save > * us some cycles; it also allows us to record any additional free > * space that lazy_vacuum_heap_page() will make available in cases > * where it's possible to truncate the page's line pointer array. > * > + * Our goal is to update the freespace map the last time we touch the > + * page. If the relation has no indexes, or if index vacuuming is > + * disabled, there will be no second heap pass; if this particular > + * page has no dead items, the second heap pass will not touch this > + * page. So, in those cases, update the FSM now. > + * > * Note: It's not in fact 100% certain that we really will call > * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index > * vacuuming (and so must skip heap vacuuming). This is deemed okay > * because it only happens in emergencies, or when there is very > * little free space anyway. (Besides, we start recording free space > * in the FSM once index vacuuming has been abandoned.) > */ > > I think this comment needs a rewrite, not just sticking the other > comment in the middle of it. There's some duplication between these > two comments, and merging it all together should iron that out. > Personally, I think my comment (which was there before, this commit > only moves it here) is clearer than what's already here about the > intent, but it's lacking some details that are captured in the other > two paragraphs, and we probably don't want to lose those details. > > If you'd like, I can try rewriting these comments to my satisfaction > and you can reverse-review the result. Or you can rewrite them and > I'll re-review the result. But I think this needs to be a little less > mechanical. It's not just about shuffling all the comments around so > that all the text ends up somewhere -- we also need to consider the > degree to which the meaning becomes duplicated when it all gets merged > together. I will take a stab at rewriting the comments myself first. Usually, I try to avoid changing comments if the code isn't functionally different because I know it adds additional review overhead and I try to reduce that to an absolute minimum. However, I see what you are saying and agree that it would be better to have actually good comments instead of frankenstein comments made up of parts that were previously considered acceptable. I'll have a new version ready by tomorrow. - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Wed, Jan 24, 2024 at 4:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 2:59 PM Robert Haas <robertmhaas@gmail.com> wrote: ... > > If you'd like, I can try rewriting these comments to my satisfaction > > and you can reverse-review the result. Or you can rewrite them and > > I'll re-review the result. But I think this needs to be a little less > > mechanical. It's not just about shuffling all the comments around so > > that all the text ends up somewhere -- we also need to consider the > > degree to which the meaning becomes duplicated when it all gets merged > > together. > > I will take a stab at rewriting the comments myself first. Usually, I > try to avoid changing comments if the code isn't functionally > different because I know it adds additional review overhead and I try > to reduce that to an absolute minimum. However, I see what you are > saying and agree that it would be better to have actually good > comments instead of frankenstein comments made up of parts that were > previously considered acceptable. I'll have a new version ready by > tomorrow. v12 attached has my attempt at writing better comments for this section of lazy_scan_heap(). Above the combined FSM update code, I have written a comment that is a revised version of your comment from above the lazy_scan_noprune() FSM update code but with some of the additional details from the previous comment above the lazy_scan_pruen() FSM update code. The one part that I did not incorporate was the point about how sometimes we think we'll do a second pass on the block so we don't update the FSM but then we end up not doing it but it's all okay. * Note: It's not in fact 100% certain that we really will call * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index * vacuuming (and so must skip heap vacuuming). This is deemed okay * because it only happens in emergencies, or when there is very * little free space anyway. (Besides, we start recording free space * in the FSM once index vacuuming has been abandoned.) I didn't incorporate it because I wasn't sure I understood the situation. I can imagine us skipping updating the FSM after lazy_scan_prune() because there are indexes on the relation and dead items on the page and we think we'll do a second pass. Later, we end up triggering a failsafe vacuum or, somehow, there are still too few TIDs for the second pass, so we update do_index_vacuuming to false. Then we wouldn't ever record this block's free space in the FSM. That seems fine (which is what the comment says). So, what does the last sentence mean? "Besides, we start recording..." - Melanie
Attachment
On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I didn't incorporate it because I wasn't sure I understood the > situation. I can imagine us skipping updating the FSM after > lazy_scan_prune() because there are indexes on the relation and dead > items on the page and we think we'll do a second pass. Later, we end > up triggering a failsafe vacuum or, somehow, there are still too few > TIDs for the second pass, so we update do_index_vacuuming to false. > Then we wouldn't ever record this block's free space in the FSM. That > seems fine (which is what the comment says). So, what does the last > sentence mean? "Besides, we start recording..." It means: when the failsafe kicks in, from that point on we won't do any more heap vacuuming. Clearly any pages that still need to be scanned at that point won't ever be processed by lazy_vacuum_heap_rel(). So from that point on we should record the free space in every scanned heap page in the "first heap pass" -- including pages that have LP_DEAD stubs that aren't going to be made LP_UNUSED in the ongoing VACUUM. -- Peter Geoghegan
On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > v12 attached has my attempt at writing better comments for this > section of lazy_scan_heap(). + /* + * If we didn't get the cleanup lock and the page is not new or empty, + * we can still collect LP_DEAD items in the dead_items array for + * later vacuuming, count live and recently dead tuples for vacuum + * logging, and determine if this block could later be truncated. If + * we encounter any xid/mxids that require advancing the + * relfrozenxid/relminxid, we'll have to wait for a cleanup lock and + * call lazy_scan_prune(). + */ I like this comment. I would probably drop "and the page is not new or empty" from it since that's really speaking to the previous bit of code, but it wouldn't be the end of the world to keep it, either. /* - * Prune, freeze, and count tuples. + * If we got a cleanup lock, we can prune and freeze tuples and + * defragment the page. If we didn't get a cleanup lock, we will still + * consider whether or not to update the FSM. * - * Accumulates details of remaining LP_DEAD line pointers on page in - * dead_items array. This includes LP_DEAD line pointers that we - * pruned ourselves, as well as existing LP_DEAD line pointers that - * were pruned some time earlier. Also considers freezing XIDs in the - * tuple headers of remaining items with storage. It also determines - * if truncating this block is safe. + * Like lazy_scan_noprune(), lazy_scan_prune() will count + * recently_dead_tuples and live tuples for vacuum logging, determine + * if the block can later be truncated, and accumulate the details of + * remaining LP_DEAD line pointers on the page in the dead_items + * array. These dead items include those pruned by lazy_scan_prune() + * as well we line pointers previously marked LP_DEAD. */ To me, the first paragraph of this one misses the mark. What I thought we should be saying here was something like "If we don't have a cleanup lock, the code above has already processed this page to the extent that is possible. Otherwise, we either got the cleanup lock initially and have not processed the page yet, or we didn't get it initially, attempted to process it without the cleanup lock, and decided we needed one after all. Either way, if we now have the lock, we must prune, freeze, and count tuples." The second paragraph seems fine. - * Note: It's not in fact 100% certain that we really will call - * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index - * vacuuming (and so must skip heap vacuuming). This is deemed okay - * because it only happens in emergencies, or when there is very - * little free space anyway. (Besides, we start recording free space - * in the FSM once index vacuuming has been abandoned.) Here's a suggestion from me: Note: In corner cases, it's possible to miss updating the FSM entirely. If index vacuuming is currently enabled, we'll skip the FSM update now. But if failsafe mode is later activated, disabling index vacuuming, there will also be no opportunity to update the FSM later, because we'll never revisit this page. Since updating the FSM is desirable but not absolutely required, that's OK. I think this expresses the same sentiment as the current comment, but IMHO more clearly. The one part of the current comment that I don't understand at all is the remark about "when there is very little freespace anyway". I get that if the failsafe activates we won't come back to the page, which is the "only happens in emergencies" part of the existing comment. But the current phrasing makes it sound like there is a second case where it can happen -- "when there is very little free space anyway" -- and I don't know what that is talking about. If it's important, we should try to make it clearer. We could also just decide to keep this entire paragraph as it is for purposes of the present patch. The part I really thought needed adjusting was "Prune, freeze, and count tuples." -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Thu, Jan 25, 2024 at 8:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 9:13 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > v12 attached has my attempt at writing better comments for this > > section of lazy_scan_heap(). > > + /* > + * If we didn't get the cleanup lock and the page is not new or empty, > + * we can still collect LP_DEAD items in the dead_items array for > + * later vacuuming, count live and recently dead tuples for vacuum > + * logging, and determine if this block could later be truncated. If > + * we encounter any xid/mxids that require advancing the > + * relfrozenxid/relminxid, we'll have to wait for a cleanup lock and > + * call lazy_scan_prune(). > + */ > > I like this comment. I would probably drop "and the page is not new or > empty" from it since that's really speaking to the previous bit of > code, but it wouldn't be the end of the world to keep it, either. Yes, probably best to get rid of the part about new or empty. > /* > - * Prune, freeze, and count tuples. > + * If we got a cleanup lock, we can prune and freeze tuples and > + * defragment the page. If we didn't get a cleanup lock, we will still > + * consider whether or not to update the FSM. > * > - * Accumulates details of remaining LP_DEAD line pointers on page in > - * dead_items array. This includes LP_DEAD line pointers that we > - * pruned ourselves, as well as existing LP_DEAD line pointers that > - * were pruned some time earlier. Also considers freezing XIDs in the > - * tuple headers of remaining items with storage. It also determines > - * if truncating this block is safe. > + * Like lazy_scan_noprune(), lazy_scan_prune() will count > + * recently_dead_tuples and live tuples for vacuum logging, determine > + * if the block can later be truncated, and accumulate the details of > + * remaining LP_DEAD line pointers on the page in the dead_items > + * array. These dead items include those pruned by lazy_scan_prune() > + * as well we line pointers previously marked LP_DEAD. > */ > > To me, the first paragraph of this one misses the mark. What I thought > we should be saying here was something like "If we don't have a > cleanup lock, the code above has already processed this page to the > extent that is possible. Otherwise, we either got the cleanup lock > initially and have not processed the page yet, or we didn't get it > initially, attempted to process it without the cleanup lock, and > decided we needed one after all. Either way, if we now have the lock, > we must prune, freeze, and count tuples." I see. Your suggestion makes sense. The sentence starting with "Otherwise" is a bit long. I started to lose the thread at "decided we needed one after all". You previously referred to the cleanup lock as "it" -- once you start referring to it as "one", I as the future developer am no longer sure we are talking about the cleanup lock (as opposed to the page or something else). > - * Note: It's not in fact 100% certain that we really will call > - * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip index > - * vacuuming (and so must skip heap vacuuming). This is deemed okay > - * because it only happens in emergencies, or when there is very > - * little free space anyway. (Besides, we start recording free space > - * in the FSM once index vacuuming has been abandoned.) > > Here's a suggestion from me: > > Note: In corner cases, it's possible to miss updating the FSM > entirely. If index vacuuming is currently enabled, we'll skip the FSM > update now. But if failsafe mode is later activated, disabling index > vacuuming, there will also be no opportunity to update the FSM later, > because we'll never revisit this page. Since updating the FSM is > desirable but not absolutely required, that's OK. > > I think this expresses the same sentiment as the current comment, but > IMHO more clearly. The one part of the current comment that I don't > understand at all is the remark about "when there is very little > freespace anyway". I get that if the failsafe activates we won't come > back to the page, which is the "only happens in emergencies" part of > the existing comment. But the current phrasing makes it sound like > there is a second case where it can happen -- "when there is very > little free space anyway" -- and I don't know what that is talking > about. If it's important, we should try to make it clearer. > > We could also just decide to keep this entire paragraph as it is for > purposes of the present patch. The part I really thought needed > adjusting was "Prune, freeze, and count tuples." I think it would be nice to clarify this comment. I think the "when there is little free space anyway" is referring to the case in lazy_vacuum() where we set do_index_vacuuming to false because "there are almost zero TIDs". I initially thought it was saying that in the failsafe vacuum case the pages whose free space we wouldn't record in the FSM have little free space anyway -- which I didn't get. But then I looked at where we set do_index_vacuuming to false. As for the last sentence starting with "Besides", even with Peter's explanation I still am not sure what it should say. There are blocks whose free space we don't record in the first heap pass. Then, due to skipping index vacuuming and the second heap pass, we also don't record their free space in the second heap pass. I think he is saying that once we set do_index_vacuuming to false, we will stop skipping updating the FSM after the first pass for future blocks. So, future blocks will have their free space recorded in the FSM. But that feels self-evident. The more salient point is that there are some blocks whose free space is not recorded (those whose first pass happened before unsetting do_index_vacuuming and whose second pass did not happen before do_index_vacuuming is unset). The extra sentence made me think there was some way we might go back and record free space for those blocks, but that is not true. - Melanie
On Thu, Jan 25, 2024 at 9:18 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > To me, the first paragraph of this one misses the mark. What I thought > > we should be saying here was something like "If we don't have a > > cleanup lock, the code above has already processed this page to the > > extent that is possible. Otherwise, we either got the cleanup lock > > initially and have not processed the page yet, or we didn't get it > > initially, attempted to process it without the cleanup lock, and > > decided we needed one after all. Either way, if we now have the lock, > > we must prune, freeze, and count tuples." > > I see. Your suggestion makes sense. The sentence starting with > "Otherwise" is a bit long. I started to lose the thread at "decided we > needed one after all". You previously referred to the cleanup lock as > "it" -- once you start referring to it as "one", I as the future > developer am no longer sure we are talking about the cleanup lock (as > opposed to the page or something else). Ok... trying again: If we have a cleanup lock, we must now prune, freeze, and count tuples. We may have acquired the cleanup lock originally, or we may have gone back and acquired it after lazy_scan_noprune() returned false. Either way, the page hasn't been processed yet. > I think it would be nice to clarify this comment. I think the "when > there is little free space anyway" is referring to the case in > lazy_vacuum() where we set do_index_vacuuming to false because "there > are almost zero TIDs". I initially thought it was saying that in the > failsafe vacuum case the pages whose free space we wouldn't record in > the FSM have little free space anyway -- which I didn't get. But then > I looked at where we set do_index_vacuuming to false. Oh... wow. That's kind of confusing; somehow I was thinking we were talking about free space on the disk, rather than newly free space in pages that could be added to the FSM. And it seems really questionable whether that case is OK. I mean, in the emergency case, fine, whatever, we should do whatever it takes to get the system back up, and it should barely ever happen on a well-configured system. But this case could happen regularly, and losing track of free space could easily cause bloat. This might be another argument for moving FSM updates to the first heap pass, but that's a separate task from fixing the comment. > As for the last sentence starting with "Besides", even with Peter's > explanation I still am not sure what it should say. There are blocks > whose free space we don't record in the first heap pass. Then, due to > skipping index vacuuming and the second heap pass, we also don't > record their free space in the second heap pass. I think he is saying > that once we set do_index_vacuuming to false, we will stop skipping > updating the FSM after the first pass for future blocks. So, future > blocks will have their free space recorded in the FSM. But that feels > self-evident. Yes, I don't think that necessarily needs to be mentioned here. > The more salient point is that there are some blocks > whose free space is not recorded (those whose first pass happened > before unsetting do_index_vacuuming and whose second pass did not > happen before do_index_vacuuming is unset). The extra sentence made me > think there was some way we might go back and record free space for > those blocks, but that is not true. I don't really see why that sentence made you think that, but it's not important. I agree with you about what point we need to emphasize here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Thu, Jan 25, 2024 at 10:19 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 25, 2024 at 9:18 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > To me, the first paragraph of this one misses the mark. What I thought > > > we should be saying here was something like "If we don't have a > > > cleanup lock, the code above has already processed this page to the > > > extent that is possible. Otherwise, we either got the cleanup lock > > > initially and have not processed the page yet, or we didn't get it > > > initially, attempted to process it without the cleanup lock, and > > > decided we needed one after all. Either way, if we now have the lock, > > > we must prune, freeze, and count tuples." > > > > I see. Your suggestion makes sense. The sentence starting with > > "Otherwise" is a bit long. I started to lose the thread at "decided we > > needed one after all". You previously referred to the cleanup lock as > > "it" -- once you start referring to it as "one", I as the future > > developer am no longer sure we are talking about the cleanup lock (as > > opposed to the page or something else). > > Ok... trying again: > > If we have a cleanup lock, we must now prune, freeze, and count > tuples. We may have acquired the cleanup lock originally, or we may > have gone back and acquired it after lazy_scan_noprune() returned > false. Either way, the page hasn't been processed yet. Cool. I might add "successfully" or "fully" to "Either way, the page hasn't been processed yet" > > I think it would be nice to clarify this comment. I think the "when > > there is little free space anyway" is referring to the case in > > lazy_vacuum() where we set do_index_vacuuming to false because "there > > are almost zero TIDs". I initially thought it was saying that in the > > failsafe vacuum case the pages whose free space we wouldn't record in > > the FSM have little free space anyway -- which I didn't get. But then > > I looked at where we set do_index_vacuuming to false. > > Oh... wow. That's kind of confusing; somehow I was thinking we were > talking about free space on the disk, rather than newly free space in > pages that could be added to the FSM. Perhaps I misunderstood. I interpreted it to refer to the bypass optimization. > And it seems really questionable > whether that case is OK. I mean, in the emergency case, fine, > whatever, we should do whatever it takes to get the system back up, > and it should barely ever happen on a well-configured system. But this > case could happen regularly, and losing track of free space could > easily cause bloat. > > This might be another argument for moving FSM updates to the first > heap pass, but that's a separate task from fixing the comment. Yes, it seems we could miss recording space freed in the first pass if we never end up doing a second pass. consider_bypass_optimization is set to false only if index cleanup is explicitly enabled or there are dead items accumulated for vacuum's second pass at some point. - Melanie
On Thu, Jan 25, 2024 at 11:19 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Cool. I might add "successfully" or "fully" to "Either way, the page > hasn't been processed yet" I'm OK with that. > > > I think it would be nice to clarify this comment. I think the "when > > > there is little free space anyway" is referring to the case in > > > lazy_vacuum() where we set do_index_vacuuming to false because "there > > > are almost zero TIDs". I initially thought it was saying that in the > > > failsafe vacuum case the pages whose free space we wouldn't record in > > > the FSM have little free space anyway -- which I didn't get. But then > > > I looked at where we set do_index_vacuuming to false. > > > > Oh... wow. That's kind of confusing; somehow I was thinking we were > > talking about free space on the disk, rather than newly free space in > > pages that could be added to the FSM. > > Perhaps I misunderstood. I interpreted it to refer to the bypass optimization. I think you're probably correct. I just didn't realize what was meant. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 25, 2024 at 12:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think you're probably correct. I just didn't realize what was meant. I tweaked your v12 based on this discussion and committed the result. Thanks to you for the patches, and to Peter for participating in the discussion which, IMHO, was very helpful in clarifying things. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Emit fewer vacuum records by reaping removable tuples during pruning
From
Melanie Plageman
Date:
On Fri, Jan 26, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 25, 2024 at 12:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I think you're probably correct. I just didn't realize what was meant. > > I tweaked your v12 based on this discussion and committed the result. > > Thanks to you for the patches, and to Peter for participating in the > discussion which, IMHO, was very helpful in clarifying things. Thanks! I've marked the CF entry as committed. - Melanie
On Fri, Jan 26, 2024 at 11:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > Thanks to you for the patches, and to Peter for participating in the > discussion which, IMHO, was very helpful in clarifying things. Glad I could help. -- Peter Geoghegan