Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) - Mailing list pgsql-hackers

From Kirill Reshke
Subject Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Date
Msg-id CALdSSPhiCwJwWwgJP1NmqRmnp9RS2tGOBY0gQrfLCbB+OS5_KQ@mail.gmail.com
Whole thread Raw
In response to Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
On Wed, 15 Oct 2025 at 04:27, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Thanks so much for the review! I've addressed all your feedback except
> what is commented on inline below.
> I've gone ahead and committed the preliminary patches that you thought
> were ready to commit.
>
> Attached v18 is what remains.
>
> 0001 - 0003: refactoring
> 0004 - 0006: finish eliminating XLOG_HEAP2_VISIBLE
> 0007 - 0009: refactoring
> 0010: Set VM on-access
> 0011: Set prune xid on insert
> 0012: Some refactoring for discussion
>
> For 0001, I got feedback heap_page_prune_and_freeze() has too many
> arguments, so I tried to address that. I'm interested to know if folks
> like this more.
>
> 0011 still needs a bit of investigation to understand fully if
> anything else in the index-killtuples test needs to be changed to make
> sure we have the same coverage.
>
> 0012 is sort of WIP. I got feedback heap_page_prune_and_freeze() was
> too long and should be split up into helpers. I want to know if this
> split makes sense. I can pull it down the patch stack if so.
>
> Only 0001 and 0012 are optional amongst the refactoring patches. The
> others are required to make on-access VM-setting possible or viable.
>
> On Thu, Oct 9, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> > >       }
> > > -     action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > > -                                                                                (xlrec.flags &
XLHP_CLEANUP_LOCK)!= 0, 
> > > -                                                                                &buffer);
> > > -     if (action == BLK_NEEDS_REDO)
> > > +     if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > > +                                                                       (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> > > +                                                                       &buffer) == BLK_NEEDS_REDO)
> > >       {
> > >               Page            page = BufferGetPage(buffer);
> > >               OffsetNumber *redirected;
> >
> > Why move it around this way?
>
> Because there will be an action for the visibility map
> XLogReadBufferForRedoExtended(). I could have renamed it heap_action,
> but it is being used only in one place, so I preferred to just cut it
> to avoid any confusion.
>
> > > Advance the page LSN
> > > +              * only if the record could include an FPI, since recovery skips
> > > +              * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> > > +              * needed to repair a torn page.
> > > +              */
> >
> > This is confusing, should probably just reference the stuff we did in the
> > !recovery case.
>
> I fixed this and addressed all your feedback related to this before committing.
>
> > > +             if (do_prune || nplans > 0 ||
> > > +                     ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> > > +                     PageSetLSN(page, lsn);
> > > +
> > >               /*
> > >                * Note: we don't worry about updating the page's prunability hints.
> > >                * At worst this will cause an extra prune cycle to occur soon.
> > >                */
> >
> > Not your fault, but that seems odd? Why aren't we just doing the right thing?
>
> The comment dates back to 6f10eb2. I imagine no one ever bothered to
> fuss with extracting the XID. You could change
> heap_page_prune_execute() to return the right value -- though that's a
> bit ugly since it is used in normal operation as well as recovery.
>
> > I wonder if the VM specific redo portion should be in a common helper? Might
> > not be enough code to worry though...
>
> I think it might be more code as a helper at this point.
>
> > > @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> > >                                                        VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
> > >                                                        InvalidOffsetNumber);
> > >
> > > +     /*
> > > +      * Before marking dead items unused, check whether the page will become
> > > +      * all-visible once that change is applied.
> >
> > So the function is named _would_ but here you say will :)
>
> I thought about it more and still feel that this function name should
> contain "would". From vacuum's perspective it is "will" -- because it
> knows it will remove those dead items, but from the function's
> perspective it is hypothetical. I changed the comment though.
>
> > > +     if (heap_page_would_be_all_visible(vacrel, buffer,
> > > +                                                                        deadoffsets, num_offsets,
> > > +                                                                        &all_frozen, &visibility_cutoff_xid))
> > > +     {
> > > +             vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> > > +             if (all_frozen)
> > > +             {
> > > +                     vmflags |= VISIBILITYMAP_ALL_FROZEN;
> > > +                     Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> > > +             }
> > > +
> > > +             /* Take the lock on the vmbuffer before entering a critical section */
> > > +             LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> >
> > It sure would be nice if we had documented the lock order between the heap
> > page and the corresponding VM page anywhere.  This is just doing what we did
> > before, so it's not this patch's fault, but I did get worried about it for a
> > moment.
>
> Well, the comment above the visibilitymap_set* functions says what
> expectations they have for the heap page being locked.
>
> > > +static bool
> > > +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> > > +                                                        OffsetNumber *deadoffsets,
> > > +                                                        int ndeadoffsets,
> > > +                                                        bool *all_frozen,
> > > +                                                        TransactionId *visibility_cutoff_xid)
> > > +{
> > >       Page            page = BufferGetPage(buf);
>
> > Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
> > the end?
>
> I was going to put an Assert(ndeadoffsets <= matched_dead_count), but
> then I started wondering if there is a way we could end up with fewer
> dead items than we collected during phase I.
>
> I had thought about if we dropped an index and then did on-access
> pruning -- but we don't allow setting LP_DEAD items LP_UNUSED in
> on-access pruning. So, maybe this is safe... I can do a follow-on
> commit to add the assert. But I'm just not 100% sure I've thought of
> all the cases where we might end up with fewer dead items.
>
> > > During vacuum's first and third phases, we examine tuples' visibility
> > > to determine if we can set the page all-visible in the visibility map.
> > >
> > > Previously, this check compared tuple xmins against a single XID chosen at
> > > the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> > > enables future work to set the VM during on-access pruning, since ordinary
> > > queries have access to GlobalVisState but not OldestXmin.
> > >
> > > This also benefits vacuum directly: GlobalVisState may advance
> > > during a vacuum, allowing more pages to become considered all-visible.
> > > In the rare case that it moves backward, VACUUM falls back to OldestXmin
> > > to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> > > prunable according to the GlobalVisState.
> >
> > It could, but it currently won't advance in vacuum, right?
>
> I thought it was possible for it to advance when calling
> heap_prune_satisfies_vacuum() ->
> GlobalVisTestIsRemovableXid()->...GlobalVisUpdate(). This case isn't
> going to be common, but some things can cause us to update it.
>
> We have talked about explicitly updating GlobalVisState more often
> during vacuums of large tables. But I was under the impression that it
> was at least possible for it to advance during vacuum now.
>
> - Melanie


Hi!

First of all, I rechecked v18 patches, they still cause WAL bytes
reduction. In a no-index vacuum case my result is a 39% reduction in
WAL bytes.
Almost like in your first message.

Here are my comments about code, I may be very nitpicky in minor
details, sorry for that

In 0003:

get_conflict_xid function logic is bit strange for me, it assigns
conflict_xid to some value,  but in the very end we have

> + /*
>+ * We can omit the snapshot conflict horizon if we are not pruning or
>+ * freezing any tuples and are setting an already all-visible page
>+ * all-frozen in the VM. In this case, all of the tuples on the page must
>+ * already be visible to all MVCC snapshots on the standby.
>+ */
>+ if (!do_prune && !do_freeze &&
>+ do_set_vm && blk_already_av && set_blk_all_frozen)
> + conflict_xid = InvalidTransactionId;

I feel like we should move this check to the beginning of the
function, and just  return InvalidTransactionId in that if cond.

in 0004:

> + if (old_vmbits == new_vmbits)
> + {
> + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> + /* Unset so we don't emit WAL since no change occurred */
> + do_set_vm = false;
> + }

and then

>  END_CRIT_SECTION();
> + if (do_set_vm)
> + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> +

So, in the heap_page_prune_and_freeze function we release buffer lock
both inside and outside the crit section. As I understand, this is
actually safe. I also looked in other xlog coding practices for other
access methods (GiST, GIN, ....), and I can see that some of them
release buffers before leaving crit sections and some of them after.
But I still suggest to be in sync with 'Write-Ahead Log Coding'
section of
src/backend/access/transam/README, which says:

6. END_CRIT_SECTION()

7. Unlock and unpin the buffer(s).

Let's be consistent in this at least in this single function context.


In 0010:

I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM
ScanOptions is the right thing to do. Looks like VM bits are something
that make sense for HEAP AM for not for any TAM. So, don't we break
some layer of abstraction here? Would it be better for HEAP AM to set
some flags in heap_beginscan?


Overall 0001-0003 are mostly fine for me, 0004-0006 are the right
thing to do IMHO, but maybe they need some more review from hackers.
Other patches i did not review in a great detail, will return to this
later



--
Best regards,
Kirill Reshke



pgsql-hackers by date:

Previous
From: Dagfinn Ilmari Mannsåker
Date:
Subject: Re: Add uuid_to_base32hex() and base32hex_to_uuid() built-in functions
Next
From: John Naylor
Date:
Subject: Re: tuple radix sort