Thread: Fixing a couple of buglets in how VACUUM sets visibility map bits

Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
Attached are two patches, each of which fixes two historical buglets
around VACUUM's approach to setting bits in the visibility map.
(Whether or not this is actually refactoring work or hardening work is
debatable, I suppose.)

The first patch makes sure that the snapshotConflictHorizon cutoff
(XID cutoff for recovery conflicts) is never a special XID, unless
that XID is InvalidTransactionId, which is interpreted as a
snapshotConflictHorizon value that will never need a recovery conflict
(per the general convention for snapshotConflictHorizon values
explained above ResolveRecoveryConflictWithSnapshot). This patch
establishes a hard rule that snapshotConflictHorizon values can never
be a special XID value, unless it's InvalidTransactionId. An assertion
enforces the rule for us in REDO routines (at the point that they call
ResolveRecoveryConflictWithSnapshot with the WAL record's
snapshotConflictHorizon XID value).

The second patch makes sure that VACUUM can never set a page
all-frozen in the visibility map without also setting the same page
all-visible in the same call to visibilitymap_set() -- regardless of
what we think we know about the current state of the all-visible bit
in the VM.

The second patch adjusts one of the visibilitymap_set() calls in
vacuumlazy.c that would previously sometimes set a page's all-frozen
bit without also setting its all-visible bit. This could allow VACUUM
to leave a page all-frozen but not all-visible in the visibility map
(since the value of all_visible_according_to_vm can go stale). I think
that this should be treated as a basic visibility map invariant: an
all-frozen page must also be all-visible, by definition, so why should
it be physically possible for the VM to give a contradictory picture
of the all_visible/all_frozen status of any one page? Assertions are
added that more or less make this rule into an invariant.
amcheck/pg_visibility coverage might make sense too, but I haven't
done that here.

The second patch also adjusts a later visibilitymap_set() call site
(the one used just after heap vacuuming runs in the final heap pass)
in roughly the same way. It no longer reads from the visibility map to
see what bits need to be changed. The existing approach here seems
rather odd. The whole point of calling lazy_vacuum_heap_page() is to
set LP_DEAD items referenced by VACUUM's dead_items array to LP_UNUSED
-- there has to have been at least one LP_DEAD item on the page for us
to end up here (which a Postgres 14 era assertion verifies for us). So
we already know perfectly well that the visibility map shouldn't
indicate that the page is all-visible yet -- why bother asking the VM?
And besides, any call to visibilitymap_set() will only modify the VM
when it directly observes that the bits have changed -- so why even
attempt to duplicate that on the caller side?

It seems to me that the visibilitymap_get_status() call inside
lazy_vacuum_heap_page() is actually abused to work as a substitute for
visibilitymap_pin(). Why not use top-level visibilitymap_pin() calls
instead, just like we do it in the first heap pass? That's how it's
done in the second patch; it adds a visibilitymap_pin() call in
lazy_vacuum_heap_rel()'s blkno-wise loop. That gives us parity between
the first and second heap pass, which seems like a clear
maintainability win -- everybody can pass the
already-pinned/already-setup vmbuffer by value.

-- 
Peter Geoghegan

Attachment

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The first patch makes sure that the snapshotConflictHorizon cutoff
> (XID cutoff for recovery conflicts) is never a special XID, unless
> that XID is InvalidTransactionId, which is interpreted as a
> snapshotConflictHorizon value that will never need a recovery conflict
> (per the general convention for snapshotConflictHorizon values
> explained above ResolveRecoveryConflictWithSnapshot).

Pushed this just now.

Attached is another very simple refactoring patch for vacuumlazy.c. It
makes vacuumlazy.c save the result of get_database_name() in vacrel,
which matches what we already do with things like
get_namespace_name().

Would be helpful if I could get a +1 on
v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
somewhat more substantial than the others.

-- 
Peter Geoghegan

Attachment

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Would be helpful if I could get a +1 on
> v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
> somewhat more substantial than the others.

There has been no response on this thread for over a full week at this
point. I'm CC'ing Robert now, since the bug is from his commit
a892234f83.

Attached revision of the "don't unset all-visible bit while unsetting
all-frozen bit" patch adds some assertions that verify that
visibility_cutoff_xid is InvalidTransactionId as expected when we go
to set any page all-frozen in the VM. It also broadens an existing
nearby test for corruption, which gives us some chance of detecting
and repairing corruption of this sort that might have slipped in in
the field.

My current plan is to commit something like this in another week or
so, barring any objections.

-- 
Peter Geoghegan

Attachment

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Andres Freund
Date:
Hi,

On 2023-01-08 14:39:19 -0800, Peter Geoghegan wrote:
> One of the calls to visibilitymap_set() during VACUUM's initial heap
> pass could unset a page's all-visible bit during the process of setting
> the same page's all-frozen bit.

How? visibilitymap_set() just adds flags, it doesn't remove any already
existing bits:

        map[mapByte] |= (flags << mapOffset);

It'll afaict lead to potentially unnecessary WAL records though, which does
seem buggy:
    if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))

here we check for *equivalence*, but then below we just or-in flags. So
visibilitymap_set() with just one of the flags bits set in the parameters,
but both set in the page would end up WAL logging unnecessarily.




> @@ -2388,8 +2398,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
>  static void
>  lazy_vacuum_heap_rel(LVRelState *vacrel)
>  {
> -    int            index;
> -    BlockNumber vacuumed_pages;
> +    int            index = 0;
> +    BlockNumber vacuumed_pages = 0;
>      Buffer        vmbuffer = InvalidBuffer;
>      LVSavedErrInfo saved_err_info;
>  
> @@ -2406,42 +2416,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP,
>                               InvalidBlockNumber, InvalidOffsetNumber);
>  
> -    vacuumed_pages = 0;
> -
> -    index = 0;


> @@ -2473,12 +2484,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>   */
>  static int
>  lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> -                      int index, Buffer *vmbuffer)
> +                      int index, Buffer vmbuffer)
>  {
>      VacDeadItems *dead_items = vacrel->dead_items;
>      Page        page = BufferGetPage(buffer);
>      OffsetNumber unused[MaxHeapTuplesPerPage];
> -    int            uncnt = 0;
> +    int            nunused = 0;
>      TransactionId visibility_cutoff_xid;
>      bool        all_frozen;
>      LVSavedErrInfo saved_err_info;
> @@ -2508,10 +2519,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>  
>          Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
>          ItemIdSetUnused(itemid);
> -        unused[uncnt++] = toff;
> +        unused[nunused++] = toff;
>      }
>  
> -    Assert(uncnt > 0);
> +    Assert(nunused > 0);
>  
>      /* Attempt to truncate line pointer array now */
>      PageTruncateLinePointerArray(page);
> @@ -2527,13 +2538,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>          xl_heap_vacuum xlrec;
>          XLogRecPtr    recptr;
>  
> -        xlrec.nunused = uncnt;
> +        xlrec.nunused = nunused;
>  
>          XLogBeginInsert();
>          XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
>  
>          XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> -        XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
> +        XLogRegisterBufData(0, (char *) unused, nunused * sizeof(OffsetNumber));
>  
>          recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
>  

You have plenty of changes like this, which are afaict entirely unrelated to
the issue the commit is fixing, in here. It just makes it hard to review the
patch.

Greetings,

Andres Freund



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Sun, Jan 8, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:
> How?

See the commit message for the scenario I have in mind, which involves
a concurrent HOT update that aborts.

We're vulnerable to allowing "all-frozen but not all-visible"
inconsistencies because of two factors: this business with not passing
VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
visibilitymap_set(), *and* the use of all_visible_according_to_vm to
set the VM (a local variable that can go stale). We sort of assume
that all_visible_according_to_vm cannot go stale here without our
detecting it. That's almost always the case, but it's not quite
guaranteed.

> visibilitymap_set() just adds flags, it doesn't remove any already
> existing bits:

I know. The concrete scenario I have in mind is very subtle (if the
problem was this obvious I'm sure somebody would have noticed it by
now, since we do hit this visibilitymap_set() call site reasonably
often). A concurrent HOT update will certainly clear all the bits for
us, which is enough.

> You have plenty of changes like this, which are afaict entirely unrelated to
> the issue the commit is fixing, in here. It just makes it hard to review the
> patch.

I didn't think that it was that big of a deal to tweak the style of
one or two details in and around lazy_vacuum_heap_rel() in passing,
for consistency with lazy_scan_heap(), since the patch already needs
to do some of that. I do take your point, though.

-- 
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> We're vulnerable to allowing "all-frozen but not all-visible"
> inconsistencies because of two factors: this business with not passing
> VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
> visibilitymap_set(), *and* the use of all_visible_according_to_vm to
> set the VM (a local variable that can go stale). We sort of assume
> that all_visible_according_to_vm cannot go stale here without our
> detecting it. That's almost always the case, but it's not quite
> guaranteed.

On further reflection even v2 won't repair the page-level
PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
might actually leave the all-frozen bit set in the VM, while both the
all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
Again, all due to the approach we take with
all_visible_according_to_vm, which can go stale independently of both
the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
example problem scenario).

FWIW I don't have this remaining problem in my VACUUM
freezing/scanning strategies patchset. It just gets rid of
all_visible_according_to_vm altogether, which makes things a lot
simpler at the point that we set VM bits at the end of lazy_scan_heap
-- there is nothing left that can become stale. Quite a lot of the
code is just removed; there is exactly one call to visibilitymap_set()
at the end of lazy_scan_heap with the patchset, that does everything
we need.

The patchset also has logic for setting PD_ALL_VISIBLE when it needs
to be set, which isn't (and shouldn't) be conditioned on whether we're
doing a "all visible -> all frozen " transition or a "neither -> all
visible" transition. What it actually needs to be conditioned on is
whether it's unset now, and so needs to be set in passing, as part of
setting one or both VM bits -- simple as that.

-- 
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Sun, Jan 8, 2023 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On further reflection even v2 won't repair the page-level
> PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
> might actually leave the all-frozen bit set in the VM, while both the
> all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
> Again, all due to the approach we take with
> all_visible_according_to_vm, which can go stale independently of both
> the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
> example problem scenario).

Attached is v3, which explicitly checks the need to set the
PD_ALL_VISIBLE flag at the relevant visibilitymap_set() call site. It
also has improved comments.

-- 
Peter Geoghegan

Attachment

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Andres Freund
Date:
Hi,

On 2023-01-08 16:27:59 -0800, Peter Geoghegan wrote:
> On Sun, Jan 8, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:
> > How?
> 
> See the commit message for the scenario I have in mind, which involves
> a concurrent HOT update that aborts.

I looked at it. I specifically was wondering about this part of it:
> One of the calls to visibilitymap_set() during VACUUM's initial heap
> pass could unset a page's all-visible bit during the process of setting
> the same page's all-frozen bit.

Which I just don't see as possible, due to visibilitymap_set() simply never
unsetting bits.

I think that's just an imprecise formulation though - the problem is that we
can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though
VISIBILITYMAP_ALL_VISIBLE was concurrently unset.

ISTM that we ought to update all_visible_according_to_vm from
PageIsAllVisible() once we've locked the page. Even if we avoid this specific
case, it seems a recipe for future bugs to have a potentially outdated
variable around.

Greetings,

Andres Freund



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Andres Freund
Date:
Hi,

On 2023-01-09 10:16:03 -0800, Peter Geoghegan wrote:
> Attached is v3, which explicitly checks the need to set the PD_ALL_VISIBLE
> flag at the relevant visibilitymap_set() call site. It also has improved
> comments.

Afaict we'll need to backpatch this all the way?


> From e7788ebdb589fb7c6f866cf53658cc369f9858b5 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg@bowt.ie>
> Date: Sat, 31 Dec 2022 15:13:01 -0800
> Subject: [PATCH v3] Don't accidentally unset all-visible bit in VM.
> 
> One of the calls to visibilitymap_set() during VACUUM's initial heap
> pass could unset a page's all-visible bit during the process of setting
> the same page's all-frozen bit.

As just mentioned upthread, this just seems wrong.


> This could happen in the event of a
> concurrent HOT update from a transaction that aborts soon after.  Since
> the all_visible_according_to_vm local variable that lazy_scan_heap works
> off of when setting the VM doesn't reflect the current state of the VM,
> and since visibilitymap_set() just requested that the all-frozen bit get
> set in one case, there was a race condition.  Heap pages could initially
> be all-visible just as all_visible_according_to_vm is established, then
> not be all-visible after the update, and then become eligible to be set
> all-visible once more following pruning by VACUUM.  There is no reason
> why VACUUM can't remove a concurrently aborted heap-only tuple right
> away, and so no reason why such a page won't be able to reach the
> relevant visibilitymap_set() call site.

Do you have a reproducer for this?


> @@ -1120,8 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel)
>           * got cleared after lazy_scan_skip() was called, so we must recheck
>           * with buffer lock before concluding that the VM is corrupt.
>           */
> -        else if (all_visible_according_to_vm && !PageIsAllVisible(page)
> -                 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
> +        else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
> +                 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
>          {
>              elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page
%u",
>                   vacrel->relname, blkno);

Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it
might be useful to know what bit was wrong when debugging problems.


> @@ -1164,12 +1167,34 @@ lazy_scan_heap(LVRelState *vacrel)
>                   !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
>          {
>              /*
> -             * We can pass InvalidTransactionId as the cutoff XID here,
> -             * because setting the all-frozen bit doesn't cause recovery
> -             * conflicts.
> +             * Avoid relying on all_visible_according_to_vm as a proxy for the
> +             * page-level PD_ALL_VISIBLE bit being set, since it might have
> +             * become stale -- even when all_visible is set in prunestate.
> +             *
> +             * Consider the example of a page that starts out all-visible and
> +             * then has a tuple concurrently deleted by an xact that aborts.
> +             * The page will be all_visible_according_to_vm, and will have
> +             * all_visible set in prunestate.  It will nevertheless not have
> +             * PD_ALL_VISIBLE set by here (plus neither VM bit will be set).
> +             * And so we must check if PD_ALL_VISIBLE needs to be set.
>               */
> +            if (!PageIsAllVisible(page))
> +            {
> +                PageSetAllVisible(page);
> +                MarkBufferDirty(buf);
> +            }
> +
> +            /*
> +             * Set the page all-frozen (and all-visible) in the VM.
> +             *
> +             * We can pass InvalidTransactionId as our visibility_cutoff_xid,
> +             * since a snapshotConflictHorizon sufficient to make everything
> +             * safe for REDO was logged when the page's tuples were frozen.
> +             */
> +            Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
>              visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
>                                vmbuffer, InvalidTransactionId,
> +                              VISIBILITYMAP_ALL_VISIBLE    |
>                                VISIBILITYMAP_ALL_FROZEN);
>          }
>  
> @@ -1311,7 +1336,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
>  
>          /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
>          if (!vacrel->skipwithvm)
> +        {
> +            /* Caller shouldn't rely on all_visible_according_to_vm */
> +            *next_unskippable_allvis = false;
>              break;
> +        }
>  
>          /*
>           * Aggressive VACUUM caller can't skip pages just because they are
> @@ -1818,7 +1847,11 @@ retry:
>               * cutoff by stepping back from OldestXmin.
>               */
>              if (prunestate->all_visible && prunestate->all_frozen)
> +            {
> +                /* Using same cutoff when setting VM is now unnecessary */
>                  snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
> +                prunestate->visibility_cutoff_xid = InvalidTransactionId;
> +            }
>              else
>              {
>                  /* Avoids false conflicts when hot_standby_feedback in use */
> @@ -2388,8 +2421,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
>  static void
>  lazy_vacuum_heap_rel(LVRelState *vacrel)
>  {
> -    int            index;
> -    BlockNumber vacuumed_pages;
> +    int            index = 0;
> +    BlockNumber vacuumed_pages = 0;
>      Buffer        vmbuffer = InvalidBuffer;
>      LVSavedErrInfo saved_err_info;
>  
> @@ -2406,42 +2439,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP,
>                               InvalidBlockNumber, InvalidOffsetNumber);
>  
> -    vacuumed_pages = 0;
> -
> -    index = 0;
>      while (index < vacrel->dead_items->num_items)
>      {
> -        BlockNumber tblk;
> +        BlockNumber blkno;
>          Buffer        buf;
>          Page        page;
>          Size        freespace;
>  
>          vacuum_delay_point();

I still think such changes are inappropriate for a bugfix, particularly one
that needs to be backpatched.

Greetings,

Andres Freund



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 11:44 AM Andres Freund <andres@anarazel.de> wrote:
> I think that's just an imprecise formulation though - the problem is that we
> can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though
> VISIBILITYMAP_ALL_VISIBLE was concurrently unset.

That's correct.

You're right that my description of the problem from the commit
message was confusing. But we're on the same page about the problem
now.

> ISTM that we ought to update all_visible_according_to_vm from
> PageIsAllVisible() once we've locked the page. Even if we avoid this specific
> case, it seems a recipe for future bugs to have a potentially outdated
> variable around.

I basically agree, but some of the details are tricky.

As I mentioned already, my work on visibility map snapshots just gets
rid of all_visible_according_to_vm, which is my preferred long term
approach. We will very likely need to keep all_visible_according_to_vm
as a cache for performance reasons for as long as we have
SKIP_PAGES_THRESHOLD.

Can we just update all_visible_according_to_vm using
PageIsAllVisible(), without making all_visible_according_to_vm
significantly less useful as a cache? Maybe. Not sure offhand.

-- 
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Robert Haas
Date:
On Mon, Jan 2, 2023 at 1:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > The first patch makes sure that the snapshotConflictHorizon cutoff
> > (XID cutoff for recovery conflicts) is never a special XID, unless
> > that XID is InvalidTransactionId, which is interpreted as a
> > snapshotConflictHorizon value that will never need a recovery conflict
> > (per the general convention for snapshotConflictHorizon values
> > explained above ResolveRecoveryConflictWithSnapshot).
>
> Pushed this just now.
>
> Attached is another very simple refactoring patch for vacuumlazy.c. It
> makes vacuumlazy.c save the result of get_database_name() in vacrel,
> which matches what we already do with things like
> get_namespace_name().
>
> Would be helpful if I could get a +1 on
> v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
> somewhat more substantial than the others.

I feel that you should at least have a reproducer for these problems
posted to the thread, and ideally a regression test, before committing
things. I think it's very hard to understand what the problems are
right now.

I don't particularly have a problem with the idea of 0001, because if
we use InvalidTransactionId to mean that there cannot be any
conflicts, we do not need FrozenTransactionId to mean the same thing.
Picking one or the other makes sense. Perhaps we would need two values
if we both needed a value that meant "conflict with nothing" and also
a value that meant "conflict with everything," but in that case I
suppose we would want FrozenTransactionId to be the one that meant
conflict with nothing, since it logically precedes all other XIDs, and
conflicts are with XIDs that precede the value in the record. However,
I don't find the patch very clear, either. It doesn't update any
comments, not even this one:

         /*
          * It's possible that we froze tuples and made the page's XID cutoff
          * (for recovery conflict purposes) FrozenTransactionId.  This is okay
          * because visibility_cutoff_xid will be logged by our caller in a
          * moment.
          */
-        Assert(cutoff == FrozenTransactionId ||
+        Assert(!TransactionIdIsValid(cutoff) ||
                cutoff == prunestate->visibility_cutoff_xid);

Isn't the comment now incorrect as a direct result of the changes in the patch?

As for 0002, I agree that it's bad if we can get into a state where
the all-frozen bit is set and the all-visible bit is not. I'm not
completely sure what concrete harm that will cause, but it does not
seem good. But I also *entirely* agree with Andres that patches should
run around adjusting nearby code - e.g. variable names - in ways that
aren't truly necessary. That just makes life harder, not only for
anyone who wants to review the patch now, but also for future readers
who may need to understand what the patch changed and why.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <andres@anarazel.de> wrote:
> Afaict we'll need to backpatch this all the way?

I thought that we probably wouldn't need to, at first. But I now think
that we really have to.

I didn't realize that affected visibilitymap_set() calls could
generate useless set-VM WAL records until you pointed it out. That's
far more likely to happen than the race condition that I described --
it has nothing at all to do with concurrency. That's what clinches it
for me.

> > One of the calls to visibilitymap_set() during VACUUM's initial heap
> > pass could unset a page's all-visible bit during the process of setting
> > the same page's all-frozen bit.
>
> As just mentioned upthread, this just seems wrong.

I don't know why this sentence ever made sense to me. Anyway, it's not
important now.

> Do you have a reproducer for this?

No, but I'm quite certain that the race can happen.

If it's important to have a reproducer then I can probably come up
with one. I could likely figure out a way to write an isolation test
that reliably triggers the issue. It would have to work by playing
games with cleanup lock/buffer pin waits, since that's the only thing
that the test can hook into to make things happen in just the
right/wrong order.

> >                       elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation
\"%s\"page %u",
 
> >                                vacrel->relname, blkno);
>
> Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it
> might be useful to know what bit was wrong when debugging problems.

Theoretically it might change again, if we call
visibilitymap_get_status() again. Maybe I should just broaden the
error message a bit instead?

> I still think such changes are inappropriate for a bugfix, particularly one
> that needs to be backpatched.

I'll remove the changes that are inessential in the next revision. I
wouldn't have done it if I'd fully understood the seriousness of the
issue from the start.

If you're really concerned about diff size then I should point out
that the changes to lazy_vacuum_heap_rel() aren't strictly necessary,
and probably shouldn't be backpatched. I deemed that in scope because
it's part of the same overall problem of updating the visibility map
based on potentially stale information. It makes zero sense to check
with the visibility map before updating it when we already know that
the page is all-visible. I mean, are we trying to avoid the work of
needlessly updating the visibility map in cases where its state was
corrupt, but then became uncorrupt (relative to the heap page) by
mistake?

--
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 12:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I feel that you should at least have a reproducer for these problems
> posted to the thread, and ideally a regression test, before committing
> things. I think it's very hard to understand what the problems are
> right now.

Hard to understand relative to what, exactly? We're talking about a
very subtle race condition here.

I'll try to come up with a reproducer, but I *utterly* reject your
assertion that it's a hard requirement, sight unseen. Why should those
be the parameters of the discussion?

For one thing I'm quite confident that I'm right, with or without a
reproducer. And my argument isn't all that hard to follow, if you have
relevant expertise, and actually take the time. But even this is
unlikely to matter much. Even if I somehow turn out to have been
completely wrong about the race condition, it is still self-evident
that the problem of uselessly WAL logging non-changes to the VM
exists. That doesn't require any concurrent access at all. It's a
natural consequence of calling visibilitymap_set() with
VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
for 2 minutes to see it.

> I don't particularly have a problem with the idea of 0001, because if
> we use InvalidTransactionId to mean that there cannot be any
> conflicts, we do not need FrozenTransactionId to mean the same thing.
> Picking one or the other makes sense.

We've already picked one, many years ago -- InvalidTransactionId. This
is a long established convention, common to all REDO routines that are
capable of creating granular conflicts.

I already committed 0001 over a week ago. We were calling
ResolveRecoveryConflictWithSnapshot with FrozenTransactionId arguments
before now, which was 100% guaranteed to be a waste of cycles. I saw
no need to wait more than a few days for a +1, given that this
particular issue was so completely clear cut.

--
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I didn't realize that affected visibilitymap_set() calls could
> generate useless set-VM WAL records until you pointed it out. That's
> far more likely to happen than the race condition that I described --
> it has nothing at all to do with concurrency. That's what clinches it
> for me.

I didn't spend as much time on this as I'd like to so far, but I think
that this concern about visibilitymap_set() actually turns out to not
apply. The visibilitymap_set() call in question is gated by a
"!VM_ALL_FROZEN()", which is enough to avoid the problem with writing
useless VM set records.

That doesn't make me doubt my original concern about races where the
all-frozen bit can be set, without setting the all-visible bit, and
without accounting for the fact that it changed underneath us. That
scenario will have !VM_ALL_FROZEN(), so that won't save us. (And we
won't test VM_ALL_VISIBLE() or PD_ALL_VISIBLE in a way that is
sufficient to realize that all_visible_according_to_vm is stale.
prunestate.all_visible being set doesn't reliably indicate that it's not stale,
but lazy_scan_heap incorrectly believes that it really does work that way.)


--
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Robert Haas
Date:
On Mon, Jan 9, 2023 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 9, 2023 at 12:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I feel that you should at least have a reproducer for these problems
> > posted to the thread, and ideally a regression test, before committing
> > things. I think it's very hard to understand what the problems are
> > right now.
>
> Hard to understand relative to what, exactly? We're talking about a
> very subtle race condition here.
>
> I'll try to come up with a reproducer, but I *utterly* reject your
> assertion that it's a hard requirement, sight unseen. Why should those
> be the parameters of the discussion?
>
> For one thing I'm quite confident that I'm right, with or without a
> reproducer. And my argument isn't all that hard to follow, if you have
> relevant expertise, and actually take the time.

Look, I don't want to spend time arguing about what seem to me to be
basic principles of good software engineering. When I don't put test
cases into my patches, people complain at me and tell me that I'm a
bad software engineer because I didn't include test cases. Your
argument here seems to be that you're such a good software engineer
that you don't need any test cases to know what the bug is or that
you've fixed it correctly. That seems like a surprising argument, but
even if it's true, test cases can have considerable value to future
code authors, because it allows them to avoid reintroducing bugs that
have previously been fixed. In my opinion, it's not worth trying to
have automated test cases for absolutely every bug we fix, because
many of them would be really hard to develop and executing all of them
every time we do anything would be unduly time-consuming. But I can't
remember the last time before this that someone wanted to commit a
patch for a data corruption issue without even providing a test case
that other people can run manually. If you think that is or ought to
be standard practice, I can only say that I disagree.

I don't particularly appreciate the implication that I either lack
relevant or expertise or don't actually take time, either. I spent an
hour yesterday looking at your patches yesterday and didn't feel I was
very close to understanding 0002 in that time. I feel that if the
patches were better-written, with relevant comments and test cases and
really good commit messages and a lack of extraneous changes, I
believe I probably would have gotten a lot further in the same amount
of time. There is certainly an alternate explanation, which is that I
am stupid. I'm inclined to think that's not the correct explanation,
but most stupid people believe that they aren't, so that doesn't
really prove anything.

> But even this is
> unlikely to matter much. Even if I somehow turn out to have been
> completely wrong about the race condition, it is still self-evident
> that the problem of uselessly WAL logging non-changes to the VM
> exists. That doesn't require any concurrent access at all. It's a
> natural consequence of calling visibilitymap_set() with
> VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
> for 2 minutes to see it.

Apparently not, because I spent more time than that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Tue, Jan 10, 2023 at 10:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Look, I don't want to spend time arguing about what seem to me to be
> basic principles of good software engineering. When I don't put test
> cases into my patches, people complain at me and tell me that I'm a
> bad software engineer because I didn't include test cases. Your
> argument here seems to be that you're such a good software engineer
> that you don't need any test cases to know what the bug is or that
> you've fixed it correctly.

That's not what I said. This is a straw man.

What I actually said was that there is no reason to declare up front
that the only circumstances under which a fix could be committed is
when a clean repro is available. I never said that a test case has
little or no value, and I certainly didn't assert that we definitely
don't need a test case to proceed with a commit -- since I am not in
the habit of presumptuously attaching conditions to such things well
in advance.

> I don't particularly appreciate the implication that I either lack
> relevant or expertise or don't actually take time, either.

The implication was only that you didn't take the time. Clearly you
have the expertise. Obviously you're very far from stupid.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

While pruning will remove aborted dead tuples, freezing will not
remove the xmax of an aborted update unless the XID happens to be <
OldestXmin. With my problem scenario, the page will be all_visible in
prunestate, but not all_frozen -- so it dodges the relevant
visibilitymap_set() call site.

That just leaves inserts that abort, I think. An aborted insert will
be totally undone by pruning, but that does still leave behind an
LP_DEAD item that needs to be vacuumed in the second heap pass. This
means that we can only set the page all-visible/all-frozen in the VM
in the second heap pass, which also dodges the relevant
visibilitymap_set() call site.

In summary, I think that there is currently no way that we can have
the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
the page all_frozen. It can happen and leave the page all_visible, but
not all_frozen, due to these very fine details. (Assuming I haven't
missed another path to the problem with aborted Multis or something,
but looks like I haven't.)

-- 
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan <pg@bowt.ie> wrote:
> In summary, I think that there is currently no way that we can have
> the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
> the page all_frozen. It can happen and leave the page all_visible, but
> not all_frozen, due to these very fine details. (Assuming I haven't
> missed another path to the problem with aborted Multis or something,
> but looks like I haven't.)

Actually, FreezeMultiXactId() can fully remove an xmax that has some
member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
possible, at least when no individual member is still running. Doesn't
have to involve transaction aborts at all.

Let me go try to break it that way...

-- 
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Robert Haas
Date:
On Tue, Jan 10, 2023 at 2:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What I actually said was that there is no reason to declare up front
> that the only circumstances under which a fix could be committed is
> when a clean repro is available. I never said that a test case has
> little or no value, and I certainly didn't assert that we definitely
> don't need a test case to proceed with a commit -- since I am not in
> the habit of presumptuously attaching conditions to such things well
> in advance.

I don't understand what distinction you're making. It seems like
hair-splitting to me. We should be able to reproduce problems like
this reliably, at least with the aid of a debugger and some
breakpoints, before we go changing the code. The risk of being wrong
is quite high because the code is subtle, and the consequences of
being wrong are potentially very bad because the code is critical to
data integrity. If the reproducer doesn't require a debugger or other
extreme contortions, then we should consider reducing it to a TAP test
that can be committed. If you agree with that, then I'm not sure what
your last email was complaining about. If you disagree, then I don't
know why.

> I have been unable to reproduce the problem, and think it's possible
> that the issue cannot be triggered in practice. Though only through
> sheer luck. Here's why that is:
>
> While pruning will remove aborted dead tuples, freezing will not
> remove the xmax of an aborted update unless the XID happens to be <
> OldestXmin. With my problem scenario, the page will be all_visible in
> prunestate, but not all_frozen -- so it dodges the relevant
> visibilitymap_set() call site.
>
> That just leaves inserts that abort, I think. An aborted insert will
> be totally undone by pruning, but that does still leave behind an
> LP_DEAD item that needs to be vacuumed in the second heap pass. This
> means that we can only set the page all-visible/all-frozen in the VM
> in the second heap pass, which also dodges the relevant
> visibilitymap_set() call site.

I guess I'm not very sure that this is sheer luck. It seems like we
could equally well suppose that the people who wrote the code
correctly understood the circumstances under which we needed to avoid
calling visibilitymap_set(), and wrote the code in a way that
accomplished that purpose. Maybe there's contrary evidence or maybe it
is actually broken somehow, but that's not currently clear to me.

For the purposes of clarifying my understanding, is this the code
you're principally worried about?

        /*
         * If the all-visible page is all-frozen but not marked as such yet,
         * mark it as all-frozen.  Note that all_frozen is only valid if
         * all_visible is true, so we must check both prunestate fields.
         */
        else if (all_visible_according_to_vm && prunestate.all_visible &&
                 prunestate.all_frozen &&
                 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
        {
            /*
             * We can pass InvalidTransactionId as the cutoff XID here,
             * because setting the all-frozen bit doesn't cause recovery
             * conflicts.
             */
            visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
                              vmbuffer, InvalidTransactionId,
                              VISIBILITYMAP_ALL_FROZEN);
        }

Or maybe this one?

    if (PageIsAllVisible(page))
    {
        uint8       flags = 0;
        uint8       vm_status = visibilitymap_get_status(vacrel->rel,
                                                         blkno, vmbuffer);

        /* Set the VM all-frozen bit to flag, if needed */
        if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
            flags |= VISIBILITYMAP_ALL_VISIBLE;
        if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
            flags |= VISIBILITYMAP_ALL_FROZEN;

        Assert(BufferIsValid(*vmbuffer));
        if (flags != 0)
            visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
                              *vmbuffer, visibility_cutoff_xid, flags);
    }

These are the only two call sites in vacuumlazy.c where I can see
there being a theoretical risk of the kind of problem that you're
describing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Tue, Jan 10, 2023 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't understand what distinction you're making. It seems like
> hair-splitting to me. We should be able to reproduce problems like
> this reliably, at least with the aid of a debugger and some
> breakpoints, before we go changing the code.

So we can *never* change something defensively, on the basis of a
suspected or theoretical hazard, either in backbranches or just on
HEAD? Not under any circumstances, ever?

> The risk of being wrong
> is quite high because the code is subtle, and the consequences of
> being wrong are potentially very bad because the code is critical to
> data integrity. If the reproducer doesn't require a debugger or other
> extreme contortions, then we should consider reducing it to a TAP test
> that can be committed. If you agree with that, then I'm not sure what
> your last email was complaining about.

I was complaining about your prescribing conditions on proceeding with
a commit, based on an understanding of things that you yourself
acknowledged as incomplete. I cannot imagine how you read that as an
unwillingness to test the issue, especially given that I agreed to
work on that before you chimed in.

> > I have been unable to reproduce the problem, and think it's possible
> > that the issue cannot be triggered in practice. Though only through
> > sheer luck. Here's why that is:

> I guess I'm not very sure that this is sheer luck.

That's just my characterization. Other people can make up their own minds.

> For the purposes of clarifying my understanding, is this the code
> you're principally worried about?

>             visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
>                               vmbuffer, InvalidTransactionId,
>                               VISIBILITYMAP_ALL_FROZEN);

Obviously I meant this call site, since it's the only one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general.

The other visibilitymap_set() callsite that you quoted is from the
second heap pass, where LP_DEAD items are vacuumed and become
LP_UNUSED items. That isn't buggy, but it is a silly approach, in that
it cares about what the visibility map says about the page being
all-visible, as if it might take a dissenting view that needs to be
taken into consideration (obviously we know what's going on with the
page because we just scanned it ourselves, and determined that it was
at least all-visible).

-- 
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, FreezeMultiXactId() can fully remove an xmax that has some
> member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
> possible, at least when no individual member is still running. Doesn't
> have to involve transaction aborts at all.
>
> Let me go try to break it that way...

Attached patch shows how this could break.

It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.

The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.

I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:

* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.

* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

* End every session whose XID is stored in our multi (commit/abort).

* Within GDB, continue from before -- observe assertion failure.

Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.

-- 
Peter Geoghegan

Attachment

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> * Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
> relevant visibilitymap_set() call site (the one that passes
> VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
> VISIBILITYMAP_ALL_VISIBLE).
>
> Now all_visible_according_to_vm is set to true, but we don't have a
> lock/pin on the same heap page just yet.
>
> * Acquire several non-conflicting row locks on a row on the block in
> question, so that a new multi is allocated.

Forgot to mention that there needs to be a HOT update mixed in with
these SELECT ... FOR SHARE row lockers, too, which must abort once its
XID has been added to a multi. Obviously heap_lock_tuple() won't ever
unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears
VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all
of these status bits.

This enables the assertion to fail because:

* Pruning can get rid of the aborted successor heap-only tuple right
away, so it is not going to block us from setting the page all_visible
(that just leaves the original tuple to consider).

* The original tuple's xmax is a Multi, so it won't automatically be
ineligible for freezing because it's > OldestXmin in this scenario.

* FreezeMultiXactId() processing will completely remove xmax, without
caring too much about cutoffs like OldestXmin -- it only cares about
whether each individual XID needs to be kept or not.

(Granted, FreezeMultiXactId() will only remove xmax like this because
I deliberately removed its FRM_NOOP handling, but that is a very
delicate thing to rely on, especially from such a great distance. I
can't imagine that it doesn't fail on HEAD for any reason beyond sheer
luck.)

--
Peter Geoghegan



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

From
Peter Geoghegan
Date:
On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <andres@anarazel.de> wrote:
> > Afaict we'll need to backpatch this all the way?
>
> I thought that we probably wouldn't need to, at first. But I now think
> that we really have to.

Attached is v4. This is almost the same as v3. The only notable change
is in how the issue is explained in comments, and in the commit
message.

I have revised my opinion on this question once more. In light of what
has come to light about the issue from recent testing, I lean towards
a HEAD-only commit once again. What do you think?

I still hope to be able to commit this on my original timeline (on
Monday or so), without the issue taking up too much more of
everybody's time.

Thanks
-- 
Peter Geoghegan

Attachment