Re: Fixing a couple of buglets in how VACUUM sets visibility map bits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
Date
Msg-id CAH2-WzmisLS0Gxj=+NJLAxqC9bZmUV4W3CdD0dxnisf7WeXPcQ@mail.gmail.com
Whole thread Raw
In response to Re: Fixing a couple of buglets in how VACUUM sets visibility map bits  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
List pgsql-hackers
On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, FreezeMultiXactId() can fully remove an xmax that has some
> member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
> possible, at least when no individual member is still running. Doesn't
> have to involve transaction aborts at all.
>
> Let me go try to break it that way...

Attached patch shows how this could break.

It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.

The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.

I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:

* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.

* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

* End every session whose XID is stored in our multi (commit/abort).

* Within GDB, continue from before -- observe assertion failure.

Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.

-- 
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Use windows VMs instead of windows containers on the CI
Next
From: David Rowley
Date:
Subject: Re: Todo: Teach planner to evaluate multiple windows in the optimal order