On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, FreezeMultiXactId() can fully remove an xmax that has some
> member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
> possible, at least when no individual member is still running. Doesn't
> have to involve transaction aborts at all.
>
> Let me go try to break it that way...
Attached patch shows how this could break.
It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.
The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.
I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:
* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.
* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.
* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).
Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.
* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.
* End every session whose XID is stored in our multi (commit/abort).
* Within GDB, continue from before -- observe assertion failure.
Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.
--
Peter Geoghegan