Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date
Msg-id CAH2-WznNKY6ydUczuTXutVmb_dj3MnAcoaVYc8xyignWfNQ=FQ@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
List pgsql-bugs
On Tue, Nov 9, 2021 at 9:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I've discussed this privately with Andres -- expect more from him
> soon. I came up with more sophisticated instrumentation (better
> assertions, really) that shows that the problem begins in VACUUM, not
> opportunistic pruning (certainly with the test case we have).

Attached is a WIP fix for the bug. The idea here is to follow all HOT
chains in an initial pass over the page, while even following LIVE
heap-only tuples. Any heap-only tuples that we don't determine are
part of some valid HOT chain (following an initial pass over the whole
heap page) will now be processed in a second pass over the page. We
expect (and assert) that these "disconnected" heap-only tuples will
all be either DEAD or RECENTLY_DEAD. We treat them as DEAD either way,
on the grounds that they must be from an aborted xact in any case.
Note that we sometimes do something very similar already -- we can
sometimes consider some tuples from a HOT chain DEAD, even though
they're RECENTLY_DEAD (provided a later tuple from the chain really is
DEAD).

The patch also has more detailed assertions inside heap_page_prune().
These should catch any HOT chain invariant violations at just about
the earliest opportunity, at least when assertions are enabled.
Especially because we're now following every HOT chain from beginning
to end now, even when we already know that there are no more
DEAD/RECENTLY_DEAD tuples in the chain to be found.

I'm not sure why this seems to have become more of a problem following
the snapshot scalability work from Andres -- Alexander mentioned that
commit dc7420c2 looked like it was the source of the problem here, but
I can't see any reason why that might be true (even though I accept
that it might well *appear* to be true). I believe Andres has some
theory on that, but I don't know the details myself. AFAICT, this is a
live bug on all supported versions. We simply weren't being careful
enough about breaking the invariant that an LP_REDIRECT can only point
to a valid heap-only tuple. The really surprising thing here is that
it took this long for it to visibly break.

-- 
Peter Geoghegan

Attachment

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum