Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date
Msg-id CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
List pgsql-bugs
On Tue, Nov 9, 2021 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is a WIP fix for the bug. The idea here is to follow all HOT
> chains in an initial pass over the page, while even following LIVE
> heap-only tuples. Any heap-only tuples that we don't determine are
> part of some valid HOT chain (following an initial pass over the whole
> heap page) will now be processed in a second pass over the page.

I realized that I could easily go further than in v1, and totally get
rid of the "marked" array (which tracks whether we have decided to
mark an item as LP_DEAD/LP_UNUSED/a new LP_REDIRECT/newly pointed to
by another LP_REDIRECT). In my v1 from earlier today we already had an
array that records whether or not each item is part of any known valid
chain, which is strictly better than knowing whether or not they were
"marked" earlier. So why bother with the "marked" array at all, even
for assertions? It is less robust (not to mention less efficient) than
just using the new "fromvalidchain" array.

Attached is v2, which gets rid of the "marked" array as described. It
also has better worked out comments and assertions. The patch has
stood up to a fair amount of stress-testing. I repeated Alexander's
original test case for over an hour with this. Getting the test case
to cause an assertion failure would usually take about 5 minutes
without any fix.

I have yet to do any work on validating the performance of this patch,
though that definitely needs to happen.

Anybody have any thoughts on how far this should be backpatched? We'll
probably need to do that for Postgres 14. Less sure about other
branches, which haven't been directly demonstrated to be affected by
the bug so far. Haven't tried to break earlier branches with
Alexander's test case, though I will note again that Alexander
couldn't do that when he tried.

-- 
Peter Geoghegan

Attachment

pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Next
From: Noah Misch
Date:
Subject: Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data