Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

From Dmitry Dolgov
Subject Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date
Msg-id 20211113150640.vk5zhjangylufxaa@localhost
Whole thread Raw
In response to Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
List pgsql-bugs
> On Fri, Nov 12, 2021 at 02:46:22PM -0800, Peter Geoghegan wrote:
> On Fri, Nov 12, 2021 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:
> > > Naturally, I also went through the exercise of trying to find a
> > > counterexample, where pruning doesn't see a disconnected tuple as DEAD
> > > in its HTSV. I could not get the assertion to fail with Alexander's
> > > test case, nor with make check-world.
> >
> > I don't think that provides a meaningful coverage. Alexander's test has a
> > quite limited set operations (which e.g. doesn't include an subxacts), and our
> > own tests around subtransactions, and particularly concurrent subtransaction
> > heavy work, is quite, uh, minimal.
>
> It's a start.
>
> We need to be pragmatic here. There is some uncertainty about what
> HTSV might say about a disconnected tuple in the absence of
> corruption, or there is a risk of a new problem like that coming up in
> the future -- let's work within those confines, then. What do you want
> to do about that? There aren't that many choices, since, to repeat,
> the tuple is "morally" DEAD no matter what. Even with corruption, even
> without corruption in the presence of some unanticipated corner case
> with HTSV -- this is fundamental.

I've got curious if modifying the Alexander's test case could reveal
something interesting, and sprinkled it with savepoints and rollbacks.
Almost immediately a new problem has manifested itself, although the
crash has nothing to do with the disconnected tuples as far as I can
tell -- still probably worth mentioning. In this case vacuum invoked
lazy_scan_prune, and during the first scan one of the chains had a
HEAPTUPLE_DEAD at the third position. The processing flow fell through
to heap_prune_record_prunable and crashed on an assert with an
InvalidTransactionId:

    #3  0x000055a2b260d1f9 in heap_prune_record_prunable (prstate=0x7ffd0c0ecdf0, xid=0) at pruneheap.c:872
    #4  0x000055a2b260ca72 in heap_prune_chain (buffer=2117, rootoffnum=150, prstate=0x7ffd0c0ecdf0) at
pruneheap.c:695
    #5  0x000055a2b260bcd6 in heap_page_prune (relation=0x7fb98e217e20, buffer=2117, vistest=0x55a2b31d2d60
<GlobalVisCatalogRels>,old_snap_xmin=0, old_snap_ts=0, report_stats=false, off_loc=0x55a2b3e6a0cc) at pruneheap.c:288
 
    #6  0x000055a2b261309c in lazy_scan_prune (vacrel=0x55a2b3e6a060, buf=2117, blkno=192, page=0x7fb97856bf80 "",
vistest=0x55a2b31d2d60<GlobalVisCatalogRels>, prunestate=0x7ffd0c0ee9d0) at vacuumlazy.c:1739
 

Applying heap_prune_record_prunable only if TransactionIdIsNormal seems
to help. The original implementation didn't reach
heap_prune_record_prunable either and also doesn't crash.



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum