Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

Hi,

On 2021-11-22 14:34:36 -0800, Andres Freund wrote:
> I think I'm actually just not sure what the better approach for the
> backbranches is. I'm worried about the size of your patch, I'm worried about
> the craziness of the current architecture (if you can call it that) of
> heap_page_prune() with my patch.

Given the lack of further discussion, let's go with the "minimal" fix and your
stuff later. Attached is further polished patch - no significant changes,
plenty copy-editing stuff.

0001 is the bugfix

0002 adds the additional assertions - I'm wondering about only committing that
  in HEAD?

  I think your patch should easily be able to use prstate->htsv?

0003 removes the redundant RelationGetNumberOfBlocks() calls. As 0001 does not
  change the total number of RelationGetNumberOfBlocks() calls I'm inclined to
  just apply this to HEAD?

0004 is a patch that tries to address your point about the GlobalVisTestFor()
  placement, sort of

  My earlier point that moving it to earlier would be a bad idea because it'd
  make the horizon "older" was bogus - GlobalVisTestFor() doesn't itself do
  any horizon determination. However moving the GlobalVisTestFor() earlier
  still seems wrong - imo the vacuum_set_xid_limits() call should be moved to
  later. The attached patch moves both, wrapped in a new function, to just
  before the scan in lazy_scan_heap().

  This clearly is HEAD only material - I'm only bringing it up here because
  the issue of the GlobalVisTestFor() placement was raised in here...


> > > I think it's worth to clean up the regression test I wrote and use it
> > > regardless of which version of the fix we end up choosing. But I'm a bit bit
> > > on the fence - it's quite complicated.
> >
> > +1 to the idea of keeping it somewhere. Without necessarily running it
> > on the BF.
> >
> > > OTOH, I think with the framework in place it'd not be too hard to write a few
> > > more tests for odd pruning scenarios...
> >
> > Can we commit a test like this without running it by default, at all?
> > It's not like it has never been done before.
>
> Is there a reason not to run it if we commit it? IME tests not run by default
> tend to bitrot very quickly.  The only issue that I can think of is that it
> might be hard to make the test useful and robust in the presence of
> debug_discard_caches != 0.

I tried pretty hard to get the test to be reliable enough to commit it. But in
the end I think using the index locking as a "control" mechanism just isn't
great - as evidenced by 0003 breaking the test entirely. I think unfortunately
there's not much hope for testing this unless we get wait points - which seems
not too close :(.

Greetings,

Andres Freund

Attachment

pgsql-bugs by date:

Previous
From: Vik Fearing
Date:
Subject: Re: BUG #17321: count(*) on a 1,874,554,883 rows partitioned table takes several minutes.
Next
From: Andres Freund
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum