Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date
Msg-id CAH2-Wzm9_2s=_6eKm5bNfz6-X9QYmWPd2h=iF_1wBhZufQ4WQw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
List pgsql-bugs
On Fri, Dec 10, 2021 at 8:57 PM Andres Freund <andres@anarazel.de> wrote:
> Given the lack of further discussion, let's go with the "minimal" fix and your
> stuff later. Attached is further polished patch - no significant changes,
> plenty copy-editing stuff.

Makes sense.

> 0001 is the bugfix

LGTM.

> 0002 adds the additional assertions - I'm wondering about only committing that
>   in HEAD?

I don't think that it makes much difference, since these are only
assertions. I probably wouldn't bother with that myself, but I also
have zero concern about it breaking anything.

>   I think your patch should easily be able to use prstate->htsv?

Let's get this committed, to 14 and HEAD. And then leave it to me to
rebase, as additional HEAD-only hardening.

> 0003 removes the redundant RelationGetNumberOfBlocks() calls. As 0001 does not
>   change the total number of RelationGetNumberOfBlocks() calls I'm inclined to
>   just apply this to HEAD?

Yeah, that can be part of my follow-up hardening patch, which (to
repeat) will be HEAD-only.

> 0004 is a patch that tries to address your point about the GlobalVisTestFor()
>   placement, sort of
>
>   My earlier point that moving it to earlier would be a bad idea because it'd
>   make the horizon "older" was bogus - GlobalVisTestFor() doesn't itself do
>   any horizon determination. However moving the GlobalVisTestFor() earlier
>   still seems wrong - imo the vacuum_set_xid_limits() call should be moved to
>   later. The attached patch moves both, wrapped in a new function, to just
>   before the scan in lazy_scan_heap().

I think that this is the right direction, and I like the comments a
lot. But I don't like the idea of putting this initialization into
lazy_scan_heap(), for purely structural reasons. I plan to get rid of
the VacuumParams argument to lazy_scan_heap() completely, which would
make "VacuumParams setup" the sole responsibility of its caller,
heap_vacuum_rel(). The draft version of my "remove special cases to
advance relfrozenxid" patch (the one you've reviewed) already does
this. In general I really think it's important to make the state in
vacuumlazy.c as simple as possible.

I have no objection to delaying the lazy_scan_heap_limits() stuff
until right before lazy_scan_heap() is called. However, I do think
that we should always know certain basic "immutable" facts about a
VACUUM at the point that we call lazy_scan_heap(), which is not the
case with this patch.

Honestly, I'm surprised that you see much value in delaying the
lazy_scan_heap_limits() stuff until the very last microsecond. How
many microseconds could we possibly delay it by?

> I tried pretty hard to get the test to be reliable enough to commit it. But in
> the end I think using the index locking as a "control" mechanism just isn't
> great - as evidenced by 0003 breaking the test entirely. I think unfortunately
> there's not much hope for testing this unless we get wait points - which seems
> not too close :(.

Hopefully somebody will seriously commit to that at some point. It's
important, but not very glamorous.

-- 
Peter Geoghegan



pgsql-bugs by date:

Previous
From: Strahinja Kustudić
Date:
Subject: Re: BUG #17330: EXPLAIN hangs and very long query plans
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum