Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs
From | Andres Freund |
---|---|
Subject | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Date | |
Msg-id | 20211118071941.e4far2w5vufnahun@alap3.anarazel.de Whole thread Raw |
In response to | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Andres Freund <andres@anarazel.de>) |
Responses |
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
|
List | pgsql-bugs |
Hi, I spent quite a bit more time trying to have a good testcase for reproducing this reliably. I didn't feel confident using testing that takes substantial amounts of time to hit bugs. On 2021-11-10 11:20:10 -0800, Andres Freund wrote: > The way this definitely breaks - I have been able to reproduce this in > isolation - is when one tuple is processed twice by heap_prune_chain(), and > the result of HeapTupleSatisfiesVacuum() changes from > HEAPTUPLE_DELETE_IN_PROGRESS to DEAD. > > Consider a page like this: > > lp 1: redirect to lp2 > lp 2: deleted by xid x, not yet committed > > and a sequence of events like this: > > 1) heap_prune_chain(rootlp = 1) > 2) commit x > 3) heap_prune_chain(rootlp = 2) > > 1) heap_prune_chain(rootlp = 1) will go to lp2, and see a > HEAPTUPLE_DELETE_IN_PROGRESS and thus not do anything. > > 3) then could, with the snapshot scalability changes, get DEAD back from > HTSV. Due to the "fuzzy" nature of the post-snapshot-scalability xid horizons, > that is possible, because we can end up rechecking the boundary condition and > seeing that now the horizon allows us to prune x / lp2. > > At that point we have a redirect tuple pointing into an unused slot. Which is > "illegal", because something independent can be inserted into that slot. This above isn't quite right - it turns out this requires additional steps to be triggerable (without debugging code to make things easier to hit). As shown above the result won't change, because the xmin horizon of x will have held the xmin horizon far enough back that GlobalVisTestShouldUpdate() won't ever recompute the horizon for x on its own. However, what *can* happen is that *another* tuple is within the "fuzzy" horizon window, and that tuple can trigger a recomputation of the horizon. Which then basically triggers the above problem. However even that is very hard to hit, because normally we won't recompute the horizon in that case either, because of * The current heuristic is that we update only if RecentXmin has changed * since the last update. If the oldest currently running transaction has not * finished, it is unlikely that recomputing the horizon would be useful. I.e. a snapshot needs to be computed between the the vacuum_set_xid_limit() and the heap_page_prune() call. Turns out, there's a window: vac_open_indexes() needs to lock the indexes. Which in turn will trigger AcceptInvalidationMessages(). Relcache invalidations then can trigger a catalog snapshot to be built, which can increase RecentXmin, and later trigger a new accurate horizon to be built. Consider a page like: lp 1: redirect to lp 3 lp 2: RECENTLY_DEAD tuple xmax x1, held back by session y's xmin horizon lp 3: DELETE_IN_PROGRESS tuple xmax x2 and the following sequence of actions: 1) vacuum_set_xid_limit() -> sets ComputeXidHorizonsResultLastXmin = RecentXmin 2) session y commits 3) vac_open_indexes() -> AcceptInvalidationMessages() -> InvalidateCatalogSnapshot() -> RelationCacheInvalidateEntry() -> GetCatalogSnapshot() -> GetSnapshotData() updates RecentXmin 4) x2 commits 5) heap_prune_chain(1) -> sees lp3 as RECENTLY_DEAD (or DELETE_IN_PROGRESS depending on timing) -> doesn't do anything 6) heap_prune_chain(2) -> sees a RECENTLY_DEAD, updates GlobalVisTestShouldUpdate() says yes, updates xid horizon to above x2 7) heap_prune_chain(3) -> uses the HeapOnly path at the beginning, sees a DEAD tuple, marks the element as unused Boom. Not sure if it's worth it, but I think this can be made into a reliably isolationtest by causing a tables index to locked exclusively, blocking vac_open_indexes(), which can then reliably schedule a new relcache inval, which in turn can compute a new RecentXmin. Might be too fragile to be worth it. As an independent issue, I don't think it can ever be safe to do catalog access with PROC_IN_VACUUM set. Which vac_open_indexes() (and likely some other things) clearly do. Without the current backend's xmin set, another backend can remove still required tuples. Which can cause us to build corrupt relcache entries, and then subsequently skip indexing some indexes, causing corruption. I think I hit this while trying to repro the above, but it's hard to be sure. > I hit a crash once in 13 with a slightly evolved version of the test (many > connections creating / dropping the partitions as in the original scenario, > using :client_id to target different tables). It's possible that my > instrumentation was the cause of that. Unfortunately it took quite a few hours > to hit the problem in 13... I wonder if this was actually the above PROC_IN_VACUUM thing. Greetings, Andres Freund
pgsql-bugs by date: