Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date
Msg-id CAH2-WzmqBtFwdmBW=AG8ZZW_uF5a4entmv4QmqhsXOT-Fj4L-g@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Andres Freund <andres@anarazel.de>)
Responses Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
List pgsql-bugs
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> The way this definitely breaks - I have been able to reproduce this in
> isolation - is when one tuple is processed twice by heap_prune_chain(), and
> the result of HeapTupleSatisfiesVacuum() changes from
> HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.

I had no idea that that was now possible. I really think that this
ought to be documented centrally.

As you know, I don't like the way that vacuumlazy.c doesn't explain
anything about the relationship between OldestXmin (which still
exists, but isn't used for pruning), and the similar GlobalVisState
state (used only during pruning). Surely this deserves to be
addressed, because we expect these two things to agree in certain
specific ways. But not necessarily in others.

> Note that there are several paths < 14, that cause HTSV()'s answer to change
> for the same xid. E.g. when the transaction inserting a tuple version aborts,
> we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD.

Right -- that one I certainly knew about.  After all, the
tupgone-ectomy work from my commit 8523492d specifically targeted this
case.

> But I haven't quite found a
> path to trigger problems with that, because there won't be redirects to a
> tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
> to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).

That explains why the snapshot scalability either made these problems
possible for the first time, or at the very least made them far far
more likely in practice.

The relevant code in pruneheap.c was always incredibly fragile -- no
question. Even still, there is really no good reason to believe that
that was actually a problem before commit dc7420c2. Even if we assume
that there's a problem before 14, the surface area is vastly smaller
than on 14 -- the relevant pruneheap.c code hasn't really ever changed
since HOT went in. And so I think that the most sensible course of
action here is this: commit a fix to Postgres 14 + HEAD only -- no
backpatch to earlier versions.

We could go back further than that, but ISTM that the risk of causing
new problems far outweighs the benefits. Whereas I feel pretty
confident that we need to do something on 14.

-- 
Peter Geoghegan



pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Next
From: Peter Geoghegan
Date:
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum