On Tue, Aug 4, 2020 at 1:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The core of the issue seems to be that in the presence of concurrent
> updates, we might not have a stable opinion of which entry of the HOT
> chain is live, leading to trying to index multiple ones of them (using
> the same root TID), which leads to the assertion failure.
I agree with that assessment. FWIW, I believe that contrib/amcheck
will detect this issue on Postgres 12+. If it happened all that often
then we probably would have heard about it by now.
BTW, I backpatched the assertion that fails. All branches have it now.
It might not help, but it certainly can't hurt.
> I'm not sure if I was considering the HOT-chain case when I wrote that,
> but "no harm" seems clearly wrong in that situation: indexing more than
> one in-doubt chain member leads to having multiple index entries pointing
> at the same HOT chain. That could be really bad if they have distinct
> index values (though we do not expect such a case to arise in a system
> catalog, since broken HOT chains should never occur there).
I think that it might accidentally be okay for those reasons, though I
have a hard time imagining that that's what you meant back then. I
doubt that the exact consequences of the problem will affect what the
fix looks like now, so this may be somewhat of an academic question.
> In the light of this, it bothers me that the DELETE_IN_PROGRESS case
> has an exception for HOT chains:
>
> if (checking_uniqueness ||
> HeapTupleIsHotUpdated(heapTuple))
> // wait
>
> while the INSERT_IN_PROGRESS case has none. Simple symmetry
> would suggest that the INSERT_IN_PROGRESS case should be
>
> if (checking_uniqueness ||
> HeapTupleIsHeapOnly(heapTuple))
> // wait
I had exactly the same intuition.
> but I'm not 100% convinced that that's right.
Why doubt that explanation?
As I've said, it's clear that the original HOT commit imagined that
this wait business was all about avoiding confusion about which heap
tuple to index for the HOT chain -- nothing more or less than that.
The simplest explanation seems to be that 1ddc2703a93 missed that
subtlety. When some (though not all) of the problems came to light a
few years later, 520bcd9c9bb didn't go far enough. We know that
1ddc2703a93 got the DELETE_IN_PROGRESS stuff wrong -- why doubt that
it also got the INSERT_IN_PROGRESS stuff wrong?
--
Peter Geoghegan