On Mon, Apr 15, 2024 at 02:10:20PM -0700, Andres Freund wrote:
> On 2024-04-15 13:52:04 -0700, Noah Misch wrote:
> > On Mon, Apr 15, 2024 at 12:35:59PM -0400, Robert Haas wrote:
> > > I propose to remove this open item from
> > > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> > >
> > > On the original thread (BUG #17257), Alexander Lakhin says that he
> > > can't reproduce this after dad1539ae/18b87b201. Based on my analysis
> >
> > I have observed the infinite loop in production with v15.5, so that
> > non-reproduce outcome is a limitation in the test procedure. (v14.2 added
> > those two commits.)
>
> How closely have you analyzed those production occurences? It's not too hard
> to imagine some form of corruption that leads to such a loop, but which isn't
> related to the horizon going backwards? E.g. a corrupted HOT chain can lead
> to heap_page_prune() not acting on a DEAD tuple, but lazy_scan_prune() would
> then encounter a DEAD tuple.
One occurrence had these facts:
HeapTupleHeaderGetXmin = 95271613
HeapTupleHeaderGetUpdateXid = 95280147
vacrel->OldestXmin = 95317451
vacrel->vistest->definitely_needed = 95318928
vacrel->vistest->maybe_needed = 93624425
How compatible are those with the corruption vectors you have in view?
> > > of the code, I suspect that there is a residual bug, or at least that
> > > there was one prior to 6f47f6883151366c031cd6fd4011e66d2c702a90.
> >
> > Can you say more about how 6f47f6883151366c031cd6fd4011e66d2c702a90 mitigated
> > the regression that 1ccc1e05ae introduced? Thanks for discovering that.
>
> Which regression has 1ccc1e05ae actually introduced? As I pointed out
> upthread, the proposed path to corruption doesn't seem to actually lead to
> corruption, "just" an error? Which actually seems considerably better than an
> endless retry loop that cannot be cancelled.
A transient, spurious error is far better than an uninterruptible infinite
loop with a buffer content lock held. If a transient error is the consistent
outcome, I would agree 1ccc1e05ae improved the situation and didn't regress
it. That would close the open item. I tried briefly to understand
https://postgr.es/m/flat/20240415173913.4zyyrwaftujxthf2@awork3.anarazel.de
but I felt verifying its argument was going to be a big job for me. Would
those errors happen transiently, like the infinite loop, or would they persist
until something resets the tuple fields (e.g. ATRewriteTables())?
Thanks,
nm