Re: recovering from "found xmin ... from before relfrozenxid ..." - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: recovering from "found xmin ... from before relfrozenxid ..."
Date
Msg-id CAA4eK1KsVpDQPMj5hB5tT9=Pu8-9OG2EOcLHb+aeEg8YSV1YjA@mail.gmail.com
Whole thread Raw
In response to Re: recovering from "found xmin ... from before relfrozenxid ..."  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: recovering from "found xmin ... from before relfrozenxid ..."  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum.  I don't have time to poke further right now.
>
..
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway?  I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules.  (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.)  However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear).  So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables.  We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits.  However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist.  I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes.  This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>

Yeah, on a quick look it seems before commit dc7420c2c9 the
pruneheap.c and the main Vacuum code use to make the same decision and
that is commit which has introduced GlobalVisState stuff.

> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>

What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Ashutosh Sharma
Date:
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Next
From: Andy Fan
Date:
Subject: Re: Dynamic gathering the values for seq_page_cost/xxx_cost