On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > So this confirms the suspicion that the cause of the buildfarm
> > failures is a concurrently-open transaction, presumably from
> > autovacuum. I don't have time to poke further right now.
>
..
> 2. As Amit suspected, there's an inconsistency between pruneheap.c's
> rules for which tuples are removable and vacuum.c's rules for that.
> This seems like a massive bug in its own right: what's the point of
> pruneheap.c going to huge effort to decide whether it should keep a
> tuple if vacuum will then kill it anyway? I do not understand why
> whoever put in the GlobalVisState stuff only applied it in pruneheap.c
> and not VACUUM proper.
>
> These two points interact, in that we don't get to the "cannot freeze"
> failure except when pruneheap has decided not to remove something that
> is removable according to VACUUM's rules. (VACUUM doesn't actually
> remove it, because lazy_scan_heap won't try to remove HeapOnly tuples
> even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze
> the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert
> a7212be8b then the pg_surgery test still fails in the presence of a
> concurrent transaction (both of the expected "skipping TID" notices
> disappear). So reverting that patch wouldn't get us out of trouble.
>
> I think to move forward, we need to figure out what the freezing
> behavior ought to be for temp tables. We could make it the same
> as it was before a7212be8b, which'd just require some more complexity
> in vacuum_set_xid_limits. However, that negates the idea that we'd
> like VACUUM's behavior on a temp table to be fully independent of
> whether concurrent transactions exist. I'd prefer to allow a7212be8b's
> behavior to stand, but then it seems we need to lobotomize the error
> check in heap_prepare_freeze_tuple to some extent.
>
> Independently of that, it seems like we need to fix things so that
> when pruneheap.c is called by vacuum, it makes EXACTLY the same
> dead-or-not-dead decisions that the main vacuum code makes. This
> business with applying some GlobalVisState rule or other instead
> seems just as unsafe as can be.
>
Yeah, on a quick look it seems before commit dc7420c2c9 the
pruneheap.c and the main Vacuum code use to make the same decision and
that is commit which has introduced GlobalVisState stuff.
> AFAICS, there is no chance of the existing pg_surgery regression test
> being fully stable if we don't fix both things.
>
What if ensure that it runs with autovacuum = off and there is no
parallel test running? I am not sure about the second part but if we
can do that then the test will be probably stable.
--
With Regards,
Amit Kapila.