Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date
Msg-id CAH2-Wzn_LLtTk-jGranhLcciSt_70pFKJNngZOf8Gt_MNgTmnA@mail.gmail.com
Whole thread Raw
In response to Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
List pgsql-hackers
On Wed, Mar 30, 2022 at 12:01 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Perhaps something is amiss inside vac_update_relstats(), where the
> boolean flag that indicates that pg_class.relfrozenxid was advanced is
> set:
>
>     if (frozenxid_updated)
>         *frozenxid_updated = false;
>     if (TransactionIdIsNormal(frozenxid) &&
>         pgcform->relfrozenxid != frozenxid &&
>         (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
>          TransactionIdPrecedes(ReadNextTransactionId(),
>                                pgcform->relfrozenxid)))
>     {
>         if (frozenxid_updated)
>             *frozenxid_updated = true;
>         pgcform->relfrozenxid = frozenxid;
>         dirty = true;
>     }
>
> Maybe the "existing relfrozenxid is in the future, silently update
> relfrozenxid" part of the condition (which involves
> ReadNextTransactionId()) somehow does the wrong thing here. But how?

I tried several times to recreate this issue on CI. No luck with that,
though -- can't get it to fail again after 4 attempts.

This was a VACUUM of pg_database, run from an autovacuum worker. I am
vaguely reminded of the two bugs fixed by Andres in commit a54e1f15.
Both were issues with the shared relcache init file affecting shared
and nailed catalog relations. Those bugs had symptoms like " ERROR:
found xmin ... from before relfrozenxid ..." for various system
catalogs.

We know that this particular assertion did not fail during the same VACUUM:

    Assert(vacrel->NewRelfrozenXid == OldestXmin ||
           TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
                                         vacrel->NewRelfrozenXid));

So it's hard to see how this could be a bug in the patch -- the final
new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
problem scenario seen on the CI Windows instance yesterday (that's why
this earlier assertion didn't fail).  The assertion I'm showing here
needs the "vacrel->NewRelfrozenXid == OldestXmin" part of the
condition to account for the fact that
OldestXmin/GetOldestNonRemovableTransactionId() is known to "go
backwards". Without that the regression tests will fail quite easily.

The surprising part of the CI failure must have taken place just after
this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
presumably because the existing relfrozenxid appeared to be "in the
future" when we examine it in pg_class again. We see evidence that
this must have happened afterwards, when the closely related assertion
(used only in instrumentation code) fails:

From my patch:

>             if (frozenxid_updated)
>             {
> -               diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
> +               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
> +               Assert(diff > 0);
>                 appendStringInfo(&buf,
>                                  _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
> -                                FreezeLimit, diff);
> +                                vacrel->NewRelfrozenXid, diff);
>             }

Does anybody have any ideas about what might be going on here?

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Mingw task for Cirrus CI
Next
From: Andres Freund
Date:
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations