Re: pg_upgrade can result in early wraparound on databases with high transaction load - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pg_upgrade can result in early wraparound on databases with high transaction load
Date
Msg-id 20210423234256.hwopuftipdmp3okf@alap3.anarazel.de
Whole thread Raw
In response to Re: pg_upgrade can result in early wraparound on databases with hightransaction load  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Hi,

On 2019-06-15 11:37:59 -0700, Noah Misch wrote:
> On Tue, May 21, 2019 at 03:23:00PM -0700, Peter Geoghegan wrote:
> > On Mon, May 20, 2019 at 3:10 AM Jason Harvey <jason@reddit.com> wrote:
> > > This week I upgraded one of my large(2.8TB), high-volume databases from 9 to 11. The upgrade itself went fine.
Abouttwo days later, we unexpectedly hit transaction ID wraparound. What was perplexing about this was that the age of
ouroldest `datfrozenxid` was only 1.2 billion - far away from where I'd expect a wraparound. Curiously, the wraparound
errorreferred to a mysterious database of `OID 0`:
 
> > >
> > > UPDATE ERROR:  database is not accepting commands to avoid wraparound data loss in database with OID 0
> 
> That's bad.

Yea. The code triggering it in pg_resetwal is bogus as far as I can
tell. That pg_upgrade triggers it makes this quite bad.

I just hit issues related to it when writing a wraparound handling
test. Peter remembered this issue (how?)...

Especially before 13 (inserts triggering autovacuum) it is quite common
to have tables that only ever get vacuumed due to anti-wraparound
vacuums. And it's common for larger databases to increase
autovacuum_freeze_max_age. Which makes it fairly likely for this to
guess an oldestXid value that's *newer* than an accurate one. Since
oldestXid is used in a few important-ish places (like triggering
vacuums, and in 14 also some snapshot related logic) I think that's bad.

The relevant code:

    if (set_xid != 0)
    {
        ControlFile.checkPointCopy.nextXid =
            FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid),
                                             set_xid);

        /*
         * For the moment, just set oldestXid to a value that will force
         * immediate autovacuum-for-wraparound.  It's not clear whether adding
         * user control of this is useful, so let's just do something that's
         * reasonably safe.  The magic constant here corresponds to the
         * maximum allowed value of autovacuum_freeze_max_age.
         */
        ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
        if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
            ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
        ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
    }

Originally from:

commit 25ec228ef760eb91c094cc3b6dea7257cc22ffb5
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2009-08-31 02:23:23 +0000

    Track the current XID wrap limit (or more accurately, the oldest unfrozen
    XID) in checkpoint records.  This eliminates the need to recompute the value
    from scratch during database startup, which is one of the two remaining
    reasons for the flatfile code to exist.  It should also simplify life for
    hot-standby operation.


I think we should remove the oldestXid guessing logic, and expose it as
an explicit option. I think it's important that pg_upgrade sets an
accurate value. Probably not worth caring about oldestXidDB though?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Next
From: Andres Freund
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)