Thread: Re: pg_upgrade can result in early wraparound on databases with hightransaction load
Re: pg_upgrade can result in early wraparound on databases with hightransaction load
On Mon, May 20, 2019 at 3:10 AM Jason Harvey <jason@reddit.com> wrote: > This week I upgraded one of my large(2.8TB), high-volume databases from 9 to 11. The upgrade itself went fine. About twodays later, we unexpectedly hit transaction ID wraparound. What was perplexing about this was that the age of our oldest`datfrozenxid` was only 1.2 billion - far away from where I'd expect a wraparound. Curiously, the wraparound errorreferred to a mysterious database of `OID 0`: > > UPDATE ERROR: database is not accepting commands to avoid wraparound data loss in database with OID 0 > > We were able to recover after a few hours by greatly speeding up our vacuum on our largest table. > > In a followup investigation I uncovered the reason we hit the wraparound so early, and also the cause of the mysteriousOID 0 message. When pg_upgrade executes, it calls pg_resetwal to set the next transaction ID. Within pg_resetwalis the following code: https://github.com/postgres/postgres/blob/6cd404b344f7e27f4d64555bb133f18a758fe851/src/bin/pg_resetwal/pg_resetwal.c#L440-L450 > > This sets the controldata to have a fake database (OID 0) on the brink of transaction wraparound. Specifically, after pg_upgradeis ran, wraparound will occur within around 140 million transactions (provided the autovacuum doesn't finish first).I confirmed by analyzing our controldata before and after the upgrade that this was the cause of our early wraparound. > > Given the size and heavy volume of our database, we tend to complete a vacuum in the time it takes around 250 million transactionsto execute. With our tunings this tends to be rather safe and we stay well away from the wraparound point undernormal circumstances. This does seem like an unfriendly behavior. Moving the thread over to the -hackers list for further discussion... -- Peter Geoghegan
Re: pg_upgrade can result in early wraparound on databases with hightransaction load
On Tue, May 21, 2019 at 03:23:00PM -0700, Peter Geoghegan wrote: > On Mon, May 20, 2019 at 3:10 AM Jason Harvey <jason@reddit.com> wrote: > > This week I upgraded one of my large(2.8TB), high-volume databases from 9 to 11. The upgrade itself went fine. Abouttwo days later, we unexpectedly hit transaction ID wraparound. What was perplexing about this was that the age of ouroldest `datfrozenxid` was only 1.2 billion - far away from where I'd expect a wraparound. Curiously, the wraparound errorreferred to a mysterious database of `OID 0`: > > > > UPDATE ERROR: database is not accepting commands to avoid wraparound data loss in database with OID 0 That's bad. > > We were able to recover after a few hours by greatly speeding up our vacuum on our largest table. For what it's worth, a quicker workaround is to VACUUM FREEZE any database, however small. That forces a vac_truncate_clog(), which recomputes the wrap point from pg_database.datfrozenxid values. This demonstrates the workaround: --- a/src/bin/pg_upgrade/test.sh +++ b/src/bin/pg_upgrade/test.sh @@ -248,7 +248,10 @@ case $testhost in esac pg_dumpall --no-sync -f "$temp_root"/dump2.sql || pg_dumpall2_status=$? +pg_controldata "${PGDATA}" +vacuumdb -F template1 pg_ctl -m fast stop +pg_controldata "${PGDATA}" if [ -n "$pg_dumpall2_status" ]; then echo "pg_dumpall of post-upgrade database cluster failed" > > In a followup investigation I uncovered the reason we hit the wraparound so early, and also the cause of the mysteriousOID 0 message. When pg_upgrade executes, it calls pg_resetwal to set the next transaction ID. Within pg_resetwalis the following code: https://github.com/postgres/postgres/blob/6cd404b344f7e27f4d64555bb133f18a758fe851/src/bin/pg_resetwal/pg_resetwal.c#L440-L450 pg_upgrade should set oldestXID to the same value as the source cluster or set it like vac_truncate_clog() would set it. Today's scheme is usually too pessimistic, but it can be too optimistic if the source cluster was on the bring of wrap. Thanks for the report.
Re: pg_upgrade can result in early wraparound on databases with high transaction load
Hi, On 2019-06-15 11:37:59 -0700, Noah Misch wrote: > On Tue, May 21, 2019 at 03:23:00PM -0700, Peter Geoghegan wrote: > > On Mon, May 20, 2019 at 3:10 AM Jason Harvey <jason@reddit.com> wrote: > > > This week I upgraded one of my large(2.8TB), high-volume databases from 9 to 11. The upgrade itself went fine. Abouttwo days later, we unexpectedly hit transaction ID wraparound. What was perplexing about this was that the age of ouroldest `datfrozenxid` was only 1.2 billion - far away from where I'd expect a wraparound. Curiously, the wraparound errorreferred to a mysterious database of `OID 0`: > > > > > > UPDATE ERROR: database is not accepting commands to avoid wraparound data loss in database with OID 0 > > That's bad. Yea. The code triggering it in pg_resetwal is bogus as far as I can tell. That pg_upgrade triggers it makes this quite bad. I just hit issues related to it when writing a wraparound handling test. Peter remembered this issue (how?)... Especially before 13 (inserts triggering autovacuum) it is quite common to have tables that only ever get vacuumed due to anti-wraparound vacuums. And it's common for larger databases to increase autovacuum_freeze_max_age. Which makes it fairly likely for this to guess an oldestXid value that's *newer* than an accurate one. Since oldestXid is used in a few important-ish places (like triggering vacuums, and in 14 also some snapshot related logic) I think that's bad. The relevant code: if (set_xid != 0) { ControlFile.checkPointCopy.nextXid = FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid), set_xid); /* * For the moment, just set oldestXid to a value that will force * immediate autovacuum-for-wraparound. It's not clear whether adding * user control of this is useful, so let's just do something that's * reasonably safe. The magic constant here corresponds to the * maximum allowed value of autovacuum_freeze_max_age. */ ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000; if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId) ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId; ControlFile.checkPointCopy.oldestXidDB = InvalidOid; } Originally from: commit 25ec228ef760eb91c094cc3b6dea7257cc22ffb5 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: 2009-08-31 02:23:23 +0000 Track the current XID wrap limit (or more accurately, the oldest unfrozen XID) in checkpoint records. This eliminates the need to recompute the value from scratch during database startup, which is one of the two remaining reasons for the flatfile code to exist. It should also simplify life for hot-standby operation. I think we should remove the oldestXid guessing logic, and expose it as an explicit option. I think it's important that pg_upgrade sets an accurate value. Probably not worth caring about oldestXidDB though? Greetings, Andres Freund
Re: pg_upgrade can result in early wraparound on databases with high transaction load
Hi,
Hi, On 2021-04-23 19:28:27 -0500, Justin Pryzby wrote:This (combination of) thread(s) seems relevant. Subject: pg_upgrade failing for 200+ million Large Objects https://www.postgresql.org/message-id/flat/12601596dbbc4c01b86b4ac4d2bd4d48%40EX13D05UWC001.ant.amazon.com https://www.postgresql.org/message-id/flat/a9f9376f1c3343a6bb319dce294e20ac%40EX13D05UWC001.ant.amazon.com https://www.postgresql.org/message-id/flat/cc089cc3-fc43-9904-fdba-d830d8222145%40enterprisedb.com#3eec85391c6076a4913e96a86fece75eHuh. Thanks for digging these up.Allows the user to provide a constant via pg_upgrade command-line, that overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the (window of) Transaction IDs available for pg_upgrade to complete.That seems the entirely the wrong approach to me, buying further into the broken idea of inventing random wrong values for oldestXid. We drive important things like the emergency xid limits off oldestXid. On databases with tables that are older than ~147million xids (i.e. not even affected by the default autovacuum_freeze_max_age) the current constant leads to setting the oldestXid to a value *in the future*/wrapped around. Any different different constant (or pg_upgrade parameter) will do that too in other scenarios. As far as I can tell there is precisely *no* correct behaviour here other than exactly copying the oldestXid limit from the source database.
Please find attached a patch proposal doing so: it adds a new (- u) parameter to pg_resetwal that allows to specify the oldest unfrozen XID to set.
Then this new parameter is being used in pg_upgrade to copy the source Latest checkpoint's oldestXID.
Questions:
- Should we keep the old behavior in case -x is being used without -u? (The proposed patch does not set an arbitrary oldestXID anymore in case -x is used.)
- Also shouldn't we ensure that the xid provided with -x or -u is >= FirstNormalTransactionId (Currently the only check is that it is # 0)?
I'm adding this patch to the commitfest.
Bertrand
Attachment
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
Hi,
Hi,
On 4/24/21 3:00 AM, Andres Freund wrote:Hi, On 2021-04-23 19:28:27 -0500, Justin Pryzby wrote:This (combination of) thread(s) seems relevant. Subject: pg_upgrade failing for 200+ million Large Objects https://www.postgresql.org/message-id/flat/12601596dbbc4c01b86b4ac4d2bd4d48%40EX13D05UWC001.ant.amazon.com https://www.postgresql.org/message-id/flat/a9f9376f1c3343a6bb319dce294e20ac%40EX13D05UWC001.ant.amazon.com https://www.postgresql.org/message-id/flat/cc089cc3-fc43-9904-fdba-d830d8222145%40enterprisedb.com#3eec85391c6076a4913e96a86fece75eHuh. Thanks for digging these up.Allows the user to provide a constant via pg_upgrade command-line, that overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the (window of) Transaction IDs available for pg_upgrade to complete.That seems the entirely the wrong approach to me, buying further into the broken idea of inventing random wrong values for oldestXid. We drive important things like the emergency xid limits off oldestXid. On databases with tables that are older than ~147million xids (i.e. not even affected by the default autovacuum_freeze_max_age) the current constant leads to setting the oldestXid to a value *in the future*/wrapped around. Any different different constant (or pg_upgrade parameter) will do that too in other scenarios. As far as I can tell there is precisely *no* correct behaviour here other than exactly copying the oldestXid limit from the source database.Please find attached a patch proposal doing so: it adds a new (- u) parameter to pg_resetwal that allows to specify the oldest unfrozen XID to set.
Then this new parameter is being used in pg_upgrade to copy the source Latest checkpoint's oldestXID.
Questions:
- Should we keep the old behavior in case -x is being used without -u? (The proposed patch does not set an arbitrary oldestXID anymore in case -x is used.)
- Also shouldn't we ensure that the xid provided with -x or -u is >= FirstNormalTransactionId (Currently the only check is that it is # 0)?
Copy/pasting Andres feedback (Thanks Andres for this feedback) on those questions from another thread [1].
> I was also wondering if:
>
> * We should keep the old behavior in case pg_resetwal -x is being used
> without -u?
(The proposed patch does not set an arbitrary oldestXID
> anymore in
case -x is used)
Andres: I don't think we should. I don't see anything in the old behaviour worth
maintaining.
> * We should ensure that the xid provided with -x or -u is
> >=
FirstNormalTransactionId (Currently the only check is that it is
> # 0)?
Andres: Applying TransactionIdIsNormal() seems like a good idea.
=> I am attaching a new version that makes use of TransactionIdIsNormal() checks.
Andres: I think it's important to verify that the xid provided with -x is within a reasonable range of the oldest xid.
=> What do you mean by "a reasonable range"?
Thanks
Bertrand
[1]: https://www.postgresql.org/message-id/20210517185646.pwe4klaufwmdhe2a%40alap3.anarazel.de
Attachment
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
Hi, On 7/27/21 4:39 AM, Bruce Momjian wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. > > > > This patch has been applied back to 9.6 and will appear in the next > minor release. Thank you! Bertrand
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
Bruce Momjian <bruce@momjian.us> writes: > This patch has been applied back to 9.6 and will appear in the next > minor release. I have just discovered that this patch broke pg_upgrade's ability to upgrade from 8.4: $ pg_upgrade -b ~/version84/bin -d ... Performing Consistency Checks ----------------------------- Checking cluster versions ok The source cluster lacks some required control information: latest checkpoint oldestXID Cannot continue without required control information, terminating Failure, exiting Sure enough, 8.4's pg_controldata doesn't print anything about oldestXID, because that info wasn't there then. Given the lack of field complaints, it's probably not worth trying to do anything to restore that capability. But we really ought to update pg_upgrade's code and docs in pre-v15 branches to say that the minimum supported source version is 9.0. regards, tom lane
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
On 2022-07-05 Tu 12:59, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> This patch has been applied back to 9.6 and will appear in the next >> minor release. > I have just discovered that this patch broke pg_upgrade's ability > to upgrade from 8.4: > > $ pg_upgrade -b ~/version84/bin -d ... > Performing Consistency Checks > ----------------------------- > Checking cluster versions ok > The source cluster lacks some required control information: > latest checkpoint oldestXID > > Cannot continue without required control information, terminating > Failure, exiting > > Sure enough, 8.4's pg_controldata doesn't print anything about > oldestXID, because that info wasn't there then. > > Given the lack of field complaints, it's probably not worth trying > to do anything to restore that capability. But we really ought to > update pg_upgrade's code and docs in pre-v15 branches to say that > the minimum supported source version is 9.0. So it's taken us a year to discover the issue :-( Perhaps if we're going to say we support upgrades back to 9.0 we should have some testing to be assured we don't break it without knowing like this. I'll see if I can coax crake to do that - it already tests back to 9.2. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
Andrew Dunstan <andrew@dunslane.net> writes: > So it's taken us a year to discover the issue :-( Perhaps if we're going > to say we support upgrades back to 9.0 we should have some testing to be > assured we don't break it without knowing like this. I'll see if I can > coax crake to do that - it already tests back to 9.2. Hmm ... could you first look into why 09878cdd4 broke it? I'd supposed that that was just detecting situations we must already have dealt with in order for the pg_upgrade test to work, but crake's not happy. regards, tom lane
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
On Tue, Jul 5, 2022 at 11:53 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > Sure enough, 8.4's pg_controldata doesn't print anything about > > oldestXID, because that info wasn't there then. > > > > Given the lack of field complaints, it's probably not worth trying > > to do anything to restore that capability. But we really ought to > > update pg_upgrade's code and docs in pre-v15 branches to say that > > the minimum supported source version is 9.0. > > > So it's taken us a year to discover the issue :-( I'm not surprised at all, given the history here. There were at least a couple of bugs affecting how pg_upgrade carries forward information about these cutoffs. See commits 74cf7d46 and a61daa14. Actually, commit 74cf7d46 was where pg_resetxlog/pg_resetwal's -u argument was first added, for use by pg_upgrade. That commit is only about a year old, and was only backpatched to 9.6. Unfortunately the previous approach to carrying forward oldestXID was an accident that usually worked. So...yeah, things are bad here. At least we now have the ability to detect any downstream problems that this might cause by using pg_amcheck. -- Peter Geoghegan
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
On Tue, Jul 5, 2022 at 12:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > Actually, commit 74cf7d46 was where pg_resetxlog/pg_resetwal's -u > argument was first added, for use by pg_upgrade. That commit is only > about a year old, and was only backpatched to 9.6. I just realized that this thread was where that work was first discussed. That explains why it took a year to discover that we broke 8.4! On further reflection I think that breaking pg_upgrade for 8.4 might have been a good thing. The issue was fairly visible and obvious if you actually ran into it, which is vastly preferable to what would have happened before commit 74cf7d46. -- Peter Geoghegan
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
On 2022-07-05 Tu 15:17, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> So it's taken us a year to discover the issue :-( Perhaps if we're going >> to say we support upgrades back to 9.0 we should have some testing to be >> assured we don't break it without knowing like this. I'll see if I can >> coax crake to do that - it already tests back to 9.2. > Hmm ... could you first look into why 09878cdd4 broke it? I'd supposed > that that was just detecting situations we must already have dealt with > in order for the pg_upgrade test to work, but crake's not happy. It's complaining about this: andrew@emma:HEAD $ cat ./inst/REL9_6_STABLE-20220705T160820.039/incompatible_polymorphics.txt In database: regression aggregate: public.first_el_agg_f8(double precision) I can have TestUpgradeXVersion.pm search for and remove offending functions, if that's the right fix. I note too that drongo is failing similarly, but its pg_upgrade output directory is missing, so 4fff78f009 seems possibly shy of a load w.r.t. MSVC. I will investigate. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load
Andrew Dunstan <andrew@dunslane.net> writes: > On 2022-07-05 Tu 15:17, Tom Lane wrote: >> Hmm ... could you first look into why 09878cdd4 broke it? I'd supposed >> that that was just detecting situations we must already have dealt with >> in order for the pg_upgrade test to work, but crake's not happy. > It's complaining about this: > andrew@emma:HEAD $ cat > ./inst/REL9_6_STABLE-20220705T160820.039/incompatible_polymorphics.txt > In database: regression > aggregate: public.first_el_agg_f8(double precision) Thanks. > I can have TestUpgradeXVersion.pm search for and remove offending > functions, if that's the right fix. I'm not sure. It seems like the new check must be too strict, because it was only meant to detect cases that would cause a subsequent dump/reload failure, and evidently this did not. I'll have to look closer to figure out what to do. Anyway, it's off topic for this thread ... regards, tom lane
Re: [UNVERIFIED SENDER] pg_upgrade can result in early wraparound on databases with high transaction load
> On 5 Jul 2022, at 18:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Given the lack of field complaints, it's probably not worth trying > to do anything to restore that capability. But we really ought to > update pg_upgrade's code and docs in pre-v15 branches to say that > the minimum supported source version is 9.0. (reviving an old thread from the TODO) Since we never got around to doing this we still refer to 8.4 as a possible upgrade path in v14 and older. There seems to be two alternatives here, either we bump the minimum version in v14-v12 to 9.0 which is the technical limitation brought by 695b4a113ab, or we follow the direction taken by e469f0aaf3c and set 9.2. e469f0aaf3c raised the minimum supported version to 9.2 based on the complexity of compiling anything older using a modern toolchain. It can be argued that making a change we don't cover with testing is unwise, but we clearly don't test the current code either since it's broken. The attached takes the conservative approach of raising the minimum supported version to 9.0 while leaving the code to handle 8.4 in place. While it can be removed, the risk/reward tradeoff of gutting code in backbranches doesn't seem appealing since the code will be unreachable with this check anyways. Thoughts? -- Daniel Gustafsson
Attachment
Re: [UNVERIFIED SENDER] pg_upgrade can result in early wraparound on databases with high transaction load
Daniel Gustafsson <daniel@yesql.se> writes: >> On 5 Jul 2022, at 18:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Given the lack of field complaints, it's probably not worth trying >> to do anything to restore that capability. But we really ought to >> update pg_upgrade's code and docs in pre-v15 branches to say that >> the minimum supported source version is 9.0. > (reviving an old thread from the TODO) > Since we never got around to doing this we still refer to 8.4 as a possible > upgrade path in v14 and older. Oh, yeah, that seems to have fallen through a crack. > The attached takes the conservative approach of raising the minimum supported > version to 9.0 while leaving the code to handle 8.4 in place. While it can be > removed, the risk/reward tradeoff of gutting code in backbranches doesn't seem > appealing since the code will be unreachable with this check anyways. Yeah, it's not worth working harder than this. I do see one typo in your comment: s/supported then/supported when/. LGTM otherwise. regards, tom lane
Re: [UNVERIFIED SENDER] pg_upgrade can result in early wraparound on databases with high transaction load
> On 16 May 2024, at 19:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, it's not worth working harder than this. I do see one typo > in your comment: s/supported then/supported when/. LGTM otherwise. Thanks for review, I've pushed this (with the fix from above) to 14 through 12. -- Daniel Gustafsson