Thread: BUG #17744: Fail Assert while recoverying from pg_basebackup
The following bug has been logged on the website: Bug reference: 17744 Logged by: bowenshi Email address: zxwsbg12138@gmail.com PostgreSQL version: 15.1 Operating system: centos Description: Dears, There may be some problems in recovery. The following step can stably reproducing the problem: Firstly, run following script in master. To make sure that we have at least 20GB data. 1. create table t(a int); 2. echo "insert into t select generate_series(1,5000);">script.sql 3. pgbench --no-vacuum --client=25 -U postgres --transactions=10000 --file script.sql Secondly, using pg_basebackup with stream mode 1. pg_basebackup --checkpoint=fast -h localhost -U postgres -p 5432 -Xs -Ft -v -P -D /data2/sqpg/inst/data_b 2. pgbench --no-vacuum --client=25 -U postgres --transactions=3000 --file script.sql (Run this script concurrently during pg_basebackup) Thirdly, we start the backup instance cd /data2/sqpg/inst/data_b tar xvf base.tar mv pg_wal.tar pg_wal/ cd pg_wal tar xvf pg_wal.tar cd ../ echo "port=5433">>postgresql.conf echo "log_min_messages=debug1">>postgresql.conf echo "checkpoint_timeout=30s">>postgresql.conf cd /data2/sqpg/inst/bin ./pg_ctl start -D ../data_b -l logfile_b Then during crash recovery, the checkpoint process will do a restart point , and this would fail in Assert. The stack is following TRAP: FailedAssertion("TransactionIdIsValid(initial)", File: "procarray.c", Line: 1750, PID: 2063) postgres: checkpointer (ExceptionalCondition+0xb9)[0xb378bc] postgres: checkpointer [0x962195] postgres: checkpointer (GetOldestTransactionIdConsideredRunning+0x14)[0x9628a3] postgres: checkpointer (CreateRestartPoint+0x698)[0x5972bf] postgres: checkpointer (CheckpointerMain+0x5b7)[0x8cae37] postgres: checkpointer (AuxiliaryProcessMain+0x165)[0x8c8b01] postgres: checkpointer [0x8d32b2] postgres: checkpointer (PostmasterMain+0x11dd)[0x8ce559] postgres: checkpointer [0x7d38e7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f80352c9555] postgres: checkpointer [0x48d1b9] 2023-01-10 07:25:42.368 UTC [2062] LOG: checkpointer process (PID 2063) was terminated by signal 6: Aborted (gdb) bt #0 0x00007f80352dd3d7 in raise () from /lib64/libc.so.6 #1 0x00007f80352deac8 in abort () from /lib64/libc.so.6 #2 0x0000000000b378e9 in ExceptionalCondition ( conditionName=0xd13697 "TransactionIdIsValid(initial)", errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c", lineNumber=1750) at assert.c:69 #3 0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0) at procarray.c:1750 #4 0x00000000009628a3 in GetOldestTransactionIdConsideredRunning () at procarray.c:2050 #5 0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153 #6 0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464 #7 0x00000000008c8b01 in AuxiliaryProcessMain (auxtype=CheckpointerProcess) at auxprocess.c:153 #8 0x00000000008d32b2 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5430 #9 0x00000000008ce559 in PostmasterMain (argc=3, argv=0x18149a0) at postmaster.c:1463 #10 0x00000000007d38e7 in main (argc=3, argv=0x18149a0) at main.c:202
At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > #2 0x0000000000b378e9 in ExceptionalCondition ( > conditionName=0xd13697 "TransactionIdIsValid(initial)", > errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c", > > lineNumber=1750) at assert.c:69 > #3 0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0) > at procarray.c:1750 > #4 0x00000000009628a3 in GetOldestTransactionIdConsideredRunning () > at procarray.c:2050 > #5 0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153 > #6 0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464 The function requires a valid value in ShmemVariableCache->latestCompleteXid. But it is not initialized and maintained in this case. The attached quick hack seems working, but of course more decent fix is needed. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0070d56b0b..667c990301 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5363,7 +5363,7 @@ StartupXLOG(void) * control file and we've established a recovery snapshot from a * running-xacts WAL record. */ - if (ArchiveRecoveryRequested && EnableHotStandby) + if (1 || (ArchiveRecoveryRequested && EnableHotStandby)) { TransactionId *xids; int nxids; diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 94cc860f5f..c814c991f8 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -1161,7 +1161,7 @@ standby_redo(XLogReaderState *record) Assert(!XLogRecHasAnyBlockRefs(record)); /* Do nothing if we're not in hot standby mode */ - if (standbyState == STANDBY_DISABLED) + if (standbyState == STANDBY_DISABLED && info != XLOG_RUNNING_XACTS) return; if (info == XLOG_STANDBY_LOCK) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thanks for your advice! I wonder that whether there are some potential risks if we just compiling the PG code without `--enable-cassert. regards. Kyotaro Horiguchi <horikyota.ntt@gmail.com> 于2023年1月13日周五 17:36写道: > > At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > > #2 0x0000000000b378e9 in ExceptionalCondition ( > > conditionName=0xd13697 "TransactionIdIsValid(initial)", > > errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c", > > > > lineNumber=1750) at assert.c:69 > > #3 0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0) > > at procarray.c:1750 > > #4 0x00000000009628a3 in GetOldestTransactionIdConsideredRunning () > > at procarray.c:2050 > > #5 0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153 > > #6 0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464 > > The function requires a valid value in > ShmemVariableCache->latestCompleteXid. But it is not initialized and > maintained in this case. The attached quick hack seems working, but > of course more decent fix is needed. > > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index 0070d56b0b..667c990301 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -5363,7 +5363,7 @@ StartupXLOG(void) > * control file and we've established a recovery snapshot from a > * running-xacts WAL record. > */ > - if (ArchiveRecoveryRequested && EnableHotStandby) > + if (1 || (ArchiveRecoveryRequested && EnableHotStandby)) > { > TransactionId *xids; > int nxids; > diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c > index 94cc860f5f..c814c991f8 100644 > --- a/src/backend/storage/ipc/standby.c > +++ b/src/backend/storage/ipc/standby.c > @@ -1161,7 +1161,7 @@ standby_redo(XLogReaderState *record) > Assert(!XLogRecHasAnyBlockRefs(record)); > > /* Do nothing if we're not in hot standby mode */ > - if (standbyState == STANDBY_DISABLED) > + if (standbyState == STANDBY_DISABLED && info != XLOG_RUNNING_XACTS) > return; > > if (info == XLOG_STANDBY_LOCK) > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Hi, On 2023-01-13 18:36:05 +0900, Kyotaro Horiguchi wrote: > At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > > #2 0x0000000000b378e9 in ExceptionalCondition ( > > conditionName=0xd13697 "TransactionIdIsValid(initial)", > > errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c", > > > > lineNumber=1750) at assert.c:69 > > #3 0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0) > > at procarray.c:1750 > > #4 0x00000000009628a3 in GetOldestTransactionIdConsideredRunning () > > at procarray.c:2050 > > #5 0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153 > > #6 0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464 > > The function requires a valid value in > ShmemVariableCache->latestCompleteXid. But it is not initialized and > maintained in this case. The attached quick hack seems working, but > of course more decent fix is needed. I might be missing something, but I suspect the problem here is that we shouldn't have been creating a restart point. Afaict, the setup instructions provided don't configure a recovery.signal, so we'll just perform crash recovery. And I don't think it'd ever make sense to create a restart point during crash recovery? Except that in this case, it's not pure crash recovery, it's restoring from a backup label. Due to which it actually might make sense to create restart points? If you're doing PITR or such you don't really gain anything by doing checkpoints until you've reached consistency, unless you want to optimize for the case that you might need to start/stop the instance multiple times? So maybe it's the right thing to create restart points? Really not sure. If we do want to do restartpoints, we definitely shouldn't try to TruncateSUBTRANS() in the crash-recovery-like-restartpoint case, we've not even done StartupSUBTRANS(), because that's guarded by ArchiveRecoveryRequested. The most obvious (but wrong!), fix would be to change if (EnableHotStandby) TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning()); to if (standbyState != STANDBY_DISABLED) TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning()); except that doesn't work, because we don't have working access to standbyState. Nor the other relevant variables. Gah. We've really made a hash out of the state management for xlog.c. ArchiveRecoveryRequested, InArchiveRecovery, StandbyModeRequested, StandbyMode, EnableHotStandby, LocalHotStandbyActive, ... :(. We use InArchiveRecovery = true, even if there's no archiving involved. Afaict ArchiveRecoveryRequested=false, InArchiveRecovery=true isn't really something the comments around the variables foresee. Greetings, Andres Freund
Thanks for the comment! At Wed, 1 Feb 2023 07:32:52 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2023-01-13 18:36:05 +0900, Kyotaro Horiguchi wrote: > > At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > > > #2 0x0000000000b378e9 in ExceptionalCondition ( > > > conditionName=0xd13697 "TransactionIdIsValid(initial)", > > > errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c", > > > > > > lineNumber=1750) at assert.c:69 > > > #3 0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0) > > > at procarray.c:1750 > > > #4 0x00000000009628a3 in GetOldestTransactionIdConsideredRunning () > > > at procarray.c:2050 > > > #5 0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153 > > > #6 0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464 > > > > The function requires a valid value in > > ShmemVariableCache->latestCompleteXid. But it is not initialized and > > maintained in this case. The attached quick hack seems working, but > > of course more decent fix is needed. > > I might be missing something, but I suspect the problem here is that we > shouldn't have been creating a restart point. Afaict, the setup > instructions provided don't configure a recovery.signal, so we'll just > perform crash recovery. > > And I don't think it'd ever make sense to create a restart point during > crash recovery? > > Except that in this case, it's not pure crash recovery, it's restoring > from a backup label. Due to which it actually might make sense to create > restart points? If you're doing PITR or such you don't really gain > anything by doing checkpoints until you've reached consistency, unless > you want to optimize for the case that you might need to start/stop the > instance multiple times? > > > So maybe it's the right thing to create restart points? Really not sure. In my memory that behavior was intentionally introduced recently... yes, this is ([1]). And it seems that this is the motive: > Once we had the checkpointer running, we could also consider making > the end-of-recovery checkpoint optional, or at least the wait for it > to complete. I haven't shown that in this patch, but it's just > different checkpoint request flags and possibly an end-of-recovery > record. What problems do you foresee with that? Why should we have > "fast" promotion but not "fast" crash recovery? [1]: https://www.postgresql.org/message-id/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com > If we do want to do restartpoints, we definitely shouldn't try to > TruncateSUBTRANS() in the crash-recovery-like-restartpoint case, we've > not even done StartupSUBTRANS(), because that's guarded by > ArchiveRecoveryRequested. Yeah.. > The most obvious (but wrong!), fix would be to change :) > if (EnableHotStandby) > TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning()); > to > if (standbyState != STANDBY_DISABLED) > TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning()); > except that doesn't work, because we don't have working access to > standbyState. Nor the other relevant variables. Gah. Yeah, also not have access to ArchiveRecoveryRequested (but I can write the variable there!). > We've really made a hash out of the state management for > xlog.c. ArchiveRecoveryRequested, InArchiveRecovery, > StandbyModeRequested, StandbyMode, EnableHotStandby, > LocalHotStandbyActive, ... :(. We use InArchiveRecovery = true, even if > there's no archiving involved. Afaict ArchiveRecoveryRequested=false, > InArchiveRecovery=true isn't really something the comments around the > variables foresee. Ah... InArchiveRecovery is set when we know how far recovery needs to run to reach consistency regardless whether recovery starts from crash recovery or not. This is *slightly*(:p) different from what its comment suggests. Mmm. Okay, now I know it's not that simple. Let me sleep on it.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Feb 01, 2023 at 07:32:52AM -0800, Andres Freund wrote: > I might be missing something, but I suspect the problem here is that we > shouldn't have been creating a restart point. Afaict, the setup > instructions provided don't configure a recovery.signal, so we'll just > perform crash recovery. > > And I don't think it'd ever make sense to create a restart point during > crash recovery? A restart point should never be created during crash recovery until we reach a consistent state, because there could be a risk of seeing inconsistent pages, for one, no? SwitchIntoArchiveRecovery() would be the portion of the code doing the switch from crash to archive recovery if it was requested, as called in the callback to read records in ReadRecord(). > Except that in this case, it's not pure crash recovery, it's restoring > from a backup label. Due to which it actually might make sense to create > restart points? Hmm. I'd like to think that these days requesting replay from a backup_label file without a recovery.signal or a standby.signal is just asking for trouble. Still, the case of the OP, by just having a backup label, is equivalent to restoring from a self-contained backup, which is something that should work on its own even if there is no recovery.signal. > If you're doing PITR or such you don't really gain > anything by doing checkpoints until you've reached consistency, unless > you want to optimize for the case that you might need to start/stop the > instance multiple times? Yes, that could help in some cases. The PITR target could be far away from the consistent point, and the user could have just copied WAL segments in pg_wal/ rather than relying on a recovery.signal with at least a restore_command. That would be fancy, still being able to start/stop and retry targets may make sense.. So, yes, it seems to me that the correct answer here would be just to skip restart points as long as SharedRecoveryState is in RECOVERY_STATE_CRASH, because the checkpointer relies on RecoveryInProgress() and it cannot make the difference between crash recovery and archive recovery. But now the checkpointer is started during recovery to ease the end-of-recovery checkpoint process. -- Michael
Attachment
At Tue, 21 Feb 2023 12:51:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Wed, Feb 01, 2023 at 07:32:52AM -0800, Andres Freund wrote: > > I might be missing something, but I suspect the problem here is that we > > shouldn't have been creating a restart point. Afaict, the setup > > instructions provided don't configure a recovery.signal, so we'll just > > perform crash recovery. > > > > And I don't think it'd ever make sense to create a restart point during > > crash recovery? > > A restart point should never be created during crash recovery until we > reach a consistent state, because there could be a risk of seeing > inconsistent pages, for one, no? SwitchIntoArchiveRecovery() would I'm not sure. Anyway the on-disk pages are always inconsistent regardless of type of recovery being running or whether recovery is being executed at all. > be the portion of the code doing the switch from crash to archive > recovery if it was requested, as called in the callback to read > records in ReadRecord(). > > > Except that in this case, it's not pure crash recovery, it's restoring > > from a backup label. Due to which it actually might make sense to create > > restart points? > > Hmm. I'd like to think that these days requesting replay from a > backup_label file without a recovery.signal or a standby.signal is > just asking for trouble. Still, the case of the OP, by just having a > backup label, is equivalent to restoring from a self-contained backup, > which is something that should work on its own even if there is no > recovery.signal. > > > If you're doing PITR or such you don't really gain > > anything by doing checkpoints until you've reached consistency, unless > > you want to optimize for the case that you might need to start/stop the > > instance multiple times? > > Yes, that could help in some cases. The PITR target could be far away According to Thomas' initial mail that started the discussion about that change, one of the goals is to let crash recovery have a "fast" mode just like archive recovery already has. It's not just limited to end-of-recovery checkpoint. > from the consistent point, and the user could have just copied WAL > segments in pg_wal/ rather than relying on a recovery.signal with at > least a restore_command. That would be fancy, still being able to > start/stop and retry targets may make sense.. > > So, yes, it seems to me that the correct answer here would be just to > skip restart points as long as SharedRecoveryState is in > RECOVERY_STATE_CRASH, because the checkpointer relies on > RecoveryInProgress() and it cannot make the difference between crash > recovery and archive recovery. But now the checkpointer is started > during recovery to ease the end-of-recovery checkpoint process. If we prevent restartpoints during crash recovery, why do we need to launch checkpointer at such an early stage? In PG15 the comment for SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) was missing the follwoing description, that means we are no longer supposed to have this restriction. > We don't bother during > * crash recovery as restartpoints can only be performed during > * archive recovery. And we'd like to keep crash recovery simple, to > * avoid introducing bugs that could affect you when recovering after > * crash. The follwoing is the corresponding commit. > commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > Author: Thomas Munro <tmunro@postgresql.org> > Date: Mon Aug 2 17:32:20 2021 +1200 > > Run checkpointer and bgwriter in crash recovery. > > Start up the checkpointer and bgwriter during crash recovery (except in > --single mode), as we do for replication. This wasn't done back in > commit cdd46c76 out of caution. Now it seems like a better idea to make > the environment as similar as possible in both cases. There may also be > some performance advantages. Wouldn't just previnting restartpoints during crash recovery mean making this change inneffective? If something wrong actually happens when restartpoints happen during crash recovery, shouldn't we identify the cause and let restartpoints run safely during crash recovery? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 21, 2023 at 03:22:58PM +0900, Kyotaro Horiguchi wrote: > At Tue, 21 Feb 2023 12:51:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in >> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 >> Author: Thomas Munro <tmunro@postgresql.org> >> Date: Mon Aug 2 17:32:20 2021 +1200 >> >> Run checkpointer and bgwriter in crash recovery. >> >> Start up the checkpointer and bgwriter during crash recovery (except in >> --single mode), as we do for replication. This wasn't done back in >> commit cdd46c76 out of caution. Now it seems like a better idea to make >> the environment as similar as possible in both cases. There may also be >> some performance advantages. > > Wouldn't just previnting restartpoints during crash recovery mean > making this change inneffective? If something wrong actually happens > when restartpoints happen during crash recovery, shouldn't we identify > the cause and let restartpoints run safely during crash recovery? I will let Thomas comment on that (now added in CC of this thread), because, honestly, I am not sure. Now, it seems to me that the original intent of starting the checkpointer this early is to be able to unify the code path for the end-of-recovery checkpoint, primarily, which is a fine goal in itself because the startup process can query a checkpoint rather than doing things itself. It is the fact of starting early the bgwriter that showed most of the improvements in crash recovery with some aggressive tuning setups, under specific workloads like heavy INSERTs. This report is telling us that restartpoints during crash recovery may not be a good idea without more careful consideration, even if the original intent of 7ff23c6 is to allow them in the crash recovery case. I am not sure about this last point. (Restartpoint is mentioned once in the thread of this commit, FWIW.) -- Michael
Attachment
On Tue, Feb 21, 2023 at 8:01 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Feb 21, 2023 at 03:22:58PM +0900, Kyotaro Horiguchi wrote: > > At Tue, 21 Feb 2023 12:51:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in > >> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 > >> Author: Thomas Munro <tmunro@postgresql.org> > >> Date: Mon Aug 2 17:32:20 2021 +1200 > >> > >> Run checkpointer and bgwriter in crash recovery. > >> > >> Start up the checkpointer and bgwriter during crash recovery (except in > >> --single mode), as we do for replication. This wasn't done back in > >> commit cdd46c76 out of caution. Now it seems like a better idea to make > >> the environment as similar as possible in both cases. There may also be > >> some performance advantages. > > > > Wouldn't just previnting restartpoints during crash recovery mean > > making this change inneffective? If something wrong actually happens > > when restartpoints happen during crash recovery, shouldn't we identify > > the cause and let restartpoints run safely during crash recovery? > > I will let Thomas comment on that (now added in CC of this thread), > because, honestly, I am not sure. Thanks, catching up with this thread. > Now, it seems to me that the original intent of starting the > checkpointer this early is to be able to unify the code path for the > end-of-recovery checkpoint, primarily, which is a fine goal in itself > because the startup process can query a checkpoint rather than doing > things itself. It is the fact of starting early the bgwriter that > showed most of the improvements in crash recovery with some aggressive > tuning setups, under specific workloads like heavy INSERTs. > > This report is telling us that restartpoints during crash recovery may > not be a good idea without more careful consideration, even if the > original intent of 7ff23c6 is to allow them in the crash recovery > case. I am not sure about this last point. (Restartpoint is > mentioned once in the thread of this commit, FWIW.) I started by wanting to have the bgwriter active. For that, the checkpointer must also be active, to have a shared fsync request queue (since both the startup process and the bgwriter create fsync requests, it wouldn't work if the startup process had a local "pendingOps"). Looking into this.
On Wed, Feb 22, 2023 at 11:27:03PM +1300, Thomas Munro wrote: > I started by wanting to have the bgwriter active. For that, the > checkpointer must also be active, to have a shared fsync request queue > (since both the startup process and the bgwriter create fsync > requests, it wouldn't work if the startup process had a local > "pendingOps"). Looking into this. I was thinking about that, and you may be fine as long as you skip some parts of the restartpoint logic. The case reported of this thread does not cause crash recovery, actually, because startup switches to +archive+ recovery any time it sees a backup_label file. One thing I did not remember here is that we also set minRecoveryPoint at a much earlier LSN than it should be (see 6c4f666). However, we rely heavily on backupEndRequired in the control file to make sure that we've replayed up the end-of-backup record to decide if the system is consistent or not. This is the type of checks done by CheckRecoveryConsistency() for the startup process. And what we want here is to be able to apply the same kind of things to the checkpointer, as the checkpointer needs to decide which part of a restartpoint is fine to be run without consistency being reached yet. The case of the OP actually switches InArchiveRecovery to true, hence enforcing SharedRecoveryState = RECOVERY_STATE_ARCHIVE from the start of the redo loop, so you cannot rely only on that in the checkpointer as I initially thought. Neither can you just rely only on minRecoveryPoint as we could expect an end-of-backup record. So, you need to check for a combination of SharedRecoveryState, minRecoveryPoint and endBackupRecord, roughly. Note that as ReachedEndOfBackup() tells, minRecoveryPoint could be as well ahead of the end-of-backup record, meaning that we'd still need to replay more than the end-of-backup record before reaching consistency. There are a few things that we can do here that I can think of, in conservative order: - Delay the checkpointer startup until the startup process is OK with a consistent state, which could be the end of a backup. No need to let the checkpointer know when consistency has been reached, though it basically invalidates your previous change to make the bgwriter and the checkpointer more aggressive. - Delay completely restart points, which is pretty much the same as the last one, except that the checkpointer is started earlier, and it needs to know about the consistent state. - Skip some parts of the restartpoint logic. TruncateSUBTRANS() is one, as proved here, still we have other parts of CheckPointGuts() that equally rely on a consistent system? For example, I am not sure about things like CheckPointSnapBuild(), CheckPointLogicalRewriteHeap() or CheckPointReplicationOrigin(), to name three. Most of these involve SLRU flushes, these days, but we'd better be careful in the choices made. -- Michael
Attachment
On Fri, Feb 24, 2023 at 09:36:50AM +0900, Michael Paquier wrote: > I was thinking about that, and you may be fine as long as you skip > some parts of the restartpoint logic. The case reported of this > thread does not cause crash recovery, actually, because startup > switches to +archive+ recovery any time it sees a backup_label file. > One thing I did not remember here is that we also set minRecoveryPoint > at a much earlier LSN than it should be (see 6c4f666). However, we > rely heavily on backupEndRequired in the control file to make sure > that we've replayed up the end-of-backup record to decide if the > system is consistent or not. I have been spending more time on that to see if I was missing something, and reproducing the issue is rather easy by using pgbench that gets stopped with a SIGINT so as restart points would be able to see transactions still running in the code path triggering the assert. A cheap regression test should be possible, actually, though for now the only thing I have been able to rely on is a hack to force checkpoint_timeout at 1s to make the failure rate more aggressive. Anyway, with this simple method (and an increase of short pgbench runs that are interrupted to increase the chance of hits), a bisect points at 7ff23c6 :/ -- Michael
Attachment
On Sat, Feb 25, 2023 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Feb 24, 2023 at 09:36:50AM +0900, Michael Paquier wrote: > > I was thinking about that, and you may be fine as long as you skip > > some parts of the restartpoint logic. The case reported of this > > thread does not cause crash recovery, actually, because startup > > switches to +archive+ recovery any time it sees a backup_label file. > > One thing I did not remember here is that we also set minRecoveryPoint > > at a much earlier LSN than it should be (see 6c4f666). However, we > > rely heavily on backupEndRequired in the control file to make sure > > that we've replayed up the end-of-backup record to decide if the > > system is consistent or not. > > I have been spending more time on that to see if I was missing > something, and reproducing the issue is rather easy by using pgbench > that gets stopped with a SIGINT so as restart points would be able to > see transactions still running in the code path triggering the assert. > A cheap regression test should be possible, actually, though for now > the only thing I have been able to rely on is a hack to force > checkpoint_timeout at 1s to make the failure rate more aggressive. > > Anyway, with this simple method (and an increase of short pgbench runs > that are interrupted to increase the chance of hits), a bisect points > at 7ff23c6 :/ Thanks. I've been thinking about how to make a deterministic test script to study this and possible fixes, too. Unfortunately I came down with a nasty cold and stopped computing for a couple of days, so sorry for the slow response on this thread, but I seem to have rebooted now. Looking.
On Sat, Feb 25, 2023 at 08:56:18AM +1300, Thomas Munro wrote: > Thanks. I've been thinking about how to make a deterministic test > script to study this and possible fixes, too. I can send you what I have if you are interested, but I would need to rewrite it a bit as it is integrated with my local scripts. If you wish me to test something, just ping me. > Unfortunately I came > down with a nasty cold and stopped computing for a couple of days, so > sorry for the slow response on this thread, but I seem to have > rebooted now. Looking. Ouch, take care. This has been around for some time, so a couple of weeks won't matter much. There is still a lot of time until the next minor release. By the way, I think that I have put my finger on something.. The startup process happily sets InArchiveRecovery to true after read_backup_label() if finding one, with a comment telling that archive recovery is requested, still ArchiveRecoveryRequested is never set to true unless there is a recovery.signal or a standby.signal. The configuration of this thread fails on that: @@ -619,6 +619,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr, * file, we know how far we need to replay to reach consistency. Enter * archive recovery directly. */ + if (!ArchiveRecoveryRequested) + elog(PANIC, "ArchiveRecoveryRequested not set with backup_label"); One of the tests of 010_pg_basebackup.pl with tablespace maps fails on that as well, still there does not seem to be a lot of coverage for self-contained backups.. -- Michael
Attachment
On Sat, Feb 25, 2023 at 4:39 PM Michael Paquier <michael@paquier.xyz> wrote: > By the way, I think that I have put my finger on something.. The > startup process happily sets InArchiveRecovery to true after > read_backup_label() if finding one, with a comment telling that > archive recovery is requested, still ArchiveRecoveryRequested is never > set to true unless there is a recovery.signal or a standby.signal. Yeah, it's all a bit confusing. That particular fact is the reason we can't just rely on RECOVERY_STATE_CRASH to suppress restart points in RecoveryRestartPoint(), which was my first naive idea before I started working through scenarios... Another thing that is a little confusing is that commit 7ff23c6 didn't change the behaviour of hot standbys in this respect: they already started the checkpointer at the start of recovery, not at the start of hot standby. Which leads to Bowen's suggestion to make recovery-from-backup work the same. As for what we're actually going to do about this, and how to test that it's working the way we want, here's my current idea, which I'm planning to work up experimental patches for on Monday: (1) Since it's quite hard to hit the assertion failure without generating a megaton of data, or hacking checkpoint_timeout to allow 1 second etc, I think we should test the startup process's behaviour instead (ie rather than the checkpointer's). We could have RecoveryRestartPoint() log when it's recording a restart point in shared memory (= a potential restart point, whether or not the checkpoint chooses to act on it), and also when it's choosing to skip. At DEBUG2 level. Then we could then have a test 035_restart_points.pl for all the various situations (actual crash, backup, backup + archives, pitr, with standby enabled, etc etc), and make assertions about which messages we log. That gives us observability and nails down the differences between these modes. (2) Then I think we should simply prevent restart points that 14 wouldn't have allowed, just a couple of back-patchable lines, maybe (yet another) new variable. Then for master I think we should (in later patches, for 16 or maybe 17) consider relaxing that. There doesn't seem to be a technical reason why a very long running PITR shouldn't benefit from restart points, and that would be in line with the direction of making recovery modes more similar, but given we/I missed this subtlety in the past, the more conservative choice seems more appropriate for 15.
On Sun, Feb 26, 2023 at 09:38:01AM +1300, Thomas Munro wrote: > Yeah, it's all a bit confusing. That particular fact is the reason we > can't just rely on RECOVERY_STATE_CRASH to suppress restart points in > RecoveryRestartPoint(), which was my first naive idea before I started > working through scenarios... Another thing that is a little confusing > is that commit 7ff23c6 didn't change the behaviour of hot standbys in > this respect: they already started the checkpointer at the start of > recovery, not at the start of hot standby. Which leads to Bowen's > suggestion to make recovery-from-backup work the same. FWIW, I think that 7ff23c6 and this thread's issue are just a victim of the decision taken in abf5c5c9 (basically mentioned upthread) where it was decided that we should initialize for hot standby depending on ArchiveRecoveryRequested rather than InArchiveRecovery, as of this part of the commit: @@ -5243,7 +5344,7 @@ StartupXLOG(void) * control file and we've established a recovery snapshot from a * running-xacts WAL record. */ - if (InArchiveRecovery && EnableHotStandby) + if (ArchiveRecoveryRequested && EnableHotStandby) { TransactionId *xids; int nxids; At the end, I would suggest something the attached patch, where we switch a bit the order of the actions taken at the beginning of the startup process and enforce ArchiveRecoveryRequested when we find a backup_label so as we always go through archive recovery as all the code paths of the startup process assume that. As far as I can see, we have never assumed that the combination of ArchiveRecoveryRequested = false and InArchiveRecovery = true should be a valid one. With HEAD, it is actually possible to do much nastier things, actually, like missing the setup of recoveryWakeupLatch, etc. One of my test scripts has been doing something like that, actually, because there is no need to run an interrupted pgbench concurrently with pg_basebackup to trigger the issue: - Setup a primary, with archiving. - Take a base backup of the primary, it has an end-of-recovery record, already. Do not start it yet. - Run pgbench and interrupt it a few times. - Copy the contents of the archives to the previous base backup's pg_wal/ (Critical!). - Start the base backup, causing the backup to replay all the WAL it can find, and enjoy the show as consistency is reached as of the end of the previous base backup (one can reload some recovery parameters, as well). Note that this makes the backup_file read happen slightly earlier. I'd like to think that we should enforce the decision of having at least a recovery.signal in 17~ if we find a backup_label with a proper validation of validateRecoveryParameters(). Another side effect is that this actually changes the behavior reported on this thread by making the end-of-recovery *switch* to a new TLI rather than continuing the existing timeline once recovery is finished, so this actually makes the base backup recovery much more predictible. It is true that this could influence some recovery flows, but this combination is actually leads to more consistency at recovery, and a backup_label is all about archive recovery, so.. > Then for master I think we should (in later patches, for 16 or maybe > 17) consider relaxing that. There doesn't seem to be a technical > reason why a very long running PITR shouldn't benefit from restart > points, and that would be in line with the direction of making > recovery modes more similar, but given we/I missed this subtlety in > the past, the more conservative choice seems more appropriate for 15. If you wish to be more conservative and prevent checkpoints to run in cases where they were not running before 14 and lift this restriction in 17~, that would be fine by me, but I am not convinced that we really need that. As far as I can see, 7ff23c6 is kind of right. -- Michael
Attachment
At Mon, 27 Feb 2023 09:08:19 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Sun, Feb 26, 2023 at 09:38:01AM +1300, Thomas Munro wrote: > > Yeah, it's all a bit confusing. That particular fact is the reason we > > can't just rely on RECOVERY_STATE_CRASH to suppress restart points in > > RecoveryRestartPoint(), which was my first naive idea before I started > > working through scenarios... Another thing that is a little confusing > > is that commit 7ff23c6 didn't change the behaviour of hot standbys in > > this respect: they already started the checkpointer at the start of > > recovery, not at the start of hot standby. Which leads to Bowen's > > suggestion to make recovery-from-backup work the same. > > FWIW, I think that 7ff23c6 and this thread's issue are just a victim > of the decision taken in abf5c5c9 (basically mentioned upthread) where > it was decided that we should initialize for hot standby depending on > ArchiveRecoveryRequested rather than InArchiveRecovery, as of this > part of the commit: > @@ -5243,7 +5344,7 @@ StartupXLOG(void) > * control file and we've established a recovery snapshot from a > * running-xacts WAL record. > */ > - if (InArchiveRecovery && EnableHotStandby) > + if (ArchiveRecoveryRequested && EnableHotStandby) > { > TransactionId *xids; > int nxids; > > At the end, I would suggest something the attached patch, where we > like missing the setup of recoveryWakeupLatch, etc. One of my test > scripts has been doing something like that, actually, because there is > no need to run an interrupted pgbench concurrently with pg_basebackup > to trigger the issue: > - Setup a primary, with archiving. > - Take a base backup of the primary, it has an end-of-recovery record, > already. Do not start it yet. > - Run pgbench and interrupt it a few times. > - Copy the contents of the archives to the previous base backup's > pg_wal/ (Critical!). > - Start the base backup, causing the backup to replay all the WAL it > can find, and enjoy the show as consistency is reached as of the end > of the previous base backup (one can reload some recovery parameters, > as well). In my understanding, the approach differs slightly from the OP's case. While server reaches consistency state early on using this steps, in the OP's case it takes much longer time to reach consistency, to the point where a checkpoint occurs before it is reached. It seems that the approach taken by the patch does not work in the OP's case, where (I guess that) hot-standby is disabled. > Note that this makes the backup_file read happen slightly earlier. I'd > like to think that we should enforce the decision of having at least a > recovery.signal in 17~ if we find a backup_label with a proper > validation of validateRecoveryParameters(). Another side effect is > that this actually changes the behavior reported on this thread by > making the end-of-recovery *switch* to a new TLI rather than > continuing the existing timeline once recovery is finished, so this > actually makes the base backup recovery much more predictible. It is > true that this could influence some recovery flows, but this > combination is actually leads to more consistency at recovery, and a > backup_label is all about archive recovery, so.. As you know, pg_basebackup has the -R option. It seems to me that the existence means we conciously allowed the backup to start without archive recovery. The OP's operation (if it is correct) involves taking a base backup and treating it as a literal backup, meaning it begins its life on the origial timeline. This change prevents an operation that we may have implicitly allowed (at least, that is my understanding). > > Then for master I think we should (in later patches, for 16 or maybe > > 17) consider relaxing that. There doesn't seem to be a technical > > reason why a very long running PITR shouldn't benefit from restart > > points, and that would be in line with the direction of making > > recovery modes more similar, but given we/I missed this subtlety in > > the past, the more conservative choice seems more appropriate for 15. > > If you wish to be more conservative and prevent checkpoints to run in > cases where they were not running before 14 and lift this restriction > in 17~, that would be fine by me, but I am not convinced that we > really need that. As far as I can see, 7ff23c6 is kind of right. Thus, I slightly object to let backup_label to enforce archive recovery. And rather prefer to suppress checkpoints before reaching consistency(?). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Feb 27, 2023 at 12:01:01PM +0900, Kyotaro Horiguchi wrote: > In my understanding, the approach differs slightly from the OP's > case. While server reaches consistency state early on using this > steps, in the OP's case it takes much longer time to reach > consistency, to the point where a checkpoint occurs before it is > reached. It seems that the approach taken by the patch does not work > in the OP's case, where (I guess that) hot-standby is disabled. Yep, hot standby is disabled in this case. Now note that we go through the hot-standby initialization based on (ArchiveRecoveryRequested && EnableHotStandby). The first one is true when we find a recovery.signal or standby.signal. The second one is the hot_standby GUC, hence recovery.signal enforces the initialization even if we know that we won't allow standby connections anyway. This is a waste, surely, if you only have recovery.signal, still it does not hurt? > As you know, pg_basebackup has the -R option. It seems to me that the > existence means we conciously allowed the backup to start without > archive recovery. The OP's operation (if it is correct) involves > taking a base backup and treating it as a literal backup, meaning it > begins its life on the origial timeline. This change prevents an > operation that we may have implicitly allowed (at least, that is my > understanding). Note that pg_basebackup -R enforces a standby.signal, not a recovery.signal. I am questioning whether we have to support the case of this thread, to be honest, particularly if it means simplifying the recovery logic that's now behind the two signal files and the backup_label. That's not a behavior that should be enforced in a stable branch, perhaps, but if it means removing some of the boolean flags controlling the recovery behavior, simplifying its logic, then I see a pretty good argument with doing a move in this direction if it eases the maintenance of this code with less cases to think about. We have too many of them already and these have caused us many issues in recent history (continuation records, for one), so I would draw a clear line and simplify things. Also, note "Recovering Using a Continuous Archive Backup", where we basically say that one *has* to put a recovery.signal when recovering using a base backup in one of the steps, which enforces a new timeline once recovery is done with a base backup. What I have sent upthread enforces that more in the presence of a backup_label file without recovery.signal, which may be too aggressive, perhaps, because it causes always a TLI jump, still it avoids the need of a hard failure to always require a recovery.signal or standby.signal if we are recovering from a base backup. If we were to support the case where we have (InArchiveRecovery && !ArchiveRecoveryRequested) during recovery, we'd better put a clear definition behind it. Personnally, I don't know what it means based on what is documented in xlogrecovery.c or xlog.c (echoing with [1]). > Thus, I slightly object to let backup_label to enforce archive > recovery. And rather prefer to suppress checkpoints before reaching > consistency(?). Suppressing restartpoints is not enough to completely close the issue. With the case I am mentioning above, it is possible to run restartpoints after consistency has been reached without the hot standby initialization done, causing the same failure when attempting to truncate pg_subtrans/ in a restart point :/ [1]: https://www.postgresql.org/message-id/20230201153252.l6kcfum7trdovw2b@alap3.anarazel.de -- Michael