Thread: BUG #17744: Fail Assert while recoverying from pg_basebackup

BUG #17744: Fail Assert while recoverying from pg_basebackup

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17744
Logged by:          bowenshi
Email address:      zxwsbg12138@gmail.com
PostgreSQL version: 15.1
Operating system:   centos
Description:

Dears,

There may be some problems in recovery. The following step can stably
reproducing the problem:

Firstly, run following script in master. To make sure that we have at least
20GB data.
1. create table t(a int);
2. echo "insert into t select generate_series(1,5000);">script.sql
3. pgbench --no-vacuum --client=25 -U postgres --transactions=10000 --file
script.sql  
 
Secondly, using pg_basebackup with stream mode
1. pg_basebackup --checkpoint=fast -h localhost -U postgres -p 5432  -Xs -Ft
 -v -P -D /data2/sqpg/inst/data_b
2. pgbench --no-vacuum --client=25 -U postgres --transactions=3000 --file
script.sql   (Run this script concurrently during pg_basebackup)

Thirdly, we start the backup instance
cd /data2/sqpg/inst/data_b
tar xvf base.tar
mv pg_wal.tar pg_wal/
cd pg_wal
tar xvf pg_wal.tar
cd ../
echo "port=5433">>postgresql.conf
echo "log_min_messages=debug1">>postgresql.conf
echo "checkpoint_timeout=30s">>postgresql.conf
cd /data2/sqpg/inst/bin
./pg_ctl start -D ../data_b -l logfile_b

Then during crash recovery, the checkpoint process will do a restart point ,
and this would fail in Assert. The stack is following
TRAP: FailedAssertion("TransactionIdIsValid(initial)", File: "procarray.c",
Line: 1750, PID: 2063)
postgres: checkpointer (ExceptionalCondition+0xb9)[0xb378bc]
postgres: checkpointer [0x962195]
postgres: checkpointer
(GetOldestTransactionIdConsideredRunning+0x14)[0x9628a3]
postgres: checkpointer (CreateRestartPoint+0x698)[0x5972bf]
postgres: checkpointer (CheckpointerMain+0x5b7)[0x8cae37]
postgres: checkpointer (AuxiliaryProcessMain+0x165)[0x8c8b01]
postgres: checkpointer [0x8d32b2]
postgres: checkpointer (PostmasterMain+0x11dd)[0x8ce559]
postgres: checkpointer [0x7d38e7]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f80352c9555]
postgres: checkpointer [0x48d1b9]
2023-01-10 07:25:42.368 UTC [2062] LOG:  checkpointer process (PID 2063) was
terminated by signal 6: Aborted

(gdb) bt
#0  0x00007f80352dd3d7 in raise () from /lib64/libc.so.6
#1  0x00007f80352deac8 in abort () from /lib64/libc.so.6
#2  0x0000000000b378e9 in ExceptionalCondition (
    conditionName=0xd13697 "TransactionIdIsValid(initial)", 
    errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c",

    lineNumber=1750) at assert.c:69
#3  0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0)
    at procarray.c:1750
#4  0x00000000009628a3 in GetOldestTransactionIdConsideredRunning ()
    at procarray.c:2050
#5  0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153
#6  0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464
#7  0x00000000008c8b01 in AuxiliaryProcessMain
(auxtype=CheckpointerProcess)
    at auxprocess.c:153
#8  0x00000000008d32b2 in StartChildProcess (type=CheckpointerProcess)
    at postmaster.c:5430
#9  0x00000000008ce559 in PostmasterMain (argc=3, argv=0x18149a0)
    at postmaster.c:1463
#10 0x00000000007d38e7 in main (argc=3, argv=0x18149a0) at main.c:202


Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Kyotaro Horiguchi
Date:
At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> #2  0x0000000000b378e9 in ExceptionalCondition (
>     conditionName=0xd13697 "TransactionIdIsValid(initial)", 
>     errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c",
> 
>     lineNumber=1750) at assert.c:69
> #3  0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0)
>     at procarray.c:1750
> #4  0x00000000009628a3 in GetOldestTransactionIdConsideredRunning ()
>     at procarray.c:2050
> #5  0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153
> #6  0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464

The function requires a valid value in
ShmemVariableCache->latestCompleteXid. But it is not initialized and
maintained in this case.  The attached quick hack seems working, but
of course more decent fix is needed.


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..667c990301 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5363,7 +5363,7 @@ StartupXLOG(void)
          * control file and we've established a recovery snapshot from a
          * running-xacts WAL record.
          */
-        if (ArchiveRecoveryRequested && EnableHotStandby)
+        if (1 || (ArchiveRecoveryRequested && EnableHotStandby))
         {
             TransactionId *xids;
             int            nxids;
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 94cc860f5f..c814c991f8 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1161,7 +1161,7 @@ standby_redo(XLogReaderState *record)
     Assert(!XLogRecHasAnyBlockRefs(record));
 
     /* Do nothing if we're not in hot standby mode */
-    if (standbyState == STANDBY_DISABLED)
+    if (standbyState == STANDBY_DISABLED && info != XLOG_RUNNING_XACTS)
         return;
 
     if (info == XLOG_STANDBY_LOCK)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Bowen Shi
Date:
Thanks for your advice!

I wonder that whether there are some potential risks if we just
compiling the PG code without `--enable-cassert.

regards.

Kyotaro Horiguchi <horikyota.ntt@gmail.com> 于2023年1月13日周五 17:36写道:
>
> At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> > #2  0x0000000000b378e9 in ExceptionalCondition (
> >     conditionName=0xd13697 "TransactionIdIsValid(initial)",
> >     errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c",
> >
> >     lineNumber=1750) at assert.c:69
> > #3  0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0)
> >     at procarray.c:1750
> > #4  0x00000000009628a3 in GetOldestTransactionIdConsideredRunning ()
> >     at procarray.c:2050
> > #5  0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153
> > #6  0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464
>
> The function requires a valid value in
> ShmemVariableCache->latestCompleteXid. But it is not initialized and
> maintained in this case.  The attached quick hack seems working, but
> of course more decent fix is needed.
>
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 0070d56b0b..667c990301 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -5363,7 +5363,7 @@ StartupXLOG(void)
>                  * control file and we've established a recovery snapshot from a
>                  * running-xacts WAL record.
>                  */
> -               if (ArchiveRecoveryRequested && EnableHotStandby)
> +               if (1 || (ArchiveRecoveryRequested && EnableHotStandby))
>                 {
>                         TransactionId *xids;
>                         int                     nxids;
> diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
> index 94cc860f5f..c814c991f8 100644
> --- a/src/backend/storage/ipc/standby.c
> +++ b/src/backend/storage/ipc/standby.c
> @@ -1161,7 +1161,7 @@ standby_redo(XLogReaderState *record)
>         Assert(!XLogRecHasAnyBlockRefs(record));
>
>         /* Do nothing if we're not in hot standby mode */
> -       if (standbyState == STANDBY_DISABLED)
> +       if (standbyState == STANDBY_DISABLED && info != XLOG_RUNNING_XACTS)
>                 return;
>
>         if (info == XLOG_STANDBY_LOCK)
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Andres Freund
Date:
Hi,

On 2023-01-13 18:36:05 +0900, Kyotaro Horiguchi wrote:
> At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> > #2  0x0000000000b378e9 in ExceptionalCondition (
> >     conditionName=0xd13697 "TransactionIdIsValid(initial)", 
> >     errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c",
> > 
> >     lineNumber=1750) at assert.c:69
> > #3  0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0)
> >     at procarray.c:1750
> > #4  0x00000000009628a3 in GetOldestTransactionIdConsideredRunning ()
> >     at procarray.c:2050
> > #5  0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153
> > #6  0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464
> 
> The function requires a valid value in
> ShmemVariableCache->latestCompleteXid. But it is not initialized and
> maintained in this case.  The attached quick hack seems working, but
> of course more decent fix is needed.

I might be missing something, but I suspect the problem here is that we
shouldn't have been creating a restart point. Afaict, the setup
instructions provided don't configure a recovery.signal, so we'll just
perform crash recovery.

And I don't think it'd ever make sense to create a restart point during
crash recovery?

Except that in this case, it's not pure crash recovery, it's restoring
from a backup label. Due to which it actually might make sense to create
restart points?  If you're doing PITR or such you don't really gain
anything by doing checkpoints until you've reached consistency, unless
you want to optimize for the case that you might need to start/stop the
instance multiple times?


So maybe it's the right thing to create restart points? Really not sure.


If we do want to do restartpoints, we definitely shouldn't try to
TruncateSUBTRANS() in the crash-recovery-like-restartpoint case, we've
not even done StartupSUBTRANS(), because that's guarded by
ArchiveRecoveryRequested.

The most obvious (but wrong!), fix would be to change

    if (EnableHotStandby)
        TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
to
    if (standbyState != STANDBY_DISABLED)
        TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
except that doesn't work, because we don't have working access to
standbyState. Nor the other relevant variables. Gah.




We've really made a hash out of the state management for
xlog.c. ArchiveRecoveryRequested, InArchiveRecovery,
StandbyModeRequested, StandbyMode, EnableHotStandby,
LocalHotStandbyActive, ... :(.  We use InArchiveRecovery = true, even if
there's no archiving involved. Afaict ArchiveRecoveryRequested=false,
InArchiveRecovery=true isn't really something the comments around the
variables foresee.


Greetings,

Andres Freund



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Kyotaro Horiguchi
Date:
Thanks for the comment!

At Wed, 1 Feb 2023 07:32:52 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2023-01-13 18:36:05 +0900, Kyotaro Horiguchi wrote:
> > At Tue, 10 Jan 2023 07:45:45 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> > > #2  0x0000000000b378e9 in ExceptionalCondition (
> > >     conditionName=0xd13697 "TransactionIdIsValid(initial)", 
> > >     errorType=0xd12df4 "FailedAssertion", fileName=0xd12de8 "procarray.c",
> > > 
> > >     lineNumber=1750) at assert.c:69
> > > #3  0x0000000000962195 in ComputeXidHorizons (h=0x7ffe93de25e0)
> > >     at procarray.c:1750
> > > #4  0x00000000009628a3 in GetOldestTransactionIdConsideredRunning ()
> > >     at procarray.c:2050
> > > #5  0x00000000005972bf in CreateRestartPoint (flags=256) at xlog.c:7153
> > > #6  0x00000000008cae37 in CheckpointerMain () at checkpointer.c:464
> > 
> > The function requires a valid value in
> > ShmemVariableCache->latestCompleteXid. But it is not initialized and
> > maintained in this case.  The attached quick hack seems working, but
> > of course more decent fix is needed.
> 
> I might be missing something, but I suspect the problem here is that we
> shouldn't have been creating a restart point. Afaict, the setup
> instructions provided don't configure a recovery.signal, so we'll just
> perform crash recovery.
> 
> And I don't think it'd ever make sense to create a restart point during
> crash recovery?
> 
> Except that in this case, it's not pure crash recovery, it's restoring
> from a backup label. Due to which it actually might make sense to create
> restart points?  If you're doing PITR or such you don't really gain
> anything by doing checkpoints until you've reached consistency, unless
> you want to optimize for the case that you might need to start/stop the
> instance multiple times?
> 
> 
> So maybe it's the right thing to create restart points? Really not sure.

In my memory that behavior was intentionally introduced
recently... yes, this is ([1]). And it seems that this is the motive:

> Once we had the checkpointer running, we could also consider making
> the end-of-recovery checkpoint optional, or at least the wait for it
> to complete.  I haven't shown that in this patch, but it's just
> different checkpoint request flags and possibly an end-of-recovery
> record.  What problems do you foresee with that?  Why should we have
> "fast" promotion but not "fast" crash recovery?

[1]: https://www.postgresql.org/message-id/CA%2BhUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q%40mail.gmail.com

> If we do want to do restartpoints, we definitely shouldn't try to
> TruncateSUBTRANS() in the crash-recovery-like-restartpoint case, we've
> not even done StartupSUBTRANS(), because that's guarded by
> ArchiveRecoveryRequested.

Yeah..

> The most obvious (but wrong!), fix would be to change

:)

>     if (EnableHotStandby)
>         TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
> to
>     if (standbyState != STANDBY_DISABLED)
>         TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
> except that doesn't work, because we don't have working access to
> standbyState. Nor the other relevant variables. Gah.

Yeah, also not have access to ArchiveRecoveryRequested (but I can
write the variable there!).

> We've really made a hash out of the state management for
> xlog.c. ArchiveRecoveryRequested, InArchiveRecovery,
> StandbyModeRequested, StandbyMode, EnableHotStandby,
> LocalHotStandbyActive, ... :(.  We use InArchiveRecovery = true, even if
> there's no archiving involved. Afaict ArchiveRecoveryRequested=false,
> InArchiveRecovery=true isn't really something the comments around the
> variables foresee.

Ah... InArchiveRecovery is set when we know how far recovery needs to
run to reach consistency regardless whether recovery starts from crash
recovery or not. This is *slightly*(:p) different from what its
comment suggests.

Mmm.  Okay, now I know it's not that simple. Let me sleep on it..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Wed, Feb 01, 2023 at 07:32:52AM -0800, Andres Freund wrote:
> I might be missing something, but I suspect the problem here is that we
> shouldn't have been creating a restart point. Afaict, the setup
> instructions provided don't configure a recovery.signal, so we'll just
> perform crash recovery.
>
> And I don't think it'd ever make sense to create a restart point during
> crash recovery?

A restart point should never be created during crash recovery until we
reach a consistent state, because there could be a risk of seeing
inconsistent pages, for one, no?   SwitchIntoArchiveRecovery() would
be the portion of the code doing the switch from crash to archive
recovery if it was requested, as called in the callback to read
records in ReadRecord().

> Except that in this case, it's not pure crash recovery, it's restoring
> from a backup label. Due to which it actually might make sense to create
> restart points?

Hmm.  I'd like to think that these days requesting replay from a
backup_label file without a recovery.signal or a standby.signal is
just asking for trouble.  Still, the case of the OP, by just having a
backup label, is equivalent to restoring from a self-contained backup,
which is something that should work on its own even if there is no
recovery.signal.

> If you're doing PITR or such you don't really gain
> anything by doing checkpoints until you've reached consistency, unless
> you want to optimize for the case that you might need to start/stop the
> instance multiple times?

Yes, that could help in some cases.  The PITR target could be far away
from the consistent point, and the user could have just copied WAL
segments in pg_wal/ rather than relying on a recovery.signal with at
least a restore_command.  That would be fancy, still being able to
start/stop and retry targets may make sense..

So, yes, it seems to me that the correct answer here would be just to
skip restart points as long as SharedRecoveryState is in
RECOVERY_STATE_CRASH, because the checkpointer relies on
RecoveryInProgress() and it cannot make the difference between crash
recovery and archive recovery.  But now the checkpointer is started
during recovery to ease the end-of-recovery checkpoint process.
--
Michael

Attachment

Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Kyotaro Horiguchi
Date:
At Tue, 21 Feb 2023 12:51:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Wed, Feb 01, 2023 at 07:32:52AM -0800, Andres Freund wrote:
> > I might be missing something, but I suspect the problem here is that we
> > shouldn't have been creating a restart point. Afaict, the setup
> > instructions provided don't configure a recovery.signal, so we'll just
> > perform crash recovery.
> > 
> > And I don't think it'd ever make sense to create a restart point during
> > crash recovery?
> 
> A restart point should never be created during crash recovery until we
> reach a consistent state, because there could be a risk of seeing
> inconsistent pages, for one, no?   SwitchIntoArchiveRecovery() would

I'm not sure. Anyway the on-disk pages are always inconsistent
regardless of type of recovery being running or whether recovery is
being executed at all.

> be the portion of the code doing the switch from crash to archive
> recovery if it was requested, as called in the callback to read
> records in ReadRecord().
> 
> > Except that in this case, it's not pure crash recovery, it's restoring
> > from a backup label. Due to which it actually might make sense to create
> > restart points?
> 
> Hmm.  I'd like to think that these days requesting replay from a
> backup_label file without a recovery.signal or a standby.signal is
> just asking for trouble.  Still, the case of the OP, by just having a
> backup label, is equivalent to restoring from a self-contained backup,
> which is something that should work on its own even if there is no
> recovery.signal.
> 
> > If you're doing PITR or such you don't really gain
> > anything by doing checkpoints until you've reached consistency, unless
> > you want to optimize for the case that you might need to start/stop the
> > instance multiple times?
> 
> Yes, that could help in some cases.  The PITR target could be far away

According to Thomas' initial mail that started the discussion about
that change, one of the goals is to let crash recovery have a "fast"
mode just like archive recovery already has. It's not just limited to
end-of-recovery checkpoint.

> from the consistent point, and the user could have just copied WAL
> segments in pg_wal/ rather than relying on a recovery.signal with at
> least a restore_command.  That would be fancy, still being able to
> start/stop and retry targets may make sense..
> 
> So, yes, it seems to me that the correct answer here would be just to
> skip restart points as long as SharedRecoveryState is in
> RECOVERY_STATE_CRASH, because the checkpointer relies on
> RecoveryInProgress() and it cannot make the difference between crash
> recovery and archive recovery.  But now the checkpointer is started
> during recovery to ease the end-of-recovery checkpoint process.

If we prevent restartpoints during crash recovery, why do we need to
launch checkpointer at such an early stage? In PG15 the comment for
SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) was missing the
follwoing description, that means we are no longer supposed to have
this restriction.

>                                                   We don't bother during
>         * crash recovery as restartpoints can only be performed during
>         * archive recovery.  And we'd like to keep crash recovery simple, to
>         * avoid introducing bugs that could affect you when recovering after
>         * crash.

The follwoing is the corresponding commit.

> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> Author: Thomas Munro <tmunro@postgresql.org>
> Date:   Mon Aug 2 17:32:20 2021 +1200
> 
>     Run checkpointer and bgwriter in crash recovery.
>     
>     Start up the checkpointer and bgwriter during crash recovery (except in
>     --single mode), as we do for replication.  This wasn't done back in
>     commit cdd46c76 out of caution.  Now it seems like a better idea to make
>     the environment as similar as possible in both cases.  There may also be
>     some performance advantages.

Wouldn't just previnting restartpoints during crash recovery mean
making this change inneffective?  If something wrong actually happens
when restartpoints happen during crash recovery, shouldn't we identify
the cause and let restartpoints run safely during crash recovery?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Tue, Feb 21, 2023 at 03:22:58PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 21 Feb 2023 12:51:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
>> Author: Thomas Munro <tmunro@postgresql.org>
>> Date:   Mon Aug 2 17:32:20 2021 +1200
>>
>>     Run checkpointer and bgwriter in crash recovery.
>>
>>     Start up the checkpointer and bgwriter during crash recovery (except in
>>     --single mode), as we do for replication.  This wasn't done back in
>>     commit cdd46c76 out of caution.  Now it seems like a better idea to make
>>     the environment as similar as possible in both cases.  There may also be
>>     some performance advantages.
>
> Wouldn't just previnting restartpoints during crash recovery mean
> making this change inneffective?  If something wrong actually happens
> when restartpoints happen during crash recovery, shouldn't we identify
> the cause and let restartpoints run safely during crash recovery?

I will let Thomas comment on that (now added in CC of this thread),
because, honestly, I am not sure.

Now, it seems to me that the original intent of starting the
checkpointer this early is to be able to unify the code path for the
end-of-recovery checkpoint, primarily, which is a fine goal in itself
because the startup process can query a checkpoint rather than doing
things itself.  It is the fact of starting early the bgwriter that
showed most of the improvements in crash recovery with some aggressive
tuning setups, under specific workloads like heavy INSERTs.

This report is telling us that restartpoints during crash recovery may
not be a good idea without more careful consideration, even if the
original intent of 7ff23c6 is to allow them in the crash recovery
case.  I am not sure about this last point.  (Restartpoint is
mentioned once in the thread of this commit, FWIW.)
--
Michael

Attachment

Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Thomas Munro
Date:
On Tue, Feb 21, 2023 at 8:01 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Feb 21, 2023 at 03:22:58PM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 21 Feb 2023 12:51:51 +0900, Michael Paquier <michael@paquier.xyz> wrote in
> >> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> >> Author: Thomas Munro <tmunro@postgresql.org>
> >> Date:   Mon Aug 2 17:32:20 2021 +1200
> >>
> >>     Run checkpointer and bgwriter in crash recovery.
> >>
> >>     Start up the checkpointer and bgwriter during crash recovery (except in
> >>     --single mode), as we do for replication.  This wasn't done back in
> >>     commit cdd46c76 out of caution.  Now it seems like a better idea to make
> >>     the environment as similar as possible in both cases.  There may also be
> >>     some performance advantages.
> >
> > Wouldn't just previnting restartpoints during crash recovery mean
> > making this change inneffective?  If something wrong actually happens
> > when restartpoints happen during crash recovery, shouldn't we identify
> > the cause and let restartpoints run safely during crash recovery?
>
> I will let Thomas comment on that (now added in CC of this thread),
> because, honestly, I am not sure.

Thanks, catching up with this thread.

> Now, it seems to me that the original intent of starting the
> checkpointer this early is to be able to unify the code path for the
> end-of-recovery checkpoint, primarily, which is a fine goal in itself
> because the startup process can query a checkpoint rather than doing
> things itself.  It is the fact of starting early the bgwriter that
> showed most of the improvements in crash recovery with some aggressive
> tuning setups, under specific workloads like heavy INSERTs.
>
> This report is telling us that restartpoints during crash recovery may
> not be a good idea without more careful consideration, even if the
> original intent of 7ff23c6 is to allow them in the crash recovery
> case.  I am not sure about this last point.  (Restartpoint is
> mentioned once in the thread of this commit, FWIW.)

I started by wanting to have the bgwriter active.  For that, the
checkpointer must also be active, to have a shared fsync request queue
(since both the startup process and the bgwriter create fsync
requests, it wouldn't work if the startup process had a local
"pendingOps").  Looking into this.



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Wed, Feb 22, 2023 at 11:27:03PM +1300, Thomas Munro wrote:
> I started by wanting to have the bgwriter active.  For that, the
> checkpointer must also be active, to have a shared fsync request queue
> (since both the startup process and the bgwriter create fsync
> requests, it wouldn't work if the startup process had a local
> "pendingOps").  Looking into this.

I was thinking about that, and you may be fine as long as you skip
some parts of the restartpoint logic.  The case reported of this
thread does not cause crash recovery, actually, because startup
switches to +archive+ recovery any time it sees a backup_label file.
One thing I did not remember here is that we also set minRecoveryPoint
at a much earlier LSN than it should be (see 6c4f666).  However, we
rely heavily on backupEndRequired in the control file to make sure
that we've replayed up the end-of-backup record to decide if the
system is consistent or not.

This is the type of checks done by CheckRecoveryConsistency() for the
startup process.  And what we want here is to be able to apply the
same kind of things to the checkpointer, as the checkpointer needs to
decide which part of a restartpoint is fine to be run without
consistency being reached yet.

The case of the OP actually switches InArchiveRecovery to true, hence
enforcing SharedRecoveryState = RECOVERY_STATE_ARCHIVE from the start
of the redo loop, so you cannot rely only on that in the
checkpointer as I initially thought.  Neither can you just rely only
on minRecoveryPoint as we could expect an end-of-backup record.  So,
you need to check for a combination of SharedRecoveryState,
minRecoveryPoint and endBackupRecord, roughly.  Note that as
ReachedEndOfBackup() tells, minRecoveryPoint could be as well ahead of
the end-of-backup record, meaning that we'd still need to replay more
than the end-of-backup record before reaching consistency.

There are a few things that we can do here that I can think of, in
conservative order:
- Delay the checkpointer startup until the startup process is OK with
a consistent state, which could be the end of a backup.  No need to
let the checkpointer know when consistency has been reached, though it
basically invalidates your previous change to make the bgwriter and
the checkpointer more aggressive.
- Delay completely restart points, which is pretty much the same as
the last one, except that the checkpointer is started earlier, and it
needs to know about the consistent state.
- Skip some parts of the restartpoint logic.  TruncateSUBTRANS() is
one, as proved here, still we have other parts of CheckPointGuts()
that equally rely on a consistent system?  For example, I am not sure
about things like CheckPointSnapBuild(),
CheckPointLogicalRewriteHeap() or CheckPointReplicationOrigin(), to
name three.  Most of these involve SLRU flushes, these days, but we'd
better be careful in the choices made.
--
Michael

Attachment

Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Fri, Feb 24, 2023 at 09:36:50AM +0900, Michael Paquier wrote:
> I was thinking about that, and you may be fine as long as you skip
> some parts of the restartpoint logic.  The case reported of this
> thread does not cause crash recovery, actually, because startup
> switches to +archive+ recovery any time it sees a backup_label file.
> One thing I did not remember here is that we also set minRecoveryPoint
> at a much earlier LSN than it should be (see 6c4f666).  However, we
> rely heavily on backupEndRequired in the control file to make sure
> that we've replayed up the end-of-backup record to decide if the
> system is consistent or not.

I have been spending more time on that to see if I was missing
something, and reproducing the issue is rather easy by using pgbench
that gets stopped with a SIGINT so as restart points would be able to
see transactions still running in the code path triggering the assert.
A cheap regression test should be possible, actually, though for now
the only thing I have been able to rely on is a hack to force
checkpoint_timeout at 1s to make the failure rate more aggressive.

Anyway, with this simple method (and an increase of short pgbench runs
that are interrupted to increase the chance of hits), a bisect points
at 7ff23c6 :/
--
Michael

Attachment

Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Thomas Munro
Date:
On Sat, Feb 25, 2023 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Feb 24, 2023 at 09:36:50AM +0900, Michael Paquier wrote:
> > I was thinking about that, and you may be fine as long as you skip
> > some parts of the restartpoint logic.  The case reported of this
> > thread does not cause crash recovery, actually, because startup
> > switches to +archive+ recovery any time it sees a backup_label file.
> > One thing I did not remember here is that we also set minRecoveryPoint
> > at a much earlier LSN than it should be (see 6c4f666).  However, we
> > rely heavily on backupEndRequired in the control file to make sure
> > that we've replayed up the end-of-backup record to decide if the
> > system is consistent or not.
>
> I have been spending more time on that to see if I was missing
> something, and reproducing the issue is rather easy by using pgbench
> that gets stopped with a SIGINT so as restart points would be able to
> see transactions still running in the code path triggering the assert.
> A cheap regression test should be possible, actually, though for now
> the only thing I have been able to rely on is a hack to force
> checkpoint_timeout at 1s to make the failure rate more aggressive.
>
> Anyway, with this simple method (and an increase of short pgbench runs
> that are interrupted to increase the chance of hits), a bisect points
> at 7ff23c6 :/

Thanks.  I've been thinking about how to make a deterministic test
script to study this and possible fixes, too.  Unfortunately I came
down with a nasty cold and stopped computing for a couple of days, so
sorry for the slow response on this thread, but I seem to have
rebooted now.  Looking.



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Sat, Feb 25, 2023 at 08:56:18AM +1300, Thomas Munro wrote:
> Thanks.  I've been thinking about how to make a deterministic test
> script to study this and possible fixes, too.

I can send you what I have if you are interested, but I would need to
rewrite it a bit as it is integrated with my local scripts.  If you
wish me to test something, just ping me.

> Unfortunately I came
> down with a nasty cold and stopped computing for a couple of days, so
> sorry for the slow response on this thread, but I seem to have
> rebooted now.  Looking.

Ouch, take care.  This has been around for some time, so a couple of
weeks won't matter much.  There is still a lot of time until the next
minor release.

By the way, I think that I have put my finger on something..  The
startup process happily sets InArchiveRecovery to true after
read_backup_label() if finding one, with a comment telling that
archive recovery is requested, still ArchiveRecoveryRequested is never
set to true unless there is a recovery.signal or a standby.signal.

The configuration of this thread fails on that:
@@ -619,6 +619,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
         * file, we know how far we need to replay to reach consistency. Enter
         * archive recovery directly.
         */
+       if (!ArchiveRecoveryRequested)
+           elog(PANIC, "ArchiveRecoveryRequested not set with backup_label");

One of the tests of 010_pg_basebackup.pl with tablespace maps fails on
that as well, still there does not seem to be a lot of coverage for
self-contained backups..
--
Michael

Attachment

Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Thomas Munro
Date:
On Sat, Feb 25, 2023 at 4:39 PM Michael Paquier <michael@paquier.xyz> wrote:
> By the way, I think that I have put my finger on something..  The
> startup process happily sets InArchiveRecovery to true after
> read_backup_label() if finding one, with a comment telling that
> archive recovery is requested, still ArchiveRecoveryRequested is never
> set to true unless there is a recovery.signal or a standby.signal.

Yeah, it's all a bit confusing.  That particular fact is the reason we
can't just rely on RECOVERY_STATE_CRASH to suppress restart points in
RecoveryRestartPoint(), which was my first naive idea before I started
working through scenarios...  Another thing that is a little confusing
is that commit 7ff23c6 didn't change the behaviour of hot standbys in
this respect: they already started the checkpointer at the start of
recovery, not at the start of hot standby.  Which leads to Bowen's
suggestion to make recovery-from-backup work the same.

As for what we're actually going to do about this, and how to test
that it's working the way we want, here's my current idea, which I'm
planning to work up experimental patches for on Monday:

(1) Since it's quite hard to hit the assertion failure without
generating a megaton of data, or hacking checkpoint_timeout to allow 1
second etc, I think we should test the startup process's behaviour
instead (ie rather than the checkpointer's).  We could have
RecoveryRestartPoint() log when it's recording a restart point in
shared memory (= a potential restart point, whether or not the
checkpoint chooses to act on it), and also when it's choosing to skip.
At DEBUG2 level.  Then we could then have a test 035_restart_points.pl
for all the various situations (actual crash, backup, backup +
archives, pitr, with standby enabled, etc etc), and make assertions
about which messages we log.  That gives us observability and nails
down the differences between these modes.

(2) Then I think we should simply prevent restart points that 14
wouldn't have allowed, just a couple of back-patchable lines, maybe
(yet another) new variable.

Then for master I think we should (in later patches, for 16 or maybe
17) consider relaxing that.  There doesn't seem to be a technical
reason why a very long running PITR shouldn't benefit from restart
points, and that would be in line with the direction of making
recovery modes more similar, but given we/I missed this subtlety in
the past, the more conservative choice seems more appropriate for 15.



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Sun, Feb 26, 2023 at 09:38:01AM +1300, Thomas Munro wrote:
> Yeah, it's all a bit confusing.  That particular fact is the reason we
> can't just rely on RECOVERY_STATE_CRASH to suppress restart points in
> RecoveryRestartPoint(), which was my first naive idea before I started
> working through scenarios...  Another thing that is a little confusing
> is that commit 7ff23c6 didn't change the behaviour of hot standbys in
> this respect: they already started the checkpointer at the start of
> recovery, not at the start of hot standby.  Which leads to Bowen's
> suggestion to make recovery-from-backup work the same.

FWIW, I think that 7ff23c6 and this thread's issue are just a victim
of the decision taken in abf5c5c9 (basically mentioned upthread) where
it was decided that we should initialize for hot standby depending on
ArchiveRecoveryRequested rather than InArchiveRecovery, as of this
part of the commit:
@@ -5243,7 +5344,7 @@ StartupXLOG(void)
         * control file and we've established a recovery snapshot from a
         * running-xacts WAL record.
         */
-       if (InArchiveRecovery && EnableHotStandby)
+       if (ArchiveRecoveryRequested && EnableHotStandby)
        {
            TransactionId *xids;
            int         nxids;

At the end, I would suggest something the attached patch, where we
switch a bit the order of the actions taken at the beginning of the
startup process and enforce ArchiveRecoveryRequested when we find a
backup_label so as we always go through archive recovery as all the
code paths of the startup process assume that.  As far as I can see,
we have never assumed that the combination of ArchiveRecoveryRequested
= false and InArchiveRecovery = true should be a valid one.  With
HEAD, it is actually possible to do much nastier things, actually,
like missing the setup of recoveryWakeupLatch, etc.  One of my test
scripts has been doing something like that, actually, because there is
no need to run an interrupted pgbench concurrently with pg_basebackup
to trigger the issue:
- Setup a primary, with archiving.
- Take a base backup of the primary, it has an end-of-recovery record,
already.  Do not start it yet.
- Run pgbench and interrupt it a few times.
- Copy the contents of the archives to the previous base backup's
pg_wal/ (Critical!).
- Start the base backup, causing the backup to replay all the WAL it
can find, and enjoy the show as consistency is reached as of the end
of the previous base backup (one can reload some recovery parameters,
as well).

Note that this makes the backup_file read happen slightly earlier.  I'd
like to think that we should enforce the decision of having at least a
recovery.signal in 17~ if we find a backup_label with a proper
validation of validateRecoveryParameters().  Another side effect is
that this actually changes the behavior reported on this thread by
making the end-of-recovery *switch* to a new TLI rather than
continuing the existing timeline once recovery is finished, so this
actually makes the base backup recovery much more predictible.  It is
true that this could influence some recovery flows, but this
combination is actually leads to more consistency at recovery, and a
backup_label is all about archive recovery, so..

> Then for master I think we should (in later patches, for 16 or maybe
> 17) consider relaxing that.  There doesn't seem to be a technical
> reason why a very long running PITR shouldn't benefit from restart
> points, and that would be in line with the direction of making
> recovery modes more similar, but given we/I missed this subtlety in
> the past, the more conservative choice seems more appropriate for 15.

If you wish to be more conservative and prevent checkpoints to run in
cases where they were not running before 14 and lift this restriction
in 17~, that would be fine by me, but I am not convinced that we
really need that.  As far as I can see, 7ff23c6 is kind of right.
--
Michael

Attachment

Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Kyotaro Horiguchi
Date:
At Mon, 27 Feb 2023 09:08:19 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Sun, Feb 26, 2023 at 09:38:01AM +1300, Thomas Munro wrote:
> > Yeah, it's all a bit confusing.  That particular fact is the reason we
> > can't just rely on RECOVERY_STATE_CRASH to suppress restart points in
> > RecoveryRestartPoint(), which was my first naive idea before I started
> > working through scenarios...  Another thing that is a little confusing
> > is that commit 7ff23c6 didn't change the behaviour of hot standbys in
> > this respect: they already started the checkpointer at the start of
> > recovery, not at the start of hot standby.  Which leads to Bowen's
> > suggestion to make recovery-from-backup work the same.
> 
> FWIW, I think that 7ff23c6 and this thread's issue are just a victim
> of the decision taken in abf5c5c9 (basically mentioned upthread) where
> it was decided that we should initialize for hot standby depending on
> ArchiveRecoveryRequested rather than InArchiveRecovery, as of this
> part of the commit:
> @@ -5243,7 +5344,7 @@ StartupXLOG(void)
>          * control file and we've established a recovery snapshot from a
>          * running-xacts WAL record.
>          */
> -       if (InArchiveRecovery && EnableHotStandby)
> +       if (ArchiveRecoveryRequested && EnableHotStandby)
>         {
>             TransactionId *xids;
>             int         nxids;
> 
> At the end, I would suggest something the attached patch, where we
> like missing the setup of recoveryWakeupLatch, etc.  One of my test
> scripts has been doing something like that, actually, because there is
> no need to run an interrupted pgbench concurrently with pg_basebackup
> to trigger the issue:
> - Setup a primary, with archiving.
> - Take a base backup of the primary, it has an end-of-recovery record,
> already.  Do not start it yet.
> - Run pgbench and interrupt it a few times.
> - Copy the contents of the archives to the previous base backup's
> pg_wal/ (Critical!).
> - Start the base backup, causing the backup to replay all the WAL it
> can find, and enjoy the show as consistency is reached as of the end
> of the previous base backup (one can reload some recovery parameters,
> as well).

In my understanding, the approach differs slightly from the OP's
case. While server reaches consistency state early on using this
steps, in the OP's case it takes much longer time to reach
consistency, to the point where a checkpoint occurs before it is
reached. It seems that the approach taken by the patch does not work
in the OP's case, where (I guess that) hot-standby is disabled.

> Note that this makes the backup_file read happen slightly earlier.  I'd   
> like to think that we should enforce the decision of having at least a
> recovery.signal in 17~ if we find a backup_label with a proper
> validation of validateRecoveryParameters().  Another side effect is
> that this actually changes the behavior reported on this thread by
> making the end-of-recovery *switch* to a new TLI rather than
> continuing the existing timeline once recovery is finished, so this
> actually makes the base backup recovery much more predictible.  It is
> true that this could influence some recovery flows, but this
> combination is actually leads to more consistency at recovery, and a
> backup_label is all about archive recovery, so..

As you know, pg_basebackup has the -R option.  It seems to me that the
existence means we conciously allowed the backup to start without
archive recovery. The OP's operation (if it is correct) involves
taking a base backup and treating it as a literal backup, meaning it
begins its life on the origial timeline. This change prevents an
operation that we may have implicitly allowed (at least, that is my
understanding).

> > Then for master I think we should (in later patches, for 16 or maybe
> > 17) consider relaxing that.  There doesn't seem to be a technical
> > reason why a very long running PITR shouldn't benefit from restart
> > points, and that would be in line with the direction of making
> > recovery modes more similar, but given we/I missed this subtlety in
> > the past, the more conservative choice seems more appropriate for 15.
> 
> If you wish to be more conservative and prevent checkpoints to run in
> cases where they were not running before 14 and lift this restriction
> in 17~, that would be fine by me, but I am not convinced that we
> really need that.  As far as I can see, 7ff23c6 is kind of right.

Thus, I slightly object to let backup_label to enforce archive
recovery. And rather prefer to suppress checkpoints before reaching
consistency(?).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From
Michael Paquier
Date:
On Mon, Feb 27, 2023 at 12:01:01PM +0900, Kyotaro Horiguchi wrote:
> In my understanding, the approach differs slightly from the OP's
> case. While server reaches consistency state early on using this
> steps, in the OP's case it takes much longer time to reach
> consistency, to the point where a checkpoint occurs before it is
> reached. It seems that the approach taken by the patch does not work
> in the OP's case, where (I guess that) hot-standby is disabled.

Yep, hot standby is disabled in this case.  Now note that we go
through the hot-standby initialization based on
(ArchiveRecoveryRequested && EnableHotStandby).  The first one is true
when we find a recovery.signal or standby.signal.  The second one is
the hot_standby GUC, hence recovery.signal enforces the initialization
even if we know that we won't allow standby connections anyway.  This
is a waste, surely, if you only have recovery.signal, still it does
not hurt?

> As you know, pg_basebackup has the -R option.  It seems to me that the
> existence means we conciously allowed the backup to start without
> archive recovery. The OP's operation (if it is correct) involves
> taking a base backup and treating it as a literal backup, meaning it
> begins its life on the origial timeline. This change prevents an
> operation that we may have implicitly allowed (at least, that is my
> understanding).

Note that pg_basebackup -R enforces a standby.signal, not a
recovery.signal.  I am questioning whether we have to support the case
of this thread, to be honest, particularly if it means simplifying the
recovery logic that's now behind the two signal files and the
backup_label.  That's not a behavior that should be enforced in a
stable branch, perhaps, but if it means removing some of the boolean
flags controlling the recovery behavior, simplifying its logic, then I
see a pretty good argument with doing a move in this direction if it
eases the maintenance of this code with less cases to think about.  We
have too many of them already and these have caused us many issues in
recent history (continuation records, for one), so I would draw a
clear line and simplify things.

Also, note "Recovering Using a Continuous Archive Backup", where we
basically say that one *has* to put a recovery.signal when recovering
using a base backup in one of the steps, which enforces a new timeline
once recovery is done with a base backup.  What I have sent upthread
enforces that more in the presence of a backup_label file without
recovery.signal, which may be too aggressive, perhaps, because it
causes always a TLI jump, still it avoids the need of a hard failure
to always require a recovery.signal or standby.signal if we are
recovering from a base backup.  If we were to support the case where
we have (InArchiveRecovery && !ArchiveRecoveryRequested) during
recovery, we'd better put a clear definition behind it.  Personnally,
I don't know what it means based on what is documented in
xlogrecovery.c or xlog.c (echoing with [1]).

> Thus, I slightly object to let backup_label to enforce archive
> recovery. And rather prefer to suppress checkpoints before reaching
> consistency(?).

Suppressing restartpoints is not enough to completely close the issue.
With the case I am mentioning above, it is possible to run
restartpoints after consistency has been reached without the hot
standby initialization done, causing the same failure when attempting
to truncate pg_subtrans/ in a restart point :/

[1]: https://www.postgresql.org/message-id/20230201153252.l6kcfum7trdovw2b@alap3.anarazel.de
--
Michael

Attachment