Thread: recovery_min_apply_delay in archive recovery causes assertion failurein latch

recovery_min_apply_delay in archive recovery causes assertion failurein latch

From
Fujii Masao
Date:
Hi,

I got the following assertion failure when I enabled recovery_min_apply_delay
and started archive recovery (i.e., I put only recovery.signal not
standby.signal).

TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File:
"latch.c", Line: 522)

Here is the example to reproduce the issue:

----------------------------
initdb -D data
pg_ctl -D data start
psql -c "alter system set recovery_min_apply_delay to '60s'"
psql -c "alter system set archive_mode to on"
psql -c "alter system set archive_command to 'cp %p ../arch/%f'"
psql -c "alter system set restore_command to 'cp ../arch/%f %p'"
mkdir arch
pg_basebackup -D bkp -c fast
pgbench -i
pgbench -t 1000
pg_ctl -D data -m i stop
rm -rf bkp/pg_wal
mv data/pg_wal bkp
rm -rf data
mv bkp data
touch data/recovery.signal
pg_ctl -D data -W start
----------------------------

The latch that causes this assertion failure is recoveryWakeupLatch.
The ownership of this latch is taken only when standby mode is
requested. But this latch can be used when starting archive recovery
with recovery_min_apply_delay set even though it's unowned.
So the assertion failure happened.

Attached patch fixes this issue by making archive recovery always ignore
recovery_min_apply_delay. This change is OK because
recovery_min_apply_delay was introduced for standby mode, I think.

This issue is not new in v12. I observed that the issue was reproduced
in v11. So the back-patch is necessary.

Regards,

-- 
Fujii Masao

Attachment

Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch

From
Michael Paquier
Date:
On Mon, Sep 30, 2019 at 12:49:03AM +0900, Fujii Masao wrote:
> Attached patch fixes this issue by making archive recovery always ignore
> recovery_min_apply_delay. This change is OK because
> recovery_min_apply_delay was introduced for standby mode, I think.
>
> This issue is not new in v12. I observed that the issue was reproduced
> in v11. So the back-patch is necessary.

I have not directly tested, but from a lookup at the code I think
that you are right.  Perhaps we'd want more safeguards in
WaitForWALToBecomeAvailable(), like an assert within the
XLOG_FROM_STREAM part similar to the check you are adding?  My point
is that we should switch to XLOG_FROM_STREAM only if we are in standby
mode, and that's the only place where the startup process looks at
recoveryWakeupLatch.
--
Michael

Attachment

Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch

From
Fujii Masao
Date:
On Mon, Sep 30, 2019 at 12:42 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Sep 30, 2019 at 12:49:03AM +0900, Fujii Masao wrote:
> > Attached patch fixes this issue by making archive recovery always ignore
> > recovery_min_apply_delay. This change is OK because
> > recovery_min_apply_delay was introduced for standby mode, I think.
> >
> > This issue is not new in v12. I observed that the issue was reproduced
> > in v11. So the back-patch is necessary.
>
> I have not directly tested, but from a lookup at the code I think
> that you are right.  Perhaps we'd want more safeguards in
> WaitForWALToBecomeAvailable(), like an assert within the
> XLOG_FROM_STREAM part similar to the check you are adding?  My point
> is that we should switch to XLOG_FROM_STREAM only if we are in standby
> mode, and that's the only place where the startup process looks at
> recoveryWakeupLatch.

Thanks for the review! OK, attached is the patch which also added
two assertion checks as you described.

Regards,

-- 
Fujii Masao

Attachment

Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch

From
Michael Paquier
Date:
On Mon, Sep 30, 2019 at 05:50:03PM +0900, Fujii Masao wrote:
> Thanks for the review! OK, attached is the patch which also added
> two assertion checks as you described.

Thanks, looks fine.  The indentation looks a bit wrong for the
comments, but that's a nit.
--
Michael

Attachment
On Mon, Sep 30, 2019 at 12:49 AM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> Hi,
>
> I got the following assertion failure when I enabled recovery_min_apply_delay
> and started archive recovery (i.e., I put only recovery.signal not
> standby.signal).
>
> TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File:
> "latch.c", Line: 522)
>
> Here is the example to reproduce the issue:
>
> ----------------------------
> initdb -D data
> pg_ctl -D data start
> psql -c "alter system set recovery_min_apply_delay to '60s'"
> psql -c "alter system set archive_mode to on"
> psql -c "alter system set archive_command to 'cp %p ../arch/%f'"
> psql -c "alter system set restore_command to 'cp ../arch/%f %p'"
> mkdir arch
> pg_basebackup -D bkp -c fast
> pgbench -i
> pgbench -t 1000
> pg_ctl -D data -m i stop
> rm -rf bkp/pg_wal
> mv data/pg_wal bkp
> rm -rf data
> mv bkp data
> touch data/recovery.signal
> pg_ctl -D data -W start
> ----------------------------
>
> The latch that causes this assertion failure is recoveryWakeupLatch.
> The ownership of this latch is taken only when standby mode is
> requested. But this latch can be used when starting archive recovery
> with recovery_min_apply_delay set even though it's unowned.
> So the assertion failure happened.
>
> Attached patch fixes this issue by making archive recovery always ignore
> recovery_min_apply_delay. This change is OK because
> recovery_min_apply_delay was introduced for standby mode, I think.

No, I found the following description in the doc.

------------------------------
This parameter is intended for use with streaming replication deployments;
however, if the parameter is specified it will be honored in all cases
------------------------------

So archive recovery with recovery_min_apply_delay enabled would be
intended configuration. My patch that changes archive recovery so that
it always ignores thesetting might be in wrong direction. Maybe we should
make recovery_min_apply_delay work properly even in archive recovery.
Thought?

Regards,

-- 
Fujii Masao



On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Mon, Sep 30, 2019 at 12:49 AM Fujii Masao <masao.fujii@gmail.com> wrote:
> >
> > Hi,
> >
> > I got the following assertion failure when I enabled recovery_min_apply_delay
> > and started archive recovery (i.e., I put only recovery.signal not
> > standby.signal).
> >
> > TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File:
> > "latch.c", Line: 522)
> >
> > Here is the example to reproduce the issue:
> >
> > ----------------------------
> > initdb -D data
> > pg_ctl -D data start
> > psql -c "alter system set recovery_min_apply_delay to '60s'"
> > psql -c "alter system set archive_mode to on"
> > psql -c "alter system set archive_command to 'cp %p ../arch/%f'"
> > psql -c "alter system set restore_command to 'cp ../arch/%f %p'"
> > mkdir arch
> > pg_basebackup -D bkp -c fast
> > pgbench -i
> > pgbench -t 1000
> > pg_ctl -D data -m i stop
> > rm -rf bkp/pg_wal
> > mv data/pg_wal bkp
> > rm -rf data
> > mv bkp data
> > touch data/recovery.signal
> > pg_ctl -D data -W start
> > ----------------------------
> >
> > The latch that causes this assertion failure is recoveryWakeupLatch.
> > The ownership of this latch is taken only when standby mode is
> > requested. But this latch can be used when starting archive recovery
> > with recovery_min_apply_delay set even though it's unowned.
> > So the assertion failure happened.
> >
> > Attached patch fixes this issue by making archive recovery always ignore
> > recovery_min_apply_delay. This change is OK because
> > recovery_min_apply_delay was introduced for standby mode, I think.
>
> No, I found the following description in the doc.
>
> ------------------------------
> This parameter is intended for use with streaming replication deployments;
> however, if the parameter is specified it will be honored in all cases
> ------------------------------
>
> So archive recovery with recovery_min_apply_delay enabled would be
> intended configuration. My patch that changes archive recovery so that
> it always ignores thesetting might be in wrong direction. Maybe we should
> make recovery_min_apply_delay work properly even in archive recovery.
> Thought?

Patch attached. This patch allows archive recovery with
recovery_min_apply_delay set, but not crash recovery.

Regards,

-- 
Fujii Masao

Attachment

Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch

From
Michael Paquier
Date:
On Tue, Oct 08, 2019 at 02:18:00AM +0900, Fujii Masao wrote:
> On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao <masao.fujii@gmail.com> wrote:
>> So archive recovery with recovery_min_apply_delay enabled would be
>> intended configuration. My patch that changes archive recovery so that
>> it always ignores thesetting might be in wrong direction. Maybe we should
>> make recovery_min_apply_delay work properly even in archive recovery.
>> Thought?
>
> Patch attached. This patch allows archive recovery with
> recovery_min_apply_delay set, but not crash recovery.

Right.  In short it makes no sense to wait the delay when in crash
recovery.  After more testing I have been able to reproduce the
failure myself.

+   /* nothing to do if crash recovery is requested */
+   if (!ArchiveRecoveryRequested && !StandbyModeRequested)
+       return false;

ArchiveRecoveryRequested will be set to true if recovery.signal or
standby.signal are found, so it seems to me that you can make all
those checks more simple by removing from the equation
StandbyModeRequested, no?  StandbyModeRequested is never set to true
if ArchiveRecoveryRequested is not itself true.

It would be nice to test some scenario within 002_archiving.pl.
--
Michael

Attachment

Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch

From
Michael Paquier
Date:
On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote:
> ArchiveRecoveryRequested will be set to true if recovery.signal or
> standby.signal are found, so it seems to me that you can make all
> those checks more simple by removing from the equation
> StandbyModeRequested, no?  StandbyModeRequested is never set to true
> if ArchiveRecoveryRequested is not itself true.

For the sake of the archives, this has been applied by Fujii-san as of
ec1259e8.
--
Michael

Attachment

Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch

From
Alvaro Herrera
Date:
On 2019-Oct-19, Michael Paquier wrote:

> On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote:
> > ArchiveRecoveryRequested will be set to true if recovery.signal or
> > standby.signal are found, so it seems to me that you can make all
> > those checks more simple by removing from the equation
> > StandbyModeRequested, no?  StandbyModeRequested is never set to true
> > if ArchiveRecoveryRequested is not itself true.
> 
> For the sake of the archives, this has been applied by Fujii-san as of
> ec1259e8.

So, the commit message says

    Fix failure of archive recovery with recovery_min_apply_delay enabled.
    
    recovery_min_apply_delay parameter is intended for use with streaming
    replication deployments. However, the document clearly explains that
    the parameter will be honored in all cases if it's specified. So it should
    take effect even if in archive recovery. But, previously, archive recovery
    with recovery_min_apply_delay enabled always failed, and caused assertion
    failure if --enable-caasert is enabled.

but I'm not clear how would this problem manifest in the case of a build
with assertions disabled.  Will it keep sleeping beyond the specified
time?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Sat, Dec 14, 2019 at 12:35 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> On 2019-Oct-19, Michael Paquier wrote:
>
> > On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote:
> > > ArchiveRecoveryRequested will be set to true if recovery.signal or
> > > standby.signal are found, so it seems to me that you can make all
> > > those checks more simple by removing from the equation
> > > StandbyModeRequested, no?  StandbyModeRequested is never set to true
> > > if ArchiveRecoveryRequested is not itself true.
> >
> > For the sake of the archives, this has been applied by Fujii-san as of
> > ec1259e8.
>
> So, the commit message says
>
>     Fix failure of archive recovery with recovery_min_apply_delay enabled.
>
>     recovery_min_apply_delay parameter is intended for use with streaming
>     replication deployments. However, the document clearly explains that
>     the parameter will be honored in all cases if it's specified. So it should
>     take effect even if in archive recovery. But, previously, archive recovery
>     with recovery_min_apply_delay enabled always failed, and caused assertion
>     failure if --enable-caasert is enabled.
>
> but I'm not clear how would this problem manifest in the case of a build
> with assertions disabled.  Will it keep sleeping beyond the specified
> time?

When assertion is disabled, the recovery exists with the following messages.

FATAL:  cannot wait on a latch owned by another process
LOG:  startup process (PID 81007) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  database system is shut down

Regards,

-- 
Fujii Masao