Thread: recovery_min_apply_delay in archive recovery causes assertion failurein latch
recovery_min_apply_delay in archive recovery causes assertion failurein latch
From
Fujii Masao
Date:
Hi, I got the following assertion failure when I enabled recovery_min_apply_delay and started archive recovery (i.e., I put only recovery.signal not standby.signal). TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 522) Here is the example to reproduce the issue: ---------------------------- initdb -D data pg_ctl -D data start psql -c "alter system set recovery_min_apply_delay to '60s'" psql -c "alter system set archive_mode to on" psql -c "alter system set archive_command to 'cp %p ../arch/%f'" psql -c "alter system set restore_command to 'cp ../arch/%f %p'" mkdir arch pg_basebackup -D bkp -c fast pgbench -i pgbench -t 1000 pg_ctl -D data -m i stop rm -rf bkp/pg_wal mv data/pg_wal bkp rm -rf data mv bkp data touch data/recovery.signal pg_ctl -D data -W start ---------------------------- The latch that causes this assertion failure is recoveryWakeupLatch. The ownership of this latch is taken only when standby mode is requested. But this latch can be used when starting archive recovery with recovery_min_apply_delay set even though it's unowned. So the assertion failure happened. Attached patch fixes this issue by making archive recovery always ignore recovery_min_apply_delay. This change is OK because recovery_min_apply_delay was introduced for standby mode, I think. This issue is not new in v12. I observed that the issue was reproduced in v11. So the back-patch is necessary. Regards, -- Fujii Masao
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Michael Paquier
Date:
On Mon, Sep 30, 2019 at 12:49:03AM +0900, Fujii Masao wrote: > Attached patch fixes this issue by making archive recovery always ignore > recovery_min_apply_delay. This change is OK because > recovery_min_apply_delay was introduced for standby mode, I think. > > This issue is not new in v12. I observed that the issue was reproduced > in v11. So the back-patch is necessary. I have not directly tested, but from a lookup at the code I think that you are right. Perhaps we'd want more safeguards in WaitForWALToBecomeAvailable(), like an assert within the XLOG_FROM_STREAM part similar to the check you are adding? My point is that we should switch to XLOG_FROM_STREAM only if we are in standby mode, and that's the only place where the startup process looks at recoveryWakeupLatch. -- Michael
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Fujii Masao
Date:
On Mon, Sep 30, 2019 at 12:42 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Sep 30, 2019 at 12:49:03AM +0900, Fujii Masao wrote: > > Attached patch fixes this issue by making archive recovery always ignore > > recovery_min_apply_delay. This change is OK because > > recovery_min_apply_delay was introduced for standby mode, I think. > > > > This issue is not new in v12. I observed that the issue was reproduced > > in v11. So the back-patch is necessary. > > I have not directly tested, but from a lookup at the code I think > that you are right. Perhaps we'd want more safeguards in > WaitForWALToBecomeAvailable(), like an assert within the > XLOG_FROM_STREAM part similar to the check you are adding? My point > is that we should switch to XLOG_FROM_STREAM only if we are in standby > mode, and that's the only place where the startup process looks at > recoveryWakeupLatch. Thanks for the review! OK, attached is the patch which also added two assertion checks as you described. Regards, -- Fujii Masao
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Michael Paquier
Date:
On Mon, Sep 30, 2019 at 05:50:03PM +0900, Fujii Masao wrote: > Thanks for the review! OK, attached is the patch which also added > two assertion checks as you described. Thanks, looks fine. The indentation looks a bit wrong for the comments, but that's a nit. -- Michael
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Fujii Masao
Date:
On Mon, Sep 30, 2019 at 12:49 AM Fujii Masao <masao.fujii@gmail.com> wrote: > > Hi, > > I got the following assertion failure when I enabled recovery_min_apply_delay > and started archive recovery (i.e., I put only recovery.signal not > standby.signal). > > TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: > "latch.c", Line: 522) > > Here is the example to reproduce the issue: > > ---------------------------- > initdb -D data > pg_ctl -D data start > psql -c "alter system set recovery_min_apply_delay to '60s'" > psql -c "alter system set archive_mode to on" > psql -c "alter system set archive_command to 'cp %p ../arch/%f'" > psql -c "alter system set restore_command to 'cp ../arch/%f %p'" > mkdir arch > pg_basebackup -D bkp -c fast > pgbench -i > pgbench -t 1000 > pg_ctl -D data -m i stop > rm -rf bkp/pg_wal > mv data/pg_wal bkp > rm -rf data > mv bkp data > touch data/recovery.signal > pg_ctl -D data -W start > ---------------------------- > > The latch that causes this assertion failure is recoveryWakeupLatch. > The ownership of this latch is taken only when standby mode is > requested. But this latch can be used when starting archive recovery > with recovery_min_apply_delay set even though it's unowned. > So the assertion failure happened. > > Attached patch fixes this issue by making archive recovery always ignore > recovery_min_apply_delay. This change is OK because > recovery_min_apply_delay was introduced for standby mode, I think. No, I found the following description in the doc. ------------------------------ This parameter is intended for use with streaming replication deployments; however, if the parameter is specified it will be honored in all cases ------------------------------ So archive recovery with recovery_min_apply_delay enabled would be intended configuration. My patch that changes archive recovery so that it always ignores thesetting might be in wrong direction. Maybe we should make recovery_min_apply_delay work properly even in archive recovery. Thought? Regards, -- Fujii Masao
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Fujii Masao
Date:
On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao <masao.fujii@gmail.com> wrote: > > On Mon, Sep 30, 2019 at 12:49 AM Fujii Masao <masao.fujii@gmail.com> wrote: > > > > Hi, > > > > I got the following assertion failure when I enabled recovery_min_apply_delay > > and started archive recovery (i.e., I put only recovery.signal not > > standby.signal). > > > > TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: > > "latch.c", Line: 522) > > > > Here is the example to reproduce the issue: > > > > ---------------------------- > > initdb -D data > > pg_ctl -D data start > > psql -c "alter system set recovery_min_apply_delay to '60s'" > > psql -c "alter system set archive_mode to on" > > psql -c "alter system set archive_command to 'cp %p ../arch/%f'" > > psql -c "alter system set restore_command to 'cp ../arch/%f %p'" > > mkdir arch > > pg_basebackup -D bkp -c fast > > pgbench -i > > pgbench -t 1000 > > pg_ctl -D data -m i stop > > rm -rf bkp/pg_wal > > mv data/pg_wal bkp > > rm -rf data > > mv bkp data > > touch data/recovery.signal > > pg_ctl -D data -W start > > ---------------------------- > > > > The latch that causes this assertion failure is recoveryWakeupLatch. > > The ownership of this latch is taken only when standby mode is > > requested. But this latch can be used when starting archive recovery > > with recovery_min_apply_delay set even though it's unowned. > > So the assertion failure happened. > > > > Attached patch fixes this issue by making archive recovery always ignore > > recovery_min_apply_delay. This change is OK because > > recovery_min_apply_delay was introduced for standby mode, I think. > > No, I found the following description in the doc. > > ------------------------------ > This parameter is intended for use with streaming replication deployments; > however, if the parameter is specified it will be honored in all cases > ------------------------------ > > So archive recovery with recovery_min_apply_delay enabled would be > intended configuration. My patch that changes archive recovery so that > it always ignores thesetting might be in wrong direction. Maybe we should > make recovery_min_apply_delay work properly even in archive recovery. > Thought? Patch attached. This patch allows archive recovery with recovery_min_apply_delay set, but not crash recovery. Regards, -- Fujii Masao
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Michael Paquier
Date:
On Tue, Oct 08, 2019 at 02:18:00AM +0900, Fujii Masao wrote: > On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao <masao.fujii@gmail.com> wrote: >> So archive recovery with recovery_min_apply_delay enabled would be >> intended configuration. My patch that changes archive recovery so that >> it always ignores thesetting might be in wrong direction. Maybe we should >> make recovery_min_apply_delay work properly even in archive recovery. >> Thought? > > Patch attached. This patch allows archive recovery with > recovery_min_apply_delay set, but not crash recovery. Right. In short it makes no sense to wait the delay when in crash recovery. After more testing I have been able to reproduce the failure myself. + /* nothing to do if crash recovery is requested */ + if (!ArchiveRecoveryRequested && !StandbyModeRequested) + return false; ArchiveRecoveryRequested will be set to true if recovery.signal or standby.signal are found, so it seems to me that you can make all those checks more simple by removing from the equation StandbyModeRequested, no? StandbyModeRequested is never set to true if ArchiveRecoveryRequested is not itself true. It would be nice to test some scenario within 002_archiving.pl. -- Michael
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Michael Paquier
Date:
On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote: > ArchiveRecoveryRequested will be set to true if recovery.signal or > standby.signal are found, so it seems to me that you can make all > those checks more simple by removing from the equation > StandbyModeRequested, no? StandbyModeRequested is never set to true > if ArchiveRecoveryRequested is not itself true. For the sake of the archives, this has been applied by Fujii-san as of ec1259e8. -- Michael
Attachment
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Alvaro Herrera
Date:
On 2019-Oct-19, Michael Paquier wrote: > On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote: > > ArchiveRecoveryRequested will be set to true if recovery.signal or > > standby.signal are found, so it seems to me that you can make all > > those checks more simple by removing from the equation > > StandbyModeRequested, no? StandbyModeRequested is never set to true > > if ArchiveRecoveryRequested is not itself true. > > For the sake of the archives, this has been applied by Fujii-san as of > ec1259e8. So, the commit message says Fix failure of archive recovery with recovery_min_apply_delay enabled. recovery_min_apply_delay parameter is intended for use with streaming replication deployments. However, the document clearly explains that the parameter will be honored in all cases if it's specified. So it should take effect even if in archive recovery. But, previously, archive recovery with recovery_min_apply_delay enabled always failed, and caused assertion failure if --enable-caasert is enabled. but I'm not clear how would this problem manifest in the case of a build with assertions disabled. Will it keep sleeping beyond the specified time? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: recovery_min_apply_delay in archive recovery causes assertionfailure in latch
From
Fujii Masao
Date:
On Sat, Dec 14, 2019 at 12:35 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Oct-19, Michael Paquier wrote: > > > On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote: > > > ArchiveRecoveryRequested will be set to true if recovery.signal or > > > standby.signal are found, so it seems to me that you can make all > > > those checks more simple by removing from the equation > > > StandbyModeRequested, no? StandbyModeRequested is never set to true > > > if ArchiveRecoveryRequested is not itself true. > > > > For the sake of the archives, this has been applied by Fujii-san as of > > ec1259e8. > > So, the commit message says > > Fix failure of archive recovery with recovery_min_apply_delay enabled. > > recovery_min_apply_delay parameter is intended for use with streaming > replication deployments. However, the document clearly explains that > the parameter will be honored in all cases if it's specified. So it should > take effect even if in archive recovery. But, previously, archive recovery > with recovery_min_apply_delay enabled always failed, and caused assertion > failure if --enable-caasert is enabled. > > but I'm not clear how would this problem manifest in the case of a build > with assertions disabled. Will it keep sleeping beyond the specified > time? When assertion is disabled, the recovery exists with the following messages. FATAL: cannot wait on a latch owned by another process LOG: startup process (PID 81007) exited with exit code 1 LOG: terminating any other active server processes LOG: database system is shut down Regards, -- Fujii Masao