Thread: Failed assertion on standby while shutdown
Hi, haсkers! Recently, I was doing some experiments with primary/standby instances interaction. In certain conditions I’ve got and was able to reproduce crash on failed assertion. The scenario is the following: 1. start primary server 2. start standby server by pg_basebackup -P -R -X stream -c fast -p5432 -D data 3. apply some load to the primary server by pgbench -p5432 -i -s 150 postgres 4. kill primary server (with kill -9) and keep it down 5. stop standby server by pg_ctl 6. run standby server Then any standby server termination will result in a failed assertion. The log with a backtrace is following: 2021-03-19 18:54:25.352 MSK [3508443] LOG: received fast shutdown request 2021-03-19 18:54:25.379 MSK [3508443] LOG: aborting any active transactions TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", File: "/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c", Line: 592, PID: 3508452) postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f] postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31] postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32] postgres: walreceiver (+0x530bff)[0x555555a84bff] postgres: walreceiver (+0x531044)[0x555555a85044] postgres: walreceiver (+0x530959)[0x555555a84959] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0] /lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da] postgres: walreceiver (+0x52bea4)[0x555555a7fea4] postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1] postgres: walreceiver (+0x41ff1f)[0x555555973f1f] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3] postgres: walreceiver (_start+0x2e)[0x55555561abfe] After a brief investigation I found out that I can get this assert with 100% probability if I insert a sleep for about 5 sec into InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c: diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 897045ee272..b5f365f426d 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -525,7 +525,7 @@ InitAuxiliaryProcess(void) if (MyProc != NULL) elog(ERROR, "you already exist"); - + pg_usleep(5000000L); /* * We use the ProcStructLock to protect assignment and releasing of * AuxiliaryProcs entries. Maybe, this kinda behaviour would appear if a computer hosting instances is under significant side load, which cause delay to start db-instances under a heavy load. Configuration for a primary server is default with "wal_level = logical" Configuration for a standby server is default with "wal_level = logical" and "primary_conninfo = 'port=5432'" I'm puzzled with this behavor. I'm pretty sure it is not what should be. Any ideas how this can be fixed? --- Best regards, Maxim Orlov.
On 2021/03/20 2:25, Maxim Orlov wrote: > Hi, haсkers! > > Recently, I was doing some experiments with primary/standby instances interaction. In certain conditions I’ve got and wasable to reproduce crash on failed assertion. > > The scenario is the following: > 1. start primary server > 2. start standby server by pg_basebackup -P -R -X stream -c fast -p5432 -D data > 3. apply some load to the primary server by pgbench -p5432 -i -s 150 postgres > 4. kill primary server (with kill -9) and keep it down > 5. stop standby server by pg_ctl > 6. run standby server > > Then any standby server termination will result in a failed assertion. > > The log with a backtrace is following: > > 2021-03-19 18:54:25.352 MSK [3508443] LOG: received fast shutdown request > 2021-03-19 18:54:25.379 MSK [3508443] LOG: aborting any active transactions > TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", File: "/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c",Line: 592, PID: 3508452) > postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f] > postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31] > postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32] > postgres: walreceiver (+0x530bff)[0x555555a84bff] > postgres: walreceiver (+0x531044)[0x555555a85044] > postgres: walreceiver (+0x530959)[0x555555a84959] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0] > /lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da] > postgres: walreceiver (+0x52bea4)[0x555555a7fea4] > postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1] > postgres: walreceiver (+0x41ff1f)[0x555555973f1f] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3] > postgres: walreceiver (_start+0x2e)[0x55555561abfe] > > After a brief investigation I found out that I can get this assert with 100% probability if I insert a sleep for about5 sec into InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c: > > diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c > index 897045ee272..b5f365f426d 100644 > --- a/src/backend/storage/lmgr/proc.c > +++ b/src/backend/storage/lmgr/proc.c > @@ -525,7 +525,7 @@ InitAuxiliaryProcess(void) > > if (MyProc != NULL) > elog(ERROR, "you already exist"); > - > + pg_usleep(5000000L); > /* > * We use the ProcStructLock to protect assignment and releasing of > * AuxiliaryProcs entries. Thanks for the report! I could reproduce this issue by adding that sleep into InitAuxiliaryProcess(). > Maybe, this kinda behaviour would appear if a computer hosting instances is under significant side load, which cause delayto start db-instances under a heavy load. > > Configuration for a primary server is default with "wal_level = logical" > > Configuration for a standby server is default with "wal_level = logical" and "primary_conninfo = 'port=5432'" > > I'm puzzled with this behavor. I'm pretty sure it is not what should be. Any ideas how this can be fixed? ISTM that the cause of this issue is that the startup process exits without releasing the locks that it was holding when shutdown is requested. To address this issue, IMO the startup process should call ShutdownRecoveryTransactionEnvironment() at its exit. Attached is the POC patch that changes the startup process that way. I've not tested the patch enough yet.. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021-03-22 16:40, Fujii Masao wrote: > On 2021/03/20 2:25, Maxim Orlov wrote: >> Hi, haсkers! >> >> Recently, I was doing some experiments with primary/standby instances >> interaction. In certain conditions I’ve got and was able to reproduce >> crash on failed assertion. >> >> The scenario is the following: >> 1. start primary server >> 2. start standby server by pg_basebackup -P -R -X stream -c fast >> -p5432 -D data >> 3. apply some load to the primary server by pgbench -p5432 -i -s 150 >> postgres >> 4. kill primary server (with kill -9) and keep it down >> 5. stop standby server by pg_ctl >> 6. run standby server >> >> Then any standby server termination will result in a failed assertion. >> >> The log with a backtrace is following: >> >> 2021-03-19 18:54:25.352 MSK [3508443] LOG: received fast shutdown >> request >> 2021-03-19 18:54:25.379 MSK [3508443] LOG: aborting any active >> transactions >> TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", >> File: >> "/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c", >> Line: 592, PID: 3508452) >> postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f] >> postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31] >> postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32] >> postgres: walreceiver (+0x530bff)[0x555555a84bff] >> postgres: walreceiver (+0x531044)[0x555555a85044] >> postgres: walreceiver (+0x530959)[0x555555a84959] >> /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0] >> /lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da] >> postgres: walreceiver (+0x52bea4)[0x555555a7fea4] >> postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1] >> postgres: walreceiver (+0x41ff1f)[0x555555973f1f] >> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3] >> postgres: walreceiver (_start+0x2e)[0x55555561abfe] >> >> After a brief investigation I found out that I can get this assert >> with 100% probability if I insert a sleep for about 5 sec into >> InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c: >> >> diff --git a/src/backend/storage/lmgr/proc.c >> b/src/backend/storage/lmgr/proc.c >> index 897045ee272..b5f365f426d 100644 >> --- a/src/backend/storage/lmgr/proc.c >> +++ b/src/backend/storage/lmgr/proc.c >> @@ -525,7 +525,7 @@ InitAuxiliaryProcess(void) >> >> if (MyProc != NULL) >> elog(ERROR, "you already exist"); >> - >> + pg_usleep(5000000L); >> /* >> * We use the ProcStructLock to protect assignment and >> releasing of >> * AuxiliaryProcs entries. > > Thanks for the report! I could reproduce this issue by adding that > sleep > into InitAuxiliaryProcess(). > >> Maybe, this kinda behaviour would appear if a computer hosting >> instances is under significant side load, which cause delay to start >> db-instances under a heavy load. >> >> Configuration for a primary server is default with "wal_level = >> logical" >> >> Configuration for a standby server is default with "wal_level = >> logical" and "primary_conninfo = 'port=5432'" >> >> I'm puzzled with this behavor. I'm pretty sure it is not what should >> be. Any ideas how this can be fixed? > > ISTM that the cause of this issue is that the startup process exits > without releasing the locks that it was holding when shutdown is > requested. To address this issue, IMO the startup process should > call ShutdownRecoveryTransactionEnvironment() at its exit. > Attached is the POC patch that changes the startup process that way. > > I've not tested the patch enough yet.. > > Regards, Thank you for reply! As far as I understand, this is really the case. I've test your patch a bit. This annoying failed assertion is gone now. I think I should test more and report later about results. Should we put this patch to CF? --- Best regards, Maxim Orlov.
On 2021/03/24 14:02, Maxim Orlov wrote: > Thank you for reply! As far as I understand, this is really the case. I've test your patch a bit. Thanks for testing the patch! > This annoying failed assertion is gone now. Good news! > I think I should test more and report later about results. > > Should we put this patch to CF? Yes. Since this is a bug, we can review and commit the patch without waiting for next CF. But I agree that it's better to add the patch to next CF so that we don't forget the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested All the tests passed successfully. The new status of this patch is: Ready for Committer
отбой: Маша сказала: уже оплатили :) вт 30.03.21 20:44, Maxim Orlov пишет: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: not tested > > All the tests passed successfully. > > The new status of this patch is: Ready for Committer
On 2021-03-30 20:44, Maxim Orlov wrote: > The following review has been posted through the commitfest > application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: not tested > > All the tests passed successfully. > > The new status of this patch is: Ready for Committer The patch is good. One note, should we put a comment about ShutdownRecoveryTransactionEnvironment not reentrant behaviour? Or maybe rename it to ShutdownRecoveryTransactionEnvironmentOnce? --- Best regards, Maxim Orlov.
On 2021/03/31 19:51, Maxim Orlov wrote: > On 2021-03-30 20:44, Maxim Orlov wrote: >> The following review has been posted through the commitfest application: >> make installcheck-world: tested, passed >> Implements feature: tested, passed >> Spec compliant: not tested >> Documentation: not tested >> >> All the tests passed successfully. >> >> The new status of this patch is: Ready for Committer > > The patch is good. One note, should we put a comment about ShutdownRecoveryTransactionEnvironment not reentrant behaviour?Or maybe rename it to ShutdownRecoveryTransactionEnvironmentOnce? +1 to add more comments into ShutdownRecoveryTransactionEnvironment(). I did that. What about the attached patch? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021-04-01 15:02, Fujii Masao wrote: > On 2021/03/31 19:51, Maxim Orlov wrote: >> On 2021-03-30 20:44, Maxim Orlov wrote: >>> The following review has been posted through the commitfest >>> application: >>> make installcheck-world: tested, passed >>> Implements feature: tested, passed >>> Spec compliant: not tested >>> Documentation: not tested >>> >>> All the tests passed successfully. >>> >>> The new status of this patch is: Ready for Committer >> >> The patch is good. One note, should we put a comment about >> ShutdownRecoveryTransactionEnvironment not reentrant behaviour? Or >> maybe rename it to ShutdownRecoveryTransactionEnvironmentOnce? > > +1 to add more comments into ShutdownRecoveryTransactionEnvironment(). > I did that. What about the attached patch? > > Regards, Well done! In my view is just what it's needed. --- Best regards, Maxim Orlov.
On 2021/04/05 16:30, Maxim Orlov wrote: > Well done! In my view is just what it's needed. Thanks for the review! I pushed the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION