Thread: Checkpointer crashes on slave in 9.4 on windows
During internals tests, it is observed that checkpointer
is getting crashed on slave with below log on slave in
windows:
LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
LOG: terminating any other active server processes
I debugged and found that it is happening when checkpointer
tries to update shared memory config and below is the
call stack.
> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615) Line 579 + 0x14 bytes C
postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615) Line 510 C
postgres.exe!WALInsertLockAcquireExclusive() Line 1627 C
postgres.exe!UpdateFullPageWrites() Line 9037 C
postgres.exe!UpdateSharedMemoryConfig() Line 1364 C
postgres.exe!CheckpointerMain() Line 359 C
postgres.exe!AuxiliaryProcessMain(int argc=2, char * * argv=0x00000000007d2180) Line 427 C
postgres.exe!SubPostmasterMain(int argc=4, char * * argv=0x00000000007d2170) Line 4635 C
postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170) Line 207 C
Basically, here the issue is that during startup when
checkpointer tries to acquire WAL Insertion Locks to
update the value of fullPageWrites, it crashes because
the same is still not initialized. It will be initialized in
InitXLOGAccess() which will get called via RecoveryInProgress()in case recovery is in progress before doing actual checkpoint.
However we are trying to access it before that which leads to
crash.
I think the reason why it occurs only on windows is that
on linux fork will ensure that WAL Insertion Locks get
initialized with same values as postmaster.
To fix this issue, we need to ensure that WAL Insertion
Locks should get initialized before we use them, so one of
the ways is to call InitXLOGAccess() before calling
CheckPointerMain() as I have done in attached patch, other
could be to call RecoveryInProgess() much earlier in path
than now.
Steps to reproduce the issue
-------------------------------------------
On Master
a. Change below parameters in postgresql.conf
wal_level = archive
archive_mode = on
archive_command = 'copy "%p" "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f"' archive_timeout = 10
b. Change pg_hba.conf to accept connections from slave
c. Start Server
d. Connect to server and start online backup
psql.exe -p 5432 -c "select pg_start_backup('label-1')"; postgrese. Create the slave directory by copying everything from master
f. remove postmaster.pid from slave directoy
g. change port on slave
g. create recovery.conf with below parameters on slave:
standby_mode=on
restore_command = 'copy "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f" "%p"'
h. Stop online backup on master
psql.exe -p 5432 -c "select pg_stop_backup('1')"; postgres
i. Start the slave and you can observe below logs:
LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
Comments?
Attachment
On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > During internals tests, it is observed that checkpointer > is getting crashed on slave with below log on slave in > windows: > > LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005 > HINT: See C include file "ntstatus.h" for a description of the hexadecimal > value. > LOG: terminating any other active server processes > > I debugged and found that it is happening when checkpointer > tries to update shared memory config and below is the > call stack. > >> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode >> mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned >> __int64 val=18446744073709551615) Line 579 + 0x14 bytes C > postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned > __int64 * valptr=0x0000000000000020, unsigned __int64 > val=18446744073709551615) Line 510 C > postgres.exe!WALInsertLockAcquireExclusive() Line 1627 C > postgres.exe!UpdateFullPageWrites() Line 9037 C > postgres.exe!UpdateSharedMemoryConfig() Line 1364 C > postgres.exe!CheckpointerMain() Line 359 C > postgres.exe!AuxiliaryProcessMain(int argc=2, char * * > argv=0x00000000007d2180) Line 427 C > postgres.exe!SubPostmasterMain(int argc=4, char * * > argv=0x00000000007d2170) Line 4635 C > postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170) Line 207 > C > > Basically, here the issue is that during startup when > checkpointer tries to acquire WAL Insertion Locks to > update the value of fullPageWrites, it crashes because > the same is still not initialized. It will be initialized in > InitXLOGAccess() which will get called via RecoveryInProgress() > in case recovery is in progress before doing actual checkpoint. > However we are trying to access it before that which leads to > crash. > > I think the reason why it occurs only on windows is that > on linux fork will ensure that WAL Insertion Locks get > initialized with same values as postmaster. > > To fix this issue, we need to ensure that WAL Insertion > Locks should get initialized before we use them, so one of > the ways is to call InitXLOGAccess() before calling > CheckPointerMain() as I have done in attached patch, other > could be to call RecoveryInProgess() much earlier in path > than now. So, this problem was introduced by Heikki's commit, 68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots with regular LWLocks. I think the problem here is that the initialization code here really doesn't belong in InitXLOGAccess at all: 1. I think WALInsertLocks is just another global variable that needs to be saved and restored in EXEC_BACKEND mode and that it therefore ought to participate in the save_backend_variables() mechanism instead of having its own special-purpose mechanism to save and restore the value. 2. And I think that the LWLockRegisterTranche call belongs in XLOGShmeInit(), so that it's parallel to the other call in CreateLWLocks. I think that would be more robust, because while your fix will definitely work, we could easily reintroduce a similar platform-specific bug for some other auxiliary process. Using the mechanisms described above will mean that this is set up properly for everything that's attached to shared memory at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 24, 2014 at 12:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> So, this problem was introduced by Heikki's commit,
> 68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots
> with regular LWLocks. I think the problem here is that the
> initialization code here really doesn't belong in InitXLOGAccess at
> all:
>
> 1. I think WALInsertLocks is just another global variable that needs
> to be saved and restored in EXEC_BACKEND mode and that it therefore
> ought to participate in the save_backend_variables() mechanism instead
> of having its own special-purpose mechanism to save and restore the
> value.
It seems to me that we don't need to save/restore variables that points
{
..
> On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> So, this problem was introduced by Heikki's commit,
> 68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots
> with regular LWLocks. I think the problem here is that the
> initialization code here really doesn't belong in InitXLOGAccess at
> all:
>
> 1. I think WALInsertLocks is just another global variable that needs
> to be saved and restored in EXEC_BACKEND mode and that it therefore
> ought to participate in the save_backend_variables() mechanism instead
> of having its own special-purpose mechanism to save and restore the
> value.
It seems to me that we don't need to save/restore variables that points
to shared memory which we initialize during startup of process. We
do initliaze shared memory during each process start in below call:
SubPostmasterMain(){
..
..
CreateSharedMemoryAndSemaphores(false, 0);
}
CreateSharedMemoryAndSemaphores(false, 0);
}
Few another examples of some similar variables are as below:
MultiXactShmemInit()
{
..
OldestMemberMXactId = MultiXactState->perBackendXactIds;
OldestVisibleMXactId = OldestMemberMXactId + MaxOldestSlot;
}
CreateSharedProcArray()
{
..
allProcs = ProcGlobal->allProcs;
allProcs = ProcGlobal->allProcs;
}
However, I think it is better to initialize WALInsertLocks in XLOGShmemInit()
However, I think it is better to initialize WALInsertLocks in XLOGShmemInit()
as we do for other cases and suggested by you in point number-2.
> 2. And I think that the LWLockRegisterTranche call belongs in
> XLOGShmeInit(), so that it's parallel to the other call in
> CreateLWLocks.
> 2. And I think that the LWLockRegisterTranche call belongs in
> XLOGShmeInit(), so that it's parallel to the other call in
> CreateLWLocks.
Agreed.
Revised patch initialize the WALInsertLocks and call
LWLockRegisterTranche() in XLOGShmemInit() which makes their
initialization similar to what we do at other places.
Attachment
On Thu, Jul 24, 2014 at 5:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Revised patch initialize the WALInsertLocks and call > LWLockRegisterTranche() in XLOGShmemInit() which makes their > initialization similar to what we do at other places. OK, that seems all right. Will commit and back-patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company