Thread: Checkpointer crashes on slave in 9.4 on windows

Checkpointer crashes on slave in 9.4 on windows

From
Amit Kapila
Date:
During internals tests, it is observed that checkpointer
is getting crashed on slave with below log on slave in
windows:

LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
LOG:  terminating any other active server processes

I debugged and found that it is happening when checkpointer
tries to update shared memory config and below is the
call stack.

> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615)  Line 579 + 0x14 bytes C
  postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615)  Line 510 C
  postgres.exe!WALInsertLockAcquireExclusive()  Line 1627 C
  postgres.exe!UpdateFullPageWrites()  Line 9037 C
  postgres.exe!UpdateSharedMemoryConfig()  Line 1364 C
  postgres.exe!CheckpointerMain()  Line 359 C
  postgres.exe!AuxiliaryProcessMain(int argc=2, char * * argv=0x00000000007d2180)  Line 427 C
  postgres.exe!SubPostmasterMain(int argc=4, char * * argv=0x00000000007d2170)  Line 4635 C
  postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170)  Line 207 C

Basically, here the issue is that during startup when
checkpointer tries to acquire WAL Insertion Locks to
update the value of fullPageWrites, it crashes because
the same is still not initialized. It will be initialized in
InitXLOGAccess() which will get called via RecoveryInProgress()
in case recovery is in progress before doing actual checkpoint.
However we are trying to access it before that which leads to
crash.

I think the reason why it occurs only on windows is that 
on linux fork will ensure that WAL Insertion Locks get
initialized with same values as postmaster.

To fix this issue, we need to ensure that WAL Insertion
Locks should get initialized before we use them, so one of
the ways is to call InitXLOGAccess() before calling
CheckPointerMain() as I have done in attached patch, other
could be to call RecoveryInProgess() much earlier in path
than now.

Steps to reproduce the issue
-------------------------------------------
On Master
a. Change below parameters in postgresql.conf
    wal_level = archive
    archive_mode = on
    archive_command = 'copy "%p" "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f"'
    archive_timeout = 10
b. Change pg_hba.conf to accept connections from slave
c. Start Server
d. Connect to server and start online backup
    psql.exe -p 5432 -c "select pg_start_backup('label-1')"; postgres
e. Create the slave directory by copying everything from master
f.  remove postmaster.pid from slave directoy
g. change port on slave
g. create recovery.conf with below parameters on slave:
    standby_mode=on
    restore_command = 'copy  "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f" "%p"'
h. Stop online backup on master
    psql.exe -p 5432 -c "select pg_stop_backup('1')"; postgres
i.  Start the slave and you can observe below logs:
LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.    

Comments? 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Checkpointer crashes on slave in 9.4 on windows

From
Robert Haas
Date:
On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> During internals tests, it is observed that checkpointer
> is getting crashed on slave with below log on slave in
> windows:
>
> LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
> value.
> LOG:  terminating any other active server processes
>
> I debugged and found that it is happening when checkpointer
> tries to update shared memory config and below is the
> call stack.
>
>> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode
>> mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned
>> __int64 val=18446744073709551615)  Line 579 + 0x14 bytes C
>   postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned
> __int64 * valptr=0x0000000000000020, unsigned __int64
> val=18446744073709551615)  Line 510 C
>   postgres.exe!WALInsertLockAcquireExclusive()  Line 1627 C
>   postgres.exe!UpdateFullPageWrites()  Line 9037 C
>   postgres.exe!UpdateSharedMemoryConfig()  Line 1364 C
>   postgres.exe!CheckpointerMain()  Line 359 C
>   postgres.exe!AuxiliaryProcessMain(int argc=2, char * *
> argv=0x00000000007d2180)  Line 427 C
>   postgres.exe!SubPostmasterMain(int argc=4, char * *
> argv=0x00000000007d2170)  Line 4635 C
>   postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170)  Line 207
> C
>
> Basically, here the issue is that during startup when
> checkpointer tries to acquire WAL Insertion Locks to
> update the value of fullPageWrites, it crashes because
> the same is still not initialized. It will be initialized in
> InitXLOGAccess() which will get called via RecoveryInProgress()
> in case recovery is in progress before doing actual checkpoint.
> However we are trying to access it before that which leads to
> crash.
>
> I think the reason why it occurs only on windows is that
> on linux fork will ensure that WAL Insertion Locks get
> initialized with same values as postmaster.
>
> To fix this issue, we need to ensure that WAL Insertion
> Locks should get initialized before we use them, so one of
> the ways is to call InitXLOGAccess() before calling
> CheckPointerMain() as I have done in attached patch, other
> could be to call RecoveryInProgess() much earlier in path
> than now.

So, this problem was introduced by Heikki's commit,
68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots
with regular LWLocks.   I think the problem here is that the
initialization code here really doesn't belong in InitXLOGAccess at
all:

1. I think WALInsertLocks is just another global variable that needs
to be saved and restored in EXEC_BACKEND mode and that it therefore
ought to participate in the save_backend_variables() mechanism instead
of having its own special-purpose mechanism to save and restore the
value.

2. And I think that the LWLockRegisterTranche call belongs in
XLOGShmeInit(), so that it's parallel to the other call in
CreateLWLocks.

I think that would be more robust, because while your fix will
definitely work, we could easily reintroduce a similar
platform-specific bug for some other auxiliary process.  Using the
mechanisms described above will mean that this is set up properly for
everything that's attached to shared memory at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Checkpointer crashes on slave in 9.4 on windows

From
Amit Kapila
Date:
On Thu, Jul 24, 2014 at 12:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> So, this problem was introduced by Heikki's commit,
> 68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots
> with regular LWLocks.   I think the problem here is that the
> initialization code here really doesn't belong in InitXLOGAccess at
> all:
>
> 1. I think WALInsertLocks is just another global variable that needs
> to be saved and restored in EXEC_BACKEND mode and that it therefore
> ought to participate in the save_backend_variables() mechanism instead
> of having its own special-purpose mechanism to save and restore the
> value.

It seems to me that we don't need to save/restore variables that points
to shared memory which we initialize during startup of process.  We
do initliaze shared memory during each process start in below call:
SubPostmasterMain()
{
..
..
CreateSharedMemoryAndSemaphores(false, 0);
}

Few another examples of some similar variables are as below:

MultiXactShmemInit()
{
..
OldestMemberMXactId = MultiXactState->perBackendXactIds;
OldestVisibleMXactId = OldestMemberMXactId + MaxOldestSlot;
}

CreateSharedProcArray()
{
..
allProcs = ProcGlobal->allProcs;
allProcs = ProcGlobal->allProcs;



However, I think it is better to initialize WALInsertLocks in XLOGShmemInit()
as we do for other cases and suggested by you in point number-2. 

> 2. And I think that the LWLockRegisterTranche call belongs in
> XLOGShmeInit(), so that it's parallel to the other call in
> CreateLWLocks.

Agreed.

Revised patch initialize the WALInsertLocks and call
LWLockRegisterTranche() in XLOGShmemInit() which makes their
initialization similar to what we do at other places.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Checkpointer crashes on slave in 9.4 on windows

From
Robert Haas
Date:
On Thu, Jul 24, 2014 at 5:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Revised patch initialize the WALInsertLocks and call
> LWLockRegisterTranche() in XLOGShmemInit() which makes their
> initialization similar to what we do at other places.

OK, that seems all right.  Will commit and back-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company