Checkpointer crashes on slave in 9.4 on windows - Mailing list pgsql-hackers

From Amit Kapila
Subject Checkpointer crashes on slave in 9.4 on windows
Date
Msg-id CAA4eK1JPz-=rKR5+aKf=gwPBYXsFJXcP-qHxaugB+_MNvNeJVQ@mail.gmail.com
Whole thread Raw
Responses Re: Checkpointer crashes on slave in 9.4 on windows  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
During internals tests, it is observed that checkpointer
is getting crashed on slave with below log on slave in
windows:

LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
LOG:  terminating any other active server processes

I debugged and found that it is happening when checkpointer
tries to update shared memory config and below is the
call stack.

> postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615)  Line 579 + 0x14 bytes C
  postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned __int64 * valptr=0x0000000000000020, unsigned __int64 val=18446744073709551615)  Line 510 C
  postgres.exe!WALInsertLockAcquireExclusive()  Line 1627 C
  postgres.exe!UpdateFullPageWrites()  Line 9037 C
  postgres.exe!UpdateSharedMemoryConfig()  Line 1364 C
  postgres.exe!CheckpointerMain()  Line 359 C
  postgres.exe!AuxiliaryProcessMain(int argc=2, char * * argv=0x00000000007d2180)  Line 427 C
  postgres.exe!SubPostmasterMain(int argc=4, char * * argv=0x00000000007d2170)  Line 4635 C
  postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170)  Line 207 C

Basically, here the issue is that during startup when
checkpointer tries to acquire WAL Insertion Locks to
update the value of fullPageWrites, it crashes because
the same is still not initialized. It will be initialized in
InitXLOGAccess() which will get called via RecoveryInProgress()
in case recovery is in progress before doing actual checkpoint.
However we are trying to access it before that which leads to
crash.

I think the reason why it occurs only on windows is that 
on linux fork will ensure that WAL Insertion Locks get
initialized with same values as postmaster.

To fix this issue, we need to ensure that WAL Insertion
Locks should get initialized before we use them, so one of
the ways is to call InitXLOGAccess() before calling
CheckPointerMain() as I have done in attached patch, other
could be to call RecoveryInProgess() much earlier in path
than now.

Steps to reproduce the issue
-------------------------------------------
On Master
a. Change below parameters in postgresql.conf
    wal_level = archive
    archive_mode = on
    archive_command = 'copy "%p" "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f"'
    archive_timeout = 10
b. Change pg_hba.conf to accept connections from slave
c. Start Server
d. Connect to server and start online backup
    psql.exe -p 5432 -c "select pg_start_backup('label-1')"; postgres
e. Create the slave directory by copying everything from master
f.  remove postmaster.pid from slave directoy
g. change port on slave
g. create recovery.conf with below parameters on slave:
    standby_mode=on
    restore_command = 'copy  "c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f" "%p"'
h. Stop online backup on master
    psql.exe -p 5432 -c "select pg_stop_backup('1')"; postgres
i.  Start the slave and you can observe below logs:
LOG:  checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.    

Comments? 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: small doccumentation fix in psql
Next
From: Magnus Hagander
Date:
Subject: Re: [bug fix] pg_ctl always uses the same event source