[GENERAL] the database system is shutting down - terminating walsender processdue to replication timeout - Mailing list pgsql-general

From Zarko Aleksic
Subject [GENERAL] the database system is shutting down - terminating walsender processdue to replication timeout
Date
Msg-id CY4PR04MB1224DF879B60CBD87D7345F29A590@CY4PR04MB1224.namprd04.prod.outlook.com
Whole thread Raw
In response to [GENERAL] the database system is shutting down - terminating walsender processdue to replication timeout  (Zarko Aleksic <zarko.aleksic@outlook.com>)
Responses Re: [GENERAL] the database system is shutting down - terminating walsender process due to replication timeout  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general


Greetings everyone,


I'm looking for a bit of help understanding a particular behavior we are seeing with our PostgreSQL 9.6. After issuing a service shutdown command with "systemctl stop" on RHEL 7 our PostgreSQL instance started behaving weirdly. For the first time it wouldn't shutdown so easily / quickly. 


From the logs we could see that standby nodes that were trying to connect were rejected due to database being shutdown. After wal_sender_timeout and wal_receiver_timeout (default 60s) were reached the database finally shut down. It seems that walsender process was preventing the shutdown of the master database - until timeout was reached, a behavior we didn't experience before. 


Does anyone know why would this happen? 


We have 1 standby node in our primary site (same subnet as master DB), and two standbys in a remote site. WAL archiving is enabled to the remote site with rsync command that worked normally during this time and generally completes within a couple of seconds - definitely less than a minute. So we kind of ruled out WAL archiving. 

Would shutting down remote site standbys prevent this kind of delay, they usually have a couple of seconds lag in terms of replication (pg_xlog) location? 


Here is the postgres log from master that was being shutdown. 

2017-10-26 22:04:01 CDT [1701]: [6-1]  user=,db= LOG:  received fast shutdown request
2017-10-26 22:04:01 CDT [1701]: [7-1]  user=,db= LOG:  aborting any active transactions
2017-10-26 22:04:01 CDT [1711]: [2-1]  user=,db= LOG:  dbms_aq launcher shutting down
2017-10-26 22:04:01 CDT [1708]: [2-1]  user=,db= LOG:  autovacuum launcher shutting down
2017-10-26 22:04:01 CDT [1705]: [9971-1]  user=,db= LOG:  shutting down
2017-10-26 22:04:01 CDT [1705]: [9972-1]  user=,db= LOG:  checkpoint starting: shutdown immediate
2017-10-26 22:04:01 CDT [1705]: [9973-1]  user=,db= LOG:  checkpoint complete: wrote 54 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.017 s, sync=0.001 s, total=0.026 s; sync files=12, longest=0.000 s, average=0.000 s; distance=1100 kB, estimate=3307 kB
2017-10-26 22:04:01 CDT [1705]: [9974-1]  user=,db= LOG:  database system is shut down

2017-10-26 22:04:36 CDT [8763]: [1-1] "Local site standby IP"(49606) user=replication_user ,db=[unknown] FATAL:  the database system is shutting down

[OMITTED - a bunch of "FATAL:  the database system is shutting down" messages]

2017-10-26 22:05:00 CDT [2669]: [1-1] "Disaster recovery site standby IP" (55498) user=replication_user,db=[unknown] LOG:  terminating walsender process due to replication timeout

2017-10-26 22:05:00 CDT [1465]: [1-1] "Disaster recovery site standby IP"(36948) user=replication_user,db=[unknown] LOG:  terminating walsender process due to replication timeout
2017-10-26 22:05:01 CDT [1701]: [8-1]  user=,db= LOG:  database system is shut down

Standbys had the same message regarding walreceiver being terminated due to replication timeout.

Any help is greatly appreciated.

Thanks,
Zarko

pgsql-general by date:

Previous
From: Zarko Aleksic
Date:
Subject: [GENERAL] the database system is shutting down - terminating walsender processdue to replication timeout
Next
From: Zarko Aleksic
Date:
Subject: [GENERAL] the database system is shutting down - terminating walsender processdue to replication timeout