Thread: [GENERAL] PostgreSQL walsender process doesn't exist after "pg_ctl stop -mfast"
We encounter one problem that PostgreSQL walsender process doesn't exist after "pg_ctl stop -m fast". Uses PostgreSQL 9.6.2 Steps: 1)active postgres server is up; 2)standby postgres intance take pg_basebackup 3)usin command "pg_ctl stop -W -m fast -D /mnt/db/DBTestPostgres/db_data" to stop active postgres server; 4)active postgres server can't stop successfully. It hanging by walsender process and can't exit (we wait about 10min) 5)kill -9 walsender process, it exit right now. 1. make postgres coredump, the log as bellow Stack trace of thread 2288: #0 0x00007f7fae6ea013 __select (libc.so.6) #1 0x000000000046df6b ServerLoop (postgres) #2 0x000000000060d528 PostmasterMain (postgres) #3 0x000000000046f2dc main (postgres) #4 0x00007f7fae62a461 __libc_start_main (libc.so.6) #5 0x000000000046f35a _start (postgres) Nov 10 10:08:02 mn-1 systemd-coredump[8994]: Process 2645 (postgres) of user 550 dumped core. Stack trace of thread2645: #0 0x00007f7fae6f1d63 epoll_wait (libc.so.6) #1 0x000000000063ce69 WaitEventSetWait (postgres) #2 0x000000000063d1ab WaitLatchOrSocket (postgres) #3 0x0000000000622ead WalSndLoop (postgres) #4 0x0000000000623a8f exec_replication_command (postgres) #5 0x00000000006558fd PostgresMain (postgres) #6 0x000000000046e631 ServerLoop (postgres) #7 0x000000000060d528 PostmasterMain (postgres) #8 0x000000000046f2dc main (postgres) #9 0x00007f7fae62a461 __libc_start_main (libc.so.6) #10 0x000000000046f35a _start (postgres) 2. active postgres intance log: Nov 09 12:29:53 mn-1 postgres[6073]: [3-1] DEBUG: autovacuum: processing database "DBTestPostgres" Nov 09 12:30:13 mn-1 postgres[6204]: [3-1] DEBUG: autovacuum: processing database "postgreswd" Nov 09 12:30:20 mn-1 postgres[2231]: [35-1] DEBUG: checkpoint sync: number=1 file=base/16385/16467 time=14.990 msec Nov 09 12:30:20 mn-1 postgres[2231]: [36-1] LOG: checkpoint complete: wrote 1065 buffers (6.5%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=59.892 s, sync=0.015 s, total=59.927 s; sync files=1, longest=0.014 s, average=0.014 s; distance=12487 kB, estimate=19411 kB Nov 09 12:30:20 mn-1 postgres[2231]: [37-1] LOG: checkpoint starting: force wait Nov 09 12:30:20 mn-1 postgres[2231]: [38-1] DEBUG: performing replication slot checkpoint Nov 09 12:30:21 mn-1 postgres[2231]: [39-1] DEBUG: checkpoint sync: number=1 file=base/16385/16467 time=59.400 msec Nov 09 12:30:21 mn-1 postgres[2231]: [40-1] LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=1.102 s, sync=0.059 s, total=1.168 s; sync files=1, longest=0.059 s, average=0.059 s; distance=3900 kB, estimate=17860 kB Nov 09 12:30:23 mn-1 postgres[2235]: [28-1] DEBUG: archived transaction log file "000000020000000000000193" Nov 09 12:30:23 mn-1 postgres[2235]: [29-1] DEBUG: archived transaction log file "000000020000000000000193.00000108.backup" Nov 09 12:30:24 mn-1 postgres[6263]: [3-1] DEBUG: received replication command: IDENTIFY_SYSTEM Nov 09 12:30:24 mn-1 postgres[6263]: [4-1] DEBUG: received replication command: START_REPLICATION 0/65000000 TIMELINE 2 Nov 09 12:30:24 mn-1 postgres[6263]: [5-1] DEBUG: standby "walreceiver" has now caught up with primary Nov 09 12:30:24 mn-1 dbim-postgres[2039]: [2039-139870943506432] .... /opt/nokia/libexec/SS_RCPPostgress/bin/pg_ctl stop -W -m fast -D /mnt/db/DBTestPostgres/db_data Nov 09 12:30:24 mn-1 postgres[2157]: [3-1] LOG: received fast shutdown request Nov 09 12:30:24 mn-1 postgres[2157]: [4-1] LOG: aborting any active transactions Nov 09 12:30:24 mn-1 postgres[6073]: [4-1] err-2: terminating autovacuum process due to administrator command Nov 09 12:30:24 mn-1 postgres[2234]: [3-1] LOG: autovacuum launcher shutting down Nov 09 12:30:24 mn-1 postgres[2231]: [41-1] LOG: shutting down ...... Nov 09 12:30:44 mn-1 postgres[6408]: [5-1] err-2: the database system is shutting down Nov 09 12:30:46 mn-1 postgres[6411]: [5-1] err-2: the database system is shutting down Nov 09 12:30:48 mn-1 postgres[6415]: [5-1] err-2: the database system is shutting down Nov 09 12:30:50 mn-1 postgres[6434]: [5-1] err-2: the database system is shutting down Nov 09 12:30:52 mn-1 postgres[6442]: [5-1] err-2: the database system is shutting down 3. slave instance log: Nov 09 12:30:23 mn-0 postgres[11641]: [1-1] LOG: ending log output to stderr Nov 09 12:30:23 mn-0 dbim-postgres[11104]: LOG: ending log output to stderr Nov 09 12:30:23 mn-0 dbim-postgres[11104]: HINT: Future log output will go to log destination "syslog". Nov 09 12:30:23 mn-0 postgres[11641]: [1-2] HINT: Future log output will go to log destination "syslog". Nov 09 12:30:23 mn-0 postgres[11642]: [2-1] LOG: database system was interrupted; last known up at 2017-11-09 10:30:21 GMT Nov 09 12:30:23 mn-0 postgres[11642]: [3-1] LOG: entering standby mode Nov 09 12:30:23 mn-0 postgres[11642]: [4-1] DEBUG: checkpoint record is at 0/64C00140 Nov 09 12:30:23 mn-0 postgres[11642]: [5-1] DEBUG: redo record is at 0/64C00108; shutdown FALSE Nov 09 12:30:23 mn-0 postgres[11642]: [6-1] DEBUG: next transaction ID: 0:464; next OID: 16473 Nov 09 12:30:23 mn-0 postgres[11642]: [7-1] DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 Nov 09 12:30:23 mn-0 postgres[11642]: [8-1] DEBUG: oldest unfrozen transaction ID: 418, in database 1 Nov 09 12:30:23 mn-0 postgres[11642]: [9-1] DEBUG: oldest MultiXactId: 1, in database 1 Nov 09 12:30:23 mn-0 postgres[11642]: [10-1] DEBUG: commit timestamp Xid oldest/newest: 0/0 Nov 09 12:30:23 mn-0 postgres[11642]: [11-1] DEBUG: transaction ID wrap limit is 2147484065, limited by database with OID 1 Nov 09 12:30:23 mn-0 postgres[11642]: [12-1] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 Nov 09 12:30:23 mn-0 postgres[11642]: [13-1] DEBUG: starting up replication slots Nov 09 12:30:23 mn-0 postgres[11642]: [14-1] DEBUG: resetting unlogged relations: cleanup 1 init 0 Nov 09 12:30:23 mn-0 postgres[11642]: [15-1] DEBUG: initializing for hot standby Nov 09 12:30:23 mn-0 postgres[11642]: [16-1] LOG: redo starts at 0/64C00108 Nov 09 12:30:23 mn-0 postgres[11642]: [17-1] DEBUG: recovery snapshots are now enabled Nov 09 12:30:23 mn-0 postgres[11642]: [17-2] CONTEXT: xlog redo at 0/64C00108 for Standby/RUNNING_XACTS: nextXid 464 latestCompletedXid 463 oldestRunningXid 464 Nov 09 12:30:23 mn-0 postgres[11642]: [18-1] DEBUG: end of backup reached Nov 09 12:30:23 mn-0 postgres[11642]: [18-2] CONTEXT: xlog redo at 0/64C001B0 for XLOG/BACKUP_END: 0/64C00108 Nov 09 12:30:23 mn-0 postgres[11642]: [19-1] LOG: consistent recovery state reached at 0/64C001D8 Nov 09 12:30:23 mn-0 postgres[11641]: [2-1] LOG: database system is ready to accept read only connections Nov 09 12:30:23 mn-0 postgres[11650]: [3-1] LOG: started streaming WAL from primary at 0/65000000 on timeline 2 Nov 09 12:30:24 mn-0 postgres[11642]: [20-1] DEBUG: transaction ID wrap limit is 2147484065, limited by database with OID 1 Nov 09 12:30:24 mn-0 postgres[11642]: [20-2] CONTEXT: xlog redo at 0/65400028 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/65400028; tli 2; prev tli 2; fpw true; xid 0:464; oid 16473; multi 1; offset 0; oldest xid 418 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown Nov 09 12:30:24 mn-0 postgres[11650]: [4-1] LOG: replication terminated by primary server Nov 09 12:30:24 mn-0 postgres[11650]: [4-2] DETAIL: End of WAL reached on timeline 2 at 0/65400098. Nov 09 12:30:24 mn-0 postgres[11650]: [5-1] err-2: could not send end-of-streaming message to primary: no COPY in progress Nov 09 12:30:24 mn-0 postgres[11650]: [5-2] Nov 09 12:30:24 mn-0 postgres[11642]: [21-1] LOG: invalid record length at 0/65400098: wanted 24, got 0 Nov 09 12:30:24 mn-0 postgres[11701]: [3-1] err-2: could not connect to the primary server: err-2: the database system is shutting down Nov 09 12:30:24 mn-0 postgres[11701]: [3-2] Nov 09 12:30:26 mn-0 postgres[11642]: [22-1] DEBUG: invalid record length at 0/65400098: wanted 24, got 0 Nov 09 12:30:26 mn-0 postgres[11720]: [3-1] err-2: could not connect to the primary server: err-2: the database system is shutting down Nov 09 12:30:26 mn-0 postgres[11720]: [3-2] Nov 09 12:30:28 mn-0 postgres[11642]: [23-1] DEBUG: invalid record length at 0/65400098: wanted 24, got 0 Nov 09 12:30:28 mn-0 postgres[11739]: [3-1] err-2: could not connect to the primary server: err-2: the database system is shutting down Nov 09 12:30:28 mn-0 postgres[11739]: [3-2] Nov 09 12:30:30 mn-0 postgres[11642]: [24-1] DEBUG: invalid record length at 0/65400098: wanted 24, got 0 Nov 09 12:30:30 mn-0 postgres[11767]: [3-1] err-2: could not connect to the primary server: err-2: the database system is shutting down Nov 09 12:30:30 mn-0 postgres[11767]: [3-2] Nov 09 12:30:32 mn-0 postgres[11642]: [25-1] DEBUG: invalid record length at 0/65400098: wanted 24, got 0 Nov 09 12:30:32 mn-0 postgres[11770]: [3-1] err-2: could not connect to the primary server: err-2: the database system is shutting down Nov 09 12:30:54 mn-0 postgres[11642]: [36-1] DEBUG: invalid record length at 0/65400098: wanted 24, got 0 Nov 09 12:30:54 mn-0 postgres[12088]: [3-1] err-2: could not connect to the primary server: err-2: the database system is shutting down Nov 09 12:30:54 mn-0 postgres[12088]: [3-2] -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] PostgreSQL walsender process doesn't exist after "pg_ctl stop -m fast"
From
Tom Lane
Date:
y39chen <yan-jack.chen@nokia.com> writes: > We encounter one problem that PostgreSQL walsender process doesn't exist > after "pg_ctl stop -m fast". > Uses PostgreSQL 9.6.2 There was a fix in 9.6.4 that's at least related to this problem. It would be interesting to see if you can still reproduce it on current 9.6. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] PostgreSQL walsender process doesn't exist after"pg_ctl stop -m fast"
From
Michael Paquier
Date:
On Tue, Nov 14, 2017 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > y39chen <yan-jack.chen@nokia.com> writes: >> We encounter one problem that PostgreSQL walsender process doesn't exist >> after "pg_ctl stop -m fast". >> Uses PostgreSQL 9.6.2 > > There was a fix in 9.6.4 that's at least related to this problem. > It would be interesting to see if you can still reproduce it on > current 9.6. Commit that may matter here: commit: e9d4aa594f2caa8c28d55c41c9926420b1efdb79 author: Tom Lane <tgl@sss.pgh.pa.us> date: Fri, 30 Jun 2017 12:00:03 -0400 Fix walsender to exit promptly if client requests shutdown. It's possible for WalSndWaitForWal to be asked to wait for WAL that doesn't exist yet. That's fine, in fact it's the normal situation if we're caught up; but when the client requests shutdown we should not keep waiting. The previous coding could wait indefinitely if the source server was idle. In passing, improve the rather weak comments in this area, and slightly rearrange some related code for better readability. Back-patch to 9.4 where this code was introduced. Discussion: https://postgr.es/m/14154.1498781234@sss.pgh.pa.us -- Michael -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Re: PostgreSQL walsender process doesn't exist after "pg_ctl stop-m fast"
From
y39chen
Date:
Thank you for the explanation. We shall try the latest PostgreSQL 9.6.6 version. -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general