Thread: [GENERAL] PostgreSQL walsender process doesn't exist after "pg_ctl stop -mfast"

We encounter one problem that PostgreSQL walsender process doesn't exist
after "pg_ctl stop -m fast".
Uses PostgreSQL 9.6.2
Steps:
1)active postgres server  is up;
2)standby postgres intance take pg_basebackup 
3)usin command "pg_ctl stop -W -m fast -D /mnt/db/DBTestPostgres/db_data" to
stop active postgres server;
4)active postgres server can't stop successfully. It hanging by walsender
process and can't exit (we wait about 10min)
5)kill -9 walsender process, it exit right now. 

1. make postgres coredump, the log as bellow
        Stack trace of thread 2288:                                            #0  0x00007f7fae6ea013 __select
 
(libc.so.6)                                            #1  0x000000000046df6b
ServerLoop (postgres)                                            #2  0x000000000060d528
PostmasterMain (postgres)                                            #3  0x000000000046f2dc main
(postgres)                                            #4  0x00007f7fae62a461
__libc_start_main (libc.so.6)                                            #5  0x000000000046f35a _start
(postgres)

Nov 10 10:08:02 mn-1 systemd-coredump[8994]: Process 2645 (postgres) of user
550 dumped core.                                                                                        Stack trace of
thread2645:                                            #0  0x00007f7fae6f1d63
 
epoll_wait (libc.so.6)                                            #1  0x000000000063ce69
WaitEventSetWait (postgres)                                            #2  0x000000000063d1ab
WaitLatchOrSocket (postgres)                                            #3  0x0000000000622ead
WalSndLoop (postgres)                                            #4  0x0000000000623a8f
exec_replication_command (postgres)                                            #5  0x00000000006558fd
PostgresMain (postgres)                                            #6  0x000000000046e631
ServerLoop (postgres)                                            #7  0x000000000060d528
PostmasterMain (postgres)                                            #8  0x000000000046f2dc main
(postgres)                                            #9  0x00007f7fae62a461
__libc_start_main (libc.so.6)                                            #10 0x000000000046f35a _start
(postgres)


2. active postgres intance log:

Nov 09 12:29:53 mn-1 postgres[6073]: [3-1] DEBUG:  autovacuum: processing
database "DBTestPostgres"
Nov 09 12:30:13 mn-1 postgres[6204]: [3-1] DEBUG:  autovacuum: processing
database "postgreswd"
Nov 09 12:30:20 mn-1 postgres[2231]: [35-1] DEBUG:  checkpoint sync:
number=1 file=base/16385/16467 time=14.990 msec
Nov 09 12:30:20 mn-1 postgres[2231]: [36-1] LOG:  checkpoint complete: wrote
1065 buffers (6.5%); 0 transaction log file(s) added, 0 removed, 2 recycled;
write=59.892 s, sync=0.015 s, total=59.927 s; sync files=1, longest=0.014 s,
average=0.014 s; distance=12487 kB, estimate=19411 kB
Nov 09 12:30:20 mn-1 postgres[2231]: [37-1] LOG:  checkpoint starting: force
wait
Nov 09 12:30:20 mn-1 postgres[2231]: [38-1] DEBUG:  performing replication
slot checkpoint
Nov 09 12:30:21 mn-1 postgres[2231]: [39-1] DEBUG:  checkpoint sync:
number=1 file=base/16385/16467 time=59.400 msec
Nov 09 12:30:21 mn-1 postgres[2231]: [40-1] LOG:  checkpoint complete: wrote
1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=1.102 s, sync=0.059 s, total=1.168 s; sync files=1, longest=0.059 s,
average=0.059 s; distance=3900 kB, estimate=17860 kB
Nov 09 12:30:23 mn-1 postgres[2235]: [28-1] DEBUG:  archived transaction log
file "000000020000000000000193"
Nov 09 12:30:23 mn-1 postgres[2235]: [29-1] DEBUG:  archived transaction log
file "000000020000000000000193.00000108.backup"
Nov 09 12:30:24 mn-1 postgres[6263]: [3-1] DEBUG:  received replication
command: IDENTIFY_SYSTEM
Nov 09 12:30:24 mn-1 postgres[6263]: [4-1] DEBUG:  received replication
command: START_REPLICATION 0/65000000 TIMELINE 2
Nov 09 12:30:24 mn-1 postgres[6263]: [5-1] DEBUG:  standby "walreceiver" has
now caught up with primary

Nov 09 12:30:24 mn-1 dbim-postgres[2039]: [2039-139870943506432] ....
/opt/nokia/libexec/SS_RCPPostgress/bin/pg_ctl stop -W -m fast -D
/mnt/db/DBTestPostgres/db_data
Nov 09 12:30:24 mn-1 postgres[2157]: [3-1] LOG:  received fast shutdown
request
Nov 09 12:30:24 mn-1 postgres[2157]: [4-1] LOG:  aborting any active
transactions
Nov 09 12:30:24 mn-1 postgres[6073]: [4-1] err-2:  terminating autovacuum
process due to administrator command
Nov 09 12:30:24 mn-1 postgres[2234]: [3-1] LOG:  autovacuum launcher
shutting down
Nov 09 12:30:24 mn-1 postgres[2231]: [41-1] LOG:  shutting down
......
Nov 09 12:30:44 mn-1 postgres[6408]: [5-1] err-2:  the database system is
shutting down
Nov 09 12:30:46 mn-1 postgres[6411]: [5-1] err-2:  the database system is
shutting down
Nov 09 12:30:48 mn-1 postgres[6415]: [5-1] err-2:  the database system is
shutting down
Nov 09 12:30:50 mn-1 postgres[6434]: [5-1] err-2:  the database system is
shutting down
Nov 09 12:30:52 mn-1 postgres[6442]: [5-1] err-2:  the database system is
shutting down

3. slave instance log:
Nov 09 12:30:23 mn-0 postgres[11641]: [1-1] LOG:  ending log output to
stderr
Nov 09 12:30:23 mn-0 dbim-postgres[11104]: LOG:  ending log output to stderr
Nov 09 12:30:23 mn-0 dbim-postgres[11104]: HINT:  Future log output will go
to log destination "syslog".
Nov 09 12:30:23 mn-0 postgres[11641]: [1-2] HINT:  Future log output will go
to log destination "syslog".
Nov 09 12:30:23 mn-0 postgres[11642]: [2-1] LOG:  database system was
interrupted; last known up at 2017-11-09 10:30:21 GMT
Nov 09 12:30:23 mn-0 postgres[11642]: [3-1] LOG:  entering standby mode
Nov 09 12:30:23 mn-0 postgres[11642]: [4-1] DEBUG:  checkpoint record is at
0/64C00140
Nov 09 12:30:23 mn-0 postgres[11642]: [5-1] DEBUG:  redo record is at
0/64C00108; shutdown FALSE
Nov 09 12:30:23 mn-0 postgres[11642]: [6-1] DEBUG:  next transaction ID:
0:464; next OID: 16473
Nov 09 12:30:23 mn-0 postgres[11642]: [7-1] DEBUG:  next MultiXactId: 1;
next MultiXactOffset: 0
Nov 09 12:30:23 mn-0 postgres[11642]: [8-1] DEBUG:  oldest unfrozen
transaction ID: 418, in database 1
Nov 09 12:30:23 mn-0 postgres[11642]: [9-1] DEBUG:  oldest MultiXactId: 1,
in database 1
Nov 09 12:30:23 mn-0 postgres[11642]: [10-1] DEBUG:  commit timestamp Xid
oldest/newest: 0/0
Nov 09 12:30:23 mn-0 postgres[11642]: [11-1] DEBUG:  transaction ID wrap
limit is 2147484065, limited by database with OID 1
Nov 09 12:30:23 mn-0 postgres[11642]: [12-1] DEBUG:  MultiXactId wrap limit
is 2147483648, limited by database with OID 1
Nov 09 12:30:23 mn-0 postgres[11642]: [13-1] DEBUG:  starting up replication
slots
Nov 09 12:30:23 mn-0 postgres[11642]: [14-1] DEBUG:  resetting unlogged
relations: cleanup 1 init 0
Nov 09 12:30:23 mn-0 postgres[11642]: [15-1] DEBUG:  initializing for hot
standby
Nov 09 12:30:23 mn-0 postgres[11642]: [16-1] LOG:  redo starts at 0/64C00108
Nov 09 12:30:23 mn-0 postgres[11642]: [17-1] DEBUG:  recovery snapshots are
now enabled
Nov 09 12:30:23 mn-0 postgres[11642]: [17-2] CONTEXT:  xlog redo at
0/64C00108 for Standby/RUNNING_XACTS: nextXid 464 latestCompletedXid 463
oldestRunningXid 464
Nov 09 12:30:23 mn-0 postgres[11642]: [18-1] DEBUG:  end of backup reached
Nov 09 12:30:23 mn-0 postgres[11642]: [18-2] CONTEXT:  xlog redo at
0/64C001B0 for XLOG/BACKUP_END: 0/64C00108
Nov 09 12:30:23 mn-0 postgres[11642]: [19-1] LOG:  consistent recovery state
reached at 0/64C001D8
Nov 09 12:30:23 mn-0 postgres[11641]: [2-1] LOG:  database system is ready
to accept read only connections
Nov 09 12:30:23 mn-0 postgres[11650]: [3-1] LOG:  started streaming WAL from
primary at 0/65000000 on timeline 2
Nov 09 12:30:24 mn-0 postgres[11642]: [20-1] DEBUG:  transaction ID wrap
limit is 2147484065, limited by database with OID 1
Nov 09 12:30:24 mn-0 postgres[11642]: [20-2] CONTEXT:  xlog redo at
0/65400028 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/65400028; tli 2; prev tli 2;
fpw true; xid 0:464; oid 16473; multi 1; offset 0; oldest xid 418 in DB 1;
oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest
running xid 0; shutdown
Nov 09 12:30:24 mn-0 postgres[11650]: [4-1] LOG:  replication terminated by
primary server
Nov 09 12:30:24 mn-0 postgres[11650]: [4-2] DETAIL:  End of WAL reached on
timeline 2 at 0/65400098.
Nov 09 12:30:24 mn-0 postgres[11650]: [5-1] err-2:  could not send
end-of-streaming message to primary: no COPY in progress
Nov 09 12:30:24 mn-0 postgres[11650]: [5-2]
Nov 09 12:30:24 mn-0 postgres[11642]: [21-1] LOG:  invalid record length at
0/65400098: wanted 24, got 0
Nov 09 12:30:24 mn-0 postgres[11701]: [3-1] err-2:  could not connect to the
primary server: err-2:  the database system is shutting down
Nov 09 12:30:24 mn-0 postgres[11701]: [3-2]
Nov 09 12:30:26 mn-0 postgres[11642]: [22-1] DEBUG:  invalid record length
at 0/65400098: wanted 24, got 0
Nov 09 12:30:26 mn-0 postgres[11720]: [3-1] err-2:  could not connect to the
primary server: err-2:  the database system is shutting down
Nov 09 12:30:26 mn-0 postgres[11720]: [3-2]
Nov 09 12:30:28 mn-0 postgres[11642]: [23-1] DEBUG:  invalid record length
at 0/65400098: wanted 24, got 0
Nov 09 12:30:28 mn-0 postgres[11739]: [3-1] err-2:  could not connect to the
primary server: err-2:  the database system is shutting down
Nov 09 12:30:28 mn-0 postgres[11739]: [3-2]
Nov 09 12:30:30 mn-0 postgres[11642]: [24-1] DEBUG:  invalid record length
at 0/65400098: wanted 24, got 0
Nov 09 12:30:30 mn-0 postgres[11767]: [3-1] err-2:  could not connect to the
primary server: err-2:  the database system is shutting down
Nov 09 12:30:30 mn-0 postgres[11767]: [3-2]
Nov 09 12:30:32 mn-0 postgres[11642]: [25-1] DEBUG:  invalid record length
at 0/65400098: wanted 24, got 0
Nov 09 12:30:32 mn-0 postgres[11770]: [3-1] err-2:  could not connect to the
primary server: err-2:  the database system is shutting down

Nov 09 12:30:54 mn-0 postgres[11642]: [36-1] DEBUG:  invalid record length
at 0/65400098: wanted 24, got 0
Nov 09 12:30:54 mn-0 postgres[12088]: [3-1] err-2:  could not connect to the
primary server: err-2:  the database system is shutting down
Nov 09 12:30:54 mn-0 postgres[12088]: [3-2]




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

y39chen <yan-jack.chen@nokia.com> writes:
> We encounter one problem that PostgreSQL walsender process doesn't exist
> after "pg_ctl stop -m fast".
> Uses PostgreSQL 9.6.2
There was a fix in 9.6.4 that's at least related to this problem.
It would be interesting to see if you can still reproduce it on
current 9.6.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] PostgreSQL walsender process doesn't exist after"pg_ctl stop -m fast"

From
Michael Paquier
Date:
On Tue, Nov 14, 2017 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> y39chen <yan-jack.chen@nokia.com> writes:
>> We encounter one problem that PostgreSQL walsender process doesn't exist
>> after "pg_ctl stop -m fast".
>> Uses PostgreSQL 9.6.2
>
> There was a fix in 9.6.4 that's at least related to this problem.
> It would be interesting to see if you can still reproduce it on
> current 9.6.

Commit that may matter here:
commit: e9d4aa594f2caa8c28d55c41c9926420b1efdb79
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Fri, 30 Jun 2017 12:00:03 -0400
Fix walsender to exit promptly if client requests shutdown.

It's possible for WalSndWaitForWal to be asked to wait for WAL that doesn't
exist yet.  That's fine, in fact it's the normal situation if we're caught
up; but when the client requests shutdown we should not keep waiting.
The previous coding could wait indefinitely if the source server was idle.

In passing, improve the rather weak comments in this area, and slightly
rearrange some related code for better readability.

Back-patch to 9.4 where this code was introduced.

Discussion: https://postgr.es/m/14154.1498781234@sss.pgh.pa.us
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Thank you for the explanation. We shall try the latest PostgreSQL 9.6.6
version.



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general