Re: streaming replication master can fail to shut down - Mailing list pgsql-bugs

From Nick Cleaton
Subject Re: streaming replication master can fail to shut down
Date
Msg-id CAFgz3kv_6YA2FNXr=Eho6qhKY64KkVH8oDnhakjHHPM8VkA=7w@mail.gmail.com
Whole thread Raw
In response to streaming replication master can fail to shut down  (Nick Cleaton <nick@cleaton.net>)
Responses Re: streaming replication master can fail to shut down
List pgsql-bugs
On 9 March 2016 at 16:17, Nick Cleaton <nick@cleaton.net> wrote:
> I've found that the master server does not finish shutting down in
> response to SIGINT to the postmaster when a streaming replica is
> connected, if and only if all of the following hold:
>
> * The master server is a specific new bit of kit, which has slightly
> higher performance CPUs than the replica and is much more idle.
> * The master is configured with archive_mode=on and a non-empty archive command.
> * I do not have "strace -p ..." running on the master against the wal
> sender servicing the slow replica.

To debug the replication conversation during a failing master shutdown
with my 9.5.1 test installation, I've now also set ssl=false in the
server config and then:

* start a tcpdump
* start the master
* start the replica
* sleep 60
* kill -INT the postmaster on the master

A libpcap dump of the first 100k packets exchanged is at
http://nick.cleaton.net/protodump-100k-nossl.xz (800k compressed)

Highlights from pgshark: at master shutdown time there's:

Packet: t=1457611872.617496, session=17578864249730
PGSQL: type=XLogData, B -> F
XLOG DATA len=2896

... and everything following that is a storm of:

Packet: t=1457611872.761098, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

Packet: t=1457611872.761221, session=17578864249730
PGSQL: type=StandbyStatusUpdate, F -> B
STANDBY STATUS UPDATE

Packet: t=1457611872.761363, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

There were a total of 25486 StandbyStatusUpdate messages and 64769
PrimaryKeepalive messages during the 0.6 seconds between the start of
master shutdown and tcpdump hitting the 100k packet limit.


Relevant top and netstat output about 10 minutes into the blocked shutdown:

          PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM
TIME+ COMMAND
master:  6167  postgres  20   0  251632   3020   1372 S  74.7  0.0
0:10.64 postgres: wal sender process postgres 10.122.82.98(49730)
streaming 0/11000098
replica: 28211 postgres  20   0  263100   4580   1964 R 100.0  0.0
0:21.26 postgres: wal receiver process   streaming 0/11000000

         Proto Recv-Q  Send-Q Local Address           Foreign Address
       State
master:  tcp        0 2553544 10.122.83.27:5432
10.122.82.98:49730      ESTABLISHED
replica: tcp  4284291     390 10.122.82.98:49730
10.122.83.27:5432       ESTABLISHED

pgsql-bugs by date:

Previous
From: Oliver Seemann
Date:
Subject: Undetected deadlock with pg_dump on hot_standby server
Next
From: Teodor Sigaev
Date:
Subject: Re: BUG #13440: unaccent does not remove all diacritics