Re: streaming replication master can fail to shut down - Mailing list pgsql-bugs

From Nick Cleaton
Subject Re: streaming replication master can fail to shut down
Date
Msg-id CAFgz3ku0_B8g56kJ+NWQZsqcbP-+DKgAGH9WTjmUQT2BFMG2jQ@mail.gmail.com
Whole thread Raw
In response to Re: streaming replication master can fail to shut down  (Andres Freund <andres@anarazel.de>)
Responses Re: streaming replication master can fail to shut down
List pgsql-bugs
On 29 April 2016 at 04:38, Andres Freund <andres@anarazel.de> wrote:

>> > I guess you have a fair amount of WAL traffic, and the receiver was
>> > behind a good bit?
>>
>> No, IIRC this was on the test cluster that I installed for the purpose
>> of replicating the problem under 9.5; it was essentially idle.
>
> The reason I'm asking is that I so far can't really replicate the issue
> so far. It's pretty clear that waiting_for_ping_response = true; is
> needed, but I'm suspicious that that's not all.
>
> Was your standby on a separate machine?

Yes, I've only seen it happen when the standby was on a machine with
slower CPU cores than the primary. All my attempts to replicate it on
a single machine by trying to slow down the wal receiver have failed.
I'm fairly convinced it's some sort of race that depends on wal sender
+ network being faster than wal receiver.

> What kind of latency?

1G switches.

root@XXX:~# ping XXX
PING XXX) 56(84) bytes of data.
64 bytes from XXX: icmp_seq=1 ttl=64 time=0.162 ms
64 bytes from XXX: icmp_seq=2 ttl=64 time=0.223 ms
64 bytes from XXX: icmp_seq=3 ttl=64 time=0.122 ms
64 bytes from XXX: icmp_seq=4 ttl=64 time=0.126 ms
64 bytes from XXX: icmp_seq=5 ttl=64 time=0.149 ms

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: streaming replication master can fail to shut down
Next
From: Magnus Hagander
Date:
Subject: Re: streaming replication master can fail to shut down