Re: streaming replication master can fail to shut down - Mailing list pgsql-bugs

From Andres Freund
Subject Re: streaming replication master can fail to shut down
Date
Msg-id 20160429183332.5tiaz2ccu36uqjee@alap3.anarazel.de
Whole thread Raw
In response to Re: streaming replication master can fail to shut down  (Nick Cleaton <nick@cleaton.net>)
List pgsql-bugs
Hi,

I pushed a fix for this to 9.4,9.5 and master yesterday. I'm not
convinced it's all that needs to be fixed, particularly for Magnus'
report.

On 2016-04-29 08:05:51 +0100, Nick Cleaton wrote:
> On 29 April 2016 at 04:38, Andres Freund <andres@anarazel.de> wrote:
>
> >> > I guess you have a fair amount of WAL traffic, and the receiver was
> >> > behind a good bit?
> >>
> >> No, IIRC this was on the test cluster that I installed for the purpose
> >> of replicating the problem under 9.5; it was essentially idle.
> >
> > The reason I'm asking is that I so far can't really replicate the issue
> > so far. It's pretty clear that waiting_for_ping_response = true; is
> > needed, but I'm suspicious that that's not all.
> >
> > Was your standby on a separate machine?
>
> Yes, I've only seen it happen when the standby was on a machine with
> slower CPU cores than the primary. All my attempts to replicate it on
> a single machine by trying to slow down the wal receiver have failed.
> I'm fairly convinced it's some sort of race that depends on wal sender
> + network being faster than wal receiver.

Yes, that's kind of what I'm expecting. You'll only hit that branch if
there's outstanding data to be replicated, but the message has been
handed to the os (!pq_is_send_pending()). Locally that's just a small
data volume, but over actual network on a longer lived connection that
can be a lot more.

Andres

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: BUG #14121: Constraint UNIQUE
Next
From: Tom Lane
Date:
Subject: Re: Bug report