Re: Logical decoding and walsender timeouts - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Logical decoding and walsender timeouts
Date
Msg-id CAMsr+YENnRzC+1qRQ373wu2A-GhmxjN20sGm4=ZYSFDUJzrRTw@mail.gmail.com
Whole thread Raw
In response to Re: Logical decoding and walsender timeouts  (Andres Freund <andres@anarazel.de>)
Responses Re: Logical decoding and walsender timeouts  (Vladimir Gordiychuk <folyga@gmail.com>)
List pgsql-hackers
On 31 October 2016 at 16:52, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
>> TL;DR: Logical decoding clients need to generate their own keepalives
>> and not rely on the server requesting them to prevent timeouts. Or
>> admins should raise the wal_sender_timeout by a LOT when using logical
>> decoding on DBs with any big rows.
>
> Unconvinced.

Yeah. I've seen enough issues in the wild where we keep timing out and
restarting over and over until we increase wal_sender_timeout to know
there's _something_ going on. I am less sure I'm right about what is
or how to solve it.

>> When sending a big message, WalSndWriteData() notices that it's
>> approaching timeout and tries to send a keepalive request, but the
>> request just gets buffered behind the remaining output plugin data and
>> isn't seen by the client until the client has received the rest of the
>> pending data.
>
> Only for individual messages, not the entire transaction though.

Right.  I initially thought it was the whole tx, but I was mistaken as
I'd failed to notice that WalSndWriteData() queues a keepalive
request.

> Are
> you sure the problem at hand is that we're sending a keepalive, but it's
> too late?

No, I'm not sure. I'm trying to identify the cause of an issue I've
seen in the wild, but never under conditions where it's been possible
to sit around and debug in a leisurely manner.

I'm trying to set up a TAP test to demonstrate that this happens, but
I don't think it's going to work without some kind of network
bandwidth limitation simulation or simulated latency. A local unix
socket is just too fast for Pg's row size limits.

> It might very well be that the actual issue is that we're
> never sending keepalives, because the network is fast enough / the tcp
> window is large enough.  IIRC we only send a keepalive if we're blocked
> on network IO?

Mm, that's a good point. That might better explain the issues I've
seen in the wild, since I never found strong evidence that individual
big rows were involved, but hadn't been able to come up with anything
else yet.

>> So: We could ask output plugins to deal with this for us, by chunking
>> up their data in small pieces and calling OutputPluginPrepareWrite()
>> and OutputPluginWrite() more than once per output plugin callback if
>> they expect to send a big message. But this pushes the complexity of
>> splitting up and handling big rows, and big Datums, onto each plugin.
>> It's awkward to do well and hard to avoid splitting things up
>> unnecessarily.
>
> There's decent reason for doing that independently though, namely that
> it's a lot more efficient from a memory management POV.

Definitely. Though you're always going to be tossing around ridiculous
chunks of memory when dealing with big external compressed toasted
data, unless there are ways to access that progressively that I'm
unaware of. Hopefully there are.

I'd quite like to extend the bdr/pglogical/logicalrep protocol so that
in-core logical rep, in some later version, can write a field as 'to
follow', like we currently mark unchanged toasted datums separately.
Then send it chunked, after the main row, in follow-up messages. That
way we keep processing keepalives, we don't allocate preposterous
amounts of memory, etc.

> I don't think the "unrequested keepalive" approach really solves the
> problem on a fundamental enough level.

Fair. It feels a bit like flailing in the dark, too.

>> (A separate issue is that we can also time out when doing logical
>> _replication_ if the downstream side blocks on a lock, since it's not
>> safe to send on a socket from a signal handler ... )
>
> That's strictly speaking not true. write() / sendmsg() are signal safe
> functions.  There's good reasons not to do that however, namely that the
> non signal handler code might be busy writing data itself.

Huh, ok. And since in pglogical/bdr and as far as I can tell in core
logical rep we don't send anything on the socket while we're calling
in to heap access, the executor, etc, that'd actually be an option. We
could possibly safeguard it with a volatile "socket busy" flag since
we don't do much sending anyway. But I'd need to do my reading on
signal handler safety etc. Still, good to know it's not completely
absurd to do this if the issue comes up.

Thanks very much for the input. I saw your post with proposed changes.
Once I can get the issue simulated reliably I'll test the patch and
see if it solves it, but it looks like it's sensible to just apply it
anyway TBH.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Proposal: scan key push down to heap [WIP]
Next
From: Tomas Vondra
Date:
Subject: Re: Speed up Clog Access by increasing CLOG buffers