Hi,
On 2014-05-16 16:37:16 -0400, Steve Singer wrote:
> I am finding that my logical walsender connections are being terminated due
> to a timeout on the CREATE REPLICATION SLOT command. with "terminating
> walsender process due to replication timeout"
>
> Below is the stack trace when this happens
>
> #3 0x000000000067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871)
> at walsender.c:1748
> #4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
> walsender.c:1216
> ...
> #9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
> walsender.c:800
> #10 exec_replication_command () at walsender.c:1291
> #11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
> argv=argv@entry=0x177db50, dbname=0x177db30 "test1",
>
> (gdb) p last_reply_timestamp
> $1 = 0
>
>
> I propose the attached patch sets last_reply_timestamp to now() it starts
> processing a command. Since receiving a command is hearing something from
> the client.
Hm. Yes, that's a problem.
> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
> new file mode 100644
> index 5c11d68..56a2f10
> *** a/src/backend/replication/walsender.c
> --- b/src/backend/replication/walsender.c
> *************** exec_replication_command(const char *cmd
> *** 1276,1281 ****
> --- 1276,1282 ----
> parse_rc))));
>
> cmd_node = replication_parse_result;
> + last_reply_timestamp = GetCurrentTimestamp();
>
> switch (cmd_node->type)
> {
I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.
Greetings,
Andres Freund