On Thu, Mar 31, 2016 at 9:15 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi hackers,
>
> If you shut down a primary server, a standby that is streaming from it says54:
>
> LOG: replication terminated by primary server
> DETAIL: End of WAL reached on timeline 1 at 0/14F4B68.
> FATAL: could not send end-of-streaming message to primary: no COPY in progress
>
> Isn't that FATAL ereport a bug?
ISTM that the cause is that walsender exits and replication connection is
closed just after "COPY 0" is sent. That is, then after receiving "COPY 0",
walreceiver tries to send an end-of-copy message to the primary, but fails
because the connection has been already closed.
> How is clean server shutdown supposed to work?
One option is to make walsender wait for end-of-copy message from walreceiver
before it closes the connection and exits, after sending "COPY 0" message.
But one question is; how should walsender behave when walreceiver gets stuck
and cannot reply an end-of-copy message to walsender? Probably we need
the timeout (maybe we can use wal_sender_timeout here but not sure yet
if it's appropriate or not).
Another option is to prevent walreceiver from sending an end-of-copy message.
If "COPY 0" always means the exit of walsender and the termination of
the connection, there seems to be no need to send back an end-of-copy message.
I've not checked yet how this interferes with other replication logics, though.
Regards,
--
Fujii Masao