Hi hackers,
If you shut down a primary server, a standby that is streaming from it says54:
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/14F4B68.
FATAL: could not send end-of-streaming message to primary: no COPY in progress
Isn't that FATAL ereport a bug?
I haven't worked out the root cause but the immediate problem seems to
be libpqrcv_endstreaming calls PQputCopyEnd which doesn't like the
state that the libpq connection is in, namely PGASYNC_BUSY. That
state seems to have been established by the call to walrcv_receive
that returned -1 (end of copy). It doesn't happen in the similar case
of promotion of the remote server.
How is clean server shutdown supposed to work? It looks like
walsender sends COPY 0 and then just hangs up. Meanwhile, walreceiver
has to distinguish between that case and the the new timeline case
which involves a further exchange of messages. Is an explicit message
at the end of the copy stream saying either "goodbye" or "but wait,
there's more" lacking here? Or is there some other way that
walreceiver could distinguish between clean shutdown of remote server
(no error necessary), unclean shutdown of remote server, and timeline
negotiation?
--
Thomas Munro
http://www.enterprisedb.com