Hi,
On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> I've found another edge-case bug through investigation of unexpectedly
> slow recovery test runs. It goes like this:
>
> * While streaming from master to slave, test script shuts down master
> while slave is left running. We soon restart the master, but meanwhile:
>
> * slave's walreceiver process fails, reporting
>
> 2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
> 2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
> 2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress
>
> * slave's startup process observes that walreceiver is gone and sends
> PMSIGNAL_START_WALRECEIVER to ask for a new one
>
> * more often than you would guess, in fact nearly 100% reproducibly for
> me, the postmaster receives/services the PMSIGNAL before it receives
> SIGCHLD for the walreceiver. In this situation sigusr1_handler just
> throws away the walreceiver start request, reasoning that the walreceiver
> is already running.
Yuck.
I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message. I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.
> * eventually, it dawns on the startup process that the walreceiver
> isn't starting, and it asks for a new one. But that takes ten seconds
> (WALRCV_STARTUP_TIMEOUT).
>
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver. But I wonder whether it's intentional that the old
> walreceiver dies in the first place. That FATAL exit looks suspiciously
> like it wasn't originally-designed-in behavior.
It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable. Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.
- Andres