Re: [HACKERS] Another reason why the recovery tests take a long time - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Another reason why the recovery tests take a long time
Date
Msg-id 20170626164659.ea43donv7tqmwl5a@alap3.anarazel.de
Whole thread Raw
In response to [HACKERS] Another reason why the recovery tests take a long time  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Another reason why the recovery tests take a long time
List pgsql-hackers
Hi,


On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> I've found another edge-case bug through investigation of unexpectedly
> slow recovery test runs.  It goes like this:
> 
> * While streaming from master to slave, test script shuts down master
> while slave is left running.  We soon restart the master, but meanwhile:
> 
> * slave's walreceiver process fails, reporting
> 
> 2017-06-26 16:06:50.209 UTC [13209] LOG:  replication terminated by primary server
> 2017-06-26 16:06:50.209 UTC [13209] DETAIL:  End of WAL reached on timeline 1 at 0/3000098.
> 2017-06-26 16:06:50.209 UTC [13209] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> 
> * slave's startup process observes that walreceiver is gone and sends
> PMSIGNAL_START_WALRECEIVER to ask for a new one
> 
> * more often than you would guess, in fact nearly 100% reproducibly for
> me, the postmaster receives/services the PMSIGNAL before it receives
> SIGCHLD for the walreceiver.  In this situation sigusr1_handler just
> throws away the walreceiver start request, reasoning that the walreceiver
> is already running.

Yuck.

I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message.  I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.


> * eventually, it dawns on the startup process that the walreceiver
> isn't starting, and it asks for a new one.  But that takes ten seconds
> (WALRCV_STARTUP_TIMEOUT).
> 
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver.  But I wonder whether it's intentional that the old
> walreceiver dies in the first place.  That FATAL exit looks suspiciously
> like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable.  Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

- Andres



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Next
From: nb
Date:
Subject: [HACKERS] pg_basebackup fails on Windows when using tablespace mapping