Thread: [HACKERS] Another reason why the recovery tests take a long time

[HACKERS] Another reason why the recovery tests take a long time

From
Tom Lane
Date:
I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs.  It goes like this:

* While streaming from master to slave, test script shuts down master
while slave is left running.  We soon restart the master, but meanwhile:

* slave's walreceiver process fails, reporting

2017-06-26 16:06:50.209 UTC [13209] LOG:  replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL:  End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL:  could not send end-of-streaming message to primary: no COPY in progress

* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one

* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver.  In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.

* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one.  But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).

So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver.  But I wonder whether it's intentional that the old
walreceiver dies in the first place.  That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.
        regards, tom lane



Re: [HACKERS] Another reason why the recovery tests take a long time

From
Andres Freund
Date:
Hi,


On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> I've found another edge-case bug through investigation of unexpectedly
> slow recovery test runs.  It goes like this:
> 
> * While streaming from master to slave, test script shuts down master
> while slave is left running.  We soon restart the master, but meanwhile:
> 
> * slave's walreceiver process fails, reporting
> 
> 2017-06-26 16:06:50.209 UTC [13209] LOG:  replication terminated by primary server
> 2017-06-26 16:06:50.209 UTC [13209] DETAIL:  End of WAL reached on timeline 1 at 0/3000098.
> 2017-06-26 16:06:50.209 UTC [13209] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> 
> * slave's startup process observes that walreceiver is gone and sends
> PMSIGNAL_START_WALRECEIVER to ask for a new one
> 
> * more often than you would guess, in fact nearly 100% reproducibly for
> me, the postmaster receives/services the PMSIGNAL before it receives
> SIGCHLD for the walreceiver.  In this situation sigusr1_handler just
> throws away the walreceiver start request, reasoning that the walreceiver
> is already running.

Yuck.

I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message.  I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.


> * eventually, it dawns on the startup process that the walreceiver
> isn't starting, and it asks for a new one.  But that takes ten seconds
> (WALRCV_STARTUP_TIMEOUT).
> 
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver.  But I wonder whether it's intentional that the old
> walreceiver dies in the first place.  That FATAL exit looks suspiciously
> like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable.  Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

- Andres



Re: [HACKERS] Another reason why the recovery tests take a long time

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
>> ... But I wonder whether it's intentional that the old
>> walreceiver dies in the first place.  That FATAL exit looks suspiciously
>> like it wasn't originally-designed-in behavior.

> It's quite intentional afaik - I've complained about the bad error
> message recently (we really shouldn't say "no COPY in progress), but
> exiting seems quite reasonable.  Otherwise we'd have add a separate
> retry logic into the walsender, that reconnects without a new walsender
> being started.

Ah, I see.  I'd misinterpreted the purpose of the infinite loop in
WalReceiverMain() --- now I see that's for receiving requests from the
startup proc for different parts of the WAL stream, not for reconnecting
to the master.
        regards, tom lane



Re: [HACKERS] Another reason why the recovery tests take a long time

From
Andres Freund
Date:
On 2017-06-26 13:42:52 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> >> ... But I wonder whether it's intentional that the old
> >> walreceiver dies in the first place.  That FATAL exit looks suspiciously
> >> like it wasn't originally-designed-in behavior.
> 
> > It's quite intentional afaik - I've complained about the bad error
> > message recently (we really shouldn't say "no COPY in progress), but
> > exiting seems quite reasonable.  Otherwise we'd have add a separate
> > retry logic into the walsender, that reconnects without a new walsender
> > being started.
> 
> Ah, I see.  I'd misinterpreted the purpose of the infinite loop in
> WalReceiverMain() --- now I see that's for receiving requests from the
> startup proc for different parts of the WAL stream, not for reconnecting
> to the master.

Right.  And if the connection fails, we intentionally (whether that's
good or bad is another question) switch to restore_command (or
pg_xlog...) based recovery, in which case we certainly do not want the
walsender around.

- Andres



Re: [HACKERS] Another reason why the recovery tests take a long time

From
Tom Lane
Date:
I wrote:
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver.

Concretely, I propose the attached patch.  Together with reducing
wal_retrieve_retry_interval to 500ms, which I propose having
PostgresNode::init do in its standard postgresql.conf adjustments,
this takes the runtime of the recovery TAP tests down from 2m50s
(after the patches I posted yesterday) to 1m30s.

I think there's still gold to be mined, because "top" is still
showing pretty low CPU load over most of the run, but this is
lots better than 4m30s.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 83e99b7..6c79c36 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** static volatile sig_atomic_t start_autov
*** 357,362 ****
--- 357,365 ----
  /* the launcher needs to be signalled to communicate some condition */
  static volatile bool avlauncher_needs_signal = false;

+ /* received START_WALRECEIVER signal */
+ static volatile sig_atomic_t WalReceiverRequested = false;
+
  /* set when there's a worker that needs to be started up */
  static volatile bool StartWorkerNeeded = true;
  static volatile bool HaveCrashedWorker = false;
*************** static void maybe_start_bgworkers(void);
*** 426,431 ****
--- 429,435 ----
  static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
  static pid_t StartChildProcess(AuxProcType type);
  static void StartAutovacuumWorker(void);
+ static void MaybeStartWalReceiver(void);
  static void InitPostmasterDeathWatchHandle(void);

  /*
*************** ServerLoop(void)
*** 1810,1815 ****
--- 1814,1823 ----
                  kill(AutoVacPID, SIGUSR2);
          }

+         /* If we need to start a WAL receiver, try to do that now */
+         if (WalReceiverRequested)
+             MaybeStartWalReceiver();
+
          /* Get other worker processes running, if needed */
          if (StartWorkerNeeded || HaveCrashedWorker)
              maybe_start_bgworkers();
*************** reaper(SIGNAL_ARGS)
*** 2958,2964 ****
          /*
           * Was it the wal receiver?  If exit status is zero (normal) or one
           * (FATAL exit), we assume everything is all right just like normal
!          * backends.
           */
          if (pid == WalReceiverPID)
          {
--- 2966,2973 ----
          /*
           * Was it the wal receiver?  If exit status is zero (normal) or one
           * (FATAL exit), we assume everything is all right just like normal
!          * backends.  (If we need a new wal receiver, we'll start one at the
!          * next iteration of the postmaster's main loop.)
           */
          if (pid == WalReceiverPID)
          {
*************** sigusr1_handler(SIGNAL_ARGS)
*** 5066,5079 ****
          StartAutovacuumWorker();
      }

!     if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
!         WalReceiverPID == 0 &&
!         (pmState == PM_STARTUP || pmState == PM_RECOVERY ||
!          pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
!         Shutdown == NoShutdown)
      {
          /* Startup Process wants us to start the walreceiver process. */
!         WalReceiverPID = StartWalReceiver();
      }

      if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
--- 5075,5086 ----
          StartAutovacuumWorker();
      }

!     if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER))
      {
          /* Startup Process wants us to start the walreceiver process. */
!         /* Start immediately if possible, else remember request for later. */
!         WalReceiverRequested = true;
!         MaybeStartWalReceiver();
      }

      if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
*************** StartAutovacuumWorker(void)
*** 5410,5415 ****
--- 5417,5440 ----
  }

  /*
+  * MaybeStartWalReceiver
+  *        Start the WAL receiver process, if requested and our state allows.
+  */
+ static void
+ MaybeStartWalReceiver(void)
+ {
+     if (WalReceiverPID == 0 &&
+         (pmState == PM_STARTUP || pmState == PM_RECOVERY ||
+          pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
+         Shutdown == NoShutdown)
+     {
+         WalReceiverPID = StartWalReceiver();
+         WalReceiverRequested = false;
+     }
+ }
+
+
+ /*
   * Create the opts file
   */
  static bool

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another reason why the recovery tests take a long time

From
Simon Riggs
Date:
On 26 June 2017 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> So this looks like a pretty obvious race condition in the postmaster,
>> which should be resolved by having it set a flag on receipt of
>> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
>> new walreceiver.
>
> Concretely, I propose the attached patch.  Together with reducing
> wal_retrieve_retry_interval to 500ms, which I propose having
> PostgresNode::init do in its standard postgresql.conf adjustments,
> this takes the runtime of the recovery TAP tests down from 2m50s
> (after the patches I posted yesterday) to 1m30s.

Patch looks good

> I think there's still gold to be mined, because "top" is still
> showing pretty low CPU load over most of the run, but this is
> lots better than 4m30s.

Thanks for looking into this

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services