Apparent race condition in standby promotion - Mailing list pgsql-hackers

From Tom Lane
Subject Apparent race condition in standby promotion
Date
Msg-id 9542.1461607767@sss.pgh.pa.us
Whole thread Raw
Responses Re: Apparent race condition in standby promotion  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
I'm looking at buildfarm member tern's recent failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08

The perl script thinks it's started the postmaster and sent a promote
signal, then it's waited 90 seconds for the postmaster to come out of
standby:

### Starting node "standby"
# Running: pg_ctl -w -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l
/home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logstart
 
waiting for server to start.... done
server started
# Postmaster PID for node "standby" is 14025120
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO tbl1 values ('in
master,before promotion')
 
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO trunc_tbl values ('in
master,before promotion')
 
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO tail_tbl SELECT g, 'in
master,before promotion: ' || g FROM generate_series(1, 10000) g
 
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c CHECKPOINT
### Promoting node "standby"
# Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l
/home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logpromote
 
server promoting
# 
Timed out while waiting for promotion of standby at RewindTest.pm line 166.

But look at what's in the standby's log:

LOG:  database system was interrupted; last known up at 2016-04-25 01:58:39 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  consistent recovery state reached at 0/2000C00
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
LOG:  statement: SELECT NOT pg_is_in_recovery()
... 88 occurrences of that removed ...
LOG:  statement: SELECT NOT pg_is_in_recovery()
LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command

The standby server didn't notice the promote request until AFTER
the perl script had probed 90 times, once a second, for promotion
to finish.  What's up with that?

One theory is that the postmaster handles promotion requests improperly.
Its only consideration of the problem is in sigusr1_handler:
   if (CheckPromoteSignal() && StartupPID != 0 &&       (pmState == PM_STARTUP || pmState == PM_RECOVERY ||
pmState== PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))   {       /* Tell startup process to finish recovery */
signal_child(StartupPID,SIGUSR2);   }
 

This means that if pg_ctl sends a SIGUSR1 after creating a promotion
trigger file, and the system state is not such that the signal can be
forwarded on to the startup process at that very instant, the signal is
simply forgotten.  We will reconsider sending it to the startup process
only when the postmaster next gets a SIGUSR1, which might not be for a
long time.  Now, I do not immediately see a way for this to happen: the
postmaster should create the startup process before it first enables
interrupts, and that list of states seems to cover every possibility.
But it seems fragile as heck.  Maybe we need to check for promotion in
PostmasterStateMachine or related places, not in (or not just in) the
signal handler.

The other theory is that the startup process received the SIGUSR2
but failed to act on it for a long time.  It checks for that only
in CheckForStandbyTrigger(), and the calls to that function are
in assorted rather unprincipled places, which makes it hard to
convince oneself that such a call will always happen soon.
This example sure looks like no such call happened.

We've seen previous examples of the same thing, eg
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09
Noah raised the test's timeout from 30 seconds to 90 in response
to that failure, but it looks to me like this was a mis-diagnosis
of the problem.  What I suspect is happening is that if the signal
arrives at the startup process at the right time, probably around
the time it's opening up the streaming connection from the primary,
it doesn't get handled promptly.

I'm not planning to pursue this further, but someone who's more
familiar with the WAL-receiving logic in startup.c ought to.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [BUGS] Breakage with VACUUM ANALYSE + partitions
Next
From: Andres Freund
Date:
Subject: Re: Suspicious behaviour on applying XLOG_HEAP2_VISIBLE.