Thread: Race conditions with checkpointer and shutdown

Race conditions with checkpointer and shutdown

From
Michael Paquier
Date:
Hi all,

This is a continuation of the following thread, but I prefer spawning
a new thread for clarity:
https://www.postgresql.org/message-id/20190416064512.GJ2673@paquier.xyz

The buildfarm has reported two similar failures when shutting down a
node:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01

In both cases, the instance cannot shut down because it times out,
waiting for the shutdown checkpoint to finish but I suspect that this
checkpoint actually never happens.

The first case involves piculet which has --disable-atomics, gcc 6 and
the recovery test 016_min_consistency where we trigger a checkpoint,
then issue a fast shutdown on a standby.  And at this point the test
waits forever.

The second case involves dragonet which has JIT enabled and clang.
The failure is on test 009_twophase.pl.  The failure happens after
test preparing transaction xact_009_11, where a *standby* gets
restarted.  Again, the test waits forever for the instance to shut
down.

The most recent commits which have touched checkpoints are 0dfe3d0e
and c6c9474a, which maps roughly to the point where the failures
began to happen, and that something related to standby clean shutdowns
has broken since.

Thanks,
--
Michael

Attachment

Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> The buildfarm has reported two similar failures when shutting down a
> node:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01

> In both cases, the instance cannot shut down because it times out,
> waiting for the shutdown checkpoint to finish but I suspect that this
> checkpoint actually never happens.

Hmm, I don't think that that is actually where the problem is.  In
piculet's failure, the test script times out waiting for a "fast"
shutdown of the standby server, and what we see in the standby's log is

2019-03-23 22:44:12.181 UTC [9731] LOG:  received fast shutdown request
2019-03-23 22:44:12.181 UTC [9731] LOG:  aborting any active transactions
2019-03-23 22:44:12.181 UTC [9960] FATAL:  terminating walreceiver process due to administrator command
2019-03-23 22:50:13.088 UTC [9731] LOG:  received immediate shutdown request

where the last line indicates that the test script lost patience and
issued an immediate shutdown.  However, in a successful run of the
test, the log looks like

2019-03-24 03:33:25.592 UTC [23816] LOG:  received fast shutdown request
2019-03-24 03:33:25.592 UTC [23816] LOG:  aborting any active transactions
2019-03-24 03:33:25.592 UTC [23895] FATAL:  terminating walreceiver process due to administrator command
2019-03-24 03:33:25.595 UTC [23819] LOG:  shutting down
2019-03-24 03:33:25.600 UTC [23816] LOG:  database system is shut down
2019-03-24 03:33:25.696 UTC [23903] LOG:  starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Debian
8.2.0-12)8.2.0, 64-bit 

where the last line reflects restarting the server for the next test step.
So in the failure case we don't see the "shutting down" message, which
means we never got to ShutdownXLOG, so no checkpoint request was made.
Even if we had got to ShutdownXLOG, the process is just executing the
operation directly, it's not sending a signal asking some other process
to do the checkpoint; so it's hard to see how either of the commits
you mention could be involved.

I think what we need to look for is reasons why (1) the postmaster
never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
main loop doesn't get to noticing shutdown_requested.

A rather scary point for (2) is that said main loop seems to be
assuming that MyLatch a/k/a MyProc->procLatch is not used for any
other purposes in the checkpointer process.  If there were something,
like say a condition variable wait, that would reset MyLatch at any
time during a checkpoint, then we could very easily go to sleep at the
bottom of the loop and not notice that there's a pending shutdown request.

Now, c6c9474aa did not break this, because the latch resets that
it added happen in other processes not the checkpointer.  But I'm
feeling suspicious that some other change we made recently might've
borked it.  And in general, it seems like we've managed to load a
lot of potentially conflicting roles onto process latches.

Do we need to think harder about establishing rules for multiplexed
use of the process latch?  I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch.  Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Robert Haas
Date:
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Do we need to think harder about establishing rules for multiplexed
> use of the process latch?  I'm imagining some rule like "if you are
> not the outermost event loop of a process, you do not get to
> summarily clear MyLatch.  Make sure to leave it set after waiting,
> if there was any possibility that it was set by something other than
> the specific event you're concerned with".

Hmm, yeah.  If the latch is left set, then the outer loop will just go
through an extra and unnecessary iteration, which seems fine.  If the
latch is left clear, then the outer loop might miss a wakeup intended
for it and hang forever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Race conditions with checkpointer and shutdown

From
Andres Freund
Date:
Hi,

On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Do we need to think harder about establishing rules for multiplexed
> > use of the process latch?  I'm imagining some rule like "if you are
> > not the outermost event loop of a process, you do not get to
> > summarily clear MyLatch.  Make sure to leave it set after waiting,
> > if there was any possibility that it was set by something other than
> > the specific event you're concerned with".
>
> Hmm, yeah.  If the latch is left set, then the outer loop will just go
> through an extra and unnecessary iteration, which seems fine.  If the
> latch is left clear, then the outer loop might miss a wakeup intended
> for it and hang forever.

Arguably that's a sign that the latch using code in the outer loop(s) isn't
written correctly? If you do it as:

while (true)
{
    CHECK_FOR_INTERRUPTS();

    ResetLatch(MyLatch);

    if (work_needed)
    {
        Plenty();
        Code();
        Using(MyLatch);
    }
    else
    {
        WaitLatch(MyLatch);
    }
}

I think that's not a danger? I think the problem really is that we
suggest doing that WaitLatch() unconditionally:

 * The correct pattern to wait for event(s) is:
 *
 * for (;;)
 * {
 *       ResetLatch();
 *       if (work to do)
 *           Do Stuff();
 *       WaitLatch();
 * }
 *
 * It's important to reset the latch *before* checking if there's work to
 * do. Otherwise, if someone sets the latch between the check and the
 * ResetLatch call, you will miss it and Wait will incorrectly block.
 *
 * Another valid coding pattern looks like:
 *
 * for (;;)
 * {
 *       if (work to do)
 *           Do Stuff(); // in particular, exit loop if some condition satisfied
 *       WaitLatch();
 *       ResetLatch();
 * }

Obviously there's the issue that a lot of latch using code isn't written
that way - but I also don't think there's that many latch using code
that then also uses latch. Seems like we could fix that.  While it has
obviously dangers of not being followed, so does the
'always-set-latch-unless-outermost-loop' approach.

I'm not sure I like the idea of incurring another unnecessary SetLatch()
call for most latch using places.

I guess there's a bit bigger danger of taking longer to notice
postmaster-death. But I'm not sure I can quite see that being
problematic - seems like all we should incur is another cycle through
the loop, as the latch shouldn't be set anymore.

Greetings,

Andres Freund



Re: Race conditions with checkpointer and shutdown

From
Andres Freund
Date:
Hi,

On 2019-04-16 17:05:36 -0700, Andres Freund wrote:
> On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Do we need to think harder about establishing rules for multiplexed
> > > use of the process latch?  I'm imagining some rule like "if you are
> > > not the outermost event loop of a process, you do not get to
> > > summarily clear MyLatch.  Make sure to leave it set after waiting,
> > > if there was any possibility that it was set by something other than
> > > the specific event you're concerned with".
> >
> > Hmm, yeah.  If the latch is left set, then the outer loop will just go
> > through an extra and unnecessary iteration, which seems fine.  If the
> > latch is left clear, then the outer loop might miss a wakeup intended
> > for it and hang forever.
> 
> Arguably that's a sign that the latch using code in the outer loop(s) isn't
> written correctly? If you do it as:
> 
> while (true)
> {
>     CHECK_FOR_INTERRUPTS();
> 
>     ResetLatch(MyLatch);
> 
>     if (work_needed)
>     {
>         Plenty();
>         Code();
>         Using(MyLatch);
>     }
>     else
>     {
>         WaitLatch(MyLatch);
>     }
> }
> 
> I think that's not a danger? I think the problem really is that we
> suggest doing that WaitLatch() unconditionally:
> 
>  * The correct pattern to wait for event(s) is:
>  *
>  * for (;;)
>  * {
>  *       ResetLatch();
>  *       if (work to do)
>  *           Do Stuff();
>  *       WaitLatch();
>  * }
>  *
>  * It's important to reset the latch *before* checking if there's work to
>  * do. Otherwise, if someone sets the latch between the check and the
>  * ResetLatch call, you will miss it and Wait will incorrectly block.
>  *
>  * Another valid coding pattern looks like:
>  *
>  * for (;;)
>  * {
>  *       if (work to do)
>  *           Do Stuff(); // in particular, exit loop if some condition satisfied
>  *       WaitLatch();
>  *       ResetLatch();
>  * }
> 
> Obviously there's the issue that a lot of latch using code isn't written
> that way - but I also don't think there's that many latch using code
> that then also uses latch. Seems like we could fix that.  While it has
> obviously dangers of not being followed, so does the
> 'always-set-latch-unless-outermost-loop' approach.
> 
> I'm not sure I like the idea of incurring another unnecessary SetLatch()
> call for most latch using places.
> 
> I guess there's a bit bigger danger of taking longer to notice
> postmaster-death. But I'm not sure I can quite see that being
> problematic - seems like all we should incur is another cycle through
> the loop, as the latch shouldn't be set anymore.

I think we should thus change our latch documentation to say:

something like:

diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fc995819d35..dc46dd94c5b 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -44,22 +44,31 @@
  * {
  *    ResetLatch();
  *    if (work to do)
- *        Do Stuff();
- *    WaitLatch();
+ *        DoStuff();
+ *    else
+ *        WaitLatch();
  * }
  *
  * It's important to reset the latch *before* checking if there's work to
  * do. Otherwise, if someone sets the latch between the check and the
  * ResetLatch call, you will miss it and Wait will incorrectly block.
  *
+ * The reason to only wait on the latch in case there is nothing to do is that
+ * code inside DoStuff() might use the same latch, and leave it reset, even
+ * though a SetLatch() aimed for the outer loop arrived. Which again could
+ * lead to incorrectly blocking in Wait.
+ *
  * Another valid coding pattern looks like:
  *
  * for (;;)
  * {
  *    if (work to do)
- *        Do Stuff(); // in particular, exit loop if some condition satisfied
- *    WaitLatch();
- *    ResetLatch();
+ *        DoStuff(); // in particular, exit loop if some condition satisfied
+ *     else
+ *     {
+ *        WaitLatch();
+ *        ResetLatch();
+ *     }
  * }
  *
  * This is useful to reduce latch traffic if it's expected that the loop's

and adapt code to match (at least in the outer loops).

Greetings,

Andres Freund



Re: Race conditions with checkpointer and shutdown

From
Thomas Munro
Date:
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think what we need to look for is reasons why (1) the postmaster
> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
> main loop doesn't get to noticing shutdown_requested.
>
> A rather scary point for (2) is that said main loop seems to be
> assuming that MyLatch a/k/a MyProc->procLatch is not used for any
> other purposes in the checkpointer process.  If there were something,
> like say a condition variable wait, that would reset MyLatch at any
> time during a checkpoint, then we could very easily go to sleep at the
> bottom of the loop and not notice that there's a pending shutdown request.

Agreed on the non-composability of that coding, but if there actually
is anything in that loop that can reach ResetLatch(), it's well
hidden...

-- 
Thomas Munro
https://enterprisedb.com



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think what we need to look for is reasons why (1) the postmaster
>> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
>> main loop doesn't get to noticing shutdown_requested.
>> 
>> A rather scary point for (2) is that said main loop seems to be
>> assuming that MyLatch a/k/a MyProc->procLatch is not used for any
>> other purposes in the checkpointer process.  If there were something,
>> like say a condition variable wait, that would reset MyLatch at any
>> time during a checkpoint, then we could very easily go to sleep at the
>> bottom of the loop and not notice that there's a pending shutdown request.

> Agreed on the non-composability of that coding, but if there actually
> is anything in that loop that can reach ResetLatch(), it's well
> hidden...

Well, it's easy to see that there's no other ResetLatch call in
checkpointer.c.  It's much less obvious that there's no such
call anywhere in the code reachable from e.g. CreateCheckPoint().

To try to investigate that, I hacked things up to force an assertion
failure if ResetLatch was called from any other place in the
checkpointer process (dirty patch attached for amusement's sake).
This gets through check-world without any assertions.  That does not
really prove that there aren't corner timing cases where a latch
wait and reset could happen, but it does put a big dent in my theory.
Question is, what other theory has anybody got?

            regards, tom lane

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a1e0423..b0ee1fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -166,6 +166,8 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;

+extern bool reset_latch_ok;
+
 /* Prototypes for private functions */

 static void CheckArchiveTimeout(void);
@@ -343,9 +345,13 @@ CheckpointerMain(void)
         int            elapsed_secs;
         int            cur_timeout;

+        reset_latch_ok = true;
+
         /* Clear any already-pending wakeups */
         ResetLatch(MyLatch);

+        reset_latch_ok = false;
+
         /*
          * Process any requests or signals received recently.
          */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 59fa917..1f0613d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -118,6 +118,8 @@ struct WaitEventSet
 #endif
 };

+bool reset_latch_ok = true;
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -521,6 +523,8 @@ ResetLatch(Latch *latch)
     /* Only the owner should reset the latch */
     Assert(latch->owner_pid == MyProcPid);

+    Assert(reset_latch_ok);
+
     latch->is_set = false;

     /*

Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> wrote (in the other thread):
> Any idea whether it's something newly-introduced or of long standing?

It's the latter.  I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached.  Not all of these look like
the failure pattern Michael pointed to, but enough of them do to say
that the problem has existed since at least mid-2017.  To be concrete,
we have quite a sample of cases where a standby server has received a
"fast shutdown" signal and acknowledged that in its log, but it never
gets to the expected "shutting down" message, meaning it never starts
the shutdown checkpoint let alone finishes it.  The oldest case that
clearly looks like that is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29

A significant majority of the recent cases look just like the piculet
failure Michael pointed to, that is we fail to shut down the "london"
server while it's acting as standby in the recovery/t/009_twophase.pl
test.  But there are very similar failures in other tests.

I also notice that the population of machines showing the problem seems
heavily skewed towards, um, weird cases.  For instance, in the set
that have shown this type of failure since January, we have

dragonet: uses JIT
francolin: --disable-spinlocks
gull: armv7
mereswine: armv7
piculet: --disable-atomics
sidewinder: amd64, but running netbsd 7 (and this was 9.6, note)
spurfowl: fairly generic amd64

This leads me to suspect that the problem is (a) some very low-level issue
in spinlocks or or latches or the like, or (b) a timing problem that just
doesn't show up on generic Intel-oid platforms.  The timing theory is
maybe a bit stronger given that one test case shows this more often than
others.  I've not got any clear ideas beyond that.

Anyway, this is *not* new in v12.

            regards, tom lane

    sysname    |      snapshot       |        stage        |                  l
---------------+---------------------+---------------------+-------------------------------------
 jacana        | 2016-07-23 06:15:32 | pg_upgradeCheck     | pg_ctl: server does not shut down\r
 pademelon     | 2016-08-14 03:49:36 | ECPG-Check          | pg_ctl: server does not shut down
 mereswine     | 2017-02-13 14:24:37 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-03-04 20:06:10 | StopDb-C:4          | pg_ctl: server does not shut down
 nightjar      | 2017-06-02 18:54:29 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-02 19:54:11 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-03 15:54:05 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-03 17:54:18 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-03 21:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-04 00:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-04 16:34:32 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-04 17:54:16 | SubscriptionCheck   | pg_ctl: server does not shut down
 hornet        | 2017-06-05 16:22:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-05 16:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-05 20:26:24 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-06 03:30:02 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-06 15:54:18 | SubscriptionCheck   | pg_ctl: server does not shut down
 hornet        | 2017-06-06 17:10:02 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-06 18:54:27 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 00:54:07 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 02:54:06 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 15:12:15 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 17:54:07 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 18:54:06 | SubscriptionCheck   | pg_ctl: server does not shut down
 sungazer      | 2017-06-07 19:46:53 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 21:03:43 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 01:54:07 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 15:54:10 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 16:57:03 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 17:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 19:54:06 | SubscriptionCheck   | pg_ctl: server does not shut down
 nudibranch    | 2017-07-23 08:31:12 | pg_upgradeCheck     | pg_ctl: server does not shut down
 nudibranch    | 2017-08-06 08:31:25 | pg_upgradeCheck     | pg_ctl: server does not shut down
 spurfowl      | 2017-08-21 18:43:08 | BinInstallCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-10 22:12:01 | pg_upgradeCheck     | pg_ctl: server does not shut down
 calliphoridae | 2017-10-11 20:32:24 | recoveryCheck       | pg_ctl: server does not shut down
 arapaima      | 2017-10-12 16:36:18 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-12 16:36:18 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-13 10:26:13 | test-decoding-check | pg_ctl: server does not shut down
 arapaima      | 2017-10-14 07:40:05 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-14 23:22:28 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-10-15 10:32:28 | test-decoding-check | pg_ctl: server does not shut down
 arapaima      | 2017-10-16 08:40:39 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-10-17 20:32:25 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-18 13:11:05 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-18 23:47:09 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-10-19 09:41:15 | Check               | pg_ctl: server does not shut down
 spurfowl      | 2017-11-15 13:29:06 | BinInstallCheck     | pg_ctl: server does not shut down
 baiji         | 2017-11-27 23:00:01 | Check               | pg_ctl: server does not shut down\r
 skink         | 2017-11-30 08:47:34 | recoveryCheck       | pg_ctl: server does not shut down
 skink         | 2017-12-02 14:30:01 | recoveryCheck       | pg_ctl: server does not shut down
 gharial       | 2017-12-15 09:12:45 | Check               | pg_ctl: server does not shut down
 nudibranch    | 2017-12-17 09:33:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 nudibranch    | 2018-01-20 10:08:33 | Check               | pg_ctl: server does not shut down
 nudibranch    | 2018-01-27 09:44:39 | Check               | pg_ctl: server does not shut down
 nudibranch    | 2018-02-01 09:49:26 | test-decoding-check | pg_ctl: server does not shut down
 mereswine     | 2018-03-01 11:05:39 | Check               | pg_ctl: server does not shut down
 topminnow     | 2018-03-10 12:45:57 | pg_rewindCheck      | pg_ctl: server does not shut down
 mereswine     | 2018-03-14 08:10:08 | Check               | pg_ctl: server does not shut down
 gull          | 2018-03-30 02:10:15 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-08 11:37:13 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-08 21:41:39 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-10 23:37:22 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-13 16:06:17 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-13 23:59:24 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-04-14 20:09:25 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-17 11:59:57 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-19 07:31:31 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-20 08:02:00 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-21 04:21:33 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-21 23:06:17 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-22 17:03:22 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-22 17:03:22 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-23 13:08:51 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-24 07:52:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-25 01:32:16 | Check               | pg_ctl: server does not shut down
 gull          | 2018-04-25 02:10:12 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-26 17:42:38 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-28 05:58:59 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-30 13:11:22 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-03 02:36:32 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-05-04 11:07:37 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-04 18:02:59 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-06 02:52:23 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-07 19:16:33 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-08 13:03:52 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-05-09 04:36:06 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-09 10:41:59 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-09 16:41:32 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-10 14:06:04 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-11 09:00:33 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-12 04:07:50 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-15 22:11:06 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-16 10:14:21 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-16 23:01:57 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-17 07:06:03 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-17 14:22:36 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-18 00:15:19 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-18 11:23:03 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-18 22:36:45 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-19 11:54:21 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-19 19:06:08 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-05-20 09:32:53 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-20 16:47:17 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-21 03:36:24 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-21 14:38:21 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-22 03:02:05 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-22 11:06:05 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-23 12:13:37 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-23 21:05:13 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-24 08:29:25 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-25 14:08:16 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-25 19:11:35 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-27 16:56:10 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-05-29 16:09:14 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-05-30 15:07:44 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-31 13:10:53 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-31 21:12:12 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-31 21:12:12 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-04 17:07:20 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-05 00:12:13 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-06 17:06:03 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-06 21:49:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-06 21:49:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-07 06:53:06 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-06-07 19:36:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-08 15:47:15 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-09 07:11:38 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-06-09 21:54:40 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-10 08:12:11 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-12 00:12:15 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-13 23:29:02 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 04:38:36 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 04:38:36 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 13:54:08 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 20:19:02 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-16 08:09:36 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-17 04:38:19 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-06-17 19:47:19 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-18 04:06:04 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-06-18 16:23:17 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-19 18:45:19 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-20 09:15:42 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-20 09:15:42 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-21 04:41:45 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-21 13:06:30 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-21 23:50:37 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-22 13:11:55 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-23 02:09:06 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-23 13:12:49 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 01:06:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 01:06:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 08:25:08 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 16:10:48 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-26 03:34:45 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-26 12:47:21 | StopDb-C:1          | pg_ctl: server does not shut down
 gull          | 2018-08-24 03:27:16 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-08-24 08:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-08-29 09:26:52 | pg_rewindCheck      | pg_ctl: server does not shut down
 gull          | 2018-08-30 02:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-09-06 02:10:16 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-09-14 02:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 sungazer      | 2018-09-18 19:44:53 | subscriptionCheck   | pg_ctl: server does not shut down
 gull          | 2018-09-26 02:10:06 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-10-01 08:10:04 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-03 02:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-07 03:45:39 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-07 05:07:06 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-10-07 08:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-13 02:10:06 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-13 05:21:13 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-29 03:25:28 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-15 11:55:56 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-16 09:10:13 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-30 10:34:17 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-30 11:44:54 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-12-08 04:29:34 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-01-30 05:18:59 | recoveryCheck       | pg_ctl: server does not shut down
 francolin     | 2019-02-07 07:55:41 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-02-20 03:10:06 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-02-22 03:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 spurfowl      | 2019-02-25 11:02:04 | pg_rewindCheck      | pg_ctl: server does not shut down
 sidewinder    | 2019-02-25 16:45:00 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-03-06 03:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 piculet       | 2019-03-23 22:28:59 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2019-04-11 06:25:20 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-04-12 06:31:30 | recoveryCheck       | pg_ctl: server does not shut down
 dragonet      | 2019-04-16 06:14:01 | recoveryCheck       | pg_ctl: server does not shut down
(188 rows)


Re: Race conditions with checkpointer and shutdown

From
Thomas Munro
Date:
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Question is, what other theory has anybody got?

I wondered if there might be a way for PostmasterStateMachine() to be
reached with without signals blocked, in the case where we fork a
fresh checkpointers, and then it misses the SIGUSR2 that we
immediately send because it hasn't installed its handler yet.  But I
can't see it.

This is a curious thing from dragonet's log:

2019-04-16 08:23:24.178 CEST [8335] LOG:  received fast shutdown request
2019-04-16 08:23:24.178 CEST [8335] LOG:  aborting any active transactions
2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
process due to administrator command
2019-04-16 08:28:23.166 CEST [8337] LOG:  restartpoint starting: time

LogCheckpointStart() is the thing that writes "starting: ...", and it
prefers to report "shutdown" over "time", but it didn't.

-- 
Thomas Munro
https://enterprisedb.com



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> This is a curious thing from dragonet's log:

> 2019-04-16 08:23:24.178 CEST [8335] LOG:  received fast shutdown request
> 2019-04-16 08:23:24.178 CEST [8335] LOG:  aborting any active transactions
> 2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
> process due to administrator command
> 2019-04-16 08:28:23.166 CEST [8337] LOG:  restartpoint starting: time

> LogCheckpointStart() is the thing that writes "starting: ...", and it
> prefers to report "shutdown" over "time", but it didn't.

Yeah, but since we don't see "shutting down", we know that the shutdown
checkpoint hasn't begun.  Here's another similar case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54

The relevant fragment of the standby server's log is

2018-11-30 05:09:22.996 PST [4229] LOG:  received fast shutdown request
2018-11-30 05:09:23.628 PST [4229] LOG:  aborting any active transactions
2018-11-30 05:09:23.649 PST [4231] LOG:  checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed,
1recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563
kB,estimate=16563 kB 
2018-11-30 05:09:23.679 PST [4229] LOG:  background worker "logical replication launcher" (PID 4276) exited with exit
code1 
2018-11-30 05:11:23.757 PST [4288] master LOG:  unexpected EOF on standby connection
2018-11-30 05:11:23.883 PST [4229] LOG:  received immediate shutdown request
2018-11-30 05:11:23.907 PST [4229] LOG:  database system is shut down

To the extent that I've found logs in which the checkpointer prints
anything at all during this interval, it seems to be just quietly
plodding along with its usual business, without any hint that it's
aware of the pending shutdown request.  It'd be very easy to believe
that the postmaster -> checkpointer SIGUSR2 is simply getting dropped,
or never issued.

Hmm ... actually, looking at the postmaster's logic, it won't issue
SIGUSR2 to the checkpointer until the walreceiver (if any) is gone.
And now that I think about it, several of these logs contain traces
showing that the walreceiver is still live.  Like the one quoted above:
seems like the line from PID 4288 has to be from a walreceiver.

Maybe what we should be looking for is "why doesn't the walreceiver
shut down"?  But the dragonet log you quote above shows the walreceiver
exiting, or at least starting to exit.  Tis a puzzlement.

I'm a bit tempted to add temporary debug logging to the postmaster so
that walreceiver start/stop is recorded at LOG level.  We'd have to wait
a few weeks to have any clear result from the buildfarm, but I'm not sure
how we'll get any hard data without some such measures.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Michael Paquier
Date:
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> It's the latter.  I searched the buildfarm database for failure logs
> including the string "server does not shut down" within the last three
> years, and got all of the hits attached.  Not all of these look like
> the failure pattern Michael pointed to, but enough of them do to say
> that the problem has existed since at least mid-2017.  To be concrete,
> we have quite a sample of cases where a standby server has received a
> "fast shutdown" signal and acknowledged that in its log, but it never
> gets to the expected "shutting down" message, meaning it never starts
> the shutdown checkpoint let alone finishes it.  The oldest case that
> clearly looks like that is
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29

Interesting.  I was sort of thinking about c6c3334 first but this
failed based on 9fcf670, which does not include the former.

> This leads me to suspect that the problem is (a) some very low-level issue
> in spinlocks or or latches or the like, or (b) a timing problem that just
> doesn't show up on generic Intel-oid platforms.  The timing theory is
> maybe a bit stronger given that one test case shows this more often than
> others.  I've not got any clear ideas beyond that.
>
> Anyway, this is *not* new in v12.

Indeed.  It seems to me that v12 makes the problem easier to appear
though, and I got to wonder if c6c9474 is helping in that as more
cases are popping up since mid-March.
--
Michael

Attachment

Re: Race conditions with checkpointer and shutdown

From
Thomas Munro
Date:
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> > Anyway, this is *not* new in v12.
>
> Indeed.  It seems to me that v12 makes the problem easier to appear
> though, and I got to wonder if c6c9474 is helping in that as more
> cases are popping up since mid-March.

Interesting, but I'm not sure how that could be though.  Perhaps, a
bit like the other thing that cropped up in the build farm after that
commit, removing ~200ms of needless sleeping around an earlier online
CHECKPOINT made some other pre-existing race condition more likely to
go wrong.

-- 
Thomas Munro
https://enterprisedb.com



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
>> Anyway, this is *not* new in v12.

> Indeed.  It seems to me that v12 makes the problem easier to appear
> though, and I got to wonder if c6c9474 is helping in that as more
> cases are popping up since mid-March.

Yeah.  Whether that's due to a server code change, or new or modified
test cases, is unknown.  But looking at my summary of buildfarm runs
that failed like this, there's a really clear breakpoint at

 gull          | 2018-08-24 03:27:16 | recoveryCheck       | pg_ctl: server does not shut down

Since that, most of the failures with this message have been in the
recoveryCheck step.  Before that, the failures were all over the
place, and now that I look closely a big fraction of them were
in bursts on particular animals, suggesting it was more about
some local problem on that animal than any real code issue.

So it might be worth groveling around in the commit logs from
last August...

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Interesting, but I'm not sure how that could be though.  Perhaps, a
> bit like the other thing that cropped up in the build farm after that
> commit, removing ~200ms of needless sleeping around an earlier online
> CHECKPOINT made some other pre-existing race condition more likely to
> go wrong.

The data that we've got is entirely consistent with the idea that
there's a timing-sensitive bug that gets made more or less likely
to trigger by "unrelated" changes in test cases or server code.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Thomas Munro
Date:
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > 2019-04-16 08:23:24.178 CEST [8393] FATAL:  terminating walreceiver
> > process due to administrator command

> Maybe what we should be looking for is "why doesn't the walreceiver
> shut down"?  But the dragonet log you quote above shows the walreceiver
> exiting, or at least starting to exit.  Tis a puzzlement.

One thing I noticed about this message: if you receive SIGTERM at a
rare time when WalRcvImmediateInterruptOK is true, then that ereport()
runs directly in the signal handler context.  That's not strictly
allowed, and could cause nasal demons.  On the other hand, it probably
wouldn't have managed to get the FATAL message out if that was the
problem here (previously we've seen reports of signal handlers
deadlocking while trying to ereport() but they couldn't get their
message out at all, because malloc or some such was already locked in
the user context).  Is there some way that the exit code could hang
*after* that due to corruption of libc resources (FILE streams,
malloc, ...)?  It doesn't seem likely to me (we'd hopefully see some
more clues) but I thought I'd mention the idea.

-- 
Thomas Munro
https://enterprisedb.com



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Apr 19, 2019 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe what we should be looking for is "why doesn't the walreceiver
>> shut down"?  But the dragonet log you quote above shows the walreceiver
>> exiting, or at least starting to exit.  Tis a puzzlement.

> ... Is there some way that the exit code could hang
> *after* that due to corruption of libc resources (FILE streams,
> malloc, ...)?  It doesn't seem likely to me (we'd hopefully see some
> more clues) but I thought I'd mention the idea.

I agree it's not likely ... but that's part of the reason I was thinking
about adding some postmaster logging.  Whatever we're chasing here is
"not likely", per the observed buildfarm failure rate.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
>>> Maybe what we should be looking for is "why doesn't the walreceiver
>>> shut down"?  But the dragonet log you quote above shows the walreceiver
>>> exiting, or at least starting to exit.  Tis a puzzlement.

huh ... take a look at this little stanza in PostmasterStateMachine:

    if (pmState == PM_SHUTDOWN_2)
    {
        /*
         * PM_SHUTDOWN_2 state ends when there's no other children than
         * dead_end children left. There shouldn't be any regular backends
         * left by now anyway; what we're really waiting for is walsenders and
         * archiver.
         *
         * Walreceiver should normally be dead by now, but not when a fast
         * shutdown is performed during recovery.
         */
        if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 &&
            WalReceiverPID == 0)
        {
            pmState = PM_WAIT_DEAD_END;
        }
    }

I'm too tired to think through exactly what that last comment might be
suggesting, but it sure seems like it might be relevant to our problem.
If the walreceiver *isn't* dead yet, what's going to ensure that we
can move forward later?

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
I have spent a fair amount of time trying to replicate these failures
locally, with little success.  I now think that the most promising theory
is Munro's idea in [1] that the walreceiver is hanging up during its
unsafe attempt to do ereport(FATAL) from inside a signal handler.  It's
extremely plausible that that could result in a deadlock inside libc's
malloc/free, or some similar place.  Moreover, if that's what's causing
it, then the windows for trouble are fixed by the length of time that
malloc might hold internal locks, which fits with the results I've gotten
that inserting delays in various promising-looking places doesn't do a
thing towards making this reproducible.

Even if that isn't the proximate cause of the current reports, it's
clearly trouble waiting to happen, and we should get rid of it.
Accordingly, see attached proposed patch.  This just flushes the
"immediate interrupt" stuff in favor of making sure that
libpqwalreceiver.c will take care of any signals received while
waiting for input.

The existing code does not use PQsetnonblocking, which means that it's
theoretically at risk of blocking while pushing out data to the remote
server.  In practice I think that risk is negligible because (IIUC) we
don't send very large amounts of data at one time.  So I didn't bother to
change that.  Note that for the most part, if that happened, the existing
code was at risk of slow response to SIGTERM anyway since it didn't have
Enable/DisableWalRcvImmediateExit around the places that send data.

My thought is to apply this only to HEAD for now; it's kind of a large
change to shove into the back branches to handle a failure mode that's
not been reported from the field.  Maybe we could back-patch after we
have more confidence in it.

            regards, tom lane

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2B%3D1G98m61VjNS-qGboJPwdZcF%2BrAPu2eC4XuWRTR3UPw%40mail.gmail.com

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7123d41..765d58d 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -99,6 +99,7 @@ static WalReceiverFunctionsType PQWalReceiverFunctions = {

 /* Prototypes for private functions */
 static PGresult *libpqrcv_PQexec(PGconn *streamConn, const char *query);
+static PGresult *libpqrcv_PQgetResult(PGconn *streamConn);
 static char *stringlist_to_identifierstr(PGconn *conn, List *strings);

 /*
@@ -196,7 +197,7 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname,
         if (rc & WL_LATCH_SET)
         {
             ResetLatch(MyLatch);
-            CHECK_FOR_INTERRUPTS();
+            ProcessWalRcvInterrupts();
         }

         /* If socket is ready, advance the libpq state machine */
@@ -456,6 +457,10 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli)
 {
     PGresult   *res;

+    /*
+     * Send copy-end message.  As in libpqrcv_PQexec, this could theoretically
+     * block, but the risk seems small.
+     */
     if (PQputCopyEnd(conn->streamConn, NULL) <= 0 ||
         PQflush(conn->streamConn))
         ereport(ERROR,
@@ -472,7 +477,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli)
      * If we had not yet received CopyDone from the backend, PGRES_COPY_OUT is
      * also possible in case we aborted the copy in mid-stream.
      */
-    res = PQgetResult(conn->streamConn);
+    res = libpqrcv_PQgetResult(conn->streamConn);
     if (PQresultStatus(res) == PGRES_TUPLES_OK)
     {
         /*
@@ -486,7 +491,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli)
         PQclear(res);

         /* the result set should be followed by CommandComplete */
-        res = PQgetResult(conn->streamConn);
+        res = libpqrcv_PQgetResult(conn->streamConn);
     }
     else if (PQresultStatus(res) == PGRES_COPY_OUT)
     {
@@ -499,7 +504,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli)
                             pchomp(PQerrorMessage(conn->streamConn)))));

         /* CommandComplete should follow */
-        res = PQgetResult(conn->streamConn);
+        res = libpqrcv_PQgetResult(conn->streamConn);
     }

     if (PQresultStatus(res) != PGRES_COMMAND_OK)
@@ -509,7 +514,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli)
     PQclear(res);

     /* Verify that there are no more results */
-    res = PQgetResult(conn->streamConn);
+    res = libpqrcv_PQgetResult(conn->streamConn);
     if (res != NULL)
         ereport(ERROR,
                 (errmsg("unexpected result after CommandComplete: %s",
@@ -572,12 +577,11 @@ libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn,
  * The function is modeled on PQexec() in libpq, but only implements
  * those parts that are in use in the walreceiver api.
  *
- * Queries are always executed on the connection in streamConn.
+ * May return NULL, rather than an error result, on failure.
  */
 static PGresult *
 libpqrcv_PQexec(PGconn *streamConn, const char *query)
 {
-    PGresult   *result = NULL;
     PGresult   *lastResult = NULL;

     /*
@@ -588,60 +592,26 @@ libpqrcv_PQexec(PGconn *streamConn, const char *query)
      */

     /*
-     * Submit a query. Since we don't use non-blocking mode, this also can
-     * block. But its risk is relatively small, so we ignore that for now.
+     * Submit the query.  Since we don't use non-blocking mode, this could
+     * theoretically block.  In practice, since we don't send very long query
+     * strings, the risk seems negligible.
      */
     if (!PQsendQuery(streamConn, query))
         return NULL;

     for (;;)
     {
-        /*
-         * Receive data until PQgetResult is ready to get the result without
-         * blocking.
-         */
-        while (PQisBusy(streamConn))
-        {
-            int            rc;
-
-            /*
-             * We don't need to break down the sleep into smaller increments,
-             * since we'll get interrupted by signals and can either handle
-             * interrupts here or elog(FATAL) within SIGTERM signal handler if
-             * the signal arrives in the middle of establishment of
-             * replication connection.
-             */
-            rc = WaitLatchOrSocket(MyLatch,
-                                   WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
-                                   WL_LATCH_SET,
-                                   PQsocket(streamConn),
-                                   0,
-                                   WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE);
-
-            /* Interrupted? */
-            if (rc & WL_LATCH_SET)
-            {
-                ResetLatch(MyLatch);
-                CHECK_FOR_INTERRUPTS();
-            }
+        /* Wait for, and collect, the next PGresult. */
+        PGresult   *result;

-            /* Consume whatever data is available from the socket */
-            if (PQconsumeInput(streamConn) == 0)
-            {
-                /* trouble; drop whatever we had and return NULL */
-                PQclear(lastResult);
-                return NULL;
-            }
-        }
+        result = libpqrcv_PQgetResult(streamConn);
+        if (result == NULL)
+            break;                /* query is complete, or failure */

         /*
          * Emulate PQexec()'s behavior of returning the last result when there
          * are many.  We are fine with returning just last error message.
          */
-        result = PQgetResult(streamConn);
-        if (result == NULL)
-            break;                /* query is complete */
-
         PQclear(lastResult);
         lastResult = result;

@@ -656,6 +626,51 @@ libpqrcv_PQexec(PGconn *streamConn, const char *query)
 }

 /*
+ * Perform the equivalent of PQgetResult(), but watch for interrupts.
+ */
+static PGresult *
+libpqrcv_PQgetResult(PGconn *streamConn)
+{
+    /*
+     * Collect data until PQgetResult is ready to get the result without
+     * blocking.
+     */
+    while (PQisBusy(streamConn))
+    {
+        int            rc;
+
+        /*
+         * We don't need to break down the sleep into smaller increments,
+         * since we'll get interrupted by signals and can handle any
+         * interrupts here.
+         */
+        rc = WaitLatchOrSocket(MyLatch,
+                               WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
+                               WL_LATCH_SET,
+                               PQsocket(streamConn),
+                               0,
+                               WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE);
+
+        /* Interrupted? */
+        if (rc & WL_LATCH_SET)
+        {
+            ResetLatch(MyLatch);
+            ProcessWalRcvInterrupts();
+        }
+
+        /* Consume whatever data is available from the socket */
+        if (PQconsumeInput(streamConn) == 0)
+        {
+            /* trouble; return NULL */
+            return NULL;
+        }
+    }
+
+    /* Now we can collect and return the next PGresult */
+    return PQgetResult(streamConn);
+}
+
+/*
  * Disconnect connection to primary, if any.
  */
 static void
@@ -716,13 +731,13 @@ libpqrcv_receive(WalReceiverConn *conn, char **buffer,
     {
         PGresult   *res;

-        res = PQgetResult(conn->streamConn);
+        res = libpqrcv_PQgetResult(conn->streamConn);
         if (PQresultStatus(res) == PGRES_COMMAND_OK)
         {
             PQclear(res);

             /* Verify that there are no more results. */
-            res = PQgetResult(conn->streamConn);
+            res = libpqrcv_PQgetResult(conn->streamConn);
             if (res != NULL)
             {
                 PQclear(res);
@@ -886,7 +901,7 @@ libpqrcv_processTuples(PGresult *pgres, WalRcvExecResult *walres,
     {
         char       *cstrs[MaxTupleAttributeNumber];

-        CHECK_FOR_INTERRUPTS();
+        ProcessWalRcvInterrupts();

         /* Do the allocations in temporary context. */
         oldcontext = MemoryContextSwitchTo(rowcontext);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f32cf91..7f4bd56 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -112,28 +112,7 @@ static struct
 static StringInfoData reply_message;
 static StringInfoData incoming_message;

-/*
- * About SIGTERM handling:
- *
- * We can't just exit(1) within SIGTERM signal handler, because the signal
- * might arrive in the middle of some critical operation, like while we're
- * holding a spinlock. We also can't just set a flag in signal handler and
- * check it in the main loop, because we perform some blocking operations
- * like libpqrcv_PQexec(), which can take a long time to finish.
- *
- * We use a combined approach: When WalRcvImmediateInterruptOK is true, it's
- * safe for the signal handler to elog(FATAL) immediately. Otherwise it just
- * sets got_SIGTERM flag, which is checked in the main loop when convenient.
- *
- * This is very much like what regular backends do with ImmediateInterruptOK,
- * ProcessInterrupts() etc.
- */
-static volatile bool WalRcvImmediateInterruptOK = false;
-
 /* Prototypes for private functions */
-static void ProcessWalRcvInterrupts(void);
-static void EnableWalRcvImmediateExit(void);
-static void DisableWalRcvImmediateExit(void);
 static void WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last);
 static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI);
 static void WalRcvDie(int code, Datum arg);
@@ -151,7 +130,20 @@ static void WalRcvShutdownHandler(SIGNAL_ARGS);
 static void WalRcvQuickDieHandler(SIGNAL_ARGS);


-static void
+/*
+ * Process any interrupts the walreceiver process may have received.
+ * This should be called any time the process's latch has become set.
+ *
+ * Currently, only SIGTERM is of interest.  We can't just exit(1) within the
+ * SIGTERM signal handler, because the signal might arrive in the middle of
+ * some critical operation, like while we're holding a spinlock.  Instead, the
+ * signal handler sets a flag variable as well as setting the process's latch.
+ * We must check the flag (by calling ProcessWalRcvInterrupts) anytime the
+ * latch has become set.  Operations that could block for a long time, such as
+ * reading from a remote server, must pay attention to the latch too; see
+ * libpqrcv_PQexec for example.
+ */
+void
 ProcessWalRcvInterrupts(void)
 {
     /*
@@ -163,26 +155,12 @@ ProcessWalRcvInterrupts(void)

     if (got_SIGTERM)
     {
-        WalRcvImmediateInterruptOK = false;
         ereport(FATAL,
                 (errcode(ERRCODE_ADMIN_SHUTDOWN),
                  errmsg("terminating walreceiver process due to administrator command")));
     }
 }

-static void
-EnableWalRcvImmediateExit(void)
-{
-    WalRcvImmediateInterruptOK = true;
-    ProcessWalRcvInterrupts();
-}
-
-static void
-DisableWalRcvImmediateExit(void)
-{
-    WalRcvImmediateInterruptOK = false;
-    ProcessWalRcvInterrupts();
-}

 /* Main entry point for walreceiver process */
 void
@@ -292,12 +270,10 @@ WalReceiverMain(void)
     PG_SETMASK(&UnBlockSig);

     /* Establish the connection to the primary for XLOG streaming */
-    EnableWalRcvImmediateExit();
     wrconn = walrcv_connect(conninfo, false, cluster_name[0] ? cluster_name : "walreceiver", &err);
     if (!wrconn)
         ereport(ERROR,
                 (errmsg("could not connect to the primary server: %s", err)));
-    DisableWalRcvImmediateExit();

     /*
      * Save user-visible connection string.  This clobbers the original
@@ -336,7 +312,6 @@ WalReceiverMain(void)
          * Check that we're connected to a valid server using the
          * IDENTIFY_SYSTEM replication command.
          */
-        EnableWalRcvImmediateExit();
         primary_sysid = walrcv_identify_system(wrconn, &primaryTLI);

         snprintf(standby_sysid, sizeof(standby_sysid), UINT64_FORMAT,
@@ -348,7 +323,6 @@ WalReceiverMain(void)
                      errdetail("The primary's identifier is %s, the standby's identifier is %s.",
                                primary_sysid, standby_sysid)));
         }
-        DisableWalRcvImmediateExit();

         /*
          * Confirm that the current timeline of the primary is the same or
@@ -509,6 +483,8 @@ WalReceiverMain(void)
                 if (rc & WL_LATCH_SET)
                 {
                     ResetLatch(walrcv->latch);
+                    ProcessWalRcvInterrupts();
+
                     if (walrcv->force_reply)
                     {
                         /*
@@ -577,9 +553,7 @@ WalReceiverMain(void)
              * The backend finished streaming. Exit streaming COPY-mode from
              * our side, too.
              */
-            EnableWalRcvImmediateExit();
             walrcv_endstreaming(wrconn, &primaryTLI);
-            DisableWalRcvImmediateExit();

             /*
              * If the server had switched to a new timeline that we didn't
@@ -726,9 +700,7 @@ WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last)
                     (errmsg("fetching timeline history file for timeline %u from primary server",
                             tli)));

-            EnableWalRcvImmediateExit();
             walrcv_readtimelinehistoryfile(wrconn, tli, &fname, &content, &len);
-            DisableWalRcvImmediateExit();

             /*
              * Check that the filename on the master matches what we
@@ -805,7 +777,7 @@ WalRcvSigUsr1Handler(SIGNAL_ARGS)
     errno = save_errno;
 }

-/* SIGTERM: set flag for main loop, or shutdown immediately if safe */
+/* SIGTERM: set flag for ProcessWalRcvInterrupts */
 static void
 WalRcvShutdownHandler(SIGNAL_ARGS)
 {
@@ -816,10 +788,6 @@ WalRcvShutdownHandler(SIGNAL_ARGS)
     if (WalRcv->latch)
         SetLatch(WalRcv->latch);

-    /* Don't joggle the elbow of proc_exit */
-    if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
-        ProcessWalRcvInterrupts();
-
     errno = save_errno;
 }

diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 33e89ca..7f2927c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -302,6 +302,7 @@ walrcv_clear_result(WalRcvExecResult *walres)

 /* prototypes for functions in walreceiver.c */
 extern void WalReceiverMain(void) pg_attribute_noreturn();
+extern void ProcessWalRcvInterrupts(void);

 /* prototypes for functions in walreceiverfuncs.c */
 extern Size WalRcvShmemSize(void);

Re: Race conditions with checkpointer and shutdown

From
Thomas Munro
Date:
On Sun, Apr 28, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Even if that isn't the proximate cause of the current reports, it's
> clearly trouble waiting to happen, and we should get rid of it.
> Accordingly, see attached proposed patch.  This just flushes the
> "immediate interrupt" stuff in favor of making sure that
> libpqwalreceiver.c will take care of any signals received while
> waiting for input.

+1

I see that we removed the code that this was modelled on back in 2015,
and in fact your patch even removes a dangling reference in a comment:

- * This is very much like what regular backends do with ImmediateInterruptOK,

> The existing code does not use PQsetnonblocking, which means that it's
> theoretically at risk of blocking while pushing out data to the remote
> server.  In practice I think that risk is negligible because (IIUC) we
> don't send very large amounts of data at one time.  So I didn't bother to
> change that.  Note that for the most part, if that happened, the existing
> code was at risk of slow response to SIGTERM anyway since it didn't have
> Enable/DisableWalRcvImmediateExit around the places that send data.

Right.

> My thought is to apply this only to HEAD for now; it's kind of a large
> change to shove into the back branches to handle a failure mode that's
> not been reported from the field.  Maybe we could back-patch after we
> have more confidence in it.

+1

That reminds me, we should probably also clean up at least the
ereport-from-signal-handler hazard identified over in this thread:

https://www.postgresql.org/message-id/CAEepm%3D10MtmKeDc1WxBM0PQM9OgtNy%2BRCeWqz40pZRRS3PNo5Q%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sun, Apr 28, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Even if that isn't the proximate cause of the current reports, it's
>> clearly trouble waiting to happen, and we should get rid of it.

> +1

> That reminds me, we should probably also clean up at least the
> ereport-from-signal-handler hazard identified over in this thread:
> https://www.postgresql.org/message-id/CAEepm%3D10MtmKeDc1WxBM0PQM9OgtNy%2BRCeWqz40pZRRS3PNo5Q%40mail.gmail.com

Yeah, probably.  I imagine the reason we've not already seen complaints
about that is that not that many custom bgworkers exist.

I do not think we can get away with back-patching a change in that area,
though, since it'll move the goalposts about what bgworker code has to
do to cope with SIGTERMs.  It might already be too late for v12, unless
we want to treat that as an Open Item.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Andres Freund
Date:
Hi,

On 2019-04-27 20:56:51 -0400, Tom Lane wrote:
> Even if that isn't the proximate cause of the current reports, it's
> clearly trouble waiting to happen, and we should get rid of it.
> Accordingly, see attached proposed patch.  This just flushes the
> "immediate interrupt" stuff in favor of making sure that
> libpqwalreceiver.c will take care of any signals received while
> waiting for input.

Good plan.


> The existing code does not use PQsetnonblocking, which means that it's
> theoretically at risk of blocking while pushing out data to the remote
> server.  In practice I think that risk is negligible because (IIUC) we
> don't send very large amounts of data at one time.  So I didn't bother to
> change that.  Note that for the most part, if that happened, the existing
> code was at risk of slow response to SIGTERM anyway since it didn't have
> Enable/DisableWalRcvImmediateExit around the places that send data.

Hm, I'm not convinced that's OK. What if there's a network hickup? We'll
wait until there's an OS tcp timeout, no? It's bad enough that there
were cases of this before. Increasing the surface of cases where we
might want to shut down walreceiver, e.g. because we would rather switch
to recovery_command, or just shut down the server, but just get stuck
waiting for an hour for a tcp timeout, doesn't seem OK.

Greetings,

Andres Freund



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-04-27 20:56:51 -0400, Tom Lane wrote:
>> The existing code does not use PQsetnonblocking, which means that it's
>> theoretically at risk of blocking while pushing out data to the remote
>> server.  In practice I think that risk is negligible because (IIUC) we
>> don't send very large amounts of data at one time.  So I didn't bother to
>> change that.  Note that for the most part, if that happened, the existing
>> code was at risk of slow response to SIGTERM anyway since it didn't have
>> Enable/DisableWalRcvImmediateExit around the places that send data.

> Hm, I'm not convinced that's OK. What if there's a network hickup? We'll
> wait until there's an OS tcp timeout, no?

No.  send() is only going to block if there's no room in the kernel's
buffers, and that would only happen if we send a lot of data in between
waits to receive data.  Which, AFAIK, the walreceiver never does.
We might possibly need to improve that code in the future, but I don't
think there's a need for it today.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Andres Freund
Date:
Hi,

On 2019-04-29 12:55:31 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Hm, I'm not convinced that's OK. What if there's a network hickup? We'll
> > wait until there's an OS tcp timeout, no?
> 
> No.  send() is only going to block if there's no room in the kernel's
> buffers, and that would only happen if we send a lot of data in between
> waits to receive data.  Which, AFAIK, the walreceiver never does.
> We might possibly need to improve that code in the future, but I don't
> think there's a need for it today.

Ah, right.

- Andres



Re: Race conditions with checkpointer and shutdown

From
Ashwin Agrawal
Date:
On Sat, Apr 27, 2019 at 5:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I have spent a fair amount of time trying to replicate these failures
> locally, with little success.  I now think that the most promising theory
> is Munro's idea in [1] that the walreceiver is hanging up during its
> unsafe attempt to do ereport(FATAL) from inside a signal handler.  It's
> extremely plausible that that could result in a deadlock inside libc's
> malloc/free, or some similar place.  Moreover, if that's what's causing
> it, then the windows for trouble are fixed by the length of time that
> malloc might hold internal locks, which fits with the results I've gotten
> that inserting delays in various promising-looking places doesn't do a
> thing towards making this reproducible.

For Greenplum (based on 9.4 but current master code looks the same) we
did see deadlocks recently hit in CI many times for walreceiver which
I believe confirms above finding.

#0  __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>,
p=0x26bb3b0, have_lock=0) at malloc.c:3962
#2  0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at
malloc.c:2968
#3  0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#4  0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
#5  0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235
#6  0x00007f0637ea0ff8 in __run_exit_handlers (status=1,
listp=0x7f063822b5f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#7  0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104
#8  0x00000000008c72c7 in proc_exit ()
#9  0x0000000000a75867 in errfinish ()
#10 0x000000000089ea53 in ProcessWalRcvInterrupts ()
#11 0x000000000089eac5 in WalRcvShutdownHandler ()
#12 <signal handler called>
#13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>,
bytes=bytes@entry=16384) at malloc.c:3802
#14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913
#15 0x00000000007754c3 in makeEmptyPGconn ()
#16 0x0000000000779686 in PQconnectStart ()
#17 0x0000000000779b8b in PQconnectdb ()
#18 0x00000000008aae52 in libpqrcv_connect ()
#19 0x000000000089f735 in WalReceiverMain ()
#20 0x00000000005c5eab in AuxiliaryProcessMain ()
#21 0x00000000004cd5f1 in ServerLoop ()
#22 0x000000000086fb18 in PostmasterMain ()
#23 0x00000000004d2e28 in main ()

ImmediateInterruptOK was removed from regular backends but not for
walreceiver and walreceiver performing elog(FATAL) inside signal
handler is dangerous.



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
> For Greenplum (based on 9.4 but current master code looks the same) we
> did see deadlocks recently hit in CI many times for walreceiver which
> I believe confirms above finding.

> #0  __lll_lock_wait_private () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>,
> p=0x26bb3b0, have_lock=0) at malloc.c:3962
> #2  0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at
> malloc.c:2968
> #3  0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
> #4  0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30
> #5  0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235
> #6  0x00007f0637ea0ff8 in __run_exit_handlers (status=1,
> listp=0x7f063822b5f8 <__exit_funcs>,
> run_list_atexit=run_list_atexit@entry=true) at exit.c:82
> #7  0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104
> #8  0x00000000008c72c7 in proc_exit ()
> #9  0x0000000000a75867 in errfinish ()
> #10 0x000000000089ea53 in ProcessWalRcvInterrupts ()
> #11 0x000000000089eac5 in WalRcvShutdownHandler ()
> #12 <signal handler called>
> #13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>,
> bytes=bytes@entry=16384) at malloc.c:3802
> #14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913
> #15 0x00000000007754c3 in makeEmptyPGconn ()
> #16 0x0000000000779686 in PQconnectStart ()
> #17 0x0000000000779b8b in PQconnectdb ()
> #18 0x00000000008aae52 in libpqrcv_connect ()
> #19 0x000000000089f735 in WalReceiverMain ()
> #20 0x00000000005c5eab in AuxiliaryProcessMain ()
> #21 0x00000000004cd5f1 in ServerLoop ()
> #22 0x000000000086fb18 in PostmasterMain ()
> #23 0x00000000004d2e28 in main ()

Cool --- that stack trace is *exactly* what you'd expect if this
were the problem.  Thanks for sending it along!

Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
to see if it fixes the problem for you?

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Ashwin Agrawal
Date:
On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> to see if it fixes the problem for you?

Yes, will give it a try on greenplum and report back the result.

Have we decided if this will be applied to back branches?



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Ashwin Agrawal <aagrawal@pivotal.io> writes:
> On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
>> to see if it fixes the problem for you?

> Yes, will give it a try on greenplum and report back the result.

> Have we decided if this will be applied to back branches?

My feeling about it is "maybe eventually, but most definitely not
the week before a set of minor releases".  Some positive experience
with Greenplum would help increase confidence in the patch, for sure.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Alvaro Herrera
Date:
On 2019-Apr-29, Tom Lane wrote:

> Ashwin Agrawal <aagrawal@pivotal.io> writes:
> > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46
> >> to see if it fixes the problem for you?
> 
> > Yes, will give it a try on greenplum and report back the result.
> 
> > Have we decided if this will be applied to back branches?

Hi Ashwin, did you have the chance to try this out?


> My feeling about it is "maybe eventually, but most definitely not
> the week before a set of minor releases".  Some positive experience
> with Greenplum would help increase confidence in the patch, for sure.

I looked at the buildfarm failures for the recoveryCheck stage.  It
looks like there is only one failure for branch master after this
commit, which was chipmunk saying:

  # poll_query_until timed out executing this query:
  # SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;
  # expecting this output:
  # standby1|1|sync
  # standby2|2|sync
  # standby3|2|potential
  # standby4|2|potential
  # last actual query output:
  # standby1|1|sync
  # standby2|2|potential
  # standby3|2|sync
  # standby4|2|potential
  # with stderr:
  not ok 6 - asterisk comes before another standby name

  #   Failed test 'asterisk comes before another standby name'
  #   at t/007_sync_rep.pl line 26.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
AFAICS this is wholly unrelated to the problem at hand.

No other animal failed recoveryCheck test; before the commit, the
failure was not terribly frequent, but rarely would 10 days go by
without it failing.  So I suggest that the bug has indeed been fixed.

Maybe now's a good time to get it back-patched?  In branch
REL_11_STABLE, it failed as recently as 11 days ago in gull,
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-06-01%2004%3A11%3A36

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Apr-29, Tom Lane wrote:
>> Ashwin Agrawal <aagrawal@pivotal.io> writes:
>>> Have we decided if this will be applied to back branches?

>> My feeling about it is "maybe eventually, but most definitely not
>> the week before a set of minor releases".  Some positive experience
>> with Greenplum would help increase confidence in the patch, for sure.

> I looked at the buildfarm failures for the recoveryCheck stage.  It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
> ...
> AFAICS this is wholly unrelated to the problem at hand.

Yeah, that seems unrelated.

> No other animal failed recoveryCheck test; before the commit, the
> failure was not terribly frequent, but rarely would 10 days go by
> without it failing.  So I suggest that the bug has indeed been fixed.

I feel pretty good about it too.

> Maybe now's a good time to get it back-patched?

Should we do that now, or wait till after next week's releases?

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Alvaro Herrera
Date:
On 2019-Jun-12, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:

> > Maybe now's a good time to get it back-patched?
> 
> Should we do that now, or wait till after next week's releases?

IMO this has been hammered enough in master, and we still have a few
days in the back-branches for buildfarm, that it's okay to do it now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Race conditions with checkpointer and shutdown

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Jun-12, Tom Lane wrote:
>> Should we do that now, or wait till after next week's releases?

> IMO this has been hammered enough in master, and we still have a few
> days in the back-branches for buildfarm, that it's okay to do it now.

Poking at that, I find that a1a789eb5 back-patches reasonably painlessly
into v11 and v10, but trying to bring it back to 9.6 encounters a pile of
merge failures.  Also, looking at the git logs shows that we did a hell
of a lot of subtle work on that code (libpqwalreceiver.c in particular)
during the v10 cycle.  So I've got no confidence that successful
buildfarm/beta1 testing of the HEAD patch means much of anything for
putting it into pre-v10 branches.

Given that we've seen few if any field reports of this issue, my
inclination is to back-patch as far as v10, but not take the risk
and effort involved in going further.

            regards, tom lane



Re: Race conditions with checkpointer and shutdown

From
Michael Paquier
Date:
On Wed, Jun 12, 2019 at 04:26:23PM -0400, Tom Lane wrote:
> Poking at that, I find that a1a789eb5 back-patches reasonably painlessly
> into v11 and v10, but trying to bring it back to 9.6 encounters a pile of
> merge failures.  Also, looking at the git logs shows that we did a hell
> of a lot of subtle work on that code (libpqwalreceiver.c in particular)
> during the v10 cycle.  So I've got no confidence that successful
> buildfarm/beta1 testing of the HEAD patch means much of anything for
> putting it into pre-v10 branches.
>
> Given that we've seen few if any field reports of this issue, my
> inclination is to back-patch as far as v10, but not take the risk
> and effort involved in going further.

+1 for only a back-patch to v10 per the invasiveness argument.  I
think that you have made the right move here.
--
Michael

Attachment

Re: Race conditions with checkpointer and shutdown

From
Michael Paquier
Date:
On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote:
> I looked at the buildfarm failures for the recoveryCheck stage.  It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
> AFAICS this is wholly unrelated to the problem at hand.

Yes, that's unrelated.

This failure is interesting, still it has happened only once per what
I can see.  I think that this points out a rare race condition in test
007_sync_rep.pl because of this sequence which reorders the standbys:
# Stop and start standbys to rearrange the order of standbys
# in WalSnd array. Now, if standbys have the same priority,
# standby2 is selected preferentially and standby3 is next.
$node_standby_1->stop;
$node_standby_2->stop;
$node_standby_3->stop;

$node_standby_2->start;
$node_standby_3->start;

The failure actually indicates that even if standby3 has started
after standby2, standby3 has registered back into the WAL sender array
of the primary before standby2 had the occasion to do it, but we have
no guarantee that the ordering is actually right.  So this has messed
up with the expected set of sync standbys when these are not directly
listed.  I think that this can happen as well when initializing the
standbys at the top of the test, but the window is much, much
narrower and basically impossible to reach.

Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes
the test faster, but that's much more costly than using just
poll_query_until() on pg_stat_replication to make sure that each
standby is registered on the primary's WAL sender array.  In short,
something like the attached should improve the stability of the test.

Thoughts?
--
Michael

Attachment