Thread: Parallel query vs smart shutdown and Postmaster death

Parallel query vs smart shutdown and Postmaster death

From
Thomas Munro
Date:
Hello hackers,

1.  In a nearby thread, I misdiagnosed a problem reported[1] by Justin
Pryzby (though my misdiagnosis is probably still a thing to be fixed;
see next).  I think I just spotted the real problem he saw: if you
execute a parallel query after a smart shutdown has been initiated,
you wait forever in gather_readnext()!  Maybe parallel workers can't
be launched in this state, but we lack code to detect this case?  I
haven't dug into the exact mechanism or figured out what to do about
it yet, and I'm tied up with something else for a bit, but I will come
back to this later if nobody beats me to it.

2.  Commit cfdf4dc4 on the master branch fixed up all known waits that
didn't respond to postmaster death, and added an assertion to that
effect.  One of the cases fixed was in gather_readnext(), and
initially I thought that's what Justin was telling us about (his
report was from 11.x), until I reread his message and saw that it was
SIGTERM and not eg SIGKILL.  I should probably go and back-patch a fix
for that case anyway... but now I'm wondering, was there a reason for
that omission, and likewise for mq_putmessage()?

(Another case of missing PM death detection in the back-branches is
postgres_fdw.)

[1] https://www.postgresql.org/message-id/CAEepm%3D0kMunPC0hhuT0VC-5dfMT3K-xsToJHkTznA6yrSARsPg%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com


Re: Parallel query vs smart shutdown and Postmaster death

From
Thomas Munro
Date:
On Mon, Feb 25, 2019 at 2:13 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 1.  In a nearby thread, I misdiagnosed a problem reported[1] by Justin
> Pryzby (though my misdiagnosis is probably still a thing to be fixed;
> see next).  I think I just spotted the real problem he saw: if you
> execute a parallel query after a smart shutdown has been initiated,
> you wait forever in gather_readnext()!  Maybe parallel workers can't
> be launched in this state, but we lack code to detect this case?  I
> haven't dug into the exact mechanism or figured out what to do about
> it yet, and I'm tied up with something else for a bit, but I will come
> back to this later if nobody beats me to it.

Given smart shutdown's stated goal, namely that it "lets existing
sessions end their work normally", my questions are:

1.  Why does pmdie()'s SIGTERM case terminate parallel workers
immediately?  That breaks aborts running parallel queries, so they
don't get to end their work normally.
2.  Why are new parallel workers not allowed to be started while in
this state?  That hangs future parallel queries forever, so they don't
get to end their work normally.
3.  Suppose we fix the above cases; should we do it for parallel
workers only (somehow), or for all bgworkers?  It's hard to say since
I don't know what all bgworkers do.

In the meantime, perhaps we should teach the postmaster to report this
case as a failure to launch in back-branches, so that at least
parallel queries don't hang forever?  Here's an initial sketch of a
patch like that: it gives you "ERROR:  parallel worker failed to
initialize" and "HINT:  More details may be available in the server
log." if you try to run a parallel query.  The HINT is right, the
server logs say that a smart shutdown is in progress.  If that seems a
bit hostile, consider that any parallel queries that were running at
the moment the smart shutdown was requested have already been ordered
to quit; why should new queries started after that get a better deal?
Then perhaps we could do some more involved surgery on master that
achieves smart shutdown's stated goal here, and lets parallel queries
actually run?  Better ideas welcome.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Parallel query vs smart shutdown and Postmaster death

From
Robert Haas
Date:
On Tue, Feb 26, 2019 at 5:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Then perhaps we could do some more involved surgery on master that
> achieves smart shutdown's stated goal here, and lets parallel queries
> actually run?  Better ideas welcome.

I have noticed before that the smart shutdown code does not disclose
to the rest of the system that a shutdown is in progress: no signals
are sent, and no shared memory state is updated.  That makes it a bit
challenging for any other part of the system to respond to the smart
shutdown in a way that is, well, smart.  But I guess that's not really
the problem in this case.

It seems to me that we could fix pmdie() to skip parallel workers; I
think that the postmaster could notice that they are lagged as
BGWORKER_CLASS_PARALLEL.  But we'd also have to fix things so that new
parallel workers could be launched during a smart shutdown, which
looks not quite so simple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Parallel query vs smart shutdown and Postmaster death

From
Arseny Sher
Date:
Hi,

Thomas Munro <thomas.munro@gmail.com> writes:

> 1.  Why does pmdie()'s SIGTERM case terminate parallel workers
> immediately?  That breaks aborts running parallel queries, so they
> don't get to end their work normally.
> 2.  Why are new parallel workers not allowed to be started while in
> this state?  That hangs future parallel queries forever, so they don't
> get to end their work normally.
> 3.  Suppose we fix the above cases; should we do it for parallel
> workers only (somehow), or for all bgworkers?  It's hard to say since
> I don't know what all bgworkers do.

Attached patch fixes 1 and 2. As for the 3, the only other internal
bgworkers currently are logical replication launcher and workers, which
arguably should be killed immediately.

> Here's an initial sketch of a
> patch like that: it gives you "ERROR:  parallel worker failed to
> initialize" and "HINT:  More details may be available in the server
> log." if you try to run a parallel query.

Reporting bgworkers that postmaster is never going to start is probably
worthwhile doing anyway to prevent potential further deadlocks like
this. However, I think there is a problem in your patch: we might be in
post PM_RUN states due to FatalError, not because of shutdown. In this
case, we shouldn't refuse to run bgws in the future. I would also merge
the check into bgworker_should_start_now.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From da11d22a5ed78a690ccdbfeb77c59749c541d463 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 44 ++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..d60969fdb8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2689,10 +2689,30 @@ pmdie(SIGNAL_ARGS)
             if (pmState == PM_RUN || pmState == PM_RECOVERY ||
                 pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
             {
+                slist_iter siter;
+
+                /*
+                 * Shut down all bgws except parallel workers: graceful
+                 * termination of existing sessions needs them.
+                 */
+                slist_foreach(siter, &BackgroundWorkerList)
+                {
+                    RegisteredBgWorker *rw;
+
+                    rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+                    if (rw->rw_pid > 0 &&
+                        ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0))
+                    {
+                        ereport(DEBUG4,
+                                (errmsg_internal("sending signal %d to process %d",
+                                                 SIGTERM, (int) rw->rw_pid)));
+                        signal_child(rw->rw_pid, SIGTERM);
+
+                    }
+                }
+
                 /* autovac workers are told to shut down immediately */
-                /* and bgworkers too; does this need tweaking? */
-                SignalSomeChildren(SIGTERM,
-                                   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+                SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC);
                 /* and the autovac launcher too */
                 if (AutoVacPID != 0)
                     signal_child(AutoVacPID, SIGTERM);
@@ -5735,18 +5755,30 @@ do_start_bgworker(RegisteredBgWorker *rw)
  * specified start_time?
  */
 static bool
-bgworker_should_start_now(BgWorkerStartTime start_time)
+bgworker_should_start_now(RegisteredBgWorker *rw)
 {
+    BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time;
+
     switch (pmState)
     {
         case PM_NO_CHILDREN:
         case PM_WAIT_DEAD_END:
         case PM_SHUTDOWN_2:
         case PM_SHUTDOWN:
+            return false;
+
         case PM_WAIT_BACKENDS:
         case PM_WAIT_READONLY:
         case PM_WAIT_BACKUP:
-            break;
+            /*
+             * Allow new parallel workers in SmartShutdown while backends
+             * are still there.
+             */
+            if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) ||
+                Shutdown != SmartShutdown ||
+                FatalError)
+                return false;
+            /* fall through */
 
         case PM_RUN:
             if (start_time == BgWorkerStart_RecoveryFinished)
@@ -5906,7 +5938,7 @@ maybe_start_bgworkers(void)
             }
         }
 
-        if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+        if (bgworker_should_start_now(rw))
         {
             /* reset crash time before trying to start worker */
             rw->rw_crashed_at = 0;
-- 
2.11.0


Re: Parallel query vs smart shutdown and Postmaster death

From
Thomas Munro
Date:
On Sun, Mar 17, 2019 at 5:53 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>
> > 1.  Why does pmdie()'s SIGTERM case terminate parallel workers
> > immediately?  That breaks aborts running parallel queries, so they
> > don't get to end their work normally.
> > 2.  Why are new parallel workers not allowed to be started while in
> > this state?  That hangs future parallel queries forever, so they don't
> > get to end their work normally.
> > 3.  Suppose we fix the above cases; should we do it for parallel
> > workers only (somehow), or for all bgworkers?  It's hard to say since
> > I don't know what all bgworkers do.
>
> Attached patch fixes 1 and 2. As for the 3, the only other internal
> bgworkers currently are logical replication launcher and workers, which
> arguably should be killed immediately.

Hi Arseny,

Thanks for working on this!  Yes, it seems better to move forwards
rather than backwards, and fix this properly as you have done in this
patch.

Just a thought: instead of the new hand-coded loop you added in
pmdie(), do you think it would make sense to have a new argument
"exclude_class_mask" for SignalSomeChildren()?  If we did that, I
would consider renaming the existing parameter "target" to
"include_type_mask" to make it super clear what's going on.

> > Here's an initial sketch of a
> > patch like that: it gives you "ERROR:  parallel worker failed to
> > initialize" and "HINT:  More details may be available in the server
> > log." if you try to run a parallel query.
>
> Reporting bgworkers that postmaster is never going to start is probably
> worthwhile doing anyway to prevent potential further deadlocks like
> this. However, I think there is a problem in your patch: we might be in
> post PM_RUN states due to FatalError, not because of shutdown. In this
> case, we shouldn't refuse to run bgws in the future. I would also merge
> the check into bgworker_should_start_now.

Hmm, yeah, I haven't tested or double-checked in detail yet but I
think you're probably right about all of that.

-- 
Thomas Munro
https://enterprisedb.com


Re: Parallel query vs smart shutdown and Postmaster death

From
Arseny Sher
Date:
Thomas Munro <thomas.munro@gmail.com> writes:

> Just a thought: instead of the new hand-coded loop you added in
> pmdie(), do you think it would make sense to have a new argument
> "exclude_class_mask" for SignalSomeChildren()?  If we did that, I
> would consider renaming the existing parameter "target" to
> "include_type_mask" to make it super clear what's going on.

I thought that this is a bit too complicated for single use-case, but if
you like it better, here is an updated version.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From 2e6359df5bf60b9ab6ca6e35fa347993019526db Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 72 ++++++++++++++++++++++++++++---------
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..2ad215b12d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -412,10 +412,10 @@ static void report_fork_failure_to_client(Port *port, int errnum);
 static CAC_state canAcceptConnections(void);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
-static bool SignalSomeChildren(int signal, int targets);
+static bool SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask);
 static void TerminateChildren(int signal);
 
-#define SignalChildren(sig)               SignalSomeChildren(sig, BACKEND_TYPE_ALL)
+#define SignalChildren(sig)               SignalSomeChildren(sig, BACKEND_TYPE_ALL, 0)
 
 static int    CountChildren(int target);
 static bool assign_backendlist_entry(RegisteredBgWorker *rw);
@@ -2689,10 +2689,12 @@ pmdie(SIGNAL_ARGS)
             if (pmState == PM_RUN || pmState == PM_RECOVERY ||
                 pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
             {
-                /* autovac workers are told to shut down immediately */
-                /* and bgworkers too; does this need tweaking? */
-                SignalSomeChildren(SIGTERM,
-                                   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+                /*
+                 * Shut down all bgws except parallel workers: graceful
+                 * termination of existing sessions needs them.
+                 * Autovac workers are also told to shut down immediately.
+                 */
+                SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER, BGWORKER_CLASS_PARALLEL);
                 /* and the autovac launcher too */
                 if (AutoVacPID != 0)
                     signal_child(AutoVacPID, SIGTERM);
@@ -2752,7 +2754,7 @@ pmdie(SIGNAL_ARGS)
                 signal_child(WalReceiverPID, SIGTERM);
             if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
             {
-                SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
+                SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER, 0);
 
                 /*
                  * Only startup, bgwriter, walreceiver, possibly bgworkers,
@@ -2773,7 +2775,7 @@ pmdie(SIGNAL_ARGS)
                 /* shut down all backends and workers */
                 SignalSomeChildren(SIGTERM,
                                    BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC |
-                                   BACKEND_TYPE_BGWORKER);
+                                   BACKEND_TYPE_BGWORKER, 0);
                 /* and the autovac launcher too */
                 if (AutoVacPID != 0)
                     signal_child(AutoVacPID, SIGTERM);
@@ -3939,26 +3941,52 @@ signal_child(pid_t pid, int signal)
 
 /*
  * Send a signal to the targeted children (but NOT special children;
- * dead_end children are never signaled, either).
+ * dead_end children are never signaled, either). If BACKEND_TYPE_BGWORKER
+ * is in include_type_mask, bgworkers are additionaly filtered out by
+ * exclude_bgw_mask.
  */
 static bool
-SignalSomeChildren(int signal, int target)
+SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask)
 {
     dlist_iter    iter;
+    slist_iter    siter;
     bool        signaled = false;
 
+    /*
+     * Handle bgws by walking over BackgroundWorkerList because we have
+     * to check out bgw class.
+     */
+    if ((include_type_mask & BACKEND_TYPE_BGWORKER) != 0)
+    {
+        slist_foreach(siter, &BackgroundWorkerList)
+        {
+            RegisteredBgWorker *rw;
+
+            rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+            if (rw->rw_pid > 0 &&
+                ((rw->rw_worker.bgw_flags & exclude_bgw_mask) == 0))
+            {
+                ereport(DEBUG4,
+                        (errmsg_internal("sending signal %d to process %d",
+                                         signal, (int) rw->rw_pid)));
+                signal_child(rw->rw_pid, signal);
+                signaled = true;
+            }
+        }
+    }
+
     dlist_foreach(iter, &BackendList)
     {
         Backend    *bp = dlist_container(Backend, elem, iter.cur);
 
-        if (bp->dead_end)
+        if (bp->dead_end || bp->bkend_type == BACKEND_TYPE_BGWORKER)
             continue;
 
         /*
          * Since target == BACKEND_TYPE_ALL is the most common case, we test
          * it first and avoid touching shared memory for every child.
          */
-        if (target != BACKEND_TYPE_ALL)
+        if (include_type_mask != BACKEND_TYPE_ALL)
         {
             /*
              * Assign bkend_type for any recently announced WAL Sender
@@ -3968,7 +3996,7 @@ SignalSomeChildren(int signal, int target)
                 IsPostmasterChildWalSender(bp->child_slot))
                 bp->bkend_type = BACKEND_TYPE_WALSND;
 
-            if (!(target & bp->bkend_type))
+            if (!(include_type_mask & bp->bkend_type))
                 continue;
         }
 
@@ -5735,18 +5763,30 @@ do_start_bgworker(RegisteredBgWorker *rw)
  * specified start_time?
  */
 static bool
-bgworker_should_start_now(BgWorkerStartTime start_time)
+bgworker_should_start_now(RegisteredBgWorker *rw)
 {
+    BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time;
+
     switch (pmState)
     {
         case PM_NO_CHILDREN:
         case PM_WAIT_DEAD_END:
         case PM_SHUTDOWN_2:
         case PM_SHUTDOWN:
+            return false;
+
         case PM_WAIT_BACKENDS:
         case PM_WAIT_READONLY:
         case PM_WAIT_BACKUP:
-            break;
+            /*
+             * Allow new parallel workers in SmartShutdown while backends
+             * are still there.
+             */
+            if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) ||
+                Shutdown != SmartShutdown ||
+                FatalError)
+                return false;
+            /* fall through */
 
         case PM_RUN:
             if (start_time == BgWorkerStart_RecoveryFinished)
@@ -5906,7 +5946,7 @@ maybe_start_bgworkers(void)
             }
         }
 
-        if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+        if (bgworker_should_start_now(rw))
         {
             /* reset crash time before trying to start worker */
             rw->rw_crashed_at = 0;
-- 
2.11.0