Re: Parallel query vs smart shutdown and Postmaster death - Mailing list pgsql-hackers

From Arseny Sher
Subject Re: Parallel query vs smart shutdown and Postmaster death
Date
Msg-id 87lg1ep0ds.fsf@ars-thinkpad
Whole thread Raw
In response to Re: Parallel query vs smart shutdown and Postmaster death  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Parallel query vs smart shutdown and Postmaster death
List pgsql-hackers
Hi,

Thomas Munro <thomas.munro@gmail.com> writes:

> 1.  Why does pmdie()'s SIGTERM case terminate parallel workers
> immediately?  That breaks aborts running parallel queries, so they
> don't get to end their work normally.
> 2.  Why are new parallel workers not allowed to be started while in
> this state?  That hangs future parallel queries forever, so they don't
> get to end their work normally.
> 3.  Suppose we fix the above cases; should we do it for parallel
> workers only (somehow), or for all bgworkers?  It's hard to say since
> I don't know what all bgworkers do.

Attached patch fixes 1 and 2. As for the 3, the only other internal
bgworkers currently are logical replication launcher and workers, which
arguably should be killed immediately.

> Here's an initial sketch of a
> patch like that: it gives you "ERROR:  parallel worker failed to
> initialize" and "HINT:  More details may be available in the server
> log." if you try to run a parallel query.

Reporting bgworkers that postmaster is never going to start is probably
worthwhile doing anyway to prevent potential further deadlocks like
this. However, I think there is a problem in your patch: we might be in
post PM_RUN states due to FatalError, not because of shutdown. In this
case, we shouldn't refuse to run bgws in the future. I would also merge
the check into bgworker_should_start_now.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From da11d22a5ed78a690ccdbfeb77c59749c541d463 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@yandex.ru>
Date: Sun, 17 Mar 2019 07:42:18 +0300
Subject: [PATCH] Allow parallel workers while backends are alive in 'smart'
 shutdown.

Since postmaster doesn't advertise that he is never going to start parallel
workers after shutdown was issued, parallel leader stucks waiting for them
indefinitely.
---
 src/backend/postmaster/postmaster.c | 44 ++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..d60969fdb8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2689,10 +2689,30 @@ pmdie(SIGNAL_ARGS)
             if (pmState == PM_RUN || pmState == PM_RECOVERY ||
                 pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
             {
+                slist_iter siter;
+
+                /*
+                 * Shut down all bgws except parallel workers: graceful
+                 * termination of existing sessions needs them.
+                 */
+                slist_foreach(siter, &BackgroundWorkerList)
+                {
+                    RegisteredBgWorker *rw;
+
+                    rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+                    if (rw->rw_pid > 0 &&
+                        ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0))
+                    {
+                        ereport(DEBUG4,
+                                (errmsg_internal("sending signal %d to process %d",
+                                                 SIGTERM, (int) rw->rw_pid)));
+                        signal_child(rw->rw_pid, SIGTERM);
+
+                    }
+                }
+
                 /* autovac workers are told to shut down immediately */
-                /* and bgworkers too; does this need tweaking? */
-                SignalSomeChildren(SIGTERM,
-                                   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
+                SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC);
                 /* and the autovac launcher too */
                 if (AutoVacPID != 0)
                     signal_child(AutoVacPID, SIGTERM);
@@ -5735,18 +5755,30 @@ do_start_bgworker(RegisteredBgWorker *rw)
  * specified start_time?
  */
 static bool
-bgworker_should_start_now(BgWorkerStartTime start_time)
+bgworker_should_start_now(RegisteredBgWorker *rw)
 {
+    BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time;
+
     switch (pmState)
     {
         case PM_NO_CHILDREN:
         case PM_WAIT_DEAD_END:
         case PM_SHUTDOWN_2:
         case PM_SHUTDOWN:
+            return false;
+
         case PM_WAIT_BACKENDS:
         case PM_WAIT_READONLY:
         case PM_WAIT_BACKUP:
-            break;
+            /*
+             * Allow new parallel workers in SmartShutdown while backends
+             * are still there.
+             */
+            if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) ||
+                Shutdown != SmartShutdown ||
+                FatalError)
+                return false;
+            /* fall through */
 
         case PM_RUN:
             if (start_time == BgWorkerStart_RecoveryFinished)
@@ -5906,7 +5938,7 @@ maybe_start_bgworkers(void)
             }
         }
 
-        if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+        if (bgworker_should_start_now(rw))
         {
             /* reset crash time before trying to start worker */
             rw->rw_crashed_at = 0;
-- 
2.11.0


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Possible to modify query language in an extension?
Next
From: Michael Paquier
Date:
Subject: Re: Make pg_checksums complain if compiled BLCKSZ and data folder'sblock size differ