Thread: Parallel query vs smart shutdown and Postmaster death
Hello hackers, 1. In a nearby thread, I misdiagnosed a problem reported[1] by Justin Pryzby (though my misdiagnosis is probably still a thing to be fixed; see next). I think I just spotted the real problem he saw: if you execute a parallel query after a smart shutdown has been initiated, you wait forever in gather_readnext()! Maybe parallel workers can't be launched in this state, but we lack code to detect this case? I haven't dug into the exact mechanism or figured out what to do about it yet, and I'm tied up with something else for a bit, but I will come back to this later if nobody beats me to it. 2. Commit cfdf4dc4 on the master branch fixed up all known waits that didn't respond to postmaster death, and added an assertion to that effect. One of the cases fixed was in gather_readnext(), and initially I thought that's what Justin was telling us about (his report was from 11.x), until I reread his message and saw that it was SIGTERM and not eg SIGKILL. I should probably go and back-patch a fix for that case anyway... but now I'm wondering, was there a reason for that omission, and likewise for mq_putmessage()? (Another case of missing PM death detection in the back-branches is postgres_fdw.) [1] https://www.postgresql.org/message-id/CAEepm%3D0kMunPC0hhuT0VC-5dfMT3K-xsToJHkTznA6yrSARsPg%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
On Mon, Feb 25, 2019 at 2:13 PM Thomas Munro <thomas.munro@gmail.com> wrote: > 1. In a nearby thread, I misdiagnosed a problem reported[1] by Justin > Pryzby (though my misdiagnosis is probably still a thing to be fixed; > see next). I think I just spotted the real problem he saw: if you > execute a parallel query after a smart shutdown has been initiated, > you wait forever in gather_readnext()! Maybe parallel workers can't > be launched in this state, but we lack code to detect this case? I > haven't dug into the exact mechanism or figured out what to do about > it yet, and I'm tied up with something else for a bit, but I will come > back to this later if nobody beats me to it. Given smart shutdown's stated goal, namely that it "lets existing sessions end their work normally", my questions are: 1. Why does pmdie()'s SIGTERM case terminate parallel workers immediately? That breaks aborts running parallel queries, so they don't get to end their work normally. 2. Why are new parallel workers not allowed to be started while in this state? That hangs future parallel queries forever, so they don't get to end their work normally. 3. Suppose we fix the above cases; should we do it for parallel workers only (somehow), or for all bgworkers? It's hard to say since I don't know what all bgworkers do. In the meantime, perhaps we should teach the postmaster to report this case as a failure to launch in back-branches, so that at least parallel queries don't hang forever? Here's an initial sketch of a patch like that: it gives you "ERROR: parallel worker failed to initialize" and "HINT: More details may be available in the server log." if you try to run a parallel query. The HINT is right, the server logs say that a smart shutdown is in progress. If that seems a bit hostile, consider that any parallel queries that were running at the moment the smart shutdown was requested have already been ordered to quit; why should new queries started after that get a better deal? Then perhaps we could do some more involved surgery on master that achieves smart shutdown's stated goal here, and lets parallel queries actually run? Better ideas welcome. -- Thomas Munro https://enterprisedb.com
Attachment
On Tue, Feb 26, 2019 at 5:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Then perhaps we could do some more involved surgery on master that > achieves smart shutdown's stated goal here, and lets parallel queries > actually run? Better ideas welcome. I have noticed before that the smart shutdown code does not disclose to the rest of the system that a shutdown is in progress: no signals are sent, and no shared memory state is updated. That makes it a bit challenging for any other part of the system to respond to the smart shutdown in a way that is, well, smart. But I guess that's not really the problem in this case. It seems to me that we could fix pmdie() to skip parallel workers; I think that the postmaster could notice that they are lagged as BGWORKER_CLASS_PARALLEL. But we'd also have to fix things so that new parallel workers could be launched during a smart shutdown, which looks not quite so simple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Thomas Munro <thomas.munro@gmail.com> writes: > 1. Why does pmdie()'s SIGTERM case terminate parallel workers > immediately? That breaks aborts running parallel queries, so they > don't get to end their work normally. > 2. Why are new parallel workers not allowed to be started while in > this state? That hangs future parallel queries forever, so they don't > get to end their work normally. > 3. Suppose we fix the above cases; should we do it for parallel > workers only (somehow), or for all bgworkers? It's hard to say since > I don't know what all bgworkers do. Attached patch fixes 1 and 2. As for the 3, the only other internal bgworkers currently are logical replication launcher and workers, which arguably should be killed immediately. > Here's an initial sketch of a > patch like that: it gives you "ERROR: parallel worker failed to > initialize" and "HINT: More details may be available in the server > log." if you try to run a parallel query. Reporting bgworkers that postmaster is never going to start is probably worthwhile doing anyway to prevent potential further deadlocks like this. However, I think there is a problem in your patch: we might be in post PM_RUN states due to FatalError, not because of shutdown. In this case, we shouldn't refuse to run bgws in the future. I would also merge the check into bgworker_should_start_now. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company From da11d22a5ed78a690ccdbfeb77c59749c541d463 Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Sun, 17 Mar 2019 07:42:18 +0300 Subject: [PATCH] Allow parallel workers while backends are alive in 'smart' shutdown. Since postmaster doesn't advertise that he is never going to start parallel workers after shutdown was issued, parallel leader stucks waiting for them indefinitely. --- src/backend/postmaster/postmaster.c | 44 ++++++++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index fe599632d3..d60969fdb8 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2689,10 +2689,30 @@ pmdie(SIGNAL_ARGS) if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY || pmState == PM_STARTUP) { + slist_iter siter; + + /* + * Shut down all bgws except parallel workers: graceful + * termination of existing sessions needs them. + */ + slist_foreach(siter, &BackgroundWorkerList) + { + RegisteredBgWorker *rw; + + rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur); + if (rw->rw_pid > 0 && + ((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0)) + { + ereport(DEBUG4, + (errmsg_internal("sending signal %d to process %d", + SIGTERM, (int) rw->rw_pid))); + signal_child(rw->rw_pid, SIGTERM); + + } + } + /* autovac workers are told to shut down immediately */ - /* and bgworkers too; does this need tweaking? */ - SignalSomeChildren(SIGTERM, - BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); + SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -5735,18 +5755,30 @@ do_start_bgworker(RegisteredBgWorker *rw) * specified start_time? */ static bool -bgworker_should_start_now(BgWorkerStartTime start_time) +bgworker_should_start_now(RegisteredBgWorker *rw) { + BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time; + switch (pmState) { case PM_NO_CHILDREN: case PM_WAIT_DEAD_END: case PM_SHUTDOWN_2: case PM_SHUTDOWN: + return false; + case PM_WAIT_BACKENDS: case PM_WAIT_READONLY: case PM_WAIT_BACKUP: - break; + /* + * Allow new parallel workers in SmartShutdown while backends + * are still there. + */ + if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) || + Shutdown != SmartShutdown || + FatalError) + return false; + /* fall through */ case PM_RUN: if (start_time == BgWorkerStart_RecoveryFinished) @@ -5906,7 +5938,7 @@ maybe_start_bgworkers(void) } } - if (bgworker_should_start_now(rw->rw_worker.bgw_start_time)) + if (bgworker_should_start_now(rw)) { /* reset crash time before trying to start worker */ rw->rw_crashed_at = 0; -- 2.11.0
On Sun, Mar 17, 2019 at 5:53 PM Arseny Sher <a.sher@postgrespro.ru> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > > 1. Why does pmdie()'s SIGTERM case terminate parallel workers > > immediately? That breaks aborts running parallel queries, so they > > don't get to end their work normally. > > 2. Why are new parallel workers not allowed to be started while in > > this state? That hangs future parallel queries forever, so they don't > > get to end their work normally. > > 3. Suppose we fix the above cases; should we do it for parallel > > workers only (somehow), or for all bgworkers? It's hard to say since > > I don't know what all bgworkers do. > > Attached patch fixes 1 and 2. As for the 3, the only other internal > bgworkers currently are logical replication launcher and workers, which > arguably should be killed immediately. Hi Arseny, Thanks for working on this! Yes, it seems better to move forwards rather than backwards, and fix this properly as you have done in this patch. Just a thought: instead of the new hand-coded loop you added in pmdie(), do you think it would make sense to have a new argument "exclude_class_mask" for SignalSomeChildren()? If we did that, I would consider renaming the existing parameter "target" to "include_type_mask" to make it super clear what's going on. > > Here's an initial sketch of a > > patch like that: it gives you "ERROR: parallel worker failed to > > initialize" and "HINT: More details may be available in the server > > log." if you try to run a parallel query. > > Reporting bgworkers that postmaster is never going to start is probably > worthwhile doing anyway to prevent potential further deadlocks like > this. However, I think there is a problem in your patch: we might be in > post PM_RUN states due to FatalError, not because of shutdown. In this > case, we shouldn't refuse to run bgws in the future. I would also merge > the check into bgworker_should_start_now. Hmm, yeah, I haven't tested or double-checked in detail yet but I think you're probably right about all of that. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > Just a thought: instead of the new hand-coded loop you added in > pmdie(), do you think it would make sense to have a new argument > "exclude_class_mask" for SignalSomeChildren()? If we did that, I > would consider renaming the existing parameter "target" to > "include_type_mask" to make it super clear what's going on. I thought that this is a bit too complicated for single use-case, but if you like it better, here is an updated version. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company From 2e6359df5bf60b9ab6ca6e35fa347993019526db Mon Sep 17 00:00:00 2001 From: Arseny Sher <sher-ars@yandex.ru> Date: Sun, 17 Mar 2019 07:42:18 +0300 Subject: [PATCH] Allow parallel workers while backends are alive in 'smart' shutdown. Since postmaster doesn't advertise that he is never going to start parallel workers after shutdown was issued, parallel leader stucks waiting for them indefinitely. --- src/backend/postmaster/postmaster.c | 72 ++++++++++++++++++++++++++++--------- 1 file changed, 56 insertions(+), 16 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index fe599632d3..2ad215b12d 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -412,10 +412,10 @@ static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); static bool RandomCancelKey(int32 *cancel_key); static void signal_child(pid_t pid, int signal); -static bool SignalSomeChildren(int signal, int targets); +static bool SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask); static void TerminateChildren(int signal); -#define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL) +#define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL, 0) static int CountChildren(int target); static bool assign_backendlist_entry(RegisteredBgWorker *rw); @@ -2689,10 +2689,12 @@ pmdie(SIGNAL_ARGS) if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY || pmState == PM_STARTUP) { - /* autovac workers are told to shut down immediately */ - /* and bgworkers too; does this need tweaking? */ - SignalSomeChildren(SIGTERM, - BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); + /* + * Shut down all bgws except parallel workers: graceful + * termination of existing sessions needs them. + * Autovac workers are also told to shut down immediately. + */ + SignalSomeChildren(SIGTERM, BACKEND_TYPE_WORKER, BGWORKER_CLASS_PARALLEL); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -2752,7 +2754,7 @@ pmdie(SIGNAL_ARGS) signal_child(WalReceiverPID, SIGTERM); if (pmState == PM_STARTUP || pmState == PM_RECOVERY) { - SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER); + SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER, 0); /* * Only startup, bgwriter, walreceiver, possibly bgworkers, @@ -2773,7 +2775,7 @@ pmdie(SIGNAL_ARGS) /* shut down all backends and workers */ SignalSomeChildren(SIGTERM, BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC | - BACKEND_TYPE_BGWORKER); + BACKEND_TYPE_BGWORKER, 0); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); @@ -3939,26 +3941,52 @@ signal_child(pid_t pid, int signal) /* * Send a signal to the targeted children (but NOT special children; - * dead_end children are never signaled, either). + * dead_end children are never signaled, either). If BACKEND_TYPE_BGWORKER + * is in include_type_mask, bgworkers are additionaly filtered out by + * exclude_bgw_mask. */ static bool -SignalSomeChildren(int signal, int target) +SignalSomeChildren(int signal, int include_type_mask, int exclude_bgw_mask) { dlist_iter iter; + slist_iter siter; bool signaled = false; + /* + * Handle bgws by walking over BackgroundWorkerList because we have + * to check out bgw class. + */ + if ((include_type_mask & BACKEND_TYPE_BGWORKER) != 0) + { + slist_foreach(siter, &BackgroundWorkerList) + { + RegisteredBgWorker *rw; + + rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur); + if (rw->rw_pid > 0 && + ((rw->rw_worker.bgw_flags & exclude_bgw_mask) == 0)) + { + ereport(DEBUG4, + (errmsg_internal("sending signal %d to process %d", + signal, (int) rw->rw_pid))); + signal_child(rw->rw_pid, signal); + signaled = true; + } + } + } + dlist_foreach(iter, &BackendList) { Backend *bp = dlist_container(Backend, elem, iter.cur); - if (bp->dead_end) + if (bp->dead_end || bp->bkend_type == BACKEND_TYPE_BGWORKER) continue; /* * Since target == BACKEND_TYPE_ALL is the most common case, we test * it first and avoid touching shared memory for every child. */ - if (target != BACKEND_TYPE_ALL) + if (include_type_mask != BACKEND_TYPE_ALL) { /* * Assign bkend_type for any recently announced WAL Sender @@ -3968,7 +3996,7 @@ SignalSomeChildren(int signal, int target) IsPostmasterChildWalSender(bp->child_slot)) bp->bkend_type = BACKEND_TYPE_WALSND; - if (!(target & bp->bkend_type)) + if (!(include_type_mask & bp->bkend_type)) continue; } @@ -5735,18 +5763,30 @@ do_start_bgworker(RegisteredBgWorker *rw) * specified start_time? */ static bool -bgworker_should_start_now(BgWorkerStartTime start_time) +bgworker_should_start_now(RegisteredBgWorker *rw) { + BgWorkerStartTime start_time = rw->rw_worker.bgw_start_time; + switch (pmState) { case PM_NO_CHILDREN: case PM_WAIT_DEAD_END: case PM_SHUTDOWN_2: case PM_SHUTDOWN: + return false; + case PM_WAIT_BACKENDS: case PM_WAIT_READONLY: case PM_WAIT_BACKUP: - break; + /* + * Allow new parallel workers in SmartShutdown while backends + * are still there. + */ + if (((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0) || + Shutdown != SmartShutdown || + FatalError) + return false; + /* fall through */ case PM_RUN: if (start_time == BgWorkerStart_RecoveryFinished) @@ -5906,7 +5946,7 @@ maybe_start_bgworkers(void) } } - if (bgworker_should_start_now(rw->rw_worker.bgw_start_time)) + if (bgworker_should_start_now(rw)) { /* reset crash time before trying to start worker */ rw->rw_crashed_at = 0; -- 2.11.0