Re: Preventing hangups in bgworker start/stop during DB shutdown - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Preventing hangups in bgworker start/stop during DB shutdown
Date
Msg-id CALj2ACWWiqC1Ci3dr4nTbpxH78z=fGO14uk6NEaT14zpbutKnA@mail.gmail.com
Whole thread Raw
In response to Preventing hangups in bgworker start/stop during DB shutdown  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Preventing hangups in bgworker start/stop during DB shutdown
List pgsql-hackers
On Wed, Dec 23, 2020 at 3:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Here's an attempt at closing the race condition discussed in [1]
> (and in some earlier threads, though I'm too lazy to find them).
>
> The core problem is that the bgworker management APIs were designed
> without any thought for exception conditions, notably "we're not
> gonna launch any more workers because we're shutting down the database".
> A process waiting for a worker in WaitForBackgroundWorkerStartup or
> WaitForBackgroundWorkerShutdown will wait forever, so that the database
> fails to shut down without manual intervention.
>
> I'd supposed that we would need some incompatible changes in those APIs
> in order to fix this, but after further study it seems like we could
> hack things by just acting as though a request that won't be serviced
> has already run to completion.  I'm not terribly satisfied with that
> as a long-term solution --- it seems to me that callers should be told
> that there was a failure.  But this looks to be enough to solve the
> lockup condition for existing callers, and it seems like it'd be okay
> to backpatch.
>
> Thoughts?
>
> [1] https://www.postgresql.org/message-id/flat/16785-c0207d8c67fb5f25%40postgresql.org

1) Yeah, the postmaster will not be able to start the bg workers in
following cases, when bgworker_should_start_now returns false. So we
might encounter the hang issue.
static bool
bgworker_should_start_now(BgWorkerStartTime start_time)
{
    switch (pmState)
    {
        case PM_NO_CHILDREN:
        case PM_WAIT_DEAD_END:
        case PM_SHUTDOWN_2:
        case PM_SHUTDOWN:
        case PM_WAIT_BACKENDS:
        case PM_STOP_BACKENDS:
            break;

2) What if postmaster enters pmState >= PM_STOP_BACKENDS state after
it calls BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS)?
First of all, is it possible? I think yes, because we are in
sigusr1_handler(), and I don't see we blocking the signal that sets
pmState >= PM_STOP_BACKENDS either in sigusr1_handler or in
BackgroundWorkerStateChange(). Though it's a small window, we might
get into the hangup issue? If yes, can we check the pmState in the for
loop in BackgroundWorkerStateChange()?

     if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
     {
-        BackgroundWorkerStateChange();
+        /* Accept new dynamic worker requests only if not stopping. */
+        BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS);
         StartWorkerNeeded = true;
     }

3) Can we always say that if bgw_restart_time is BGW_NEVER_RESTART,
then it's a dynamic bg worker? I think we can also have normal
bgworkers with BGW_NEVER_RESTART flag(I see one in worker_spi.c
_PG_init()), will there be any problem? Or do we need some comment
tweaking?

+        /*
+         * If this is a dynamic worker request, and we aren't allowing new
+         * dynamic workers, then immediately mark it for termination; the next
+         * stanza will take care of cleaning it up.
+         */
+        if (slot->worker.bgw_restart_time == BGW_NEVER_RESTART &&
+            !allow_new_workers)
+            slot->terminate = true;

4) IIUC, in the patch we mark slot->terminate = true only for
BGW_NEVER_RESTART kind bg workers, what happens if a bg worker has
bgw_restart_time seconds and don't we hit the hanging issue(that we
are trying to solve here) for those bg workers?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: How is this possible "publication does not exist"
Next
From: "k.jamison@fujitsu.com"
Date:
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist