Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> On Wed, Dec 23, 2020 at 3:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Here's an attempt at closing the race condition discussed in [1]
>> (and in some earlier threads, though I'm too lazy to find them).
> 2) What if postmaster enters pmState >= PM_STOP_BACKENDS state after
> it calls BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS)?
> First of all, is it possible? I think yes, because we are in
> sigusr1_handler(), and I don't see we blocking the signal that sets
> pmState >= PM_STOP_BACKENDS either in sigusr1_handler or in
> BackgroundWorkerStateChange().
If you're asking whether the postmaster's signal handlers can interrupt
each other, they can't; see comment at the start of each one. If you're
wondering about the order of operations in sigusr1_handler, I agree that
seems wrong now. I'm inclined to move the BackgroundWorkerStateChange
call to be just before maybe_start_bgworkers(). That way it's after the
possible entry to PM_HOT_STANDBY state. The later steps can't change
pmState, except for PostmasterStateMachine, which would be responsible for
anything that needs to be done to bgworkers as a result of making a state
change.
> 3) Can we always say that if bgw_restart_time is BGW_NEVER_RESTART,
> then it's a dynamic bg worker?
That assumption's already baked into ResetBackgroundWorkerCrashTimes,
for one. Personally I'd have designed things with some more-explicit
indicator, but I'm not interested in revisiting those API decisions now;
any cleanup we might undertake would result in a non-back-patchable fix.
As far as whether it's formally correct to do this given the current
APIs, I think it is. We're essentially pretending that the worker
got launched and instantly exited. If it's BGW_NEVER_RESTART then that
would result in deregistration in any case, while if it's not that,
then the worker record should get kept for a possible later restart.
> I think we can also have normal
> bgworkers with BGW_NEVER_RESTART flag(I see one in worker_spi.c
> _PG_init()),
That might be a bug in worker_spi.c, but since that's only test code,
I don't care about it too much. Nobody's really going to care what
that module does in a postmaster shutdown.
> 4) IIUC, in the patch we mark slot->terminate = true only for
> BGW_NEVER_RESTART kind bg workers, what happens if a bg worker has
> bgw_restart_time seconds and don't we hit the hanging issue(that we
> are trying to solve here) for those bg workers?
The hang problem is not with the worker itself, it's with anything
that might be waiting around for the worker to finish. It doesn't
seem to me to make a whole lot of sense to wait for a restartable
worker; what would that mean?
There's definitely room to revisit and clarify these APIs, and maybe
(if we don't change them altogether) add some Asserts about what are
sane combinations of properties. But my purpose today is just to get
a back-patchable bug fix, and that sort of change wouldn't fit in.
regards, tom lane