Re: Parallel query hangs after a smart shutdown is issued - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Parallel query hangs after a smart shutdown is issued
Date
Msg-id CA+hUKGJZ9d2vUTZMPnNkZP_pJ-7++XyHP48K+OYEi6S4feUD4Q@mail.gmail.com
Whole thread Raw
In response to Re: Parallel query hangs after a smart shutdown is issued  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Parallel query hangs after a smart shutdown is issued
List pgsql-hackers
On Thu, Aug 13, 2020 at 6:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Thu, Aug 13, 2020 at 3:32 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> After a smart shutdown is issued(with pg_ctl), run a parallel query,
> >> then the query hangs.
>
> > Yeah, the current situation is not good.  I think your option 2 sounds
> > better, because the documented behaviour of smart shutdown is that it
> > "lets existing sessions end their work normally".  I think that means
> > that a query that is already running or allowed to start should be
> > able to start new workers and not have its existing workers
> > terminated.  Arseny Sher wrote a couple of different patches to try
> > that last year, but they fell through the cracks:
> >
https://www.postgresql.org/message-id/flat/CA%2BhUKGLrJij0BuFtHsMHT4QnLP54Z3S6vGVBCWR8A49%2BNzctCw%40mail.gmail.com
>
> I already commented on this in the other thread that Bharath started [1].
> I think the real issue here is why is the postmaster's SIGTERM handler
> doing *anything* other than disallowing new connections?  It seems quite
> premature to kill support processes of any sort, not only parallel
> workers.  The documentation says existing clients are allowed to end
> their work, not that their performance is going to be crippled until
> they end.

Right.  It's pretty strange that during smart shutdown, you could run
for hours with no autovacuum, walwriter, bgwriter.  I guess Arseny and
I were looking for a minimal change to fix a bug, but clearly there's a
more general problem and this change works out cleaner anyway.

> So I looked at moving the kills of all the support processes to happen
> after we detect that there are no remaining regular backends, and it
> seems to not be too hard.  I realized that the existing PM_WAIT_READONLY
> state is doing that already, but just for a subset of support processes
> that it thinks might be active in hot standby mode.  So what I did in the
> attached was to repurpose that state as "PM_WAIT_CLIENTS", which does the
> right thing in either regular or hot standby mode.

Make sense, works as expected and passes check-world.

> One other thing I changed here was to remove PM_WAIT_READONLY from the
> set of states in which we'll allow promotion to occur or a new walreceiver
> to start.  I'm not convinced that either of those behaviors aren't
> bugs; although if someone thinks they're right, we can certainly put
> back PM_WAIT_CLIENTS in those checks.  (But, for example, it does not
> appear possible to reach PM_WAIT_READONLY/PM_WAIT_CLIENTS state with
> Shutdown == NoShutdown, so the test in MaybeStartWalReceiver sure looks
> like confusingly dead code to me.  If we do want to allow restarting
> the walreceiver in this state, the Shutdown condition needs fixed.)

If a walreceiver is allowed to run, why should it not be allowed to
restart?  Yeah, I suppose that other test'd need to be Shutdown <=
SmartShutdown, just like we do in SIGHUP_handler().  Looking at other
places where we test Shutdown == NoShutdown, one that jumps out is the
autovacuum wraparound defence stuff and the nearby
PMSIGNAL_START_AUTOVAC_WORKER code.



pgsql-hackers by date:

Previous
From: Marco Atzeri
Date:
Subject: ltree_plpython failure test on Cygwin for 12.4 test
Next
From: Tom Lane
Date:
Subject: Re: Parallel query hangs after a smart shutdown is issued