Thread: rare avl shutdown slowness (related to signal handling)

rare avl shutdown slowness (related to signal handling)

From

Qingqing Zhou

Date:

07 April 2015, 17:54:27

I am playing git tip on windows 7/32 bits, with the backend compiled
with visual studio 2005 (I know, it is very old :-( ).

I encountered avl shutdown slowness twice, last night and this
morning: after a ctrl_c is hit, wait for another 60 seconds, server
shuts down. Here is one log:

D:\pgsql\bin>postgres -D../data --log_line_prefix="%t %p"
2015-04-07 10:30:04 PDT 3148LOG:  database system was shut down at
2015-04-07 10:29:24 PDT
2015-04-07 10:30:04 PDT 19548LOG:  database system is ready to accept
connections
2015-04-07 10:30:04 PDT 27008LOG:  autovacuum launcher started
2015-04-07 10:30:08 PDT 19548LOG:  received fast shutdown request
2015-04-07 10:30:08 PDT 19548LOG:  aborting any active transactions
2015-04-07 10:30:08 PDT 27008LOG:  autovacuum launcher shutting down
2015-04-07 10:30:08 PDT 27008ERROR:  canceling statement due to user request
2015-04-07 10:31:09 PDT 27008LOG:  autovacuum launcher shutting down
2015-04-07 10:31:09 PDT 15656LOG:  shutting down
2015-04-07 10:31:09 PDT 15656LOG:  database system is shut down

The interesting part is on PID 27008: avl first ereport() shutdown,
which is at the very end of the main loop and just one step away from
proc_exit(). Then it seems honors a SIGINT within ereport(), longjmp
to the loop head, and waits for another 60 seconds. After timeout, it
ereports shutdown again, and finally exits.

Not sure if this is windows only or general. I can hardly repro it.
Anyone has ever seen this?

Regards,
Qingqing

Re: rare avl shutdown slowness (related to signal handling)

From

Qingqing Zhou

Date:

07 April 2015, 21:11:31

 I got another repro with the shutdown slowness (DEBUG5 with verbosed
log are attached).

It gives a finer picture of what's going on:
1. Avl ereport("autovacuum launcher shutting down");
2. At the end of errfinish(), it honors a pending SIGINT;
3. SIGINT handler longjmp to the start of avl error handling;
4.  The error handling continues and rebuild_database_list() (that's
why we see begin/commit pair);
5. In main loop, it WaitLatch(60 seconds);
6. Finally it ereport() again and proc_exit().

This looks like a general pattern - don't think *nix is immune. Notice
that this ereport() is special as there is way to go back. So we can
insert HOLD_INTERRUPTS() just before it.

Thoughts?

Regards,
Qingqing


On Tue, Apr 7, 2015 at 10:54 AM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:
> I am playing git tip on windows 7/32 bits, with the backend compiled
> with visual studio 2005 (I know, it is very old :-( ).
>
> I encountered avl shutdown slowness twice, last night and this
> morning: after a ctrl_c is hit, wait for another 60 seconds, server
> shuts down. Here is one log:
>
> D:\pgsql\bin>postgres -D../data --log_line_prefix="%t %p"
> 2015-04-07 10:30:04 PDT 3148LOG:  database system was shut down at
> 2015-04-07 10:29:24 PDT
> 2015-04-07 10:30:04 PDT 19548LOG:  database system is ready to accept
> connections
> 2015-04-07 10:30:04 PDT 27008LOG:  autovacuum launcher started
> 2015-04-07 10:30:08 PDT 19548LOG:  received fast shutdown request
> 2015-04-07 10:30:08 PDT 19548LOG:  aborting any active transactions
> 2015-04-07 10:30:08 PDT 27008LOG:  autovacuum launcher shutting down
> 2015-04-07 10:30:08 PDT 27008ERROR:  canceling statement due to user request
> 2015-04-07 10:31:09 PDT 27008LOG:  autovacuum launcher shutting down
> 2015-04-07 10:31:09 PDT 15656LOG:  shutting down
> 2015-04-07 10:31:09 PDT 15656LOG:  database system is shut down
>
> The interesting part is on PID 27008: avl first ereport() shutdown,
> which is at the very end of the main loop and just one step away from
> proc_exit(). Then it seems honors a SIGINT within ereport(), longjmp
> to the loop head, and waits for another 60 seconds. After timeout, it
> ereports shutdown again, and finally exits.
>
> Not sure if this is windows only or general. I can hardly repro it.
> Anyone has ever seen this?
>
> Regards,
> Qingqing

Attachment

avlexit.log

Re: rare avl shutdown slowness (related to signal handling)

From

Tom Lane

Date:

07 April 2015, 21:32:45

Qingqing Zhou <zhouqq.postgres@gmail.com> writes:
>  I got another repro with the shutdown slowness (DEBUG5 with verbosed
> log are attached).

> It gives a finer picture of what's going on:
> 1. Avl ereport("autovacuum launcher shutting down");
> 2. At the end of errfinish(), it honors a pending SIGINT;
> 3. SIGINT handler longjmp to the start of avl error handling;
> 4.  The error handling continues and rebuild_database_list() (that's
> why we see begin/commit pair);
> 5. In main loop, it WaitLatch(60 seconds);
> 6. Finally it ereport() again and proc_exit().

> This looks like a general pattern - don't think *nix is immune. Notice
> that this ereport() is special as there is way to go back. So we can
> insert HOLD_INTERRUPTS() just before it.

> Thoughts?

That seems like (a) a hack, and (b) not likely to solve the problem
completely, unless you leave interrupts held throughout proc_exit(),
which would create all sorts of opportunities for corner case bugs
during on_proc_exit hooks.

I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
be a much safer fix.

It looks like there's a related risk associated with this bit:
/* in emergency mode, just start a worker and go away */if (!AutoVacuumingActive()){    do_start_worker();
proc_exit(0);           /* done */}
 

If we get SIGHUP and see that autovacuum has been turned off,
we exit the main loop, but we don't set got_SIGTERM.  So if we
then get a similar error at the shutdown report, we'd not merely
waste some time, but actually incorrectly launch a child.
        regards, tom lane

Re: rare avl shutdown slowness (related to signal handling)

From

Alvaro Herrera

Date:

07 April 2015, 21:56:03

Tom Lane wrote:

> That seems like (a) a hack, and (b) not likely to solve the problem
> completely, unless you leave interrupts held throughout proc_exit(),
> which would create all sorts of opportunities for corner case bugs
> during on_proc_exit hooks.
>
> I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
> be a much safer fix.

Ah, yeah.  I was thinking in changing PG_exception_stack once shutdown
was requested, but this is much simpler.

> It looks like there's a related risk associated with this bit:
>
>     /* in emergency mode, just start a worker and go away */
>     if (!AutoVacuumingActive())
>     {
>         do_start_worker();
>         proc_exit(0);            /* done */
>     }
>
> If we get SIGHUP and see that autovacuum has been turned off,
> we exit the main loop, but we don't set got_SIGTERM.  So if we
> then get a similar error at the shutdown report, we'd not merely
> waste some time, but actually incorrectly launch a child.

Well, what it does is request a new child from postmaster, which
presumably wouldn't be listening; but yeah there's a lot of other work
done here also.

How about the attached?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

autovac-shutdown.patch

Re: rare avl shutdown slowness (related to signal handling)

From

Tom Lane

Date:

07 April 2015, 22:13:26

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
>> be a much safer fix.

> Ah, yeah.  I was thinking in changing PG_exception_stack once shutdown
> was requested, but this is much simpler.

Your proposed patch seems to be doing both of those, which is probably
unnecessary.  I don't object to the SIGHUP test and goto in the error
path, but I'd put it a lot further down, like after the existing
RESUME_INTERRUPTS.  I doubt it's a good idea to skip the transaction
cleanup steps.
        regards, tom lane

Re: rare avl shutdown slowness (related to signal handling)

From

Qingqing Zhou

Date:

07 April 2015, 22:57:07

On Tue, Apr 7, 2015 at 2:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That seems like (a) a hack, and (b) not likely to solve the problem
> completely, unless you leave interrupts held throughout proc_exit(),
> which would create all sorts of opportunities for corner case bugs
> during on_proc_exit hooks.
>

Hmm, looks like proc_exit() already taken care of this by setting
proc_exit_inprogress and StatementCancelHandler() respects it.
Actually, in quickdie(), I found a similar practice for the same
reason:
/* * Prevent interrupts while exiting; though we just blocked signals that * would queue new interrupts, one may have
beenpending.  We don't want a * quickdie() downgraded to a mere query cancel. */HOLD_INTERRUPTS();

I do feel that we have too many functions instructing how to handle
interrupts and they are subtle - I just found a new friend
HOLD_CANCEL_INTERRUPTS :-(

Regards,
Qingqing

Re: rare avl shutdown slowness (related to signal handling)

From

Tom Lane

Date:

07 April 2015, 23:02:06

Qingqing Zhou <zhouqq.postgres@gmail.com> writes:
> I do feel that we have too many functions instructing how to handle
> interrupts and they are subtle - I just found a new friend
> HOLD_CANCEL_INTERRUPTS :-(

Indeed, which is why I think a patch for this issue should not introduce
a new mode/context in which proc_exit() is executed.  The code path would
be so seldom executed that we'd never manage to isolate any bugs that
might be in it.  Thus also my objection to Alvaro's patch that had us
going to proc_exit from partway through the error cleanup sequence.
        regards, tom lane

Re: rare avl shutdown slowness (related to signal handling)

From

Qingqing Zhou

Date:

07 April 2015, 23:15:29

On Tue, Apr 7, 2015 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Indeed, which is why I think a patch for this issue should not introduce
> a new mode/context in which proc_exit() is executed.

Agree. Along this line, we can add an on_proc_exit hook simply
ereport("we are exiting") there. In this way, we reuse all shields
proc_exit() already has. Abuse?

Regards,
Qingqing

Re: rare avl shutdown slowness (related to signal handling)

From

Alvaro Herrera

Date:

08 April 2015, 15:33:53

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> I think changing the outer "for(;;)" to "while (!got_SIGTERM)" would
> >> be a much safer fix.
>
> > Ah, yeah.  I was thinking in changing PG_exception_stack once shutdown
> > was requested, but this is much simpler.
>
> Your proposed patch seems to be doing both of those, which is probably
> unnecessary.  I don't object to the SIGHUP test and goto in the error
> path, but I'd put it a lot further down, like after the existing
> RESUME_INTERRUPTS.  I doubt it's a good idea to skip the transaction
> cleanup steps.

Ah, of course.  See attached patch.

We could keep the "for (;;)" instead of having it conditional on
got_SIGTERM, but I don't think there's any harm in the change.

This is backpatchable to 9.0.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

autovac-shutdown-2.patch