Re: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?)
Date
Msg-id CAEepm=10MtmKeDc1WxBM0PQM9OgtNy+RCeWqz40pZRRS3PNo5Q@mail.gmail.com
Whole thread Raw
In response to Fwd: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Wed, Oct 10, 2018 at 7:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > 0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
> > generates a FATAL ereport "terminating background worker \"%s\" due to
> > administrator command", directly in the signal handler (hmm, see also
> > recent discussions about the legality of similar code in quickdie()).
>
> Yeah, I think that code is bad news, for the same reasons discussed
> with regard to quickdie().

So I suppose we should just remove it, with something like 0002.  I'm
a bit uneasy about existing code out there that might be not calling
CFI.  OTOH I suspect that a lot of code copied worker_spi.c and
installed its own handler.

> > 1.  Some processes install their own custom SIGTERM handler that sets
> > a flag, and use that to break out of their main loop and exit quietly.
> > Example: autovacuum.c, or the open-source pg_cron extension, and
> > probably many other things that just want a nice quiet shutdown.
> >
> > 2.  Some processes install the standard SIGTERM handler die(), and
> > then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
> > default this looks like "FATAL:  terminating connection due to
> > administrator command".  This approach is effectively required if the
> > main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> > loop.  For example, a "launcher"-type process that calls
> > WaitForBackgroundWorkerStartup() could hang forever if it used
> > approach 1 (ask me how I know).
>
> My experience with background workers has been that approach #1 does
> not really work.  I mean, if you aren't doing anything complicated it
> might be OK, if for example you are executing SQL queries, and you try
> to do #1, then your SQL queries don't respond to interrupts.  And that
> sucks.  So I have generally adopted approach #2.

Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
it might technically be OK for that code)?  I think I might have been
guilty of copying that.

> To put that another way, nearly everything can reach
> CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
> in the case isn't much different from saying that it is required, full
> stop.

> > 3.  Some system processes generally use approach 2, but have a special
> > case in ProcessInterrupts() to suppress or alter the usual FATAL
> > message or behaviour.  This includes the logical replication launcher.
>
> Maybe we could replace this by a general-purpose hook.  So instead of
> the various tests for process types that are there now, we would just
> have
>
> if (procdie_hook != NULL)
>     (*procdie_hook)();
>
> And that hook can do whatever you like (but probably including dying).

Ok, patch 0001 is a sketch like that, for discussion.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: partition tree inspection functions
Next
From: Michael Paquier
Date:
Subject: Re: partition tree inspection functions