Fwd: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?) - Mailing list pgsql-hackers

From Robert Haas
Subject Fwd: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?)
Date
Msg-id CA+TgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_+X=w@mail.gmail.com
Whole thread Raw
In response to [HACKERS] Why does logical replication launcher exit with exit code 1?  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: background worker shudown (was Re: [HACKERS] Why does logicalreplication launcher exit with exit code 1?)  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I still think the current situation is non-ideal.  I don't have a
> strong view on whether some or all system-wide processes should say
> hello and goodbye explicitly in the log, but I do think we need a way
> to make that not look like an error condition, and ideally without
> special cases for known built-in processes.
>
> I looked into this a bit today, while debugging an extension that runs
> a background worker.  Here are some assorted approaches to shutdown:
>
> 0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
> generates a FATAL ereport "terminating background worker \"%s\" due to
> administrator command", directly in the signal handler (hmm, see also
> recent discussions about the legality of similar code in quickdie()).

Yeah, I think that code is bad news, for the same reasons discussed
with regard to quickdie().

> 1.  Some processes install their own custom SIGTERM handler that sets
> a flag, and use that to break out of their main loop and exit quietly.
> Example: autovacuum.c, or the open-source pg_cron extension, and
> probably many other things that just want a nice quiet shutdown.
>
> 2.  Some processes install the standard SIGTERM handler die(), and
> then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
> default this looks like "FATAL:  terminating connection due to
> administrator command".  This approach is effectively required if the
> main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> loop.  For example, a "launcher"-type process that calls
> WaitForBackgroundWorkerStartup() could hang forever if it used
> approach 1 (ask me how I know).

My experience with background workers has been that approach #1 does
not really work.  I mean, if you aren't doing anything complicated it
might be OK, if for example you are executing SQL queries, and you try
to do #1, then your SQL queries don't respond to interrupts.  And that
sucks.  So I have generally adopted approach #2.

To put that another way, nearly everything can reach
CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
in the case isn't much different from saying that it is required, full
stop.

> 3.  Some system processes generally use approach 2, but have a special
> case in ProcessInterrupts() to suppress or alter the usual FATAL
> message or behaviour.  This includes the logical replication launcher.

Maybe we could replace this by a general-purpose hook.  So instead of
the various tests for process types that are there now, we would just
have

if (procdie_hook != NULL)
    (*procdie_hook)();

And that hook can do whatever you like (but probably including dying).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: executor relation handling
Next
From: Andrew Dunstan
Date:
Subject: Re: pread() and pwrite()