Re: Async-unsafe functions in signal handlers - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Async-unsafe functions in signal handlers
Date
Msg-id 20210827210547.nntbtdlhqbxbo4wu@alap3.anarazel.de
Whole thread Raw
In response to Re: Async-unsafe functions in signal handlers  (Denis Smirnov <sd@arenadata.io>)
Responses Re: Async-unsafe functions in signal handlers  (Denis Smirnov <sd@arenadata.io>)
List pgsql-hackers
Hi,

On 2021-08-27 23:51:27 +0300, Denis Smirnov wrote:
> > 26 авг. 2021 г., в 23:39, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
> > 
> > (BTW, I think it's pretty silly to imagine that adding backtrace()
> > calls inside ereport is making things any more dangerous.  ereport
> > has pretty much always carried a likelihood of calling malloc(),
> > for example.)
> 
> I have taken a look through the signal handlers and found out that many of them use malloc() via ereport() and
elog().Here is the list:
 
> 
> SIGUSR1
> - procsignal_sigusr1_handler(): autoprewarm, autovacuum, bgworker, bgwriter, checkpointer, pgarch, startup,
walwriter,walreciever, walsender
 

There shouldn't be meaningful uses of elog/ereport() inside
procsignal_sigusr1_handler(). The exception I found was an elog(FATAL) for
unreachable code.


> - sigusr1_handler(): postmaster
> SIGHUP:
> - SIGHUP_handler(): postmaster
> SIGCHLD:
> - reaper(): postmaster

I think these runs in a very controlled set of circumstances because most of
postmaster runs with signals masked.


> SIGFPE:
> - FloatExceptionHandler(): autovacuum, bgworker, postgres, plperl

Yep, although as discussed this might not be a "real" problem because it
should only run during an instruction triggering an FPE.


> SIGQUIT:
> - quickdie(): postgres

Yes, this is an issue. I've previously argued for handling this via write()
and _exit(), instead of the full ereport() machinery. However, we have a
bandaid that deals with possible hangs, by SIGKILLing when processes don't
shut down (at that point things have already gone quite south, so that's not
an issue).


> SIGTERM:
> - bgworker_die(): bgworker

Bad.


> SIGALRM:
> - handle_sig_alarm(): autovacuum, bgworker, postmaster, startup, walsender, postgres

I don't think there reachable elogs in there. I'm not concerned about e.g.
        elog(FATAL, "timeout index %d out of range 0..%d", index,
             num_active_timeouts - 1);
because that's not something that should ever be reachable in a production
scenario. If it is, there's bigger problems.


Perhaps we ought to have a version of Assert() that's enabled in production
builds as well, and that outputs the error messages via write() and then
_exit()s?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Denis Smirnov
Date:
Subject: Re: Async-unsafe functions in signal handlers
Next
From: Stephen Frost
Date:
Subject: Re: New predefined roles- 'pg_read/write_all_data'